Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6915
Jacques Blanc-Talon Richard Kleihorst Wilfried Philips Dan Popescu Paul Scheunders (Eds.)
Advanced Concepts for Intelligent Vision Systems 13th International Conference, ACIVS 2011 Ghent, Belgium, August 22-25, 2011 Proceedings
13
Volume Editors Jacques Blanc-Talon DGA, 7-9 rue des mathurins, 92 221 Bagneux, France E-mail:
[email protected] Richard Kleihorst VITO-TAP, Boeretang 200, 2400 Mol, Belgium E-mail:
[email protected] Wilfried Philips Ghent University St.-Pietersnieuwstraat 41, 9000 Ghent, Belgium E-mail:
[email protected] Dan Popescu CSIRO ICT Centre P.O. Box 76 Epping, NSW 1710, 1710 Sydney, Australia E-mail:
[email protected] Paul Scheunders University of Antwerp, Universiteitsplein 1 2610 Wilrijk, Belgium E-mail:
[email protected] ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-23686-0 e-ISBN 978-3-642-23687-7 DOI 10.1007/978-3-642-23687-7 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011935066 CR Subject Classification (1998): I.4, I.5, C.2, I.2, I.2.10, H.4 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics © Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
This volume collects the papers accepted for presentation at the 13th International Conference on “Advanced Concepts for Intelligent Vision Systems” (ACIVS 2011). Following the first meeting in Baden-Baden (Germany) in 1999, which was part of a large multiconference, the ACIVS conference then developed into an independent scientific event and has ever since maintained the tradition of being a single-track conference. ACIVS 2011 attracted scientists from 28 different countries, mostly from Europe, but also from Australia, Japan, Brazil, Korea, China, Tunisia, Pakistan, Taiwan and the USA. Although ACIVS is a conference on all areas of image and video processing, submissions tend to gather within some major fields of interest. About 40% of the selected papers deal with image and video processing, including filtering and restoration and low-level analysis. This year topics related to vision, with a focus on the analysis of people, were well represented, as were papers on 3D scene estimation, processing and understanding. We would like to thank the invited speakers Marc De Mey (UGent and Koninklijke Vlaamse Academie van Belgi¨e), Peter Meijer (Metamodal BV), Ben Kr¨ose (University of Amsterdam) Lambert Spaanenburg (Lund University) and Rainer Stiefelhagen (Institute for Anthropomatics, Karlsruhe Institute of Technology and Fraunhofer Institute of Optronics, System Technologies and Image Exploitation) for enhancing the technical program with their presentations. A conference like ACIVS would not be feasible without the concerted effort of many people and the support of various institutions. The paper submission and review procedure was carried out electronically and a minimum of three reviewers were assigned to each paper. From 124 submissions, 29 were selected for oral presentation and 37 as posters. A large and energetic Program Committee, helped by additional referees (about 95 people in total) – listed on the following pages – completed the long and demanding reviewing process. We would like to thank all of them for their timely and high-quality reviews. Also, we would like to thank our sponsors, FWO-Vlaanderen, Ghent University, IBBT, Alcatel-Lucent, Philips, Nicta and Object Video, for their valuable support. Last but not least, we would like to thank all the participants who trusted in our ability to organize this conference for the 13th time. We hope they attended a stimulating scientific event and enjoyed the atmosphere of the ACIVS social events in the city of Ghent. July 2011
J. Blanc-Talon D. Popescu W. Philips R. Kleihorst P. Scheunders
Organization
Acivs 2011 was organized by IBBT and Ghent University in Ghent Belgium in conjunction with the ACM/IEEE International Conference on Distributed Smart Cameras.
Steering Committee Jacques Blanc-Talon Wilfried Philips Dan Popescu Paul Scheunders
DGA, France Ghent University - IBBT, Belgium CSIRO, Australia University of Antwerp, Belgium
Organizing Committee Richard Kleihorst Wilfried Philips Paul Scheunders
VITO - Ghent University, Belgium Ghent University - IBBT, Belgium University of Antwerp, Belgium
Sponsors Acivs 2011 was sponsored by the following organizations: – – – – – – –
Flemish Fund for Scientific Research - FWO-Vlaanderen Ghent University - Faculty of Engineering and Architecture Interdisciplinary Institute for Broadband Technology - IBBT Alcatel-Lucent Philips Nicta Object Video
Program Committee Hamid Aghajan Marc Antonini Laure Blanc-Feraud Philippe Bolon Salah Bourennane Dumitru Burdescu Vicent Caselles
Stanford University, USA Universit´e de Nice Sophia Antipolis, France INRIA/I3S, France University of Savoie, France Ecole Centrale de Marseille, France University of Craiova, Romania Universitat Pompeu Fabra, Spain
VIII
Organization
Jocelyn Chanussot Pamela Cosman Yves D’Asseler Jennifer Davidson Arturo de la Escalera Hueso Eric Debreuve Christine Fernandez-Maloigne Don Fraser Jerˆ ome Gilles Georgy Gimel’farb Jean-Yves Guillemaut Markku Hauta-Kasari Dimitris Iakovidis Arto Kaarna Andrzej Kasinski Nikos Komodakis Murat Kunt Kenneth Lam Alessandro Ledda Maylor Leung Yue Li Brian Lovell Anthony Maeder Xavier Maldague Gonzalo Pajares Martinsanz Javier Mateos G´erard Medioni Fabrice M´eriaudeau Alfred Mertins Jean Meunier Amar Mitiche Rafael Molina Adrian Munteanu Michel Paindavoinec Fernando Pereira Aleksandra Pizurica Frederic Precioso William Puech Gianni Ramponi Paolo Remagnino Martin Rumpf Guillermo Sapiro Andrzej Sluzek
INPG, France University of California at San Diego, USA Ghent University, Belgium Iowa State University, USA Universidad Carlos III de Madrid, Spain I3S, France Universit´e de Poitiers, France Australian Defence Force Academy, Australia UCLA, USA The University of Auckland, New Zealand University of Surrey, UK University of Eastern Finland, Finland Technological Educational Institute of Lamia, Greece Lappeenranta University of Technology, Finland Poznan University of Technology, Poland University of Crete, Greece EPFL, Switzerland The Hong Kong Polytechnic University, China Artesis University College of Antwerp, Belgium Nanyang Technological University CSIRO ICT Centre, Australia University of Queensland, Australia University of Western Sydney, Australia Universit´e de Laval, Canada Universidad Complutense, Spain University of Granada, Spain USC/IRIS, USA IUT Le Creusot, France Universit¨ at zu L¨ ubeck, Germany Universit´e de Montr´eal, Canada INRS, Canada Universidad de Granada, Spain Vrije Universiteit Brussel, Belgium Bourgogne University, France Instituto Superior T´ecnico, Portugal Ghent University - IBBT, Belgium Paris 6, France LIRMM, France Trieste University, Italy Kingston University, UK Bonn University, Germany University of Minnesota, USA Nanyang Technological University, Singapore
Organization
Hugues Talbot Jean-Philippe Thiran Matthew Thurley Frederic Truchetet Dimitri Van De Ville Marc Van Droogenbroeck Peter Veelaert Miguel Vega Gerald Zauner Pavel Zemcik Djemel Ziou
IX
ESIEE, France EPFL, Switzerland Lule˚ a University of Technology, Sweden Universit´e de Bourgogne, France EPFL, Switzerland University of Li`ege, Belgium University College Ghent, Belgium University of Granada, Spain Fachhochschule Ober¨osterreich, Austria Brno University of Technology, Czech Republic Sherbrooke University, Canada
Reviewers Hamid Aghajan Marc Antonini Sileye Ba Thierry Baccino Charles Beumier Jacques Blanc-Talon Philippe Bolon Don Bone Alberto Borghese Salah Bourennane Dumitru Burdescu Vicent Caselles Umberto Castellani Jocelyn Chanussot Cornel Cofaru Peter Corke Pamela Cosman Marco Cristani Erik D’Hollander Matthew Dailey Arturo de la Escalera Hueso Jonas De Vylder Francis Deboeverie Eric Debreuve Julie Digne Koen Douterloigne S´everine Dubuisson Lieven Eeckhout Christine Fernandez-Maloigne David Filliat Don Fraser
Stanford University, USA Universit´e de Nice Sophia Antipolis, France Telecom Bretagne, France Lutin Userlab, France Royal Military Academy, Belgium DGA, France University of Savoie, France Cannon Information Systems Research, Australia University of Milan, Italy Ecole Centrale de Marseille, France University of Craiova, Romania Universitat Pompeu Fabra, Spain Universit` a degli Studi di Verona, Italy INPG, France Ghent University, Belgium Queensland University of Technology, Australia University of California at San Diego, USA University of Verona, Italy Ghent University, Belgium Asian Institute of Technology, Thailand Universidad Carlos III de Madrid, Spain Ghent University, Belgium Ghent University College, Belgium I3S, France ENS Cachan, France Ghent University, Belgium Universit´e Paris VI, France Ghent University, Belgium Universit´e de Poitiers, France ENSTA, France Australian Defence Force Academy, Australia
X
Organization
Jerˆ ome Gilles Georgy Gimel’farb Bart Goossens Jean-Yves Guillemaut Markku Hauta-Kasari Monson Hayes Dimitris Iakovidis Arto Kaarna Andrzej Kasinski Richard Kleihorst Nikos Komodakis Murat Kunt Nojun Kwak Olivier Laligant Kenneth Lam Peter Lambert Alessandro Ledda Maylor Leung Yue Li Hiep Luong Xavier Maldague Gonzalo Pajares Martinsanz Javier Mateos G´erard Medioni Fabrice M´eriaudeau Alfred Mertins Jean Meunier Amar Mitiche Rafael Molina Adrian Munteanu Sergio Orjuela Vargas Michel Paindavoine Dijana Petrovska Sylvie Philipp-Foliguet Wilfried Philips Wojciech Pieczynski Aleksandra Pizurica Ljiljana Platisa Dan Popescu Eric Postma Frederic Precioso William Puech Gianni Ramponi Paolo Remagnino
UCLA, USA The University of Auckland, New Zealand Ghent University, Belgium University of Surrey, UK University of Eastern Finland, Finland Georgia Institute of Technology, USA Technological Educational Institute of Lamia, Greece Lappeenranta University of Technology, Finland Poznan University of Technology, Poland VITO - Ghent University, Belgium University of Crete, Greece EPFL, Switzerland Ajou University, Republic of Korea IUT Le Creusot, France The Hong Kong Polytechnic University, China Ghent University, Belgium Artesis University College of Antwerp, Belgium Nanyang Technological University CSIRO ICT Centre, Australia Ghent University, Belgium Universit´e de Laval, Canada Universidad Complutense, Spain University of Granada, Spain USC/IRIS, USA IUT Le Creusot, France Universit¨ at zu L¨ ubeck, Germany Universit´e de Montr´eal, Canada INRS, Canada Universidad de Granada, Spain Vrije Universiteit Brussel, Belgium Ghent University, Belgium Bourgogne University, France SudParis, France ETIS, France Ghent University - IBBT, Belgium TELECOM SudParis, France Ghent University - IBBT, Belgium Ghent University, Belgium CSIRO, Australia University of Tilburg, The Netherlands Paris 6, France LIRMM, France Trieste University, Italy Kingston University, UK
Organization
Martin Rumpf Paul Scheunders V´eronique Serfaty D´esir´e Sidib´e Andrzej Sluzek Dirk Stroobandt Hugues Talbot Jean-Philippe Thiran Matthew Thurley Frederic Truchetet Dimitri Van De Ville Marc Van Droogenbroeck David Van Hamme Peter Van Hese Peter Veelaert Miguel Vega Gerald Zauner Pavel Zemcik Djemel Ziou Witold Zorski
XI
Bonn University, Germany University of Antwerp, Belgium DGA, France University of Bourgogne, France Nanyang Technological University, Singapore Ghent University, Belgium ESIEE, France EPFL, Switzerland Lule˚ a University of Technology, Sweden Universit´e de Bourgogne, France EPFL, Switzerland University of Li`ege, Belgium University College Ghent, Belgium Ghent University, Belgium University College Ghent, Belgium University of Granada, Spain Fachhochschule Ober¨osterreich, Austria Brno University of Technology, Czech Republic Sherbrooke University, Canada Cybernetics Faculty, Military University of Technology, Poland
Table of Contents
Vision Robust Visual Odometry Using Uncertainty Models . . . . . . . . . . . . . . . . . . David Van Hamme, Peter Veelaert, and Wilfried Philips
1
Supervised Visual Vocabulary with Category Information . . . . . . . . . . . . . Yunqiang Liu and Vicent Caselles
13
Nonparametric Estimation of Fisher Vectors to Aggregate Image Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Herv´e Le Borgne and Pablo Mu˜ noz Fuentes
22
Knowledge-Driven Saliency: Attention to the Unseen . . . . . . . . . . . . . . . . . M. Zaheer Aziz, Michael Knopf, and B¨ arbel Mertsching
34
A Comparative Study of Vision-Based Lane Detection Methods . . . . . . . . Nadra Ben Romdhane, Mohamed Hammami, and Hanene Ben-Abdallah
46
A New Multi-camera Approach for Lane Departure Warning . . . . . . . . . . Amol Borkar, Monson Hayes, and Mark T. Smith
58
Classification, Recognition and Tracking Feature Space Warping Relevance Feedback with Transductive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniele Borghesani, Dalia Coppi, Costantino Grana, Simone Calderara, and Rita Cucchiara Improved Support Vector Machines with Distance Metric Learning . . . . . Yunqiang Liu and Vicent Caselles A Low-Cost System to Detect Bunches of Grapes in Natural Environment from Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel J.C.S. Reis, Raul Morais, Carlos Pereira, Olga Contente, Miguel Bacelar, Salviano Soares, Ant´ onio Valente, Jos´e Baptista, Paulo J.S.G. Ferreira, and Jos´e Bulas-Cruz
70
82
92
Fuzzy Cognitive Maps Applied to Synthetic Aperture Radar Image Classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gonzalo Pajares, Javier S´ anchez-Llad´ o, and Carlos L´ opez-Mart´ınez
103
Swarm Intelligence Based Searching Schemes for Articulated 3D Body Motion Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bogdan Kwolek, Tomasz Krzeszowski, and Konrad Wojciechowski
115
XIV
Table of Contents
Combining Linear Dimensionality Reduction and Locality Preserving Projections with Feature Selection for Recognition Tasks . . . . . . . . . . . . . . Fadi Dornaika, Ammar Assoum, and Alireza Bosaghzadeh
127
A New Anticorrelation-Based Spectral Clustering Formulation . . . . . . . . . Julia Dietlmeier, Ovidiu Ghita, and Paul F. Whelan
139
Simultaneous Partitioned Sampling for Articulated Object Tracking . . . . Christophe Gonzales, S´everine Dubuisson, and Xuan Son N’Guyen
150
Segmentation A Geographical Approach to Self-Organizing Maps Algorithm Applied to Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thales Sehn Korting, Leila Maria Garcia Fonseca, and Gilberto Cˆ amara A Multi-Layer ‘Gas of Circles’ Markov Random Field Model for the Extraction of Overlapping Near-Circular Objects . . . . . . . . . . . . . . . . . . . . . Jozsef Nemeth, Zoltan Kato, and Ian Jermyn Evaluation of Image Segmentation Algorithms from the Perspective of Salient Region Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bogdan Popescu, Andreea Iancu, Dumitru Dan Burdescu, Marius Brezovan, and Eugen Ganea
162
171
183
Robust Active Contour Segmentation with an Efficient Global Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonas De Vylder, Jan Aelterman, and Wilfried Philips
195
A Method to Generate Artificial 2D Shape Contour Based in Fourier Transform and Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maur´ıcio Falvo, Jo˜ ao Batista Florindo, and Odemir Martinez Bruno
207
Image Segmentation Based on Electrical Proximity in a Resistor-Capacitor Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Gaura, Eduard Sojka, and Michal Krumnikl
216
Hierarchical Blurring Mean-Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Milan Surkala, Karel Mozdˇreˇ n, Radovan Fusek, and Eduard Sojka
228
Image Analysis Curve-Skeletons Based on the Fat Graph Approximation . . . . . . . . . . . . . . Denis Khromov DTW for Matching Radon Features: A Pattern Recognition and Retrieval Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Santosh K.C., Bart Lamiroy, and Laurent Wendling
239
249
Table of Contents
Ridges and Valleys Detection in Images Using Difference of Rotating Half Smoothing Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Baptiste Magnier, Philippe Montesinos, and Daniel Diep
XV
261
Analysis of Wear Debris through Classification . . . . . . . . . . . . . . . . . . . . . . . Roman Jur´ anek, Stanislav Machal´ık, and Pavel Zemˇc´ık
273
Fourier Fractal Descriptors for Colored Texture Analysis . . . . . . . . . . . . . . Jo˜ ao B. Florindo and Odemir M. Bruno
284
Efficiency Optimization of Trainable Feature Extractors for a Consumer Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maurice Peemen, Bart Mesman, and Henk Corporaal Salient Region Detection Using Discriminative Feature Selection . . . . . . . HyunCheol Kim and Whoi-Yul Kim Image Analysis Applied to Morphological Assessment in Bovine Livestock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Horacio M. Gonz´ alez-Velasco, Carlos J. Garc´ıa-Orellana, Miguel Mac´ıas-Mac´ıas, Ram´ on Gallardo-Caballero, and Antonio Garc´ıa-Manso Quantifying Appearance Retention in Carpets Using Geometrical Local Binary Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rolando Quinones, Sergio A. Orjuela, Benhur Ortiz-Jaramillo, Lieva Van Langenhove, and Wilfried Philips Enhancing the Texture Attribute with Partial Differential Equations: A Case of Study with Gabor Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bruno Brandoli Machado, Wesley Nunes Gon¸calves, and Odemir Martinez Bruno Dynamic Texture Analysis and Classification Using Deterministic Partially Self-avoiding Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wesley Nunes Gon¸calves and Odemir Martinez Bruno
293 305
316
327
337
349
Image Processing Segmentation Based Tone-Mapping for High Dynamic Range Images . . . Qiyuan Tian, Jiang Duan, Min Chen, and Tao Peng Underwater Image Enhancement: Using Wavelength Compensation and Image Dehazing (WCID) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John Y. Chiang, Ying-Ching Chen, and Yung-Fu Chen Video Stippling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Houit and Frank Nielsen
360
372 384
XVI
Table of Contents
Contrast Enhanced Ultrasound Images Restoration . . . . . . . . . . . . . . . . . . . Adelaide Albouy-Kissi, Stephane Cormier, Bertrand Zavidovique, and Francois Tranquart
396
Mutual Information Refinement for Flash-no-Flash Image Alignment . . . Sami Varjo, Jari Hannuksela, Olli Silv´en, and Sakari Alenius
405
Virtual Restoration of the Ghent Altarpiece Using Crack Detection and Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tijana Ruˇzi´c, Bruno Cornelis, Ljiljana Platiˇsa, Aleksandra Piˇzurica, Ann Dooms, Wilfried Philips, Maximiliaan Martens, Marc De Mey, and Ingrid Daubechies Image Sharpening by DWT-Based Hysteresis . . . . . . . . . . . . . . . . . . . . . . . . Nuhman ul Haq, Khizar Hayat, Neelum Noreen, and William Puech Content Makes the Difference in Compression Standard Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guido Manfredi, Djemel Ziou, and Marie-Flavie Auclair-Fortier
417
429
437
A Bio-Inspired Image Coder with Temporal Scalability . . . . . . . . . . . . . . . Khaled Masmoudi, Marc Antonini, and Pierre Kornprobst
447
Self-similarity Measure for Assessment of Image Visual Quality . . . . . . . . Nikolay Ponomarenko, Lina Jin, Vladimir Lukin, and Karen Egiazarian
459
Video Surveillance and Biometrics An Intelligent Video Security System Using Object Tracking and Shape Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sang Hwa Lee, Siddharth Sharma, Linlin Sang, Jong-Il Park, and Yong Gyu Park 3D Facial Expression Recognition Based on Histograms of Surface Differential Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huibin Li, Jean-Marie Morvan, and Liming Chen
471
483
Facial Feature Tracking for Emotional Dynamic Analysis . . . . . . . . . . . . . . Thibaud Senechal, Vincent Rapp, and Lionel Prevost
495
Detection of Human Groups in Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sel¸cuk Sandıkcı, Svitlana Zinger, and Peter H.N. de With
507
Estimation of Human Orientation in Images Captured with a Range Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S´ebastien Pi´erard, Damien Leroy, Jean-Fr´ed´eric Hansen, and Marc Van Droogenbroeck
519
Table of Contents
Human Identification Based on Gait Paths . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Adam Swito´ nski, Andrzej Pola´ nski, and Konrad Wojciechowski Separating Occluded Humans by Bayesian Pixel Classifier with Re-weighted Posterior Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daehwan Kim, Yeonho Kim, and Daijin Kim
XVII
531
543
An Edge-Based Approach for Robust Foreground Detection . . . . . . . . . . . Sebastian Gruenwedel, Peter Van Hese, and Wilfried Philips
554
Relation Learning - A New Approach to Face Recognition . . . . . . . . . . . . . Len Bui, Dat Tran, Xu Huang, and Girija Chetty
566
Algorithms and Optimizations Temporal Prediction and Spatial Regularization in Differential Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Hoeffken, Daniel Oberhoff, and Marina Kolesnik Parallel Implementation of the Integral Histogram . . . . . . . . . . . . . . . . . . . . Pieter Bellens, Kannappan Palaniappan, Rosa M. Badia, Guna Seetharaman, and Jesus Labarta System on Chip Coprocessors for High Speed Image Feature Detection and Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marek Kraft, Michal Fularz, and Andrzej Kasi´ nski Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gert Jan van den Braak, Cedric Nugteren, Bart Mesman, and Henk Corporaal Feasibility Analysis of Ultra High Frame Rate Visual Servoing on FPGA and SIMD Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yifan He, Zhenyu Ye, Dongrui She, Bart Mesman, and Henk Corporaal
576 586
599
611
623
3D, Depth and Scene Understanding Calibration and Reconstruction Algorithms for a Handheld 3D Laser Scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Denis Lamovsky and Aless Lasaruk Comparison of Visual Registration Approaches of 3D Models for Orthodontics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rapha¨el Destrez, Benjamin Albouy-Kissi, Sylvie Treuillet, Yves Lucas, and Arnaud Marchadier
635
647
XVIII
Table of Contents
A Space-Time Depth Super-Resolution Scheme for 3D Face Scanning . . . Karima Ouji, Mohsen Ardabilian, Liming Chen, and Faouzi Ghorbel Real-Time Depth Estimation with Wide Detectable Range Using Horizontal Planes of Sharp Focus Proceedings . . . . . . . . . . . . . . . . . . . . . . . Hiroshi Ikeoka, Masayuki Ohata, and Takayuki Hamamoto Automatic Occlusion Removal from Facades for 3D Urban Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Engels, David Tingdahl, Mathias Vercruysse, Tinne Tuytelaars, Hichem Sahli, and Luc Van Gool
658
669
681
hSGM: Hierarchical Pyramid Based Stereo Matching Algorithm . . . . . . . Kwang Hee Won and Soon Ki Jung
693
Surface Reconstruction of Rotating Objects from Monocular Video . . . . . Charlotte Boden and Abhir Bhalerao
702
Precise Registration of 3D Images Acquired from a Hand-Held Visual Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benjamin Coudrin, Michel Devy, Jean-Jos´e Orteu, and Ludovic Br`ethes A 3-D Tube Scanning Technique Based on Axis and Center Alignment of Multi-laser Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seung-Hae Baek and Soon-Yong Park Combining Plane Estimation with Shape Detection for Holistic Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kai Zhou, Andreas Richtsfeld, Karthik Mahesh Varadarajan, Michael Zillich, and Markus Vincze
712
724
736
Simple Single View Scene Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bronislav Pˇribyl and Pavel Zemˇc´ık
748
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
761
Robust Visual Odometry Using Uncertainty Models David Van Hamme1,2 , Peter Veelaert1,2 , and Wilfried Philips2 1
University College Ghent (Vision Systems) 2 Ghent University/IBBT (IPI)
Abstract. In dense, urban environments, GPS by itself cannot be relied on to provide accurate positioning information. Signal reception issues (e.g. occlusion, multi-path effects) often prevent the GPS receiver from getting a positional lock, causing holes in the absolute positioning data. In order to keep assisting the driver, other sensors are required to track the vehicle motion during these periods of GPS disturbance. In this paper, we propose a novel method to use a single on-board consumer-grade camera to estimate the relative vehicle motion. The method is based on the tracking of ground plane features, taking into account the uncertainty on their backprojection as well as the uncertainty on the vehicle motion. A Hough-like parameter space vote is employed to extract motion parameters from the uncertainty models. The method is easy to calibrate and designed to be robust to outliers and bad feature quality. Preliminary testing shows good accuracy and reliability, with a positional estimate within 2 metres for a 400 metre elapsed distance. The effects of inaccurate calibration are examined using artificial datasets, suggesting a self-calibrating system may be possible in future work.
1
Introduction
The current generation of GPS navigation systems is insufficiently reliable in urban conditions. Obstacles along and over the road (e.g. tall buildings, overpasses, bridges) often stand in the way of the four-way satellite link that is required for a positional fix. In these situations, the GPS unit temporarily loses track of position. However, it is sufficient to be able to accurately track the relative motion of the vehicle over a short distance to keep a reasonable estimate of the absolute position. In this paper we present a relative motion tracking approach based on a single forward-facing camera. There is a tendency in the automotive industry to equip consumer vehicles with cameras to perform a number of driver assistance tasks. Traffic sign recognition and lane departure warning are two examples that can readily be found on the options list of many new cars. The calculation of the ego-motion of a moving platform from images captured from the platform itself is called visual odometry. A solid mathematical basis for visual odometry was laid in the past 20 years ([1, 10, 11, 13–15]). More recent works focus on how to bring these concepts into J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 1–12, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
D.V. Hamme, P. Veelaert, and W. Philips
practice. The approaches can be roughly divided into three categories based on the optics used. A first approach uses stereo cameras, with influential works by Obdrˇz´alek et al., 2010 [16], Kitt et al., 2010 [8], Comport et al., 2007 [6], Konolige et al., 2007 [9] and Cheng et al., 2006 [5]. A second approach is the use of an omnidirectonal camera, as in Scaramuzza et al., 2009 [17], Tardif et al., 2008 [18]. The third category consists of methods that use a single, non-panoramic camera. Significant efforts include Campbell et al., 2005 [4] and Mouragnon et al., 2009 [12], Azuma et al., 2010 [2]. Both stereo vision and panoramic cameras boast obvious advantages for solving the visual odometry problem. However, such setups are impractical for use on consumer vehicles, as the stereo methods require precise and repeated calibration and the omnidirectional camera cannot be mounted inconspicuously. We will therefore focus on a solution that involves a single, standard consumer camera. Among the methods cited above, good performance has been demonstrated on indoor test sequences, and in some cases even in controlled outdoor environments. However, the application on consumer vehicles poses new challenges. As road speeds increase, the number of available feature correspondences is reduced, and their displacements in the image can become very large. A third problem is the presence of outliers: other traffic will cause feature correspondences that are of no use to calculate visual odometry. These unsolved problems lead us to propose a novel method for a single, forward-facing camera. Central to the method is the backprojection of image features to the world ground plane, and the uncertainties associated with this backprojection. A Hough-like voting scheme is implemented to track the consistent motion in the uncertainty models of the ground plane features. Careful modelling of the uncertainties both in the backprojection and in the motion parameters, coupled with a robust voting algorithm, allow us to reliably extract the ego-motion from a calibrated system. Experiments on real data show accuracy within 2 metres after an elapsed distance of 400 metres. The details of the method will be further explained in section 3.
2
Camera Model
To accurately model the uncertainties of the backprojection in an elegant way, it is important to choose an appropriate camera model and carefully define the used coordinate systems. In our application, we are working in three different coordinate systems: 3D world coordinates, 3D camera coordinates, and 2D image coordinates. The 3D world coordinate system is chosen as a right-handed system with the origin on the road surface directly below the center of rotation of the vehicle. Note that this means that the world axes remain tied to the vehicle, and vehicle motion manifests itself as moving texture on a fixed plane. The world Z-axis is taken perpendicular to the ground plane, the Y-axis parallel to the straight-ahead driving direction of the vehicle. The camera coordinate system is also a right-handed system, and is similarly aligned. Its origin is in the center of projection of the camera, its Y-axis points in the viewing direction along
Robust Visual Odometry Using Uncertainty Models
3
the principal axis, and the X- and Z-axis are aligned with the horizontal and vertical image sensor directions respectively. While this deviates somewhat from the conventional definitions, it facilitates calibration, as will be explained shortly. Finally, the 2D image coordinate system is defined with the origin in the top left corner of the image, the X-axis pointing right and the Y-axis pointing down. Choosing the axes this way offers some advantages from a programming point of view, as it better reflects image data organization in memory. To describe the perspective projection from 3D world coordinates to 2D image coordinates, we will use an undistorted pinhole camera model. Let x = [x y w]T T denote the 2D image point in homogeneous coordinates, and X = [X Y Z 1] the corresponding point in homogeneous 3D world coordinates. The projection of X onto x is then given by: x = C R|t X. (1) In the above expression, C is the upper triangular intrinsic camera matrix as described in Hartley and Zisserman, 2004 [7], consisting of the horizontal and vertical scaling components αx and αy and the image coordinates of the principal point (x0 , y0 ), multiplied by a substitution matrix arising from our non-standard definition of image axes: ⎤ ⎤⎡ ⎡ 10 0 αx 0 x0 (2) C = ⎣ 0 αy y0 ⎦ ⎣0 0 −1⎦ . 01 0 0 0 1 [R|t] is the rotation matrix R that aligns the world axes with the camera axes, augmented by the 3D translation vector t between their origins. The resulting 4x3 matrix is the projection matrix that maps homogeneous 3D world coordinates onto 2D image coordinates. Because we will only consider points in the world ground plane, Z = 0 for all points and the projection matrix reduces to a 3x3 homography matrix H: ⎡ ⎤ ⎡ ⎤ x X x = ⎣ y ⎦ = H ⎣Y ⎦ . (3) w 1 In our case, we are interested in the inverse transformation: we want to project image points onto the world ground plane. This projection is characterized by the inverse of H: ⎡ ⎤ ⎡ ⎤ X x X = ⎣ Y ⎦ = H−1 ⎣y ⎦ . (4) W 1 If lens distortion has to be taken into account, a distortion function is applied to the left hand side of equations (1) and (3). We will consider the distortion to be rectified in advance, as can easily be done using standard methods [3]. Let us take a closer look at the components of equation (1). The camera matrix C can be obtained using standard calibration methods as described in Bouguet, 1999 [3], and will remain constant for a fixed-zoom camera.
4
D.V. Hamme, P. Veelaert, and W. Philips
The vector t determines the offset between the origin of the world axes and the origin of the camera axes. We can measure t as part of the extrinsic calibration process, as it changes only very slightly with vehicle motion and load. The rotation matrix R can be constructed as a series of three rotations, each along one of the axes. The most common conventions for defining these rotations are through heading, pitch and roll (Z-, X-, and Y-axis rotations) or Euler angles (usually Z-X-Z). We will use the heading, pitch and roll configuration. Due to our non-standard choice of world and camera axes, these three angles can be measured relatively easily in world coordinates. However, only the heading can be assumed to remain constant while the vehicle is moving, as suspension movement does not affect this direction of the vehicle in any significant way. The other rotations however, pitch and roll, have a range of freedom around the calibration values obtained while stationary, as they are affected by the suspension. This range of motion will be the key challenge to overcome, as will be explained in section 3. The ground plane projection of a sample frame while stationary is shown in figure 1.
Fig. 1. Example of a test frame (left), and its ground plane backprojection (right)
3
The Proposed Method
The first step in our proposed method is the backprojection of Harris corners to the world ground plane, using equation (4). The advantage of using backprojected ground features is that the method will still produce useful motion information when there are only few feature correspondences, due to the reduction in degrees of freedom. However, in order to accurately backproject the Harris corners, we need the immediate pitch and roll angles for each frame. These angles are not precisely known, they are only defined within an interval around the calibration values obtained at rest (cfr. section 2). This range of possible suspension angles defines a region of possible backprojected locations in the world plane. Due to the trigonometric functions in the rotation matrix R, this region will not strictly be convex, but as the variation in angles is small, it can be closely approximated by a tetragon. These uncertainty regions on the world plane arising from the range of angles in the camera perspective will be referred to as Perspective Uncertainty Tetragons (PUTs). Note that there is a PUT for every Harris corner detected in the camera image.
Robust Visual Odometry Using Uncertainty Models
5
The PUTs are easily calculated: a rotation matrix R is constructed for each of the four combinations of extremal pitch and roll values. This results in four different backprojections of each Harris corner, yielding the four corners of the tetragon for that feature. It should be noted that any inaccuracies in the feature detector can also be modelled into the PUTs. Suppose for example that the chosen feature detector is known to have poor localization, producing features in positions that can be off by one pixel from the actual point of interest. One way to account for this could be to calculate the PUTs for all possible extremal pixel co¨ordinates of the actual feature, distributed in an area around the feature detector output, and then taking the union of the PUTs to obtain the uncertainty on the world plane position. In our application, the Harris corners are assumed to be accurate within half a pixel in both the horizontal and vertical image dimension.
Fig. 2. Harris corners in the camera view (left), and close-up of some of their associated PUTs (right)
Figure 2 shows the PUTs for an example frame in a test sequence. The left image shows a camera frame in which Harris corners were detected. The background of the right image is the backprojection of the camera frame based on the average angles, with the PUTs of the Harris corners drawn on top of it. Realistic limits on the pitch and roll angles can be established by analysis of test videos in which the vehicle is accelerating, braking or cornering hard. We can see from the shape and size of the PUTs in figure 2 that pitch is the main factor contributing to the uncertainty (vertical elongation), while the effect of roll remains limited. The problem we need to solve, is how to extract accurate translation and rotation parameters from correspondences between the features of consecutive frames when the position of the features is only known up to a region (the PUTs). Additionally, the method must be robust to outliers. Features that do not correspond to object in the ground plane, or that belong to other traffic, will exhibit inconsistent motion and should have as little influence as possible. To solve the above problem we propose a simple, transparent voting mechanism that can extract the incremental trajectory in real-time. In a first step, we must establish possible feature correspondences between consecutive frames. Let
6
D.V. Hamme, P. Veelaert, and W. Philips
us assume that we know the exact ground plane position of the previous frame’s features. As we stated earlier, the origin of the world coordinate system remains affixed to a point below the center of rotation of the vehicle. This means that one frame ahead, the features (which are static in the real world) will have new coordinates in the ground plane. The new coordinates depend on the speed and steering angle between the previous two frames, and on the acceleration and steering input of the driver between the previous and the current frame. The driver inputs are the unknowns that we want to determine. However, the acceleration is bound by the maximum torque of the engine and by the maximum retardation allowed by the braking system, and the steering input is limited by the ratio of the steering rack coupled with the maximum speed at which the driver is physically able to twirl the steering wheel. These bounds can be established from the specification of the vehicle and a simple experiment in which a person tries to turn the steering wheel from lock to lock as fast as possible. The bounds on driver input are also included in our uncertainty model. For each known feature position from the previous frame, the range of possible driver inputs delimits a region in the world plane that represents this feature’s possible world plane positions in the current frame. Again, this region is not strictly convex due to the rotation component, but is closely approximated by a tetragon. We will call this type of tetragon a Motion Uncertainty Tetragon (MUT), as it arises from the uncertainty on the vehicle motion. An example of a set of MUTs is shown in figure 3. Note that features in close proximity to the vehicle have a narrower MUT than distant features. The MUT of a feature position is calculated by displacing the feature along four circle segments, representing the four extremal combinations of possible speed and steering angle. The circle segments are an approximation of the real trajectory, as they assume a constant speed and steering angle over the interframe interval. In reality, the bend radius will change continuously in this interval, but the errors introduced by this approximation are small. The uncertainty of the feature detection and backprojection is now modelled in the PUTs and the uncertainty on the predicted displacement of the vehicle is modelled in the MUTs. The PUTs correspond to features in the current frame’s camera view, while the MUTs correspond to features in the previous frame’s world plane. The problem of finding correspondences between the previous frame and the current frame is thereby reduced to finding overlap between MUTs and PUTs. Once the possible feature correspondences have been established, the second problem is how to extract the correct motion parameters (i.e. speed differential and steering angle differential) from these correspondences. Once we know these differentials, we can reconstruct the trajectory of the vehicle between the two frames. The MUTs are essentially projections of a region of translation-rotation parameter space onto the ground plane. The overlap of the MUTs with the PUTs gives us information about which parameter combinations are plausible according to the observed features. Although the MUTs have slightly different sizes and
Robust Visual Odometry Using Uncertainty Models
7
Fig. 3. Example of MUTs
shapes based on their location, their boundaries correspond to the same extremal values of speed and steering angle, and as such every MUT is a deformation of the same rectangular patch in parameter space. When a PUT overlaps with an MUT, this is evidence that the region of overlap in the MUT contains plausible vehicle motion parameters according to one of the features. We can state that the overlap expresses a vote for this region in parameter space. When we sum the region votes for all areas of PUT-MUT overlap, we obtain a measure of plausibility for every rotation-translation combination in the parameter space patch. Evidence will concentrate on those combinations that agree with the majority of the observations. This is similar in concept to Hough-based shape detection algorithms, where the shapes are found as peaks in a voted parameter space. This type of voting method has the advantage that it provides some robustness against outliers, as the contributions from bad features (e.g. features caused by other traffic or by objects not in the ground plane) will not typically have a common intersection, and as such tend to manifest themselves as noise spread out over the parameter space. Figure 4 shows the typical overlapping of MUTs and PUTs. To perform the parameter space vote in practice, we normalize every MUT to a rectangle of predefined size and represent its overlap with one or more PUTs by a binary image of this predefined size. The horizontal axis of the normalized image is a slightly nonlinearly stretched representation of the rotation differential axis, while the vertical axis is the speed differential axis. Summing the images of all normalized MUTs is essentially the same as summing region votes in the discretized parameter space. An example of a sum image is shown in figure 4. We call this the consensus image, as it is a graphic representation of the parameter consensus between all features. In the example shown, we see that the votes concentrate on an area slightly below center. This translates to a slight deceleration while driving straight ahead. We should note that in order for the evidence to concentrate in the consensus image, we need a horizontal distribution of features over the image as equal as
8
D.V. Hamme, P. Veelaert, and W. Philips
Fig. 4. Overlap between PUTs and MUTs, and corresponding consensus image
possible. If Harris corners are only found on the right side of the driving direction, their PUTs will all be skewed in the same direction, causing their intersections to be large. Contributions of corners from the other side of the driving direction will yield much smaller intersections, and therefore a better concentration of evidence.
4
Results
The proposed method was tested on a trajectory on the parking lot of our campus. The trajectory is a figure of eight with a length of 431 metres. The reconstructed trajectory is shown in figure 5. The backprojections of the entire video frames are used as a background. The comparison with the ground truth of the trajectory is also shown in figure 5, with an aerial photo as background. The positional error at the end of this trajectory is 2.14 metres, while the rotational error is 7.67 degrees. When coupled to an offline map, this is sufficiently accurate to establish the road that the vehicle is travelling on. This proves the validity of the concept for filling in gaps in the GPS reception of a couple of hundred metres. However, more testing is required to evaluate robustness and the effect of other traffic (which was absent during the test run). To evaluate the sensitivity of the method to calibration errors, the method was also tested on artificial data. A camera trajectory of 150 metres consisting of 2 sharp and 2 shallow bends above a checkerboard pattern was rendered and evaluated by the proposed method. The use of a rendered video provides the advantage that the calibration angles and trajectory are exactly known. An example of a frame from the artifical set is shown in figure 6. Figure 7 shows the results of our method compared to the ground truth, shown in black. The red trajectory in figure 7 is the output of our method for correct calibration values. Positional and rotational errors are given in table 1. The green trajectory shows the result when the roll angle would be off by 2 degrees, for
Robust Visual Odometry Using Uncertainty Models
9
Fig. 5. Reconstructed trajectory and surroundings according to our system (left), compared to ground truth (yellow) on aerial photo
Fig. 6. Example frame of artificial test sequence
example due to a calibration error. Two distinct effects are visible. Firstly there is an overall rotational bias that manifests itself in both left and right bends. Secondly, the right bends get truncated while the left bends get elongated. Still, for our trajectory with equal bends left and right, the positional and rotational errors are relatively small. This shows that our method is reasonably robust against roll miscalibration. The blue trajectory in figure 7 results from a 2 degree error in the pitch calibration. The effect is significant: every displacement
10
D.V. Hamme, P. Veelaert, and W. Philips
Fig. 7. Results for artificial test data, showing the effect of various calibration errors Table 1. Positional and rotational errors for different miscalibrations
Exact calibration 2 Degree roll error 2 Degree pitch error 2 Degree heading error
Positional error (m) Rotational error (degrees) 1.51 0.44 2.83 -6.37 16.78 -4.64 18.79 -33.40
gets severely underestimated. Rotational accuracy is still good, but the positional error quickly becomes very large. We can conclude that the method is fairly sensitive to errors in pitch calibration. Finally, the teal trajectory shows the effect of a 2 degree error in heading. Predictably, there is a strong rotational bias, both in the bends and on the straight segments. The end position is off by a very wide margin, and likewise the rotation. Clearly, the method is also sensitive to heading calibration errors. However, it should be noted that when an offline map is present, these systematic errors in position and rotation are easily detected, and the different effects of errors in the different angles should enable us to identify which calibration values are off. This in turn opens the road to a self-correction system, which would be a great asset for consumer applications.
5
Conclusion
We proposed a novel method for calculating visual odometry. The method is shown to work on real data, yielding a positional error of only around 2 metres
Robust Visual Odometry Using Uncertainty Models
11
on trajectory over 400 metres long. However, robustness has to be tested further on real-world data sets with other traffic present, and the algorithm should be connected to an offline map to keep drift under control for longer sequences. Despite the early stage of development though, the technique has already been proven accurate enough to pinpoint the lane in which the vehicle is driving after 300 metres of GPS silence. The effect of calibration errors was examined on artificial data, giving a good understanding of the consequences of inaccuracies in each of the calibration angles. In future work, this will be exploited to make the system self-calibrating to an extent, using the offline map as a reference. Another logical extensions would be implementation of a feedback loop of the trajectory to the vehicle suspension model to reduce perspective uncertainty. Also, the current version makes no use of the established consensus to refine the PUTs and recalculate the consensus. One can reasonably assume that such an iterative method will further improve the results, however at the cost of longer computation. Finally, a detailed analysis of the evolution of the uncertainties over time could be carried out to derive a confidence measure for the estimated position and orientation.
References 1. Amidi, O., Kanade, T., Miller, J.: Vision-based autonomus helicopter research at cmu. In: Proc. of Heli Japan 1998 (1998) 2. Azuma, T., Sugimoto, S., Okutomi, M.: Egomotion estimation using planar and non-planar constraints. In: Intelligent Vehicles Symposium (IV), pp. 855–862. IEEE, Los Alamitos (2010) 3. Bouguet, J.: Visual Methods for Three-Dimensional Modeling. Ph.D. thesis, California Institute of Technology (May 1999) 4. Campbell, J., Sukthankar, R., Nourbakhsh, I., Pahwa, A.: A robust visual odometry and precipice detection system using consumer-grade monocular vision. In: Proc. of IEEE Int. Conf on Robotics and Automation (ICRA) 2005, pp. 3421– 3427 (2005) 5. Cheng, Y., Maimone, M., Matthies, L.: Visual odometry on the mars exploration rovers. IEEE Robotics and Automation Magazine 13(2) (2006) 6. Comport, A., Malis, E., Rives, P.: Accurate quadrifocal tracking for robust 3d visual odometry. In: Proc. of IEEE Int. Conf. on Robotics and Automation (ICRA) 2007, pp. 40–45 (2007) 7. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2004) 8. Kitt, B., Geiger, A., Lategahn, H.: Visual odometry based on stereo image sequences with ransac-based outlier rejection scheme. In: Intelligent Vehicles Symposium (IV), pp. 486–492. IEEE, Los Alamitos (2010) 9. Konolige, K., Agrawal, M., Sol, J.: Large-scale visual odometry for rough terrain. In: Int. Symposium on Research in Robotics (2007) 10. Levin, A., Szeliski, R.: Visual odometry and map correlation. In: Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition 2004, vol. 1-I, pp. 611–618 (2004)
12
D.V. Hamme, P. Veelaert, and W. Philips
11. Marks, R., Wang, H., Lee, M., Rock, S.: Automatic visual station keeping of an underwater robot. In: Proc. of IEEE Oceans 1994, pp. 137–142 (1994) 12. Mouragnon, E., Lhuillier, M., Dhome, M., Dekeyser, F., Sayd, P.: Generic and real-time structure from motion using local bundle adjustment. Image and Vision Computing 27(8) (2009) 13. Negahdaripour, S., Horn, B.: Direct passive navigation. IEEE Transactions on Pattern Analysis and Machine Intelligence 9(1) (1987) 14. Nist´er, D.: An efficient solution to the five-point relative point problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(6) (2004) 15. Nist´er, D., Naroditsky, O., Bergen, J.: Visual odometry for ground vehicle applications. Journal of Field Robotics 23 (2006) 16. Obdrˇza ´lek, S., Matas, J.: A voting strategy for visual ego-motion from stereo. In: Intelligent Vehicles Symposium (IV), pp. 382–387. IEEE, Los Alamitos (2010) 17. Scaramuzza, D., Fraundorfer, F., Siegwart, R.: Real-time monocular visual odometry for on-road vehicles with 1-point ransac. In: Proc. of IEEE Int. Conf on Robotics and Automation (ICRA) 2009, pp. 4293–4299 (2009) 18. Tardif, J.-P., Pavlidis, Y., Daniilidis, K.: Monocular visual odometry in urban environments using an omnidrectional camera. In: Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS) 2008, pp. 2531–2538 (2008)
Supervised Visual Vocabulary with Category Information Yunqiang Liu1 and Vicent Caselles2 1
Barcelona Media - Innovation Center, Barcelona, Spain 2 Universitat Pompeu Fabra, Barcelona, Spain
[email protected],
[email protected]
Abstract. The bag-of-words model has been widely employed in image classification and object detection tasks. The performance of bagof-words methods depends fundamentally on the visual vocabulary that is applied to quantize the image features into visual words. Traditional vocabulary construction methods (e.g. k-means) are unable to capture the semantic relationship between image features. In order to increase the discriminative power of the visual vocabulary, this paper proposes a technique to construct a supervised visual vocabulary by jointly considering image features and their class labels. The method uses a novel cost function in which a simple and effective dissimilarity measure is adopted to deal with category information. And, we adopt a prototypebased approach which tries to find prototypes for clusters instead of using the means in k-means algorithm. The proposed method works as the k-means algorithm by efficiently minimizing a clustering cost function. The experiments on different datasets show that the proposed vocabulary construction method is effective for image classification. Keywords: bag of words, visual vocabulary, category information, image classification.
1
Introduction
The bag-of-words model [1] has been widely employed in image classification and object detection tasks due to its simplicity and good performance. It describes images as sets of elementary local features called visual words. Generally, this method consists of several basic steps: (1) keypoints are sampled using various detectors, (2) keypoints are represented using local descriptors such as SIFT (scale-invariant feature transform)[2], (3) descriptors are vector quantized into a fixed-size vocabulary using a clustering algorithm and each resulting cluster corresponds to a visual word, and (4) by mapping keypoints into visual words, images are represented by the histogram of visual words occurrences in the image. This histogram can be used as an input to a classifier in the classification step. The performance of bag-of-words methods depends fundamentally on the visual vocabulary that is applied to quantize the continuous local image features into discrete visual words. Various techniques have been explored to address this quantization process for vocabulary generation e.g. k-means [1] and J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 13–21, 2011. c Springer-Verlag Berlin Heidelberg 2011
14
Y. Liu and V. Caselles
mean-shift [3]. These approaches are typically designed to build up a vocabulary with the objective of minimizing the reconstruction error. Vocabularies are usually constructed based solely on the statistics of local image patches without taking category information into account, and cannot capture the semantic relationship between image features. This semantic relationship is useful for building up discriminative vocabularies for image classification. Some recent approaches have addressed the problem of building discriminative vocabularies. Winn et al. [4] obtained discriminative vocabulary by reducing a large one using pair-wise word merging based on the mutual information between visual words and categories. Lazebnik et al. [5] proposed to construct supervised vocabularies by minimizing mutual information loss. Moosmann et al. [6] built up discriminative visual vocabularies using extremely randomized clustering forests which consider semantic labels as stopping tests in building supervised indexing trees. Perronnin et al. [7] learned a universal vocabulary and then adapt it to class specific data. Yang et al. [8] proposed a vocabulary learning framework that is integrated with classifier learning. They encode an image by a sequence of visual bits that constitute the vocabulary. Zhang et al. [9] constructed category sensitive vocabulary where the category information is incorporated as an additional term. Lian et al. [10] proposed probabilistic models for supervised vocabulary learning which seek a balance between minimization of distortions of clusters and maximization of discriminative power of histogram representations of images. In order to increase the discriminative power of visual vocabulary, this paper proposes a technique to construct a supervised visual vocabulary by jointly considering image features and their class labels. The method uses a novel cost function in which a simple and effective dissimilarity measure is adopted to deal with category information. And, a prototype-based approach is adopted which tries to find prototypes for clusters instead of the means in k-means algorithm. The proposed method works like the k-means algorithm by efficiently minimizing a clustering cost function.
2
Supervised Vocabulary Construction
Under the bag of words model, images are represented by a histogram of visual word occurrences relative to a given vocabulary. Traditional vocabularies are usually constructed without taking category information into account. Consequently, histogram representations of images may not be optimal for classification tasks. Our aim is to construct a discriminative visual vocabulary by jointly considering image features and their category information. Fig. 1 shows the problem of the traditional visual vocabulary on a toy dataset. The features denoted as ◦ and + belong to classes A and B, respectively. A traditional method, like k-means, focuses on minimizing the overall reconstruction error without considering the category information and would generally produce two words (denoted as × ) as shown in Fig. 1(a). If we use the histogram of visual words as feature vector for classification, the described vocabulary would
Supervised Visual Vocabulary with Category Information
15
not be very attractive since the histograms for classes A and B would be the same in this case (as shown in Fig.2(a) and (b)). A more discriminative vocabulary can be constructed if we jointly consider image features and their category information, trying to find a tradeoff between the overall reconstruction error and class purity. Fig. 1(b) shows the generated words (denoted as × ). Based on the supervised vocabulary, we can obtain histogram representations as shown in Fig. 2(c) and (d), from which we can easily identify classes A and B. Obviously, the supervised vocabulary is more discriminative to classify A and B. Please note that the category information is only used at training stage to obtain the vocabulary, and at testing stage, each feature is assigned to its nearest visual words (as Fig. 1(b) shows) using the traditional distance measurement without using category information. However, we can expect that the visual vocabulary obtained by the supervised method has high class purity for nonannotated testing dataset if it has similar feature distribution with the training dataset. Thus the supervised vocabulary is also discriminative for the testing dataset.
Fig. 1. A toy problem
Fig. 2. Histogram representation for toy problem
16
2.1
Y. Liu and V. Caselles
Supervised Vocabulary Construction
Let F = {f1 , f2 ..., fN } be the set of local image features in the training set, and each feature is represented by a d-dimensional vector, (d=128 for SIFT descriptor). And we assume that the class label of each image is also known so that we have Y = {y1 , y2 ..., yN } where yi is the class label corresponding to feature fi . Our goal is to partition the dataset F into k clusters for generating a vocabulary with k words. The k-means algorithm searches for a partition into k clusters so that the sum of squared errors between the empirical mean of each cluster and its corresponding points is minimized. The process can be formulated as the following constrained optimization problem: min W,V
C(W, V ) =
k n
wi,j d(fi , vj )
j=1 i=1
(1) subject to
k
wi,j = 1, wi,j ∈ {0, 1}
j=1
where W = [wi , j]nk is a partition matrix, V = {v1 , v2 ..., vk } is a set of cluster centers, d(x, y) = ||x−y||2 is the squared Euclidean distance between two vectors x and y, and vj is the mean of cluster j. Solving (1) is known to be an NP-hard problem. K-means solves it using alternating optimization strategy [11]. The main steps of k-means algorithm are as follows: 1. Select an initial V randomly or based on some prior knowledges. 2. Fix V and find the distribution for W minimizing C which is given by: 1, if d(fi , vj ) ≤ d(fi , vt ) for 1 ≤ t ≤ k wi,j = (2) 0, for t = j 3. Fix W and determine V so that C is minimized. Then vj are given by: n wi,j fi vj = 1=1 (3) n 1=1 wi,j 4. Repeat steps 2 and 3 until there is no changes for each cluster. The vocabulary constructed by k-means solely considers the statistics of local image patches without taking category information into account. Then it cannot capture the semantic relation between image features. As discussed above, this vocabulary is not optimal for classification tasks. Inspired by [12], we propose a technique to construct a supervised visual vocabulary by jointly considering image features and their class labels. We use a novel cost function in which a simple and effective dissimilarity measure is adopted to deal with category information. Moreover, we adopt a prototype-based approach which tries to find prototypes for clusters instead of using the means as in the k-means algorithm. The prototype, denoted as P = {(vj , uj ); j = 1, ..., k}, consists of two vector entries, the feature
Supervised Visual Vocabulary with Category Information
17
prototype V and the corresponding label prototype U = {u1 , u2 , ..., uk }. The feature prototype V is used to generate the visual vocabulary. In order to construct a supervised vocabulary, we modify the cost function of (1) as following: min
W,V,U
C(W, V, U ) =
subject to
k
k n
wi,l d (fi , pj )
j=1 i=1
(4)
wi,j = 1, wi,j ∈ {0, 1}
j=1
where d (fi , pj ) = d(fi , vj ) + λδ(yi , uj ), δ(yi , uj ) is the dissimilarity measure between two class labels x and y, defined as: 1, x = y δ(x, y) = (5) 0, x = y and λ is the weight parameter to adjust the influence of label dissimilarities. When λ equals to zero, the method reduces to the standard k-means, and it generates the category specified visual vocabulary when λ is large enough. Similar to k-means, we minimize C by using an alternating optimization strategy. Its main steps are as follows: 1. Select an initial prototype P randomly or based on some prior knowledges. 2. Fix P and find distribution for the partition matrix W minimizing C which is given by: 1, if d (fi , vj ) ≤ d (fi , vt ) for 1 ≤ t ≤ k wi,j = (6) 0, for t = j 3. Fix W and determine P so that C is minimized according to Lemma 1 below. 4. Repeat steps 2 and 3 until there is no changes for each cluster. In the first step, the prototypes (vj , uj ) are initialized within each category. And the number of prototypes is proportional to the number of features in the corresponding category. Specifically, suppose the expected of prototypes (or visual words) is k for the whole dataset, we first assign the number of prototypes for each category according to the number of feature in this catergory, and the total number of prototypes for all the categories equals to k. Once the number of prototypes for each category is determined, we randomly initialize the prototypes within each category. Lemma 1. Suppose that the partition distribution W is fixed, then the function C is minimized if and only if the prototype P is assigned as follows: n wi,j fi vj = 1=1 (7) n 1=1 wi,j uj = arg max(nc,j ) c
(8)
where nc,j is the number of features having the class label c within current cluster j under W .
18
Y. Liu and V. Caselles
Proof. Let Cf (W, V ) =
k
j=1
n
i=1
wi,j d(fi , vj ) Cy (W, U ) =
Then C(W, V, U ) =
k
j=1
n
i=1
k
j=1
n
i=1
wi,j δ(fi , vj )
wi,l d (fi , pj )
can be written as C(W, V, U ) = Cf (W, V ) + λCy (W, U ). Since Cf (W, V ) and Cy (W, U ) are nonnegative and independent of each other, minimizing C(W, V, U ) respect to V , U is equivalent to minimizing Cf (W, V ) and Cy (W, U ) with respect to V and U respectively. It is obvious that Cf (W, V ) is minimized when (7) holds, which is the same as that in the standard k-means. Moreover, Cy (W, U ) can be rewritten as: Cy (W, U ) =
k n j=1 i=1
wi,j δ(fi , vj ) =
k
(nj − nuj ,j )
(9)
j=1
where nj is the number of features in cluster j according W (notice that nj is fixed since W is fixed), and nuj ,j is the number of features with class label uj within the cluster j. Cy (W, U ) is minimized if and only if every nuj ,j is maximal any j = 1, , k. Therefore uj must be the class label with the largest number of features in cluster j.
3
Experiments
We evaluated the proposed supervised vocabulary for image classification task on two public datasets: MSRC-2 [4] and OT [13]. We first describe the implementation setup, and then give the comparison results. 3.1
Experimental Setup
For the two datasets, we use only the grey level information in all the experiments. The keypoints are obtained using dense sampling since the comparative results have shown that using a dense set of keypoints works better than sparsely detected keypoints in many computer vision applications [14]. Specifically, we compute keypoints on a dense grid with spacing d=7 both in horizontal and vertical directions. SIFT descriptors are computed at each patch over a circular support area with radius r=5. Classification is performed using KNN and SVM classifiers on both datasets. The RBF (radial basis function) kernel is used in the SVM classifier. Multiclass classification is obtained from two-class SVM using one-vs-all strategy. The parameters such as the number of neighbors in KNN and the regularization parameter c in SVM are determined using k-fold (k = 5) cross validation.
Supervised Visual Vocabulary with Category Information
3.2
19
Experimental Results
MSRC-2 In the experiments with MSRC-2, there are 20 classes, and 30 images per class in this dataset. We choose six classes out of them: {airplane, cow, face, car, bike, sheep}. The visual vocabulary size is set as 100. Moreover, we divided randomly the images within each class into two groups of the same size to form a training set and a testing set. We repeat each experiment five times over different splits. Fig.3 shows how the classification accuracy varied with the weight parameter λ using a SVM classifier. Observe that the best performance is obtained when λ = 1.0. Therefore, λ is empirically set to 1.0 in the subsequent experiments. Please note that the best value of λ will change when using different feature spaces to represent the image. It is recommend to set the λ experimentally. We report the comparison results in terms of classification accuracy in table 1 for KNN and SVM classifiers. The supervised vocabulary method consistently outperforms the vocabulary obtained using k-means for both classifiers. Moreover, the computation time of the supervised vocabulary generation is very close to that of k-means. On average, supervised vocabulary generation takes 1.79s for each iteration with 114840 features and 100 words, compared with 1.78s for k-means.
Fig. 3. Classification accuracy (%) plotted with λ Table 1. Classification accuracy (%) on MSRC-2
KNN SVM
k-means Ours 81.0 83.5 86.9 89.2
OT Dataset OT dataset contains 8 categories, {coast, forest, highway, insidecity, mountain, opencountry, street, tallbuilding}. We use the first 200 images per class in the experiment. Images within each class are randomly divided into two subsets of the same size to form a training set and a testing set. We set the visual vocabulary size as 200. Table 2 shows the averaged classification results over five experiments with different random splits of dataset. A confusion matrix for the supervised vocabulary when using the SVM classifier is presented in Fig.4
20
Y. Liu and V. Caselles
Fig. 4. Confusion matrix for supervised vocabulary Table 2. Classification accuracy (%) on OT
KNN SVM
k-means Ours 67.2 68.7 72.1 73.4
in order to give more details on the categorization of each category. The first column contains the true labels and the last row lists the referred labels.
4
Conclusions
This paper proposed supervised visual vocabulary construction approach which jointly considers image features and their class labels. A novel cost function is adopted to measure the dissimilarity of category information. The proposed method runs like the k-means algorithm to minimize the clustering cost function. The experiments on different datasets demonstrate that the proposed vocabulary construction method is effective for the application of image classification. Acknowledgements. This work was partially funded by Mediapro through the Spanish project CENIT-2007-1012 i3media and by the Centro para el Desarrollo Tecnol´ogico Industrial (CDTI). The authors acknowlege partial support by the EU project “2020 3D Media: Spatial Sound and Vision” under FP7-ICT. Y. Liu also acknowledges partial support from the Torres Quevedo Program from the Ministry of Science and Innovation in Spain (MICINN), co-funded by the European Social Fund (ESF). V. Caselles also acknowledges partial support by MICINN project, reference MTM2009-08171, by GRC reference 2009 SGR 773 and by “ICREA Acad`emia” prize for excellence in research funded both by the Generalitat de Catalunya.
Supervised Visual Vocabulary with Category Information
21
References 1. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: Proc. ICCV, vol. 2, pp. 1470–1477 (2003) 2. Lowe, G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 3. Jurie, F., Triggs, B.: Creating efficient codebooks for visual recognition. In: Proc. ICCV (2005) 4. Winn, J., Criminisi, A., Minka, T.: Object categorization by learned universal visual dictionary. In: Proc. ICCV, pp. 1800–1807 (2005) 5. Lazebnik, S., Raginsky, M.: Supervised Learning of Quantizer Codebooks by Information Loss Minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(7), 1294–1309 (2009) 6. Moosmann, F., Nowak, E., Jurie, F.: Randomized clustering forests for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(9), 1632–1646 (2008) 7. Perronnin, F.: Universal and Adapted Vocabularies for Generic Visual Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(7), 1243–1256 (2008) 8. Yang, L., Jin, R., Sukthankar, R., Jurie, F.: Unifying discriminative visual codebook generation with classifier training for object category recognition. In: Proc. CVPR (2008) 9. Zhang, C., Liu, J., Ouyang, Y., Tian, Q., Lu, H., Ma, S.: Category sensitive codebook construction for object category recognition. In: Proc. ICIP (2009) 10. Lian, X., Li, Z., Wang, C., Lv, B., Zhang, L.: Probabilistic models for supervised dictionary learning. In: Proc. CVPR (2010) 11. Jian, A., Dubes, R.: Algorithms for clustering data. Prentice Hall, Englewood Cliffs (1988) 12. Huang, Z.: Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery, 283–304 (1998) 13. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42(3), 145–175 (2001) 14. Bosch, A., Zisserman, A., Munoz, X.: Scene classification using a hybrid generative/discriminative approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(4), 712–727 (2008)
Nonparametric Estimation of Fisher Vectors to Aggregate Image Descriptors Herv´e Le Borgne and Pablo Mu˜noz Fuentes CEA, LIST, Laboratory of Vision and Content Engineering 18 route du Panorama, BP6, Fontenay-au-Roses, F-92265 France
Abstract. We investigate how to represent a natural image in order to be able to recognize the visual concepts within it. The core of the proposed method consists in a new approach to aggregate local features, based on a non-parametric estimation of the Fisher vector, that result from the derivation of the gradient of the loglikelihood. For this, we need to use low level local descriptors that are learned with independent component analysis and thus provide a statistically independent description of the images. The resulting signature has a very intuitive interpretation and we propose an efficient implementation as well. We show on publicly available datasets that the proposed image signature performs very well.
1 Introduction Contemporary works on image classification (but also for object recognition and image retrieval) showed the efficiency of approaches based on the computation of local features (such as SIFT descriptors [14]).However, since the number of points of interest may vary from one image to another, the dimension of the vector representing an image (i.e number of keypoints × local feature size) is not constant, making it unusable as input of usual machine learning algorithms used in the image classification paradigm. This drawback was circumvented through the bag-of-visterm (BOV) approach [20]. It consists in pre-computing a codebook of visual words then coding each image according to this visual vocabulary. The simplest approach consists here in hard quantization that associate each local feature to one visual word(s) of the dictionary. Significant improvement is obtained by using a soft assignment scheme [5] or sparse coding [21,1]. The BOV signature can be refined by the spatial pyramid matching (SPM) scheme [10] that add spatial information at several scales. It consists of averaging the local features according to hierarchical regular grids over the image (average pooling) although recent work showed that one may benefit to consider their maximum instead (maximum pooling) [21,1]. An alternative to the BOV scheme was recently proposed to aggregate local descriptors. In [16], the visual vocabulary is modeled with a Gaussian mixture model (GMM) then a signature is derived according to the Fisher kernel principle, consisting in computing the gradient of the loglikelihood of the problem with respect to some parameters. It allows to take advantage of a generative model of data and discriminative properties at the same time. The VLAD descriptor [8] can be seen as a derivative of the Fisher kernel too . In practice, it consists of accumulating the difference between (each component of) the local descriptors and the visual words of the codebook. To be applied J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 22–33, 2011. c Springer-Verlag Berlin Heidelberg 2011
Nonparametric Estimation of Fisher Vectors to Aggregate Image Descriptors
23
to image retrieval, these both signatures were compressed using principal component analysis and product kernel [8] or a simple binarization strategy [17]. The main contribution of this paper is to propose a non-parametric estimation of the Fisher kernel principle then derive an image signature that can be used for image classification.The main difference with the work of Perronnin [16,17] is that he used a parametric method (GMM) to estimate the density of the visual words. Hence, two parameters affect its performance: (i) the number of Gaussian component and (ii) the parameters with respect to which the gradient is computed. By using a non-parametric estimation of the Fisher kernel principle, our approach circumvent these issues, and may as well provide a more accurate model of the loglikelihood. However, the theoretical framework we proposed has to be applied to local features with statistically independent dimensions. Such descriptors were proposed in [11], who directly estimated the density of each dimension of the descriptors with parametric and non parametric methods. Our work differs from them since the non-parametric estimation is only a step in our framework, and the signature we finally obtain from the Fisher kernel derivation is different. A second contribution of this work is to propose an efficient implementation of the image signature obtained according to the proposed theoretical framework. We propose some choices to make the computation of the signature simpler than in theory and show that classification results are maintained or even improved. In section 2 we review the Fisher kernel framework, the non-parametric density estimation used and derive the proposed image signature. We present in section 3 some experimental results on several publicly available benchmarks and compare our approach to recent work in image scene classification.
2 Aggregating the Image Feature 2.1 Fisher Kernel, Score and Vector Let X = {xt , t = 1 . . . T } a set of vectors, for instance used to describe an image (e.g the collection of local features extracted from it). It can be seen as resulting from a generative probability model with density f (X|θ). To derive a kernel function from such a generative model, i.e being able to exhibit discriminative properties as well, Jaakola [7] proposed to use the gradient of the log-likelihood with respect to the parameters, called the Fisher score: UX (θ) = θ log f (X|θ) (1) This transforms the variable length of the sample X into a fixed length vector that can feed a classical learning machine. In the original work of [7] the Fisher information matrix Fλ is suggested to normalize the vector: Fλ = EX [θ log f (X|θ)θ log f (X|θ)T ]
(2)
It then results into the Fisher vector: −1/2
GX (θ) = Fλ
θ log f (X|θ)
(3)
24
H. Le Borgne and P. Mu˜noz Fuentes
2.2 Logspline Density Estimation The traditional way of modeling a distribution density is to assume a classical parametric model such as normal, gamma or Weibull. For instance in [16], the vocabularies of visual words are represented with a Gaussian Mixture Models, for which the parameters (weight, mean and variance of each Gaussian) are estimated by maximum likelihood. Alternatively, we can use a nonparametric estimate of the density, such as a histogram or a kernel-based method.A histogram density estimation can be seen as modeling the unknown log-density function by a piecewise constant function and estimating the unknown coefficients by maximum likelihood. In this vein, Kooperberg [9] proposed to model the log-density function by cubic spline (twice-continuously differentiable piecewise cubic polynomial), resulting into the so-called logspline density estimation. More precisely, given the lower bound of data L and upper one U (L and U can be infinite), and a sequence of K values t1 , . . . , tK such that L < t1 < . . . < tK < U (later referred as knots) and K > 2, we consider the space S consisting of the twicecontinuously differentiable function fs on (L, U ), such that the restriction of fs to each of the intervals [t1 , t2 ] . . . [tK−1 , tK ] is a cubic polynomial and linear on (L, t1 ] and [tK , U ). The functions of the K-dimensional space S are named natural (cubic) splines. Let 1, B1 , . . . , Bp (with p = K−1) a set of basis functions that span the space S, chosen such that B1 is linear with negative slope on (L, t1 ], B2 , . . . , Bp are constant on (L, t1 ], Bp is linear with positive slope on [tK , U ) and B1 , . . . , Bp−1 are constant on [tK , U ). Given θ = (θ1 , . . . , θp ) ∈ Rp such that: U exp (θ1 B1 (y)+, . . . , θp Bp (y)dy) < ∞ (4) L
We can thus consider the exponential family of distribution based on this basis function: f (y, θ) = exp (θ1 B1 (y)+, . . . , θp Bp (y) − C(θ)) Where C(θ)is a normalizing constant such that: f (y, θ)dy = 1
(5)
(6)
R
As shown in [9], it is possible to determine the maximum likelihood estimate of θ with a Newton-Raphson method with step-halving. They also proposed a knot selection methodology based on Akaike Information Criterion (AIC) to select the best model. It finally results into the estimation of the maximum likelihood estimate ˆθ of the coefficients, the knots t1 , . . . , tK and thus the wanted density. Let notice that at convergence (θ = ˆ θ), the loglikelihood is maximal, thus its derivative is null. Let {y1 . . . yn } a random sample from f . We have: n ∂L(Y, θ) ∂C(ˆθ) = 0 = Bj (yt ) − n (7) ∂θj ∂θj θ=ˆ θ t=1 Thus at convergence we have: ∂C(ˆ θ) = Ef [Bj (y)] ∂θj
(8)
Nonparametric Estimation of Fisher Vectors to Aggregate Image Descriptors
25
Where Ef [B(y)] denotes the Monte Carlo estimate of the expectation of B(.) according to f . This will be used later in our development. 2.3 Signature Derivation Let consider that any image is described according to some D-dimensional vectors. Each feature dimension xi can be thought of as arising as a random sample from a distribution having a density hi . We can model the log-density function by a cubic spline, such as explained in section 2.2. Hence, it exists a basis 1, B1i , . . . , Bpi i of S such that: ⎛ i ⎞ p hi (xi , θi ) = exp ⎝ θji Bji (xi ) − C i (θ i ))⎠ (9) j=1
Let {yt , t = 1 . . . T } a set of vectors extracted from a given image, seen as T independent realizations of the D-dimensional random vector Y (for simplicity, Y denotes both the image and the corresponding random vector). The log-likelihood is thus: L(Y, θ) =
T
log (h(yt , θ))
(10)
t=1
Where h(yt , θ) denotes the density of Y . If one assumes the independence of all feature dimensions (this point is discussed in section 2.6), we have: h(yt , θ) =
D
hi (yti , θi )
(11)
i=1
Thus: L(Y, θ) =
T D
log hi (yti , θi )
(12)
t=1 i=1
Each density hi can be estimated on the same basis as the one determined during learning. In other words, each hi is expressed as in (9) with specific value for the coefficients θji . Hence, from (12) it follows: T
∂L(Y, θ) i i ∂C i (θi ) = Bj (yt ) − i ∂θj ∂θji t=1
(13)
The first component of equation (13) is simply the expectation of Bji (y) according to the density of the considered image (i.e estimated from the samples {yt , t = 1 . . . T }). The second component is a function that depend on θji . If one assumes that the considered samples follow a probability law quite close to the one estimated during learning, we can apply equation (8), and finally:
∂L(Y, θ) = Ehi Bji (y) − Ef i Bji (y) (14) i ∂θj i i θj ≈ θj
26
H. Le Borgne and P. Mu˜noz Fuentes
Where hi (.) is the density of the image descriptor and f i (.) the density class descriptor (dimension i), this last being estimated from local descriptor extracted from several learning images. The full gradient vector UY (θ) is a concatenation of these partial D derivatives with respect to all parameters. Its number of components is i=1 pi , where pi is the number of non-constant polynomial of the basis of S for dimension i. The equation (14) leads to a remarkably simple expression, for which the physical interpretation is quite straightforward. It simply reflects the way a specific image (with density hi ) differs from the average world (i.e density f i ), through
a well
chosen polynomial basis, at each dimension (figure 1). The average world Ef i Bji (y) can be
seen as a sort of codebook. If one uses linear polynomials Bji (y) = αj y i , equation (14) relates to the VLAD signature, with an important difference since all vectors are used (i) during learning to estimate the codeword (ii) during test to compute the signature, while (i) K-means uses the closest vectors of a codeword (cluster center) to re-estimate it at each step (ii) VLAD uses as well only nearest neighbors to compute the signature component (see eq. (1) in [8]).
Bedroom
Suburb image
categ
image
categ
image
categ
image
categ
image
categ
Kitchen
Industrial image
categ
image
categ
image
categ
image
categ
image
categ
Livingroom
Forest
image
categ
image
categ
image
categ
image
categ
InsideCity
OpenCountry
Building
categ
Coast
Highway
Mountain
image
Street
Office
Store
Fig. 1. Example images from the scene15 database (1st column), with the response of an ICA filter (dimension i) to a particular image, i.e f i in equation (14) (2nd column) and the average density over the category, i.e hi in equation (14) (3rd column)
2.4 Signature Normalization In his seminal work, Jaakola [7] proposed to normalize the Fisher score by the Fisher information matrix. In [16], it was noted that such an operation improved the efficiency of the method in term of discrimination, by normalizing the dynamic range of the different dimensions of the gradient vector.
Nonparametric Estimation of Fisher Vectors to Aggregate Image Descriptors
27
To normalize the dynamic range of each dimension of the gradient vector UY (θ) (each UY (θji ) is given by the equation (14)), we need to compute the diagonal terms of the Fisher information matrix Fθ . Considering the expression of each UY (θji ) given by he equation (14), the diagonal terms of Fθ are:
2 Fθji = E Ehi Bji (y) − Ef i Bji (y) (15) The dynamic range being computed on the learning database (density f i ), it is thus the variance of Bji (y). From equation (3) the final fisher vector is: UYn (θji )
Ehi Bji (y) − Ef i Bji (y)
= σf i Bji (y)
(16)
Where σf i [.] is the standard deviation computed according to density f i . Hence, our normalized signature UYn (θ) can be regarded as the “standardizing” transformation of the raw description of the image given by the polynomial activity Bji (y). The “normal ity” is here the learning database, from which we compute an average Ef i Bji (y) and
a standard deviation σf i Bji (y) at each dimension. 2.5 Efficient Implementation We discuss in this section the choice of the knots t1 , . . . , tK and the set of basis functions 1, B1 , . . . , Bp (with p = K − 1) that span the space S defined in section 2.2, used to compute our signature according to equation (14). In his article on the logspline density estimation theory [9], Kooperberg proposed an automatic method to place the knots according to an AIC criterion. However, preliminary experiments have convinced us that such a process is not necessary, and that a simpler strategy is more efficient. For this, we fix a given number of shot and place them according to statistic order of the learning data. For instance, if K = 9, knots are places according to the decile of data. Hence, at each dimension, the amount of information is regularly distributed between knots. For low-level features such as those presented in section 2.6, the knots are approximately placed according to a logarithmic distribution. Once the knots (t1 , . . . , tK ) are fixed, a natural choice for the set of basis functions (1, B1 , . . . , Bp ) that span the space S is to consider the polynomials of the form: B0 (y) = 1 B1 (y) = y 3 k Bk (y) = |y−tk |+y−t for k > 1 2
(17)
With such a basis, B0 = 1 has no influence in the computation of the signature (see equation (14)), while B1 (.) leads to compute the difference of the mean between the considered image (density hi ) and the class (density f i ). Further polynomials Bk (.) are the positive part of a cubic function that is null at knot tk . Obviously, such a representation is strongly redundant. Hence, we propose two simplifications to this implementation. First, for polynomials Bk with k > 1, we only
28
H. Le Borgne and P. Mu˜noz Fuentes
consider the values y between knot tk and tk+1 , in order to avoid redundancy. The difference in equation (14) is thus computed on each interval defined by the knots. This can be seen as a sort of filter activity quantization, the limits of the cells being the knots. Secondly, for computational efficiency, we only consider the positive part (absolute value) without computing the power three. The chosen basis thus becomes: B0 (y) = 1 (not used) B1 (y) = y Bk>1 (y) =
|y−tk |+y−tk 2
0
for y < tk+1 for y > tk+1
(18)
Such an implementation is equivalent to compute only (y − tk ) on the interval [tk , tk+1 ] since the polynomial is null elsewhere and y > tk on the interval. The third proposed simplification is to use a binary weighting scheme. It consists in not considering the value of |y − tk | in the computation but only its existence. In other word, one can only count +1 each time a pixel activity y is between tk and tk+1 . Such a binary weighting scheme is commonly used in the design of BOV, in particular when the codebook is large [20]. 2.6 Independent Low Level Features According to equation (11) the signature derivation requires to use independent lowlevel features, such that the image description density could be expressed as a factorial code. Such features can be obtained with Independent Component Analysis (ICA)[2,6] that is a class of methods that aims at revealing statistically independent latent variables of observed data. In comparison, the well-known Principal Component Analysis (PCA) would reveal uncorrelated sources, i.e with null moments up to the order two only. Many algorithms were proposed to achieve such an estimation, that are well reviewed in [6]. These authors proposed the fast-ICA algorithm that searches for sources that have a maximal nongaussianity. When applied to natural image patches of fixed size (e.g Δ = 16 × 16 = 256), ICA results into a generative model composed of localized and oriented basis functions [6]. Its inverse, the separating matrix, is composed of independent filters w1 , . . . , wD (size Δ) that can be used as feature extractors, giving a new representation with mutually independent dimensions. The number of filters (D) extracted by ICA is less or equal to the input data dimension (Δ). This can be reduced using a PCA previously to the ICA. The responses of the D filters to some pixels (p1 , . . . , pT ) of an image I(.) are thus independent realizations of the D-dimensional random vector Y . As a consequence, the density can be factorized as expected: hica (I(pt )) =
D i=1
hiica (I(py )) =
D
wi ∗ I(pt )
(19)
i=1
Where ∗ is the convolution product. These independent low-level features can be further used according to the method presented into section 2.3 since they verify equation (11).
Nonparametric Estimation of Fisher Vectors to Aggregate Image Descriptors
29
3 Experiments 3.1 Datasets and Experimental Setup The first dataset (scene15, see figure 1) [10] was recently used in several works on scene classification [15,19,18,21,1]. It is composed of 4485 images with 200 to 400 images for each category. A given image belong to one category exactly. The original sources of the pictures include the COREL collection, personal photographs and Google image search. We followed the experimental setup of[10] using 100 images per class for training and the rest for testing. As well, only the gray level of the images is considered, making color-based descriptors inoperable on this dataset. The performance is measured using the classification rate to be comparable to previous works. The second dataset (vcdt08) was used as a benchmark in the Visual Concept Detection Task of the campaign ImageCLEF 2008 [3]. It is composed on 1827 images for training and 1000 for testing, all image being annotated according to 17 categories.The categories are not exclusive i.e an image can belong to several of these categories and sometimes to none of them. The performance is measures using the Equal Error Rate i.e the point such that the false acceptance rate is equal to the false rejection rate. Hence, this value is between 0 and 1 such that the smaller the EER the better the system: ideally, false acceptance and false rejection are null. Two machine learning algorithms were used in the following experiments. the first is a SVM with a linear kernel [4] that has a fast convergence. We consider the use of a linear kernel as relevant because our method lead to a large signature (from hundreds to thousands dimensions) and that in such a high dimensional space, data are quite sparse and it is thus easier to find an hyperplane that separate them. The second learning algorithm is the fast shared boosting (FSB) [12] that is specifically designed to tackle the problem of overlapping classes (such as in the dataset vcdt08), by using weak classifiers that shares features among classes. Another benefit of this algorithm is its fast convergence when the umber of classes grows. When images can belong to several classes, it was shown that it has usually better results than a classic one-versus-all strategy. 3.2 Signature Implementation We study the effect of the simplifications proposed in section 2.5 to efficiently implement the proposed signature. In this experiment, only D = 64 filters are extracted with fastICA [6] from 40000 patches of size 16 × 16. Multi-class classification on non-exclusive classes is done with the FSB. All the experiments in this section were conducted with 32 knots. The “normal” implementation uses the polynomial basis given at equation (17). Simplifications (section 2.5) include: (i) computing the polynomial between knots instead of above them (ii) use the absolute value instead of the third power (iii) in the case of the absolute value one can use a binarized weighting scheme instead of the values. Table 1 shows the performances of all these possible implementations on the vcdt08 benchmark. As expected, reducing information redundancy into the signature by computing it between knots improve the performances. Concerning the other simplification, no significant effect is noticed for signatures computed between knots. However,
30
H. Le Borgne and P. Mu˜noz Fuentes
Table 1. Performances on the vcdt08 benchmark for several implementations of our method (see text). The lower the EER, the better the method. Above knots |.|3 0.294 |.| val 0.286 |.| bin 0.256
Between knots 0.254 0.257 0.254
Table 2. Classification accuracy on scene15 for three different implementation (see text). The equation number is given for the centered and the normalized signature. The simple one is only the first part of equation (14). Signatures were computed with 32 knots and no grid nor pyramid. Signature
simple
Ef i Bji (y) 64 filters 74.20 128 filters 75.58 256 filters 78.56
centered normalized see (14) see (16) 74.30 73.92 75.71 76.08 78.86 78.42
when the signature is computed with the “above knots” implementation, the binarized weighting scheme allows to reach similar performances as those obtained with the nonredundant implementation. Other variation in the implementation are possible, since we proposed a centered signature (equation 14) and a normalized one (equation 16). We also consider here a simple signature that only implement the first member of the equation (14), i.e a non-centered version of it. Since of the dimension has few influence on the FSB [12], we tested these three signatures as input of a SVM, on the scene15 dataset (exclusive classes). As well, signatures were computed for three different sets of ICA filters, of size 64, 128 and 256, with 32 knots. Optimal parameters of the SVM were determined on the learning database, with a 5-fold cross validation, and data were previously scaled according to the method proposed by the author [4]. Results presented in table 2 show that results improve when more filters are used (see next section for a discussion on this point) but very few variation from one implementation to another. This is probably due to the data scaling of the SVM we used, that itself standardized the signatures. 3.3 Signature Parameters Above implementation details (previous section), the signature we propose mainly depends on two parameters, namely the number of filters D and the number of knots K. In this section we show the influence of these two parameters on the scene15 dataset. We used a simple signature and computed the polynomial activity between knots with a binarized weighting scheme. The signature size is D × K. Optimal parameters of the SVM were determined on the learning database, with a 5-fold cross validation, and data were previously scaled as proposed in [4].
Nonparametric Estimation of Fisher Vectors to Aggregate Image Descriptors
31
Table 3. Classification accuracy on scene15 for three filter basis of various size D and different number of knots K. Best result for a given D is marked in bold. K = 8 K = 16 D = 64 71.55 75.11 D = 128 71.82 75.88 D = 256 76.78 78.49
K = 32 74.20 75.58 78.56
K = 64 73.47 74.84 77.69
Results are presented in table 3. For a given number of knots K, the best results are always obtained with a maximal number of filters, that is consistent with the previous results. Indeed, with patches of size 16 × 16 the maximal number of filters is 256. To obtain a smaller collection of ICA filter, data is reduced with a PCA during the process (see section 2.6), thereby inducing a loss of information detrimental to the resulting descriptors. For a given number of filter D, the best results are obtained with K = 16 or K = 32. As explained above, knots can be regarded as the limits of quantization cells of the ICA filter activity. Hence, when the number of knots is large, cells are too selective (K = 64) while they generalize too much for small number of knots (K = 8). 3.4 Comparison to Other Works Table 4 shows the classification results for our method in comparison to recent works evaluated on the scene15 dataset. Note that we considered only signatures on the full image for fair comparison. A SPM scheme, dividing the image according to a spatial grid at several scales, usually allows to improve the performances [10,21,1]. We used the same filters as previously with K = 16 knots to build the signature at each dimension. For [10] the results reported here are those with the “strong features” (SIFT). When no spatial pyramid scheme is used, it is equivalent to a BOV approach. With 256 filters, without spatial pyramid scheme, our method achieve an accuracy of 78.5%, that is a gain between 2% and 6% over the best accuracies reported in the literature. Let also notice that with 64 filters our method achieve 75.1%, that is comparable to the state-of-the art. K Table 4. Classification accuracy for scene15. UD (ICA) is our method with D filters and K knots. At the 8 top rows are methods that do not use any spatial pyramid scheme. The bottom rows show the results of methods using such a technique.
Method Accuracy % Lazebnik et al. [10] 74.8 Liu et al. [13] 75.16 Rasiwasia et al. [19] 72.2 Rasiwasia et al. [18] 72.5 Masnadi-Shirazi et al. [15] 76.74 16 U64 (ICA) 75.10 16 U128 (ICA) 75.88 16 U256 (ICA) 78.49
32
H. Le Borgne and P. Mu˜noz Fuentes
Fig. 2. Average EER over the 17 categories of VCDT 2008: ranking of the proposed approach (white) with respect to the other 53 runs (blue)
The result we obtained on vcdt08 (table 1) are competitive as well. On this benchmark, [12] reported an EER of 0.24 while we achieve 0.254 with 64 filters and 32 knots. In figure 2 we compare the average EER of our approach with the one of the runs submitted to the campaign [3]. Although the best run reported performs better, our method is ranked among the best.
4 Conclusion We proposed a non parametric estimation of the Fisher Kernel framework to derive an image signature. This high-dimensional and dense signature can reasonably feed a linear SVM. We showed the effectiveness of the approach in the context of image categorization. Experiments on challenging benchmark showed our approach outperforms recent techniques in scene classification [10,13,15,19,18], even when a spatial pyramid scheme is used [10]. Hence we report the higher accuracy on this state-of-the-art dataset in scene classification. We also reported competitive results on the vcdt08 benchmark of the ImageCLEF campaign [3]. In future work we would be interested in applying the non-parametric Fisher vector estimation presented in this paper on local features that only partially verify the property of statistical independence (see [11]) or not at all [14]. Another direction of research will concern the signature compression in order to apply it to image retrieval. Acknowledgment. This work has been partially funded by I2S in the context of the project Polinum. We acknowledge support from the ANR (project Yoji) and the DGCIS for funding us through the regional business cluster Cap Digital (project Rom´eo).
References 1. Boureau, Y., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In: CVPR, San Francisco, USA (2010) 2. Comon, P.: Independent component analysis, a new concept? Signal Processing 36(3), 287– 314 (1994) 3. Deselaers, T., Deserno, T.: The visual concept detection task in imageclef 2008. In: ImageCLEF Workshop (2008)
Nonparametric Estimation of Fisher Vectors to Aggregate Image Descriptors
33
4. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008) 5. van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.M.: Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(7), 1271–1283 (2010) 6. Hyv¨arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley-Interscience, Hoboken (2001) 7. Jaakola, T., Haussler, D.: Exploiting generative models in discriminative classifiers. In: NIPS, pp. 1–8 (1999) 8. J´egou, H., Douze, M., Schmid, C., P´erez, P.: Aggregating local descriptors into a compact image representation. In: CVPR, San Francisco, USA (June 2010) 9. Kooperberg, C., Stone, C.J.: Logspline density estimation for censored data. Journal of Computational and Graphical Statistics 1, 301–328 (1997) 10. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR, Washington, DC, USA, pp. 2169–2178 (2006) 11. Le Borgne, H., Gu´erin Dugu´e, A., Antoniadis, A.: Representation of images for classification with independent features. Pattern Recognition Letters 25(2), 141–154 (2004) 12. Le Borgne, H., Honnorat, N.: Fast shared boosting for large-scale concept detection. Multimedia Tools and Applications (2010) 13. Liu, J., Shah, M.: Scene modeling using co-clustering. In: ICCV (2007) 14. Lowe, D.G.: Object recognition from local scale-invariant features. In: CVPR 1999, Los Alamitos, CA, USA, vol. 2, pp. 1150–1157 (August 1999) 15. Masnadi-Shirazi, H., Mahadevan, V., Vasconcelos, N.: On the design of robust classifiers for computer vision. In: CVPR, San Francisco, USA, pp. 779–786 (June 2010) 16. Perronnin, F., Dance, C.R.: Fisher kernels on visual vocabularies for image categorization. In: CVPR (2007) 17. Perronnin, F., Dance, C.R.: Large-scale image retrieval with compressed fisher kernels. In: CVPR, San Francisco, USA, pp. 3384–3391 (2010) 18. Rasiwasia, N., Vasconcelos, N.: Holistic context modeling using semantic co-occurrences. In: CVPR, Los Alamitos, CA, USA, pp. 1889–1895 (2009) 19. Rasiwasia, N., Vasconcelos, N.: Scene classification with low-dimensional semantic spaces and weak supervision. In: CVPR, pp. 1–6 (2008) 20. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV, vol. 2, pp. 1470–1477 (2003) 21. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR (2009)
Knowledge-Driven Saliency: Attention to the Unseen M. Zaheer Aziz, Michael Knopf, and B¨arbel Mertsching GET LAB, Universit¨ at Paderborn, 33098 Paderborn, Germany {aziz,mertsching}@upb.de,
[email protected] http://getwww.uni-paderborn.de Abstract. This paper deals with attention in 3D environments based upon knowledge-driven cues. Using learned 3D scenes as top-down influence, the proposed system is able to mark high saliency to locations occupied by objects that are new, changed, or even missing from their location as compared to the already learned situation. The proposal addresses a system level solution covering learning of 3D objects and scenes using visual, range and odometry sensors, storage of spatial knowledge using multiple-view theory from psychology, and validation of scenes using recognized objects as anchors. The proposed system is designed to handle the complex scenarios of recognition with partially visible objects during revisit to the scene from an arbitrary direction. Simulation results have shown success of the designed methodology under ideal sensor readings from range and odometry sensors.
1
Introduction
Living cognitive systems perform various types of activities to perceive and react to their three dimensional environment. Attention towards selected signals coming into the senses help them in quickly responding to the most relevant and important items. Sharp contrast or some element of surprise in the available data lead to bottom-up attention [1][2]. On the hand, a vast amount of literature on psychophysics advocates dominance of top-down in this process of autonomous selection [3][4]. Computational modeling of the attention phenomenon has become a topic of research since more than a decade. Majority of efforts on attention modeling take the bottom-up saliency channels into account whereas higher level processes of top-down attention have been addressed only recently, in which knowledge-driven saliency is a rarely modeled aspect. This paper presents a system able to compute saliency using previously learned knowledge about 3D scenes. It makes theoretical advancements in terms of three contexts. Firstly, it introduces an attention mechanism able to perform attention in 3D space. Secondly, it includes knowledge-driven influence to compute top-down saliency. Thirdly, a multi-sensor mechanism has been devised to learn 3D objects by collecting their several 2D views through a mobile robotic vision system. Hence a complete system has been discussed that is able to learn 3D objects and scenes and, at a later visit to the learned scenes, detect salient locations based upon the changes occurred after the previous visit. J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 34–45, 2011. c Springer-Verlag Berlin Heidelberg 2011
Knowledge-Driven Saliency: Attention to the Unseen
2
35
Related Literature
The system under discussion has roots in various disciplines. In order to be concise, selected literature is reviewed here that has conceptually close relation to the major contributions of the presented work. The fields worth mentioning here include role of recognition in human attention, efforts on recognition based artificial visual attention, psychological findings on recognition of 3D objects in brain, and work on 3D object analysis in machine vision. Role of cognition or recognition in selection of focus of attention in realworld has been established by research in psychology, e.g., [5] and [6]. The psychophysical experiments reported in [7] have shown memory-driven attentional capture on the basis of content-specific representations. In the context of learning and recognizing objects in 3D environments different theories exist about the brain functions. The theories suggesting incorporation of multiple 2D views into 3D models, e.g. [8], are simpler for implementation in vision systems of mobile robots. The psychological model in [5] suggests learning of canonical views for object representations. The work presented in [9] shows capability of learning three dimensional structure not only by looking at the target object or environment from different directions but by imagining to view it from these orientations as well. The ability of the brain to visualize a scene from an orientation in space that was not actually experienced by it is shown in [10]. A relation between visual attention and spatial memory is established in [11] with a conclusion that a detailed representation of the spatial structure of the environment is typically retained across fixations and it is also used to guide eye movements while looking at already learned objects. Formation of object representations by human vision through snapshots taken from different view angles is discussed in [12] and it is suggested that such procedure is followed only for the objects under visual attention while the unattended scene may be processed as a 2-D representation bound to the background scene as a texture. In machine vision, an analysis on importance of cognitive functions prior to focusing visual attention can be seen in [13]. Some efforts on establishing a relation between visual attention and object recognition already exist, for example [14] localizes and identifies objects using an adaptive color-based visual attention control algorithm and hierarchical neural networks for object recognition. Training of neural network for learning influences of task and context for constructing top-down saliency map was employed by [15] as well. In [16] prior knowledge of the target and distractors is used to compute the top-down gains. The bottom-up saliency maps are multiplicatively weighted by these top-down gains to yield the saliency map. A model incorporating familiarity based saliency can be seen in [17] in which top-down familiarity is computed using SIFT based object recognition. Handling representation and recognition of 3D objects has been a topic of research for some time. One of the recent proposals includes [18] with a learning approach that finds salient points on a 3D object and represents these points in a 2D spatial map based on a longitude-latitude transformation. Results of classification have been demonstrated on isolated objects. Use of canonical views has
36
M. Zaheer Aziz, M. Knopf, and B. Mertsching
been experimented in [19] where isolated objects are converted into representations having predefined size. Recognition of isolated 3D objects using partial views by artificial neural networks and semantic networks was presented in [20]. Merging 3D point cloud views from different observation directions have been used for unsupervised learning of objects in [21]. A similar work on recognition using partial views of 3D point clouds can be seen in [22]. The major aspect lacking in the existing 3D object analysis systems is their requirement of isolated 3D objects as input. The solution proposed makes a step ahead towards simultaneous object segregation and learning. Attention models applying top-down influences have shown success only in 2D scenes. The proposed system extends the scope of top-down attention towards three dimensional space and demonstrates some practical applications for mobile vision systems.
3
Proposed System
Both in learning and learned mode, the first major module of the proposed system include processing on input data from vision, range, and localization sensors to obtain color- and depth-wise segmented regions. This data is then arranged into view descriptors in the sensory memory to formulate a ground for 3D object representations (3DOR). Under learning mode, the 3DORs go to the short term memory (STM) through a data filtering and organization process in the sensory memory. The spatial relations between 3DORs of recognized objects are used to define semantics of a 3D scene. The learned knowledge in the STM is stored in the long term memory for permanent storage. When the system runs in the learned mode, it performs object recognition on the 3DORs fed into the sensory memory and tries to validate the learned scenes. This capability is utilized to detect saliency of spatial locations of some new, altered, or missing objects. The following subsections provides details of these processes. It may be noted that there can be a variety of options for each of the said processes. Since the main focus of this paper is a system level solution, in which each component is itself a research area, we have proposed simple solutions for each component to achieve the overall objectives of the whole system to a level sufficient for the proof of concept. 3.1
Multisensor Input and Preprocessing
The system requires visual input from camera and depth input from a reliable mechanism such as laser range sensor or time of flight camera. In the current status of the solution we have utilized a laser scanner calibrated with a camera by establishing correspondence between points of laser data and image pixels. Depth values are applied on homogeneously colored segments in the camera image to separate depth-wise different regions. Neighboring regions with smooth variation of depths are grouped together. Such groups in the camera image represent 2D views of 3D objects from a particular observing position. A mobile robot can collect many such views after moving around a given object.
Knowledge-Driven Saliency: Attention to the Unseen
3.2
37
View Descriptors
Views represent the basic information unit in the proposed 3D object recognition process. A view includes data about a 3D object’s appearance from a specific angle through the camera’s field of view. It is possible to have more than one views in the sensor input at a time when multiple objects are visible. Scale and, to some extent, rotation invariance is among the desired attributes of view descriptors to tolerate distance of camera from the objects and input taken when the robot stands on an uneven ground. The view descriptors used in the presented status of the system consist of weighted average and standard deviation of hue (h), saturation (s), intensity (i), eccentricity (e), and orientation (o) of the regions formulating the view. Physical width, height, and area (actual number of pixels covered by the regions in the view) are also included as additional features. The physical width and height is computed using the measurements from the camera image and the distance obtained from the range sensor. Hence the view descriptor has the following format: v = {μh , μs , μi , μe , μo , σh , σs , σ i , σ e , σo } For a view consisting of the set of regions R the attribute μh is computed using the hue values of the involved regions as |R| Ai hi μh = i=1 |R| i=1 Ai where Ai is the area and hi is the hue component of color of the considered region. Hence the larger regions in the view will contribute more in the final attribute value. The standard deviation is also weighted by the region region, hence |R| Ai (hi − μh ) h σ = i=1|R| i=1 Ai Magnitudes of other components of the view descriptor are computed in the similar fashion. 3.3
3D Object Construction Using Views
A robot can obtain many view descriptors of an observed 3D object by moving around it. The proposed view-based object representation consists of a set of pairs, each of which consists of an orientation vector pointing to the viewed object from the robot’s location coupled with the view descriptor, i.e., O = {(φ1 , v1 )...(φn , vn )} The first view of an object is taken as reference for that object and considered as the origin for orientations of other views. The odometry sensor is used to know
38
M. Zaheer Aziz, M. Knopf, and B. Mertsching
(a)
(b)
(c)
Fig. 1. (a) Four samples of the orientation vectors from which the example object was viewed to construct its view-based 3D representation. (b) Three vectors with starting from another first view as origin orientation. (c) Aligning and merging of the two object representations.
the robot transformations and to localize it with respect to the origin orientation vector. Figure 1(a) shows four of the orientation vectors from which the system viewed the cubic object, vi being the one taken as reference. Each movement of the robot around the observed object can cause a change in the acquired view descriptors. Storing all these descriptors is not desirable. Using fixed intervals of orientation can lead to redundant storage of similar descriptors at one hand while missing some important one on the other. The proposed solution is to continue the robot’s movement around the object until the current view descriptor vi differs from the previously sensed vi−1 one by a certain amount, i.e. the distance Δ(vi , vi−1 ) < τd . 3.4
Learning of Objects into Knowledge
We propose to accomplish the learning using three levels of memories. The first level consists of sensory memory that remembers the last descriptor vi−1 along with the localization data of vi−1 and the robot itself. The current vi is compared with vi−1 to filter out non-useful descriptors. Useful vi are stored into the shortterm memory (STM). Contents of the STM are stored into long term memory (LTM) for use after restart of the robot. Location information is not stored in STM and LTM because objects are to be recognized without reference to their locations. It is likely that an object may go out of sensor range and come back in view from another perspective due to wandering of the robot. Hence, a mechanism has to be provided for extending learned object representations by merging new ones into them. Consider the situation shown in figures 1(a) and 1(b) where the robot had collected an object representation with four vs into STM and now another one is available with three vs. In order to merge these two sets the first question is to find the transformation to align them together. For this purpose we define a concept of dominator view descriptors. Two consecutive view descriptors {va , vb } in an object Oi will be called dominant on vc ∈ Oj if the distances Δ(va , vc ) and Δ(vb , vc ) both remain below a threshold τd . In the example shown in figure 1 vector vm is dominated by {vk , vl }.
Knowledge-Driven Saliency: Attention to the Unseen
39
As a first step to find the dominating pairs we pick the elements of the lager object representation that have a small vector distance from some element of the smaller one into a set of matches M (Oi , Oj ). Hence, for Oi and Oj such that |Oi | ≤ |Oj | M (Oi , Oj ) = {va |va ∈ Oj ∧ Δ(va , vc ) < τd ∀vc ∈ Oi } vc with minimum distance is picked in case of more than one vc with Δ(va , vc ) < τd . Then a set D(Oi , Oj ) is constructed having pair of dominators for each vc ∈ Oi as D(Oi , Oj ) = {(vcx , vcy )|vcx , vcy ∈ M (Oi , Oj ) ∧ Θ(φ(vcx ), φ(vcy )) ≤ τθ } where Θ(., .) computes the minimal angle between orientation of view descriptors delivered by φ(.). An upper bound τθ has been applied to avoid accepting a pair with very wide angle as dominator pair. For example (vax , vay ) for visually similar faces pointing to opposite directions in space cannot be accepted as dominators for a given vc . The set of dominator pairs is used to find the transformation required by each vc ∈ Oi to align it with a best matching descriptor from its dominator pair. These transformation candidates are collected in a set T (Oi , Oj ) as T (Oi , Oj ) = {(vc , t)|t = Θ(φ(vcz ), φ(vc )) s.t. Δ(vcz , vc ) = min (Δ(vcx , vc ), Δ(vcy , vc ))}
It may be noted that without the above mentioned step-wise refinements a vc ∈ Oi could try to align itself any visually similar va ∈ Oj leading to high probability of merging errors. Ideally all members of T (Oi , Oj ) should have the same value of transformation but practically it may not be the case. T (Oi , Oj ) is filtered to obtain the candidate values for transformation between Oi and Oj by picking those members from T (Oi , Oj ) that occur more than τn times, i.e., CT = {t|(vc , t) ∈ T (Oi , Oj ) ∧ N (t, T (Oi , Oj )) > τn } where N (., .) counts elements in the given set having similar values to the first parameter. Mean of the elements collected in CT is taken as the transformation magnitude to align Oi to Oj . The transformed version OiT = μ(CT )Oj is merged into Oj by inserting new elements of OiT into Oj while maintaining an order of orientations for view descriptors. 3.5
Object Recognition in 3D
For a mobile vision system, most of the parts of viewed objects are hidden, while looking from a particular observation direction, due to occlusions by other objects or hidden back faces. It is not desirable that all sides should be examined before recognizing a 3D object. The object representation mechanism proposed in the previous section retains sufficient information to recognize a learned object when sensed through an arbitrary observation direction or under minor occlusions. The simplest recognition criterion would be to search a single observed view into the learned knowledge using the merging process described above. A more confident recognition would be available when a system could collect two or more views that could fulfill the merging criteria.
40
M. Zaheer Aziz, M. Knopf, and B. Mertsching
3.6
Learning of 3D Scenes
A scene comprises of a specific configuration of objects. A structure to learn scenes requires references to the learned object representations and spatial relations between these objects. A graph based approach in adopted for extracting and storage of spatial relations. A scene representation graph Si = {O, E, β} consists of nodes O referring to the involved objects, edges E ⊆ O × O between the objects and 3D displacement vectors β : E → R3 assigned to each E. The object recognition mechanism mechanism provides input for nodes O. The displacement vectors β depends upon accuracy of odometry and range sensor data that lead to absolute locations in world coordinates. The sensory memory stores this information and establishes an edge E between the involved objects as soon as more than one entries are available to it. Since a mobile robot faces dynamic scenarios in which objects may appear and get occluded with the passage of time, the established graph may may rapidly change especially during scene learning phase. At the time of merging two object representations their respective nodes are merged, the edge between them are deleted, and the edges from other objects to these are updated to point to the merged one. It is notable that scenes can be learned even with partially learned objects. 3.7
Knowledge-Driven Saliency
The proposed mechanisms for object and scene representations provide an adequate base for computation of familiarity-based saliency in 3D environments. Top-down saliency could be determined using learned knowledge of objects and scenes in two scenarios. Firstly, it may be desirable to let know objects popout in an unlearned environment, e.g. known landmarks should attract attention in surrounding distractors. On the other hand in some applications unknown or unrecognizable objects should be treated as more salient. This type of scenarios include attention to objects in a known scene that have gone through alteration, have been replaced by another object, are are missing from their place. In the current status of the system we compute the saliency keeping the above mentioned two scenarios separate from each other. For OS being the object representations available in the sensory memory and OM the set of learned objects in the STM, the familiarity-based saliency of an object OiS would be ψ f = 1 − Δ (OiS , OkM ) where Δ (OiS , OkM ) = min(Δ (OiS , OjM )∀OjM ∈ OM ) Here Δ (., .) computes the normalized vector difference between the given objects (0 ≤ Δ (., .) ≤ 1). The saliency based upon change detection is computed using the learned scenes. For this purpose a scene configuration is selected from the STM using the recognized objects in the current view that serve as anchors to activate the relevant scene configuration S a (O, E, β). Saliency ψ l of a location
Knowledge-Driven Saliency: Attention to the Unseen
41
pointed by displacement vectors βik (S a ) associated with edges Eik (S a ) of the recognized anchor object Ok (S a ) is computed as ψl = Δ (Ol , Oik (S a )) where Ol is the object visible at location pointed by βik (S a ) and Oik (S a ) is the object expected at location l according the stored knowledge. It is obvious that a high saliency will be signaled when Ol is empty. The saliency of objects OiS available in sensory memory with respect to the active scene configuration S a is computed as ψis = 1 − Δ (Ois , L(Oj(S a ))) where L(.) picks the object at the same location as Ois from the active scene configuration. Combining the saliencies ψ l and ψis based upon the objects actually sensed in the current observation provides a saliency map that indicates locations where some got added, changed, or is missing as compared to the known configuration around the anchor object(s).
4
Results
It is apparent that the system discussed in this paper involves integration of many modules that are currently under research by robotics community. The sensor noise and accumulation of related errors are among some of the serious problems to be tackled. Since the proposed system relies upon ideal solutions for robot localization and pose estimation we report here experiments done in a robot simulation framework in order to concentrate on the conceptual testing without being distracted by branch issues. The robot simulation framework SIMORE [23][24] is able to simulate robots will fully functional virtual cameras, laser range sensor, and localization feedback. The robots can interact with their virtual environment with simulated physical forces. Hence experiments performed in this framework are equivalent to doing them in real world with the advantage that the sensor data is noise free. Hence this experimentation platform provides sufficient feedback for the proof of concept of the current state of the proposed system. Secondly, it is possible to show results in scenes with controlled complexity using this framework in order to make a clear demonstration of the contributions of this work. Handling issues related to sensor noise and localization inaccuracies in planned as part of future work. Figure 2(a) demonstrates a snap shot of the experiment in which the virtual robot drives around in a scene with two objects. The beams of laser scanner intercepting with the objects are also visualized by the simulation framework. The view seen by the robot through its camera is shown in figure 2(b). Figure 3(a) shows maneuver of the system in learning mode in a scene with three objects. In figure 3(b) it can seen that the system has marked two already learned objects as known, represented by green rectangles and crosses on the
42
M. Zaheer Aziz, M. Knopf, and B. Mertsching
(a)
(b)
Fig. 2. Experimentation platform used to obtain results of the current state of the proposed system (a) A simulated robot roaming in a virtual reality scene with visualization of its laser range scanner beams. (b) Camera input as seen by the robot.
(a)
(b)
(c)
Fig. 3. Familiarity-based saliency detected by the proposed system (a) The system observes a scene with two known and one unknown objects. (b) Regions involved in the view descriptors of known objects marked by the system using green rectangles and crosses. (c) Saliency map showing high conspicuity of unknown object among known ones.
region involved in the view descriptors. The third object is yet unknown hence its familiarity-based saliency (ψ f ) is high, as seen in figure 3(c). Figures 4, 5, and 6 demonstrate samples from experiments that show the ability of the system to detect changes in a learned scene (the scene shown in figure 3(a)). It can be observed in figure 4 that the left most object (cuboid with blue stripes) is replaced by a yellow object. The system approaches the scene from almost the opposite direction as that of figure 3(a) and it is able to recognize the central object (having red and yellow stripes) as anchor. Using this anchor it can detect the changed object and mark it salient for attention. In figure 5, an additional object has been placed in the scene. The system approaches the scene from an arbitrary direction and recognizes two known objects and marks them as anchors. The newly added object is detected on scene validation and is marked as salient by the system. On the other hand, In figure 6 demonstrates the case where one of the objects (the middle one with red and yellow stripes) is missing from the scene known to the system (figure 3(a)). The system is approaching the scene from yet another direction. After recognizing
Knowledge-Driven Saliency: Attention to the Unseen
(a)
(b)
43
(c)
Fig. 4. (a) The system observes an already known scene (the one shown in figure 3(a)). (b) A known object recognized (striped one at right) using view descriptors of current observation and marked as anchor. (c) Already learned scene recognized using anchor object and the replaced object detected as salient.
(a)
(b)
(c)
Fig. 5. (a) Observation of an already known scene (the one shown in figure 3(a)) from a different view direction. (b) Known objects recognized (left and right ones) and marked as anchors. (c) Already learned scene recognized using anchor object and an added object detected as salient.
(a)
(b)
(c)
Fig. 6. (a) Observation of the scene shown in figure 3(a) from a different view direction. (b) A known object recognized and marked as anchor. (c) Already learned scene recognized using the anchor object and a missing object detected as salient.
the visible object as anchor it expects the missing object to be visible in its sensor range. Failing to do so it detects the missing object and marks is location as salient in the saliency map.
44
5
M. Zaheer Aziz, M. Knopf, and B. Mertsching
Conclusion
The topic of knowledge driven saliency has been addressed in this work with a system level perspective. Methodology is developed to learn 3D objects and then scenes formed by these objects. The knowledge is managed by sensory, short-term, and long-term memories. The designed structure allows recognition at a later visit from an arbitrary direction and detects saliency with respect to the occurred changes. Results with ideal sensor conditions are very promising and show success of the developed solution. The proposed approach is novel in three contexts, namely, visual attention in 3D environments by mobile vision systems, application of knowledge for finding top-down saliencies, and a learning mechanism for 3D allowing recognition after viewing from arbitrary direction at revisit. Future work in this direction include dealing with the complexity of real-life scenarios and noisy sensor data. Acknowledgment. We gratefully acknowledge the funding of this work by the German Research Foundation (DFG) under grant Me 1289/12-1(AVRAM).
References 1. Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Congnitive Psychology 12, 97–136 (1980) 2. Posner, M., Snyder, C., Davidson, B.: Attention and the detection of signals. Journal of Experimental Psychology: General 109(2), 160–174 (1980) 3. Chen, X., Zelinsky, G.: Real-world visual search is dominated by top-down guidance. Vision Research 46(24), 4118–4133 (2006) 4. Chen, X., Zelinsky, G.J.: Is visual search a top-down or bottom-up process? Journal of Vision 6, 447 (2006) 5. Blanz, V., Tarr, M., B¨ ulthoff, H., Vetter, T.: What object attributes determine canonical views? PERCEPTION-LONDON 28, 575–600 (1999) 6. Endres, D., Vintila, F., Bruce, N., Bouecke, J., Kornprobst, P., Neumann, H., Giese, M.: Hooligan detection: the effects of saliency and expert knowledge. Perception 39 ECVP Abstract Supplement 39, 193 (2010) 7. Olivers, C., Meijer, F., Theeuwes, J.: Feature-based memory-driven attentional capture: Visual working memory content affects visual attention. Journal of Experimental Psychology 32(5), 1243–1265 (2006) 8. Palmer, S.: Vision science: Photons to phenomenology, vol. 2. MIT Press, Cambridge (1999) 9. Oman, C.M., Shebilske, W.L., Richards, J.T., Tubr´e, T.C., Bealli, A.C., Natapoffi, A.: Three dimensional spatial memory and learning in real and virtual environments. Spatial Cognition and Computation 2, 355–372 (2000) 10. Shelton, A.L., Mcnamara, T.P.: Spatial memory and perspective taking. Memory & Cognition 32, 416–426 (2004) 11. Aivar, M.P., Hayhoe, M.M., Chizk, C.L., Mruczek, R.E.B.: Spatial memory and saccadic targeting in a natural task. Journal of Vision 5, 177–193 (2005) 12. Hoshino, E., Taya, F., Mogi, K.: Memory formation of object representation: Natural scenes. In: Wang, R., et al. (eds.) Advances in Cognitive Neurodynamics, pp. 457–462 (2008)
Knowledge-Driven Saliency: Attention to the Unseen
45
13. Follet, B., Le Meur, O., Baccino, T.: Modeling visual attention on scenes. Studia Informatica Universalis 8, 150–167 (2010) 14. Fay, R., Kaufmann, U., Markert, H., Palm, G.: Adaptive visual attention based object recognition. In: Proceedings of the IEEE SMC UK-RI Chapter Conference (2005) 15. Rasolzadeh, B., Tavakoli, A.T., Eklundh, J.O.: Attention in cognitive systems. theories and systems from an interdisciplinary viewpoint, pp. 123–140. Springer, Heidelberg (2008) 16. Navalpakkam, V., Itti, L.: An integrated model of top-down and bottom-up attention for optimizing detection speed. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2049–2056. IEEE, Los Alamitos (2006) 17. Lee, S., Kim, K., Kim, J., Kim, M., Yoo, H.: Familiarity based unified visual attention model for fast and robust object recognition. Pattern Recognition 43, 1116–1128 (2010) 18. Atmosukarto, I., Shapiro, L.: A learning approach to 3D object representation for classification. Structural, Syntactic, and Statistical Pattern Recognition, 267–276 (2010) 19. Denton, T., Demirci, M., Abrahamson, J., Shokoufandeh, A., Dickinson, S.: Selecting canonical views for view-based 3-D object recognition. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 2, pp. 273–276. IEEE, Los Alamitos (2004) 20. B¨ uker, U., Hartmann, G.: Knowledge-based view control of a neural 3-D object recognition system. In: ICPR, vol. 4, pp. 24–29. IEEE, Los Alamitos (2002) 21. Ruhnke, M., Steder, B., Grisetti, G., Burgard, W.: Unsupervised learning of 3d object models from partial views. In: ICRA 2009, pp. 801–806. IEEE, Los Alamitos (2009) 22. Salamanca, S., Cerrada, C., Adan, A., Cerrada, J., Ad´ an, M.: Free-Shaped Object Recognition Method from Partial Views Using Weighted Cone Curvatures. Progress in Pattern Recognition, Image Analysis and Applications, 222–232 (2005) 23. Kutter, O., Hilker, C., Simon, A., Mertsching, B.: Modeling and simulating mobile robots environments. In: 3rd International Conference on Computer Graphics Theory and Applications (GRAPP 2008), Funchal, Portugal (2008) 24. Kotth¨ auser, T., Mertsching, B.: Validating Vision and Robotic Algorithms for Dynamic Real World Environments. In: Second International Conference on Simulation, Modeling and Programming for Autonomous Robot, SIMPAR (2010)
A Comparative Study of Vision-Based Lane Detection Methods Nadra Ben Romdhane1, Mohamed Hammami2, and Hanene Ben-Abdallah1 1
MIRACL-FSEG, Sfax University, Rte Aeroport Km 4, 3018 Sfax, Tunisia
[email protected],
[email protected] 2 MIRACL-FS, Sfax University, Rte Sokra Km 3 BP 802, 3018 Sfax, Tunisia
[email protected]
Abstract. Lane detection consists of detecting the lane limits where the vehicle carrying the camera is moving. The aim of this study is to propose a lane detection method through digital image processing. Morphological filtering, Hough transform and linear parabolic fitting are applied to realize this task. The results of our proposed method are compared with three proposed researches. The method presented here was tested on video sequences filmed by the authors on Tunisian roads, on a video sequence provided by Daimler AG as well as on the PETS2001 dataset provided by the Essex University. Keywords: Driver assistance system, Lane detection, Hough transform, Morphological filtering.
1 Introduction In the majority of the cases, the lane markings exist on both sides of the road, in others, the road is not marked. The lane detection consists of detecting the lane marking limits. This task is relatively easy when the texture of the road is uniform and when the lane presents very clear markings. However, it becomes non trivial particularly with the presence of various markings (cf., Figure 1.a), the presence of obstacles that cover lane markings (cf., Figure 1.b), the effects of weather conditions (cf., Figure 1.c) and the time of acquisition (cf., Figure 1.d). Its difficulty gets further accentuated in cases of camera movements, blur effects, and abrupt or unexpected actions of conductors, which are likely to distort the result of detection. Hence, all these circumstances must be considered to ensure the effectiveness of lane detection; this is the objective of our presented approach.
(a)
(b)
(c)
Fig. 1. Different environment conditions J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 46–57, 2011. © Springer-Verlag Berlin Heidelberg 2011
(d)
A Comparative Study of Vision-Based Lane Detection Methods
47
In this paper, we are interested in marked roads where markings are presented in the form of two parallel linear or curved ribbons, with continuous or dashed lines. The remainder of this paper is organized as follows: Section 2 describes a state of the art of the lane detection step. Section 3 details our proposed method. Section 4 presents a set of experimental results and performance evaluations. Section 5 concludes by summarizing the major contributions of the present work.
2 State of the Art The aim of lane detection and tracking based on image processing is to locate, in the first sequence frame, the Lane Limits (LL) of the road in which the vehicle is engaged and then to track these limits in the remaining frames. Various lane detection methods have been proposed in the literature. These methods can be broadly grouped in two approaches, namely the model-based approach [5-10] and the feature-based approach [11-15]. 2.1 Model-Based Approach The methods in this category detect lane limits based on the assumption that the form of a given lane can be modeled by a curve that can be linear, parabolic etc. These methods follow a top-down step to detect the lane limits. Thus, the detection consists in searching the adequate curve parameters to be modeled while using parametric or explicit models. Parametric models. The widely used algorithm in lane detection applies the Hough transform on approximate pixels extracted by a gradient operator. Yeh [6] applies the Hough transform on oblique image edges. On the other hand, Collado [12] apply the Hough transform, he fixes the orientation ranges to limit the number of lines to be detected. The two required lines that refer respectively to the right and to the left sides are those having the smallest and the greatest value of the orientations. With the Hough transform model, the LL is detected very quickly. Nevertheless, it is characterized by a weak detection precision since it only represents the linear portions of the limits. To overcome this problem, various algorithms were proposed to detect slightly curved lane based on a local detection of the LL instead of the global detection. In particular, Tian [10] divides the lower half of the image contour into five sections of decreasing width from the bottom to the top, and then applies the Hough transform in each section. As for Jung [9], he divides horizontally the image into two regions corresponding to the near field and the far field. Based on the near and far region delimitation, we presented in our previous work [21], a lane detection algorithm that is based on Hough transform and linear parabolic fitting. Explicit models. In opposition to parametric models, with explicit models, it is possible to detect arbitrary shapes of roads. The majority of these models such as deformable and recursive based models are initialized with points positioned either on road boundaries or on lane markings. Deformable models are complex models, but they have a certain degree of freedom in the evolution compared to parametric models. They require an approximate initialization of the model on the segmented image. Then, the model evolves repeatedly in order to minimize an energy function. The
48
N. Ben Romdhane, M. Hammami, and H. Ben-Abdallah
model stops when a specified criterion is reached. Researches [10, 11] rely on interpolation by B-snake to model linear and curved portions of the road starting from the image edges. They simplify the problem of detecting the two lane boundaries to the detection of the middle of the lane. After the approximate pixels extraction step, they apply the Hough transform to different sections of the image to take into account the curved roads. They use the position of the vanishing point (intersection of the right and left limits) detected in each section to determine the middle pixels of the lane. These pixels will be adjusted by applying the deformable model. A recursive method in detecting lane markings is proposed by Aufrère [28]. He describes the LL using the positions of the left and right boundaries of the road in several horizontal lines of the image. He first of all delimits the region of interest which has a trapezoid form, and then he detects the lane marking segments by searching the pixel having the maximum value of the gradient for each line. He defines thereafter the two new regions of interest. The updated model will be initialized close to the detected segment and the algorithm will be stopped based on a depth criterion. In spite of explicit models effectiveness in modelling various road shapes, the segmentation result depends enormously on the initialization of these models. Moreover, they are characterized by their great sensitivity to noises and they require considerable execution time which makes these methods inadequate for real-time systems. 2.2 Feature-Based Approach The feature-based methods detect lane limits based on characteristics extracted from marking regions or from marking edges. These limits can be detected using different methods, such as geometric filtering methods, supervised training methods, or tracking methods. Rule-based method. The rule-based methods consist in searching the forms which are similar to those of lane marking limits according to user-defined classification rules. D'Cruz [18] uses the position characteristics to extract the required regions. He adjusts the retained pixels by a quadratic linear function. Cheng [19] uses colour for the image segmentation, and then he calculates the form, geometry and motion characteristics to eliminate false detections. As for Yim [29], he extracts the characteristics of the starting position, the direction and the intensity. These methods are conceptually fast and simple but they are not general. They are specific to the conditions defined by the user. Supervised training methods. The supervised training methods produce automatically the classification rules based on a training image dataset containing examples of processed cases. Gonzalez [30] extracts the position, form and geometry characteristics from the segmented image (the size of each segmented object, the gravity centre, the orientation, the maximum width, etc). Then, he selects the marking areas from the segmented regions based on a decision tree classifier. The author detects the LL by adjusting the gravity centres of the retained regions by a quadratic linear function. Kim [15] extracts 81 parameters, from small windows, related to the RVB components. Then he applies a multi-layer perceptron neural network to classify the retained regions into markings and non-markings.
A Comparative Study of Vision-Based Lane Detection Methods
49
Tracking methods. The basic idea of the tracking methods is to consider iteratively, from a starting point, different segments of a defined length and having different directions. In each iteration, only the segment which corresponds best to the required structure based on a given criteria is kept. Authors [15, 27] apply the Markov process to link the segments retained in the approximate pixels extraction step using a morphological gradient operator method. The aim of this process is to predict the position of a following segment knowing only the position of a current one. In each iteration, they consider different segments of a given length and oriented in different directions compared to the initial direction’s segment. The segment which corresponds the best to the LL is kept and added to the list of the pixels of the tracked limit.
3 A New Method for Lane Detection This section describes our proposed lane detection method. Our process is illustrated in Figure 2. Pre-processing • Delimitation of the region of interest • Median filtering • Contrast enhancement
Extraction of approximate pixels • Image segmentation • Edge pixels extraction Lane detection • Detection in near lane region • Detection in far lane region
Fig. 2. Lane detection process
In the pre-processing step, we apply a list of transformations to delimit the region of interest and to improve the contrast of the image. For the extraction of approximate pixels step, we first segment the image to extract the approximate regions of lane limit’s markings, and then we select their edge pixels. In the lane detection step, as the road may be formed by linear and curved portions, we present the techniques we apply to detect the corresponding lane markings. 3.1 Pre-processing In the pre-processing step, we treat the image to restrict the region of interest and improve the contrast of the image. We resize each processed image to half of the initial
50
N. Ben Romdhane, M. Hammami, and H. Ben-Abdallah
size, and then delimit our region of interest to the lower part which occupies third of the image width. We delimit two rectangular regions to detect the linear portions of the road: ROIr to detect the right limit, and ROIl to detect the left limit (cf., Figure 3.a). In some cases, the presence of shadows, lights and object reflections as well as blurred lane markings is noticed in the image. For deal with these cases, and in order to improve the result of approximate pixel extraction, only the pixels belonging to the clear areas are extracted from each region of interest, since lane markings areas are clear. Hence, we proceed according to the following three steps: 1) We apply a median filtering on the gray level image given that it is the most used method to reduce impulsive noise while preserving the sharp edges. 2) We apply the top hat transformation to extract the clear regions regardless of background variations; thus, the extraction of lane markings will not be affected by the presence of strong shadows or illuminations. The result of top hat transformation is illustrated in Figure 3.b. 3) We increase the contrast of the approximate lane markings regions in order to facilitate their extraction. Since in ROIr and ROIl generally the road is free, the mean value ‘M’ of the pixels composing each region in the grayscale image indicates the road condition. Thus, lower values may be caused either by the presence of strong shadows, cloudy day or obscurity. In these cases, lane markings are generally blurred. In opposition, upper values indicate the presence of a sunny day. In these cases, lane markings are generally clear. Thus, we enhance the pixels intensities by determining the square of the intensities of the top hat transformed image if ‘M’ is upper than a threshold, otherwise, we determine their cubic values to more enhance the image. We have determined the value of this threshold after a series of experiments. As a result of this operation, the clearer pixels that refer in majority to the lane markings will have as intensity value: 255 which facilitate their extraction in the following step. This enhancement will not be affected in case of light reflections due to the use of the top hat transformation as we have indicated above. The result of the contrast enhancement is illustrated in Figure 3.c.
ROIr
ROIl (a)
(b)
(c)
Fig. 3. Pre-processing step. (a) Original image, (b) image after top hat transformation, (c) image after contrast enhancement.
Our proposed contrast enhancement step allows us to improve the contrast of the lane markings in different environment conditions such as the presence of strong shadows, illumination variation, objects reflection, obscurity and in the presence of cloudy, heavy rain and intense sunny day.
A Comparative Study of Vision-Based Lane Detection Methods
51
3.2 Extraction of Approximate Pixels In this step, we extract the approximate pixels of lane markings based on the preprocessed image. To reach this objective, we proceed in 2 steps: 1) We retain the pixels having intensity equal to 255 (cf. Figure 4.a). 2) We label the obtained binary masque, eliminate small regions and extract edge pixels of the retained regions to obtain the approximate pixels. This was done by first delimiting each retained region by a rectangle then, by collecting the first white pixel met in each line of a rectangle by scanning the lines from left to right in ROIr and from right to left in ROIl. The result is illustrated in Figure 4.b.
(a)
(b)
Fig. 4. Extraction of approximate pixels step. (a) segmented image, (b) retained pixels.
3.3 Lane Detection In this section, we propose the lane detection step which is based on the model approach. As the road may be formed by linear and curved portions, we present the techniques we apply to detect the lane markings in the near and far lane regions. Detection in the near lane region. The near lane region includes ROIr and ROIl. In this region, we apply the Hough transform to detect the lane limits considering its rapidity and its robustness in the presence of noises and partial coverage of markings. In Hough transform, each line is a vector of parametric coordinates that are the orientation θ and the intercept to the origin P. The first coordinate corresponds to the angle formed by the vertical axis of the image and the segment passing by the origin and which is perpendicular to this line. From the set of obtained segments, we limited the range of search to lines that have an orientation ranging between 25° and 70° in ROIl of the road, and an orientation ranging between -25° and -70° in ROIr. Among these lines, we retained the segment with the maximum length in each region (cf., Figure 5.a). Finally, we project the two retained limits up to the sides of ROIr and ROIl (cf., Figure 5.b).
(a)
(b)
Fig. 5. Lane detection step based on model approach
52
N. Ben Romdhane, M. Hammami, and H. Ben-Abdallah
Using the linear equation of each detected limit, we determine the coordinates u and v of their pixels and we organise them in two different lists: Listr for the right limit, and Listl for the left limit:
{
List r = ( u ir , vir ) i = 1,2 ,...m 1
{
Listl = ( uil , vil ) i = 1,2,...m 2
}
(1)
}
(2)
where m1 and m2 are the numbers of pixels collected respectively in Listr and Listl. Detection in the far lane region. To consider the presence of an eventual curved lane portion, we delimited a third region of interest ROIc by Hz and Y according to the position of the detected limits in the near region (cf., Figure 6.b). Hz is the horizontal line passing through the intersection of the two detected limits, namely the vanishing line, Y is the horizontal line that delimits the near lane region. The detection process is detailed in our previous work [21]. To detect the lane markings in ROIc, we proceed in 2 steps: 1) We apply the same pre-processing step as in ROIl and ROIr (cf., Figure 6.c). 2) We extract the approximate pixels by following a mid to side strategy: we scan upwards the horizontal lines of ROIc, and from the middle to the right and left sides (cf., Figure 6.d). The lane centre position for each row can be determined based on the width of the lane in the corresponding row. For each side, we retain the first white pixel met satisfying certain constraints, and we add it to the appropriate list (Listr or Listl). We begin by collecting some approximate pixels from the two sides of the lane. Then we apply the Hough transform to detect the main line segment on each side and we search the position of the new vanishing point. By comparing this new position to the initial one, we can have an idea on the road trajectory and then we continue the extraction process.
ROIc ROIl
Hz Y
ROI ROIrr
(a) Detection of linear portions in ROIr and ROIl
(b) Delimitation of ROIc
(c) Preprocessing of ROIc
(d) Extraction of approximate pixels
(e) Lane detection
Fig. 6. Lane detection in near and far lane regions
To have a continuous layout of the limits in near and far lane regions, we apply to each obtained list, a linear parabolic model that corresponds to a quadratic polynomial function. The final result is illustrated in Figure 6.e.
4 Experimental Results and Performance Evaluation In order to evaluate the performance of our proposed method, we carried out a series of experiments on various sequences. In order to clarify the experimental conditions, the following sections will start by providing a brief overview of the test dataset and then proceed to present the experimental results and performance evaluation.
A Comparative Study of Vision-Based Lane Detection Methods
53
4.1 The Test Dataset The test dataset is composed by several video sequences divided in three sets. The first set (S1) contains four video sequences that are (Video 1) composed of 1400 frames, (Video 2) composed of 1220 frames, (Video 3) composed of 860 frames and (Video 4) composed of 500 frames, which we had captured in Tunisian roads. They are tacked on highways and on main roads during different moments of the day and in sunny, intense sunny, cloudy and rainy days. The roadway comprises two lanes separated by continuous and/or discontinuous white lines. The images present various road conditions, such as the presence of blurred, contrasted and well delimited markings, as well as various types of markings and partial coverage of markings by obstacles. The second set (S2) contains a night vision stereo sequence, composed of the 600 frames provided by Daimler AG1. The third set (S3) contains the video sequences of the PETS20012 dataset provided by the Essex University (England), a sequence (Video 5) composed of 2866 frames and a sequence (Video 6) composed of 2867 frames. They were tacked during a sunny and intense sunny days, and they illustrate a roadway that comprises three lanes separated by discontinuous lane markings; two continuous white lines delimit the edges of the road. 4.2 Experimental Results This section presents a series of experiments obtained through our proposed lane detection method. In order to evaluate and compare the performance of our lane detection method, we have implemented three proposed algorithms of the most known methods. For the model-based approach, we implement the Canny/Hough Estimation of Vanishing Points ‘CHEVP’ algorithm proposed by wang et al [10] and adopted by Tian et al [11] that is based on the local parametric detection method (Method A) and the global parametric detection method (Method B) proposed by Zhu et al [9]. For the feature-based approach, we implement the tracking method (Method C) proposed by Tsai et al [16]. We also present the detection results based on our previous work [21]. In this comparison, we discarded the explicit parametric methods that is based on the model approach and both the geometric filtering and the supervised training methods that are based on the feature approach, for two main reasons: i) their great sensitivity to noises and partial coverage of lane markings by obstacles; and ii) our quest for a robust and fast algorithm to delimit the lane limits. Table 1 illustrates the detection results of the different methods in different environment conditions. The images of Table 1 illustrate the detection results of the different methods. The first column indicates the different environment conditions. The second column illustrates the original images tacked from S1 (cf. Figures a (Video 1), b (Video 2), e (Video 3), c and h (Video 4)), S2 (cf. Figure d) and S3 (cf. Figures g (Video 5) and f (Video 6)). The following columns illustrate successively the lane detection results obtained based on our proposed method, our previous work, the Method A, the Method B and the Method C. 1 2
http://www.mi.auckland.ac.nz ftp://ftp.cs.rdg.ac.uk
54
N. Ben Romdhane, M. Hammami, and H. Ben-Abdallah Table 1. Comparative evaluation of the lane detection step Environment condition
Original Image
Proposed Method
previous work
Method A
Method B
Method C
Strong bridge a shadow + objects reflection b Cloudy day
c Heavy Raining
d Night time
e Normal condition
f Obstacles
g
Strong trees Shadow
h Intense raining
Most of the proposed methods give good results in normal conditions (cf. Figure e) and in case of night time (cf. Figure d). With Method A, a misalignment was caused by obstacles (cf. Figure f) and some misdetections were caused by strong shadows, objects reflection, intense lighting where the lane markings have a very low contrast (cf. Figures a and c). Such failures are caused by the use of Canny edge detector on the original image to extract the edge pixels without applying a pre-processing step to enhance the lane markings. As for Method B which performs a global detection of the lane markings, some misdetections were caused by strong shadows, objects reflection and intense lighting where the lane markings have a very low contrast (cf. Figures a, c and g). Such failures are caused by the use of the histogram equalization technique in preprocessing step, to enhance the image, which is not efficient in different environments Some misalignments were obtained with Method C. As illustrated in Figures (a, b and f), these misalignments are caused by a false detection of the first or the followings lane segments in the presence of objects reflexion on the windshield of the car, obstacles or lack of markings. We noted that this method gives good results in the presence of a heavy rain, strong shadows and intense lighting due to the use of morphological transformations in the pre-processing step (cf. Figures c and g). With our previous work, we obtained good results in the normal environments conditions. It detects efficiently the lane limits in the presence of curved roads
A Comparative Study of Vision-Based Lane Detection Methods
55
(cf. Figures b and e) and in the presence of obstacles (cf. Figure f). Furthermore, our previous work detects the limits during a cloudy day (cf. Figure b) and at night time (cf. Figure d). The weakness of our method is seen in presence of strong shadows (cf. Figures a and g), objects reflection and intense lighting (cf. Figure a) and in the presence of rain (cf. Figure c). The robustness of our new method in comparison with our previous work and the three others is seen in frames presenting a combination of strong shadow, too spaced blurred lane markings, illumination variation and intense lighting (cf., Figures a, c and g) due to the efficiency of our pre-processing step. Our method gives, as for the Method C, good results in the presence of a heavy rain (cf., Figure c). Furthermore, it detects efficiently the limits in case of curved lanes (cf., Figures b and e) and in the presence of obstacles (cf., Figure f). There remain few cases where the detection fails with our method and with the others such as the presence of sharp curved lane, non flat road, congested road and also in case of intense raining as illustrated in Figure (h). 4.3 Performance Evaluation In order to further evaluate our lane detection method and puts it in its context, we compared its detection performance with the three methods in terms of Recall, Precision and F-measure. We performed this evaluation on the set S3 (the PETS2001 dataset) since it is a known dataset and to provide our lane detection rates for comparison with other researches. We have manually detected the limits of 1000 frames: the first 500 frames of Video 5 and the first 500 frames of Video 6. Considering the reference lane limits, which correspond to the lane limits that we have detected manually, and the extracted lane limits, which correspond to the automatic extracted ones, these measurements are defined by the following equations: R ecall =
Length of matched reference lenght of reference
P recision =
length of matched extraction length of extraction
F - mesure = 2 ×
Recall × Pr ecision Recall + Pr ecision
The performance measure of the different methods is illustrated in Figure 7.
Fig. 7. Comparative performance evaluation of the five methods
(4) (5) (6)
56
N. Ben Romdhane, M. Hammami, and H. Ben-Abdallah
Based on our proposed method, we obtained an average rate of 92.04% for the Recall, 91.53% for the Precision and 91.76% for the F-measure. These results prove the improvement of our method previous work given an average rate of 78.59% for the Recall, 78.16% for the Precision and 72.55% for the F-measure. Our results are almost equal to the Method C. However, this method is characterized by its sensitivity to noises which lead it to present more cases of false detections than our method. On the other hand, our previous work, Method A and Method B give lower rates especially because of the presence of strong shadows and intense lighting in some frames.
5 Conclusion In this study, we presented our lane detection method. Our experimental evaluation showed that, with our method, we can detect lane limits in the majority of the images in the presence of strong shadows, partial coverage of the limits by obstacles, reflection of objects on the windshield of the car, presence of different lane markings that can be blurred, contrasted, and continuous or dashed, and in different weather conditions. The comparison of our results with three known methods proves its robustness in the cases of complex environment conditions. The findings of the present study are promising. For this reason, studies are currently underway in our laboratories to investigate the tracking of the limits to have a continuous detection of the limits while minimizing noises and execution time.
References 1. Dai, X., Kummert, A., Park, S.B., Neisius, D.: A warning algorithm for lane departure warning system. In: IEEE Intelligent Vehicles Symposium (2009) 2. Yu, B., Zhang, W., Cai, Y.: A Lane Departure Warning System based on Machine Vision. In: IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application (2008) 3. Guo, L., Wang, J., Li, K.: Lane Keeping System Based on THASV-II Platform. In: IEEE International Conference on Vehicular Electronics and Safety (2006) 4. Dariu, M.G., Uwe, F., Wohler, C., Gorzig, S.: Real-time vision for intelligent vehicles. IEEE Instrumentation and Measurement Magazine (2001) 5. Assidiq, A., Khalifa, O., Islam, R., Khan, S.: Real Time Lane Detection for Autonomous Vehicles. In: Proceedings of the International Conference on Computer and Communication Engineering, Kuala Lumpur, Malaysia (2008) 6. Yeh, C., Chen, Y.: Development of Vision-Based Lane and Vehicle Detecting Systems via the Implementation with a Dual-Core DSP. In: IEEE Int. Transp. Systems Conference, Canada (2006) 7. Lin, H., Ko, S., Hi, W., Kim, Y., Kim, H.: Lane departure identification on Highway with searching the region of interest on Hough space. In: International Conference on Control, Automation and Systems, Seoul, Korea (2008) 8. Jung, C.R., Christian, R.K.: A Robust Linear-Parabolic Model for Lane Following. In: Proceedings of the XVII Brazilian Symposium on Computer Graphics and Image Processing (2004)
A Comparative Study of Vision-Based Lane Detection Methods
57
9. Zhu, W., Chen, Q., Wang, H.: Lane Detection in Some Complex Conditions. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China (2006) 10. Wang, Y., Shen, D., Teoh, E.K.: Lane Detection Using Spline Model. Pattern Recognition Letter 21, 677–689 (2000) 11. Tian, M., Fuqiang, L., Wenhong, Z., Chao, X.: Vision Based Lane detection for Active Security in Intelligent Vehicle. In: IEEE Conference on Vehicular Electronics and Safety, pp. 507–511 (2006) 12. D’Cruz, C., Zou, J.J.: Lane detection for driver assistance and intelligent vehicule applications. In: International Symposium on Communications and Information Technologies (2007) 13. Cheng, H., Jeng, B., Tseng, P., Fan, K.: Lane detection with moving vehicle in the traffic scenes. IEEE Trans. on Intelligent Transportation Systems 7, 571–582 (2006) 14. Gonzalez, J.P., Omit, O.: Lane Detection Using Histogram-Based Segmentation and Decision Trees. In: Int. Transp. Systems, Conference Proceedings Dearborn (MI), USA, (2000) 15. Zu, K.: Realtime Lane Tracking of Curved Local Road. In: Proceedings of the IEEE Intelligent Transportation Systems Conference Toronto, Canada (2006) 16. Tsai, M., Hsu-Yung, C., Yu, C., Tseng, C., Fan, K., Hwang, J., Jeng, B.: Lane Detection Using Directional Random Walks. In: International Conference on Acoustics, Speech, and Signal Processing (2008) 17. Asif, M., Arshad, M.R., Yousuf, M., Zia, I., Yahya, A.: An Implementation of Active Contour and Kalman Filter for Road Tracking. International Journal of Applied Mathematics (2006) 18. Lim, K.H., Seng, K.P., Ngo, A.C., Ang, L.: Real-time Implementation of Vision-based Lane Detection and Tracking. In: Int. Conference on Intelligent Human-Machine Systems and Cybernetics (2009) 19. Wanga, Y., Teoha, E.K., Shenb, D.: Lane detection and tracking using B-Snake. Image and Vision Computing 22, 269–280 (2004) 20. Boumediene, M., Ouamri, A., Dahnoun, N.: Lane Boundary Detection and Tracking using NNF and HMM Approaches. In: IEEE Intelligent Vehicles Symposium Istanbul, Turkey (2007) 21. Ben Romdhane, N., Hammami, M., Ben-Abdallah, H.: An Artificial Vision-based Lane Detection Algorithm. In: International Conference on Computer Science and its Applications, Korea (2009)
A New Multi-camera Approach for Lane Departure Warning Amol Borkar1, Monson Hayes2 , and Mark T. Smith3 1
Georgia Institute of Technology, Atlanta, GA, USA
[email protected] 2 Chung-Ang University, Seoul, Korea
[email protected] 3 Kungliga Tekniska H¨ ogskolan, Stockholm, Sweden
[email protected]
Abstract. In this paper, we present a new multi camera approach to Lane Departure Warning (LDW). Upon acquisition, the captured images are transformed to a bird’s-eye view using a modified perspective removal transformation. Then, camera calibration is used to accurately determine the position of the two cameras relative to a reference point. Lane detection is performed on the front and rear camera images which are combined using data fusion. Finally, the distance between the vehicle and adjacent lane boundaries is determined allowing to perform LDW. The proposed system was tested on real world driving videos and shows good results when compared to ground truth. Keywords: Lane departure warning, multi camera calibration, data fusion, non-overlapping camera networks.
1
Introduction
Driver Assistance (DA) systems are common accessory in today’s passenger and commercial vehicles. DA systems as the name suggests are systems that provide aid or feedback to the driver of a vehicle. One example of such a system is Lane Departure Warning (LDW). LDW is a safety feature that informs the driver if the vehicle appears to change lanes unless certain conditions are met e.g. turn indicator is on. The hardware often consists of a single camera which acquires images out of the front windshield that are analyzed to determine if the vehicle is within a certain distance of a lane boundary. Although LDW has been in research for many years, it has only been recently that LDW systems have started appearing in luxury cars and as after market products. Variations of the Time to Lane Crossing (TLC) are some of the most popular techniques used for LDW [1–4]. Using a “look-ahead” approach, the future position of the vehicle is estimated based on the steering angle and by extrapolating detected lane positions, then the driver is warned if a lane crossing is predicted. The lane detection component of LDW often produces noisy results; consequently, the extrapolated estimates may not accurately represent J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 58–69, 2011. c Springer-Verlag Berlin Heidelberg 2011
A New Multi-camera Approach for Lane Departure Warning
59
future lane positions leading to false signaling. Other LDW techniques involve determining the angles or orientations of detected lane boundaries to hypothesize a distance metric [5–7]. Actual distance to the adjacent boundaries is not computed. In the proposed work, we compute the immediate distance between the vehicle and adjacent boundaries using known camera geometries. In addition, we add a second camera that looks out of the rear windshield. This provides redundant lane boundary position information which is used in data fusion. Since we are combining data from multiple cameras, calibration is important. Calibration for stereo and n-camera setups is often performed using variations of standard test patterns like checkerboard to establish a baseline [8]. However, these techniques require each camera to be able to see the same checkerboard. Since our cameras are facing in opposite directions, the checkerboard methods are ineffective. Work on calibration for non-overlapping camera networks has been conducted in [9–11]. The method shown in [9, 10] uses a planar mirror while [11] uses a moving rig to calibrate each camera. However, the rig movement appears arbitrary which can severely affect the estimates of camera positions. A similar experiment with a multi camera setup for vehicular applications has been shown in [12]; however, the method used to merge information from the two cameras is only stated briefly and lacks details. Additionally, calibration of opposite facing cameras is not discussed. In our work, we have devised a new yet simple calibration technique that allows to accurately determine the position of the two cameras in the vehicle. This paper is organized as follows: subsequent to the introduction, preprocessing in the form of a geometric transformation applied to acquired images is explained. Then, the details of camera calibration and data fusion are described. Finally, LDW is completed by explaining the technique used to measure the distance to lane boundaries. The proposed approach was compared to a single camera LDW system as well as ground truth and showed good results when tested on real world driving videos.
2
Pre-processing
In a camera image, lane markings appear as lines that decrease in thickness and converge near the horizon. This can be seen on the left in Fig. 1. Consequently, feature extraction can become difficult since the line features change as a function of distance from the camera due to the effects of perspective. As a result, the camera acquired images first undergo a geometric transformation known as Inverse Perspective Mapping (IPM) to remove the effects of perspective from the image [13]. With IPM, the captured images are transformed to appear as a birds-eye view with lane markers now appearing as nearly parallel lines. In addition, the thickness of these lines remain constant in the IPM image as shown on the right in Fig. 1. The camera configuration is shown in Fig. 2. The mapping from world co-ordinates (x, y, z = 0) to image co-ordinates (r, c) is given as M −1 h − x · tan θo r(x) = 1+ · cot αv + 1 (1) 2 h · tan θo + x
60
A. Borkar, M. Hayes, and M.T. Smith
Fig. 1. Pre-processing stage Z
ew ield of Vie
0
T0 Ima g Pla e ne
0
Optica
Image Plane
l Axis
c
M
h
iew fV ld o Fie
Horizontal Axis
Dv
r
o^ Came ra
Fie ld
of
^o Vie
w
Y
Du
X
Z
J
Camera
N
Y Road Surface
Optical Axis
Field
of Vie
w
X
(a) Side view.
(b) Top view.
Fig. 2. Camera parameters
N −1 y c(x, y) = 1− · cot αu + 1 2 h · sin θo + x · cos θo
(2)
where h is the height of the camera, θo is the pitch, αu is the horizontal field of view, αv is the vertical field of view, γ is the yaw, and (M × N ) is the size of the camera image. Furthermore x cos γ − sin γ x = (3) y sin γ cos γ y provides the yaw correction by rotating the world co-ordinates about the optical center of the camera. This mapping differs slightly from the technique described in [13] where the horizontal and vertical fields of view of the camera are assumed to be equal. However, in most commercial camera hardware, these values are different.
3
Camera Calibration
In camera calibration, an origin needs to be specified. This point is chosen as T the center of the vehicle as shown in Fig. 3 and denoted as Po = 0 0 0 . The positions of the front and rear cameras relative to Po are then determined. These position are denoted as Pf for the front camera and Pr for the rear camera. Pr is also oriented looking out of the back of the vehicle which results in its IPM image
A New Multi-camera Approach for Lane Departure Warning
61
Y
Pr
Pf
Z
Po
X
Fig. 3. Position of the origin and two cameras
(a) Front camera image.
(b) Rear camera image.
Fig. 4. Sample camera images
being rotated 180◦ . Since small measurement errors in Pf and Pr could lead to critical errors during data fusion used later, the accuracy of these measurements is verified by comparing to reference data subject to certain constraints. The reference data is made up of images captured while driving at a constant speed near the middle of a straight road where the distance between adjacent lane boundaries is constant and uniform. Both cameras undergo basic calibration to determine yaw (γ) and pitch (θo ) which prevents the lane boundaries from appearing skewed or tilted in the IPM image. Therefore, the cameras are now 180◦ opposed and facing in opposite directions along the X-axis of the reference co-ordinate system shown in Fig. 3. Sample front and rear camera images are shown in Fig. 4. Correctness is assumed in the measured X and Y co-ordinates of Pf . The techniques described below accurately determine the height of Pf as well as the translation of Pr relative to Po . 3.1
Height Measurement
We combine Eq. (1) and (2) to determine the mapping of a pixel from the camera to world co-ordinates as [1 − 2( Nc−1 −1 )] tan αu Y (r, c) = h (4) r−1 sin θo − [1 − 2( M −1 )] tan αv cos θo We choose two points (r, c1 ) and (r, c2 ) on adjacent lane boundaries in the camera image and find their corresponding mappings to the IPM domain as Y (r, c1 ) and Y (r, c2 ) respectively as shown in Fig. 5. Since it not possible to park on the
62
A. Borkar, M. Hayes, and M.T. Smith
Y(r,c1)
(r,c1)
Y(r,c2)
(r,c2)
(a) Two points in the camera image.
(b) Two points in the IPM image.
Fig. 5. Mapping points from the camera image to IPM image
highway and physically measure the distance between adjacent lane boundaries, we refer to Federal Highway Administrations (FHA) handbook regarding specification for painted markings [14] where it is stated that the distance between lane boundaries on highways is on average 12ft. Using this information, we can set Y (r, c1 ) − Y (r, c2 ) = ΔY = 12 feet (5) Hence, we substitute Eq. (4) in Eq. (5) and solve for h as r−1 sin θo − [1 − 2( M−1 )] tan αv cos θo h = ΔY 1 2[ cN2 −c ] tan αu −1
(6)
Similarly, we can repeat this procedure to determine the height of the rear camera. 3.2
Offset between Cameras
If the two cameras are not aligned properly, there will be a visible offset between the lane markings when observed in each IPM image. This is illustrated in Fig. 6 by placing the two IPM images side by side. Since the Pf is inline with the origin, the offset is adjusted by translating the rear camera IPM image along the Y-axis till the lane markings line up correctly. The translation in the IPM image is scaled to world units to determine the actual offset between the two cameras. 3.3
Distance between Cameras
To determine the distance between the cameras, we need to find objects with corners in each IPM image. In our case, we use the diamond from the high occupancy vehicle (HOV) lane. This is shown in Fig. 7. Since the vehicle is commuting at a constant speed and knowing the time delay since the diamond from the front camera IPM image reappears in the rear camera IPM image, the total covered distance can be computed as w = Speed × Time Delay
(7)
A New Multi-camera Approach for Lane Departure Warning
63
Y Z
X
Fig. 6. Error in camera camera alignment. This results in lane markings being offset from one view to the other.
j
k
w Y Z
X
Fig. 7. Determining the distance between the cameras
By setting (x = 0, y = 0) in Eq. (3), the optical center of each camera relative to its IPM image can be determined and is illustrated by the red dots in Fig. 7. Since the lens used in this implementation has a focal length of 5mm, it is safe to assume that optical center is analogous to the camera position. As a result, it is possible to determine the distance j or k between a specific point on the diamond and the camera itself. Consequently, the distance between the cameras can estimated as Distance between cameras = w − (j + k)
(8)
After completing the three verification steps described above, the position of T the front camera is determined as Pf = 2.0 0.0 4.79 and rear camera as T Pr = −7.51 −0.48 4.94 relative to Po . The units are in feet.
4
Lane Detection
Lane detection is performed using the technique presented in [15] on both forward and rear camera views. Following IPM, the transformed image is converted
64
A. Borkar, M. Hayes, and M.T. Smith
from RGB to YCbCr to aid in color segmentation. Then, the images are crosscorrelated with a collection of predefined templates to find candidate lane regions. These regions then undergo connected components analysis, morphological operations, and elliptical projections to approximate positions of the lane markers. The implementation of the Kalman filter enables tracking lane markers on curved roads while RANSAC helps improve estimates by eliminating outliers [15].
5
Data Fusion
In a standalone IPM image, the optical center e.g. Pf is at the origin. Each pixel in the IPM image has a certain (x, y) co-ordinate relative its optical center. This approach is acceptable when dealing with a single camera; however, when dealing with multiple cameras, a common co-ordinate system needs to be established. In our implementation, Po is chosen as the origin of this space and shown as a red dot in Fig. 8. The two green dots in Fig. 8 represent Pf and Pr . Next, a transformation Tf that maps only the (x, y) co-ordinates of Po to Pf is determined. This transformation when applied to the IPM image maps the (x, y) co-ordinates of each pixel to a location in the common co-ordinate system that is now relative to Po . Consequently, the lane boundaries detected in the front camera IPM image are also mapped to the common space and described by (x, y) co-ordinates relative to Po . A transformation mapping Po to Pr is also computed and used to relocate the rear camera IPM pixels to the common co-ordinate space.
Pr
Pf Po
Y Z
X
Fig. 8. Position of the origin and two cameras
Upon completing the front and rear lane detections, data fusion in the form of a parametric linear spline connects the detected lane boundaries. Using an end point from each boundary as shown in Fig. 9, the locations of lane boundaries outside the cameras view can be estimated using Eq. (9). Q1 (9) Q(t) = (1-t) t Q2
A New Multi-camera Approach for Lane Departure Warning
Q2
Q1 Y Z
65
Q3
X
Fig. 9. Spline estimated boundary shown as dotted green line between the two end points Q1 and Q2
6
Lane Departure Warning
Upon completing camera calibration and data fusion, the following is known: 1. Center of the vehicle. 2. Dimensions of a reference vehicle. 3. Location of lane boundaries. Using the available information, the distance to the lane boundary (in feet) on either side of the vehicle can be calculated as: Distance to marking = |LBx=0 − sign[LBx=0 ](
Car Width )| feet 2
(10)
where LBx=0 is simply the y-value of the estimated lane boundary from Eq. (9) evaluated at x=0. The orange bounding box in Fig. 10 depicts the dimensions of the reference vehicle in the common co-ordinate space with Po as the red dot. Using Eq. (10), LDW can be systematically performed by warning the driver if the vehicle is within a minimum distance from a lane boundary.
3ft
2.5ft Y Z
X
Fig. 10. Distance to lane boundaries on either side
7
Results and Analysis
Two time-synchronized NTSC cameras were used in acquiring video streams for testing. The streams were captured while driving on roadways in and around
66
A. Borkar, M. Hayes, and M.T. Smith
Atlanta, GA, USA. Several minutes worth of videos were used in testing. The rear camera was inoperable at night without artificial illumination; hence, testing was restricted to day time only. Additionally, testing was conducted only on highway footage as most commercial LDW systems operate above 45mph [16, 17].
(a)
(b)
(c)
Fig. 11. Front camera image and estimated position of the vehicle within the lane
In Fig. 11, a few examples of the images captured by the front camera with an overhead visualization estimating the position of the vehicle within a lane are shown. Fig. 12 shows an example of the vehicle approaching very close to a lane boundary over the duration of a clip. Fig. 12a - 12c shows the camera captured images with corresponding visualization. When the vehicle is within 0.75ft of a lane boundary, a warning is signaled. Hysteresis is used whereby the vehicle needs to travel more than 1ft away from the lane boundary for the warning to be disabled. This reduces flicker around the minimum distance threshold. Fig. 12d shows a plot of the estimated distance between the right lane boundary and vehicle over the duration of the video. The plot also shows a comparison between the multi camera approach, single camera approach and ground truth. To estimate distance using a single camera, we refer back to Fig. 9 and evaluate the line formed by Q1 and Q3 at x=0. A straight line model is less affected by noise in lane detection results; hence, it is used for extrapolation. Ground truth estimates are produced using cubic interpolation between the ground truth lane boundary locations in the front and rear camera images. In Fig. 12d, it can be observed that the estimates of the multi camera approach are often very close to the ground truth. The single camera approach fares well on straight roads but face difficult when encountered with a curve. Fig. 12e provides a visual comparison between the ground truth, single camera and multi camera approach when the vehicle encounters a turn in the test clip. The advantage of using the described multi camera setup is that estimating distances to the lane boundaries becomes an interpolation problem. With a single camera setup, it is an extrapolation problem which appears to be less accurate.
A New Multi-camera Approach for Lane Departure Warning
(a) Frame 73.
(b) Frame 289.
67
(c) Frame 429.
4
Distance (ft)
3 2 1
Ground Truth Single Camera Multi Camera
0 −1
0
100
200
300 Frames
400
500
600
(d) Distance between vehicle and right lane marker.
(e) Comparison between ground truth (red), single camera (green) and multi camera (blue) approach. Detected lane markings are shown in orange. Fig. 12. An example of Lane Departure Warning (LDW)
Since the lane boundary estimates are calculated based on the findings of the front and rear lane detectors, errors in lane detection reflect directly on the lane boundary estimates. A few instances of incorrect lane detections shown in Fig. 13 resulted in incorrect estimates of the distances to the lane boundaries. In addition, when an incomplete lane detection occurs i.e. only one boundary is detected in either front or rear camera images, the estimates around the vehicle cannot be determined. This is shown in Fig. 13c.
68
A. Borkar, M. Hayes, and M.T. Smith
(a)
(b)
(c)
Fig. 13. A few examples where the multi-camera approach faces difficulty
8
Conclusion
Presented in this paper is a novel approach to Lane Departure Warning (LDW) using multiple cameras. First, the captured images undergo an improved Inverse Perspective Mapping (IPM) transform to produce a bird’s-eye view image. Then, camera calibration is used to accurately determine the position of two cameras. Following lane detection, data fusion is used to connect the detected lane boundaries from one camera view to the other. By determining the distance between the vehicle and lane boundaries, an LDW system is realized. The proposed setup showed good results when tested with real world driving videos. The multi camera approach also produced estimates close to the ground truth on both straight and curved roads in the test data while the single camera approach faced difficulty on the curved roads.
9
Future Work
The presented work is part of on going research. The use of smooth curves instead of straight line estimates for lane boundaries will be explored. Handling situations when an incomplete lane detection occur needs to be investigated. Kalman or particle filters will be used to track LBx=0 on either side of the vehicle. Additional tests with complex scenarios will be conducted. Illumination hardware to enable night time multi camera usage will be purchased.
References 1. LeBlanc, D., Johnson, G., Venhovens, P., Gerber, G., DeSonia, R., Ervin, R., Lin, C., Ulsoy, A., Pilutti, T.: CAPC: A road-departure prevention system. IEEE Control Systems Magazine 16(6), 61–71 (1996) 2. Kwon, W., Lee, J., Shin, D., Roh, K., Kim, D., Lee, S.: Experiments on decision making strategies for a lane departure warning system. In: Proceedings of IEEE International Conference on Robotics and Automation, vol. 4, pp. 2596–2601. IEEE, Los Alamitos (1999)
A New Multi-camera Approach for Lane Departure Warning
69
3. Kim, S., Oh, S.: A driver adaptive lane departure warning system based on image processing and a fuzzy evolutionary technique. In: Proceedings of IEEE Intelligent Vehicles Symposium, pp. 361–365. IEEE, Los Alamitos (2003) 4. Dai, X., Kummert, A., Park, S., Neisius, D.: A warning algorithm for Lane Departure Warning system. In: IEEE Intelligent Vehicles Symposium, pp. 431–435. IEEE, Los Alamitos (2009) 5. Hsiao, P., Yeh, C., Huang, S., Fu, L.: A portable vision-based real-time lane departure warning system: day and night. IEEE Transactions on Vehicular Technology 58(4), 2089–2094 (2009) 6. Lin, Q., Han, Y., Hahn, H.: Real-time Lane Detection Based on Extended Edgelinking Algorithm. In: Second International Conference on Computer Research and Development, pp. 725–730 (2010) 7. Leng, Y.-C., Chen, C.-L.: Vision-based lane departure detection system in urban traffic scenes. In: 11th International Conference on Control Automation Robotics Vision, ICARCV 2010, pp. 1875–1880 (2010) 8. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 9. L´ebraly, P., Deymier, C., Ait-Aider, O., Royer, E., Dhome, M.: Flexible extrinsic calibration of non-overlapping cameras using a planar mirror: Application to visionbased robotics. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2010, pp. 5640–5647. IEEE, Los Alamitos (2010) 10. Kumar, R., Ilie, A., Frahm, J., Pollefeys, M.: Simple calibration of non-overlapping cameras with a mirror. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008. IEEE, Los Alamitos (2008) 11. Pagel, F.: Calibration of non-overlapping cameras in vehicles. In: IEEE Intelligent Vehicles Symposium (IV), pp. 1178–1183. IEEE, Los Alamitos (2010) 12. Ieng, S., Vrignon, J., Gruyer, D., Aubert, D.: A new multi-lanes detection using multi-camera for robust vehicle location. In: Proceedings of IEEE Intelligent Vehicles Symposium, pp. 700–705. IEEE, Los Alamitos (2005) 13. Bertozzi, M., Broggi, A.: GOLD: a parallel real-time stereo vision system for generic obstacle and lane detection. IEEE Transactions on Image Processing 7(1), 62–81 (1998) 14. Federal Highway Administration. Manual Uniform Traffic Control Devices (November 2009), http://mutcd.fhwa.dot.gov/ 15. Borkar, A., Hayes, M., Smith, M.T.: A template matching and ellipse modeling approach to detecting lane markers. In: Blanc-Talon, J., Bone, D., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2010, Part II. LNCS, vol. 6475, pp. 179–190. Springer, Heidelberg (2010) 16. Infiniti USA, 2012 Infiniti M Specs & Options (2011), http://www.infinitiusa. com/m/specs-options 17. Mercedes-Benz USA, 2011 Mercedes-Benz S550 Specs (2011), http://www.mbusa. com/mercedes/vehicles/explore/specs/class-S/model-S550V
Feature Space Warping Relevance Feedback with Transductive Learning Daniele Borghesani, Dalia Coppi, Costantino Grana, Simone Calderara, and Rita Cucchiara University of Modena and Reggio Emilia, Modena, Italy http://imagelab.ing.unimore.it
Abstract. Relevance feedback is a widely adopted approach to improve content-based information retrieval systems by keeping the user in the retrieval loop. Among all, the feature space warping has been proposed as an effective approach for bridging the gap between high-level semantics and the low-level features. Recently, combination of feature space warping and query point movement techniques has been proposed in contrast to learning based approaches, showing good performance under different data distributions. In this paper we propose to merge feature space warping and transductive learning, in order to benefit from both the ability of adapting data to the user hints and the information coming from unlabeled samples. Experimental results on an image retrieval task reveal significant performance improvements from the proposed method. Keywords: Relevance feedback, covariance matrices, transductive learning, feature space warping.
1
Introduction
The use of relevance feedback strategies in information retrieval and in particular in content-based image retrieval systems is widely considered a very precious (sometimes necessary) addition to the system itself. At the present time, it is the most effective way to capture user’s search intentions. The reason is pretty straightforward: the automatic association of low-level features to high-level semantics is still a very open problem, and the only practical way to identify what the user is looking for is by including him in the retrieval loop, letting him provide feedbacks (positive, negative or both) about what is going on. The common scenario in which relevance feedback is used within content-based image retrieval systems is the following: 1. An initial query-by-keyword or query-by-example is performed, in the form of a list of results ranked with increasing distances from the query in the feature space; 2. The user provide some good (and bad, implicitly or explicitly) feedbacks given the displayed images, choosing in other words what is relevant and what is irrelevant; J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 70–81, 2011. c Springer-Verlag Berlin Heidelberg 2011
Feature Space Warping Relevance Feedback with Transductive Learning
71
3. An algorithm uses these information to change the displayed results in a “refinement” step to accommodate user’s judgments; 4. Back to step 2 and loop until a certain condition (or satisfaction) is reached. In this paper, we focus on the third step proposing a new and effective strategy for relevance feedback based on Transductive Learning. The main contribution of this work is the joint use of Transductive Learning (successfully applied to a wide variety of pattern recognition and classification tasks) with the Feature Space Warping, a widely used technique for relevance feedback. Feature Space Warping in particular allows to modify accordingly the relations between feature points: objects similar to positive feedbacks go closer to the query, while object dissimilar are pushed away. We show that the union of these two techniques overcome the respective limitations providing a significant boost in performance over the two techniques used alone.
2
Related Work
The literature on this topic is countless [22, 5], since this problem can be faced from several point of view (computer vision, database management, humancomputer interaction, artificial intelligence, even psychology). Moreover, aside the research on the algorithms for relevance feedback, there is a wide literature about the way in which the performance of a system with relevance feedback can be safely evaluated in order to provide fair comparison with different techniques. Regarding the algorithms, we can identify very generally three classes. In the first one (called Query Point Movement, QPM in short), we try to move the query point in order to create a more complete query (a fast technique to overcome to slow convergence is proposed by [10]). In the second one (called Feature Space Warping, FSW in short), we try instead to manipulating the feature space or the metric space, in order to shape it in the direction of the users’ feedbacks [1, 12]. The third one applies some machine learning procedures (like SVM or Adaboost) to learn how to separate relevant samples from irrelevant ones [16,17]. Among the usual techniques, based on SVM or boosting, we preferred testing a transduction-based learning. Some author followed the same path (see [20,15,14, 13]). The idea is to take advantage both of the unlabeled and labeled samples in a transductive inference manner, learning from an incremental amount of training samples (feedbacks, in this case). Given the algorithm, the problem of evaluation is controversy. Even back in the Seventies, Williamson [19] proposed an evaluation methodology to tackle the so called “ranking effect” in the “fluid” relevance feedback evaluation, i.e. the overestimated performance improvement (in terms of recall and precision) due to the reposition of positive feedback in the top of the rank, aside the underestimated performance improvement of the “frozen” relevance feedback evaluation, which maintain the original ranks of documents along the sessions. In his “reranked original” ranking proposal, the best ranks are assigned to thy relevant documents and the worst ranks to the non-relevant documents; those documents not yet judged would remain in their original order, but with a rank decreased
72
D. Borghesani et al.
by the number of non-relevant documents identified. In [9] a comprehensive analysis tries to find out the reasons of relevance feedback evaluation problems, in particular problems with the dataset (characteristics and relative ground truth), problem with the comparison (different measures, different ranking approaches that make a comparison unfair, the need of rank normalization), and finally the problem of the parameter settings which can be impractical in real context. In our opinion, a quite fair and complete set of measures has been proposed in [11], where authors proposed: – actual recall and precision (computed at each iteration and relative to the current set of retrieved images solely) – new recall and precision (computed at each iteration and relative to the previous set of retrieved images solely) – cumulative recall and precision (computed at each iteration and relative to the whole set iterations so far) In this way, we can describe the behavior of the retrieval system both in terms of speed (how fast valuable images are retrieved over time) and in terms of completeness (how many good images the retrieval system finds out globally). Finally, as suggested in [8], we tried to concentrate the analysis on feasible search tasks, i.e. visual topics with a good number of representatives with a low degree of uncertainty in the evaluation, in order to assure a valuable reference ground truth.
3
Visual Similarity Using Covariance Matrices
In order to accomplish an effective similarity retrieval upon these images, we relied on a simple yet effective feature which allows to consider both color and edge based information, that is covariance matrices. Computing the covariance region descriptor from multiple information sources yields a straightforward technique for a low-dimensional feature representation [18]. In particular, we use normalized pixel locations (x/W, y/H), RGB values in the range [0, 1] and the norm of the first derivatives of the intensity with respect to x and y, calculated through the filter [−1 0 1]T . The covariance of a region is thus a 7 × 7 matrix. Notice that the covariance matrices do not form a vector space, but if we concentrate on nonsingular covariance matrices, we can observe that they are symmetric positive definite, and as such they lay on a Riemannian manifold. In order to rank images by visual similarity to a given query, we need to measure the distance between covariance matrices. In [6] the following distance measure for positive definite symmetric matrices is proposed: d ρ(X, Y) = ln2 λi (X, Y) (1) i=1
where {λi (X, Y)}i=1..d are the generalized eigenvalues of two non singular covariance matrices X and Y.
Feature Space Warping Relevance Feedback with Transductive Learning
73
Unfortunately distance alone is not enough for our purposes. In fact, in order to enable the user to provide relevance feedbacks, we need to move the query and the other points with linear combinations. To this aim two steps are required: the projection on an Euclidean tangent space, and the extraction of the orthonormal coordinates of the tangent vector. By combining these two steps, the projection y of a covariance matrix Y on the hyperplane tangent to covariance matrix X can be written as 1 1 y = vec log X− 2 YX− 2 (2) where the vector operator is defined as √ √ √ vec(y) = y1,1 2y1,2 2y1,3 . . . y2,2 2y2,3 . . . yd,d
(3)
In this way, after the selection of an appropriate projection origin, every covariance matrix gets projected to a 28-dimensional feature vector laying on an Euclidean space. This process is easily invertible. We can compute the relative covariance matrix in the Riemannian Manifold starting from the 28-dimensional feature vector using the following formulation:
1 1 Y = X 2 exp vec−1 (y) X 2
4
(4)
Mean-Shift Feature Space Warping with Remapping
In this work, we started from the relevance feedback technique proposed by Chang et al. [4] called Mean Shift Feature Space Warping (MSFSW). Given a query point q in the feature vector space, k samples are retrieved by nearest neighbor search. By examining the results, the user provides his feedback by specifying the relevance of M of these samples, forming two sets: {fp } and {fn }, the relevant and irrelevant sets respectively. These are employed to move all data samples {p} toward or away from the warping center w. In particular, for each p, its warped point p is given by p = p + λ
M
uj exp (−c|p − fj |) (w − p)
(5)
j=1
where the scalar value uj is set to +1 if fj ∈ {fp }, and −1 if fj ∈ {fn }. Two global coefficients c and λ are required to control the influence of each feedback to each sample and the maximum moving factor of any point p toward or away from the warping center w. The original FSW algorithm fixes the warping center w to q. Thus, the query point will always stay in its original position. Other points will move toward or far away from q based on its proximity to relevant and irrelevant sets. But, according to the analysis proposed in [4], FSW algorithm tends to perform poorly under Gaussian distributions when the query point is far away from the cluster center.
74
D. Borghesani et al.
For this reason, in the MSFSW, authors proposed to move the warping center instead of staying at q. They suggest to adopt the Rocchio’s query movement formula: w = αw + βfp − γfn (6) where w is the warping center (initially set to q), fp and fn are the mean of the set {fp } and {fn }. Another set of parameters α,β and γ is required, and must be tuned to optimize the performance. The MSFSW algorithm provides a flexible parameterization for switching between the two extreme algorithms: QPM by setting α = γ = λ = 0 and β = 1, and FSW by setting α = 1 and β = γ = 0. Given the final user target of our application, exposing the parameters configuration to the user was out of question. Thus, we determined the parameters configuration which provided best results on a small initial training set, using an automatic exhaustive search procedure. From the above equations, it is clear that we need a way to compute a linear combination of the feature vectors. For this reason, we employed the projection of the covariance matrices on the tangent space previously described (Eq. 2). As mentioned before, the projection requires a point from which determine the orthonormal coordinates of the tangent vector (i.e. the vector in the Euclidean space). Our experiments confirm that the choice of this point is fundamental to guarantee an optimal correspondence between the distances computed on the Riemannian manifold and those computed on the tangent space. Thus, when the user requires a refinement of a similarity search of a previously selected image, we project the whole feature space on the chosen query point (i.e. the covariance matrix of the selected image), then we rank the results and show them to the user in order to perform further refinements. Since this mapping is a homeomorphism around the neighborhood of the point, the structure of the manifold is locally preserved. The problem here is that the first step of MSFSW moves the warping center away from the current one, and this may impact on the quality of the projected vectors. For this reason we propose to employ an intermediate step of reprojection around the new warping center. The proposed relevance feedback approach works iteratively according with the following sequence of steps: 1. Given the previous warping center wi−1 and feedbacks {f}i−1 , the new warping center wi is computed by means of Eq. 6; 2. All points {p}i−1 and wi are reprojected on the manifold, defining the set of remapped points {R}i , exploiting Eq. 4: 1
12 2 Ri = Wi−1 exp vec−1 pi−1 Wi−1 I
(7)
3. Now the tangent space at the new warping center wi is taken into consideration: all remapped points on the manifold {R}i are mapped into the new Euclidean space, exploiting Eq. 2: −1 −1 ri = vecI log Wi 2 Ri Wi 2 (8)
Feature Space Warping Relevance Feedback with Transductive Learning
75
4. At this point, we can apply the FSW on the set {r}i , as in Eq. 5, finally obtaining the new set of points {p}i . Notice that on the first iteration only the feedbacks {f}0 must be initially mapped to the Euclidean space tangent at the query point (for step 1). Thus in step 2, only the new warping center w1 must be remapped, since the set of remapped points {R}1 would be equal to the original dataset itself.
5
Transductive Relevance Feedback
As mentioned in the Section 2, the relevance feedback problem can be analyzed as a semisupervised learning problem, in which the positive and the negative feedbacks given by the users constitute iteratively (and incrementally) the training set of the algorithm. In this paper, we propose a graph-based transductive learning method to tackle this purpose, defining a graph where the vertices represent the labeled and unlabeled images of the dataset, while the edges incorporate the similarity between them, in our case obtained from the distance between covariance matrices. Graph-based methods are nonparametric, discriminative, and transductive by definition [21], and labels can be assumed to be smooth over the graph. Starting from the whole dataset with n images, let’s define a set L of labeled images (x1 , y1 ), . . . , (xl , yl ) in which C classes are defined, so yi ∈ 1 . . . , C. The other images belongs to a set U of u unlabeled images, with n = l + u. Now let’s define a function f : Rn → [0, 1] which denotes the confidence of each image to one class. Formally, we can define a cost function J on f as: J(f ) =
n
2
f (xi ) − f (xj ) wij + λ
l
2
f (xi ) − yi
(9)
i=1
(i,j)=1
with λ as a regularization parameter (in our case, λ = 1). This equation, in a minimization process, tries to match the confidence wij between samples with the the true confidence yi with respect of the current confidence. Once converted in matrix notation, Eq. 9 becomes: J(f ) = (f (X) − Y )T (f (X) − Y ) + λf (X)T Lf (X)
(10)
where L = D − W is the graph Laplacian. W is the weight matrix, while D is the matrix which represents the degree of vertices: dii =
n
wij
(11)
j=1
The cost function minimization has a closed solution, that is: f = (I − λL)− 1Y
(12)
At the end of the computation, f contains the class confidence of each new sample.
76
D. Borghesani et al.
In the transductive process, we want to transfer labels from labeled samples to unlabeled ones: in other words, we want the samples which are close in the feature space to share the same label. To satisfy this local constraint, we construct accordingly the weight matrix W , that is the matrix in which each element wij contains the relation between two vertices (thus two images) xi, xj. To move from distances to affinities, we use the following formulation:
2 xi − xj wi,j = exp − (13) 2σ 2 where σ is a bandwidth parameter to tune the relations between vertices, and the distance is computed as the L2 norm after the conversion of covariance matrices in the Riemannian Manifold to vectors in the Euclidean Space. W can be subdivided into four submatrices: ll W W lu W = (14) W ul W uu where W ll (a full connected graph) denotes relations between labeled data, W uu (a k-nearest neighbor graph) denotes relations between the candidate images yet to label, and the symmetric subgraphs W ul and W lu (still k-nearest neighbor) denote the relations between positive and candidate images. The relations used to compute values in these graphs are the following: ll Wi,j =
1 n
xi −xj 2 exp − 2σ if xi ∈ knn(xj ); 2 uu = 0 otherwise; xi −xj 2 exp − 2σ if xi ∈ knn(xj ); 2 ul uu = Wi,j = 0 otherwise;
(15)
uu Wi,j
lu Wi,j
(16)
(17)
where k = 10 and σ = 1. After the first ranking by similarity, the user selects the positive feedbacks while the unselected samples are considered as negative. Then the process described in this section is iterated following the user’s need or until no more changes in the rank occurs.
6
Transductive Relevance Feedback with Feature Space Warping
The main benefit regarding the use of FSW is the possibility to move potentially irrelevant samples away from the query center, while attracting far away relevant samples toward the query center. This characteristic turns out to be very important in the the transductive learning context, especially when the samples
Feature Space Warping Relevance Feedback with Transductive Learning
77
retrieved by the system using similarity only are poor. Recalling that covariance matrices express points lying on a Riemannian Manifold and their distances, according to (1), are the geodesic distances on the manifold, we can motivate the choice of the graph Laplacian transductive approach observing that during feedback iterations we try to learn the underlying geometry of the manifold composed by positive query results points as realizations. Since the Laplacian is the discretized approximation of the Laplace-Beltrami operator on the manifold [3], we can learn the manifold structure directly from the analysis of the graph Laplacian itself. The basic idea is to strengthen, in the Laplacian, the contribution of positive query points while weakening the contribution of negatives one. This is equivalent, under a continuous relaxation of the graph affinities, to adding and removing links and path in the graph. With this premises, the user intervention allows to enhance the geodesic distances to better represent the geometry of the manifold where the query results are lying on. Additionally, it is important to remark that for transductive methods, the ratio between the number of labeled and unlabeled samples is fundamental: if the number of training samples is low as well as the number of unlabeled data goes to infinity, the learning procedure leads to uninformative membership functions [2]. To overcome this problem, we introduced a further step in the transductive learning procedure, modifying the value of W lu and W ul elements in the affinity matrix in analogy with Eq. 5. In particular, given P the set of positive samples indexes and N the set of negative ones, for each element of the affinity matrix Wijlu , its warped version Wijlu is corrected by factors δjp and δjn : δjp = − exp (−cWij ) (18) i∈P
δjn =
exp (−cWij )
(19)
i∈N
The warped elements Wijlu are finally described by the following equation:
(20) Wijlu = Wijlu + λ δjp + δjn where the global coefficients c and λ assume the same tuning functionalities as described in Section 4.
7
Experimental Results
In order to verify the effectiveness of the proposed approach, we report the results on an historical images dataset, created using the procedure described in [7], and composed of 2282 pictures. We performed an automatic simulation of relevance feedback interaction, in order to avoid human errors. We evaluate the techniques using the 171 visual queries provided with the dataset annotation, in which several prototypes of 6 object classes were retrieved (Fig. 1). We compared the following algorithms:
78
D. Borghesani et al.
Fig. 1. Samples taken from the 6 query classes
– Naive relevance feedback (actually no relevance feedback at all): the system discards the current set of n results and proposes to the user the next n, following the original rank given by the visual similarity; – MSFSW: original Mean Shift Feature Space Warping proposal by [10], with an empirically optimized set of parameter α = 0.2, β = 0.5 and γ = 0.3 for the means-shift part and λ = 0.7 and c = 0.8; – MSFSW with remapping: our modification of the original MSFSW algorithm which performs a remapping of the entire feature set using the mean of positive feedbacks as tangent point for the conversion from Riemannian Manifold to Euclidean Space; – TL with NN: the transductive learning approach which uses positive feedbacks as samples and defines the relevance feedback as a process to assign a label to unlabeled samples. The affinity matrix is filled only for the k = 20 nearest neighbors; – TL with FSW: the same transductive learning approach jointly used with Feature Space Warping on the affinity matrix Performance has been evaluated with a user-centric perspective: we choose to use metrics clearly easy to comprehend by a user in front of the application, and we included a fixed number T = 10 of iterations, to convey that the user will get bored and stop pursuing in the search at most after 10 refinements. We chose cumulative recall as representative metric, namely the recall provided by the system at each step i = 1 . . . T . While the first steps give an idea of the convergence capabilities of the algorithm, the last step give an overall evaluation of the algorithm itself. The results are presented in Table 1 and Fig. 2. The first step in the chart is the retrieval by similarity, sorted by increasing distances. The naive procedure is the baseline for the comparison. The original MSFSW proposal has a good behavior in the first step, but the deformation after the first projection limits its effectiveness in the following steps, probably due to the side effects highlighted in Section 4. Much better performance are obtained with remapping on the mean of positive feedbacks, which continually improve in substantial way up to the forth step. In the next steps, the improvement remains marginal, mimicking the behavior of the original version. The Transductive Learning approach has a milder gradient: the performance is comparable in the first steps, but the number of steps required to gather the
Feature Space Warping Relevance Feedback with Transductive Learning
79
100% 90% 80%
averagerecall
70% 60% 50% 40% 30% 20% 10% 00% 0
1
2
Naive
3
MSFSW
4
5 iterations MSFSWw/R
6
TLNN
7
8
9
10
TLFSW
Fig. 2. Comparison of the proposed techniques in terms of recall at each iteration step
Table 1. Recall values at different iteration steps method 1 2 Naive 26,7 31,1 MSFSW 34,5 40,3 MSFSW w R 36,6 50,8 TLNN 30,5 39,3 TLFSW 37,0 52,3
3 34,1 41,9 59,9 46,8 65,7
4 37,1 43,7 63,5 52,8 75,3
5 39,5 45,4 65,3 57,8 80,7
10 47,5 54,1 68,6 68,0 87,0
same performance of the other technique is too high. The proposal of this paper, instead, shows the steepest gradient in the first steps and, from the fifth steps on, it maintains an considerable increasing improvement over the best so far (MSFSW with remapping) up to 19%.
8
Conclusions
In this paper we presented a relevance feedback strategy designed to merge a transduction-based relevance feedback with the advantages of the feature space warping. In our tests, this procedure appears to be extremely promising. In the future, we plan to improve this technique in the scalability side, in other words to test it on large-scale image collections.
80
D. Borghesani et al.
References 1. Bang, H.Y., Chen, T.H.: Feature space warping: an approach to relevance feedback. In: IEEE International Conference on Image Processing, pp. 968–971 (2002) 2. Belkin, B., Srebro, N., Zhou, X.: Statistical analysis of semi-supervised learning: The limit of infinite unlabelled data. Neural Information Processing Systems (2009) 3. Belkin, M., Niyogi, P.: Semi-supervised learning on riemannian manifolds. Machine Learning 56(1-3), 209–239 (2004) 4. Chang, Y.J., Kamataki, K., Chen, T.H.: Mean shift feature space warping for relevance feedback. In: IEEE International Conference on Image Processing, pp. 1849–1852 (2009) 5. Crucianu, M., Ferecatu, M., Boujemaa, N.: Relevance feedback for image retrieval: a short survey. In: State of the Art in Audiovisual Content-Based Retrieval, Information Universal Access and Interaction including Datamodels and Languages DELOS2 Report (2004) 6. F¨ orstner, W., Moonen, B.: A metric for covariance matrices. Technical report, Stuttgart University (1999) 7. Grana, C., Borghesani, D., Cucchiara, R.: Picture Extraction from Digitized Historical Manuscripts. In: ACM International Conference on Image and Video Retrieval (July 2009) 8. Huiskes, M.J., Lew, M.S.: Performance evaluation of relevance feedback methods. In: ACM International Conference on Image and Video Retrieval, pp. 239–248 (2008) 9. Jin, X., French, J., Michel, J.: Toward consistent evaluation of relevance feedback approaches in multimedia retrieval. In: International Workshop on Adaptive Multimedia Retrival (July 2005) 10. Liu, D., Hua, K.A., Vu, K., Yu, N.: Fast query point movement techniques for large cbir systems. IEEE Transactions on Knowledge and Data Engineering 21(5), 729–743 (2009) 11. Luo, J., Nascimento, M.A.: Content-based sub-image retrieval using relevance feedback. In: ACM International Workshop on Multimedia Databases, pp. 2–9 (2004) 12. Nguyen, G.P., Worring, M., Smeulders, A.W.M.: Interactive search by direct manipulation of dissimilarity space. IEEE Transactions on Multimedia 9(7), 1404– 1415 (2007) 13. Radosavljevic, V., Kojic, N., Zajic, G., Reljin, B.: The use of unlabeled data in image retrieval with relevance feedback. In: Symposium on Neural Network Applications in Electrical Engineering, pp. 21–26 (September 2008) 14. Sahbi, H., Audibert, J.-Y., Keriven, R.: Graph-cut transducers for relevance feedback in content based image retrieval. In: IEEE International Conference on Computer Vision, pp. 1–8 (October 2007) 15. Sahbi, H., Etyngier, P., Audibert, J.-Y., Keriven, R.: Manifold learning using robust graph laplacian for interactive image search. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (June 2008) 16. Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(7), 1088–1099 (2006) 17. Tieu, K., Viola, P.: Boosting image retrieval. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 228–235 (2000) 18. Tuzel, O., Porikli, F., Meer, P.: Pedestrian Detection via Classification on Riemannian Manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(10), 1713–1727 (2008)
Feature Space Warping Relevance Feedback with Transductive Learning
81
19. Williamson, R.E.: Does relevance feedback improve document retrieval performance? SIGIR Forum 13, 151–170 (1978) 20. Wu, Y., Tian, Q., Huang, T.S.: Integrating unlabeled images for image retrieval based on relevance feedback. In: International Conference on Pattern Recognition, vol. 1, pp. 21–24 (2000) 21. Zha, Y., Yang, Y., Bi, D.: Graph-based transductive learning for robust visual tracking. Pattern Recognition 43, 187–196 (2010) 22. Zhou, X.S., Huang, T.S.: Relevance feedback in image retrieval: A comprehensive review. Multimedia Syst. 8(6), 536–544 (2003)
Improved Support Vector Machines with Distance Metric Learning Yunqiang Liu1 and Vicent Caselles2 1
Barcelona Media - Innovation Center, Barcelona, Spain 2 Universitat Pompeu Fabra, Barcelona, Spain
[email protected],
[email protected]
Abstract. This paper introduces a novel classification approach which improves the performance of support vector machines (SVMs) by learning a distance metric. The metric learned is a Mahalanobis metric previously trained so that examples from different classes are separated with a large margin. The learned metric is used to define a kernel function for SVM classification. In this context, the metric can be seen as a linear transformation of the original inputs before applying an SVM classifier that uses Euclidean distances. This transformation increases the separability of classes in the transformed space where the classification is applied. Experiments demonstrate significant improvements in classification tasks on various data sets. Keywords: Classification, support vector machine, metric learning, kernel function.
1
Introduction
Support Vector Machine (SVM) algorithm [10] has been widely adopted in classification and pattern recognition tasks because of its very good performance. SVM is a learning technique based on statistical learning theory and the principle of structural risk minimization. By means of a kernel function, SVMs map the input data into a higher dimensional feature space making its separation easier. The kernel operation permits to avoid the explicit mapping and unnecessary computations. The kernel function plays a very important role in the performance of the SVM classifier. Between the most popular kernels let us mention the Gaussian, the polynomial and the exponential radial basis function kernels. Most of the existing kernels employed in SVM measure the similarity between pairs of (input) features using their Euclidean inner product or Euclidean distance, assuming that all features are equally scaled and equally relevant, having, thus, equal contribution to construct the separating hyperplane. However, some features can be more informative than others under the specific data distribution. Most of SVM classifiers ignore the statistical regularities which can be estimated from a training set with labeled examples. J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 82–91, 2011. c Springer-Verlag Berlin Heidelberg 2011
Improved Support Vector Machines with Distance Metric Learning
83
Similar considerations have been partially explored in the context of feature selection and feature weighting. For example, [8] directly exploits the informativeness of features in order to determine which features to remove and which to retain. SVM-RFE (Recursive Feature Elimination) [4] uses a backwards feature elimination procedure to remove features based on the values of their weights which are obtained by a linear kernel SVM. The Entropy-based Recursive Feature Elimination(E-RFE) method [3] eliminates uninteresting features according to the entropy of the weights distribution of a SVM classifier. Wang et al. [11] incorporate data distribution information of the input space into the kernel function by constructing a weighted Mahalanobis distance kernel. The data structure for each class in the input space is found via agglomerative hierarchical clustering. This paper proposes to exploit the discriminatory information of features using a learned distance metric. Recently, distance metric learning was designed for clustering and kNN classification [6],[9]. Xing et al. [14] learn a distance metric for clustering by minimizing the distances between similarly labeled data while maximizing the distances between differently labeled data. Domeniconi et al. [1] use the decision boundaries of SVMs to induce a locally adaptive distance metric for kNN classification. Weinberger et al. [12] proposed a large margin nearest neighbor (LMNN) classification approach by formulating the metric learning problem in a large margin setting for kNN classification. In this paper, we introduce a novel classification approach which improves the performance of support vector machines (SVMs) by learning a distance metric. The metric is a Mahalanobis metric which has been learned using a training set so that examples from different classes are separated with a large margin. Then we use the learned metric to compute a kernel function for SVM classification. In this context, the metric can be seen as a linear transformation of the original inputs before applying an SVM classifier that uses Euclidean distances. Thanks to this transformation we increase the separability of classes in the transformed space where the classification is applied. Let us describe the plan of the paper. In Section 2 we briefly review the theory of SVMs for binary classification and we describe the proposed improvement using a learned Mahalanobis metric. In Section 3 we describe the method to learn the metric. We display some experiments in Section 4 showing the performance of the proposed method. We summarize our conclusions in Section 5.
2
Support Vector Machine with Metric
In this section, we propose an improvement of SVMs classification based on the use of a Mahalanobis metric. For that, let us briefly review the theory of SVMs for binary classification. Assume that we are given a set of training data {(xi , yi )}ni=1 , where xi are in the vector space Rd , and yi ∈ {−1, 1} is the class label of vector xi , i = 1, 2, ..., n. Recall that, to obtain the optimal separating
84
Y. Liu and V. Caselles
hyperplane which minimizes the training error and maximizes the margin, one solves the following constrained optimization problem: 1 T w w+C ξi 2 i=1 n
min
w,b,ξi
subject to ξi ≥ 0
yi (wT φ(xi ) + b) ≥ 1 − ξi ,
(1)
i = 1, . . . , n,
where φ : Rd → Rm is a kernel feature mapping, w ∈ Rm , b ∈ R determine the hyperplane in the feature space, ξi are slack variables, and C is the regularization parameter which determines the trade-off between the maximization of margin and the amount of misclassification errors. The above optimization problem is solved using its dual: max ai
n
ai −
i=1
subject to
n 1 ai aj yi yj k(xi , xj ) 2 i,j=1 n ai yi = 0, 0 ≤ ai ≤ C,
(2) i = 1, . . . , n,
i=1
which in turn is solved using quadratic programming. The kernel function k(xi , xj ) = φ(xi ), φ(xj ), where , denotes an inner product between two vectors, is introduced to handle nonlinearly separable cases without an explicit construction of the feature mapping φ. The input vector x is classified according to the output value n f (x) = sign ai yi k(xi , x) + b . (3) i=1
In SVM classification, a proper choice of the kernel function is necessary to obtain good results. The kernel function determines the degree of similarity between two data vectors. One of the most common being the exponential radial basis function (RBF) kernel krbf (xi , xj ) = exp(−γd(xi , xj )), γ > 0,
(4)
where γ > 0 is the width of the Gaussian, and d(xi , xj ) is the squared distance between xi and xj . In many cases, the distance is just the Euclidean distance. This is reasonable when all features are equally scaled and equally relevant. Unfortunately, some features can be more informative than others under the specific data distribution. Euclidean distances ignore the statistical regularities which can be estimated from a training set with labeled examples. Ideally, we would like to find an adaptive distance metric that can be adapted to the particular data set being resolved.
Improved Support Vector Machines with Distance Metric Learning
85
Based on this consideration, we propose an improved support vector machine, named SVM-M, by learning a Mahalanobis metric. Specifically, we wish to learn a squared distance of the form: dM (xi , xj ) = (xi − xj )T M (xi − xj ),
(5)
where M = P T P is a positive semi-definite matrix (by AT denotes the transpose of the matrix A). As we will describe in next section, the distance is trained so that examples from different classes are separated with a large margin. Using the matrix M , we can transform the examples by x ˆi = P T xi , i = 1, . . . , n. The purpose of the linear transformation is to separate the differently labeled examples by a large margin in the transformed space, so that examples from the same class move closer and those from different classes move away, increasing the separability of classes. Once the distance metric is obtained, we could construct the SVM-M by ˆi . In practice, substituting the input feature vector xi by the transformed one x it is not necessary to know exactly the transformation matrix P , it suffices to define the kernel function using the metric obtained. Taking the exponential RBF kernel, we define: krbf,M (xi , xj ) = exp(−γdM (xi , xj )), γ > 0,
(6)
where dM (xi , xj ) is the squared distanced given in (5). Then SVM-M is built by substituting the original kernel function in (2) and (3) by the kernel (6).
3
Distance Metric Learning
We describe in this section the proposed method to learn a squared distance of the form (5). The desired metric is trained with the goal that examples from different classes are separated with a large margin, while the distance between examples in the same class is minimized. Ideally, in order to separate the examples from different classes, the squared distance dM should satisfy the constraints dM (xi , xk ) > dM (xi , xj ) ∀(i, j, k) ∈ D,
(7)
where D is the set of triple indexes: D := {(i, j, k) : yi = yj , yi = yk }.
(8)
Since it may be not possible to satisfy all these constraints simultaneously, as usual, we introduce the slack variables ξijk , so that: dM (xi , xk ) − dM (xi , xj ) ≥ 1 − ξijk
∀(i, j, k) ∈ D.
(9)
86
Y. Liu and V. Caselles
On the other hand, the distance between examples of the same class should be small. Therefore, we formulate the following optimization problem: min
M,ξijk
dM (xi , xj ) + C
ξijk ≥ 0
ξijk
i=1
(i,j)∈S
subject to
n
dM (xi , xk ) − dM (xi , xj ) ≥ 1 − ξijk ,
(10)
∀(i, j, k) ∈ D,
M ≥ 0, where S is the set of example pairs which belong to the same class, and C is a positive constant. The constraint M ≥ 0 indicates that the matrix M has to be positive semi-definite. As usual, the slack variables ξijk allow a controlled violation of the constraints. A non-zero value of ξijk allows a triple (i, j, k) ∈ D not to meet the margin requirement at a cost proportional to ξijk . The optimization problem (10) is an SDP problem and can be solved using standard optimization software. However, the large amount of constraints in (9) makes the problem computationally intensive for large data sets. As in [12], in order to overcome this difficulty, we select a subset of these constraints and redefine S and D as: S := {(i, j) : yi = yj , ηij = 1},
(11)
D := {(i, j, k) : yi = yj , yi = yk , ηij = 1, ηik = 1},
(12)
where ηij indicates whether example j is a neighbor of example i, the neighboring relations being defined by the Euclidean distance. Notice that (11) and (12) restrict the pairs of neighbors. This is reasonable, since only the examples which are neighbor but do not share the same class label should be separated using the learned metric, while we do not care about pairs which belong to different classes and have originally a large distance. We solved the problem (10) using the algorithm proposed in [12]. In our experiments we constructed the sets (11) and (12) using the five nearest neighbors (with respect to the Euclidean distance) of each input data.
4
Experiments
We evaluated the SVM-M approach for classification tasks on several public data sets: Iris and Balance data sets from the UCI Machine Learning Repository [2], US Postal Service (USPS) data set [5], Graz-02 [7], and MSRC-2 [13]. Iris and Balance are low dimensional data sets, both with 4 dimensions. USPS is a data set of gray level images of the 10 digits from ’0’ to ’9’ and we represent each of them using a vector obtained by applying principal component analysis (PCA). Graz-02 and MSRC-2 are also image data sets, and we use the
Improved Support Vector Machines with Distance Metric Learning
87
bag-of-words model to represent those images by a high dimensional descriptor. We compare the proposed approach with standard KNN, LMNN (large margin nearest neighbor) and standard SVM classification methods. In the SVM and SVM-M methods, we use the RBF kernels (4) and (6), respectively. Parameters such as the number of neighbors in KNN and the regularization parameter C are determined using k-fold (k = 5) cross validation. 4.1
Low Dimension Data Sets
Iris and Balance are low dimensional data sets, both with 4 dimensions. The Iris data set includes 150 examples in 3 classes: Setosa, Versicolour, Virginica. The Balance data set includes 625 examples in 3 classes: {L, B, R}. Examples are randomly divided into two subsets of the same size to form a training set and a testing set. Multiclass classification is obtained from two-class SVM using the one-vs-all strategy. We repeat each experiment five times with different splits, and report the average results. Table 1 shows the classification ratio on the two data sets. Our results show a significant improvement on both data sets. The confusion matrix for the Balance data set using the SVM-M classifier is presented in Table 2 in order to give more details on the categorization of each class (for one implementation). The first column contains the true labels and the first row lists the referred labels. The computation time required to compute the matrix M by solving (10) is around 1.2s and 4.7s for Iris and Balance data sets, respectively (using a CPU of 2.67GHz). Table 1. Classification results (%) on low dimension data sets KNN LMNN SVM SVM-M Balance 87.5 89.9 95.8 97.2 Iris 94.7 96.3 96.0 98.7 Table 2. Confusion matrix for Balance using SVM-M Class L B R L 143 0 0 B 4 17 3 R 0 0 145
4.2
Image Data Sets
The USPS handwritten digits data set was created for testing classification methods for handwritten digits on envelopes. It contains 11,000 normalized gray-scale digit images, with 1,100 examples for each of the 10 digits (from ’0’ to ’9’). Images have a normalized size of 16 × 16 pixels. We apply principal component analysis (PCA) to reduce its dimensionality before classification and each image is projected onto their leading principal components. In our experiments we use
88
Y. Liu and V. Caselles
Fig. 1. Classification ratio (%) on USPS
Table 3. Confusion matrix for the USPS data ser using SVM-M 1 2 3 4 5 6 7 8 9 0 1 548 0 2 0 0 0 0 0 0 0 2 2 538 2 1 1 1 2 0 1 2 3 1 5 525 0 8 0 5 5 1 0 4 1 1 0 540 0 4 0 1 3 0 5 0 1 6 0 531 1 0 8 1 2 6 0 7 0 6 2 533 0 1 0 1 7 0 0 0 1 0 0 546 0 3 0 8 1 0 2 2 8 0 2 533 2 0 9 0 0 3 2 1 0 5 1 538 0 0 1 0 0 0 0 2 2 0 0 545
the first 30 leading components since they permit to obtain a good classification ratio. The training and testing sets are created by randomly dividing the data set into two parts (with specified percentages). We repeat each experiment five times with different splits for each fixed percentage of the training set, and we report the average classification ratio. Fig. 1 shows the classification results obtained varying with the percentage of training set. Table 3 shows the confusion matrix for the USPS data set (for one execution) corresponding to the SVM-M classifier when the percentage of the training set is 50%. The first column lists the true labels and the first row are the obtained labels. The numbers of correctly classified images for each category are shown in the diagonal. In this case, the computation time for obtaining the metric M takes on average 41s. For the image data sets Graz-02 and MSRC-2, we use a bag-of-words model to represent the images. In our experiments, we use only the gray level information, although there may be room for further improvement by including color information.
Improved Support Vector Machines with Distance Metric Learning
a) bike
b) car
c) person
d) background
89
Fig. 2. Example images from Graz-02, one image per class
Our image representation follows the standard bag of words model. Each image is divided into equivalent blocks on a regular grid with spacing d = 7 both in horizontal and vertical directions. We take the set of grid points as keypoints, each with a circular support area of radius r = 5 (in our experiments). Each support area can be taken as a local patch. Notice that the patches overlap. Each patch is described by a SIFT (scale-invariant feature transform) descriptor. Then a visual vocabulary is built-up by vector quantizing the descriptors using a clustering algorithm such as K-means. Each resulting cluster corresponds to a visual word. With this vocabulary, each descriptor is assigned to its nearest visual word in it. The vector containing the number of each visual word in the image is used as a feature vector. We carried out binary classification experiments with Graz-02. Graz-02 contains 3 object classes {bike, car, person} and a background class. It is a challenging data set and, as example, some images are shown in Fig. 2. We used the first 200 images of each class. Moreover, images within each class are randomly divided into two subsets of the same size to form a training and a testing sets. We set the visual vocabulary size as 200. The binary classification is performed as object class against background. We report in Fig. 3 the averaged results over five experiments with different random splits of the data set. The SVM-M method consistently outperforms the standard SVM and LMNN classifiers. In the data set MSRC-2, there are 20 classes and 30 images per class. In our experiment, we choose six classes out of them: {tree, cow, face, car, bike, and book}. The visual vocabulary size is set as 100. Moreover, we randomly divided
90
Y. Liu and V. Caselles
Fig. 3. Classification ratio (%) on image data set Graz-02 Table 4. Classification results (%) on MSRC-2 KNN LMNN SVM SVM-M 82.2 83.3 86.7 88.9
the images within each class into two groups of the same size to form a training and a testing sets. We repeated each experiment five times using different splits, and we report the averaged results in Table 4.
5
Conclusions
In this paper, we have proposed to improve the performance of support vector machines classification by learning a distance metric which captures the discriminatory information of features. The learned metric is used to define the kernel function for SVM classification. In this context, using the new metric can be interpreted as a linear transformation of the original inputs before applying the SVM classifier with Euclidean distances. Because the distance metric is trained so that the examples from different classes are separated with a large margin, we increase their separability in the transformed space where the classification is applied. Experiments on various data sets show that the proposed method permits to obtain significant improvements in classification tasks. Acknowledgements. This work was partially funded by Mediapro through the Spanish project CENIT-2007-1012 i3media and by the Centro para el Desarrollo Tecnol´ogico Industrial (CDTI). The authors acknowlege partial support by the EU project “2020 3D Media: Spatial Sound and Vision” under FP7-ICT. Y. Liu also acknowledges partial support from the Torres Quevedo Program from the Ministry of Science and Innovation in Spain (MICINN), co-funded by the European Social Fund (ESF). V. Caselles also acknowledges partial support by MICINN project, reference MTM2009-08171, by GRC reference 2009 SGR 773 and by “ICREA Acad`emia” prize for excellence in research funded both by the Generalitat de Catalunya.
Improved Support Vector Machines with Distance Metric Learning
91
References 1. Domeniconi, C., Gunopulos, D., Peng, J.: Large margin nearest neighbor classifiers. IEEE Transactions on Neural Networks 16(4), 899–909 (2005) 2. Frank, A., Asuncion, A.: UCI machine learning repository, university of california, irvine, school of information and computer sciences (2010), http://archive.ics. uci.edu/ml 3. Furlanello, C., Serafini, M., Merler, S., Jurman, G.: An accelerated procedure for recursive feature ranking on microarray data. Neural Networks 16(5-6), 641–648 (2003) 4. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine learning 46(1), 389–422 (2002) 5. LeCun, A., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation, 541–551 (1989) 6. Nguyen, N., Guo, Y.: Metric Learning: A Support Vector Approach. Machine Learning and Knowledge Discovery in Databases, 125–136 (2008) 7. Opelt, A., Pinz, A., Fussenegger, M., Auer, P.: Generic object recognition with boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 416– 431 (2006) 8. Rakotomamonjy, A.: Variable selection using svm based criteria. The Journal of Machine Learning Research 3, 1357–1370 (2003) 9. Ramanan, D., Baker, S.: Local distance functions: A taxonomy, new algorithms, and an evaluation. In: International Conference on Computer Vision, pp. 301–308. IEEE, Los Alamitos (2009) 10. Vapnik, V.N.: The nature of statistical learning theory. Springer, Heidelberg (2000) 11. Wang, D., Yeung, D.S., Tsang, E.C.: Weighted mahalanobis distance kernels for support vector machines. IEEE Transactions on Neural Networks 18(5), 1453–1462 (2007) 12. Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research 10, 207–244 (2009) 13. Winn, J., Criminisi, A., Minka, T.: Object categorization by learned universal visual dictionary. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1800–1807. IEEE, Los Alamitos (2005) 14. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning with application to clustering with side-information. Advances in Neural Information Processing Systems, 521–528 (2003)
A Low-Cost System to Detect Bunches of Grapes in Natural Environment from Color Images Manuel J.C.S. Reis1, , Raul Morais2, Carlos Pereira3, Olga Contente4 , Miguel Bacelar3 , Salviano Soares1 , Ant´onio Valente1 , Jos´e Baptista3 , Paulo J.S.G. Ferreira5, and Jos´e Bulas-Cruz3 1
4
IEETA/UTAD, ECT, Dept. Engenharias, 5001-801 Vila Real, Portugal {mcabral,avalente,salblues}@utad.pt 2 CITAB/UTAD, ECT, Dept. Engenharias
[email protected] 3 UTAD, ECT, Dept. Engenharias
[email protected], {mbacelar,baptista,jcruz}@utad.pt Escola Superior de Tecnologia de Viseu, Dep.Eng.Mec.Ges.Ind., 3504-510 Viseu
[email protected] 5 IEETA/Universidade de Aveiro, Campus Univ. Santiago, 3810-193 Aveiro
[email protected]
Abstract. Despite the benefits of precision agriculture and precision viticulture production systems, its adoption rate in the Portuguese Douro Demarcated Region remains low. One of the most demanding tasks in wine making is harvesting. Even for humans, the environment makes grape detection difficult, especially when the grapes and leaves have a similar color, which is generally the case for white grapes. In this paper, we propose a system for the detection and location, in the natural environment, of bunches of grapes in color images. The system is also able to distinguish between white and red grapes, at the same time, it calculates the location of the bunch stem. The proposed system achieved 97% and 91% correct classifications for red and white grapes, respectively. Keywords: precision viticulture, visual inspection, grape detection, image processing.
1
Introduction
Precision agriculture (PA) and precision viticulture (PV) are production systems that promote variable management practices within a field according to site-specific conditions. The concept is based on new tools and information sources provided by modern technologies, such as yield monitoring devices, soil, plant and pest sensors and remote sensing [1, 2]. PA and PV generally have two main objectives: to render the production more cost-effective and to reduce its
Corresponding author.
J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 92–102, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Low-Cost System to Detect Bunches of Grapes
93
environmental impact. The first objective can be achieved by reducing production costs and improving productivity. The second objective relates to the accuracy and the ability to control the application of production factors, such as chemicals, which should be done within a fair measure of the real needs of crops. Despite the benefits, such diversity restrains the rate of adoption of these technological tools, which varies considerably from country to country, and from region to region [3, 4]. In addition to environmental benefits, such as those related to better water and nutrient management, PA and PV systems can increase worker productivity, augment product throughput and improve selection reliability and uniformity. Additionally, the use of robotic systems in agriculture has seen a sharp increment in recent years. Many systems have been described for the automation of various harvesting processes, such as for fruit location, detachment and transfer, among other, [6]. They have been used in the harvesting of, for example, melons [5], cucumbers [7], tomatoes [8] and oranges [9]. For a survey of several aspects of these systems see, for example, [10] and [11]. The wine industry (see the report of the US Department of Energy [12]) is interested in using autonomous robotic systems for grape harvesting, phytosanitary treatments, among other very time and human resources consuming tasks, for multiple reasons; this is particularly true for the Douro Demarcated Region (DDR) of Portugal as explained below. Harvesting conditions, in particular, affect wine quality, and several techniques need to be used in order to produce quality wines, and including scheduling details [13], and ending up with production details [14, 15, 16]. The DDR (a UNESCO World Heritage Site and the oldest Wine Demarcated Region of the World, and where the Port wine is actually produced and then shipped at the city of OPorto), due to its unique characteristics, poses very specific challenges, mainly due to the topographic profile, pronounced climatic variations and complex soil characteristics. It is located in northeast Portugal and consists mostly of steep hills (slopes reaching 15%) and narrow valleys that flatten out into plateaux above 400m. The Douro river dug deeply into the mountains to form its bed, and the dominant element of the landscape are the vineyards, planted in terraces fashioned from the steep rocky slopes and supported by hundreds of kilometres of drystone walls. Grape harvest and disease predictions, as well as the assessment of the grape value, are currently left to the grape-growers, without the help of decision-support mechanisms, in an environment where no significant irrigation systems exists. Traditionally, vineyards have been harvested by humans. However, harvesting is difficult, particularly in a region with the topographic and climatic characteristics of the DDR. The manpower needs are large and it is getting more and more difficult to find qualified workers. Autonomous systems could not only reduce the harvesting cost and manpower needs, they could also work during the night. There is however one important constraint: although it is not essential that the work rate of a harvesting robot surpasses that of a human, it is crucial that they satisfy quality control levels at least similar to those achieved
94
M.J.C.S. Reis et al.
by humans. Also, the existing machines harvest grapes by striking the vine, a process that is not recommended for some wines, such as champagne, for chemical reasons (e.g., oxidation), but also because of some deposits being collected with the grapes. These machines need, at least, one operator to harvest. They also require previous preparation of the vineyard, such as cutting the tips. Within the harvesting process the first step to take is to locate the bunches of grapes. This can be done using visual inspection. Unfortunately, this location process is much easier for fruits than for vine grapes; see, for example, [17, 18]. Even for humans, the environment makes grape detection difficult, especially when the grapes and leaves have a similar color. To an automatic recognition application, the bunches of grapes can appear at different scales, they can be partially occluded by leaves, the luminance of images can vary widely due to the sun, clouds, leaves, and so on, changing environmental conditions. Many works have been devoted to the definition of object invariant descriptors for simple geometric transformations; see, for example, [19, 20]. The Zernike moments [21, 22, 23] have been developed to overcome the major drawbacks of regular geometrical moments regarding noise effects and image quantization error. Zernike moments were successfully used in the detection of red grapes [24], but nothing is known about their performance in connection with white grapes. Moreover, besides the difficulties associated with the calculation of the Zernike moments, the method proposed by Chamelat et al. [24] implies two distinct phases, training and recognition, the training phase being crucial to the results of the whole system. The computation time for the learning step, for 17 images, is reported to take 5 minutes on a 2.8 GHz Pentium 4. The computation time for the recognition step (identification of each block of size 16 × 16 pixels) takes less than one second. We have tried to use Zernike moments to detect white grapes but with questionable success (with less than 50% of correct classifications). In this paper we propose a much simpler method for the detection and location, in natural environment, of bunches of grapes in color images, using only very simple and basic image processing techniques. The system is also able to distinguish between white and red grapes. Additionally, it can also calculate the location of the bunch stem, and can be used to help guiding a harvesting robot. The system presented here represents part of an effort that is being made by our team to help with the introduction of PA and PV in the farmers’ everyday practices in the DDR [25, 26], and is intended to be used in an autonomous harvesting robotic system. This paper is organized as follows. In the next section, we present the proposed system. Its performance and efficiency is discussed in section 3, where experimental results obtained with real images are given. Section 4 is devoted to the conclusions and future work.
2
The Grape Recognition System
Very soon in our work we have realized that if we can solve the problem of detecting bunches of white grapes we also solve the problem of detecting red grapes,
A Low-Cost System to Detect Bunches of Grapes
95
as it will become clear in the results section. Consequently, we have focused our efforts on the detection of bunches of white grapes. For the reasons presented in the previous section, we want that the whole system to be able to work during night conditions (i.e., darkness, with very little, or none, brightness variations), and also have the ability to distinguish between red and white grapes. To this end, the system simply makes a first pass through the original (night captured) image, counting the number of pixels that are “inside” limits of the Red, Green and Blue (RGB) components of (044, 051, 064), (033, 041, 054), (055, 062, 075), and (018, 024, 036), for red grapes, and (102, 108, 089), (095, 104, 085), (076, 090, 078), and (083, 089, 038) for white grapes. These four central values (colors) were experimentally (trial and error) determined during the development phase. The system seeks for pixels within the “limits” of these central values: the default limit contains all the values within 8% of these central values for red grapes and 15% for white grapes (these were the values experimentally determined from the night captured images). The biggest counting indicates the type of grape, i.e., red or white.
(a)
(b)
Fig. 1. Processing steps: (a) original image containing bunches of red grapes; (b) color mapping
If initial conditions are known, i.e., if we know that a parcel consists of a single type of grape (say white), then the system can be switched to that type of grape mode (say white mode), skipping the grape identification phase. Most of the vineyards in the DDR are characterized by their small area size, and by having more than one type of grape (red or white and even different types of caste) per parcel and even for bard, particularly in the case of old vines (the more recent vines were constructed and organized having in mind better control of the number and type of castes—and color, i.e., red and white—, but still, more than 120 different castes are used in the region). Once the type of grape (red or white) is established, the system follows three additional steps: color mapping, morphological dilation, black areas and stem
96
M.J.C.S. Reis et al.
detection. The color mapping step is done based on the conditions established during the grape identification step. At the end of this step we will have a binary image (black for the pixels in the range and white for the other pixels). Figure 1(a) shows the original image, containing bunches of red grapes, and figure 1(b) presents the resulting image of the application of color mapping step.
0 1 (a)
(b)
Fig. 2. Processing steps: (a) morphological dilation; (b) black areas (bunches) and stem detection step
As we can see in figure 1(b) the resulting image does not generally have a uniform (continuous) black region, but several regions where the concentration of black pixels is greater. The morphological dilation operation is meant to fill in the gaps between the pixels, yielding uniform black regions. We have used a square of 3 × 3 pixels as the structuring element, typically one of the most simple and used structuring elements. We should recall that the shape of this structuring element is not very far from the shape of a grape berry. Because the regions resulting from the color mapping step when applied to red grapes and white grapes are very different, with a greater sparsity when dealing with white grapes, we have typically used 60 iterations of the morphological dilation for red grapes, and 100 iterations for white grapes. Figure 2(a) shows the resulting image after the dilation operation, applied to the image presented in figure 1(b). Obviously, we have tried different sizes for the structuring element, but when its size was increased, even with a lower number of morphological dilation iterations, the number of overlapping uniform black regions tends to increase; these overlappings will correspond to incorrect overlapping bunches, leading to incorrect identifications. The final step is concerned with black regions and stem detection. First, the number of contiguous regions are counted and labeled (numbered) using 8-connectivity. Then, for each region, its width, height and area are calculated
A Low-Cost System to Detect Bunches of Grapes
(a)
97
(b)
Fig. 3. Example of identification final result: (a) original image; (b) identification result
(in pixels) so that we can discard false detections of very small bunches or very large regions to contain only one bunch of grapes. Note that in this last case the region will be composed of two or more bunches and the system will need to capture more images (from a different angles) to correctly handle these situations. Now, these parameters, and also the total number of admissible areas, are all adjustable. Next, we count the number of regions, center, area, width, height, perimeter and boundaries, for each region. For each region, and based on the pixel distribution and density around its center, it is determined the “horizontal” (width) and “vertical” (hight) axes of the bunch, i.e., the bunch orientation. Then, with these axes (and orientation), and with the region’s limits, we locate the bunch stem. Figure 2(b) shows the resulting image after the final step and figure 3 shows another example consisting of the original image and the resulting identification. Table 1. Summary of grape detection results. Correct means that all bunches present in each image were correctly identified. Incorrect means that the system had classified some areas of the image has if they were bunches although no bunches at all were present, or missed the identification of an existing bunch. White Correct 172 (91%) Incorrect 18 (9%) Totals 190 (100%) images
3
Red 34 (97%) 1 (3%) 35 (100%) images
Results and Discussion
The images presented here and used to test the system were captured during the night, with a Panasonic FZ28 camera (http://www.panasonic.co.uk/ html/en_GB/1258590/index.html), simply using its internal flash, i.e., no other lightning system was used. By capturing the imagens during night we avoid any spurious reflections or bad illumination during sunny day time, like momentary presence of clouds, but also, as explained above, some importante wine chemical
98
M.J.C.S. Reis et al.
properties (e.g., oxidation). In total, there were 190 images containing bunches of white grapes, and 35 images of red grapes (the number of images containing bunches white grapes is much higher than that of red grapes because the identification of bunches of red grapes is much simpler, as explained above). A summary of the results can be seen in table 1. As we can see, there were 172 correct results for bunches of white grapes; this means that all bunches present in each of these 172 images were correctly identified. We emphasize this fact, because we want that in a near future this system can be part of a harvesting robot, and we know that most of the infield images captured by this robot could contain more than one bunch per image, which clearly is not an optimal situation, but rather real. Recall that in a practical infield situation the robot can take as many images as needed to ensure that no more bunches are present for harvesting. However, we also have 18 images with incorrect or false detections. This means that the system classified some areas of the image as if they were bunches although no bunches at all were present, or missed the identification of an existing bunch. Also, if two regions were merged during the morphological dilation process they were counted as a “incorrect” result. As we have foreseen, the system’s performance, in percentage, is better for bunches of red grapes than for white grapes. Recall that the color of red grapes is very different (contrasting) from their surroundings (e.g., color of the leaves); however one must note the low number of red grape sample images.
(a)
(b)
Fig. 4. Example of a possible robot’s positioning correction by noticing “how far” it is from the bunches: (a) original image; (b) identification result
Additionally, we also argue that the system presented here can help guiding the robot. As it can be seen in figure 4, although the picture was captured very far way form the grape bunches the system manages to identify the presence of bunches of grapes. So, the system can help to tell the robot to move along that direction, adjusting (fine tuning) its position or trajectory. Usually, the presence of many bunches of grapes in one image indicates that the robot is at a considerably far distance from those bunches. Obviously, this can only be a contribution to the robot’s trajectories or positioning. In the examples of figures 5 and 6, for example, we can see that the bunch in the center of the figure is correctly identified. However, we can see that there are more bunches to be harvested.
A Low-Cost System to Detect Bunches of Grapes
99
Fig. 5. Example of overlapping bunches and identification result
Fig. 6. Example of a possible robot’s positioning correction by noticing the presence of more bunches; (a) original image, (b) in this case, only the central bunch was correctly identified, so more pictures are needed (see text).
By capturing one more image after the harvesting of the central bunch the system will detect the presence of more bunches and then correct the robots trajectory, as in the previous situation. As noted before, the presence of leaves may (partially) occlude a bunch; in these cases a mechanic system like the one presented in [27], basically a blower/fan to remove leaves, can be used to help solve this problem. We tested the system on a computer running Microsoft Windows XP Home Edition, with an Intel Core Duo Processor T2300 at 1.66 GHz, Mobile Intel 945PM Express Chipset, 3GB DDR2 667MHz SDRAM, and NVDIA GeForce Go 7300 External 128MB VRAM video card. In order to reduce the identification time, we tested several image resolutions, as presented in table 2. As we can see, even with a resolution of 1.3 Mega-pixel (MP) the system is able to produce accurate results. Note that the identification of bunches of red grapes takes less time, because it uses a lower number of iterations during the dilation operation step.
100
M.J.C.S. Reis et al.
Table 2. Summary of grape detection time versus image resolution (seconds versus Mega Pixels) 10MP 3MP 1.3MP White 1.5 s 0.22 s 0.16 s Red 1.0 s 0.15 s 0.08 s
As for the system cost, nowadays the price of the Panasonic FZ28 camera is about 300 e. The processing unit may be one of the ones used to control the robot, but if we choose to use another separate computer board we may purchase it bellow 200 e. This gives us a total bellow 500 e.
4
Conclusions and Ongoing Work
Within the context of Precision Agriculture/Viticulture, and because the Douro Demarcated Region has its own very particular characteristics, a vision inspection system was developed in order to identify bunches of grapes, for later inclusion in a robotic system for night conditions harvesting. As explained above, this system is also able to automatically distinguish between white and red grapes. Additionally, it can also calculate the location of the bunch stem, and can be used to help the robot’s location and guiding system. The system was targeted to identify bunches of grapes, during night conditions, and we have achieved 97% and 91% of correct classifications, for red and white grapes, respectively. Concerning the identification algorithm we are trying to make it more flexible. For example, during the morphological dilation step, the number of iterations should depend on the merging of the different regions, i.e., the iteration process should stop if the regions are already contiguous or if two distinct regions are merged. Also, the tilling and distribution of these contiguous regions should be analyzed in order to prevent incorrect identifications, like two or more staked regions, regions with much greater width than wight, among other. Obviously, a dedicated (specific) lighting system may help in producing better identification results (starting with a simple diffuser), but it would also increase the system cost and needed resources (e.g., power supply). We are also currently testing multi-spectral cameras. This solution is far more expensive (when compared with the about 300 e of the Panasonic FZ8 camera), but it could bring information about grapes maturation and alcoholic level. A cheaper alternative may include infra-red cameras, but red grapes yield more thermal information than white grapes, and so, once again, the identification of red grapes seems simpler than white grapes. In addition to the tasks of detection and harvesting, the robotic system may also contribute to mitigate the environmental impact of chemical plant protection products, as its application by robotic systems would be made only at points of interest identified by the vision system, among other possible applications (pruning, trimming the vines, disease detection, etc.).
A Low-Cost System to Detect Bunches of Grapes
101
References [1] Bramley, R.G.V.: Precision viticulture — research supporting the development of optimal resource management for grape and wine production (2001), http://www. crcv.com.au/research/programs/one/workshop14.pdf [2] Bramley, R.G.V.: Progress in the development of precision viticulture — variation in yield, quality and soil properties in contrasting australian vineyards (2001), http://www.crcv.com.au/research/programs/one/bramley1.pdf [3] Seelan, S., Laguette, S., Casady, G., Seielstad, G.: Remote sensing applications for precision agriculture: a learning community approach. Remote Sensing of Environment 88(1-2), 157–169 (2003) [4] Tongrod, N., Tuantranont, A., Kerdcharoen, T.: Adoption of precision agriculture in vineyard. In: ECTI-CON: 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, vol. 1 and 2, pp. 695–698 (2009) [5] Edan, Y., Rogozin, D., Flash, T., Miles, G.: Robotic melon harvesting. IEEE Transactions on Robotics and Automation 16, 831–835 (2000) [6] Sistler, F.: Robotics and intelligent machines in agriculture. IEEE Journal of Robotics and Automation 3, 3–6 (1987) [7] Van Henten, E., Hemming, J., Van Tuijl, B., Kornet, J., Meuleman, J., Bontsema, J., Van Os, E.: An autonomous robot for harvesting cucumbers in greenhouses. Autonomous Robots 13, 241–258 (2002) [8] Monta, M., Kondo, N., Ting, K.: End-effectors for tomato harvesting robot. Artifficial Intelligence Review 12, 11–25 (1998) [9] Recce, M., Taylor, J., Plebe, A., Tropiano, G.: Vision and neural control for an orange harvesting robot. In: Proceedings of the 1996 International Workshop on Neural Networks for Identification, Control, Robotics, and Signal/Image Processing (NICROSP 1996), p. 467 (1996) [10] Jimenez, A., Jain, A., Ceres, R., Pons, J.: Automatic fruit recognition: A survey and new results using range/attenuation images. Pattern Recognition 32, 1719– 1736 (1999) [11] Sarig, Y.: Robotics of fruit harvesting: A state-of-the-art review. Journal of Agricultural Engineering Research 54, 265–280 (1993) [12] U.S. Department of Energy. Assessment study on sensors and automation in the industries of the future:reports on industrial controls, information processing, automation, and robotics. Technical report, U.S. Department of Energy, Energy Efficiency and Renewable Energy, Industrial Technologies Program (2004) [13] Bohle, C., Maturana, S., Vera, J.: A robust optimization approach to wine grape harvesting scheduling. European Journal of Operational Research 200(1), 245–252 (2010) [14] Mateo, E.M., Medina, A., Mateo, F., Valle-Algarra, F.M., Pardo, I., Jimenez, M.: Ochratoxin A removal in synthetic media by living and heat-inactivated cells of Oenococcus oeni isolated from wines. Food Control 21(1), 23–28 (2010) [15] Lopes, M.S., Mendonca, D., Rodrigues dos Santos, M., Eiras-Dias, J.E., da Camara Machado, A.: New insights on the genetic basis of Portuguese grapevine and on grapevine domestication. Genome 52(9), 790–800 (2009) [16] Marmol, Z., Cardozo, J., Carrasquero, S., Paez, G., Chandler, C., Araujo, K., Rincon, M.: Evaluation of total polyphenols in white wine dealt with chitin. Revista De La Facultad De Agronomia De La Universidad Del Zulia 26(3), 423–442 (2009)
102
M.J.C.S. Reis et al.
[17] Rosenberger, C., Emile, B., Laurent, H.: Calibration and quality control of cherries by artificial vision. International Journal of Electronic Imaging, Special Issue on Quality Control by Artificial Vision 13(3), 539–546 (2004) [18] Lu, J., Gouton, P., Guillemin, J., My, C., Shell, J.: Utilization of segmentation of color pictures to distinguish onions and weeds in field. In: Proceeding of International Conference on Quality Control by Artificial Vision (QCAV), vol. 2, pp. 557–562 (2001) [19] Petrou, M., Kadyrov, A.: Affine invariant features from the trace transform. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(1), 30–44 (2004) [20] Jain, A., Duin, R., Mao, J.: Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 4–37 (2000) [21] Choksuriwong, A., Laurent, H., Emile, B.: Comparison of invariant descriptors for object recognition. In: IEEE International Conference on Image Processing, ICIP 2005, vol. 1, pp. I-377–I-380 (September 2005) [22] Chong, C., Raveendran, P., Mukundan, R.: Mean shift: A comparative analysis of algorithms for fast computation of Zernike moment. Pattern Recognition 36, 731–742 (2003) [23] Khotanzad, A., Hong, Y.: Invariant image recognition by Zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(5), 489–497 (1990) [24] Chamelat, R., Rosso, E., Choksuriwong, A., Rosenberger, C., Laurent, H., Bro, P.: Grape detection by image processing. In: IECON 2006 - 32nd Annual Conference on IEEE Industrial Electronics, vol. 1-11, pp. 3521–3526 (2006) [25] Morais, R., Fernandes, M., Matos, S., Serˆ odio, C., Ferreira, P., Reis, M.: A ZigBee multi-powered wireless acquisition device for remote sensing applications in precision viticulture. Computers and Electronics in Agriculture 62(2), 94–106 (2008) [26] Morais, R., Matos, S., Fernandes, M., Valente, A., Soares, S., Ferreira, P., Reis, M.: Sun, wind and water flow as energy supply for small stationary data acquisition platforms. Computers and Electronics in Agriculture 64(2), 120–132 (2008) [27] Edan, Y., Miles, G.E.: Systems engineering of agricultural robot design. IEEE Transactions on Systems, Man and Cybernetics 24(8), 1259–1264 (1994)
Fuzzy Cognitive Maps Applied to Synthetic Aperture Radar Image Classifications Gonzalo Pajares1, Javier Sánchez-Lladó2, and Carlos López-Martínez2 1
Dpto. Ingeniería del Software e Inteligencia Artificial, Facultad de Informática, University Complutense of Madrid, 28040 Madrid, Spain 2 Remote Sensing Laboratory (RSLab), Signal Theory and Communications Department Universitat Politècnica de Catalunya (UPC), 08034 Barcelona, Spain
[email protected],
[email protected],
[email protected]
Abstract. This paper proposes a method based on Fuzzy Cognitive Maps (FCM) for improving the classification provided by the Wishart maximumlikelihood based approach in Synthetic Aperture Radar (SAR) images. FCM receives the classification results provided by the Wishart approach and creates a network of nodes associating a pixel to a node. The activation levels of these nodes define the degree of membeship of each pixel to each class. These activations levels are iteratively reinforced or punished based on the existing relations among each node and its neighbours and also taking into account the own node under consideration. Through a quality coefficient we measure the performance of the proposed approach with respect to the Wishart classifier. Keywords: Fuzzy Cognitive Maps, Wishart classifier, Synthetic Aperture Radar (SAR), Polarimetric SAR (POLSAR), classification.
1 Introduction Nowadays, the increasing technology in Polarimetric Synthetic Aperture Radar (PolSAR) remote sensors is demanding solutions for different applications based on the data they provide. One of such applications is data classification to identify the nature of the different structures in the imaged surfaces and volumes based on the scattering of microwaves. Terrain and land-use classification are probably the most important applications of POLSAR, where many supervised and unsupervised classification methods have been proposed [1,2,3,4,5]. In [6], the fuzzy c-means clustering algorithm is applied for unsupervised segmentation of multi-look PolSAR images. A statistical distance measure is derived from the complex Wishart distribution of the complex covariance matrix. Methods based on the complex Wishart distribution have gained interest because of their performance. The scattering of microwaves coming from the surfaces is mapped as a coherence matrix for each pixel. Cloude and Pottier [7,8], based on the eigenvalues and eigenvectors of the coherence matrix, obtained measures of the average scattering mechanism (alpha- α ) and randomness of the scattering (entropyH). For classification, the H- α plane is divided in nine zones. Lee et al. [9] derive a J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 103–114, 2011. © Springer-Verlag Berlin Heidelberg 2011
104
G. Pajares, J. Sánchez-Lladó, and C. López-Martínez
probability density function for the coherence matrix assuming that this matrix has the complex Wishart distribution. Based on the H- α classification, each class or cluster is characterized by its centre. Ersahin et al. [10] applied graph partitioning and human perceptual skills based on the Wishart distribution. In this paper, class and cluster are terms used without distinction. Each pixel is classified as belonging to a cluster based on the minimum distance between its coherence matrix and the cluster centres. The Wishart classification approach is an iterative process where each pixel in the whole image is re-classified until no more pixel assignments occur or a termination criterion is met. We have verified that the first rarely occurs and we apply as criterion the one giving the best partition, where a quality measurement is obtained through a specific coefficient. Once we obtain the best partition with the Wishart classifier, each pixel has been classified as belonging to a cluster. We have designed a new iterative process based on the Fuzzy Cognitive Maps (FCM) paradigm, assuming that the Wishart classifications can be still improved. FCM is a well-developed modelling methodology for complex systems that allow to describe the behaviour of a system in terms of concepts. Under its most general approach, each concept represents an entity where the concepts are joined by causal edges and have assigned an activation level [11,12]. With the goal of improving the partitions, we build a set of networks, as many as clusters provided by Wishart, i.e. nine in our approach. Each concept at each network represents a pixel in the original image and its activation level the degree of membership to the class that originated the network. Initially, the activation levels are computed from the distances between the coherency matrix associated to each pixel and the cluster centre provided by Wishart classifier. The values of the causal edges, linking a concept with other concepts, are computed considering the activation levels of all involved concepts and also under the goal of achieving the best partition as possible, based on the measure provided by the quality coefficient. The design of the FCM, as a whole, makes the main finding of this paper. Several experiments allow us to verify the improvement in the classification results achieved by Wishart. The paper is organized as follows. In Section 2 the FCM scheme is proposed for SAR data classification, giving details about the complex Wishart classifier and the theory about the cluster separability measures, as they are required by the FCM process. The performance of the method is illustrated in Section 3, where a comparative study against the original Wishart-based approach is carried out. Finally, Section 4 provides a discussion of some topics and conclusions.
2 Fuzzy Cognitive Maps Framework in SAR Classification 2.1 The Combined H − α Decomposition and Maximum Likelihood-Based Wishart Before the FCM paradigm is applied, the Wishart-based classification process described in [9] is carried out, synthesized as follows: 1) The polarimetric scattering information can be represented by a target vector,
kl = ⎡ Shh ⎣
T
2 Shv
Svv ⎤ , or another way to work is to use the Pauli scattering vector ⎦
Fuzzy Cognitive Maps Applied to Synthetic Aperture Radar Image Classifications
for
a
pixel,
k p = 2 −1 2 [ Shh + Svv
where
in
this
Shh − Svv
2Shv ]
T
case
we
and
Ti = ki ki
let
a
T
new
feature
105
vector,
is the hermitian product of
target vectors of the one-look ith pixel. POLSAR data are generally multilook processed for speckle noise reduction by averaging n neighbouring pixels. The coherency matrix is given in (1) where superscripts, * and T denote complex conjugate and matrix transposition, respectively. 1 n *T (1) T = ∑ ki k i n i =1 2) From the coherency matrices, we apply the H/α decomposition process as a refined scheme for parameterizing polarimetric scattering problems. The scattering entropy, H, is a key parameter in determining the randomness about the model. The α angle characterizes the scattering mechanism proposed in [7]. 3) The next step is to classify the image into nine classes in the H − α plane. 4) We have available nine zones (classes) where each class is identified as wj; i.e. in our approach j varies from 1 to 9. Compute the initial cluster centre from the coherency matrices for all pixels belonging to each class wj according to the number of pixels nj in that class wj, t
Vj =
n
1 j ∑ T n j i =1
(2)
i
t
where t denotes the iteration, i.e. V j is the mean for the class wj at the iteration t. 5) Compute the distance measure for each pixel i characterized by its coherence matrix T i to the cluster centre as follows,
d
(T
t
i
)
t
,V j = ln V j + Tr
((
)
t −1
Vj
T
i
)
(3)
6) Assign the pixel to the class with the minimum distance measure
i ∈ w j iff
d
(T
t
i
,V j
)
(T
t
i
,Vm
)
∀ w j ≠ wm
(4)
7) Verify if a small number of pixels change their assignment to the clusters, otherwise set t = t + 1 and return to step 1. Always stop if a prefixed number of iterations tmax is reached, in which case we choose the partition with the greatest quality, according to a coefficient measuring the cluster separation (next section). 2.2 Cluster Separation Measures
Three useful measures can be used for quantitative statistical analysis [13], namely: (a) the dispersion within a class; (b) the distance between classes and (c) a combination of the above that gives the class separability. 1) The dispersion within clusters (Dii): is defined as the averaged distance from the pixels in cluster wi to the cluster centre Vi for all pixels. It measures the compactness of cluster wi and is given by 1 ni Dii = ∑ d T k , Vi = ln ( Vi ) + Tr Vi −1Vi (5) ni k =1 large Dii indicates the dispersion of the pixels into the cluster.
(
)
(
)
106
G. Pajares, J. Sánchez-Lladó, and C. López-Martínez
2) The distance between two clusters (Dij) defined as, 1 Dij = ln ( Vi ) + ln ( Vi ) + Tr Vi −1V j + Vj−1Vi 2 large Dij values indicate the high separation of these two clusters. 3) The cluster separability (Rij): is defined by the clusters wi and wj as follows,
{
)}
(
(6)
Rij = ( Dii + D jj ) Dij
(7)
a small Rij value indicates that these two clusters are well separated. Based on the above measures, the goal is to achieve small dispersion of the clusters and large distances between two clusters, which lead to small Rij values. To verify quantitatively the performance and also in order to choose the best partition, we compute the global averaged cluster separability through the following equation, 1 R= (8) ∑∑ Rij , i ≠ j nw i j where nw is the number of Rij combinations with i ≠ j , because in our experiments we have nine zones (clusters) and we combine in two by two regions, nw = 36 . 2.3 The Fuzzy Cognitive Maps Process 2.3.1 Problem Formulation: Previous Considerations According to equation (4), a pixel i belongs to the cluster wj if the distance to the corresponding cluster centre is minimum, among all distances to the remainder cluster centres. Based on these distances, we define the support received by the pixel i for belonging to the cluster wj as follows,
{ (
μi (t ) = 2 exp − d T i ,V tj j
)}
{ (
m t ∑ exp − d T i ,Vh h =1
)} − 1
(9)
where t denotes, as indicated before, the iteration number which controls the FCM iterative process, as we will see later; the subindex h varies from 1 to m, i.e. it represents the nine zones from h = w1 to h = m = w9. As we can see, the support μij (t ) varies in the range (–1,+1] which is considered as the fuzzy causal interval for our approach. Indeed, if d
(T
)
,V j = 0 then μi (t ) = +1 and if d t
i
j
(T
t
i
)
,V j → ∞
then μi (t ) → −1 . With this transformation, the decision rule in (4) can be expressed j
as a function of the support at the iteration t, according to the equation (10), which expresses that the pixel i belongs to the cluster wj because the support received by the pixel for this cluster is the greatest of all supports received for the remainder clusters. j m (10) i ∈ w j iff μi (t ) > μi (t ) ∀ w j ≠ wm 2.3.2 Problem Formulation: Architecture For each cluster wj, we build a network of nodes, netj, where the topology of this network is established by the spatial distribution of the pixels in the image to be classified with size M × N. Each node i in the netj, is associated to the pixel location
Fuzzy Cognitive Maps Applied to Synthetic Aperture Radar Image Classifications
107
(x,y) in the image, i.e. i ≡ ( x, y) . So, the node i in the netj is initialized with the support provided by the Wishart classifier through equation (9) at the last iteration executed. These initial support values are also the initial activation levels associated to the concepts in the networks under the FCM paradigm, as described in the next section. Through the FCM the activation level of each concept is reinforced or punished iteratively based on the influences exerted by their neighbours. Figure 1 displays the architecture and the set of networks. As we can see, from the original image we build the j nets (j = 1 to 9). Every node with its activation level or support μij (t ) on each netj is associated to a pixel i on the original image, both with identical locations (x,y). The activation levels at each network are updated according to the number of iterations t. x y
μ
μ k1
net1
1 i
t+1
x k y
i
j = 1,…,9 nets
x original image pixels i and k
y
μ
μk9
net9
9 i
t+1
Fig. 1. Network architecture from the original image
2.3.3 FCM Process and Network Topology As mentioned before, FCMs allow describing the behaviour of a system in terms of concepts. More specifically, they are fuzzy signed directed graphs with feedback [11]. The directed edge eik from causal concept Ci to concept Ck measures how much Ci causes Ck. Edges eik take values in the fuzzy causal interval [ −1, +1] , eik = 0 indicates
no causality; eik > 0 indicates causal increase, this means that Ck increases as Ci increases and vice versa, Ck decreases as Ci decreases; eik < 0 indicates causal decrease or negative causality, Ck decreases as Ci increases and Ck increases as Ci decreases. Given a FCM with a number n of concepts Ci, i.e. i = 1,…,n, the value assigned to each concept, called activation level, can be updated iteratively, until convergence, based on the external influences exerted by the other nodes Ck on Ci and its selfinfluence through Ci. Several approaches have been proposed to map these influences, such as the one proposed by Tsardias and Margaritis [10], which introduces a mechanism for achieving high network stability, as we will see below.
Ai (t + 1) = f ( Ai (t ), ∑ nk =1 eki (t ) Ak (t ) ) − d i Ai (t )
(11)
108
G. Pajares, J. Sánchez-Lladó, and C. López-Martínez
The explanation of the terms in the equation (11) is as follows: 1) Ai (t ) and Ak (t ) are respectively the activation levels of the concepts Ci and Ck at the iteration, t. The sum is extended to all n concepts available. Nevertheless, only the concepts with edge values different from zero exert influences over the concept Ci trying to modify its current activation level Ai (t ) towards Ai (t + 1) . 2) eki(t+1) are the fuzzy causalities between concepts, defined as above but considering that they could vary dynamically with the iterations. 3) di ∈ [0,1] is the decay factor of certainty concept Ci. It determines the fraction of the current activation level that will be subtracted from the new one as a result of the concept’s natural intention to get stable activation levels. The bigger the decay factor, the stronger the decay mechanism. This factor was introduced in [11] as a mechanism for introducing a degree of instability, so that those concepts destabilised intentionally but with a high degree of real stability tend towards its stabilized activation level. On the contrary, if the activation level is unstable, the decay mechanism induces a continuous variability on the activation level. 4) f is a non-linear function that determines the activation level of each concept, the sigmoid function is commonly used, i.e. f ( x ) = tanh( x ) . The variable x in f(x) represents a combination of the following two terms Ai (t ) and
∑
n
e (t ) Ak (t ) , in this
k =1 ki
paper we have chosen the arithmetic mean for combination. Based on the network architecture defined above, it is easy to associate each concept Ci in the netj to the pixel location (x,y). We can define a concept for each node at each netj, Ci j , where its activation level results Ai j (t ) . We define the activation level as the support received by the pixel i for belonging to the cluster wj at the iteration t, defined in equation (9), as Ai j (t ) ≡ μi j (t ) , these supports are reinforced or punished through the FCM mechanism, equation (11); the decision is made according to equation (10) based on the supports updated. Now we concentrate the effort on defining the fuzzy causalities between concepts eki(t) and the decay factor di, both involved in equation (11), which are required for updating μij (t ) . Both terms must be expressed considering the netj in which they are involved, i.e. ekij (t ) and di j . The term ekij (t ) is a combination of two coefficients representing the mutual influence exerted by the k neighbours over i, namely: a) a regularization coefficient which computes the consistency between the activation levels or supports of the nodes in a given neighbourhood for each netj; b) a separation coefficient which computes the consistency between the clusters in terms of separability, where high separability values are suitable. The neighborhood N in is defined as the n-connected spatial region in the network around the node i, taking into account the mapping between the pixels in the images and the nodes in the networks. The regularization coefficient is computed at the iteration t as follows, n ⎧ 1 − μ j (t ) − μ j (t ) k ∈ Ni , i ≠ k i k ⎪ j (12) rik (t ) = ⎨ n ⎪⎩0 k ∉ N i or i = k
Fuzzy Cognitive Maps Applied to Synthetic Aperture Radar Image Classifications
109
From (12) we can see that rikj (t ) ranges in (-1, +1] where +1 is obtained with
μij (t ) = μ kj (t ). This means that both supports have identical values, i.e. maximum consistency between nodes. On the contrary, if μij (t ) and μkj (t ) take most extreme opposite values, such as μij (t ) = +1 and μkj (t ) = −1 or vice versa, then rikj (t ) = −1 , which is its lower limit expressing minimum consistency between nodes i and k. The separation coefficient at the iteration t is computed taking into account the labels assigned to the pixels associated to the nodes according to the classification decision rule given in the equation (10). Assume that pixels i and k are classified as belonging to clusters wr and ws respectively, i.e. labelled as r and s. Because we are trying to achieve maximum separability between clusters, we compute the averaged cluster separability according to the equation (7). We compute the separabilities between i and its k neighbors in N in . A low Rrs value, equivalently a high Rrs−1 , expresses that clusters wr and ws are well separated. Based on this reasoning, the separation coefficient is defined as follows, −1
cik (t ) = 2
Rrs (t ) −1
∑n Rru (t )
−1
(13)
Ni
Through the fraction in equation (13) we normalize the values to range in the interval (0, +1], because we compute the cluster separabilities between the clusters wr where the pixel i belongs and the clusters where the neighbours of i in N in belongs to, i.e. wu. One of such clusters is ws, the cluster where k belongs to. The coefficients 2 and 1 in the equation (13) are introduced so that cik (t ) ranges in (-1, +1]. This mapping is made to achieve the same range than rik (t ) . Note that the separation coefficient is independent of j, i.e. of the netj, because the labelling for its computation involves the activation levels of all networks. This implies that it is identical for all networks. Both coefficients, regularization and separation, are combined as the averaged sum, taking into account the signs as follows,
Wikj (t ) = γ rikj (t ) + (1 − γ ) cik (t ) ; eikj (t ) = Wikj (t )
(14)
γ ∈ [0,1] is a constant value representing the trade-off between both coefficients; j ik
e (t ) represents the degree of consistency between nodes i and k in the netj at the iteration t. We define the decay factor based on the assumption that high stability in the network activation levels implies that the activation level for the concept Ci in the network netj would be to lose some of its activation with such purpose. We build an accumulator of cells of size q = L × M , where each cell i is associated to the concept Ci. Each cell i contains the number of times hi j , that the concept Ci has changed significantly its activation level in the netj. Initially, all hi j values are set to zero and then hi j = hi j + 1 if μi j (t + 1) − μi j (t ) > ε . The stability of the node i is measured as
110
G. Pajares, J. Sánchez-Lladó, and C. López-Martínez
the fraction of changes accumulated by the cell i compared with the changes in its neighbourhood k ∈ Nim and the number of iterations t. The decay factor is computed as follows, ⎧⎪ 0 hi j = 0 and hkj = 0 j (15) di = ⎨ j j j ⎪⎩hi ⎡⎣ hk + hi t ⎤⎦ otherwise
(
)
where hij is defined above and hk j is the average value accumulated by the concepts k ∈ Nim . As one can see, from equation (15), if hi j = 0 and hkj = 0 , the decay factor takes the null value, this means that no changes occur in the activation levels of the concepts, i.e., high stability is achieved; if the fraction of changes is small, the stability of the node i is also high and the decay term tends towards zero. Even if the fraction is constant the decay term also tends to zero as t increases, this means that perhaps initially some changes can occur and then no more changes are detected, this is another sign of stability. The decay factor subtracts from the new activation level a fraction; this implies that the activation level could take values less/greater than -1 or +1. In these cases, the activation level is set to -1 or +1, respectively. Once the FCM process ends, each concept Ci has achieved an activation j level μi (t ) that determines de degree of belonging of the pixel i, represented by Ci, to the cluster j. This final decision for classifying the pixel i is made according to (10).
3 Results 3.1 Design of a Test Strategy
In order to asses the validity and performance of our proposed classification approach we use the well-tested NASA/JPL AIRSAR image of the San Francisco Bay. The images are 900 x 1024 pixels in size. Because our proposed FCM approach starts after the iterative complex Wishart process has finished, the first task consists of the determination of the best number of iterations suitable for the Wishart process. This is carried out by executing this process from 1 to tmax, fixed to 7 in our experiments. For every iteration, we compute the averaged separability value according to equation (8) for the classification obtained at this iteration and select the number of iteration tw with the minimum averaged separability coefficient. The classification results obtained for tw are the inputs for the FCM. Because, the FCM is depending on several parameters, we have carried out several experiments for adjusting them. The strategy is based on the following steps: 1) Fix the γ coefficient involved in equation (14), at this initial step. It was fixed to γ = 0.6 after experimentation. 2) Fix the maximum number of iterations tmax. We have set tmax to 4 because after experimentation we have verified that more iterations are not suitable due to an over smoothing of the textured regions.
Fuzzy Cognitive Maps Applied to Synthetic Aperture Radar Image Classifications
111
3) Fixed the above parameters, we test the FCM for the following three neighbourhood regions: N i8 , N i24 and N i48 . As before, bigger neighbourhoods produce smoothed excessively. The best neighbourhood is that with the minimum averaged separability value R . 4) Once, we have obtained the best neighbourhood and tmax as described before, we test for several γ values ranging from 0.2 till 1.0 in steps of 0.2 and for each iteration from t = 1 to tmax = 4. For each γ at each iteration, we compute the R values, eq. (8). 3.2 Results
According to the above strategy, we show the results obtained at the different steps of the process. The averaged separability coefficient values for the complex Wishart classifier, equation (8), are displayed in table 1 for iterations 1 to tmax = 7. Table 1. Averaged separability values for Wishart against the number of iterations
# of iterations R
1 91.78
2 78.29
3 116.12
Wishart 4 145.22
5 112.78
6 92.98
7 106.62
As one may observe, the best cluster separability is achieved for two iterations, because the minimum averaged separability coefficient value is obtained. This is the number of iterations employed for the complex Wishart approach in our experiments, i.e. kw = 2. Table 2 displays the values for a number of iterations, obtained by the FCM, varying from 1 to tmax and for the neighbourhoods specified with γ = 0.6. Table 2. Averaged separability values for the FCM process against the number of iterations and different neighborhoods with γ = 0.6 R Neighbourhood 8 Ni
1 79.5
FCM: # of iterations 2 3 79.0 71.9
4 85.9
N i24
189.4
177.8
198.1
197.7
48 Ni
361.3
337.5
449.5
455.3
As one can see from results in table 3, the best performance is achieved for two iterations with a neighbourhood of N i8 . This is finally the neighbourhood used for all experiments. The results obtained with N i24 and N i48 are obviously worse; one explanation is because with these values the number of neighbours forcing the change
112
G. Pajares, J. Sánchez-Lladó, and C. López-Martínez
Table 3. Averaged separability values for the FCM process against the number of iterations with different γ values
γ R Iteration
1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
82.9 72.5 75.7 73.8
78.7 73.5 72.5 71.6
81.7 79.6 82.7 78.3
78.5 66.8 71.1 70.3
77.7 69.8 71.5 73.8
77.9 72.8 72.8 74.1
of the activation levels of the central pixel is big, according to the equations (12) and (13), this implies that there are pixels belonging to different clusters trying to modify the value of the central pixel. Table 3 displays the averaged separability values for different γ values ranging from 0 to 1 and for four iterations. From table 2, the best performance during the four iterations is achieved for γ = 0.6 . These values are obtained for two iterations of Wishart where R was 78.29 and this value is improved with FCM for iterations 2, 3 and 4. This fact verifies the performance of the proposed FCM approach against Wishart. Figure 2(a) displays the classification results obtained by the complex Wishart approach after the two programmed iterations based on the reasoning expressed above. Figure 2(b) displays the classification results obtained by the proposed FCM approach after two iterations according to the discussion above, based on the results shown in table 3, i.e. N i8 and γ = 0.6 at iteration 2. The colour bar, at the right, indicates the colour assigned to each one of the nine clusters, the number i identifies the cluster wi. The data and classification results in figure 2(a) are the inputs for the FCM approach. Figures 2(c)-(f) display two expanded areas extracted from the images in (a) and (b) obtained with Wishart (c,d) and FCM (e,f).
( a)
(c)
( d)
( b)
(e)
(f)
Fig. 2. (a) Classification by Wishart after two iterations; (b) Classification obtained by FCM after two iterations with a neighbourhood of 3 × 3 and γ = 0.6 ;(c)-(d) expanded areas obtained with Wishart; (e)-(f) the same areas processed with FCM
Fuzzy Cognitive Maps Applied to Synthetic Aperture Radar Image Classifications
113
3.3 Discussion
Two points of view may be considered for the analysis: quantitative and qualitative. From the first one, our FCM approach was focused on trying to achieve maximum cluster separability. According to the results in tables 2 and 3, the best performance in terms of separability is achieved with two iterations under a neighbourhood of N i8 , where the average cluster separability is less than the one obtained with Wishart approach. More than two iterations produce an over-homogenization effect and all pixels in a given region are re-classified as belonging to the same cluster. The results obtained by FCM represent a quality improvement with respect the result obtained by complex Wishart approach, both displayed in figure 2. As in [9] we have verified that during the FCM process several cluster centres are shifted between the clusters changing their positions. In the referenced work some qualitative improvements are justified based on this fact, therefore the same is applicable in our approach, as described below. So, under the qualitative point of view, the observation of the images in figure 2 allows to make the following considerations: a) The high entropy vegetation consisting of grass and bushes belonging to the cluster w2 have been clearly homogenized, this is because there are many pixels belonging to w4 in these areas re-classified as belonging to w2. b) The three distinct surface scattering mechanisms from the ocean surface identified in [9] are clearly displayed in figure 2(b), i.e. they appear under the cluster labelled as w6 (area of high entropy), w8 (ripples, near the coast) and w9 (smooth ocean surface). c) Also, in accordance with [9] the areas with abundant city blocks display medium entropy scattering. We have homogenized the city block areas removing pixels in such areas that belong to clusters w1 and w2, so that they are re-classified as belonging to w4 and w5 as expected. d) Some structures inside of other broader regions are well isolated. This occurs in the rectangular area corresponding to a park, where the internal structures with high entropy are clearly visible [14]. e) Additionally, the homogenization effect can be interpreted as a mechanism for speckle noise reduction during the classification phase, avoiding the early filtering for classification tasks.
4 Conclusions We have proposed an effective iterative approach based on the FCM paradigm, which outperforms the complex Wishart approach. Advantages of this mechanism are that it achieves an important homogenization effect in the classified areas preserving the important structures inside of broader areas. Also it has the ability for discovering different scattering mechanisms. The main drawback of the proposed approach is its high computational cost, about 12 minutes, and also that it requires the setting of some parameters. Nevertheless, Wishart spends itself about 7 minutes, both executed under Matlab R2009 with CPU 2.4GHz. The effectiveness of this classification approach has been illustrated by the well tested NASA/JPL AIRSAR L-band data of San Francisco bay.
114
G. Pajares, J. Sánchez-Lladó, and C. López-Martínez
In the future, for reducing the computational cost could be implemented under parallel realizations. Also, because the iterative process has become effective, we hope that other methods of the same nature will be intended in the nearest future. Acknowledgements. This work has been partially funded by the Ramon y Cajal program and the National I+D project TEC2007-65690/TCM.
References 1. Lee, J.S., Grunes, M.R., Pottier, E., Ferro-Famil, L.: Unsupervised Terrain Classification Preserving scattering characteristics. IEEE Trans. Geoscience Remote Sens. 42(4), 722– 731 (2004) 2. Lee, J.S., Pottier, E.: Polarimetric Radar Imaging: From Basics to Applications. CRC Press, Boca Raton (2009) 3. Sharma, J.J., Hajnsek, I., Papathanassiou, K.P., Moreira, A.: Polarimetric decomposition over Glacier Ice using long-wavelength Airborne PolSAR. IEEE Trans. Geoscience Remote Sens. 49(1), 519–535 (2011) 4. Ferro-Famil, L., Pottier, E., Lee, J.S.: Unsupervised Classification of Multifrequency and Fully Polarimetric SAR Images Based on the H/A/Alpha-Whisart Classifier. IEEE Trans. Goscience Remote Sensing 39(11), 2332–2342 (2001) 5. Wang, Y., Han, C., Tupin, F.: PolSAR Data Segmentation by Combining Tensor Space Cluster Analysis and Markovian Framework. IEEE Geoscience and Remote Sensing Letters 7(1), 210–214 (2010) 6. Du, L., Lee, J.S.: Fuzzy classification of earth terrain covers using complex polarimetric SAR data. Int. J. Remote Sensenig 17(4), 809–826 (1996) 7. Cloude, S.R., Pottier, E.: Application of the H/A/α polarimetric decomposition theorem for land classification. In: Proc. SPIE, vol. 3120, pp. 132–143 (1997) 8. Cloude, S.R., Pottier, E.: An entropy based classification scheme for land applications of POLSAR. IEEE Trans. Geosci. Remote Sensing 35(1), 68–78 (1997) 9. Lee, J.S., Grunes, M.R., Ainsworth, T., Du, L.J., Schuler, D., Cloude, S.: Unsupervised classification using polarimetric decomposition and the complex Wishart classifier. IEEE Trans. Geosci. Remote Sensing 37(5), 2249–2258 (1999) 10. Ersahin, K., Cumming, I.G., Ward, R.K.: Segmentation and Classification of Polarimetric SAR Data Using Spectral Graph Partitioning. IEEE Trans. Geoscience Remote Sens. 48(1), 164–174 (2010) 11. Tsardias, A.K., Margaritis, K.G.: An experimental study of the dynamics of the certainty neuron fuzzy cognitive maps. Neurocomputing 24, 95–116 (1999) 12. Kosko, B.: Neural Networks and Fuzzy Systems: a Dynamical Systems Approach to Machine Intelligence. Prentice-Hall, NJ (1992) 13. Davies, D.L., Bouldin, D.W.: A cluster separation measure IEEE Trans. Pattern Anal. Machine Intell. 1, 224–227 (1979) 14. Google Earth, The Google Earth product family (2011), http://www.google.es/intl/es_es/earth/download/ge/agree.html
Swarm Intelligence Based Searching Schemes for Articulated 3D Body Motion Tracking Bogdan Kwolek, Tomasz Krzeszowski, and Konrad Wojciechowski Polish-Japanese Institute of Information Technology Koszykowa 86, 02-008 Warszawa http://www.pjwstk.edu.pl Abstract. We investigate swarm intelligence based searching schemes for effective articulated human body tracking. The fitness function is smoothed in an annealing scheme and then quantized. This allows us to extract a pool of candidate best particles. The algorithm selects a global best from such a pool. We propose a global-local annealed particle swarm optimization to alleviate the inconsistencies between the observed human pose and the estimated configuration of the 3D model. At the beginning of each optimization cycle, estimation of the pose of the whole body takes place and then the limb poses are refined locally using smaller number of particles. The investigated searching schemes were compared by analyses carried out both through qualitative visual evaluations as well as quantitatively through the use of the motion capture data as ground truth. The experimental results show that our algorithm outperforms the other swarm intelligence searching schemes. The images were captured using multi-camera system consisting of calibrated and synchronized cameras.
1
Introduction
Vision-based tracking of human bodies is a significant problem due to various potential applications like user friendly interfaces, virtual reality, surveillance, clinical analysis and sport. The aim of articulated body tracking is to estimate the joint angles of the human body at any time. It is one of the most challenging problems in computer vision due to body self-occlusions, high dimensional and nonlinear state space and large variability in human appearance. To alleviate some of the difficulties, much previous work has investigated the use of 3D human body models of various complexity to recover the position, orientation and joint angles from 2D image sequences [3][4][11][13][15]. An articulated human body can be considered as a kinematic chain consisting of at least eleven parts, corresponding naturally to body parts. This means that around twenty six parameters might be needed to describe the full body pose. The state vectors describing the human pose are computed by fitting the articulated body model to the observed person’s silhouette. Thus, the 3D model-based approaches rely on seeking the pose space to find the geometrical configuration of 3D model that matches best the current image observations [12]. In order to cope with practical difficulties arising due to occlusions and depth ambiguities, multiple cameras and simplified background are typically used by the research community [13][15]. J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 115–126, 2011. c Springer-Verlag Berlin Heidelberg 2011
116
2
B. Kwolek, T. Krzeszowski, and K. Wojciechowski
Searching Schemes for Articulated Human Body Tracking
In articulated 3D human body tracking the techniques based in particle filtering are widely used. Particle filters [5] are recursive Bayesian filters that are based on Monte Carlo simulations. They approximate a posterior distribution for the configuration of a human body given a series of observations. The high dimensionality of articulated body motion requires huge number of particles to represent well the posterior probability of the states. In such spaces, sample impoverishment may prevent particle filters from maintaining multimodal distribution for long periods of time. Therefore, many efforts have been spent in developing methods for confining the search space to promising regions with true body pose. Deutscher and Reid [3] proposed an annealed particle filter, which adopts an annealing scheme with the stochastic sampling to concentrate the particle spread near the global maximum. In the discussed approach the fitness function is smoothed using annealing factors 0 = α1 < α2 , . . . , < αn = 1, and the particles migrate towards the global maximum without getting stuck in local minima. Additionally, a crossover operation is utilized in order to maintain the diversity of the particles. The configuration space can also be constrained using a hierarchical search. In such an approach, a part of the articulated model is localized independently in advance, and then its location is used to constrain the search for the remaining limbs. In [6], an approach called search space decomposition is proposed, where on the basis of color cues the torso is localized first and then it is used to confine the search for the limbs. However, in realistic scenarios, among others due to occlusions, it is not easy to localize the torso and to extract reliably such a good starting guess for the search. Compared with the ordinary particle filter, the annealed particle filter greatly improves the tracking performance. However, it still requires a considerable number of particles. Since the particles do not exchange information and do not communicate with each other, they have reduced capability of focusing the search on some regions of interest in dependency on the previous visited values. In contrast, the particle swarm optimization (PSO) [7], which is population-based searching technique, has high search efficiency by combining local search (by self experience) and global one (by neighboring experience). In particular, a few simple rules result in high effectiveness of exploration of the search space. The PSO is initialized with a group of random particles (hypothetical solutions) and then it searches hyperspace (i.e. Rn ) of a problem for optima. Particles move through the solution space, and undergo evaluation according to some fitness function f (). Much of the success of PSO algorithms comes from the fact that individual particles have tendency to diverge from the best known position in any given iteration, enabling them to ignore local optima, while the swarm as a whole gravitates towards the global extremum. If the optimization problem is dynamic, the aim is no more to seek the extrema, but to follow their progression through the space as closely as possible. Since the object tracking process is a dynamic optimization problem, the tracking can be achieved through
Swarm Intelligence Based Searching Schemes
117
incorporating the temporal continuity information into the traditional PSO algorithm. This means, that the tracking can be accomplished by a sequence of static PSO-based optimizations to determine the best person’s pose, followed by re-diversification of the particles to cover the possible states in the next time step. In the simplest case, the re-diversification of the particle i can be realized as follows: (i) xt ← N (ˆ xt−1 , Σ) (1) where x ˆt−1 is the estimate of the state in time t − 1. In order to improve the convergence speed, Clerc and Kennedy [2] proposed to use the constriction factor ω in the following form of the equation for the calculation of the i-th particle’s velocity: vi,k+1 = ω[vi,k + c1 r1 (pi − xi,k ) + c2 r2 (g − xi,k )]
(2)
where constants c1 and c2 are used to balance the influence of the individual’s knowledge and that of the group, respectively, r1 and r2 are uniformly distributed random numbers, xi is position of the i-th particle, pi is the local best position of particle, whereas g stands for the global best position. In our approach the value of ω depends on annealing factor α in the following manner: ω = −0.8α + 1.4 (3) k where α = 0.1 + K+1 , k = 0, 1, . . . , K, and K is the number of iterations. The annealing factor is also used to smooth the objective function. The larger the iteration number is, the smaller is the smoothing. In consequence, in the last iteration the algorithm utilizes the non-smoothed function. The algorithm, which we term as annealed particle swarm optimization (APSO) can be expressed in the following pseudo-code:
1. For each particle i 2. initialize vti,0 3. xi,0 t ∼ N (gt−1 , Σ0 ) i,0 i 4. pit = xi,0 t , ft = f (xt ) 5. uit = fti , u˜it = (uit )α0 ∗ ∗ 6. i∗ = arg mini u ˜it , gt = pit , wt = uit 7. For k = 0, 1, . . . , K 8. update ωα on the basis of (3) 9. G = arg mini round(num bins · u ˜it ) 10. For each particle i 11. Select a particle from {G ∪ gt } and assign it to gti i,k i 12. vti,k+1 = ωα [vti,k + c1 r1 (pit − xi,k t ) + c2 r2 (gt − xt )] i,k+1 i,k i,k+1 13. xt = xt + vt 14. fti = f (xi,k+1 ) t 15. if fti < uit then pit = xit , uti = fti , u˜it = (uit )αk 16. if fti < wt then gt = xit , wt = fti
118
B. Kwolek, T. Krzeszowski, and K. Wojciechowski
The smoothed objective functions undergo quantization, which constrains the real numbers to relatively small discrete set of bin values (integers), see 9th line in the pseudo-code. Thanks to such an operation the similar function values are clustered into the same bins. In each iteration the algorithm determines the set G of the particles, which after the quantization of the smoothed fitness function from the previous iteration, assumed the smallest values (the best fitness scores), see 9th line in the pseudo-code. For each particle i the algorithm selects the global best particle gti from {G ∪ gt }, where gt determines the current global best particle of the swarm. That means that the whole swarm selects the global best location from a set of candidate best locations. Figure 1 depicts the number of particles in each bin, which has been determined in one of the experiments, where 200 particles, 10 iterations and a quantization into 30 bins were employed in estimating the human pose.
number of part. 150 100 50 0 0
5
10
15 bin number
20
25
6
30
5
4
3
2
1
7 8 9 10 it. number
Fig. 1. Number of the particles in each bin in iterations 1,..,10
In [15], an annealed PSO based particle filter has been proposed and evaluated in tracking articulated 3D human body. Our approach is different from the discussed algorithm, in particular, it relies on annealing scheme that is based on the smoothing and the quantization of the fitness function. Additionally, the constriction factor that controls the convergence rate depends on the annealing factor. In the global-local particle swarm optimization (GLPSO) algorithm [9], at the beginning of each frame we estimate the pose of the whole body using PSO. Given the pose of the whole body, we construct state vectors consisting of the estimated state variables for pelvis and torso/head, the arms and the legs. At this stage the state variables describing the pose of the legs are perturbed by normally distributed motion. Afterwards, we execute particle swarm optimization in order to calculate the refined estimate of the legs pose. Such refined state variables are then placed in the state vector of the whole body. The state variables describing the hands are refined analogously. Our global-local annealed particle swarm optimization algorithm (GLAPSO) operates analogously, but instead of the ordinary PSO algorithm we employ APSO optimizations.
Swarm Intelligence Based Searching Schemes
3
119
Tracking Framework
The skeleton of the human body is modeled as a kinematic tree. The articulated 3D model consists of eleven segments with limbs represented by the truncated cones, which model the pelvis, torso/head, upper and lower arm and legs. The configuration of the model is defined by 26 DOF. It is parameterized by the position and the orientation of the pelvis in the global coordinate system and the relative angles between the connected limbs. In order to obtain the 3D human pose each truncated cone is projected into 2D image plane via perspective projection. In such a way we obtain an image with the rendered model in a given configuration. Such image features are then matched to the person extracted by image analysis. In most of the approaches to articulated 3D human body tracking, the cameras are static and background subtraction algorithms are utilized to extract the object of interest [14]. In addition, image cues like edges, ridges, color are used frequently to get better delineation of the person [11]. In [10], face detection, head-shoulders contour matching and elliptical skin-blob detection techniques were used in estimating the 3D human poses in static images. In our approach, the background subtraction algorithm [1] is used to extract the binary image of the person, see Fig. 2b). It is then used to calculate a silhouette-overlap term. a)
b)
c)
d)
e)
f)
Fig. 2. Person segmentation. Input image a), foreground b), gradient magnitude c), edge distance map d), skin color patches e), extracted body segments f).
Image edges complement silhouette features and contribute toward precise aligning the body limbs. The most common type of edge detection process uses a gradient operator. Gradient features share many properties with optical flow. In particular, they do not depend on background subtraction. Gradient angle is invariant to global changes of image intensities. In contrast to optical flow, gradients features are discriminative for both moving and non-moving body parts. In our approach, the gradient magnitude, see Fig. 2c), is masked with the closed image of the foreground and then used to generate the edge distance map, see also Fig. 2d). It assigns each pixel a value that is the distance between that pixel and the nearest nonzero edge pixel. In our implementation we employ chessboard distance and limit the number of iterations on the chain propagation to three. Additionally, in the GLPSO and GLAPSO algorithms we perform the segmentation of the person’s shape into individual body parts. To accomplish this we model the distribution of skin color using 16 × 16 histogram in rg color space. The histogram back-projection is employed to identify the skin patches and to
120
B. Kwolek, T. Krzeszowski, and K. Wojciechowski
extract the skin binary masks, see Fig. 2e. Such masks are then used to delineate the skin segments in the person binary images. Taking into account the height of the extracted person we perform rough segmentation of the legs and feet, see Fig. 2f). The fitness score is calculated on the basis of following expression: f (x) = 1 − (f1 (x)w1 · f2 (x)w2 ), where w denotes weighting coefficients that were determined experimentally. The function f1 (x) reflects the degree of overlap between the segmented body parts and the projected model’s parts into 2D image. The overlap degree is calculated through checking the overlap from the binary image to the considered rasterized image of the model and vice versa. The larger the degree of overlap is, the larger is the fitness value. In GLPSO and GLAPSO algorithms the silhouette-overlap term is calculated with consideration of the distinguished body parts. The second function reflects the degree of overlap between model edges and image edges. At this stage the above mentioned edge-proximity term is utilized.
4
Experimental Results
The algorithms were compared by analyses carried out both through qualitative visual evaluations as well as quantitatively through the use of the motion capture data as ground truth. The images were captured using multi-camera system consisting of four calibrated and synchronized cameras. The system acquires color images of size 1920 × 1080 with rate 24 fps. Each pair of the cameras is approximately perpendicular to the other two. The placement of the video cameras in our laboratory is shown in Fig. 3. A commercial motion capture (MoCap) system from Vicon Nexus provides ground truth motion of the body at rate of 100 Hz. It utilizes reflective markers and sixteen cameras to recover the 3D location of such markers. The cameras are all digital and are capable to
Fig. 3. Layout of the laboratory and camera setup. The images illustrate human motion tracking in frame #20 seen in view 1 and 2 (upper row), and in view 3 and 4 (bottom row).
Swarm Intelligence Based Searching Schemes
121
differentiate overlapping markers from each camera’s view. The synchronization between the MoCap and multi-camera system is done through hardware from Vicon Giganet Lab. Figure 4 demonstrates some tracking results that were obtained using particle swarm optimization (PSO), see images in the first row, global-local PSO, see images in the second row, annealed PSO, see images in the third row, and globallocal annealed PSO, see images in the last row. Each image consists of two subimages, where the left sub-image contains the model overlaid on the images from the view 1, whereas the second one illustrates the model overlaid on the images from the view 4. In the experiments presented below we focused on analyses of motion of walking people with bared and freely swinging arms. The analysis of the human way of walking, termed gait, can be utilized in several applications ranging from medical applications to surveillance. This topic is now a very active research area in the vision community.
Fig. 4. Articulated 3D human body tracking. Shown are results in frames #20, 40, 60, 80, 100, 120, 140, obtained by PSO (1st row), GLPSO (2nd row), APSO (3rd row), GLAPSO (4th row), respectively. The left sub-images are seen from view 1, whereas the right ones are seen from view 4.
For fairness, in all experiments we use the identical particle configurations. For the global-local PSO and global-local annealed PSO the sum of particles responsible for tracking the whole body, arms and legs corresponds to the number of the particles in the PSO and APSO. For instance, the use of 300 particles in PSO or APSO is equivalent to the use of 200 particles for tracking the full body, 50 particles for tracking the arms and 50 particles for tracking both legs
122
B. Kwolek, T. Krzeszowski, and K. Wojciechowski
in GLPSO or GLAPSO. The use of 200 in PSO and APSO corresponds to the exploitation of 150, 25 and 25 particles, respectively, whereas the use of 100 particles equals to utilization 80 particles for tracking the global configuration of the body, along with 10 and 10 particles for tracking hands and legs, respectively. In Tab. 1 we can see some quantitative results that were obtained using two image sequences. For each sequence the results were averaged over ten runs with unlike initializations. They were achieved using image sequences consisting of 150 and 180 frames. In the quantization the number of bins was set to 30. Figure 4 demonstrates the images from the sequence two. Table 1. Average errors for M = 39 markers in two image sequences
#particles 100 100 PSO 300 300 100 100 GLPSO 300 300 100 100 APSO 300 300 100 100 GLAPSO 300 300
Seq. 1 Seq. 2 it. error [mm] std. dev. [mm] error [mm] std. dev. [mm] 10 86.17 50.58 73.89 35.98 20 77.71 39.36 67.58 32.15 10 75.31 41.50 65.56 30.26 20 75.11 38.42 63.43 28.63 10 80.95 42.69 68.50 32.00 20 67.66 27.15 67.17 30.08 10 68.58 30.98 64.40 28.01 20 67.96 30.03 62.87 26.00 10 71.56 36.26 65.04 29.74 20 68.81 31.87 61.29 26.86 10 66.51 29.63 61.78 26.69 20 64.63 28.91 59.70 24.98 10 69.44 31.21 63.37 30.74 20 63.71 28.79 60.42 26.72 10 60.07 21.07 60.71 24.41 20 58.96 19.43 57.62 22.49
The pose error in each frame was calculated using M = 39 markers mi (x), i = 1, . . . , M , where mi ∈ R3 represents the location of the i-th marker in the world coordinates. It was expressed as the average Euclidean distance: E(x, x ˆ) =
M 1 ||mi (x) − mi (ˆ x)|| M i=1
(4)
where mi (x) stands for marker’s position that was calculated using the estimated pose, whereas mi (ˆ x) denotes the position, which has been determined using data from our motion capture system. From the above set of markers, four markers were placed on the head, seven markers on each arm, 6 on the legs, 5 on the torso and 4 markers were attached to the pelvis. For the discussed placement of
Swarm Intelligence Based Searching Schemes
123
the markers on the human body the corresponding marker’s assignment on the 3D model was established. Given the estimated human pose the corresponding 3D positions of virtual markers were determined. On the basis of data stored in c3d files the ground truth was extracted and then utilized in calculating the average Euclidean distance given by (4). The errors that are shown in Tab. 1 were calculated on the basis of the following equation: Err(x, xˆ) =
L M 1 ||mi (x) − mi (ˆ x)|| LM i=1
(5)
k=1
where L denotes the number of frames in the utilized test sequences. As Tab. 1 shows, the particle swarm optimization algorithm allows us to obtain quite good results. The GLPSO outperforms the PSO algorithm. However, such considerably better results can only be obtained if skin and legs segmentation is used in GLPSO. Due to global-local searching strategy the GLPSO algorithm is superior in utilizing the information about the location of the hands and the legs. The APSO algorithm tracks better in comparison to GLPSO and PSO. Moreover, it requires no segmentation of skin patches or body parts. The GLAPSO takes advantages of both algorithms and achieves the best results. The plots shown in Fig. 5 illustrate the pose error versus frame number that was obtained by each algorithm. We can see that the PSO-based algorithm is able to provide estimates with errors that are sporadically larger than 100 mm. The average error of the GLAPSO is far below 60 mm. In the GLPSO algorithm, at the beginning of each PSO cycle, the estimation of the pose of the whole body takes place and then the poses of the limbs are refined locally using smaller number of particles. In the algorithms based on the annealing the particles are beforehand weighted by smoothed versions of the weighting function, where the
Fig. 5. Pose error [mm] versus frame number
124
B. Kwolek, T. Krzeszowski, and K. Wojciechowski
influence of the local minima is weakened first but increases gradually. This leads to consistent tracking of the 3D articulated body motion. In consequence, the errors in our algorithm, which takes the advantages the two different strategies for exploration of the search space are far smaller. The discussed algorithm is capable of achieving better results because of its ability to thinking globally and acting locally. As a result, the GLAPSO algorithm outperforms GLPSO and APSO algorithms both in terms of the tracking accuracy as well as consistency in tracking of the human motion. In particular, the standard deviation is far smaller in comparison to the standard deviation of the other investigated algorithms. Figure 6 presents the pose estimation errors for particular body parts. The results were obtained in 20 iterations using the GLAPSO algorithm with 200 particles for the whole body tracking and 2 × 50 particles for tracking the arms and legs. As we can observe, our algorithm can track body limbs with lower errors and it is robust to ambiguous configurations such as self occlusion. In more detail, from the discussed plot, we can see that for all body parts except the left forearm the maximal error does not exceed 100 mm. The complete human motion capture system was written in C/C++. The system runs on Windows and Linux in both 32 bit and 64 bit modes. It operates on color images with spatial resolution of 960 × 540. The entire tracking process takes approximately 2.1 sec. per frame on Intel Core i5 2.8 GHz using a configuration with 100 particles 10 iterations. If Open Multi-Processing (OpenMP) is employed the tracking is completed in 1.12 sec. The image processing and analysis takes about 0.45 sec. One of the future research directions of the presented approach is to explore the CUDA/GPU in order to speed-up the computations [8].
Fig. 6. Tracking errors [mm] versus frame number
Swarm Intelligence Based Searching Schemes
5
125
Conclusions
We have presented a vision system that effectively utilizes swarm intelligence searching schemes to achieve better articulated 3D human body tracking. By combining two searching strategies, namely, annealed and global-local, the proposed method can tackle the inconsistency between the observed body pose and the estimated model configurations. Due to better capability of exploring the search space, the combination of above-mentioned searching strategies leads to superior tracking the articulated 3D human motion. Our global-local annealed (GLAPSO) algorithm is able to track the articulated 3D human motion reliably in multi-camera image sequences. In particular, the resulting algorithm is robust to ambiguous body configurations such as self occlusion. Moreover, it performs satisfactory even when small number of particles is employed, say 100 particles and 10 iterations. The fitness function is smoothed in an annealing scheme and then quantized. This allows us to maintain a pool of candidate best particles. Furthermore, the constriction factor that controls the convergence rate depends on the annealing factor. To show the advantages of our algorithm, we have conducted several experiments on walking sequences and investigated global-local and annealed searching strategies. The algorithms were compared by analyses carried out both through qualitative visual evaluations as well as quantitatively through the use of the motion capture data as ground truth. Acknowledgment. This paper has been supported by the project “System with a library of modules for advanced analysis and an interactive synthesis of human motion” co-financed by the European Regional Development Fund under the Innovative Economy Operational Programme - Priority Axis 1.
References 1. Arsic, D., Lyutskanov, A., Rigoll, G., Kwolek, B.: Multi camera person tracking applying a graph-cuts based foreground segmentation in a homography framework. In: IEEE Int. Workshop on Performance Evaluation of Tracking and Surveillance, pp. 30–37. IEEE Press, Piscataway (2009) 2. Clerc, M., Kennedy, J.: The particle swarm - explosion, stability, and convergence in a multidimensional complex space. IEEE Trans. on Evolutionary Computation 6(1), 58–73 (2002) 3. Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle filtering. In: IEEE Int. Conf. on Pattern Recognition, pp. 126–133 (2000) 4. Deutscher, J., Reid, I.: Articulated body motion capture by stochastic search. Int. J. Comput. Vision 61(2), 185–205 (2005) 5. Doucet, A., Godsill, S., Andrieu, C.: On sequential Monte Carlo sampling methods for bayesian filtering. Statistics and Computing 10(1), 197–208 (2000) 6. Gavrila, D.M., Davis, L.S.: 3-D model-based tracking of humans in action: a multiview approach. In: Proc. of the Conf. on Computer Vision and Pattern Recognition (CVPR 1996), pp. 73–80. IEEE Computer Society, Washington, DC (1996) 7. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proc. of IEEE Int. Conf. on Neural Networks, pp. 1942–1948. IEEE Press, Piscataway (1995)
126
B. Kwolek, T. Krzeszowski, and K. Wojciechowski
8. Krzeszowski, T., Kwolek, B., Wojciechowski, K.: GPU-accelerated tracking of the motion of 3D articulated figure. In: Bolc, L., Tadeusiewicz, R., Chmielewski, L.J., Wojciechowski, K. (eds.) ICCVG 2010. LNCS, vol. 6374, pp. 155–162. Springer, Heidelberg (2010) 9. Krzeszowski, T., Kwolek, B., Wojciechowski, K.: Model-based 3D human motion capture using global-local particle swarm optimizations. In: Burduk, R., Kurzy´ nski, ˙ lnierek, A. (eds.) Computer Recognition Systems 4. Advances M., Wo´zniak, M., Zo in Intelligent and Soft Computing, vol. 95, pp. 297–306. Springer, Heidelberg (2011) 10. Lee, M.W., Cohen, I.: A model-based approach for estimating human 3D poses in static images. IEEE Trans. Pattern Anal. Mach. Intell. 28, 905–916 (2006) 11. Schmidt, J., Fritsch, J., Kwolek, B.: Kernel particle filter for real-time 3D body tracking in monocular color images. In: IEEE Int. Conf. on Face and Gesture Rec., pp. 567–572. IEEE Computer Society Press, Southampton (2006) 12. Sidenbladh, H., Black, M., Fleet, D.: Stochastic tracking of 3D human figures using 2D image motion. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 702–718. Springer, Heidelberg (2000) 13. Sigal, L., Balan, A., Black, M.: HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. Journal of Computer Vision 87, 4–27 (2010) 14. Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Discriminative density propagation for 3D human motion estimation. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition, pp. I:390–I:397 (2005) 15. Zhang, X., Hu, W., Wang, X., Kong, Y., Xie, N., Wang, H., Ling, H., Maybank, S.: A swarm intelligence based searching strategy for articulated 3D human body tracking. In: IEEE Workshop on 3D Information Extraction for Video Analysis and Mining in Conjuction with CVPR, pp. 45–50. IEEE, Los Alamitos (2010)
Combining Linear Dimensionality Reduction and Locality Preserving Projections with Feature Selection for Recognition Tasks Fadi Dornaika1,2 , Ammar Assoum3 , and Alireza Bosaghzadeh1 2
1 University of the Basque Country, San Sebastian, Spain IKERBASQUE, Basque Foundation for Science, Bilbao, Spain 3 LaMA Laboratory, Lebanese University, Tripoli, Lebanon
Abstract. Recently, a graph-based method was proposed for Linear Dimensionality Reduction (LDR). It is based on Locality Preserving Projections (LPP). It has been successfully applied in many practical problems such as face recognition. In order to solve the Small Size Problem that usually affects face recognition, LPP is preceded by a Principal Component Analysis (PCA). This paper has two main contributions. First, we propose a recognition scheme based on the concatenation of the features provided by PCA and LPP. We show that this concatenation can improve the recognition performance. Second, we propose a feasible approach to the problem of selecting the best features in this mapped space. We have tested our proposed framework on several public benchmark data sets. Experiments on ORL, UMIST, PF01, and YALE Face Databases and MNIST Handwritten Digit Database show significant performance improvements in recognition. Keywords: Linear Dimensionality Reduction, Locality Preserving Projections, feature selection, nearest neighbor classifier, object recognition.
1
Introduction
In most computer vision and pattern recognition problems, the large number of sensory inputs, such as images and videos, are computationally challenging to analyze. In such cases it is desirable to reduce the dimensionality of the data while preserving the original information in the data distribution, allowing for more efficient learning and inference. During the last few years, a large number of approaches have been proposed for constructing and computing the embedding. We categorize these methods by their linearity. The linear methods, such as Principal Component Analysis (PCA) [1], Multi-Dimensional Scaling (MDS) [2] are evidently effective in observing the Euclidean structure. The nonlinear methods such as Locally Linear Embedding (LLE) [3], Laplacian eigenmaps [4], Isomap [5] focus on preserving the geodesic distances. Most of the previous researches, e.g. LDA and its variants, aimed to find a good subspace in which the different class mass can be separated in a global way. In general, it might be hard to find a subspace which has a good separability for the whole data J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 127–138, 2011. c Springer-Verlag Berlin Heidelberg 2011
128
F. Dornaika, A. Assoum, and A. Bosaghzadeh
set, which motivates us to consider local methods, since empirically the local methods may have stronger discriminative power than global methods. Linear Dimensionality Reduction (LDR) techniques have been increasingly important in pattern recognition [6] since they permit a relatively simple mapping of data onto a lower-dimensional subspace, leading thus to simple and computationally efficient classification strategies. The issue of selecting the new components in the new-subspace has not received much attention. In most practical cases, relevant features in the new embedded space are not known a priori. Finding out what features to use in a classification task is referred to as feature selection. Although there has been a great deal of work in machine learning and related areas to address this issue, these results have not been fully explored or exploited in emerging computer vision applications. Only recently there has been an increased interest in deploying feature selection in applications such as face detection, vehicle detection, and pedestrian detection. Most efforts in the literature have largely ignored the feature selection problem and have focused mainly on developing effective feature extraction methods and applying powerful classifiers. The main trend in feature extraction has been to represent the data in a lower dimensional space computed through a linear or non-linear transformation satisfying certain properties. The goal is to find a new set of features that represents the target concept in a more compact and robust way but also to provide more discriminative information. Recently, a graphbased method was proposed for Linear Dimensionality Reduction (LDR). It is based on Locality Preserving Projections (LPP). LPP is a typical linear graphbased dimensionality reduction (DR) method that has been successfully applied in many practical problems such as face recognition. LPP is essentially a linearized version of Laplacian Eigenmaps. When dealing with face recognition problems, LPP is preceded by a Principal Component Analysis (PCA) step in order to avoid possible singularities. All face recognition works using LPP compute the recognition rates for a wide range of dimensions/eigenvectors. However, no work has dealt with the estimation of the optimal dimension. Furthermore, no work has addressed the selection of the relevant eigenvectors. The main contributions of the paper are as follows. First, we propose a recognition scheme based on concatenating the features provided by PCA and LPP. We show that this concatenation can improve the recognition performance of LPP. Second, we propose a feasible approach to the problem of selecting the best features in this mapped space: how many and which eigenvectors to retain for both PCA and LPP. Our proposed method is inspired by [7]. In [7], the authors use feature selection paradigm with PCA for the task of object detection. They demonstrate the importance of feature selection (eigenvector selection) in the context of two object detection problems: vehicle detection and face detection. However, our proposed work differs from [7] in three aspects. First, we address Linear Dimensionality Reduction and feature selection for recognition tasks–a more generic problem than object detection. Second, our study is not limited to the use of the classical PCA. Third, we propose a novel scheme for selecting eigenvectors associated with a concatenation of two linear embedding techniques: PCA+LPP.
Combining LDR and LPP with Feature Selection for Recognition Tasks
129
The remainder of the paper is organized as follows. Section 2 describes the classical Linear Dimensionality Reduction schemes. Section 3 presents the proposed approach for the estimation of embedding and feature selection. Section 4 presents some experimental results obtained with five benchmark data sets.
2
Linear Dimensionality Reduction
In this section, we provide a brief description about PCA and LPP. 2.1
Principal Component Analysis (PCA)
Principal components analysis (PCA) is a popular linear feature extractor to unsupervised dimensionality reduction [8]. We assume that we have a set of N D samples {xi }N i=1 ⊂ R . The mean sample, x, is subtracted from every sample xi . The set of centered samples is represented by a D × N matrix X. Then the covariance matrix is computed as S=
1 X XT N
(1)
The new embedded space will have its origin at x and its principal directions given by the eigenvectors of the covariance matrix S associated with the largest eigenvalues. A new sample x can be approximated by the M principal modes: x∼ =x+
M
yl wl
(2)
l=1
= x + Wy
(3)
where the D×M matrix W represents the first M eigenvectors. The M-dimensional vector y is the low dimensional representation of the original sample x. It is computed by a simple projection onto the principal modes: y = WT (x − x)
(4)
The number of principal modes, M , is chosen such that the variability of the retained modes corresponds to a high percentage of the total variability. In brief, the PCA embedding takes as input a set of sample data and produces an orthonormal set of vectors representing the eigenvectors of the sample covariance matrix associated with the M < D largest eigenvalues, i.e., w1 , w2 , . . . , wM . Turk and Pentland [1] proposed the first application of PCA to face recognition. Since the basis vectors constructed by PCA had the same dimension as the input face images, they were named “Eigenfaces”. Principal Component Analysis (PCA) has been successfully applied to construct linear models of shape, graylevel, and motion. A full review of PCA applications in computer vision is beyond the scope of this paper.
130
F. Dornaika, A. Assoum, and A. Bosaghzadeh
2.2
Locality Preserving Projections (LPP)
Many dimensionality reduction techniques can be derived from a graph whose nodes represent the data samples and whose edges quantifies both the similarity and closeness among pairs of samples [9]. Locality Preserving Projection (LPP) [10,11,12] can be seen as a linearized version of Laplacian Eigenmaps (LE) [4]. In [13], the authors have extended the LPP to the supervised case. D We assume that we have a set of N samples {xi }N i=1 ⊂ R . Define a neighborhood graph on these data, such as a K-nearest-neighbor or -ball graph, or a full mesh, and weigh each edge xi ∼ xj by a symmetric affinity function Aij = K(xi ; xj ), typically Gaussian: xi − xj 2 Aij = exp − (5) β where β is usually set to the average of squared distances between all pairs. Let A denote the symmetric affinity matrix whose elements are defined by (5). 1 L 2 We seek latent points {yi }N ⊂ R that minimizes i=1 i,j yi − yj Aij , 2 which discourages placing far apart latent points that correspond to similar observed points. For the purpose of presentation simplicity, we present the one dimensional mapping case in which the original data set {xi }N i=1 is mapped to a line. Let z = (y1 , y2 , . . . , yN )T be such a map (a column vector). Note that here every data sample is mapped to a real value. A reasonable criterion for choosing a good map is to optimize the following objective function under some constraints: min
1 (yi − yj )2 Aij 2 i,j
(6)
Minimizing function (6) imposes a heavy penalty if neighboring points xi and xj are mapped far apart. By simple algebra formulation, function (6) can be written as 1 (yi − yj )2 Aij 2 i,j
1 2 (y Aij + yj2 Aij − 2yi Aij yj ) 2 i,j i = yi2 Dii − yi Aij yj =
i T
i,j T
= z Dz − z Az = zT L z
(7)
where D is the diagonal weight matrix, whose entries are column (or row, since A is symmetric) sums of A, and L = D − A is the Laplacian matrix. In the LPP formulation, the latent data are simply given by a linear mapping of the original data. This means that the projection of xi is given by yi = wT xi .
Combining LDR and LPP with Feature Selection for Recognition Tasks
131
Thus, the one dimensional map z = (y1 , y2 , . . . , yN )T (column vector) is giving by
z = XT w
(8)
where X = (x1 , x2 , . . . , xN ) is the data matrix. Finally, by combining (8) and (7) and by imposing the constraint zT Dz = 1 the minimization problem reduces to finding: (9) min wT XLXT w s.t. wT XDXT w = 1 w The constraint is used in order to remove the arbitrary scale associated with the mapping. The transformation vector w that minimizes the objective function is given by the minimum eigenvalue solution to the generalized eigenvalue problem: XLXT w = λ XDXT w
(10)
For a multi dimensional mapping, each data sample xi is mapped into a vector yi . The aim is to compute the projection directions (WLP P = (w1 , w2 , . . . , wL )). These vectors are given by the generalized eigenvectors of (10), ordered according to their eigenvalues, 0 ≤ λ1 ≤ λ2 ≤ . . . ≤ λL . Then, the mapping of an unseen sample x is given by y = WTLP P x (11) 2.3
PCA+LPP
In many real world problems such as face recognition, the matrix XDXT is sometimes singular. This stems from the fact that sometimes the number of images in the training set N is much smaller than then number of pixels in each image. To overcome the complication of a singular XDXT , original data are first projected to a PCA subspace so that the resulting matrix XDXT is non-singular. This method when applied on faces is called Laplacianfaces. The global transform is given by: W = WP CA WLP P
3 3.1
Feature Selection Feature Selection
In a great variety of fields, including pattern recognition and machine learning, the input data are represented by a very large number of features, but only few of them are relevant for predicting the label. Many algorithms become computationally intractable when the dimension is high. On the other hand, once a good small set of features has been chosen, even the most basic classifiers (e.g., 1-nearest neighbor) can achieve desirable performance. Therefore, feature selection, i.e. the task of choosing a small subset of features which is sufficient to predict the target labels well, is critical to minimize the classification error. At the same time, feature selection also reduces training and inference time and
132
F. Dornaika, A. Assoum, and A. Bosaghzadeh
leads to a better data visualization as well as to a reduction of measurement and storage requirements. Roughly speaking, feature selection algorithms have two key problems: search strategy and evaluation criterion. According to the criterion, feature selection algorithms can be categorized into filter model and wrapper model. In the wrapper model, the feature selection method tries to directly optimize the performance of a specific predictor (classification or clustering algorithm). This is usually achieved through an evaluation function and a given search strategy. The main drawback of this method is its computational deficiency. In the filter model, the feature selection is done as a preprocessing, without trying to optimize the performance of any specific predictor directly [14,15]. A comprehensive discussion of feature selection methodologies can be found in [16,17]. 3.2
Eigenvector Selection for Object Recognition
In the sequel, we will adopt a wrapper technique for eigenvector selection. The evaluation strategy will be directly encoded by the recognition accuracy over validation sets. Without any loss of generality, the classifier used after Linear Dimension Reduction and eigenvector selection will be the KNN classifier. This classifier is one of the oldest and simplest methods for pattern classification and it is one of the top 10 algorithms in data mining [18]. The adopted search strategy will be carried out by a Genetic Algorithm (GA). We use a simple encoding scheme in the form of a bit string whose length is determined by the number of eigenvectors. Each eigenvector is associated with one bit in the string. If the ith bit is 1, then the ith eigenvector is selected, otherwise, that component is ignored. Each string thus represents a different subset of eigenvectors. Evaluation criterion. The goal of feature selection is to use less features to achieve the same or better performance. Therefore, the fitness evaluation contains two terms: (1) accuracy and (2) the number of selected features. The performance of the KNN classifier is estimated using a validation data set which guides the GA search. Each feature subset contains a certain number of eigenvectors. If two subsets achieve the same performance, while containing different number of eigenvectors, the subset with fewer eigenvectors is preferred. Between accuracy and feature subset size, accuracy is our major concern. We used the fitness function shown below to combine the two terms: F itness = c1 Accuracy + c2 Zeros
(12)
where Accuracy corresponds to the classification accuracy on a validation set for a particular subset of eigenvectors, and Zeros corresponds to the number of eigenvectors not selected (i.e., zeros in the individual). c1 and c2 are two positive coefficients with c2 << c1 . In our implementation, c1 was set to 104 and c2 to 1. Search strategy: a genetic algorithm. Genetic Algorithms (GAs) are biologically motivated adaptive systems based on natural selection and genetic
Combining LDR and LPP with Feature Selection for Recognition Tasks
133
recombination [19]. In the standard GA, candidate solutions are encoded as fixed length vectors–strings. The initial population of solutions is chosen randomly. These candidate solutions are allowed to evolve over a certain number of generations. At each generation, the fitness of each string is calculated; this is a measure of how well the string optimizes the objective function. Subsequent generations are created through a process of selection, recombination, and mutation. An individual string fitness is used to probabilistically select which individuals will recombine. Crossover operators merge the information contained within pairs of selected parents by placing random subsets of the information from both parents into their respective positions in a member of the subsequent generation. Due to the random factors involved in producing children strings, the children may, or may not, have higher fitness values than their parents. Nevertheless, because of the selective pressure applied through a number of generations, the overall trend is towards higher fitness strings. Mutations are used to help preserve diversity in the population. Mutations introduce random changes into the chromosomes. The genetic search process is iterative: evaluating, selecting, and recombining strings in the population during each iteration (generation) until reaching some termination condition. The basic algorithm, where P (t) is the population of strings at generation t, is given below: Algorithm GA() 1. t = 0 2. initialize P(t) 3. evaluate P(t) 4. while (termination condition is not satisfied) 5. do 6. begin 7. select P(t+1) from P(t) 8. recombine P(t+1) 9. evaluate P(t+1) 10. t = t +1 11. end The initial population is generated randomly. This, however, would produce a population where each individual contains approximately the same number of 1s and 0s on the average. To explore subsets of different numbers of features, the number of 1s for each individual is generated randomly. Then, the 1s are randomly scattered in the string. In all of our experiments, we used a population size of 100 and 50 generations. In our experiments, we have observed that the GA converged in less than 50 generations. Selection Our selection strategy was cross generational. Assuming a population of size I, the offspring double the size of the population and we select the best I individuals from the combined parentoffspring population. Crossing We use uniform crossing in which each bit of the offspring is selected randomly from the corresponding bits of the parents. The crossover probability used in all of our experiments was 0.6. Mutation We use
134
F. Dornaika, A. Assoum, and A. Bosaghzadeh
the traditional mutation operator which just flips a specific bit with a very low probability. The mutation probability used in all of our experiments was 0.05. Figure 1 illustrates how the fitness of every individual string is evaluated using training and validation sets. The matrix P and L denote the whole set of eigenvectors associated with PCA and LPP, respectively. The matrices P and L denote the selected subset of eigenvectors. We point out that in this proposed scheme both the PCA and LPP output are used to form the new feature vectors. The corresponding recognition rate will be used as the first term in Eq. (12) in order to compute the fitness of the string.
Fig. 1. Evaluating the fitness of a given individual string by the Genetic Algorithm. The matrix P denotes the whole set of eigenvectors associated with PCA. The matrix P denotes a putative subset of eigenvectors. The corresponding recognition rate that combine selected PCA eigenvevtors and selected LPP eigenvectors will be used for computing the fitness of the string (Eq. (12)).
4
Experimental Results
4.1
Benchmark Data Sets
The data sets used are ORL face data set, UMIST face data set, YALE face data set, PF01 face data set and MNIST handwritten digit data set1 . The details of these data sets are described in Table 1. Some samples of these data sets are shown in Figure 2. 1
ORL, UMIST, YALE, PF01, and MNIST data sets can be retrieved respectively from http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html http://www.shef.ac.uk/eee/research/vie/research/face.html http://see.xidian.edu.cn/vipsl/database_Face.html http://nova.postech.ac.kr/special/imdb/imdb.html http://yann.lecun.com/exdb/mnist/
Combining LDR and LPP with Feature Selection for Recognition Tasks
ORL
UMIST
YALE
PF01
MNIST Fig. 2. Some samples of ORL, UMIST, YALE, PF01 and MNIST data sets
135
136
F. Dornaika, A. Assoum, and A. Bosaghzadeh Table 1. Details of benchmark data sets Data set Data set size (N ) Sample dimension (D) Nb. classes (C) ORL 400 10304 40 UMIST 575 10304 20 YALE 165 77760 15 PF01 1819 2304 103 MNIST 7000 784 10
Table 2. Average recognition rates obtained with five data sets using four LPP embedding schemes. The figures in parentheses denote the percentage of the training set.
ORL (50%) UMIST (50%) YALE (70%) PF01 (70%) MNIST (70%)
4.2
LPP LPP with selection PCA:LPP PCA:LPP with selection 85.0% 92.2% 92.2% 95.4% 93.2% 97.9% 96.7% 99.2 % 78.5% 94.0% 81.0% 96.5% 43.0% 70.1% 45.4% 71.6% 72.9% 88.1% 84.2% 93.%
Method Comparison
We empirically evaluate the improvement obtained by our proposed methods on the above data sets. We have performed a number of experiments and comparisons to demonstrate the importance of feature selection for face recognition based on linear embedding. First, recognition experiments are conducted on the data sets using the classical Linear Dimensionality Reduction schemes, i.e. using the top eigenvectors, followed by the KNN classifier (K=1). Second, recognition experiments are conducted on the same data sets using the Linear Reduction schemes with feature selection. In this scheme, we use ten fold cross-validation scheme. For each fold (unless stated otherwise), 50 % of the samples are used for training and the remaining samples are used for testing. Table 2 summarizes the average recognition rate over the 10 partitions for the five data sets. The first column illustrates the average recognition rates obtained by the LPP embedding. The second column corresponds to LPP with feature selection (i.e., the eigenvectors of LPP were selected by a GA). The third column illustrates the average recognition rates obtained by concatenating the PCA output and the LPP output. The fourth column corresponds to the feature selection scheme on the concatenated features. As can be seen, the feature selection framework has provided high recognition rates for the data sets. We can observe that by concatenating the PCA output and the LPP output the recognition rate gets improved for all data sets. We can also observe that the recognition rate is improved by adopting the feature selection paradigm for the concatenated features. Figure 3 illustrates the recognition rates obtained for YALE data set for all 10 partitions. In this experiment, the percentage of training samples was set to 70%. The blue bars correspond to the classical LPP. The red bars correspond to
Combining LDR and LPP with Feature Selection for Recognition Tasks
137
100 90
Recognition rate (%)
80 70 60 50 LPP Selective LPP
40 30 20 10 0
Set 1
Set 2
Set 3
Set 4
Set 5
Set 6
Set 7
Set 8
Set 9
Set 10
Fig. 3. Recognition rates using LPP and selective LPP applied on YALE data set. The blue bars correspond to the classical LPP. The red bars correspond to the selective LPP.
the mapping obtained by the selective LPP. As can been seen, the use of feature selection paradigm has improved the recognition rate. Using Matlab on an Intel Core I3 530 CPU (2.93GHz), the CPU times needed for selecting the LPP features with ORL, UMIST, YALE, PF01 and MNIST data sets are respectively 68.8, 133.8, 8.5, 3983.6 and 110.3 seconds. The number of individuals was set to 100 and the number of generations to 10.
5
Conclusion
We presented a new method for identifying critical eigenvectors for multi-class recognition problems. The eigenvectors correspond to the linear embedding given by a concatenation of Principle Component Analysis (PCA) and Locality Preserving Projections (LPP). The procedure of eigenvectors identification is based on maximizing the recognition rate with a genetic algorithm. Experiments are conducted on five benchmark data sets. These experiments have shown that the recognition rate has increased and that the compression of data is improved. Acknowledgment. This work was supported by the Spanish Government under the project TIN2010-18856.
References 1. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3, 71–86 (1991) 2. Borg, I., Groenen, P.: Modern Multidimensional Scaling: theory and applications. Springer, New York (2005) 3. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2319–2327 (2000) 4. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396 (2003) 5. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)
138
F. Dornaika, A. Assoum, and A. Bosaghzadeh
6. Martinez, A.M., Zhu, M.: Where are linear feature extraction methods applicable? IEEE Trans. Pattern Analysis and Machine Intelligence 27(12), 1934–1944 (2005) 7. Suna, Z., Bebisa, G., Miller, R.: Object detection using feature subset selection. Pattern Recognition 37, 2165–2176 (2004) 8. Jolliffe, I.T.: Principal Component Analysis. Springer, New York (2002) 9. Yan, S., Xu, D., Zhang, B., Zhang, H., Yang, Q., Lin, S.: Graph embedding and extension: a general framework for dimensionality reduction. IEEE Trans. on Pattern Analysis and Machine Intelligence 29(1), 40–51 (2007) 10. He, X., Niyogi, P.: Locality preserving projections. In: Conference on Advances in Neural Information Processing Systems (2003) 11. He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.J.: Face recognition using laplacianfaces. IEEE Trans. Pattern Anal. Mach. Intelligence 27(3), 328–340 (2005) 12. Zhang, L., Qiao, L., Chen, S.: Graph-optimized locality preserving projections. Pattern Recognition 43, 1993–2002 (2010) 13. Yu, W., Teng, X., Liu, C.: Face recognition using discriminant locality preserving projections. Image and Vision Computing 24, 239–248 (2006) 14. Mitra, P., Murthy, C., Pal, S.: Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Analysis and Machine Intelligence 24, 301–312 (2002) 15. Liua, H., Suna, J., Liua, L., Zhang, H.: Feature selectionwith dynamic mutual information. Pattern Recognition 43, 1330–1339 (2009) 16. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003) 17. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowledge Data Engineering 17, 494–502 (2005) 18. Wu, X., Kumar, V., et al.: Top 10 algorithms in data mining. Knowledge Information Systems 14(1), 1–37 (2008) 19. Srinivas, M., Patnaik, L.: Genetic algorithms: a survey. IEEE Computer 27(6), 17–26 (1994)
A New Anticorrelation-Based Spectral Clustering Formulation Julia Dietlmeier, Ovidiu Ghita, and Paul F. Whelan Centre for Image Processing and Analysis, Dublin City University, Dublin, Ireland
[email protected],
[email protected],
[email protected]
Abstract. This paper introduces the Spectral Clustering Equivalence (SCE) algorithm which is intended to be an alternative to spectral clustering (SC) with the objective to improve both speed and quality of segmentation. Instead of solving for the spectral decomposition of a similarity matrix as in SC, SCE converts the similarity matrix to a column-centered dissimilarity matrix and searches for a pair of the most anticorrelated columns. The orthogonal complement to these columns is then used to create an output feature vector (analogous to eigenvectors obtained via SC), which is used to partition the data into discrete clusters. We demonstrate the performance of SCE on a number of artificial and real datasets by comparing its classification and image segmentation results with those returned by kernel-PCA and Normalized Cuts algorithm. The column-wise processing allows the applicability of SCE to Very Large Scale problems and asymmetric datasets. Keywords: Spectral clustering, Image segmentation, Dimensionality reduction, Latent variables.
1
Introduction
Recent years have witnessed an enormous increase in research and applications devoted to spectral clustering (SC) where the problem of grouping is reformulated in an induced feature space. This attention comes not undeserved for several reasons. Firstly, SC can be referred to as a fully unsupervised classification method [13]. Secondly, SC excels in discovering hidden and secondary relationships [20], managing non-convex cluster shapes, non-metric data [14] and noise reduction [11] in a well defined and theoretically sound framework. Finally, the segmentation and grouping based on eigenvectors is able to return the perceptual organization features present in an image [16,5,6]. Conceptually, SC belongs to the domain of manifold learning methods aimed at the unsupervised extraction of a low-dimensional representation [19]. The term spectral therein refers to a broad family of clustering methods that make use of the eigenvectors of some normalized similarity matrix [15]. Different SC algorithms formalize the grouping problem in different ways and differ widely in the retained number and ranking of eigenvectors and matrix normalization steps [8,7]. By far, the most popular application to image segmentation is the normalized cuts (Ncut) J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 139–149, 2011. c Springer-Verlag Berlin Heidelberg 2011
140
J. Dietlmeier, O. Ghita, and P.F. Whelan
algorithm presented by Shi and Malik in 1997 as the first application of SC to computer vision and image analysis domain [4]. Despite its merits, SC has also its limitations associated with the computational complexity of spectral decomposition [19,6] and the problem of discretization of continuous eigenvectors [16]. Particularly, pixel classification tasks for large indefinite (possibly asymmetric) and fully dense similarity matrices form a considerable computational bottleneck for SC. Given an image with N pixels, the size of the similarity matrix increases to N 2 × N 2 and the decompositionbased implementations of SC become quickly infeasible. This is a well-known fact which has been continually emphasized over the past decade [19,6]. Approaches dealing with this paradigm range from exploiting the sparsity, subsampling of an image or similarity matrix to the low-rank approximation methods such as Nystr¨ om algorithm [21,23]. We infer that this is still an open problem from the most recent work by Chen et al. in [25] where the team of researchers presents a parallel HPC implementation of SC. The other recent work by Tung et al. in [24] approaches the scalability problem of SC by using a combination of blockwise processing and stochastic ensemble consensus. In this contribution we do not seek numeric or platform-related solutions to speed-up SC but rather a method which questions the optimality of eigenvectors. We therefore raise a question if the important aspects of the data can be represented through a less expensive alternative. This consideration opens the door to a much broader range of techniques in statistical machine learning and dimensionality reduction which forms the basis for any SC implementation. Therein, the family of learning methods encompasses but is not limited to principal component analysis (PCA), kernel-PCA, Linear Discriminant Analysis (LDA), other generative, discriminative, latent variable [18] and Independent Component Analysis (ICA) methods [11]. In this paper, we assume a low-dimensional manifold and search for a computationally less expensive alternative to eigenvector-based analysis. We next highlight the idea behind our algorithm that given a separable dataset [18], the core information related to the leading eigenvectors is contained in the columns of a kernel matrix. We call our algorithm Spectral Clustering Equivalence (SCE) and in the next sections we show its connection with kernel-PCA (Sections 2,3), Ncut (Section 4) and the Ising model [3] (Section 5). The major contribution of our paper is the reformulation of the standard spectral clustering through the construction of uncorrelated, orthogonal and centered components without the use of eigenvectors.
2
Development of Our Method
In multidimensional scaling (MDS), the spectral decomposition is carried out on the inner product matrix (Gram) G in the feature space with the main emphasis to preserve the inner-point distances [9]. Let us denote a similarity matrix by S and a dissimilarity matrix by A (see Table 1). Either A or S can be viewed as a dot product matrix G in some feature space according to Sch¨ olkopf and Smola
A New Anticorrelation-Based Spectral Clustering Formulation
141
Table 1. SCE notations
S A C R ρmin c1,2 t1,2 ⇒ M z Φ X ˜ G λmin
similarity matrix kernel matrix (dissimilarity matrix) column-centered kernel matrix correlation matrix computed on the columns of C minimal signed Pearson correlation coefficient, ρ ≡ cos ∠c1 , c2 a pair of most anticorrelated columns of C, ∠c1 , c2 ≡ θ1 → π orthogonal complement to c1,2 , θ2 → π/2 next following operation mixing matrix M = Cz equivalence coefficient centered and uncorrelated SCE feature vectors eigenvector-based feature vectors embedded (psd) Gram matrix [9] minimal eigenvalue (additive shift constant) [9]
[27] or transformed to G with kernelization and normalization [9] according to a particular SC formulation and application [6]. We assume that A is a generally indefinite (possibly asymmetric) matrix and interpret it as a multidimensional space spanned by its columns. Further, we center the columns of A, call the new matrix C and consider a 2-class data partitioning problem. In order to answer the question about which columns in C carry more information about the binary class labels, we proceed with the analysis of linear dependencies present in C. From the related works on linear dependency analysis, Srebro and Jaakkola in [17], for example, also seek to identify a low-dimensional subspace that captures the dependent and the ”important” aspects of the data, and separate them from independent variations. Thus, a natural way to conduct dependency analysis is to analyze correlations between different variables and the first step, prior to applying correlation analysis, is the centering of variables. Contrary to the formation of G in kernel-PCA which involves double-centering [9], our normalization of A in order to obtain C does not involve row centering. Next, we define the correlation between two centered columns, c1 and c2 , according to the formula of Pearson product-moment correlation coefficient [1]: N N N ρ1,2 = c1,i c2,i / c21,i c22,i . (1) i=1
i=1
i=1
A strong negative correlation provides a suitable measure of discrimination according, for example, to [2] and also indicates that the decrease in one variable
142
J. Dietlmeier, O. Ghita, and P.F. Whelan
is controlled by the increase in the second variable. In regards to natural images, it is reasonable to view foreground and background as the two most distinct and thus most anticorrelated image structures. We therefore are interested in the lower bound of ρ ∈ [−1, 1] and define a pair of observations (columns of C) with a strong negative correlation ρij → −1 dissociation patterns. In order to construct the orthogonal and uncorrelated kernel-PCA estimates we first draw on the idea of canonical spaces [28]. Let X and Y be the two unknown subspaces spanned by the columns of C. The largest canonical angle between X and Y is defined as θ(X , Y) = maxx∈X miny∈Y ∠(x, y) [28]. We know that the correlation between centered variables is equivalent to the cosine of the angle between these variables [1]. The cosine of the largest (canonical) angle θ → π can therefore be interpreted as the minimal signed Pearson correlation coefficient between the columns of C, ρmin → −1 [1]. According to the cosine input image ⇒ features ⇒ S ⇒ A ⇒
output Dimensionality ⇒ X1,2 (SC) Reduction Φ1,2 (SCE)
˜ =U ˜ Λ˜U ˜T ⇒ X = U ˜ A ⇒ G = UΛUT ⇒ λmin ⇒ G
Λ˜
set of discrete clusters
⇒ X1,2 . (2)
A ⇒ C ⇒ R ⇒ ρmin ⇒ c1,2 ⇒⊥ ⇒ t1,2 ⇒ Mt1,2 ⇒ decorrelate ⇒ Φ1,2 . (3) with 2 × 2 PCA and center Fig. 1. Algorithm description and the comparison of SC (2) and SCE (3). Both methods take a similarity matrix S as an input and produce a pair of orthogonal and uncorrelated feature vectors. SC is based on the eigen-decomposition of the pseudo-Gram matrix G. Conversely, SCE is a decomposition-free method which is based on the dependency analysis of the column-centered kernel matrix C. Because SCE works on the columns of C, it can generally be applied on asymmetric, non-PSD and rectangular datasets. The convention to write a pipeline of equations with ”⇒” has been adopted from [9].
definition, π is the maximum possible angle corresponding to cos(θ) = ρmin = −1 [1]. Thus, ρmin not only defines a pair of mostly anticorrelated columns c1 and c2 but also provides the link with the first canonical angle θ1 . According to Stewart in [28], the number of canonical angles in the case of dim(X ) < dim(Y) is equal to dim(X ), which in our case dim(X ) = 2. This fact allows the construction of the second SCE-based feature component by employing the orthogonality constraint to obtain θ2 . For this purpose for ∀k ∈ [1, N 2 ] we seek an orthogonal
A New Anticorrelation-Based Spectral Clustering Formulation
143
complement vector ck to c1,2 , and in this process we discard the least orthogonal pair of vectors: c1,k , if |∠c1 , ck − π/2| < |∠c2 , ck − π/2| t1,2 = (4) c2,k , otherwise . Further, we multiply t1,2 by Cz (”C to the power of z”) to maximize linear dependency and subsequently decorrelate (and center) with PCA on the computed 2 × 2 covariance matrix (refer to Table 1 for notations) in order to obtain the pair Φ1,2 of orthogonal, uncorrelated and centered SCE feature vectors. In this formulation, the SCE approximation of kernel-PCA feature vectors X1,2 is controlled by the coefficient z. After decorrelation and centering and similarly to the ranking of the PCA components [11], our selection of the leading feature vector Φ1 is based on the maximum variance principle such as σ 2 (Φ1 ) > σ 2 (Φ2 ). The product Cz t can also be viewed as a multivariate polynomial regression model with t being a vector of N 2 × 2 regression coefficients. The algorithm description and the progression from c to Φ is outlined in Fig. 1.
3
Results of Comparison to Kernel-PCA
In this section we compare SCE-based classification to the kernel-PCA-based result and consider an experiment with an asymmetric dataset. We created an interlocked spirals dataset, shown in Fig. 2, which is considered to be a challenging benchmark for spectral clustering [22]. To alleviate this challenging clustering SC on symmetrized Ssym
SC on symmetrized Ssym
( -1)
SCE on asymmetric S; ρ
= - 0.4
SCE on asymmetric S
140
2
120 100 80
1
1
40
t
X
2
60
0
20 0
-1
20 40 60 5
-2 0
5
10
X
1
(a)
15
20 x 10
-5000
23
0
t
(b)
5000
2
(c)
(d)
Fig. 2. Segmentation of asymmetric data with binary k-means in SC (a) and SCE (c) constructed feature spaces. An interlocked two spirals dataset is considered as a challenging benchmark for SC [22]. As can be seen in (c), SCE results in a better intercluster separation in the feature space. As the symmetry condition constitutes one of the four metric axioms [12], this example also tests the non-metric invariance of SCE.
problem, Chang and Yeung present in [22] a robust path-based spectral clustering algorithm with the use of a Gaussian kernel. The main objective of our experiment is to demonstrate the asymmetric and non-metric invariance of SCE and we designed the following asymmetric similarity measure:
sij = b1 · dxi − dxj , (5) sji = b2 · dyi − dyj .
144
J. Dietlmeier, O. Ghita, and P.F. Whelan
In the experiment outlined in Fig. 2 each spiral consists of 151 data points and is generated according to the equation of Archimedean spiral. Two separately computed coordinate vectors have been further concatenated to form x and y vectors. In (5) we are using the first derivatives dx and dy of the raw coordinates x and y. The model parameters b1 and b2 control the degree of asymmetry and with b1 = 20 and b2 = 2 we obtain a highly asymmetric S. Due to the symmetric formulation of SC we decompose S into its symmetric and skew-symmetric parts S = Ssym + Sskew according to [10]. Because of its symmetric formulation, SC disregards Sskew and diagonalizes only Ssym . Given a high degree of asymmetry, SC fails to correctly identify the two separate spirals as illustrated in Fig. 2(b). Conversely, SCE fully utilizes the information in the asymmetric component of S to achieve the correct separation (Fig. 2(d)) and results in a better projection and thus higher inter-cluster separability than kernel-PCA. The application of SCE to image segmentation and its relation to Ncut will be investigated in the next section.
4
Connection of SCE with Normalized Cuts
Ncut is the graph-theoretic formulation of SC with the objective to minimize a normalized measure of disassociation [7]. Ncut operates on the 2nd generalized eigenvector of a normalized weight matrix W where the normalization procedure has the purpose to penalize large image segments. Ncut then computes the diagonal matrix D containing the sum of all edges solves for the eigenvectors of and 1 1 N = D− 2 WD− 2 with N(i, j) = W(i, j)/ D(i, i) D(j, j). The second smallest generalized eigenvector λ2 of W is a componentwise ratio of the second and first largest eigenvectors of N [6]. We are interested if our Φ-based approximation can provide computational savings over the Ncut algorithm while maintaining the same image partition. For comparative purposes we have acquired the Ncut demo software from [26] and used the supplied parameters for the calculation of the adjacency matrix based on intervening contour similarities. In order to compare the segmentation results, we take the returned N matrix and compute Φ as outlined in Fig. 1. There are two aspects which are non-trivial in connection with Ncut: definition of a feature similarity and selection of a partitioning threshold which can take the values of 0, median or a point that minimizes N cut(A, B) [7]. In our SCE formulation we center the columns of N before mixing with t for a number of z iterations. We can also implement Nz t mixing iteratively by centering only the mixed N 2 × 2 components after each iteration. The successive centering results in the ideal non-parametric case, where we partition the graph according only to the signs (A = {Φ1 > 0}, B = {Φ1 ≤ 0}). For our experiments we used a Dell Precision M6300 dual-core notebook with 2GB RAM and Matlab R2009a environment. The results of an experiment on the full resolution 321×481 images from the Berkeley segmentation database are shown in Fig. 3. The returned binary Ncut partition (Fig. 3, third column) is given by the second computed eigenvector (fifth column). We observe the qualitative equivalence
A New Anticorrelation-Based Spectral Clustering Formulation SCE, thresholding Φ1
145
Ncut returned result seg=2 -0.2
0.5
-0.4 0
-0.6
-0.5
-0.8 -1
SCE, thresholding Φ1
Ncut returned result seg=2 -0.3 -0.4
0.2
-0.5
0
-0.6
-0.2
-0.7
SCE, thresholding Φ1
0.4
-0.4
-0.8
-0.6
-0.9
-0.8
Ncut returned result seg=2 -0.2
0.5 -0.4 -0.6 -0.8
0 -0.5
Fig. 3. Original 321 × 481 images from the Berkeley database are shown in the first column and SCE-based results in the second column. The last three columns show the Ncut-based result and the first and the second eigenvector returned by Ncut. This diagram is best viewed in color.
between SCE segmentation based on Φ1 and the first eigenvector (fourth column) returned by Ncut algorithm which also outputs a very narrow-banded sparse matrix N with ≈ 0.1% non-zero elements. Due to the inherent sparsity advantage which has its roots in the definition of similarities [7] Ncut does not rely on the direct eigen-decomposition of N. Instead, it uses the iterative Lanczos eigensolver [28] which, similarly to SCE, is also based on sparse matrix-vector multiplications. In our experiments with the returned sparse N, the ρmin is marginally low ρmin = −8.6025e − 004 which explains the high number of SCE iterations (z ≈ 1e4) needed to approximate the first Ncut eigenvector.
5
SCE Extension with Latent Variables
It is known that although the eigenvectors are efficient in capturing the perceptual organization features [16,5,6], binary Ncut solution does not guarantee the correct discrete image partitions [7]. Conversely, in our SCE formulation, the approximation to the first Ncut eigenvector is connected with the rotation and scaling of two hyperplanes, implemented through iterative N by t multiplications. Therefore, it is reasonable to assume that the foreground innovations can be extracted ahead of the eigenvector equivalence condition and in a much shorter time. To approach these problems, we initially followed the idea of a greedy search and designed an optimization procedure detailed in Table 2. Therein, we view the matrix N as the matrix of features and consider that N is sparse. Further, we view the signs of a pair of columns as two latent (hidden)
146
J. Dietlmeier, O. Ghita, and P.F. Whelan Table 2. Ising-based SCE extension
function [Ω, A, B] = scecut(N, t, itmax) k = cov(t); [u, v] = eigs(k); t = t ∗ u; t = t − ¯t; it = 0; while it
std(b2 ) t1 = −sign(b1 ) ∗ b2 ; t2 = b2 ; else t1 = −sign(b2 ) ∗ b1 ; t2 = b1 ; end it = it + 1; end Ω(:, 1) = t1 ; Ω(:, 2) = t2 A : Ω(:, 1) > 0; B : Ω(:, 1) ≤ 0;
binary support variables s1 (+) and s2 (−) and thus establish the connection with the Ising model which is a special Markov random field (MRF) [3]. In Table 2 we define s1 = sign(b1 ) if std(b1 ) > std(b2 ) and s2 = sign(b2 ) if std(b1 ) ≤ std(b2 ), where ”std” denotes the standard deviation. Instead of using the Ising model to represent pixels, we work with similarities contained in the normalized sparse matrix N. We denote the optimized feature vectors by Ω. The initial condition is given by the pair of columns t1,2 which, with the change in notation such that S ≡ N, can be obtained according to: N ⇒ P ∈ RN
2
×H
⇒ pi = pi − p¯i ⇒ R ⇒ c1,2 ⇒ t1,2 .
(6)
Due to memory limitations we did not search for the global minimum on correlation in N but instead operated on a subset matrix P of H randomly selected columns. Thus, in the experiment shown in Fig. 4 we selected H = 100 in order to process a 321 × 481 image from the Berkeley database. Although the automatic selection of the stopping criterion is still an ongoing work, we note that one possibility to obtain the optimal partition is to analyze the dynamic oscillatory behavior of the correlation coefficient ρ(t1 , t2 ) (first column in Fig. 4) and we observed that the optimal figure ground cut occurs at the change in phase of ρ. The binary A (figure) and B (ground) partitions have to be computed twice for the two successive iterations corresponding to the ρ transition. The final segmentation result F combines the intermediate results at different iterations such that F = F1 ∩ F2 , where F1 = A1 ∪ B1 and F2 = A2 ∪ B2 (see Fig. 4 last row). Random subsampling of N (6) explains somewhat different, but consistent with perceptual meaning, segmentation results in the second and the third row of Fig. 4, where we used different random subsets of N. The results in Fig. 4 show that not only the Ising-based SCE detects the foreground innovations in the analyzed image but also has a factor 2 speed-up compared to Ncut.
A New Anticorrelation-Based Spectral Clustering Formulation
SCE, itmax=298, 41.24 sec
147
SCE, itmax=438, 55.13 sec
Ncut output seg=2, 126.3 sec
SCE, itmax=439, 54.26 sec
SCE combined output
SCE, itmax=299, 37.3 sec
SCE combined result
Fig. 4. Concept of the Ising-based SCE. We analyze the dynamic oscillatory behavior of the correlation coefficient to find the optimal transition. For the two iterations near the transition point we compute the binary classification and combine the results to yield the optimal figure ground segmentation. The transition is given by the point where the hyperplanes are flipped around the Ω1 (strongest) axis as can be seen in the second column. On our computer SCE runs faster (see the computational time above the diagrams) than Ncut and returns more perceptually meaningful binary segmentation.
6
Conclusions and Future Work
In this paper we developed an efficient alternative to eigenvector-based feature classification. We started by examining the conditions of the feature space equivalence between the proposed SCE and the kernel-PCA outputs. We further have shown that the proposed algorithm reduces the dimension of the feature space while improving classification performance and thus results in a better projection and higher inter-cluster separability than kernel-PCA. In regard to image segmentation, we demonstrated that the proposed method has potential to replace eigenvector-based computation at least for applications considering the detection of foreground innovations. Our future work will concentrate on generalizing SCE to multiclass problems as well as investigating the regularization and stopping criteria of the proposed Ising-based SCE extension.
148
J. Dietlmeier, O. Ghita, and P.F. Whelan
Although the Ising model takes SCE beyond the equivalence pursuit, it shows that segmentation without eigenvectors is a more flexible framework than that offered by the standard spectral clustering. Acknowledgments. This research was supported by the National Biophotonics and Imaging Platform Ireland funded under the HEA PRTLI Cycle 4, co-funded by the Irish Government and the European Union - Investing in your future. Our special thanks go to the anonymous referees whose invaluable suggestions and comments have helped us to improve this paper.
References 1. Rodgers, J.L., Nicewander, W.A.: Thirteen Ways to Look at the Correlation Coefficient. The American Statistician 42, 59–66 (1988) 2. Tran, H.T., Romanov, D.A., Levis, R.J.: Control Goal Selection Through Anticorrelation Analysis in the Detection Space. Journal of Physical Chemistry A 110, 10558–10563 (2006) 3. Cevher, V., Duarte, M.F., Hegde, C., Baraniuk, R.G.: Sparse Signal Recovery Using Markov Random Fields. Neural Information Processing Systems (NIPS), 257–264 (2008) 4. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. In: Computer Vision and Pattern Recognition (CVPR), pp. 731–737 (1997) 5. Perona, P., Freeman, W.T.: A factorization approach to grouping. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 655–670. Springer, Heidelberg (1998) 6. Weiss, Y.: Segmentation Using Eigenvectors: A Unifying View. In: International Conference on Computer Vision, ICCV (1999) 7. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 22, 888–905 (2000) 8. Ng, A.Y., Jordan, M.I., Weiss, Y.: On Spectral Clustering: Analysis and an Algorithm. Neural Information Processing Systems (NIPS) 14, 849–856 (2001) 9. Roth, V., Laub, J., Kawanabe, M., Buhmann, J.M.: Optimal Cluster Preserving Embedding of Nonmetric Proximity Data. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 25, 1540–1551 (2003) 10. Constantine, A.G., Gower, J.C.: Graphical Representation of Asymmetric Matrices. Journal of the Royal Statistical Society. Series C (Applied Statistics) 27, 297– 304 (1978) 11. Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (2002) 12. Anderberg, M.R.: Cluster Analysis for Applications. Academic Press Inc., London (1973) 13. Zheng, N., Xue, J.: Statistical Learning and Pattern Analysis for Image and Video Processing. Springer-Verlag London Limited (2009) 14. Laub, J., Roth, V., Buhmann, J.M., M¨ uller, K.-R.: On the Information and Representation of Non-Euclidean Pairwise Data. Pattern Recognition 39, 1815–1826 (2006) 15. Alzate, C., Suykens, J.A.K.: Multiway Spectral Clustering with Out-of- Sample Extensions Through Weighted Kernel PCA. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 32, 335–347 (2010)
A New Anticorrelation-Based Spectral Clustering Formulation
149
16. Monteiro, F.C., Campilho, A.C.: Spectral Methods in Image Segmentation: A Combined Approach. In: Marques, J.S., P´erez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3523, pp. 191–198. Springer, Heidelberg (2005) 17. Srebro, N., Jaakkola, T.: Linear Dependent Dimensionality Reduction. Advances in Neural Information Processing Systems (NIPS) 16, 145–152 (2003) 18. Sanguinetti, G.: Dimensionality Reduction in Clustered Data Sets. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 30, 535–540 (2008) 19. Talwalkar, A., Kumar, S., Rowley, H.: Large-Scale Manifold Learning. In: Computer Vision and Pattern Recognition, CVPR (2008) 20. P¸ekalska, E., Harol, A., Duin, R.P.W., Spillmann, B., Bunke, H.: Non-Euclidean or Non-metric Measures Can Be Informative. In: Yeung, D.-Y., Kwok, J.T., Fred, A., Roli, F., de Ridder, D. (eds.) SSPR 2006 and SPR 2006. LNCS, vol. 4109, pp. 871–880. Springer, Heidelberg (2006) 21. Belabbas, M.-A., Wolfe, P.: Spectral Methods in Machine Learning and New Strategies for Very Large Datasets. Proceedings of National Academy of Sciences (PNAS) of the USA 106, 369–374 (2009) 22. Chang, H., Yeung, D.-Y.: Robust Path-based Spectral Clustering. Pattern Recognition 41, 191–203 (2008) 23. Fowlkes, C., Belongie, S., Chung, F., Malik, J.: Spectral Grouping Using the Nystr¨ om Method. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 26, 214–225 (2004) 24. Tung, F., Wong, A., Clausi, D.A.: Enabling Scalable Spectral Clustering for Image Segmentation. Pattern Recognition 43, 4069–4076 (2010) 25. Chen, W.-Y., Song, Y., Bai, H., Lin, C.-J., Chang, E.Y.: Parallel Spectral Clustering in Distributed Systems. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 33, 568–586 (2010) 26. Cour, T., Yu, S. and Shi, J.: Ncut demo software, http://www.cis.upenn.edu/ ~jshi/software 27. Sch¨ olkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge (2002) 28. Stewart, G.W.: Matrix Algorithms, Eigensystems, vol. II. SIAM, Philadelphia (2001)
Simultaneous Partitioned Sampling for Articulated Object Tracking Christophe Gonzales, S´everine Dubuisson, and Xuan Son N’Guyen Laboratoire d’Informatique de Paris 6 (LIP6/UPMC) 4 place Jussieu, 75005 Paris, France [email protected]
Abstract. In this paper, we improve the Partitioned Sampling (PS) scheme to better handle high-dimensional state spaces. PS can be explained in terms of conditional independences between random variables of states and observations. These can be modeled by Dynamic Bayesian Networks. We propose to exploit these networks to determine conditionally independent subspaces of the state space. This allows us to simultaneously perform propagations and corrections over smaller spaces. This results in reducing the number of necessary resampling steps and, in addition, in focusing particles into high-likelihood areas. This new methodology, called Simultaneous Partitioned Sampling, is successfully tested and validated for articulated object tracking.
1
Introduction
Articulated object tracking is an important computer vision task for a wide variety of applications including gesture recognition, human tracking and event detection. However, tracking articulated structures with accuracy and within a reasonable time is challenging due to the high dimensionality of the state and observation spaces. In the optimal filtering context, the goal of tracking is to estimate a state sequence {xt }t=1,...,T whose evolution is specified by a dynamic equation xt = ft (xt−1 , nxt ) given a set of observations. These observations {yt }t=1,...,T , are related to the states by yt = ht (xt , nyt ). Usually, ft and ht are vector-valued and time-varying transition functions, and nxt and nyt are Gaussian noise sequences, independent and identically distributed. All these equations are usually considered in a probabilistic way and their computation is decomposed in two main steps. First the prediction of the density function p(xt |y1:t−1 ) = xt−1 p(xt |xt−1 )p(xt−1 |y1:t−1 )dxt−1 with p(xt |xt−1 ) the prior density related to transition function ft , and then a filtering step p(xt |y1:t ) ∝ p(yt |xt )p(xt |y1:t−1 ) with p(yt |xt ) the likelihood density related to the measurement function ht . When functions ft and ht are linear, or linearizable, and when distributions are Gaussian or mixtures of Gaussians, the sequence {xt }t=1,...,T can be computed analytically by Kalman, Extended Kalman or Unscented Kalman Filters [4]. Unfortunately, most vision tracking problems involve nonlinear functions and non-Gaussian distributions. In such cases, tracking methods based on particle J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 150–161, 2011. c Springer-Verlag Berlin Heidelberg 2011
Simultaneous Partitioned Sampling for Articulated Object Tracking
151
filters [4,6], also called Sequential Monte Carlo Methods (SMC), can be applied under very weak hypotheses: their principle is not to compute the parameters of the distributions, but to approximate these distributions by a set of N weighted (i) (i) samples {xt , wt }, also called particles, corresponding to hypothetical state realizations. As optimal filtering approaches do, they consist of two main steps: (i) a prediction of the object state in the scene (using previous observations), (i) (i) that consists in propagating the set of particles {xt , wt } according to a pro(i) posal function q(xt |x0:t−1 , yt ), followed by (ii) a correction of this prediction (using a new available observation), that consists in weighting the particles ac(i)
(i)
(i)
p(x
(i)
|x
(i)
)
t t−1 cording to a likelihood function, so that wt ∝ wt−1 p(yt |xt ) , with (i) q(xt |x0:t−1 ,yt ) N (i) = 1. Particles can then be resampled, so that those with highest i=1 wt weights are duplicated, and those with lowest weights are suppressed. There exist many models of particle filters, each one having its own advantages. For example, the Condensation algorithm [9] has proved to be robust to clutter and occlusion due to its multiple hypotheses. Unfortunately, the computational cost of particle filters highly depends on the number of dimensions of the state space and, for large state spaces, costs may be unrealistically high due to the large number of particles needed to approximate the distributions and to the costs (i) of computing weights wt . In this paper, we propose a new way to reduce the number of particles necessary to treat high-dimensional state spaces, by reducing the number of resampling steps and the variance of the particle set. This paper is organized as follows. Section 2 gives a short overview of the existing approaches that try to solve this high-dimensionality problem. Section 3 recalls the Partitioned Sampling approach and its limitations, then details our approach. Section 4 gives comparative tracking results on challenging synthetic video sequences. Finally, concluding remarks and perspectives are given in Section 5.
2
Reducing Particle Filter’s Complexity: The Problem of High-Dimentional State Spaces
Dealing with high-dimensional state and observation spaces is a major concern of the vision tracking research community, especially when using the particle filter framework for articulated object tracking. There exist essentially two ways to tackle high-dimensional problems: either reduce the dimension of the state space/search space or exploit conditional independences naturally arising in the state space to partition the latter into low-dimensional spaces where few particles are needed. Among those algorithms that follow the first way, some exploit tailored proposal functions to better guide particles during the prediction step. For instance, in [3], attractors corresponding to specific known state vectors are used to better diffuse particles and then to explore more efficiently the state space. For articulated body tracking, many model-driven approaches using prior knowledge [7] on the movement of articulated parts derived from physical models, have been successfully applied. In [8], the environment is assumed to influence the movement
152
C. Gonzales, S. Dubuisson, and X.S. N’Guyen
of body parts and environmental constraints are thus integrated into the tracking scheme. Search into the state space may also be improved using optimization techniques. In [18], a population-based metaheuristic is for instance exploited: a path relinking scheme is used to resample particles so as to avoid missing modes of the probability distribution to estimate. Deutscher et al. [5] also proposed the Annealed Particle Filter that consists of adding to the resampling step simulated annealing iterations to diffuse particles into high-likelihood areas. In [2] a new optimization technique is also considered, which is more efficient than previous classical gradient methods because it incorporates constraints. The second family of approaches consists of reducing the number of necessary particles by exploiting conditional independences in the state space to divide it into small parts. For instance, in [16], graphical models are used to derive conditional density propagation rules and to model interpart interactions between articulated parts of the object. A belief inference is also used in [19], where the articulated body is modeled by a Dynamical Bayesian Network in which inference is computed using both Belief Propagation and Mean Field algorithms. In [1] a body is modeled by a factor graph, and the marginal of each part of the body is computed using Belief Propagations and a specific particle filter. Then, the global estimation consists of recomputing all the weights by taking into account the links between parts of the body. In [17] Bayesian Networks are exploited to factor the representation of the state space, hence reducing the complexity of the particle filter framework. Partitioned Sampling (PS) [11,12] was proposed by MacCormick and Isard in 2000 and is one of the most popular frameworks. The key idea, that will be described in Section 3.1, is to divide the joint state space xt into a partition of P elements, i.e. one element per object part, and for each one, to apply the transition function and to perform a weighted resampling operation. PS was first applied in multiple object tracking. The order of treatment of the objects was fixed over time, which made it fail when there were occlusions. In Dynamic Partition Sampling [20], the posterior distribution is represented by a mixture model, whose mixture components represent a specific order of treatment of the objects. However, when the set of configurations is large, this approach becomes intractable. The Ranked Partition Sampling [21] proposes to simultaneously estimate the order of treatment of the objects and their distributions. For the articulated object tracking purpose, PS suffers from numerous resampling steps that increase noise as well as decrease the tracking accuracy over time. We propose in this paper an adaptation of PS to efficiently deal with an articulated object by exploiting independence between its parts to reduce the number of resampling steps. We call this new methodology Simultaneous Partitioned Sampling (SPS) and derive its modeling in the next section.
3 3.1
Proposed Approach Partitioned Sampling (PS)
Partitioned Sampling (PS) is a very effective Particle Filter that exploits some decomposition of the system dynamics in order to reduce the number of particles
Simultaneous Partitioned Sampling for Articulated Object Tracking
153
needed to track objects when the state space dimensions are large. The basic idea is to divide the state space into an appropriate set of partitions and to apply sequentially a Particle Filter on each partition, followed by a specific resampling ensuring that the sets of particles computed actually represent the joint distributions of the whole state space. This specific resampling is called a “Weighted Resampling”. Let g : X → R be a strictly positive continuous function on X called a weighting function. Given a (i) (i) (i) set of particles Pt = {xt , wt }N i=1 with weights wt , weighted resampling pro(i) (i) duces a new set of particles Pt = {x t , w t }N i=1 representing the same distribution as Pt while located at the peaks of function g. To achieve this, let an “imporN (i) (j) tance distribution” ρt be defined on {1, . . . , N } by ρt (i) = g(xt )/ j=1 g(xt ) for i = 1, . . . , N . Select independently indices k1 , . . . , kN according to proba(i) (i) bility ρt . Finally, construct a set of particles Pt = {x t , w t }N i=1 defined by (i) (i) (k ) (k ) i i x t = xt and w t = wt /ρt(ki ). MacCormick [10] shows that Pt represents the same probability distribution as Pt while focusing on the peaks of g. The basic idea underlying Partitioned Sampling is to exploit a “natural” decomposition of the system dynamics w.r.t. subspaces of the state space in order to apply Particle Filtering only on those subspaces. This allows for a significant reduction in the number of particles needed to track complex objects. More precisely, assume that state space X can be partitioned as X = X 1 × · · · × X P , i.e., the system is viewed as being composed of P parts. For instance, a system representing a hand could be decomposed as X hand = X palm × X thumb × X index × X middle × X ring × X little . In addition, assume that the dynamics of the whole system follows this decomposition, i.e., that there exist functions fti : X → X satisfying that the projection of x over X 1 ×· · ·×X i−1 equals that of x whenever x = fti (x) and such that: ft (xt−1 , nxt ) = ftP ◦ ftP −1 ◦ · · · ◦ ft2 ◦ ft1 (xt−1 ),
(1)
where ◦ is the usual function composition operator. By definition, each function fti can propagate the particles over subspace X i × · · · × X P , i.e., it can only modify the substates of the particles defined on X i × · · · × X P . However, in practice, function fti usually just modifies the substate defined on X i . One step of a “standard” Particle Filter would resample particles, propagate them using proposal function ft and finally update the particle weights using the observations at hand. Here, exploiting the features of weighted resampling, Partitioned Sampling achieves the same result by substituting the ft propagation by a sequence of applications of the fti followed by weighted resamplings, as shown in Fig. 1. In this figure, operations “∗fti ” refer to propagations of particles using proposition function fti as defined above and operations “∼ gti ” refer to weighted resamplings w.r.t. importance function gti . Of course, to be effective, PS needs gti to be peaked with the same region as the posterior distribution restricted to X i . As an example, assume that X = X 1 × X 2 and consider that the large square on Fig. 2.a represents the whole of X . Then, the effect of propagating particles according to ft1 and resampling w.r.t. gt1 corresponds to direct the set of particles into the vertical shaded rectangle (where the peaks of gt1 are
154
C. Gonzales, S. Dubuisson, and X.S. N’Guyen p(xt |xt−1 , y1:t−1 )
∼
∗ft1
∼ gt1
∗ft2
∼ gt2
∗ftP
×p(yt|xt )
p(xt |y1:t )
Fig. 1. Partitioned Sampling condensation diagram X2
X3
X1
X2
a) Partition Sampling
b) Simultaneous Partition Sampling
Fig. 2. Interpretation of Partitioned Sampling ans Simultaneous Partition Sampling
located). Further propagating these particles using ft2 and resampling w.r.t. gt2 head them toward the small shaded rectangle that precisely corresponds to the peaks of p(xt |y1:t ). This scheme can be significantly improved when the likelihood function decomposes on subsets X i , i.e., when: P
p(yt |xt ) =
pi (yti |xit ),
(2)
i=1
where yti and xit are the projections of yt and xt on X i respectively. Such a decomposition naturally arises when tracking articulated objects. In these cases, PS condensation diagram can be substituted by that of Fig. 3. MacCormick and Isard show that this new diagram produces mathematically correct results [12]. 3.2
Our Contribution: Simultaneous Partitioned Sampling (SPS)
In a sense, the hypotheses used by Partition Sampling can best be explained on a dynamic Bayesian Network (DBN) representing the conditional independences
p(xt |xt−1 , y1:t−1 )
∗ft1
×p1t
∼
∗ft2
×p2t
∼
∗ftP
×pP t
p(xt|y1:t )
Fig. 3. Improved Partitioned Sampling condensation diagram
Simultaneous Partitioned Sampling for Articulated Object Tracking
x2t−1
x2t 2 yt−1
x2t+1 yt2
x1t−1
2 yt+1
x1t 1 yt−1
x1t+1 yt1
x3t−1
1 yt+1
x3t 3 yt−1
x3t+1 yt3
time slice t − 1
155
time slice t
3 yt+1
time slice t + 1
Fig. 4. A Dynamic Bayesian network
between random variables of states and observations [13]. Assume for instance that an object to be tracked is composed of 3 parts: a torso, a left arm and a right arm. Let x1t , x2t , x3t represent these parts respectively. Then, the probabilistic dependences between these variables and their observations yt1 , yt2 , yt3 , can be represented by the DBN of Fig. 4. In this figure, Eq. (2) implicitly holds because, conditionally to states xit , observations yti are independent of the other random variables. In addition, the probabilistic dependences between substates x1t , x2t , x3t suggest that the dynamics of the system is decomposable on X 1 × X 2 × X 3 . As a consequence, the condensation diagram of Fig. 3 can be exploited to track the object. Through the d-separation criterion [14], DBNs offer a strong framework for analyzing probabilistic dependences among sets of random variables. By this criterion, it can be remarked that, on Fig. 4, x3t is independent of x2t conditionally to x1t and x3t−1 . Similarly, x2t can be shown to be independent of x3t conditionally to x1t and x2t−1 . As a consequence, propagations/corrections over subspaces X 2 and X 3 can be performed simultaneously (since they are independent). This suggests the condensation diagram of Fig. 5, which we call a Simultaneous Partitioned Sampling (SPS). It is easily seen that, as for PS, the set of particles resulting from SPS represents probability distribution p(xt |y1:t ). The major difference with PS is that, by resampling only after both x2t and 3 xt have been processed, we can gain in accuracy. Actually, consider Fig. 2.b in which the shaded rectangles explain how PS achieves concentrating iteratively on the peaks of p(xt |y1:t ). After processing subspace X 2 , PS focuses on the light
p(xt |xt−1 , y1:t−1 )
∗ft1
×p1t
∗ft2
×p2t
∗ft3
×p3t
∼ ∼
p(xt |y1:t )
Fig. 5. Basic Simultaneous Partitioned Sampling condensation diagram
156
C. Gonzales, S. Dubuisson, and X.S. N’Guyen
gray rectangle. Therefore, this rectangle is determined only using observation yt2 . If, for a given particle, substate x1t was near the edge of a rectangle (i.e., not too close to a peak), then one observation yt2 may be insufficient to discard this value of x1t whereas two observations yt2 and yt3 may well be sufficient. In other words, taking into account multiple independent observations can focus the particles on smaller peaked regions of the state space. For instance, on Fig. 2.b, it may well be the case that, instead of ending up with particles located in the dark shaded area, SPS focus them on the smaller dashed rectangle. Of course, to be effective, SPS needs that propagations/corrections on all the subspaces processed simultaneously be “good” in the sense that they end up with high weights. A naive approach to SPS would not guarantee this property. For instance, on the example of Fig. 4, a particle may be close to the true state of the left arm and far from that of the right arm. In such a case, the overall weight of the particle would be low. If numerous particles had this feature, SPS would perform poorly. Fortunately, the conditional independences exploited by SPS also enable a substate swapping operation that improves significantly the concentration of the particles in the high likelihood areas. The idea is that if 1,(i) 2,(i) 3,(i) 1,(j) 2,(j) 3,(j) two particles, say (xt , xt , xt ) and (xt , xt , xt ), have the same 1,(i) 1,(j) 1 substate on X , i.e., xt = xt , then we can swap their substates on X 2 1,(i) 2,(j) 3,(i) 1,(j) 2,(i) 3,(j) or X 3 (thus creating new particles (xt , xt , xt ) and (xt , xt , xt )) without altering the probability distribution represented by the set of particles. This feature is actually guaranteed by the d-separation criterion. Consequently, if one particle is close to the true state of the left arm and far from that of the right arm while another is close to the true state of the right arm and far from that of the left arm, provided they have the same value on X 1 , we can substitute them by a new particle close to the true states of both arms and a new particle far from those true states. Of course, after swapping, resampling will essentially take the best particle into account. By having the best values on both X 2 and X 3 , this particle allows SPS to concentrate on smaller high-peaked regions than PS. This leads to the new condensation diagram of Fig. 6, where 2,3 represents substate swapping on X 2 and X 3 . i Of course, this scheme can be easily generalized. Assume that X = P i=1 X . i P j i Partition set {X }i=1 into subsets Y = {X }i∈Ij such that ∪Ij = {1, . . . , P } and such that each pair X i1 , X i2 belonging to the same set Y j are independent conditionally to subspaces X i ∈ Y j with j < j. Then, for each j, the simultaneous propagation/correction/swapping/resampling scheme described above can
p(xt |xt−1 , y1:t−1 )
∗ft1
×p1t
∗ft2
×p2t
∗ft3
×p3t
∼ 2,3
∼
p(xt |y1:t )
Fig. 6. Complete Simultaneous Partitioned Sampling condensation diagram
Simultaneous Partitioned Sampling for Articulated Object Tracking
157
be applied on the sets of Y j (note however that, to guarantee that probability distributions remain unchanged, swapping some substate xit ∈ Y j actually requires swapping accordingly all the substates xst that are not d-separated from xit conditionally to {xrt ∈ Y k : k < j}). In the next section, we highlight the advantages of SPS by comparing it to PS on challenging synthetic video sequences.
4
Experimental Results
We have chosen to test our method and to compare it with PS on synthetical video sequences because we wanted to highlight its interest in terms of dimensionality reduction and tracking accuracy without having to take into account specific properties of images (noise, etc.). Moreover, it is possible to simulate specific motions and then to test and compare with accuracy our method with PS. We have generated our own synthetic video sequence, each one containing 300 frames, showing a P -part articulated object (P = {3, 5, 7, 9, 11}) translating and distorting over time, see examples of frames in Fig. 7. The goal, here, is then to observe the capacity of PS and SPS to deal with articulated objects composed of a varying number of parts and subject to weak or strong motions.
Fig. 7. Some frames of synthetic sequences: an articulated object with 7 and 9 parts
The tracked articulated object is modeled by a set of P rectangles. The state space contains parameters describing each rectangle, and is defined by xt = p p p p p p {x1t , x2t , . . . , xP t }, with xt = {xt , yt , θt }, where (xt , yt ) denotes the center of the (i) 1,(i) 2,(i) P,(i) pth rectangle, and θtp is its orientation. A particle xt = {xt , xt , . . . , xt } is then a possible configuration of an articulated object. In the first frame, particles are uniformly generated around the object. During the prediction step, particles are propagated following a random walk whose variance has been manually chosen. The weights of the particles are then computed using the current observation (i.e. the current frame). A classical approach consists in integrating the color distributions given by histograms into particle filtering [15], by measuring the similarity between the distribution of pixels in the region of the estimated parts of the articulated object and of the corresponding reference region. This similarity is determined by computing the Bhattacharyya distance d between the histograms of the target and the reference regions. Finally, the particle’s weights 2 (i) (i) (i) (i) are given by wt = wt−1 p(yt |xt ) ∝ wt−1 e−λd , with λ = 50 in our tests. For both approaches, the articulated object global joint distribution is estimated by starting from its center part. PS then propagates and corrects particles part after part to derive a global estimation of the object. SPS considers the left and right
C. Gonzales, S. Dubuisson, and X.S. N’Guyen
140000
PS SPS
40000 Tracking error
Tracking error
45000
PS SPS
120000 100000 80000 60000 40000 20000
Tracking error
158
35000 30000 25000 20000 15000 10000 5000
0
0 0
50
100 150 200 250 300
0
50
200
250
300
0
1800
PS SPS
100 150 200 Frame number
250
1200 1000 800 600 400
300
150
200
250
300
250
300
PS SPS
1200
1400
1000 800 600 400 200
0 50
100
1400
PS SPS
1600
200 0
50
Frame number
Tracking error
2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0
100 150
PS SPS
Frame number
Tracking error
Tracking error
Frame number
50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0
0 0
50
100 150 200 Frame number
250
300
0
50
100 150 200 Frame number
(a)
200000 150000 100000 50000 0 0
50
100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0
100 150 200 250 300
PS SPS Tracking error
PS SPS
250000
Tracking error
Tracking error
300000
0
50
Frame number
2500
2000 1500 1000 500 0 50
100
150
200
Frame number
50
250
300
PS SPS
2000 1500 1000 500
0
50
100
150
200
Frame number
100
150
200
250
300
250
300
Frame number
0 0
0
Tracking error
PS SPS
2500
100 150 200 250 300
PS SPS
Frame number
Tracking error
Tracking error
3000
50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0
250
300
2000 1800 1600 1400 1200 1000 800 600 400 200 0
PS SPS
0
50
100
150
200
Frame number
(b) Fig. 8. Convergence study (50 runs): comparison between PS and SPS, from left to right, top to bottom: N = 2, N = 5, N = 10, N = 20, N = 30 and N = 40 for an articulated composed of (a) 7 parts, (b) 9 parts
parts as totally independent, and thus propagates and corrects simultaneously in these parts. PS and SPS are compared in terms of tracking accuracy. For that, we measure the tracking error as the distance between the ground truth and the estimated articulated object at each instant. This distance is given by the sum of the Euclidean distances between the corners of the estimated rectangles and their corresponding corners of the ground truth shape. We also measure the variance of the particle set. In the first test, we compare the convergence of PS and SPS. For that, we use synthetic sequences showing the same articulated object. We compute the tracking errors by averaging over 50 different runs, and repeat these 50 runs for different numbers N of particles. Tracking errors of articulated objects composed of 7 and 9 parts are shown in Figure 8.(a-b). As we can see on this figure, SPS always outperforms PS, even when N becomes very high. One can also notice
Simultaneous Partitioned Sampling for Articulated Object Tracking 450
350
PS SPS
400
250
Variance
Variance
PS SPS
300
350 300 250 200
159
200 150 100
150 100
50 0
50
100
150
200
250
0
300
50
100
Frame number
150
200
250
300
Frame number
Fig. 9. Comparison of variances (50 runs) obtained for PS and SPS, for a set of N = 10 particles for an articulated composed of, from left to right, 7 and 9 parts
0 0
50
100
150
200
250
300
1000 800 600 400
1400 1200 1000 800 600 400
200
200
0
0 0
Frame number
50
100
150
200
250
300
2000 1500 1000 500 0
0
Frame number
25000
PS SPS
2500
Tracking error
50
1200
3000
PS SPS
1600
1400
Tracking error
100
1800
PS SPS
1600
Tracking error
1800
PS SPS Tracking error
Tracking error
300 250 200 150
50
100
150
200
250
300
Frame number
PS SPS
20000 15000 10000 5000 0
0
50
100
150
200
250
300
Frame number
0
50
100 150
200 250
300
Frame number
Fig. 10. PS vs. SPS: case of strong motions at the beginning of the sequence, N = 20 (50 runs). Tracking errors for an articulated object of, from left to right: 3, 5, 7, 9 and 11 parts. Table 1. Execution times in seconds of PS and SPS, for N = 10 to 200 particles (the times reported are averages over 20 runs) # parts 3 5 7 9 11
tracker PS SP S PS SP S PS SP S PS SP S PS SP S
10
20
40
60
80
100
200
0.54 0.56 0.71 0.74 0.95 1.00 1.21 1.25 1.41 1.46
0.88 0.91 1.26 1.32 1.72 1.76 2.18 2.23 2.61 2.77
1.58 1.61 2.33 2.42 3.21 3.28 4.12 4.25 4.99 5.11
2.99 2.33 3.34 3.43 4.7 4.84 6.16 6.32 7.38 7.61
2.82 2.84 4.43 4.51 6.22 6.46 8.12 8.40 10.02 10.33
3.45 3.50 5.43 5.60 7.86 7.99 10.35 10.59 12.69 13.01
6.80 6.92 11.62 12.05 17.10 17.73 22.06 22.77 26.37 27.18
that with a very small number of particles (N = 2), SPS shows more robustness. In fact SPS is more stable during periods in the sequence where the motion is stronger, when PS totally looses the object to track. Figure 9 confirms the stability of the proposed approach: the variance of the particle set is lower with SPS. This proves that the particles are more concentrated around high likelihood values than with PS. This is mainly due to the fact that PS does perform twice more resampling steps than SPS, introducing more noise. To test the stability of our approach, we have generated video sequences in which the motion at the beginning of the sequence is strong. Comparative results of tracking errors of PS and SPS are reported in Figure 10, for different articulated object (i.e. containing 3, 5, 7, 9 and 11 parts). Here again we see the
160
C. Gonzales, S. Dubuisson, and X.S. N’Guyen
3500
2500
Variance
Tracking error
1800 1600
PS SPS
3000
2000 1500 1000
PS SPS
1400 1200 1000 800 600 400
500
200 0
0 0
50
100 150 200 Frame number
250
300
0
50
100 150 200 Frame number
250
300
Fig. 11. PS vs. SPS: case of strong motions for a 5-part object, N = 10, from left to right: two frames, tracking errors and variance of the particle set (50 runs). SPS outperforms PS because of its capacity to more concentrate particles around highlikelihood areas.
SPS is less disturbed by this strong motion that PS. Moreover, Table 1 shows that the efficiency of SPS over PS is not achieved at the expense of response times (SPS is usually not more than 4% slower than PS). We have also compared the efficiency of PS and SPS in cases of very strong and erratic movements of parts of the articulated object throughout the sequence. Comparative tracking results for a 5-part articulated object are given in Figure 11. We can see two examples of frames of this sequence and the deformation that undergoes the articulated object. Tracking errors are considerably decreased with SPS, as is the variance.
5
Conclusion
We have presented a new methodology, the Simultaneous Partitioned Sampling, that uses independence properties to simultaneous propagate and correct particles in conditionally independent subspaces. As a result, the particle set is more concentrated into high-likelihood areas. Thus, the estimation of the probability density of the tracked object is more accurate. Empirical tests have shown that SPS outperforms PS, especially in cases where the object motion is strong and when the dimension of the state space increases (i.e., the number of parts is large). There still remains to validate this approach on real video sequences. There is still room for improving SPS, especially its swapping method. Currently, we are working on linear programming-based techniques to determine the optimal swappings.
References 1. Bernier, O., Cheungmonchan, P., Bouguet, A.: Fast nonparametric belief propagation for real-time stereo articulated body tracking. Computer Vision and Image Understanding 113(1), 29–47 (2009) 2. Bray, M., Koller-Meier, E., M¨ uller, P., Schraudolph, N.N., Van Gool, L.: Stochastic optimization for high-dimensional tracking in dense range maps. IEE Proceedings Vision, Image and Signal Processing 152(4), 501–512 (2005)
Simultaneous Partitioned Sampling for Articulated Object Tracking
161
3. Chang, W.Y., Chen, C.S., Jian, Y.D.: Visual tracking in high-dimensional state space by appearance-guided particle filtering. IEEE Transactions on Image Processing 17(7), 1154–1167 (2008) 4. Chen, Z.: Bayesian filtering: from kalman filters to particle filters, and beyond (2003) 5. Deutscher, J., Davison, A., Reid, I.: Automatic partitioning of high dimensional search spaces associated with articulated body motion capture. In: CVPR, vol. 2, pp. 669–676 (2005) 6. Gordon, N.J., Salmond, D.J., Smith, A.F.M.: Novel approach to nonlinear/nongaussian bayesian state estimation. IEE Proceedings of Radar and Signal Processing 140(2), 107–113 (1993) 7. Hauberg, S., Sommer, S., Pedersen, K.: Gaussian-like spatial priors for articulated tracking. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 425–437. Springer, Heidelberg (2010) 8. Hauberg, S., Pedersen, K.S.: Stick it! articulated tracking using spatial rigid object priors. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 758–769. Springer, Heidelberg (2011) 9. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. International Journal of Computer Vision 29, 5–28 (1998) 10. MacCormick, J.: Probabilistic modelling and stochastic algorithms for visual localisation and tracking. Ph.D. thesis, Oxford University (2000) 11. MacCormick, J., Blake, A.: A probabilistic exclusion principle for tracking multiple objects. In: ICCV, pp. 572–587 (1999) 12. MacCormick, J., Isard, M.: Partitioned sampling, articulated objects, and interfacequality hand tracking. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 3–19. Springer, Heidelberg (2000) 13. Murphy, K.: Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. thesis, UC Berkeley, Computer Science Division (2002) 14. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufman Publishers, Inc., San Francisco (1988) 15. P´erez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based probabilistic tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 661–675. Springer, Heidelberg (2002) 16. Qu, W., Schonfeld, D.: Real-time decentralized articulated motion analysis and object tracking from videos. IEEE Transactions on Image Processing 16(8), 2129– 2138 (2007) 17. Rose, C., Saboune, J., Charpillet, F.: Reducing particle filtering complexity for 3D motion capture using dynamic bayesian networks. In: AAAI, pp. 1396–1401 (2008) 18. S´ anchez, A., Pantrigo, J., Gianikellis, K.: Combining Particle Filter and Population-based Metaheuristics for Visual Articulated Motion Tracking. Electronic Letters on Computer Vision and Image Analysis 5(3), 68–83 (2005) 19. Shen, C., van den Hengel, A., Dick, A., Brooks, M.: 2D articulated tracking with dynamic Bayesian networks. In: Das, G., Gulati, V.P. (eds.) CIT 2004. LNCS, vol. 3356, pp. 130–136. Springer, Heidelberg (2004) 20. Smith, K., Gatica-perez, D.: Order matters: a distributed sampling method for multi-object tracking. In: BMVC, pp. 25–32 (2004) 21. Widynski, N., Dubuisson, S., Bloch, I.: Introducing fuzzy spatial constraints in a ranked partitioned sampling for multi-object tracking. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Chung, R., Hammoud, R., Hussain, M., Kar-Han, T., Crawfis, R., Thalmann, D., Kao, D., Avila, L. (eds.) ISVC 2010. LNCS, vol. 6453, pp. 393–404. Springer, Heidelberg (2010)
A Geographical Approach to Self-Organizing Maps Algorithm Applied to Image Segmentation Thales Sehn Korting, Leila Maria Garcia Fonseca, and Gilberto Cˆamara Image Processing Division National Institute for Space Research – INPE S˜ ao Jos´e dos Campos – SP, Brazil {tkorting,leila,gilberto}@dpi.inpe.br
Abstract. Image segmentation is one of the most challenging steps in image processing. Its results are used by many other tasks regarding information extraction from images. In remote sensing, segmentation generates regions according to found targets in a satellite image, like roofs, streets, trees, vegetation, agricultural crops, or deforested areas. Such regions differentiate land uses by classification algorithms. In this paper we investigate a way to perform segmentation using a strategy to classify and merge spectrally and spatially similar pixels. For this purpose we use a geographical extension of the Self-Organizing Maps (SOM) algorithm, which exploits the spatial correlation among near pixels. The neurons in the SOM will cluster the objects found in the image, and such objects will define the image segments.
1
Introduction
Image segmentation is currently one of the most challenging tasks in digital image processing. The results of segmentation are used by many other tasks regarding information extraction from images. One simple definition states that a good segmentation should separate the image into simple regions with homogeneous behavior [7]. Algorithms for segmentation covers splitting one image into its components, concerning a specific context. The context includes scale, since regions from segmentation can represent different levels of perception. In an urban remote sensing image, found objects describe buildings in a finer scale, or city blocks in a coarser scale. Depending on the application, both scales provide enough information for object detection. Context also includes neighborhood, in accord to Tobler’s law, which states that near things are more related than distant things [18]. In the image processing area, this law may be interpreted as near pixels are more similar than distant pixels. Extending this interpretation to image segmentation, we can state that near and similar pixels should describe regions in the segmentation results. There are several applications in remote sensing which are depending on the results of image segmentation. The found objects must be in accord to the targets of a satellite image, such as roofs [5], streets [8] and trees in urban areas. In J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 162–170, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Geographical Approach to SOM Algorithm
163
agricultural imagery, the algorithm should extract different crops [16], or deforested areas in forests [17]. Such different land covers differentiate land uses by classification algorithms. Several initiatives of segmentation algorithms applied to satellite imagery have already been proposed in the literature. However, none of them exploits deeply the spatial correlation among points as suggested by Tobler’s Law. In this scenario we present a novel technique for segmentation which aims to fulfill this gap. The paper is organized as follows. Section 2 presents the literature review, Section 3 presents the methodology, Section 4 shows the results and the Section 5 concludes the paper.
2
Literature Review
Traditional segmentation algorithms are often based on the pixel similarity, by comparing one pixel to its neighbors. Several strategies are based on the technique named “region growing”, which merges neighboring pixels with a certain degree of similarity. Some examples of algorithms are defined in [4] and [1]. The similarity are often based only on spectral properties, and seldom refers to the location of the pixels. Many other strategies for image segmentation have already been proposed in the literature. For further reading on traditional segmentation techniques, please refer to [19]. The present work integrates one classification approach to perform segmentation. The employed classification algorithm is a modification of the well known Self-Organizing Maps – SOM [10]. This work investigates an approach to merge spectrally and spatially similar pixels by using a geographical extension of SOM algorithm, named GeoSOM [2]. 2.1
SOM Applied to Remote Sensing
SOM is a powerful tool for exploring huge amounts of high-dimensional data. It defines an elastic, topology-preserving grid of points that fits to the input space [12]. Initially proposed as a visualization tool for n-dimensional data, SOM have been applied in a variety of applications, such as clustering, dimensionality reduction, classification, sampling, vector quantization, and data mining [10]. Basically, according to [21], the SOM training algorithm is similar to a Kmeans clustering, with an added spatial smoothness constraint. Neurons are initialized randomly, and at each iteration the neuron with most similar response to the input data is called winner. The winner and its neighbors are updated to become more similar to the corresponding input value. The size of the considered neighborhood is decreased along the epochs, so that on latter phases only the winner neurons are updated, with very small refinements. In remote sensing several works employ SOM for classification issues. The work of [21] tested the algorithm to classify different vegetation types using radar images of agricultural areas. The idea of semi-supervised SOM was applied, where the closest neuron to a certain pattern is associated with one class
164
T.S. Korting, L.M.G. Fonseca, and G. Cˆ amara
of the training objects. [11] performed a comparison between the ExpectationMaximization and SOM algorithms applied to intra-urban classification using high spatial resolution images to derive the results. According to the authors, both algorithms achieved good results, and presented their advantages. However, SOM classified the data in less time, which for the current increasingly amount of data, is an important advantage. [9] have concluded that SOM is a good choice of an algorithm for monitoring land cover change. They investigated the optimal classification algorithm applied into multi-temporal data based on monthly phenological characteristics. [13] employed subpixel analysis using Landsat ETM+ data to estimate impervious surface coverage, lawn, and woody tree cover in typical urban landscapes. The authors combined SOM, Learning Vector Quantization (LVQ), and Gaussian Mixture Model (GMM) methods to estimate the posterior probability of the land cover components. [20] have applied SOM in the classification of multi and hyperspectral satellite imagery. Since SOM classification is considered to be a topology preserving projection of high-dimensional data onto a low-dimensional lattice, results emphasized the necessity of a faithful topological mapping for correct interpretation. 2.2
Segmentation Using SOM
Classifying image pixels to perform segmentation is not a new topic. [15] have presented one approach using hue and saturation image components to segment personal photographs. SOM was employed to estimate the main found chromaticities. Then, each pixel was classified according to the identified classes. Neighboring pixels relying in the same class were merged in the resultant segmentation. Also applied to personal photographs, [3] proposed one extension to the traditional SOM algorithm, which tried to map classes of pixel intensities in a hierarchical structure of neurons. More recently, [12] have applied SOM in the segmentation of images for content-based image retrieval (CBIR). CBIR has been targeted at interactive use where the task is to return interesting or relevant images from an unannotated database. In their method, the color and texture values are used to train the map and to classify the segments. SOM have also been applied into magnetic resonance brain image segmentation [14], including spatial constraints using a Markov Random Field (MRF) model to improve the results.
3
Methodology
The main idea is to employ classification to perform segmentation, i.e. neighboring pixels with the same class are merged into segments. The novelty is to take advantage of the spatial correlation among the pixels, since we adpoted a geographical extention of SOM algorithm, which considers the spatial proximity of the pixel elements when merging them into a single object. The SOM extention was proposed by [2] and is named GeoSOM. According to the authors, GeoSOM has the potential to organize the SOM output space according to the geographic proximities of the input patterns.
A Geographical Approach to SOM Algorithm
165
As GeoSOM finds patterns in the image pixels, it must fit homogeneous regions in these patterns, and considers these regions as the resultant segments. A simple way of inserting spatial constraints in classification is to include spatialrelevant variables which are computed as any other variable [2]. In this case remains the choice of what spatial variables to use. The possibilities are endless, and depend on the objectives pursued. Also one has to find the weights to be attributed to the geographic variables, thus giving more (or less) importance to geographic information. The weights of the neurons have their own geographical properties. Since neurons have spatial positions, they are expected to converge to patterns found nearby in the image. In the resultant segmentation clusters reflect the spectral distribution of the data, and are positioned according the spatial occurence of the patterns. Figure 1 shows this approach. In the following we describe the GeoSOM algorithm. Let: – X be the set of N training patterns x1 , x2 , . . . xN , each of these having a set of components geo, i and another set notgeo, i; – W be a p × q grid of units wij where i and j are their coordinates on that grid, each of these units having a set of components wgeo,ij and another set wnotgeo,ij ; – α be the learning rate, assuming values in the interval (0, 1); – h(wij , wmn , r) be a neighborhood function, used to update the neurons wij and wmn with radius r, generally expressed by the gaussian equation 2 2 1 (i−m) +(j−n)
r2 e− 2 – k be the radius of the geographical Best Matching Unit (BMU) to be searched; – f be true if the units are at fixed geographical locations.
1 repeat 2 for m = 1 → N 3 ∀wij ∈ W , calculate dij = ||geom − wgeo,ij || 4 wwinnergeo is the unit that minimizes dij 5 select set Wwinners of wij that ||wwinnergeo − wij || ≤ k 6 ∀wij ∈ Wwinner , calculate dij = ||xm − wij || 7 wwinner is the unit that minimizes dij 8 if f = true 9 update units wij ∈ W : wnotgeo,ij αh(wnotgeo,winner , wnotgeo,ij , r)||xm − wij || 10 else 11 update units wij ∈ W : wij = wij + αh(wwinner , wij , r)||xm − wij || end if 12 decrease α and r 13 until α = 0
=
wnotgeo,ij +
During the training stage the algorithm searches for the nearest neuron to the input value, the so called wwinnergeo . In this case, “nearest” considers the
166
T.S. Korting, L.M.G. Fonseca, and G. Cˆ amara
Fig. 1. Inserting spatial variables as input data – the GeoSOM approach
locations of the input and the neurons. After, it searches for the most similar neuron, in terms of spectral properties, within the neighborhood of wwinnergeo , finding the neuron wwinner . At this point the parameter f plays an important role. If f is true, the neurons are updated only in their spectral properties. If f is false, besides updating the spectral properties of the neurons, their location is changed as well. The learning rate α and the radius r decrease along the epochs, allowing the algorithm to converge smoothly to the discovered patterns. The algorithm stops when one of these two parameters reaches 0, or when a pre-defined number of epochs is performed. Found patterns are then converted into segmentation. When the algorithm stops, neurons have their spectral properties similar to the
A Geographical Approach to SOM Algorithm
(a)
167
(b)
Fig. 2. Segmentation result: a) Input image, and b) segmentation with GeoSOM
main elements found in the image. Summarizinh, the neurons in the SOM will cluster the objects found in the image, and such objects will define the image segments.
4
Results
The following results are composed by input images and their segmentation using GeoSOM. To produce such results we tuned the algorithm with a set of parameters that reached the best results, considering visual inspection. Nevertheless, we selected small image croppings and segmented them with different parameters. According to [6], “a strong and experienced evaluator of segmentation techniques is the human eye/brain combination”, and according [1], “no segmentation result – even if quantitatively proofed – will convince, if it does not satisfy the human eye”. The first image is a crop of a region in the state of Minas Gerais, Brazil, using satellite CBERS-2B, instrument CCD1 . The spatial resolution of the image is 20m. Figure 2 shows the segmentation. In the first result we used a map with 9 neurons (a 3 × 3 grid), with a learning rate α = 0.2 for 50 epochs. 410 regions were generated. Segmented regions describe properly the main objects found in the image. In the central region, which is spectrally heterogeneous, the algorithm also obtained proper segments. As a drawback, some mistakes were noticed in the borders, as they were segmented as new objects. However, the training step in this example used merely 10% of the pixel information present in the image. This fact provides to our approach a significant reduction in the volume of data to be processed, reducing the processing time as well. 1
Free remote sensing imagery at http://www.dgi.inpe.br/CDSR/
168
T.S. Korting, L.M.G. Fonseca, and G. Cˆ amara
(a)
(b)
Fig. 3. Segmentation result: a) Input image, and b) segmentation with GeoSOM
We performed a second test using a high spatial resolution input image, segmenting a small crop of a Quickbird scene from an urban region in the state of S˜ ao Paulo, Brazil. The spatial resolution of the image is 0.5m. Target objects in this image are, differently from the first example, intra-urban objects, such as roofs, trees, shadows, streets, and so on. Figure 3 shows the segmentation. In the second result we used a map with 25 neurons (a 5×5 grid), with learning rate α = 0.2 for 200 epochs. 911 regions were generated. Since we used a higher amount of neurons, the algorithm generated more regions. Even the roofs of the houses were split, in some cases, in more than one region. By inspecting the values of the neurons responsible for the roofs regions, they presented similar spectral values, however different spatial locations. The result shown in Figure 3 are considered over-segmentation, because in some cases the same object reflects in more than one segmented region. However it remains an useful result, since no region contains more than one object inside. In this case the main objects of the image can still be distinguished correctly, by applying proper inferences into the data.
5
Conclusion
This article presented the idea to apply an extension of the SOM algorithm, called GeoSOM, to remote sensing image segmentation. We described the main steps of the algorithm, and how to perform the segmentation. To evaluate the technique, we tested two images with different spatial resolutions and contexts. By tuning the parameters properly, we were able to generate visually appealing results, which makes this algorithm for segmentation encouraging approach. One important aspect of this approach is the reduced amount of data required to perform the segmentation. With 10% of the total pixels in the images, the neurons were trained and the results were able to split properly the images
A Geographical Approach to SOM Algorithm
169
into their component regions. One drawback was detected in the segmentation of some borders, since they do not present spectral properties similar to the bordering objects. However, the smoothness of the resultant objects can be improved by adding a parameter to define a minimum area, or some other threshold applied to spatial properties of the generated segments. Future research include to extend this approach by inspecting the dynamic of the neurons location along training. We believe that it is possible to describe the position of the most relevant objects in the image. If the image is classified after the segmentation, it is possible to infer the spatial distribution of the patterns. An interesting visualization tool can be derived from this approach, by retrieving the position and the amount of the neurons at each place of the image.
References 1. Baatz, M., Schape, A.: Multiresolution Segmentation: an optimization approach for high quality multi-scale image segmentation. In: Wichmann-Verlag (ed.) XII Angewandte Geographische Informationsverarbeitung, Herbert Wichmann Verlag, Heidelberg (2000) 2. Ba¸ca ˜o, F., Lobo, V., Painho, M.: Applications of different self-organising map variants to geographical information science problems. Self-organising Maps: Applications in Geographic Information Science (2008) 3. Bhandarkar, S., Koh, J., Suk, M.: Multiscale image segmentation using a hierarchical self-organizing map. Neurocomputing 14(3), 241–272 (1997) 4. Bins, L., Fonseca, L., Erthal, G., Li, F.: Satellite imagery segmentation: a region growing approach. In: Brazilian Remote Sensing Symposium, vol. 8 (1996) 5. Chesnel, A.L., Binet, R., Wald, L.: Object oriented assessment of damage due to natural disaster using very high resolution images. In: IEEE International Geoscience and Remote Sensing Symposium, IGARSS 2007, pp. 3736–3739 (2007) 6. Gamanya, R., Demaeyer, P., Dedapper, M.: An automated satellite image classification design using object-oriented segmentation algorithms: A move towards standardization. Expert Systems with Applications 32(2), 616–624 (2007) 7. Haralick, R., Shapiro, L.: Image segmentation techniques. Applications of Artificial Intelligence II 548, 2–9 (1985) 8. He, Y., Wang, H., Zhang, B.: Color based road detection in urban traffic scenes. In: Proceedings of IEEE Intelligent Transportation Systems, vol. 1 (2003) 9. Kim, D., Jeong, S., Park, C.: Comparison of Three Land Cover Classification Algorithms-ISODATA, SMA, and SOM-for the Monitoring of North Korea with MODIS Multi-temporal Data. Korean Journal of Remote Sensing 23(3), 181–188 (2007) 10. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Berlin (2001) 11. Korting, T.S., Fonseca, L.M.G., Ba¸ca ˜o, F.: Expectation-Maximization x SelfOrganizing Maps for Image classification. In: IEEE International Conference on Signal Image Technology and Internet Based Systems, SITIS 2008, pp. 359–365 (2008) 12. Laaksonen, J., Viitaniemi, V., Koskela, M.: Application of Self-Organizing Maps and automatic image segmentation to 101 object categories database. In: Proc. Fourth International Workshop on Content-Based Multimedia Indexing (CBMI 2005), Riga, Latvia,Citeseer (2005)
170
T.S. Korting, L.M.G. Fonseca, and G. Cˆ amara
13. Lee, S., Lathrop, R.: Subpixel analysis of Landsat ETM+ Using Self-Organizing Map (SOM) Neural Networks for Urban Land Cover Characterization. IEEE Transactions on Geoscience and Remote Sensing 44(6), 1642–1654 (2006) 14. Li, Y., Chi, Z.: MR Brain image segmentation based on self-organizing map network. International Journal of Information Technology 11(8), 45–53 (2005) 15. Moreira, J., Costa, L.F.: Neural-based color image segmentation and classification using self-organizing maps. Proceedings of the IX SIBGRAPI 12(6), 47–54 (1996) 16. Perez, A., Benlloch, J., Lopez, F., Christensen, S.: Colour and shape analysis techniques for weed detection in cereal fields. Computers and Electronics in Agriculture 25, 197–212 (2000) 17. Silva, M., Cˆ amara, G., Souza, R., Valeriano, D., Escada, M.: Mining patterns of change in remote sensing image databases. In: The Fifth IEEE International Conference on Data Mining, New Orleans, Louisiana, USA. Citeseer (2005) 18. Tobler, W.: A Computer Movie Simulating Urban Growth in the Detroit Region. Economic Geography 46, 234–240 (1970) 19. Verg´es-Llah´ı, J.: Color Constancy and Image Segmentation Techniques for Applications to Mobile Robotics. Ph.D. thesis, UPC (2005) 20. Villmann, T., Mer´enyi, E.: Extensions and modifications of the Kohonen-SOM and applications in remote sensing image analysis. Studies in Fuzziness and Soft Computing 78, 121–144 (2002) 21. Wehrens, R.: Self-organising Maps for Image Segmentation. In: Advances in Data Analysis, Data Handling and Business Intelligence: Proceedings of the 32nd Annual Conference of the Gesellschaft F¨ ur Klassifikation EV, Joint Conference with the British Classification Society (BCS) and the Dutch/Flemish Classification, p. 373. Springer, Heidelberg (2009)
A Multi-Layer ‘Gas of Circles’ Markov Random Field Model for the Extraction of Overlapping Near-Circular Objects Jozsef Nemeth1 , Zoltan Kato1 , and Ian Jermyn2 1 Image Processing and Computer Graphics Department University of Szeged, P.O. Box 652, 6701 Szeged, Hungary 2 Department of Mathematical Sciences, Durham University South Road, Durham DH1 3LE, United Kingdom
Abstract. We propose a multi-layer binary Markov random field (MRF) model that assigns high probability to object configurations in the image domain consisting of an unknown number of possibly touching or overlapping near-circular objects of approximately a given size. Each layer has an associated binary field that specifies a region corresponding to objects. Overlapping objects are represented by regions in different layers. Within each layer, long-range interactions favor connected components of approximately circular shape, while regions in different layers that overlap are penalized. Used as a prior coupled with a suitable data likelihood, the model can be used for object extraction from images, e.g. cells in biological images or densely-packed tree crowns in remote sensing images. We present a theoretical and experimental analysis of the model, and demonstrate its performance on various synthetic and biomedical images.
1 Introduction Object extraction remains one of the key problems of computer vision and image processing. The problem is easily stated: find the regions in the image domain occupied by a specified object or objects. The solution of this problem often requires high-level knowledge about the shape of the objects sought in order to deal with high noise, cluttered backgrounds, or occlusions [4,11,8,1]. As a result, most approaches to extraction have, to differing degrees and in different ways, incorporated prior knowledge about the shape of the objects sought. Early approaches were quite generic, essentially encouraging smoothness of object boundaries [6,9,3,2,10]. For example, [10] uses a Markovian smoothness prior (basically a Potts model, i.e. boundary length is penalized); [6] uses a line process to control the formation of region boundaries and control curvature; while classical active contour models [9] use boundary length and curvature, and region area in order to favor smooth closed curves [3,2].
This research was partially supported by the grant CNK80370 of the National Office for Research and Technology (NKTH) & Hungarian Scientific Research Fund (OTKA); by the European Union and co-financed by the European Regional Development Fund within the project TAMOP-4.2.1/B-09/1/KONV-2010-0005.
J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 171–182, 2011. c Springer-Verlag Berlin Heidelberg 2011
172
J. Nemeth, Z. Kato, and I. Jermyn
Subsequently there has been a great deal of work on the inclusion of more specific prior shape knowledge in a variational [4,13] or probabilistic [5,15] framework. Many of these methods rely on a kind of template matching: shape variability is modeled as deformations of a reference shape or shapes. Although these methods are useful for many applications, the major drawback of using a reference shape (or shapes) is that handling an unknown number of instances of an object in the same image is difficult. An alternative approach, known as ‘higher-order active contours’ (HOACs), was presented and developed in [11,7,8]. HOAC models integrate shape knowledge without using reference shapes via the inclusion of explicit long-range dependencies between region boundary points. The lack of reference shapes means that they can be used to extract multiple instances of the same object. In [8], Horvath et al. showed how to set the parameters of the model introduced in [11] to favor regions consisting of any number of approximately circular connected components, each component having approximately the same, specified radius. This ‘gas of circles’ (GOC) model was successfully used for the extraction of tree crowns from aerial images. A subsequent reformulation of HOAC models (and active contour models in general) as equivalent phase field models [12,7] brings a number of theoretical and algorithmic advantages. One of the most important of these is that phase field models can be interpreted as real-valued Markov random fields (MRFs), thereby allowing the theoretical and algorithmic toolbox of random field theory to be brought to bear. In [1], this was carried out, and an MRF GOC model equivalent to the phase field GOC model was developed. For many important applications, for example the extraction of cells from light microscope images in biology, or the extraction of densely packed tree crowns in remote sensing images, these methods have limitations. The first is due to the representation: distinct overlapping objects cannot be represented. This is because the representation used is of a region, i.e. a subset of the image domain, and not of objects as such. Thus if the regions corresponding to two objects overlap, they form the single region that is their union. This cannot be distinguished from a single object occupying the same region. The second is due to the model: the same long-range interactions that favor nearcircular shapes also introduce a repulsive energy between nearby objects that means that configurations containing nearby objects have low probability, even if they do not overlap. In this paper, we propose a generalization of the MRF GOC model that overcomes these limitations: the multi-layer MRF GOC model. This consists of multiple copies of the MRF GOC model in [1], each copy being known as a layer. Now overlapping objects can be represented, as subsets of two different layers. The layers interact via a penalty for the overlap of regions in different layers, and this inter-layer interaction is crucial, particularly when a likelihood term is added. In its absence, the maximum probability configuration would simply be the same in all layers and equal to that found using the model in [1]. The result is that rather than the regions corresponding to two overlapping objects necessarily merging into a single region, it may be energetically favourable for the two regions corresponding to the two separate objects to appear in different layers. We begin by recalling the single-layer ‘gas of circles’ model.
A Multi-Layer ‘Gas of Circles’ Markov Random Field Model
173
2 The Single-Layer ‘Gas of Circles’ Model The ‘gas of circles’ model assigns high probability to regions in the image domain consisting of some number of approximately circular connected components, each of which has approximately the same, specified radius, and that are more than a certain distance apart. There are three equivalent formulations of the model: higher-order active contours (HOACs) [8], phase fields [7], and Markov random fields [1]. In the next three subsections, we explain the three formulations, since each provides some insight into the model, and the equivalences between them. 2.1 Contour Representation In the contour formulation, a region R is represented by its boundary ∂R, which is an equivalence class (under diffeomorphisms of their domain) of zero or more closed parameterized curves. The HOAC energy for the GOC model is [8]: βc E(γ) = λc L(γ) + αc A(γ) − n · n G(γ(t) − γ(t )) dt dt , (1) 2 where the contour γ of length L(γ) represents the boundary ∂R of extracted foreground regions with a total area A(γ). The last term of Eq. (1) is responsible for the geometry of extracted regions, where n, n corresponds to the normal vectors at t and t respectively, while G is the so called interaction function z π(z−d) 1 1 2 − − sin if z < 2d, d π d G(z) = 2 (2) 1 − H(z − d) otherwise. where d controls the range of interaction and H is the Heaviside step function. Horvath et al. showed in [8], that parameter triples (λc , αc , βc ) satisfying certain stability conditions will produce circular regions of a given radius r, yielding the first definition of the ’gas of circles’ HOAC model.
0.2 X 0
1
2
3
4
5
0
-0.2
-0.4
-0.6
G(z)
G(z) in 2D
G(x − x ) in 3D
Fig. 1. The interaction function G(z) for d = 2 and corresponding geometric kernel G
174
J. Nemeth, Z. Kato, and I. Jermyn
2.2 Phase Field Representation The phase field framework represents a region R by a function Φ φ : D → defined on the image domain D ⊂ 2 , and a threshold t: R = ζt (φ) = {x ∈ D : φ(x) ≥ t}. The phase field formulation E(φ) of the contour energy Eq. (1) was described in [12]: E(φ) = D
φ4 Df φ2 φ3 |∇φ|2 + λf − + αf φ − 2 4 2 3 βf − ∇φ · ∇ φ G(x − x ) . (3) 2 D×D
It is convenient to integrate the non-local term by parts: βf βf − ∇φ · ∇ φ G(x − x ) = φ φ ∇2 G(x − x )) . 2 D×D 2 D×D G(x−x )
The value φR that minimizes E(φ) for a fixed region R takes the values +1 inside R and −1 outside, away from the boundary ∂R, while changing smoothly from −1 to +1 in a narrow interface region around ∂R. Basically, the linear operator G directly acts on the phase field φ as a geometric kernel (see Fig. 1). In the ‘gas of circles’ model, the parameters of E(φ) are adjusted using the contour stability analysis and the equivalence between the formulations so that a circle of the desired radius is stable [7,8].
s
singleton
r
doubleton & long range
r
long range
Intra-layer interactions
Inter-layer interaction
Fig. 2. MRF neighbourhoods
2.3 Binary MRF Representation Discretizing the field energy Eq. (3) leads to a Markovian interpretation of the phase field model, where φ becomes a random field ω taking the discrete values of ±1 [1]. The resulting energy of the prior distribution P (ω) is given by U (ω) = α
s
ωs +
D
β
(ωs − ωs )2 + Fss ωs ωs , 2 s 2 s ∼s
(4)
s,s
where s denotes lattice sites (or pixels) of the discrete image domain S and ∼ is the nearest neighbour relation. The model parameters are related to the phase field model by 2α 0.82Df α = 3 f ; β = βf ; while D = incorporates the integral over pairs of boundary 4
A Multi-Layer ‘Gas of Circles’ Markov Random Field Model
175
lattice cells. Fss is a discrete approximation of G [1], which also determines the size of the neighborhood: {s ∈ S : |s − s | < 2d} as shown in Fig. 2. The singleton potential αωs of the prior energy corresponds to an area term: a lower α favors more foreground pixels and vice versa, while the doubleton potential D(ωs − ωs )2 acting over a nearest neighborhood of s ensures smoothness by penalizing boundary formation. Finally, the long-range potentials enforce the geometric constraints, thereby forming circles: −βFss if ωs = ωs , βFss ωs ωs = (5) +βFss otherwise. From Fig. 2, it is clear that long-range potentials favour the same label when |s − s | < d (attractive case) and different labels when d < |s − s | < 2d (repulsive case), where d d is the zero of G.
3 The Multi-Layer MRF ‘Gas of Circles’ Model We are now in a position to describe the multi-layer generalization of the MRF GOC model just described. The MRF GOC model has two limitations that render it inappropriate for many applications. First, touching or overlapping objects cannot be represented as separate entities in this model. This is because the representation used is of a region, not of objects as such. If the regions R1 and R2 corresponding to two objects overlap, the result is a single region R = R1 ∪ R2 that cannot be distinguished from the representation of a single object occupying the whole of R. Second, the model energy has a sometimes undesirable effect: it discourages connected components from being too close to one another. This is because the same interactions that favor stable circles also produce a repulsive interaction that raises the energy when two circles are closer than 2d. Thus while this model is able to separate, for example, tree crowns in regular plantations, it cannot represent, nor does it model well, configurations in which objects are touching or overlapping (cf. Fig. 8). The multi-layer MRF GOC model removes both these limitations by using multiple copies of the MRF GOC model, as follows. The domain of the binary random field becomes S˜ = × S, or alternatively, the field is a map from S to , where denotes either ∈ + or the set {1, . . . , }. Hence ω = {ω (i)} for i ∈ , where ω (i) : S → . In principle, we would like = + ,i.e. an infinite number of layers, as this would place no restrictions on the possible configurations. In practice, there is always a maximum number of mutual overlaps, and need be no larger than this. Sites that only differ in the value of i correspond to the same spatial point. Thus S˜ can be thought of as a series of layers, each of which is isomorphic to S, hence the name ‘multi-layer’. It is clear that the multi-layer field can represent overlapping objects, simply by placing the regions corresponding to them on different layers. ˜ of the multi-layer model is the sum of the MRF GOC energies The Gibbs energy U of each layer, plus an inter-layer interaction term that penalizes overlaps (see Fig. 2): ˜ (ω) = U
i=1
U (ω (i) ) +
κ
(1 + ωs(i) )(1 + ωs(j) ) , 4 s i=j
(6)
176
J. Nemeth, Z. Kato, and I. Jermyn
w>0 r
θ1
θ2
r
Fig. 3. Configurations of two overlapping circles and corresponding plots of E(M ) (r, w) and E(S) (r, w) vs. w for two circles of radius r = 10
where κ is a new parameter controlling the strength of the overlap penalty.1 Note that the inter-layer energy is ultralocal: only corresponding sites on different layers interact. Thus two regions in different layers experience no interaction at all unless they overlap. This eliminates the repulsive energy that exists in the single-layer model, because nearby but non-overlapping regions in different layers always have lower energy than the same regions in the same layer, assuming the intra-layer interactions are repulsive. 3.1 Energy of Two Interacting Circles In order better to understand the behaviour of the model, in this section we analyze the energy of two circles, on the same layer and on different layers. We consider the configurations shown in Fig. 3, where w stands for the size of the intersection: w < 0 means the circles do not intersect, while w > 0 represents a non-empty intersection of width w. We want to express the energy of these configurations as a function of w. We take advantage of the equivalence of the ‘gas of circles’ MRF and HOAC models to use the higher-order active contour energy Eq. (1) to compute the energy of the two circles. The parameters of this energy come from the equivalences between the three 4D formulations: βc = 4β; the unit weight of a boundary point is 0.82 ; while the difference in energy between an interior and exterior point is 2α. Thus the MRF energy of a single circle with radius r can be written as 2π 4D E(r) = 2πr + 2απr2 − 2β dθ dθ r2 cos(θ − θ ) G(γ(θ) − γ(θ )) , (7) 0.82 0 where γ is an embedding corresponding to the circle, parameterized, as shown in Fig. 3, by polar angle θ. Different layers: When the two circles are in different layers, the only interaction energy is the inter-layer overlap penalty. Thus the energy is constant until the circles start to overlap. It then starts to increase: 1
˜ is invariant to permutations of the layers. This will remain true even after we Notice that U add a likelihood energy. Thus all configurations, and in particular minimum energy configurations, are ! times degenerate. In practice, this degeneracy will be spontaneously broken by the optimization algorithm.
A Multi-Layer ‘Gas of Circles’ Markov Random Field Model
E(M ) (r, w) = 2E(r) + κA(r, w) , where A(r, w) is the area of the overlap given by ⎧ ⎨2 r2 arccos 1 − w − r − w 2rw − 2r 2 A(r, w) = ⎩ 0
177
(8)
w2 4
if w > 0,
(9)
otherwise.
Same layer: When the two circles are in the same layer, they interact if w > −2d for the particular form of interaction function in Eq. (2). (Note that we need only consider w ≤ 2r, where r is the radius of the circles, due to symmetry.) Thus if w ≤ −2d, the energy is simply 2E(r). For w > −2d, the energy increases with w until w ∼ = 0. As the circles start to overlap (and thus no longer form two circles, but a combined ‘dumbbell’ shape), there is effectively an attractive energy that causes an energy decrease with increasing w until the combined shape, and thus the energy, becomes that of a single circle (w = 2r). More precisely, the energy of two circles is 4D 2(2rπ − L(r, w)) + 2α(2r2 π − A(r, w)) 0.82 θf − 4β dθ1 dθ1 r2 cos(θ1 − θ1 ) G(γ1 (θ1 ) − γ1 (θ1 ))
E(S) (r, w) =
θs θf
− 2β
θs
dθ1 dθ2 r2 cos(θ1 − (π − θ2 )) G(Δ(θ1 , θ2 , w)) ,
(10)
where γ1,2 are two embeddings corresponding to the two circles, parameterized by angles θ1,2 respectively, as shown in Fig. 3. We have taken advantage of symmetry to write the second line in terms of γ1 only. L(r, w) is the arc length of the intersection segment, while Δ(θ1 , θ2 , w) = (r(sin(θ1 ) − sin(θ2 )))2 + (2r − w − r(cos(θ1 ) − cos(θ2 )))2 (11) is the distance between the points γ1 (θ1 ) and γ2 (θ2 ). The limits θs = cos−1 (min(1, 1−w )) and θf = 2π − θs are the radial angles of the two intersection points. 2d The righthand side of Fig. 3 shows plots of E(M ) (r, w) and E(S) (r, w) against w for circles with r = 10. When the overlap is greater than a certain threshold, controlled by κ, the energy of two circles in different layers becomes greater than two partially merged circles in one layer. Below this threshold, the two layer configuration has a lower energy. The stable configuration energy of two circles is given by the lower envelope of the curves in Fig. 3, and thus the repulsive energy that exists in the single-layer MRF GOC model is eliminated in the multi-layer MRF GOC model.
4 Experimental Results In this section, we report on the quantitative evaluation of the behavior and performance of the multi-layer MRF GOC model in object extraction problems involving
178
J. Nemeth, Z. Kato, and I. Jermyn
simulated data and microscope images. Results were obtained as MAP estimates, using the multi-layer MRF GOC model as a prior, combined with a likelihood energy UL ˜ (ω), to be described shortly: ω ˆ = arg maxω P (I|ω)P (ω) = arg minω UL (I, ω) + U where I : S → is the image data. Optimization was performed using Gibbs sampling coupled with simulated annealing [6]. The annealing schedule was exponential, with half-life at least 70 iterations, and a starting temperature of 3.0 for the parameter values used in the experiments. 4.1 Data Likelihood The data likelihood models the image in the interior and exterior regions using Gaussian distributions with constant means, and covariances equal to different multiples of the identity. In addition, we add an image gradient term connecting neighboring pixels, as follows. For each pair of neighboring sites, s and s , let (s, s ) be the unit vector pointing from s to s . Let sˆ = arg maxt∈{s,s } (|∇I(t)|). Let h(s, s ) = |(s, s ) · ∇I(ˆ s)|. Then define (i) (i) h(s, s ) ωs = ωs , gi (s, s ) = (12) |∇I(ˆ s)| − h(s, s ) otherwise. The likelihood energy then becomes 2
Is − μω(i) γ2
1/2 s UL (I, ω) = γ ln (2π) σω(i) + + gi (s, s ) , s 2σ 2 (i) 2 s s i s ∼s ωs (13) where γ and γ2 are positive weights. In practice, the parameters μ±1 and σ±1 of the Gaussian distributions were learned from representative samples.
=1
= 2, κ = 0.4
= 3, κ = 0.4
= 4, κ = 0.4
= 5, κ = 0.05
= 5, κ = 0.4
= 6, κ = 0.05
= 6, κ = 0.4
Fig. 4. Stable configurations of the multi-layer MRF GOC model for different numbers of layers and values of κ
A Multi-Layer ‘Gas of Circles’ Markov Random Field Model
179
Fig. 5. Plots of the relative interior area (left) and shape error (right) of the stable configurations against κ
4.2 Simulation Results with the Multi-Layer MRF GOC Model ˜ . Choosing, wlog, d = 10, In the first experiment, we study the global minima of U with the intra-layer parameters α = 0.18634, D = 0.15451, and β = 0.091137 set according to the stability constraints [8,1] and to ensure that stable circles have negative ˜ was then minimized for different numbers of layers and values of κ. Fig. 4 energy, U shows representative examples of these optimal configurations. The top-left result has = 1: note the spacing of the circles due to the intra-layer repulsive energy. When there are more layers, the intra-layer energies favour a similarly dense ‘gas of circles’ in each layer. For ≤ 3, every layer may contain such a configuration without the circles in different layers overlapping. For > 3, it is not possible to achieve both an optimal configuration in each layer and zero overlap energy. For small κ, the model tries to generate a dense configuration in each layer at the price of having overlaps. For large κ, the situation is the opposite: the model tries to avoid overlaps at the price of having 1 fewer circles in each layer. Fig. 5 shows a plot of the relative interior area N H(ω) against κ, where N = |S|. The value is almost constant for ≤ 3, while for > 3, the value decreases with κ. The circularity of the regions was also evaluated. The righthand plot in Fig. 5 shows the percentage of pixels outside the ideal desired circles. Although for > 3, these errors increase slightly, overall they remain low, meaning that the connected components remain circles to good accuracy for all and κ. 4.3 Quantitative Evaluation on Synthetic Images In this experiment, we demonstrate the efficiency of our model in separating overlapping circles. A series of noisy synthetic images were generated containing two circles of radius 10 with different degrees of overlap. The weights in the likelihood energy were set to γ = 0.1 and γ2 = 0, i.e. no gradient term was used. We used two layers and differing κ values in the range [0.01, 1]. Segmentation error was evaluated as the proportion of incorrectly segmented pixels. A plot of these errors versus the amount of overlap w and κ is shown in Fig. 6. Note that there is a rather clear drop in the segmentation error for κ ∼ = 0.7. When w > 10 (corresponding to an overlap of greater than 50%), a larger κ is required to get an accurate segmentation (κ = 0.88 was needed
180
J. Nemeth, Z. Kato, and I. Jermyn
Noisy image
Small κ
Best κ
Big κ
Fig. 6. Results on noisy synthetic images (SNR= 0dB) containing two circles of radius 10 with different degrees of overlap. Left: typical extraction results. Right: plot of segmentation error as a function of degree of overlap (w) and κ.
in the last case in Fig. 6), and for w > 15, it is hard to get good quality results. In summary, the model performs well for reasonable overlaps and it is not sensitive to the value of κ. On the other hand, there is a performance drop for very large overlaps. 4.4 Application in Biomedical Imaging Biomedical image segmentation aims to find the boundaries of various biological structures, e.g. cells, chromosomes, genes, proteins and other sub-cellular components in various image types [14]. Light microscope techniques are often used, but the resulting images are frequently noisy, blurred, and of low contrast, making accurate segmentation difficult. In many cases, the geometric structures involved are near-circular with many overlaps, so that our model seems well suited to extracting the desired structures. The extraction results shown in Fig. 7 and Fig. 8 demonstrate the effectiveness of the proposed multi-layer MRF GOC model for this type of task. Computation times vary from ∼ 20s to ∼ 1000s for images of size N = 104 . The key factor is the number of layers, with the minimum time corresponding to = 2, the maximum to = 6.
Fig. 7. Extraction of cells from light microscope images using the multi-layer MRF GOC model
A Multi-Layer ‘Gas of Circles’ Markov Random Field Model
181
Fig. 8. Extraction of lipid drops from light microscope images using the multi-layer MRF GOC model
5 Conclusion The multi-layer MRF GOC model enables the representation and modeling of object configurations consisting of an a priori unknown number of approximately circular objects of roughly the same size, which may touch or overlap. Such configurations occur in a number of domains, notably biomedicine and biology (e.g. cell images), and remote sensing (e.g. images of closely planted trees). Experiments show that the model behaves as expected on theoretical grounds, and that, when coupled with an appropriate likelihood model, can successfully extract such object configurations from synthetic and real images. The multi-layer model should also enable the extraction of several sets of approximately circular objects of different sizes, by setting the model parameters differently on different layers of the model.
References 1. Blaskovics, T., Kato, Z., Jermyn, I.: A Markov random field model for extracting nearcircular shapes. In: IEEE Proceedings of International Conference on Image Processing, pp. 1073–1076. IEEE, Cairo (2009) 2. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. International Journal of Computer Vision 22(1), 61–79 (1997) 3. Cohen, L.: On active contour models and balloons. Computer Vision, Graphics and Image Processing: Image Understanding 53, 211–218 (1991) 4. Cremers, D., Tischhauser, F., Weickert, J., Schnorr, C.: Diffusion snakes: Introducing statistical shape knowledge into the Mumford-Shah functional. International Journal of Computer Vision 50(3), 295–313 (2002) 5. Flach, B., Schlesinger, D.: Combining shape priors and MRF-segmentation. In: da Vitoria Lobo, N., Kasparis, T., Roli, F., Kwok, J.T., Georgiopoulos, M., Anagnostopoulos, G.C., Loog, M. (eds.) S+SSPR 2008. LNCS, vol. 5342, pp. 177–186. Springer, Heidelberg (2008) 6. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741 (1984) 7. Horvath, P., Jermyn, I.H.: A ‘gas of circles’ phase field model and its application to tree crown extraction. In: Proceedings of European Signal Processing Conference (EUSIPCO), Poznan, Poland (September 2007)
182
J. Nemeth, Z. Kato, and I. Jermyn
8. Horvath, P., Jermyn, I., Kato, Z., Zerubia, J.: A higher-order active contour model of a ‘gas of circles’ and its application to tree crown extraction. Pattern Recognition 42(5), 699–709 (2009) 9. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988) 10. Kato, Z., Berthod, M., Zerubia, J.: A hierarchical Markov random field model and multitemperature annealing for parallel image classification. Computer Vision, Graphics and Image Processing: Graphical Models and Image Processing 58(1), 18–37 (1996) 11. Rochery, M., Jermyn, I.H., Zerubia, J.: Higher order active contours. International Journal of Computer Vision 69(1), 27–42 (2006), http://dx.doi.org/10.1007/ s11263-006-6851-y 12. Rochery, M., Jermyn, I.H., Zerubia, J.: Phase field models and higher-order active contours. In: Proc. IEEE International Conference on Computer Vision (ICCV), Beijing, China (October 2005) 13. Rousson, M., Paragios, N.: Shape priors for level set representations. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 78–92. Springer, Heidelberg (2002) 14. Russell, C., Metaxas, D., Restif, C., Torr, P.: Using the Pn Potts model with learning methods to segment live cell images. In: IEEE 11th International Conference on Computer Vision, pp. 1–8. IEEE, Los Alamitos (2007) 15. Srivastava, A., Joshi, S., Mio, W., Liu, X.: Statistical shape analysis: Clustering, learning, and testing. IEEE Trans. Pattern Analysis and Machine Intelligence 27(4), 590–602 (2005)
Evaluation of Image Segmentation Algorithms from the Perspective of Salient Region Detection Bogdan Popescu, Andreea Iancu, Dumitru Dan Burdescu, Marius Brezovan, and Eugen Ganea University of Craiova, Software Engineering Department, Craiova, Bd. Decebal 107, Romania {bogdan.popescu,andreea.iancu}@itsix.com, {dumitru burdescu,brezovan marius,ganea eugen}@software.ucv.ro
Abstract. The present paper addresses the problem of image segmentation evaluation by comparing seven different approaches.We are presenting a new method of salient object detection with very good results relative to other already known object detection methods. We developed a simple evaluation framework in order to compare the results of our method with other segmentation methods. The results of our experimental work offer good perspectives for our algorithm, in terms of efficiency and precision. Keywords: color segmentation; graph-based segmentation; salient region detection.
1
Introduction
Image segmentation is a very important operation performed on acquired images. The evaluation of this process [5] focuses on two main topics: generality and objectivity. Generality means that the test images in the benchmark should have a large variety so that the evaluation results can be extended to other images and applications. Objectivity means that all the test images in the benchmark should have an unambiguous ground-truth segmentation so that the evaluation can be conducted objectively. The main target of image segmentation process is the domain-independent partition of the image into a set of regions which are visually distinct and uniform with respect to some property, such as grey level, texture or color. The problem of segmentation is an important research field and many segmentation methods have been proposed in the literature so far ([1],[6],[7],[10], [2], [3], [4]). The objective of this paper is to underline the very good results of image segmentation obtained by our segmentation technique, Graph-Based Salient Object Detection, and to compare them with other existing methods. The other methods that we use for comparison are: Efficient Graph-Based Image Segmentation (Local Variation), Normalized Cuts, Unsupervised Segmentation of Natural Images J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 183–194, 2011. c Springer-Verlag Berlin Heidelberg 2011
184
B. Popescu et al.
via Lossy Data Compression, ROI-SEG: Unsupervised Color Segmentation by Combining Differently Focused Sub-Results, Mean Shift and Multi-Layer Spectral Segmentation. They are all complex and well known algorithms, with very good results and proved efficiency in this area. The experiments were completed using the images and ground-truth segmentations of M SRA Salient Object Database [17]. The paper is organized as follows: in Section 3 we briefly present previous studies in the domain of image segmentation and the segmentation method we propose. The methodology of performance evaluation is presented in Section 4. The experimental results are presented in Section 5. Section 6 concludes the paper and outlines the main directions of the future work.
2
Related Work
Image segmentation evaluation is an open subject in today’s image processing field. The goal of existing studies is to establish the accuracy of each individual approach and find new improvement methods. Some of previous works in this domain do not require ground-truth image segmentation as the reference. In these methods, the segmentation performance is usually measured by some contextual and perceptual properties, such as the homogeneity within the resulting segments and the inhomogeneity across neighboring segments. Most of segmentation methods require ground-truth image segmentations as reference. Since the construction of the ground-truth segmentation for many real images is labor-intensive and sometimes not well or uniquely defined, most prior image segmentation methods are only tested on some special classes of images used in special applications where the ground-truth segmentations are uniquely defined, synthetic images where ground-truth segmentation is also well defined, and/or a small set of real images. The main drawback of providing such reference is represented by the resources that are needed. However, after analyzing the differences between the image under study and the ground-truth segmentation, a performance proof is obtained. Region - based segmentation methods can be broadly classified as either model-based [14] or visual feature - based [15] approaches. A distinct category of region-based segmentation methods that is relevant to our approach is represented by graph-based segmentation methods. Most graph-based segmentation methods attempt to search a certain structure in the associated edge weighted graph constructed on the image pixels, such as minimum spanning tree [6], or minimum cut [16]. Related work on quantitative segmentation evaluation includes both standalone evaluation methods, which do not make use of a reference segmentation, and relative evaluation methods employing ground truth. For standalone evaluation of image segmentations metrics for intra-object homogeneity and inter-object disparity have been proposed in [24]. A combined
Salient Region Detection
185
segmentation and evaluation scheme is developed in [25]. Although standalone evaluation methods can be very useful in such applications, their results do not necessarily coincide with the human perception of the goodness of segmentation. However, when a reference mask is available or can be generated, relative evaluation methods are preferred in order the segmentation results to coincide with the human perception of the goodness of segmentation. Martin et al. [26] propose two metrics that can be used to evaluate the consistency of a pair of segmentations based on human segmentations. The measures are designed to be tolerant to refinement. A methodology for evaluating the performance of boundary detection techniques with a database containing groundtruth data was developed in [27]. It is based in the comparison of the detected boundaries with respect to human-marked boundaries using the Precision-Recall framework. Another evaluation are based on pixel discrepancy measures. In [28], the use of binary edge masks and scalable discrepancy measures is proposed. Odet et al. [29] propose also an evaluation method based on edge pixel discrepancy, but the establishment of a correspondence of regions between the reference mask and the examined one is considered. This paper presents a benchmark for evaluating general-purpose image segmentation methods on a large variety of real images with well defined objects as the ground truth. It is a continued study of our previous work with a more thorough investigation of performance variation resulting from different parameter settings in 7 well-known, region-based segmentation methods.
3
Segmentation Methods
We will compare seven different segmentation techniques: Unsupervised Segmentation of Natural Images via Lossy Data Compression [4], ROI-SEG: Unsupervised Color Segmentation by Combining Differently Focused Sub Results [2], The Mean Shift-Based segmentation algorithm [7], Multi-Layer Spectral Segmentation [3], Efficient Graph-Based segmentation algorithm [6], Normalized Cuts segmentation algorithm [10] and our own contour-based segmentation method [13]. We have chosen Mean Shift-Based segmentation because it is generally effective and has become widely used in the vision community. The Efficient Graph-Based segmentation algorithm was chosen as an interesting comparison to the Mean Shift. Its general approach is similar. However, it excludes the mean shift filtering step itself, thus partially addressing the question of whether the filtering step is useful. Due to its computational efficiency, Normalized Cuts represents a solid reference in our study. The other three methods have been chosen because they combine in a very good manner efficient segmentation algorithms like Mean Shift or N-Cuts. We have used M SRA Salient Object Database as reference for our research. From the many images the database provides, we have chosen a number of 100 representative items to be our experimental basis. We tried an alternative methodology to the well known Berkley segmentation dataset [11].
186
B. Popescu et al.
Our experimental results have shown that [2] offers the most similar approach to the one we are presenting. Our contribution is based on two main aspects: (a) in order to minimize the running time we construct a hexagonal structure based on the image pixels, that is used in both color-based and syntactic-based segmentation algorithms, and (b) we propose an efficient method for segmentation of color images based on spanning trees and both color and syntactic features of regions. 3.1
Graph-Based Salient Object Detection
We introduce an efficient segmentation method that uses color and some geometric features of an image to process it and create a reliable result [13]. The color space we used is RGB because of the color consistency and its computational efficiency. What is particular at this approach is the basic usage of hexagonal structure instead of color pixels. In this way we can represent the structure as a grid-graph G = (V, E) where each hexagon h in the structure has a corresponding vertex v ∈ V , as presented in Figure 1. Every hexagon has six neighbors and each neighborhood connection is represented by an edge in the set E of the graph. For each hexagon on the structure two important attributes are associated: the dominant color and the coordinates of the gravity center. Basically, each hexagonal cell contains eight pixels: six from the frontier and two from the middle. Image segmentation is realized in two distinct steps. The first step represents a pre-segmentation step when only color information is used to determine an initial segmentation. The second step represents a syntactic-based segmentation step when both color and geometric properties of regions are used. The first step of the segmentation algorithm uses a color-based region model and will produce a forest of maximum spanning trees based on a modified form of the Kruskal’s algorithm. In this case the evidence for a boundary between two adjacent regions is based on the difference between the internal contrast and the external contrast between the regions. The color-based segmentation algorithm builds a maximal spanning tree for each salient region of the input image.
Fig. 1. The grid-graph constructed on the hexagonal structure of an image
Salient Region Detection
187
The second step of the segmentation algorithm uses a new graph, which has a vertex for each connected component determined by the color-based segmentation algorithm. In this case the region model contains in addition some geometric properties of regions such as the area of the region and the region boundary. The final segmentation step produces a forest of minimum spanning trees based on a modified form of the Bor˚ uvka’s algorithm. Each determined minimum spanning tree represents a final salient region returned by the segmentation algorithm. 3.2
Unsupervised Segmentation of Natural Images via Lossy Data Compression
The clustering method follows the principle of lossy minimum description length (LM DL) [18] : Principle (Data Segmentation via Lossy Compression). The optimal segmentation minimizes the number of bits needed to code the segmented data, subject to a given distortion. The lossy compression-based method is applied to segmenting natural images. First, a low-level segmentation is applied to partition an image into many small homogeneous patches. The superpixel notion is introduced. The superpixels are used to initialize the mid-level texture-based segmentation, which minimizes the total coding length of all the texture features by repeatedly merging adjacent segments, subject to a distortion. The method studies several simple heuristics for choosing a good ε for each image. 3.3
ROI-SEG: Unsupervised Color Segmentation by Combining Differently Focused Sub Results
This algorithm is used for detection of a set of connected segments in a color image, based on a previously defined region-of-interest (ROI). The detected segments all have approximately the same color distribution as the input ROI. The sub-segmentation algorithm, which gets an arbitrarily shaped ROI and a color image as input, can be roughly divided into four subsequent steps. First, the image is converted into the CIE Luv color space in order to have an isotropic feature space. Second, the color distribution of the ROI is modeled by a Gaussian Mixture Model (GMM), which is initialized by the results of a Mean Shift algorithm. Third, all color pixels are ordered by calculating Bhattacharyya distance values for each pixel. This is done by an efficient integral image based approach. The ordered pixel values are passed to a modified version of the Maximally Stable Extremal Region (MSER) detector to compute the final result - a set of connected regions, which approximately have the same color appearance as the input ROI. 3.4
Multi-Layer Spectral Segmentation
This algorithm introduces an affinity model for image segmentation that uses the relevance scores, learnt from the test image by semi-supervised learning [19], [20], [21], as graph affinities. The first step is to construct a multi-layer graph with
188
B. Popescu et al.
pixels and regions generated by the mean shift algorithm [7]. A semi-supervised strategy is applied on the affinities in order to efficiently estimate them. In a single multi-layer framework of Normalized Cuts, the proposed full affinities are used to simultaneously cluster all pixel and region nodes into visually coherent groups across all layers. The algorithm offers high-quality segmentation results by considering all intra- and inter-layer affinities in the spectral framework. The computation is very efficient by the eigen-decomposition of a sparse matrix. This solution produces much better segmentations with object details than other spectral segmentation methods such as Normalized Cuts (NCut) [10] on natural images. 3.5
Efficient Graph-Based Image Segmentation
Efficient Graph-Based image segmentation [6], is an efficient method of performing image segmentation. The basic principle is to directly process the data points of the image, using a variation of single linkage clustering without any additional filtering. A minimum spanning tree of the data points is used to perform traditional single linkage clustering from which any edges with length greater than a given threshold are removed [9]. Let G = (V, E) be a fully connected graph, with m edges {ei } and n vertices. Each vertex is a pixel, x, represented in the feature space. The final segmentation will be S = (C1 , ..., Cr ), where Ci is a cluster of data points. The algorithm [6] can be shortly presented as follows: 1. Sort E = (e1 , ..., em ) such that |et | ≤ |et |∀t < t 2. Let S 0 =({x1 }, ..., {xn }) in other words each initial cluster contains exactly one vertex. 3. For t = 1, ..., m (a) Let xi and xj be the vertices connected by et . (b) Let Cxt−1 be the connected component containing point xi on iteration i t−1 and li = maxmst Cxt−1 be the longest edge in the minimum spanning i tree of Cxt−1 . Likewise for lj . i (c) Merge Cxt−1 and Cxt−1 if: i j k k |et | < min li + t−1 , lj + t−1 (1) Cxi Cxj 4. S = S m . 3.6
Normalized Cuts
Normalized Cuts method models an image using a graph G = (V, E), where V is a set of vertices corresponding to image pixels and E is a set of edges connecting neighboring pixels. The edge weight w(u, v) describes the affinity between two vertices u and v based on different metrics like proximity and intensity similarity. The algorithm segments an image into two segments that correspond to a graph cut (A, B), where A and B are the vertices in the two resulting subgraphs.
Salient Region Detection
189
The segmentation cost is defined by: N cut(A, B) =
cut(A, B) cut(A, B) + , assoc(A, V ) assoc(B, V )
(2)
where cut(A, B) = u∈A,v∈B w(u, v) is the cut cost of (A, B) and assoc(A, V ) = u∈A,v∈V w(u, v) is the association between A and V . The algorithm finds a graph cut (A, B) with a minimum cost in Eq.(1). Since this is a NP-complete problem, a spectral graph algorithm was developed to find an approximate solution [10]. This algorithm can be recursively applied on the resulting subgraphs to get more segments. For this method, the most important parameter is the number of regions to be segmented. Normalized Cuts is an unbiased measure of dissociation between the subgraphs, and it has the property that minimizing normalized cuts leads directly to maximizing the normalized association relative to the total association within the sub-groups. 3.7
Mean Shift
The Mean Shift-Based segmentation technique [7] is one of many techniques dealing with “feature space analysis”. Advantages of feature-space methods are the global representation of the original data and the excellent tolerance to noise [12]. The algorithm has two important steps: a mean shift filtering of the image data in feature space and a clustering process of the data points already filtered. During the filtering step, segments are processed using the kernel density estimation of the gradient. Details can be found in[7]. A uniform kernel for gradient estimation with radius vector h = [hs , hs , hr , hr , hr ] is used. hs is the radius of the spatial dimensions and hr the radius of the color dimensions. Combining these two parameters, complex analysis can be performed while training on different subjects. Mean shift filtering is only a preprocessing step. Another step is required in the segmentation process: clustering of the filtered data points {x }. During filtering, each data point in the feature space is replaced by its corresponding mode. This suggests a single linkage clustering that converts the filtered points into a segmentation. Another paper that describes the clustering is [8]. A region adjacency graph (RAG) is created to hierarchically cluster the modes. Also, edge information from an edge detector is combined with the color information to better guide the clustering. This is the method used in the available EDISON system, also described in [8]. The EDISON system is the implementation we use in our evaluation system.
4
Saliency Detection Performance Evaluation
We present comparative results of segmentation performance for our contour based segmentation method and the six alternative segmentation methods
190
B. Popescu et al.
mentioned above. Our evaluation measure is based on the performance measure from [22] that we have modified and adapted to fit our needs and better illustrate the results. The image is partitioned into two segments, with one as the foreground and the other as the background. Of course, the segmentation methods produce more than two regions. All the methods partition an image into a set of disjoint segments without labeling the foreground and background. We have used a regionmerging strategy so that they can be fairly evaluated in the benchmark. If the segments in the image I are {R1 , R2 , ..., Rn }, the ground-truth foreground segment corresponds to a subset of the disjoint segments. To evaluate these methods we apply a strategy to merge the segments and then use the merged region as the detected foreground object. For each segment Ri in an image, we count it into the foreground R if it has more than 50% overlap with the ground-truth foreground A in terms of the area. The basic performance measure we use for this analysis is the Jaccard coefficient [23] which measures the region coincidence between the segmentation result and the ground truth. Specifically, let the region A be the ground-truth foreground object and the region R be the merged segments derived from the segmentation result using the region-merging strategy. The segmentation accuracy is: P (R; A) =
|R ∩ A| |R ∩ A| = , |R ∪ A| |R| + |A| − |R ∩ A|
(3)
where |X| represents the cardinal of the set X, which estimates the area of the region X. This measure has no bias to the segmentation that produces overly large or small number of segments. In this equation, R ∩ A measures how much the ground-truth object is detected and R ∪ A is a normalization factor that normalizes the accuracy measure to the range of [0, 1]. Based on the normalization factor, the accuracy measure improves the detection of irrelevant regions as the foreground segments. This region-based measure is insensitive to small variations in the ground-truth construction and incorporates the accuracy and recall measurement into one unified function. We have adapted this metric by dividing the result by the square root of the number of objects in the current segmentation. A P P (R; A) = P (R; A) , (4) R where S represents the number of detected regions of the segmentation S. We use the value P P (R; A) for comparisons because the current segmentation needs to have almost the same number of segments as the ground-truth. We used the MSRA Salient Object Database to perform our analysis and evaluation. This is a public dataset that aims to facilitate research in multimedia information retrieval and related areas. The images in the dataset are collected from a commercial search engine with more than 1000 queries. It contains about 1 million images and 20,000 videos. The surrounding texts that are obtained from
Salient Region Detection
191
Fig. 2. Comparative analysis for: Graph-Based Salient Object Detection - GBSOD, Efficient Graph-Based Image Segmentation (Local Variation) - EGB, Mean Shift - MS, Normalized Cuts - NC, Unsupervised Segmentation of Natural Images via Lossy Data Compression - YA, ROI-SEG: Unsupervised Color Segmentation by Combining Differently Focused Sub-Results - DO, Multi-Layer Spectral Segmentation - KI
more than 1 million Web pages are also provided. The images and videos have been comprehensively annotated, including their relevance levels to corresponding queries, semantic concepts of images, and category and quality information of videos. Six standard tasks are defined on this dataset: (1) image search reranking; (2) image annotation; (3) query-by-example image search; (4) video search reranking; (5) video categorization; and (6) video quality assessment. The segmentation accuracy mentioned above only provides an upper bound of the segmentation performance by assuming an ideal postprocess of region merging for applications without a priori known exact ground truth. For the extreme case where each pixel is partitioned as a segment, the upper-bound performance obtained is a meaningless value of 100%. This is a little similar to the GCE and LCE measures developed in the Berkeley benchmark. But the difference is that GCE and LCE also result in meaningless high accuracy when too few segments are produced, such as the case where the whole image is partitioned as a single segment. In this paper, we always set the segmentation parameters to produce a reasonably small number of segments when applying the strategy to merge the image regions.
5
Experimental Results
Our study of segmentation quality is based on experimental results and uses the MSRA Salient Object Database provided at [17]. In order to obtain the evaluation results, we have performed segmentation of the selected images with all the algorithms using a various set of parameters.
192
B. Popescu et al.
More precisely, by varying some parameters, we have obtained 10 distinct points that define the curve for each approach. For Normalized Cuts [10] we have modified the number of segments in the range of {5, 10, 12, 15, 20, 25, 30, 40, 50, 70}. The variable parameter for Efficient Graph - Based Image Segmentation [6] was the scale of observation, k, in range {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}. For Mean-Shift [7] we have made 10 combinations from Spatial Bandwidth {8, 16} and Range Bandwidth {4, 7, 10, 13, 16}. For our customized performance measure we have obtained the results illustrated in Figure 2. From the above presented diagram we can see that the PP(R;A) metric for our proposed method, denoted GBSOD - Graph Based Salient Object Detection is situated above the other graphics indicating a better performance result and a balanced algorithm.
6
Conclusion and Future Work
We described in this paper a new graph-method for image segmentation and extraction of visual objects. Starting from a survey of several segmentation strategies, we have performed an image segmentation evaluation experiment. Our segmentation method and other six segmentation methodologies were chosen for the experiment, and the complementary nature of the methods was demonstrated in the results. The study results offer a clear view of the effectiveness of each segmentation algorithm, trying in this way to offer a solid reference for future studies. Future work will be carried out in the direction of integrating syntactic visual information into a semantic level of a semantic image processing and indexing system.
References 1. Fu, K., Mui, J.: A survey on image segmentation. Pattern Recognition (1981) 2. Donoser, M., Bischof, H.: ROI-SEG: Unsupervised Color Segmentation by Combining Differently Focused Sub Results. Institute for Computer Graphics and Vision Graz University of Technology 3. Kim, T.H., Lee, K.M., Lee, S.U.: Learning Full Pairwise Affinities for Spectral Segmentation. Dept. of EECS, ASRI, Seoul National University, 151-742, Seoul, Korea 4. Yang, A.Y., Wright, J., Sastry, S., Ma, Y.: Unsupervised Segmentation of Natural Images via Lossy Data Compression (2007) (preprint submitted to Elsevier) 5. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. Int. Conf. Comp. Vis., vol. 2, pp. 416–425 (2001) 6. Felzenszwalb, P., Huttenlocher, D.: Efficient Graph-Based Image Segmentation. Intl. J. Computer Vision 59(2) (2004)
Salient Region Detection
193
7. Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach toward Feature Space Analysis. IEEE Trans. Pattern Analysis and Machine Intelligence 24, 603–619 (2002) 8. Christoudias, C., Georgescu, B., Meer, P.: Synergism in Low Level Vision. In: Proc. Intl. Conf. Pattern Recognition, vol. 4, pp. 150–156 (2002) 9. Unnikrishnan, R., Pantofaru, C., Hebert, M.: Toward Objective Evaluation of Image Segmentation Algorithms. IEEE Transactions on Pattern Analysis and Machine Inteligence 29(6) (2007) 10. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8) (2000) 11. Berkeley Segmentation and Boundary Detection Benchmark and Dataset (2003), http://www.cs.berkeley.edu/projects/vision/grouping/segbench 12. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, New York (2000) 13. Brezovan, M., Burdescu, D.D., Ganea, E., Stanescu, L.: An Adaptive Method for Efficient Detection of Salient Visual Object from Color Images. In: Proc. of the 20th International Conference on Pattern Recognition (ICPR), Istambul, pp. 2346–2349 (2010) 14. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: Image segmentation using expectation-maximization and its application to image querying and classification. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(8), 1026–1037 (2002) 15. Fauqueur, J., Boujemaa, N.: Region-based image retrieval: Fast coarse segmentation and fine color description. Journal of Visual Languages and Computing 15(1), 69–95 (2004) 16. Shi, J., Malik, J.: Normalized cuts and image segmentation. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, pp. 731–737 (1997) 17. MSRA Salient Object Database, http://research.microsoft.com/en-us/um/ people/jiansun/SalientObject/salient_object.htm 18. Dowson, D., Landau, B.: The Frechet distance between multivariate normal distributions. Journal Multivariate Analysis 12(3), 450–455 (1982) 19. Brin, S., Page., L.: The anatomy of a large-scale hypertextual web search engine. In: WWW (1998) 20. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Scholkopf, B.: Learning with local and global consistency. In: NIPS (2003) 21. Pan, J.-Y., Yang, H.-J., Faloutsos, C., Duygulu, P.: Automatic multimedia crossmodal correlation discovery. In: KDD (2004) 22. Ge, F., Wang, S., Liu, T.: New benchmark for image segmentation evaluation. Journal of Electronic Imaging 16(3), 033011 (JulSep 2007) 23. Cox, T., Cox, M.: Multidimensional Scaling. 2nd edn. Chapman and Hall/CRC Press, Boca Raton, FL (2000) 24. Levine, M., Nazif, A.: Dynamic measurement of computer generated image segmentations. IEEE Trans. on Pattern Analysis and Machine Intelligence 7, 155–164 (1985) 25. Zhang, Y., Wardi, Y.: A recursive segmentation and classification scheme for improving segmentation accuracy and detection rate in realtime machine vision applications. In: Proc. of the Int. Conf. on Digital Signal Processing (DSP 2002), vol. 2 (July, 2002)
194
B. Popescu et al.
26. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. of the IEEE Conference on Computer Vision, pp. 414–425 (2001) 27. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color and texture cues. IEEE Trans. on Pattern Analysis and Machine Intelligence 26(5), 530–549 (2004) 28. Huang, Q., Dom, B.: Quantitative methods of evaluating image segmentation. In: Proc. of the Int. Conf. on Image Processing (ICIP 1995), vol. 3, pp. 53–56 (1995) 29. Odet, C., Belaroussi, B., Benoit-Cattin, H.: Scalable discrepancy measures for segmentation evaluation. In: Proc. of the Int. Conf. on Image Processing (ICIP 2002), vol. 1, pp. 785–788 (2002)
Robust Active Contour Segmentation with an Efficient Global Optimizer Jonas De Vylder, Jan Aelterman, and Wilfried Philips Department of Telecommunications and Information Processing, IBBT - Image Processing and Interpretation, Ghent University, St-Pietersnieuwstraat 41, B-9000 Ghent, Belgium [email protected] http://telin.ugent.be/~ jdvylder/
Abstract. Active contours or snakes are widely used for segmentation and tracking. Recently a new active contour model was proposed, combining edge and region information. The method has a convex energy function, thus becoming invariant to the initialization of the active contour. This method is promising, but has no regularization term. Therefore segmentation results of this method are highly dependent of the quality of the images. We propose a new active contour model which also uses region and edge information, but which has an extra regularization term. This work provides an efficient optimization scheme based on Split Bregman for the proposed active contour method. It is experimentally shown that the proposed method has significant better results in the presence of noise and clutter. Keywords: Active contours, segmentation, convex optimization, Split Bregman.
1
Introduction
Since Kass et al. [1] introduced there snakes, the active contour framework has become a constant recurring topic in segmentation literature. The framework allows easy tuning to specific segmentation and tracking problems. Prior motion information of objects which need to be tracked [2–4], specific shape models [2, 5, 6], region statistics of objects [7, 8], etc. These are just a small notion of different forms of prior knowledge which have been incorporated in the active contour framework. In the active contour framework, an initial contour is moved and deformed in order to minimize a specific energy function. This energy function should be minimal when the contour is delineating the object of interest, e.g. a leaf. Two main groups can be distinguished in the active contour framework: one group representing the active contour explicitly as a parameterized curve and a second group which represents the contour implicitly using level sets. In the first group, also called snakes, the contour generally converges towards edges in the image [1, 5, 9]. The second group generally has an energy function based on region J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 195–206, 2011. c Springer-Verlag Berlin Heidelberg 2011
196
J. De Vylder, J. Aelterman, and W. Philips
properties, such as variance of intensity of the enclosed segment [7, 10]. These level set approaches has gained a lot of interest since they have some benefits over snakes. For example, they can easily change their topology, e.g. splitting a segment into multiple unconnected segments. Recently an active contour model has been proposed with a convex energy function, making it possible to define fast global optimizers [11, 12]. These global active contours have the benefit that there result is no longer dependent on the initialization. In [13], Bresson et al. proposed a new type of active contour with a convex energy function, a model which combined edge information and region information. This method combines the original snake model [1] with the active contour model without edges [7]. The proposed model has an energy function which is completely defined by the image, thus eliminating the possibility of regularization. Although the method has some interesting benefits, it lacks robustness to noise and clutter. To tackle this problem, we propose a new active contour model which has the benefits of the model proposed in [13], but which has an extra regularization term. This regularization term enforces smoothness of the boundaries of the segments. This results in segments with a smooth boundary, i.e. avoiding jaggy edges due to noise. This paper is arranged as follows. The next section briefly enumerates the notations and symbols used in this paper. In section 3 the current state of the art of global optimum active contours is summarized and expanded with the proposed method. Section 4 elaborates on a fast optimization method which can be used to calculate the proposed active contour. The next section shows some examples and quantitative results of our technique in comparison to other active contour formulations. Both convergence speed and segmentation result are examined. Section 6 recapitulates and concludes.
2
Notations and Definitions
In the remaining of this paper we will use specific notations some conventional, some more peculiar to this work. Therefore we briefly summarize the notations and symbols used in this work. We will refer to an image, F in its vector notation, i.e. f (i ∗ m + j) = F (i, j), where m × n is the dimension of the image. In a similar way we will represent the contour in vector format, u. If a pixel U (i, j) is part of the segment, it will have a value above a certain threshold, all background pixels will have a value lower than the given threshold. Note that this is similar to level-sets. The way these contours are optimized however is different than with classical level-set active contours, as is explained in the next section. We will use image operators, i.e. gradient, divergence and Laplacian in combination with this vector notation, however the semantics of the image operators remains the same as if it was used with the classical matrix notation: ∇(f (i ∗ m + j)) = F (i + 1, J) − F (i, j), F (i, J + 1) − F (i, j) ∇ · (f (i ∗ m + j)) = F (i + 1, J) − F (i, j) + F (i, J + 1) − F (i, j) ∇2 (f (i ∗ m + j)) = ∇ · ∇(f (i ∗ m + j))
Robust Active Contour Segmentation
197
Further we will use the following inner product and norm notations: f , g = f α,g =
mn
f (i), g(i)
i=1 mn
α α1 g(i) | f (i) |
i=1
If the weights g(i) = 1 for all i, then we will omit g, since we assume this will not cause confusion, but will increase readability.
3
Convex Energy Active Contours
In [11] an active contour model was proposed which has global minimizers. This active contour is calculated by minimizing the following convex energy: E[u] = ∇u1 + μu, r
(1)
r = (mf − f )2 − (mb − f )2
(2)
with Here f represents the intensity values in the image, mf and mb are respectively the mean intensity of the segment and the mean intensity of the background, i.e. every pixel not belonging to the segment. Note that this energy is convex, only if mf and mb are constant. This problem can be solved by iterating between the following two steps: first fix mf and mb and minimize eq. (1), secondly update mf and mb. Chan et al. found that the steady state of the gradient flow corresponding to this energy, i.e. du ∇u =∇· − μr dt |∇u|
(3)
coincides with the steady state of the gradient flow of the original Chan-Vese active contours [7, 11]. So minimizing eq. (1) is equivalent to finding an optimal contour which optimizes the original Chan-Vese energy function. Although the energy in eq. 1 does not have a unique global minimizer, a well defined minimizer can be found within the interval [0, 1]n : u∗ = arg min ∇u1 + μu, r
(4)
u∈[0,1]n
Note that this results in a minimizer which values are between 0 and 1. It is however desirable to have a segmentation result where the values of a minimizer are constrained to (0, 1), i.e. a pixel belongs to a segment or not. Therefore u∗ is tresholded, i.e. 1 if u∗ (x) > α ∗ Φα (u (x)) = (5) 0 otherwise
198
J. De Vylder, J. Aelterman, and W. Philips
with a predefined α ∈ [0, 1]. In [14] it is shown that Φα (u∗ ) is a global minimizer for the energy in eq. (1) and by extension for the energy function of the original Chan-Vese active contour model. In [13] the convex energy function in eq. (1) was generalized in order to incorporate edge information: E[u] = ∇u1,g + μu, r
(6)
1 where g is the result of an edge detector, e.g. g = 1+|∇f | . The active contour minimizing this energy function can be seen as a combination of edge based snake active contours [1] and the region based Chan-Vese active contours [7]. Since this method only minimizes energy terms based on the image, it is highly influenced by the quality of the image. In the presence of noise and clutter the method will find false segments or distorted segment boundaries. In order to make the method more robust we propose to extend the energy function in eq. (6) with an extra regularization term:
E[u] = ∇u1 + γ∇u1,g + μu, r
(7)
Where γ is a weighting parameter defining the influence of the extra regularization therm. This regularization term approximates the length of the segments boundary, thus penalizing small false segments and high curved boundaries due to noise.
4
Optimization
Due to the convexity of the energy function in eq. (7), a wide range of minimizers can be used to find an optimal contour u† . The Split Bregman method is an efficient optimization technique for solving L1-regularized problems and has good convergence properties. In order to find a contour which minimizes eq. (7), the Split Bregman method will ”de-couple” the L1 and L2 norm, by introducing a new variable d and by putting constraints on the problem. This results in the following optimization problem: (u† , d† ) = arg min d1 + γd1,g + μu, r such that d = ∇u u,d
(8)
This optimization problem can be converted to an unconstrained problem by adding a quadratic penalty function, i.e. (u† , d† ) = arg min d1 + γd1,g + μu, r + u,d
λ d − ∇u22 2
(9)
Where λ is a weighting parameter. If γ is heigh, d = ∇u. However setting γ high introduces numerical instability. Note that the quadratic penalty function only approximates the constraint d = ∇u. However, by using a Bregman iteration technique [15], this constraint can be enforced exactly in an efficient way. In
Robust Active Contour Segmentation
199
the Bregman iteration technique an extra vector, bk is added to the penalty function. Then the following two unconstrained steps are iteratively solved. (uk+1 , dk+1 ) = arg min dk 1 + γdk 1,g + μuk , r + uk ,dk
λ dk − ∇uk − bk 22 2 (10)
bk+1 = bk + ∇uk+1 − dk+1
(11)
The first step requires optimizing for two different vectors. We approximate these optimal vectors by alternating between optimizing eq. (10) for u and optimizing eq. (10) for d independently: λ dk − ∇uk − bk 22 2 λ = arg min dk 1 + γdk 1,g + dk − ∇uk+1 − bk 22 2 dk
uk+1 = arg min μuk , r +
(12)
dk+1
(13)
uk
The first problem can be optimized by solving a set of Euler-Lagrange equations. For each element u(i) of the optimal u the following optimality condition should be satisfied: μ ∇2 u(i) = r(i) + ∇ · (d(i) − b(i)) (14) λ Note that this system of equations can be written as Au = w. In [12] they proposed to solve this linear system using the iterative Gauss-Seidel method. In order to guarantee the convergence of this method, A should be strictly diagonally dominant or should be positive semi definite. Unfortunately is A neither. Instead we will optimize eq. (14) using the iterative conjugate residual method, which is a Krylov subspace method for which convergence is guaranteed if A is Hermitian [16]. The solution of eq. (14) is unconstrained, i.e. u(i) does not have to lie in the interval [0, 1]. Note that minimizing eq. (12) for u(i), i.e. all other elements of u remain constant, is equivalent to minimize a quadratic function. If u(i) ∈ / [0, 1] then the constrained optimum is either 0 or 1, since a quadratic function is monotonic in an interval which does not contain its extremum. So the constrained optimum can be calculated as follows: u∗ (i) = max min u(i), 1 , 0 (15) In order to calculate an optimal dk , we can rewrite eq. (13) as follows: dk+1 = arg min dk 1,(1+γg) + dk
λ dk − ∇uk+1 − bk 22 2
(16)
A closed form solution for this optimization step can be calculated using the shrinking operator, i.e. dk+1 (i) = shrink ∇u(i) + bk , 1 + γg(i), λ (17)
200
J. De Vylder, J. Aelterman, and W. Philips
where shrink(τ, θ, λ) =
0 τ−
θ λ
sgn(τ )
if τ ≤ λθ otherwise
(18)
In algorithm 1 we give an overview in pseudo code of the complete optimization algorithm. As an initial value for d[t] and b[t] we chose (0, 0). The initial estimation of mf and mb can be calculated based on Otsu thresholding. The CR function solves eq. (14) using the Conjugate residual method, given the parameters bk , dk and rk . Note that the last line is the update of rk based on the new mean intensity of the foreground/background, which were calculated in the previous two lines of code. Algorithm 1. Split Bregman for active contour segmentation 1 2 3 4 5 6 7 8 9 10
5 5.1
while u∗k+1 − u∗k 2 > do uk+1 = CR(bk , dk , rk ) u∗k+1 = max(min(uk+1 , 1), 0) dk+1 = shrink(∇u∗k+1 + bk , 1 + γg, λ) bk+1 = bk + ∇u∗k+1 − dk+1 sk = φα (u∗k+1 ) mfk+1 = sk ◦ f 1 sk −1 1 mbk+1 = sck ◦ f 1 sck −1 1 rk+1 = (mfk − f )2 − (mbk − f )2 end
Results Examples
A typical application for active contours is segmentation of organs in medical images. As an example Fig. 1 shows the result of segmenting white matter in an MRI image of the brain. The top row shows the RAW MRI image on the left and the segmentation result of the Chan-Vese convex active contour (CVAC); this uses regularization but does not incorporate edge information. The CVAC shows a good segmentation result, although the method does make some small errors near the borders of the contour. These errors generally exists of background pixels with ”high” intensity which are considered to be foreground. Due to their ”high” intensity they resemble the segment, however considering the neighbouring edges; it is unlikely that they actually belong to the segment, e.g. the white matter in this example. Some of these segmentation errors are indicated by the green arrows. The bottom row of Fig. 1fig:exBrain shows the segmentation results of CVAC which incorporate edge information. The left image does not incorporate a regularization term in its energy function, resulting in lots of small segments due to noise. The right image shows the segmentation result of the proposed active contour model. This active contour does not suffer from noise, nor has it the small errors near borders which occur with the original Chan-Vese active contour model.
Robust Active Contour Segmentation
201
Fig. 1. An example of brain white matter segmentation in MRI images. Top left shows an MRI slice of the brain. Top right depicts the segmentation result of the ChanVese convex active contours. Bottom left shows the Chan-Vese convex active contour segmentation using edge information. Bottom right depicts the proposed segmentation method, i.e. active contours using edge information with regularization.
202
J. De Vylder, J. Aelterman, and W. Philips
Fig. 2. An example of segmentation in photos with clutter. Top left shows a gray-scale photo of a squirrel. Top right depicts the segmentation result of the Chan-Vese convex active contours. Bottom left shows the Chan-Vese convex active contour segmentation using edge information. Bottom right depicts the proposed segmentation method, i.e. active contours using edge information with regularization.
A second example is shown in Fig: 2, where a squirrel has to be segmented out of a gray-scale image. The original CVAC, shown in the top row on the right, results in poor segmentation. A part of the head and a piece of the paw are missing in the segmentation result. Incorporating edge information helps to recover these missing parts as can be seen in the bottom row. Due to the clutter in the background the method finds a lot of false segments if there is no regularization, as can be seen on the left. The proposed method however finds the biggest part of the squirrel without adding any background pixels as is shown on the right of Fig. 2. Although the proposed method gives better results in noise images or images with noise, it comes with a cost, i.e. the active contour converges slower. The proposed method does converge slower than the CVAC which incorporates edge information. The convergence speed depends on the amount of regularization needed. If hardly any regularization is needed, i.e. γ in eq. (7) >> 1, the speed approaches the convergence speed of CVAC with edge information [13]. Whereas if the regularization factor is dominant in the energy function, the convergence speed approaches the convergence speed of the original CVAC using Split Bregman optimization [15]. Fig. 3 shows the
Robust Active Contour Segmentation 1.4 1.2
CPU time (s)
1
203
CVAC with edges CVAC without edges CVAC, gamma= 5/1 CVAC, gamma= 5/2 CVAC, gamma= 5/3 CVAC, gamma= 5/4
0.8 0.6 0.4 0.2 0 200
300
400
500
600
700
image size n (pixels)
800
900
1000
Fig. 3. Convergence speed of different active contours in function of the image size. The full lines are active contours from literature, Chan-Vese active contours with and without incorporation of edge information. The dotted lines represent the convergence speed of the proposed method with different γ, i.e. different ratio’s of regulation.
convergence speed in function of the image size. Between these four different CVAC only the γ parameter was changed, μ and λ was constant between all methods, i.e. 0.001 and 0.5 respectively. The full lines depict the CVAC’s from literature, in green the method using edge information, in blue the method using regularization without edge information. The dotted lines show the convergence speed of the proposed method for different γ in eq. (7). 5.2
Error Metric
For the validation of the segmentation, the Dice coefficient is used. If S is the resulting segment from the active contour, i.e. φ0.5 u∗ , and GT the ground truth segment based on manual segmentation, then the Dice coefficient between S and GT is defined as: 2 Area(S ∧ GT ) d(S, GT ) = (19) Area(S) + Area(GT ) where S ∧ GT consist of all pixels which both belong to the detected segment as well as to the ground truth segment. If S and GT are equal, the Dice coefficient is equal to one. The Dice coefficient will approach zero if the regions hardly overlap. 5.3
Noise Robustness
In order to quantitatively validate the proposed method, a dataset of synthetic fluorescent microscopic images was segmented. This dataset has been developed by Ruusuvuri et al. [17] and serves as a benchmark for segmentation algorithms.
204
J. De Vylder, J. Aelterman, and W. Philips
Fig. 4. An example of the micorscopy dataset used for vallidating the robustness of the proposed segmentation technique. The image is contaminated with different levels of white Gaussian noise. The top row has a SNR of 3,5,7 and 9 respectively. The bottom row as a SNR of 11,13,15 and 17 respectively.
1
0.9
Dice coefficient
0.8
0.7
0.6
0.5
0.4
2
CVAC with edges CVAC with edges and regularization 4
6
8
10
12
SNR (dB)
14
16
18
20
Fig. 5. The average Dice coefficient for segmentation of a synthetic dataset contaminated with different noise levels. The Dice coefficients were calculated for the segmentation result coming from the Chan-Vese active contours with edge information as for segmentation using the proposed method.
Robust Active Contour Segmentation
205
The dataset consists of 20 images each containing the 300 fluorescent nuclei. We contaminated this dataset with eight different levels of white Gaussian noise so that we could measure the influence of noise on the proposed segmentation technique. A close up of such an image with different noise levels can be seen in Fig. 4. The segmentation quality of the proposed method was measured using the average Dice coefficient for the full dataset and compared with the average Dice coefficient of segmentation using CVAC with edge information. As can be seen in Fig. 3 is the proposed method significantly more robust. The proposed method still gets a Dice coefficient of 0.7 for an image set with a SNR of only 3 in comparison with a Dice coefficient of 0.4 for the state of the art CVAC.
6
Conclusion
In this paper a new active contour method has been proposed. The method is comparable with the work proposed in [13], both methods are a combination of the original Chan-Vese active contours and snakes. The proposed method allows extra regularization, which was not possible in the method proposed by Bresson. The proposed method uses a convex energy function, allowing the use of global optimizers. This paper proposes a regularization term which enforces a smooth contour. However other convex regularization terms could be used as well, for example regularization using non local self similarity or based on texture such as is done in [18]. An efficient optimizer has been proposed using the Split Bregman optimization scheme. This results in efficient and fast optimization, although not as fast as the active contour proposed by Bresson [13]. However it is experimentally proven that the proposed method is significantly more robust to noise and clutter than the method proposed in [13]. Acknowledgment. This research has been made possible by the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT).
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. International Journal of Computer Vision, 321–331 (1988) 2. Isard, M., Blake, A.: Active contours. Springer, Heidelberg (1998) 3. Ray, N., Acton, S.: Motion gradient vector flow: An external force for tracking rolling leukocytes with shape and size constrained active contours. IEEE Transaction on Medical Imaging 23, 1466–1478 (2004) 4. Tang, J.: A multi-direction gvf snake for the segmentation of skin cancer images. Pattern Recognition (2008) 5. Charmi, M.A., Derrode, S., Ghorbel, S.: Fourier-based geometric shape prior for snakes. Pattern Recognition Letters 29, 897–904 (2008) 6. Rochery, M., Jermyn, I.H., Zerubia, J.: Higher order active contours. Int. J. Comput. Vision 69(1), 27–42 (2006) 7. Chan, T., Vese, L.: An active contour model without edges. Scale-Space Theories in Computer Vision 1682, 141–151 (1999)
206
J. De Vylder, J. Aelterman, and W. Philips
8. Mille, J.: Narrow band region-based active contours and surfaces for 2d and 3d segmentation. Computer Vision and Image Understanding 113(9), 946–965 (2009) 9. Tsechpenakis, G., Rapantizikos, K., Tsapatsoulis, N., Kollias, S.: A snake model for object tracking in natural sequences. Signal Processing: Image Communication 19, 219–238 (2004) 10. Goldenberg, R., Kimmel, R., Rivlin, E., Rudzsky, M.: Fast geodesic active contours. IEEE Transactions on Image Processing 10(10), 1467–1475 (2001) 11. Chan, T.F., Esedoglu, S., Nikolova, M.: Algorithms for finding global minimizers of image segmentation and denoising models. Siam Journal on Applied Mathematics 66(5), 1632–1648 (2006) 12. Goldstein, T., Bresson, X., Osher, S.: Geometric applications of the split bregman method: Segmentation and surface reconstruction. Journal of Scientific Computing 45(1-3), 272–293 (2010) 13. Bresson, X., Esedoglu, S., Vandergheynst, P., Thiran, J.P., Osher, S.: Fast global minimization of the active contour/snake model. Journal of Mathematical Imaging and Vision 28(2), 151–167 (2007) 14. Bresson, X., Chan, T.F.: Active contours based on chambolle’s mean curvature motion. In: 2007 IEEE International Conference on Image Processing, vol. 1-7, pp. 33–36 (2007) 15. Goldstein, T., Osher, S.: The split bregman method for l1-regularized problems. Siam Journal on Imaging Sciences 2(2), 323–343 (2009) 16. Saad, Y.: Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia (2003) 17. Ruusuvuori, P., Lehmussola, A., Selinummi, J., Rajala, T., Huttunen, H., YliHarja, O.: Set of synthetic images for validating cell image analysis. In: Proc. of the 16th European Signal Processing Conference, EUSIPCO-2008 (2008) 18. Aleman-Flores, M., Alvarez, L., Caselles, V.: Texture-oriented anisotropic filtering and geodesic active contours in breast tumor ultrasound segmentation. Journal of Mathematical Imaging and Vision 28(1), 81–97 (2007)
A Method to Generate Artificial 2D Shape Contour Based in Fourier Transform and Genetic Algorithms Maur´ıcio Falvo1,2 , Jo˜ ao Batista Florindo1 , and Odemir Martinez Bruno1 1
Universidade de S˜ ao Paulo - Instituto de F´ısica de S˜ ao Carlos, S˜ ao Carlos, Brasil [email protected], [email protected], [email protected] 2 Faculdade Adventista de Hortolˆ andia, Hortolˆ andia, Brasil
Abstract. This work presents a simple method to generate 2D contours based in the small number of samples. The method uses the Fourier transform and genetic algorithms. Using crossover and mutation operator news samples were generated. An application case is presented and the samples produced were tested in the classifier construction. The result obtained indicated the method can be a good solution to solve the small sample problem to feature vectors based in shape characteristics. Keywords: small sample problem, Fourier transform, genetic algorithms.
1
Introduction
The building processes of high quality classifiers, fundamentally, are based on a set of training vectors, which are obtained from samples numerically greater than the number of features that make it up. However, when the number of samples is less than the number of features, we have a small sample problem [10]. In the literature there are works that proposed some solutions to the small sample problem. The main idea to fix the small sample problem is complementing the samples set with artificial samples. The challenger of this approach is how to create artificial samples with characteristics capable to improve the statistical model without biasing it. In computer vision Pormelou, in his ALVINN project, used real images to create artificial samples which were applied in to learn of neural networks [19,18]. Examples from object and speech recognition based in artificial samples can be found in [15]. Li and Fang [13] proposed a sample generation method named of non-linear virtual sample generation (NVSG). This same method was used to generate an artificial data-set from Iris Plant Database. Lately this method was applied to facilitate cancer identification for DNA microarray data[14]. However, create artificial samples are not a trivial issue because of difficultly knowledge a function that can generates a statistically consistent set of features from a class of objects. In the construction of classifiers for recognition of biological shapes [2] where the number samples may be a critical factor due to difficult acquisition or from J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 207–215, 2011. c Springer-Verlag Berlin Heidelberg 2011
208
M. Falvo, J.B. Florindo, and O.M. Bruno
rare occurrence of the researched specie [17], the sample small problem becomes more evident. This article proposes a simple, efficient and statistically consistent method to create artificial shapes based in a small number of real samples, using genetic algorithms and Fourier transform. This was originally designed to produce samples of leafy vegetables but may be used to create any other set of contours. A brief theoretical introduction is discussed in the section 2. In the section 3 the developed method is detailed and study case is showed in the section 4, with a construction of a leaves shape classifier using artificial samples. In the section 5 the results and conclusions are presented.
2
A Brief Theoretical Introduction
This section presents a brief introduction of Fourier transform applied in the contour representation in the complex domain, and basic concepts of genetic algorithms. 2.1
Shape Representation in the Frequency Domain
The parametric signature combine the pairs of set coordinates (X,Y) which can be express by 1D signal ( see equation 1) when represented in complex number domain [3]. c(t) = x(t) + y(t)i (1) From this signal, a set contour descriptors are obtained by the discrete Fourier transform (DFT) [7] apud [3]. These descriptors are represented in the Fourier spectrum by frequency components. The low frequencies define the global contour form while the high frequencies define the minor details. If some high frequencies of descriptor vector are removed (replaced by zeros, by a ideal low-pass filter), and the inverse function of Fourier is applied, a 1D contour complex signal is recovered with its details reduced. This process is described by sequence of equations 2, 3, and 4. Note that ω is the cut frequency of ideal low-pass filter. F (ω) = F {c(t)} ⎧ ⎨ 0, if ωc ≤ ω ≤ −ωc H(F (ω)) = ⎩ F (ω), otherwise.
(2)
c (ω) = F −1 {H(F (ω))}
(4)
(3)
The Figure 1 shows examples of reconstruction signals after a ideal low-pass filter was applied with different cut frequency ω. Therefore, samples that belong to same class are distinguished by its hight frequency components in Fourier spectrum and grouped by its low frequency components. Note that with only a few descriptors, it is possible to characterize the general shape of a contour. This property makes the Fourier descriptors are widely used for differentiation and characterization of boundaries [11], [22], [1], [6].
A Method to Generate Artificial Contour Samples
ω=10
ω=20
ω=30
ω=60
ω=120
ω=180
ω=240
ω=300
209
Fig. 1. Contours of Passiflora caerulea after applied a ideal low-pass filter with different cut frequencies ω
2.2
Genetic Algorithms
Inspired in the evolution and natural selection principles, the genetic algorithms are usually applied in practical problems of optimization and search [9]. Its application involves firstly the coding a finite set of potential feasible solutions, named initial population (see Figure 2). Each solution is ranked by a fitness function which represents a problem to be solved. The solutions have not a good rank are disposed and replaced by new others, in the same quantity. The replacement is done by generation of new solutions. These new ones are generated by data combination of two others solutions that had a good rank. Thus, the initial number of possible solutions in the set is reestablished. The cycle: selection process, disposal, and generation is repeated again, until a optimal solution set is found [5]. Note this mimics the natural selection and biological reproduction, including the shares of crossover and mutation that occur during the reproduction of new solutions. Crossover and mutation are discussed bellow. The Crossover Operator: Considered the primary responsible in imitation of the biological process of heredity, the operator of crossover acts simultaneously on two solutions (like would be in chromosomes), creating two new solutions.
210
M. Falvo, J.B. Florindo, and O.M. Bruno
Fig. 2. Basic process of genetic algorithm
The Figure 2 exemplify the application of a crossover operation. Similar to biological process, the parents have a cut point chosen by which is based the break and swapping between the parts resulting two new solution elements. The intensity of its use is measured by the crossover rate which is defined like as the ratio between the number of offspring generated by the operator and the total number of individuals in the population. A very low rate can induce crossover genetic algorithm to find a false local maximum of a function, but if this rate is too high the algorithm can spend time investigating regions of space solutions not relevant [4] [12]. The Mutation Operator: Applied in solutions generated by the crossover operator, the mutation operator changes the coded information elements of a solution (like would be in genes of a chromosome). In the case of a solution encoded in binary form, this would imply in inversion of the selected bits. The emergence of new coded elements (or genes) that were not changed before in the population, can contribute to solving the problem [12]. Thus, the influence of mutation operator is evaluated by the rate mutation. This is defined as the ratio between the total number of genes altered by total number of genes from the population. When mutation rate is very low, the probability of to find genes that may contribute to solving of a problem is reduced. However, a very high mutation rate causes serious harm to the new younger generations, due the loss of the similarity of child with its parents, reflecting the inability of the algorithm to keep the its search history [4] . So it is very important to make balanced use of this operator in each new generation to keep control over its effects.
A Method to Generate Artificial Contour Samples
211
Fig. 3. Example of application of croosover and mutation operators
3
The Proposed Method
Given a class of objects which are generated samples of artificial boundaries, the whole process basically involves: -Division of real samples in two families (A and B): the aim is to use a family (the A familiy) as an initial source of the Fourier descriptors of the contour, in the generation of descriptors of artificial samples that will be used in learning a classifier. The B family also will be to used to generate artificial samples but these will be used only to test the validity of the classifier trained by the artificial samples of family A. This procedings is need to avoid a biased classifier. -Extraction of the Fourier descriptors from real contours of family A and B: in this process the samples of both families should be binarized and an algorithm extracting contours of the ”chain code” type is applied. A vector of the parametric contour in the domain of complex numbers is assembled according to equation (1). This vector is transformed by the DFT for frequency domain and feature vectors are produced. -Descriptor generation of new individuals through crossover and mutation operators: the generation of new samples has its beginnings in the random selection of two members of the family A. In these, apply the crossover operator, which will produce two new individuals. Sometimes, randomly, a mutation operator is driven to change the value of some frequencies of these new individuals. The points at which the frequency of mutation operator is applied are set randomly. Two operator implementations of crossover were used. The one and two cutoff points [12]. See Figure 3 -Selection of new individuals generated: as occurs in every biological process of reproduction, some times there are people who have severe abnormalities
212
M. Falvo, J.B. Florindo, and O.M. Bruno
Fig. 4. Example of artificial shapes of Passiflora suberosa. In the top of samples is possible to see the parents used in the process,represented by outline.
compared to their parents. In order to produce a representative sample of a class, individuals who showed no consistent characteristics, should be discarded. This disposal may be performed by cluster analysis, using a k-means algorithms[8] with k = 3. For these two classes of groups tend to resemble their parents and one class will contain more samples that are different from them. For this third class is that the samples will be selected for disposal. -Shape reconstruction of new individuals generated: finally the generation of the shapes from new samples is easily obtained by applying of the inverse of discret Fourier transform. All images resulting will have discontinuity problem in its outline which can be solved with a low computational cost through morphological operators of dilation and erosion [20].See Figure 4. In the next section method the evaluated is presented in the construction of leaf shapes classifier.
4
Validation Method
To validate the method developed were used six different species with small number of samples: Passiflora giberti (5 samples), Passifora miiersi (6 samples), Passiflora suberosa (7 samples), Rauwolfia sellowii (8 samples), Cariniana estrellensis (9 samples) and Caesalpinia echinata (25 samples), shown in figure 5. All samples were scanned on a flatbed scanner with a resolution of 300dpi. The binarization of images was performed using the average thresholding algorithms
A Method to Generate Artificial Contour Samples
(a)
(b)
(c)
(d)
(e)
213
(f)
Fig. 5. a. Passiflora giberti, b. Passifora miiersi, c. Passiflora suberosa, d. Rauwolfia sellowii, e. Cariniana estrellensis; f. Caesalpinia echinata
[22] and iterative thresholding [21]. To remove the possible noise in the binary images, morphological operators of opening and closing [20] were applied. The parametric contour was obtained from the contour tracer algorithm [16]. The complex signal 1D was obtained and the DFT was applied resulting in the Fourier descriptors of contour. As described in Section 3 each class of samples went through the process of generating artificial samples and the descriptors of the A family produced 200 artificial samples that were used for training of neural networks while the descriptors of family B were used to generate 50 samples for testing classifier. The features vector was based in 60 Fourier descriptor of curvature (more details see [3]) building from artificial e natural samples The classifier used was a neural network Mult-Layer Perceptron (MLP) with three layers of neurons, each one respectively equal to 30, 25 and 1. So in the case of a binary type of network that was trained specifically to identify whether or not a single class type. Thus were trained and tested six types of classifiers, each corresponding to only one class of vegetal that previously was cited in this section. Finally, these networks were subjected to two types of tests, the first in which each tries to identify all artificial samples created and a second that uses only the real samples (in this case are to reunite the family samples A and B).
5
Results and Conclusions
Table 1 presents the classification results of artificial samples in the first column. With the exception of the species Passiflora miersii the classifier had an accuracy rate 86% while other classes had rates ranging from 92% to 100%. In the same table 1 shows in the second column the results of classification of natural samples whose rates ranged from 71.43% to 100%. As can be seen, the results indicate that the artificial samples generated fulfilled their role in building of the classifiers. The results obtained both in the classification of natural and artificial samples showed consistent values. Also the fact of not having an absolute classification in all classifiers reinforces the consistency of the method of generating artificial samples.
214
M. Falvo, J.B. Florindo, and O.M. Bruno Table 1. Classification rate of artificial and natural samples by class Class Passiflora gibertii Passiflora miersii Passiflora suberosa Rauwolfia sellowii Cariniana estrellensis Caesalpinia echinata
% Artificial samples % Natural samples 100,0% 86,0% 100,0% 92,0% 99,5% 100,0%
71,43% 83,33% 71,43% 100,00% 77,78% 100,00%
References 1. Bandera, A., Urdiales, C., Arrebola, F., Sandoval, F.: 2d object recognition based on curvature functions obtained from local histograms of the contour chain code. Pattern Recognition Letters 20, 49–55 (1999) 2. Casta˜ n´ on, C.A., Fraga, J.S., Fernandez, S., Gruber, A., da Fontoura Costa, L.: Biological shape characterization for automatic image recognition and diagnosis of protozoan parasites of the genus eimeria. Pattern Recognition 40(7), 1899–1910 (2007), http://www.sciencedirect.com/science/article/B6V14-4MMWHFH-2/2/ d21bd74af95fc6439f5e08fc1ca175f7 3. Costa, L.F., Cesar, R.M.J.: Shape Analysis and Classification - Teory and Practice. Image processing series. CRC Press, Boca Raton (2001) 4. Gin, M., Chieng, R.: Genetic algorithms and engineerin design. John Wiley & Sons, Inc., New York (1997) 5. Goldberg, D.E.: Genetic algorithms in search, optimization and machine learning. Addison-Wesley Publishing Company, Inc., Massachusetts (1989) 6. Gonzalez, R.C., Woods, R.E.: Processamento de Imagens Digitais. Edgard Bl¨ ucher Ltda., S˜ ao Paulo (2000) 7. Granlund, G.H.: Fourier preprocessing for hand print character recogniton. IEEE Transactions on Computers 2(C-21), 195–201 (1972) 8. Hair, J.F.J., Black, W.C.: Cluster Analysis. In: Reading and Understanding More Multivariate Statistics, pp. 147–205. American Psychological Association, Washington, D.C (2000) 9. Holland, J.H.: Adapatation in natural and artificial systems. MIT Press, Cambridge (1975) 10. Jain, A.K., Dubes, R.: Feature definition in pattern recognition with small sample size. Pattern Recognition 10(2), 85 – 97 (1978), http://www.sciencedirect.com/ science/article/B6V14-48MPGX5-2S/2/0aff77f32970d8fbecc9104d3e636267 11. Jain, R., Kasturi, R., Schunk, B.G.: Machine Vision. Computer Science. McGrawHill, New York (1995) ´ 12. Lacerda, E.G.M.d., Carvalho, A.C.P.L.F.: Introdu¸c˜ ao aos algoritmos genETicos. ˜ Sistemas Inteligentes - AplicaC ¸ OEs a Recursos H´ıdricos e Ciˆencias Ambientais, 1 edn., pp. 99–150, Cole¸c˜ ao ABRH de Recursos H´ıdricos, Editora da Universidade Universidade do Rio Grande do Sul, Porto Alegre (1999) 13. Li, D.C., Fang, Y.H.: A non-linearly virtual sample generation technique using group discovery and parametric equations of hypersphere. Expert Systems with Applications 36(1), 844–851 (2009), http://www.sciencedirect.com/science/ article/B6V03-4R53W74-6/2/3997723c6618ff44b5847cc7b10dc83f
A Method to Generate Artificial Contour Samples
215
14. Li, D.C., Fang, Y.H., Lai, Y.Y., Hu, S.C.: Utilization of virtual samples to facilitate cancer identification for dna microarray data in the early stages of an investigation. Information Sciences 179(16), 2740–2753 (2009), http://www.sciencedirect.com/science/article/B6V0C-4W3HX9P-2/2/ 8c2f015476973daf78cc54e540c0e70a 15. Niyogi, P., Girosi, F., Poggio, T.: Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86(11), 2196–2209 (1998) 16. Pavilids, T.: Algoritms for Graphics and Image Processing. Computer Science, Rockville (1982) 17. Pazoti, M.A., Garcia, R.E., Pessoa, J.D.C., Bruno, O.M.: Comparison of shape analysis methods for guinardia citricarpa ascospore characterization. Electronic Journal of Biotechnology 8(3), 1–6 (2005) 18. Pomerleau, D.A.: Neural network vision for robot driving. In: The Handbook of Brain Theory and Neural Networks, pp. 161–181. University Press (1996) 19. Pormelau, D.A.: Efficient training of artificial neural networks for autonomous navigation. Neural Computation 3(1), 88–97 (1991) 20. Serra, J.: Image Analysis and Mathematical Morphology., vol. 1. Academic Press, London (1982) 21. Sezgin, M., Sankur, B.: Survey over image thresholding techniques and quantitative performance evaluation. Journal of Electronic Imaging 1(13), 146–165 (2004) 22. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision, 2nd edn. PWS Publishing (1998)
Image Segmentation Based on Electrical Proximity in a Resistor-Capacitor Network Jan Gaura, Eduard Sojka, and Michal Krumnikl ˇ - Technical University of Ostrava, VSB Faculty of Electrical Engineering and Computer Science, 17. listopadu 15, 708 33 Ostrava-Poruba, Czech Republic {jan.gaura,eduard.sojka,michal.krumnikl}@vsb.cz
Abstract. Measuring the distances is an important problem in many image-segmentation algorithms. The distance should tell whether two image points belong to a single or, respectively, to two different image segments. The paper deals with the problem of measuring the distance along the manifold that is defined by image. We start from the discussion of difficulties that arise if the geodesic distance, diffusion distance, and some other known metrics are used. Coming from the diffusion equation and inspired by the diffusion distance, we propose to measure the proximity of points as an amount of substance that is transferred in diffusion process. The analogy between the images and electrical circuits is used in the paper, i.e., we measure the proximity as an amount of electrical charge that is transported, during a certain time interval, between two nodes of a resistor-capacitor network. We show how the quantity we introduce can be used in the algorithms for supervised (seeded) and unsupervised image segmentation. We also show that the distance between the areas consisting of more than one point (pixel) can also be easily introduced in a meaningful way. Experimental results are also presented.
1
Introduction
In many image-segmentation algorithms, measuring the distances plays the key role. The distance is usually needed as a quantity the value of which expresses the fact that two points should belong to a single or, respectively, to two different image segments. The use of the Euclidean distance measuring the straight-line lengths is often considered. In the case of image segmentation, however, the input data (images) define certain manifolds in some space (e.g., the brightness function defines such a manifold). The key question then is whether or not it could make some sense to measure the distances along this manifold instead of measuring the straight-line distance. Another natural question is what metric should be used. In the sequel, therefore, we discuss the properties of some metrics from the point of view of how useful they can be for image segmentation, especially the metrics that can be used on the manifolds. Naturally, not all algorithms are immediately prepared for measuring the distances on the manifold since they also deal with some other points arising as new positions or centroids J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 216–227, 2011. c Springer-Verlag Berlin Heidelberg 2011
Image Segmentation Based on Electrical Proximity
217
that may lie outside the manifold. The works [2] and [8] may be taken as an example of how an algorithm, mean shift in this case, can be modified such that the distance on the manifold could be used instead of the Euclidean distance. Consider two points, denoted by A and B, respectively, on the manifold in the situations as follows (Fig. 1). In Fig. 1a, A and B both lie in one area with a constant brightness. The distance we expect in this case is small since both points should belong to a single resulting image segment. In Fig. 1b, although the points have the same brightness, they apparently should not belong to one image segment since the areas in which they lie are separated by a “dam” created by the manifold. The distance we expect in this case is big. Fig. 1c shows two areas with the same brightness that are connected by a “channel” (the areas are separated by a weak boundary). The width of the channel determines whether or not the areas should be regarded as one segment. Usually, if the channel is narrow, we expect that the distance should be big. Intuitively, the wider the channel is, the shorter distance between A and B should be reported. Preferably, the distance should reflect a relative width of channel in some intelligible way. In Figs. 1d, 1e, two situations are depicted in which we would probably say that the points A and B belong to one area and have similar distances in both cases, in spite of the fact that one area is narrow and the other is wide.
(a)
(b)
(c)
(d)
(e)
Fig. 1. On some problems arising when measuring the distances between two points (A, B) in images: The images are represented by the contour lines (isolines) of brightness similarly as in a map; various situations (a – e) are discussed in the text
Euclidean distance measures the straight-line distances between the points that can be in general positions (not only on the manifold). As a result of the fact that the distance is not measured as a route along the manifold, it can neither reveal the separation by a dam (Fig. 1b) nor quantify the width of a possible channel between the areas with similar brightness (Fig. 1c), which may deteriorate the quality of algorithms. The distance does not depend on the width of the areas as was explained in Figs. 1d, 1e, which is desirable. Geodesic distance measures the length of the shortest path lying entirely on the manifold. The distance reveals the dams (Fig. 1b) correctly. No hint about the channel width (Fig. 1c) is given; even a very thin channel causes that a short distance is reported. In the cases from Figs. 1d, 1e, the same distance is detected, which is correct. The influence of noise is big. Even if the area is theoretically flat, the walker moving from A to B must overcome small hills and depressions
218
J. Gaura, E. Sojka, and M. Krumnikl
on the manifold caused by noise, which integrates along the whole path from A to B. A big distance may be incorrectly reported. If the use of geodesic distance is mentioned in some algorithms, it is probably due to its ability to detect the dams. Other mentioned problems remain. From [3], for example, the possibility of computing k shortest paths follows, which could by used for measuring the width of channel. Nevertheless, the problems with distinguishing the situations depicted in Fig. 1c and 1d still remain. Resistance distance is a metric on graphs introduced by Klein and Randic [5] and further developed also by others [1]. In image segmentation, the use of a resistor network has already been mentioned too [4]. The resistance distance is defined as a resistance between two points in the resistor network that corresponds to the image grid; the resistances between the particular neighbouring nodes increase with the increasing local image contrast. Intuitively, the resistance distance explores many parallel paths whereas the geodesic distance explores only the shortest of them. The resistance distance is able to reveal the dams correctly. It also measures the width of channels. The wider the channel is, the higher number of conductive parallel paths runs through the channel, which decreases the resistance between the points. Unfortunately, the behaviour in the case of measuring the distance in the objects of various width/size (Figs. 1d, 1e) does not seem to be fully satisfactory. In the wide objects, shorter distances are reported than in the objects that are narrow. The resistance distance may introduce an unwanted ordering of distance values, e.g., it may report the distance from Fig. 1e as the shortest, then the distance from Fig. 1c, and, finally, the distance from Fig. 1d. This is a problem since, at the time when the distance is measured, the shape of object is not known and, therefore, it is difficult to judge on the basis of the value whether or not the pixels belong to one object. The resistance distance can be measured not only between two points but also between two sets of points (we can connect all the nodes of one set together and similarly also all the nodes of the other set and measure the resistance between these new two terminals). Diffusion distance is a cumulative squared difference between the fields obtained by solving the diffusion equation for a chosen time and for the impulses placed at the points between which the distance is to be computed. The problem with the diffusion distance is the interpretation of the value that is obtained. From the value itself, it is not clear whether at all some amount of the substance diffusing from one point reached the other point and, therefore, whether the situation along the whole path between both points is taken into account (otherwise, the value can only reflect how the substance spreads in the neighbourhood of both points). The definition of the diffusion distance can be easily extended also for measuring the distance between two set of points (areas). In this case, however, the distance also takes into account the difference in size of both sets, which makes the distance hardly to use in the algorithms based on merging smaller areas into bigger ones. Let it be pointed out that the use of standard diffusion distance has been reported in some contexts, e.g., for histogram comparison [6], but not for segmenting images.
Image Segmentation Based on Electrical Proximity
219
It follows that non of the discussed metrics is problem free. Intuitively, it seems that a tool for measuring the sizes of objects is missing in some of them. The width of the channel from Fig. 1c, for example, should be measured relatively with respect to the sizes of the areas on both its ends. The problem that different values are reported for the distances from Figs. 1d, 1e by some metrics can also be viewed as a problem whose roots lie in the absence of measuring the sizes. The metric should recognise the circumstances and adjust the way in which the distance is measured according to it. The diffusion distance measures the sizes in some way, but it has another disadvantages. In this paper, we propose a new method how the distances on the image manifold can be measured. The method is based on the diffusion equation and reflects the drawbacks of diffusion distance. In contrast with diffusion distance, we measure the amount of substance that is transferred in diffusion process. The analogy between the images and electrical circuits is used for explaining the method (another analogies, e.g., heat could be used too), which means that the amount of electrical charge that is transported between two nodes/areas in an equivalent resistor-capacitor network during a certain time interval is measured. Because of using this analogy, the method resembles the resistance distance discussed before. The difference is that the pure resistive network is modified by adding the “grounding” capacitors to the nodes. Informally speaking, the capacitors are used as a tool for measuring the sizes. Since the capacitors correspond to the pixels and since a certain electrical charge is needed for charging each capacitor/pixel, the electrical charge that is required to charge an area consisting of many pixels depends on the number of pixels, i.e., on the size of the area. If some parts of image are to be regarded as a connected area, it is required that a certain electrical charge, proportional to the size of parts, must be possible to transport between the parts through the resistors of the network in a certain limited time. A big difference is introduced by adding the capacitors since, instead of a steady state of the resistor network, a transient state is solved. The use of the resistor-capacitor network is not new in digital image processing. The well-known Perona-Malik image filtering method [7] is based on solving the diffusion partial differential equation. One of the possible technical interpretations of such an equation is that it solves a resistor-capacitor circuit/network. Since, in order to determine the quantity mentioned in the previous paragraph, we solve the diffusion equation, it is clear that the quantity we introduce has something in common, in this sense, with the diffusion distance. It, however, is not the same since it is defined in a substantially different way and has different properties. The quantity we use may be regarded as a directed proximity/distance (see the next section). At this expense, it can be computed and used effectively in the given context. The paper is organised as follows. In the next section, the resistor-capacitor network is introduced as well as the method how the distance/proximity can be introduced in that network. In Section 3, it is shown how the proximity can be used for image segmentation. The experimental results are presented in Section 4.
220
2
J. Gaura, E. Sojka, and M. Krumnikl
Proximity in a Resistor-Capacitor Network
I this section, we introduce the resistor-capacitor (RC) network and the term proximity. We understand the proximity to be an opposite to the distance (if the proximity is big, the distance is small and vice versa). The problem can be seen as either continuous or discrete. We start from the continuous version since some formulations and expressions can be written and understood more easily in that case. The discrete version will be described later on. Strictly speaking, only the discrete version deserves the name network since, instead of a network, we have a circuit with a continuous resistor and capacitor in the continuous case. In the continuous case, we have a thin two-dimensional sheet of a resistive material. The resistive sheet lies on a dielectric layer and this layer lies on a conductive base, which creates a continuous capacitor (Fig. 2). The current density, denoted by J(x, y), in the resistive sheet is connected with the intensity, denoted by E(x, y), of the electric field by the Ohm law J(x, y) = σ(x, y)E(x, y),
(1)
where σ is a conductivity tensor. The capacity is C(x, y) per area unit. Generally, both σ as well as C may vary at different places (x, y). If the electric potential, denoted by ϕ(x, y), is introduced, the field intensity is E(x, y) = −∇ϕ(x, y).
Fig. 2. A continuous resistor-capacitor circuit: A thin resistive sheet rests on a dielectric layer that lies on a grounded ideally conductive base (for the meaning of A and B, see the further text)
The equation needed for solving the potential in such a circuit is well known. For convenience of the reader, we briefly recall how it can be deduced. Consider an area dS on the resistive sheet; let ∂S stand for its boundary. The capacitor created inside dS is charged by the current flowing into dS along ∂S. Expressing this equilibrium mathematically, applying the Gauss theorem, and setting δϕ/δt ≡ ϕt , we successively obtain the following equations (n, δϕ, δt stand for the normal to ∂S, for the change of ϕ, and for a short time interval, respectively) δϕCdS = J ndl δt, (2)
S
ϕt CdS =
S
∂S
∂S
J ndl =
S
divJdS = −
div (σ∇ϕ) dS, S
(3)
Image Segmentation Based on Electrical Proximity
Cϕt = −div (σ∇ϕ) ,
221
(4)
which is the desired equation (it has the form of the diffusion equation). By making use of the equation, the potential field ϕ can be solved in time, providing that certain initial conditions (e.g., the potential at t = 0) and boundary conditions (potential at a certain set of points) are known. The equation is known for a long time (in image processing, it was probably firstly used in [7]) and the techniques for its solution are well established. In this paper, the equation and the potential are used for measuring the proximity between the areas in image. We introduce the proximity, denoted by p(A, B), of the area A to the area B (A and B do not overlap; Fig. 2) as follows. For all points inside and on the boundary of B, we set C(x, y) = 0, ϕ(x, y) = 1; both C and ϕ are constant over time in B. For all points outside B, we set C(x, y) = 1; ϕ(x, y) evolves in time from ϕ(x, y)|t=0 = 0. (It follows that A is electrically supplied from B.) The proximity p(A, B), i.e., the proximity of A to B, is then measured as a charge, denoted by QA , that is accumulated in A after a certain chosen time, which can be computed by integrating the potential p(A, B) = QA = CϕdS = ϕdS. (5) A
A
The value of proximity is not symmetric in general, i.e., it may happen that p(A, B) = p(B, A), which, however, need not be regarded as an obstacle for its practical use (in the regard of its non-symmetry, it can be compared to the directed Hausdorff distance; the standard diffusion distance is symmetric). In addition to measuring the proximity, the electric circuit described before can also be used for measuring the electrical size of area and for finding its electrical center (generally, diffusion size or diffusion center). Let A be an area that is to be measured. (The reader can imagine A in such a way that inside A, the conductivity is big; ∂A runs around A through the points at which the conductivity is low.) For measuring the size of A, we use a very small area B that is placed into A, and we measure the proximity p(A\B, B) as was described before, i.e., the area A \ B is electrically supplied from B. Naturally, the charge accumulated in A\ B also depends on where B is placed inside A (providing that the size of B is small and constant). We call the position for which the charge accumulated in A \ B is the biggest one the electrical centre of A. The ideas that were introduced before can now be reformulated for the discrete case. In the discrete case, instead of a circuit with a continuous resistor and capacitor, we have a network whose nodes correspond to the image pixels. To each node (x, y), a “grounding” capacitor C(x, y) and the resistors joining the node with its four neighbours (in the case of an inner node) are connected (Fig. 3). Let σ((x1 , y1 ), (x2 , y2 )) stand for the conductivity of the resistor connecting the nodes (x1 , y1 ), (x2 , y2 ) (only the neighbouring nodes are connected). For solving the potential in the network, the following equation can be written for each inner node (for the nodes on the image boundaries, similar equations can be easily deduced)
222
J. Gaura, E. Sojka, and M. Krumnikl
Fig. 3. Measuring the proximity of A to B: From the nodes of B, the capacitors are removed and the nodes are connected to a unit potential; all the capacitors are discharged at the beginning; the proximity is measured as a total charge accumulated in the nodes of A after a chosen time interval
− C(x, y)ϕt (x, y) = σ ((x, y) , (x − 1, y)) [ϕ (x, y) − ϕ (x − 1, y)]
(6)
+ σ ((x, y) , (x, y − 1)) [ϕ (x, y) − ϕ (x, y − 1)] + σ ((x, y) , (x + 1, y)) [ϕ (x, y) − ϕ (x + 1, y)] + σ ((x, y) , (x, y + 1)) [ϕ (x, y) − ϕ (x, y + 1)] . In the discrete case, the areas consist of nodes/pixels. In correspondence to the continuous case, the proximity p(A, B) is determined as follows (Fig. 3). From the nodes creating B, the capacitors are disconnected; instead, the nodes are connected to a unit potential. The potential in the network is then solved by making use of Eq. (6) providing that C(x, y) = 1 and ϕ(x, y)|t=0 = 0 outside B. The value of p(A, B) can then be computed as a sum of charges in capacitors that are connected to the nodes creating A, which is the same as the sum of potentials since C(x, y) = 1. In the RC network, the conductivities of the resistors between the neighbouring nodes correspond to the values of the local contrast in image. We set the conductivity to the value of the sigmoid function 1/(1 + eλ(d−μ)), where d stands for the contrast between the corresponding neighbouring pixels (it can be either the absolute value of the simple difference of brightness or a colour contrast computed in some way), and λ and μ are chosen real constants. The meaning of the constants can be easily understood. The contrast less then μ is regarded more as noise; the contrast greater then this value is regarded more as a sign of an object boundary existing at that place. The value of λ determines how sharp this decision is. The choice of λ and μ should ensure that the value of conductivity will approach to 0 for a big contrast and to 1 for a small contrast. Alternatively, other formulas for determining the conductivity can be also used, 2 e.g., e−λd [4]. A remark should be made regarding the problem of selecting the appropriate time interval for solving the RC network since it is a parameter that determines the behaviour of the method. Say that the proximity p(A, B) is defined as was
Image Segmentation Based on Electrical Proximity
223
explained before. If the time interval is infinitely small, the charge accumulated in A will be zero. If, on the other hand, the time interval is infinitely long, each capacitor in A will be charged to the unit potential (i.e., the same potential that was applied to the nodes of B), regardless the image on the basis of which the conductivities were computed. In the network, a transient process is solved. The time for which the solution is obtained determines how the network measures the distances and sizes. If the time is longer, the more distant (in the x, y plane) pixels are taken into account. From the point of view of practical computing, it is worth mentioning that disconnecting the capacitors and holding the intensity on the unit level as was described before can also be simulated by setting C(x, y) = ∞ and ϕ(x, y)|t=0 = 1 at corresponding nodes. It also further illustrates the difference between the diffusion distance and the distance we introduce. Both the distances can be computed by the spectral matrix decomposition. For the diffusion distance, solving the ordinary eigenvalue problem is sufficient since the values of all capacitors are usually C(x, y) = 1. In the case of the distance we introduce in this paper, solving the generalised eigenvalue problem would be required. Whereas the eigenvalue problem must be solved the only time in the case of diffusion distance, the generalised eigenvalue problem must be solved for every source area in the case of the distance we introduce. This makes the computation through the matrix spectral decomposition ineffective. Therefore, we directly integrate the equation Eq. (6).
3
Using the Proximity for Image Segmentation
In this section, we show how the proximity introduced in Section 2 can be used for image segmentation. We start from the seeded segmentation and continue to the general (unseeded) case. In seeded segmentation, the nuclei (seeds) of the desired areas (objects) in images are given by the operator, which makes the problem easier. Let Si0 stand for the i-th given initial seed. In its essence, the algorithm we propose is iterative, i.e., the areas grow from the seeds in time; Sik stands for the i-th area entering into the k-th iteration; after the iteration, the area Sik+1 is obtained; Si0 is a given seed of the area. At the beginning, the membership of pixels to objects is known for the pixels lying in the seeds. The goal is to decide the membership also for all remaining pixels in image. During the k-th iteration of the algorithm, the following is done. For each pixel Xj whose membership has not been decided yet, its proximity p(Xj , Sik ) for all i is used. The pixel is decided to be a member of a certain area providing that its proximity measured to that area is the highest and, at the same time, higher than a chosen threshold. If no pixel is decided in a certain iteration (i.e., if Sik+1 = Sik for all i), the algorithm stops. The following is worth mentioning: If the threshold mentioned before is chosen to be zero, the algorithm stops after one iteration. It simply only computes the values of p(Xj , Si0 ), i.e., only the proximity measured to the given initial seeds is taken into account. From the computational point of view, it should be noted that the
224
J. Gaura, E. Sojka, and M. Krumnikl
RC network is solved for every seed and for every area arising from this seed during iterations. After solving the network for Sik , the proximities p(Xj , Sik ) are obtained for all Xj lying outside Sik . The problem of general unseeded segmentation can be solved in such a way that the nuclei (seeds) of future areas in image are found automatically instead of determining them by operator. Once a method for determining them is available, the algorithm described in the previous paragraph can be used again for growing the nuclei into the corresponding areas. By making use of the ideas introduced in Section 2, the nuclei can be found as electrical centers. For each pixel Xi , the RC network is solved providing that it is electrically supplied from Xi (i.e., the capacitor at Xi is disconnected and Xi is connected to a unit potential). The electric charge transported from Xi to the rest of the network is computed. If the charge from Xi is greater than the charge computed for the neighbours of Xi and if, at the same time, the charge is greater than a chosen threshold, Xi is declared to be a nucleus of an area. It can be pointed out that the threshold ensures that the future area is big enough; the condition of local maximum ensures that the nucleus lies in a centre of area (in the electrical centre). For big and complicately shaped areas, it may happen that more than one nucleus are found (Fig. 9). A post processing step carried out after the growing described in the previous paragraph may be needed in such cases, during which all particular parts of the bigger areas are connected together along the “invisible boundaries” between the parts (a similar situation is also known from some other segmentation algorithms, for example, from mean shift). From the computational point of view, it should be mentioned that, for finding the seeds as was described, the RC network must be solved many times (one solution for each pixel charging the network). It follows that the method is computationally expensive. Some heuristics may at least partially improve the situation, e.g., the heuristic based on a priori excluding the pixels at which the nuclei cannot be expected.
4
Experimental Results
In this section, we present some experimental results. The experiments are mostly oriented to illustrate practically the claims and expectations that have been pronounced before. Firstly, the problem of measuring the proximity in the presence of a weak boundary or “channel” between two objects (Fig. 1c) is illustrated. In Fig. 4, a source image is depicted as well as the images showing that the values of proximities given by the RC network are useful and lead to the wanted segmentation that was computed by the algorithm described in Sec. 3. In Fig. 4, it can also be seen that the proximities both in the wide as well as in the narrow areas are measured approximately in the same way, which is a problem that was discussed before by making use of Figs. 1d, 1e. The next series of experiments will be presented for the image from Fig. 5. In Fig. 6, the process of growing of Si0 into Sik is shown for one initial seed. The influence of time, which is a parameter that must be chosen, is illustrated in Fig. 7. In Fig. 8, a result of a seeded segmentation is presented. Fig. 9 illustrates
Image Segmentation Based on Electrical Proximity
225
Fig. 4. Measuring the proximity in the presence of a “channel” (weak boundary) between two objects (left image): Contrary to the Euclidean and geodesic distances, the values of proximities given by the RC network are useful and lead to the wanted segmentation. The proximities to a seed (small bright square) in the left area and to a seed in the right area are shown (middle images). The proximities are depicted in gray levels (totally bright means the highest proximity). The final segmentation is also shown (right image); different areas have different brightness.
Fig. 5. The image referred to in the further experiments
Fig. 6. Growing an initial seed (small bright square in the left image) during iterations; the results after the first, third and the fifth iterations are depicted (the left, middle, and the right image, respectively); the grey level shows the proximity of each pixel to the seed or its successive area (the seed as well as the successive areas are depicted as totally bright)
the method of automatically detecting the seeds. For each pixel, the electrical charge that is transported into the network if the network is charged from that pixel is shown. The seeds are detected as the maxima in the charge map created in this way.
226
J. Gaura, E. Sojka, and M. Krumnikl
Fig. 7. On the influence of the length of the time interval: The results after the first, third, and the fifth iteration are shown similarly as in the previous figure; a longer time interval for solving the transient state in the RC network was used in this case
Fig. 8. Supervised segmentation: The seeds were determined manually in this case (left image). The result of segmentation is depicted in the right image; the particular segments are distinguished by different grey levels; for the pixels in the totally black areas, the membership to any segment has not been found.
Fig. 9. A map of input charges and the seeds detected automatically: The grey level at each pixel (except the small squares) depicts the size of charge that is transported into the network if it is charged from that pixel (i.e., the bigger is the area into which the pixel belongs, the bigger is the charge and the brighter is the pixel); the small squares show the seeds that were found automatically as maxima in the charge map
5
Conclusion
A method has been presented for measuring the distance/proximity along the manifold that is defined by image for the purpose of using this method in image segmentation. The information that is contained in image is transformed into an equivalent resistor-capacitor network. Its transient state, which is described by the diffusion equation, is then solved. We measure the proximity as an amount of electrical charge that is transported, during a chosen time interval, between two
Image Segmentation Based on Electrical Proximity
227
nodes of the network. Among the advantages of the method, it also belongs that it can measure the distances not only between the particular points, but also between a set of points and a point, and between two sets of points, which was exploited in the image-segmentation algorithm that has also been proposed. The numerical solution of the diffusion equations is well understood. From the point of view of computational complexity, the proposed segmentation method seems to be well acceptable for seeded segmentation since the number of how many times the resistor-capacitor network must be solved depends on the number of seeds, which usually is not too high. The method of automatically detecting the nuclei of areas that has also been proposed, on the other hand, seems to be extremely computationally expensive since, in essence, the network must be solved for each pixel separately. Moreover, a post processing step is required. Further research is apparently needed in this area. Although the diffusion equation itself is known for a long time and although it was also used many times in the context of image processing, e.g., for filtering, we are not aware of the fact that it would be used for introducing the distance in the way as it was done in this paper. The diffusion distance is defined in another way. Its properties make the diffusion distance not very useful for the use in image segmentation algorithms. Acknowledgements. This work was partially supported by the grant SP 2011/ ˇ - Technical University of Ostrava, Faculty of Electrical Engineering 163 of VSB and Computer Science.
References 1. Babic, D., Klein, D.J., Lukovits, I., Nikolic, S., Trinajstic, N.: Resistance-Distance Matrix: A Computational Algorithm and Its Applications. Int. J. Quant. Chem. 90, 166–176 (2002) 2. Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach toward Feature Space Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 1–18 (2002) 3. Eppstein, D.: Finding the k Shortest Paths. SIAM Journal on Computing 28(2), 652–673 (1999) 4. Grady, L.: Random Walks for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11), 1768–1783 (2006) 5. Klein, D.J., Randi, M.: Resistance Distance. J. Math. Chem. 12, 81–95 (1993) 6. Ling, H.: Diffusion distance for histogram comparison. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 246–253 (2006) 7. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(7), 629–639 (1990) 8. Sheikh, Y.A., Khan, E.A., Kanade, T.: Mode-seeking by Medoidshifts. In: International Conference on Computer Vision, pp. 1–8 (2007)
Hierarchical Blurring Mean-Shift ˇ Milan Surkala, Karel Mozdˇreˇ n, Radovan Fusek, and Eduard Sojka Technical University of Ostrava, Faculty of Electrical Engieneering and Informatics, 17. listopadu 15, 708 33 Ostrava-Poruba, Czech Republic {milan.surkala.st,karel.mozdren,radovan.fusek.st,eduard.sojka}@vsb.cz
Abstract. In recent years, various Mean-Shift methods were used for filtration and segmentation of images and other datasets. These methods achieve good segmentation results, but the computational speed is sometimes very low, especially for big images and some specific settings. In this paper, we propose an improved segmentation method that we call Hierarchical Blurring Mean-Shift. The method achieve significant reduction of computation time and minimal influence on segmentation quality. A comparison of our method with traditional Blurring Mean-Shift and Hierarchical Mean-Shift with respect to the quality of segmentation and computational time is demonstrated. Furthermore, we study the influence of parameter settings in various hierarchy depths on computational time and number of segments. Finally, the results promising reliable and fast image segmentation are presented. Keywords: mean-shift, segmentation, filtration, hierarchy, blurring.
1
Introduction
Mean-Shift (MS) is a clustering algorithm that appeared in 1975 [6] and is used for data filtration and segmentation up to now. Mean-Shift seeks for the position with the highest density of data points. These groups of points are called clusters (segments). Mean-shift is not only used as a statistical tool, but also as a method for image segmentation, object tracking [5], etc. Recently, a lot of new Mean-Shift variations have been developed. These methods use various approaches to improve the original idea. In 1995 MS was revised by Cheng [2] and in 1999 and 2002 by Comaniciu [4,3]. One of the most interesting variations of MS is Blurring Mean-Shift (BMS) that appeared in 2006 [1]. It has been proved that BMS has a lower number of iterations per data point and, therefore, a higher speed. As for stopping conditions, it is more difficult to apply the Gaussian kernel because it is not truncated. However, using the truncated kernels as Epanechnikov makes it possible to ignore data that are too far away from its center. Even with such advantages, BMS is still slow and not appropriate for general use. In 2009, P. Vaturri and W.-K. Wong presented Hierarchical Mean-Shift [8] based on an idea of effective application of multiple Mean-Shift filtrations with variable kernel sizes. Computational time of the Mean-Shift method is strongly J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 228–238, 2011. c Springer-Verlag Berlin Heidelberg 2011
Hierarchical Blurring Mean-Shift
229
dependent on the number of input data points and on the size of the Mean-Shift window (kernel size). The first step is to use MS with a small kernel window to get a lower number of small segments in a short time. These segments are considered as data points with the weight that is equal to the number of data points they represent. In the next hierarchical step (stage) the number of data points is reduced due to preprocessing (previous stage). This allows us to use a greater kernel without the unwanted increase of computational time. The number of hierarchical steps is not limited. In each step, a new level of hierarchy with greater segments is created. The paper is organised as follows. In Section 2, we introduce the reader to the original Mean-Shift method. Section 3 is devoted to the Blurred Mean-Shift version. A new approach (algorithm) is presented in Section 4. The experiments are presented in Section 5.
2
Mean-Shift N
Consider X = {xn }n=1 ⊂ Rd as a dataset of N points in the d-dimensional space. Although the characteristics of these points are not limited, we focus on the images and, therefore, these points are the set of pixels in processed image. We define the kernel density estimator with a kernel K(x) as N x − xn 2 1 p(x) = K (1) σ , N n=1 where x is a processed data point (pixel), xn is a data point in neighbourhood of point x and σ is a bandwidth limiting the size of neighbourhood (radius). We can distinguish between two types of the radii in images. The first one, denoted by σs , is a radius in the spatial domain (for example, a circle with the radius of 20 pixels). The spatial radius is mostly identical for x and y-axis, so we need only one σs . The second radius, denoted by σr , is a radius in the range domain (colour or intensity). If the grey-scale images are processed, only one σr is used, whereas, in the colour images, three different quantities σr may be required, each for one colour channel. In most cases, the same radius is chosen for all three channels, so σr is simplified to one value. We can choose from many types of kernels K(x). The simplest is the uniform flat kernel that takes the value of 1 (full contribution of pixels) for the whole area of searching window (area of kernel). This kernel is defined as follows 1, if x ≤ 1 K(x) = . (2) 0, if x > 1 The second most common kernel is the Epanechnikov kernel. It minimizes the asymptotic mean integrated square error, therefore, it is an optimal kernel. The Epanechnikov kernel is defined by the following equation 1, if 1 − x2 ≤ 1 K(x) = . (3) 0, if x > 1
230
ˇ M. Surkala et al.
The Gaussian kernel is very popular, nevertheless, it has some disadvantages. σs does not limit the size of searching window but it modifies only the shape of Gaussian curve. The computation is carried out for all data points (pixels in the image). Therefore, the use of the Gaussian kernel is very computationally demanding and the only way to achieve a better speed is to truncate the kernel. Fortunately, because of small contribution of distant pixels, the truncation has only a small influence on the quality of segmentation. The Gaussian kernel is represented by the equation 1
2
K(x) = e− 2 x .
(4)
The density gradient estimator, denoted by k(x), is defined as follows k(x) = −K (x). as
(5)
Therefore, we can rewrite the Epanechnikov kernel for mean-shift computation 1 − x, if 0 < x < 1 K(x) = . (6) 0, otherwise
The mean-shift vector represents the change of position for each data point (pixel in the image). Data used for computation remain unchanged in all iterations of mean-shift. The goal is to find an attractor. Attractor is a point to which all “similar” data points from a searching window converge. At such a point, the density of input data points takes its local maximum. All pixels that converge to one attractor form a cluster (segment). The mean-shift vector is represented by the equation N x−xi 2 i−1 xi k σ mσ,k (x) = (7) − x. N x−xi 2 i−1 k σ The first term on the left side is a new position of x computed by all adequate data in the searching window; the second term is the former position of x.
3
Blurring Mean-Shift
Blurring Mean-Shift (BMS) is an another method with a slightly changed equation leading to a significantly different method of computation. MS does not change the original dataset (it moves the data points, however, the computation is done with original data). As a result, it evolves slowly to the local density maxima. BMS changes the dataset in each iteration with computed values. As source data for each iteration, the data modified in the previous stage are used. It is proven that BMS has faster convergence in comparison to MS. The main advantage of BMS is that it reduces the number of iterations, despite that the number of input points remains the same. The equation of BMS can be written as N xm −xi 2 i−1 xi k σ mσ,k (x) = (8) − x. N xm −xi 2 i−1 k σ
Hierarchical Blurring Mean-Shift
231
BMS has several more advantages. After each iteration, for example, there is a visible progress in filtration. If the pixels that should form one cluster are close enough, without any pixel that should not be in the cluster, all these pixels will be grouped in the next iteration. Original MS does not have this feature and it iterates slowly till all the data points reach the common convergence point. This often leads to long computational times. BMS has a tendency to form clusters whose size is similar to the size of searching window, therefore, the number of clusters can be approximately predicted. Nevertheless, the shape of these clusters follows the gradients in the image very well. Many different variants of mean-shift were proposed, such as Evolving Mean-Shift [9], Medoid-Shift [7] and variations of MS and BMS. In next section we are going to describe our MS variation.
4
Hierarchical Blurring Mean-Shift
We propose a new method called Hierarchical Blurring Mean-Shift (HBMS) that is based on one of modifications of Mean-Shift, called Hierarchical Mean-Shift (HMS). It uses the original MS method, which is applied hierarchically with various sizes of kernel. First of all, it uses a very small searching window (σs ), which leads to creation of a large number of little segments representing just few data points. Because of the small searching window size, mean-shift is processed relatively quickly. Consider this as an input to the next stage of mean-shift running with a bigger searching window. Consequently, the segments become new data points. From the first stage, we obtained a filtered dataset which has lower number of data points than the dataset originally contained. In the new dataset, each new data point has a weight that is equal to the number of original data points. Repeatedly running mean-shift on the dataset with larger searching window results in significantly reduced computational time. The main idea is to reduce the number of segments and make them bigger in each iteration. MS and BMS are slow because they use a large searching window from the start. Hierarchical mean-shift processes data with a smaller searching window in the first stage, which results in faster computation. Although, HMS is faster than MS and BMS, it is still too slow for large input images. The main problem is that HMS uses the original MS as a basic algorithm. MS is not well suited for hierarchical processing since it is slower and produces a large number of segments. Therefore, each next iteration of HMS has still a large dataset. Our objective is to use more advanced mean-shift technique, which is more suitable for processing in hierarchical way. Our method uses BMS for each step. This is much more effective and significantly faster, due to a lower number of iterations per pixel. Another advantage is that BMS forms more uniform clusters. They are also bigger and less numerous. Knowing that, we can say that BMS decreases the number of iterations per data point and, moreover, it greatly reduces the number of segments for each next stages of HBMS, so we can expect a high reduction of computational time. We will show this in Section 5. The following algorithm shows an example of implementation in a pseudo code.
232
ˇ M. Surkala et al.
(a) The first stage filtration (b) The second stage filtra- (c) The third stage filtration tion
(d) The first stage segmen- (e) The second stage seg- (f) The third stage segmentation mentation tation Fig. 1. Example of filtration and segmentation evolving after each stage of HBMS
Algorithm 1. Hierarchical Blurring Mean Shift 1: σ ← {σ1 , ..., σj } , where j = number of stages 2: for i ∈ {1, ..., j} do 3: repeat 4: for mi ∈ {1, ..., Ni } do
5:
∀n : p(n|xmi ) ←
xm −xn 2 i k σi
xm −x 2 Ni i n k σi n =1
6: ymi ← N n =1 p(n|xmi )xn 7: end for 8: ∀mi : xmi ← ymi 9: until stop 10: end for
The problem is to find an accurate resizing of the searching window for each iteration. If we enlarge window too much, and if the number of segments from the previous stage is still high, the processing in the next stages will take too long. The greater reduction of data points is, the greater enlargement of searching window could be done. The size of initial window is discussed in Section 5.
Hierarchical Blurring Mean-Shift
5
233
Experiments
This section presents the experiments with HBMS algorithm. The first example is focused on the comparison between the computational speed of HBMS and HMS. We examine the idea that the HBMS approach is much faster than the hierarchical approach to MS (HMS) because BMS is faster than MS. We also study the effect of acceleration on image filtration quality. We also compare how the increasing amount of noise influences quality of image filtering. As a source image, the famous Lena gray-scale photo in 256 × 256 pixels was used and for speed tests, original 512 × 512 version and resized versions were used too. To prove segmentation quality of our method we also used additional images. The algorithms were tested on a computer with Intel Core i5-M520, dualcore processor, with 2.4GHz core frequency, 4 GB of DDR3-1333 memory and a 64-bit operating system. All tests were run in the single core configuration. The first test examined the speed of HMS and our HBMS implementation using two and three-stage hierarchy. In the last run, the value of σs was always set to 40. The two-stage configuration started with σs1 = 5, the three-stage configuration used σs1 = 3 and σs2 = 9. In all stages, σr was set to 24. Table 1 has two columns for each tested resolution. The first column represents the values of the mean square error (MSE) between the final and the original image. MSE shows the intensity difference between the original and the filtered image. The smaller the error is, the better is the correspondence between the original and the filtered image. A good segmentation has small granularity (number of segments) and concurrently small intra-segment intensity variance (MSE error). The second column represents the time taken to complete the computation. MSE is computed as follows M SE =
M N 1 (G1 (i, j) − G2 (i, j))2 , MN
(9)
i=0 j=0
where G1 and G2 are images to be compared, indexes i and j stand for pixel coordinates and M and N represents width and height of image. The results were surprising. We achieved smaller MSE error especially for higher resolutions if BMS was used as a base for hierarchical segmentation. Table 1. Comparison of speed and MSE depending on resolution 64 × 64 128 × 128 256 × 256 512 × 512 1024 × 1024 1536 × 1536 MSE t[s] MSE t[s] MSE t[s] MSE t[s] MSE t[s] MSE t[s] MS HM S2 HM S3 BM S HBM S2 HBM S3
145 200 317 209 232 284
15 0,7 0,45 4,5 0,4 0,3
149 206 263 212 197 216
100 4,6 1,7 35,5 1,4 0,9
307 1516 252 12207 228 50 219 425 172 196 10,5 162 60 127 161 138 152 1085 203 6,2 162 33 111 171 3,6 151 20 111
4310 313 156 76
91 95
358 176
234
ˇ M. Surkala et al.
Fig. 2. Computational time with respect to image resolution
(a) M S, t = 1588s
(b) HM S2 , t = 50s
(c) HM S3 , t = 10.5s
(d) BM S, t = 161s
(e) HBM S2 , t = 6.2s
(f) HBM S3 , t = 3.6s
Fig. 3. Segmentation output from each method we compared
Hierarchical Blurring Mean-Shift
235
The authors of BMS [1] proved that this method has cubic convergence while the original MS has only linear convergence, which means that the use of BMS should lead to faster processing. Table 1 shows that the computation time increases approximately linearly with the size of the image. This means that an image with a 4-times bigger area is computed in an approximately 4-times longer time. From the table, it follows that the computation time increases approximately 8-times (or even more) if HMS is used. Therefore, the difference in performance is relatively small in the case of small images but it is significantly higher for larger images. For example, a 1 megapixel image is processed in 3 minutes with HBMS, whereas HMS needs more than one hour (2-stage configuration). We use our own implementation of HMS and HBMS. It might have seemed that the hierarchical approach is helping more to original HMS than HBMS, however, the difference is not that significant. HMS benefits from a higher number of stages while HBMS is quite fast even with a lower number of stages. The speed-up is big enough even with two stages in HBMS, and it is not necessary to use three or more stages. The more stages used, the better filtration ability and computational speed we get. The drawback is that the granularity increases and the segmentation does not seem to be that good. Using more stages causes creating larger segments with a higher intensity difference of the distant attractors. This increases the probability that the segment will not be included in the computation because the attractor does not fit in the searching window even though a large part of the segment is inside this window. It creates small segments on the borders of large segments, which would not happen if a larger searching window in the spacial domain was used. We can say that all of these methods achieved a good segmentation quality, but the main difference of using the HBMS is in significantly higher speed. The next test examines the influence of noise on segmentation quality. Twostage HMS and HBMS is used on the 256 × 256 image. The noise modifies each pixel in the intensity channel. The noise intensities are ranging from 5 to 70 for each pixel intensity value. We study the filtration quality (mean square error in a comparison with the original noise-free image) and the number of segments. In Table 2, we can see that both HMS and HBMS have an increasing mean square error with the increasing amount of noise applied to the image. It means that the images with a high level of noise cannot be filtered with a low spatial radius. In HBMS, the number of segments does not depend much on the noise level, although the segments are scattered. The noise is still easily visible in the Table 2. Influence of noise on filtration error and number of segments
HMS MSE HMS segm HBMS MSE HBMS segm
0 237 60 187 57
5 213 73 187 64
10 221 71 197 51
15 301 54 201 50
20 255 64 213 66
25 243 60 220 51
30 312 69 244 52
35 398 90 287 48
40 458 77 349 67
45 555 76 497 60
50 617 112 613 69
55 745 113 746 61
60 993 166 942 86
65 1163 183 1135 67
70 1299 152 1306 71
236
ˇ M. Surkala et al.
Fig. 4. Influence of noise on number of segments
T he original image
Af ter the f irst stage
Af ter the second stage
Fig. 5. Segmentation output after each stage for the brain image (σs1 = 7, σs2 = 70) and the images from the Berkeley image database (σs1 = 4, σs2 = 50)
Hierarchical Blurring Mean-Shift
237
processed image but it is obvious that the pixels darkened by noise, are grouped together. The same applies on lightened pixels. In HMS, a stronger noise does lead to an increased number of segments. For the illustration of HBMS method capabilities, we provide three additional images. The first image is a medical scan of brain with the size of 550 × 650 pixels. The bandwidth σs1 was set to 7, and σs2 was set to 70 pixels. The range bandwidth σr remains unchanged with the value of 24 in pixel intensity. After that, we have chosen two images from the Berkeley image database with the size of 481 × 321 pixels. The mountains image is the most simple one. The tree image is more difficult one because it has a very large number of small leaves and so it could be difficult to merge them into one segment. Both images from the Berkeley database were segmented with values set to σs1 = 4, σs2 = 50. The original brain image consists of 357,500 pixels. The dataset was reduced to 3,594 segments after the first stage and 72 segments after the second one. The computation required 53.8 seconds for the first stage and 26.1 second for the second one. In the mountains image, HBMS reduced the number of pixels from 154,401 to 2,749 in the first stage and 29 segments in the final stage. The computation lasted 9.1 seconds (first stage) and 6.5 seconds (second stage). The computation took 16.3 seconds. For comparison, the original BMS with the same 50-pixel searching window lasted 894 seconds and the final segmentation had 62 segments. For the tree image, the number of pixels was reduced from 154,401 to 3,710 segments after the first stage and 65 segments after the second stage. For this case, the execution time was 17.6 seconds.
6
Conclusion
In this paper, we have proposed a new segmentation method based on hierarchical Mean-Shift. Blurring Mean-Shift was used instead of the original MS that has many disadvantages. The proposed method decreases computational time due to the use of BMS on which HBMS is based on, and decreases the number of segments. This is why each succeeding stage is faster. The computational time can reach tens of minutes or hours with large images using the original HMS while our HBMS method needs only a few minutes and quality is still comparable to the original HMS, BMS and MS methods.
References 1. Carreira-Perpi˜ na ´n, M.: Fast nonparametric clustering with Gaussian blurring meanshift. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 153–160. ACM, New York (2006) 2. Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 790–799 (1995) 3. Comaniciu, D., Meer, P.: Mean shift analysis and applications. In: The Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, p. 1197 (1999)
238
ˇ M. Surkala et al.
4. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 603–619 (2002) 5. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25, 564–575 (2003) 6. Fukunaga, K., Hostetler, L.: The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21(1), 32–40 (1975) 7. Sheikh, Y.A., Khan, E.A., Kanade, T.: Mode-seeking by medoidshifts. In: IEEE International Conference on Computer Vision, pp. 1–8 (2007) 8. Vatturi, P., Wong, W.K.: Category detection using hierarchical mean shift. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 847–856. ACM, New York (2009) 9. Zhao, Q., Yang, Z., Tao, H., Liu, W.: Evolving mean shift with adaptive bandwidth: A fast and noise robust approach. In: ACCV (1) 2009, pp. 258–268 (2009)
Curve-Skeletons Based on the Fat Graph Approximation Denis Khromov Moscow State University [email protected]
Abstract. We present a new definition of the 3D curve-skeleton. This definition provides a mathematically strict way to compare and evaluate various approaches to the skeletonization of 3D shapes. The definition is based on the usage of fat curves. A fat curve is a 3D object which allows to approximate tubular fragments of the shape. A set of fat curves is used to approximate the entire shape; such a set can be considered as a generalization of the 2D medial axis. We also present an algorithm which allows to build curve-skeletons according to the given definition. The algorithm is robust and efficient. Keywords: curve-skeleton, medial axis, shape analysis.
1
Introduction
The medial axis, first introduced in [1], has been proved to be very useful for 2D shape analysis. The medial axis of a closed bounded set Ω ⊂ R2 is the set of points having more than one closest point on the boundary; or, equivalently, the medial axis is the set of centers of the maximum inscribed in Ω circles. The medial axis is a graph embedded in R2 . This graph emphasizes geometrical and topological properties of the shape Ω. Such graphs are usually called skeletons. There are efficient algorithms for 2D medial axis computation. It would be natural to try use the same approach for 3D shape analysis. A medial axis of a 3D shape Ω ⊂ R3 is a set of points having more than one closest point on the boundary. But such an object is not a graph since it may contain 2D sheets [6]. Those sheets may be very complex, and there are some methods which try to simplify the inner structure of the medial axis in 3d [2]. Therefore 3D medial axis is as difficult for the processing as the initial shape Ω. So there is a problem: how to define a skeleton of a 3D shape as a graph embedded in R3 so that this graph would have all of the advantages of the 2D medial axis? Such graphs are called curve-skeletons. To date, there are lots of publications on curve-skeletons. However, unlike 2D case, where the strict mathematical definition of the medial axis was given decades ago, the definition of a 3D curveskeleton still hasn’t been presented. Usually, curve-skeleton is defined as the result of applying some algorithm to the 3D shape. There is no way to compare these algorithms with each other because their working principles and results J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 239–248, 2011. c Springer-Verlag Berlin Heidelberg 2011
240
D. Khromov
may have totally different nature. It’s very difficult to evaluate the quality of the skeletons produced by those algorithms, since there is no formal criterion for such an evaluation. There’s only visual evaluation, which is subjective and not mathematical at all. In [4][8] the authors presented the classification of curve-skeleton algorithms. The curve-skeleton is intuitively defined as a 1D thinning of the 3D object. The authors also made a list of possible properties of a curve-skeleton. Some of these properties are strict (for example, topological equivalency between curveskeleton and the original shape), others are intuitive and should be formalized (centeredness of the skeleton and possibility of reconstruction of the original 3D object). Almost every published algorithm computes skeletons which have some of these properties. In such cases, these properties are considered to be advantages of the algorithm. One of the popular approaches is based on the thinning of voxel images. Such thinning may be done directly (deleting boundary voxels step-by-step, [7] [11]) or with the distance function [10]. The skeletons produced by such methods are not continuous but discrete objects. Algorithms of this class are not universal because they’re applicable to the voxel images only. Finally, there is no mathematical criterion to evaluate and compare different techniques of thinning. It seems natural to try to extract 1D curve-skeleton from the 2D medial axis. The medial axis itself is a very complicated object, which consists of quadratic surfaces, so it’s usually replaced by some approximation. However, extraction of 1D piece from the medial axis usually based on some successfully found heuristic. For example, in [5] such an extraction is done with the help of the geodesics on the boundary surface. As in the previous case, there’s no strict criterion for evaluation and comparison of various heuristics of a 1D curve-skeleton extraction. There are some other techniques used to compute curve-skeleton, such as usage of optimal cut planes [9] or physical interpretation of the problem [3]. However, these methods are also successfully found heuristics which allow to compute some object visually corresponding to the human idea of the curveskeleton. And again, formal mathematical evaluation of these algorithms doesn’t seem to be possible. In this paper, a strict definition of the curve-skeleton is given. The model being proposed 1. allows to evaluate the correspondence between the curve-skeleton and the original object; 2. approximates the given shape with a fixed precision; 3. doesn’t depend on the type of the shape description (polygonal model, voxel image or point cloud). Also, an algorithm which computes the curve-skeleton according to the definition, is presented.
Curve-Skeletons Based on the Fat Graph Approximation
2
241
Definitions
Let C be a set of smooth curves in R3 . For every curve c ∈ C , there is a set Rc of continuous non-negative functions defined on c. Definition 1. A fat curve is a pair (c, r), where c ∈ C , r ∈ Rc . The curve c is said to be an axis of the fat curve, and the function r is its radial function. Definition 2. An image of the fat curve (c, r) is a set of points I(c, r) = {x ∈ R3 |∃y ∈ c : ρ(x, y) ≤ r(y)}.
(1)
Definition 3. A boundary of the fat curve (c, r) is a set of points ∂I(c, r) = {x ∈ I(c, r)|∀y ∈ c : ρ(x, y) ≥ r(y)}.
(2)
The fat curve is an object which is very convenient to approximate tubular 3D shapes (see Fig. 1).
Fig. 1. Fat curve
Definition 4. Let C be a set of fat curves such that axis of these fat curves intersect each other in their endpoints only. A fat graph F over a set C is a graph whose edges are fat curves from C and vertices are endpoints of their axis. Definition 5. A boundary ∂F of a fat graph F is an union of boundaries of the fat curves composing F . Let FC be a set of all possible fat graphs. Consider an embedded in R3 connected 3D manifold Ω with the boundary ∂Ω. We’ll approximate Ω with some fat graph.
242
D. Khromov
Definition 6. A distance between the point x ∈ R3 and the fat graph F is a distance between x and the closest point on FC ’s boundary: ρ(x, G) = min ρ(x, y). y∈∂F
(3)
Definition 7. A distance between a manifold Ω and a fat graph F is a value ε(Ω, F ) = ρ2 (x, F )dS. (4) x∈∂Ω
Approximation quality can be evaluated by two values: distance ε(Ω, F ) and complexity of the fat graph F . A complexity of a fat graph can be evaluated as 1. sum of lengths of axis of fat curves composing the fat graph; 2. number of the fat curves. If the set C is rather wide, it’s better to use the first criterion in order to avoid too crooked curves. However, if C is narrow and doesn’t contain such curves, the second criterion can be used since it’s very simple. In these terms, the problem of approximation with a fat graph can be defined as following 1. compute an approximation with the smallest possible ε(Ω, F ) and a fixed fat graph complexity; 2. compute an approximation with the least possible complexity and ε(Ω, F ) < ε0 ,
(5)
where ε0 is a fixed value. The fat graph can also be defined for planar curves. 2D medial axis would be an example of such a graph if we define the radial function at a point x equal to the distance from x to the boundary. Image of this special graph coincides with the whole shape, so its approximation error is zero.
3 3.1
Fat Graph Fitting General Overview
In this chapter, we present the common scheme which allows to compute and optimize curve-skeletons according to the given definitions. Definition 8. A topological class of fat graphs is an equivalence class of FC by the relation of isomorphism between graphs. Let FC |G be a topological class of graph G (i.e. FC |G is a set of all fat graphs isomorphic to G). Minimization of ε˜(V , F ) on set FC is done through the several stages. First we select a topological class and then consider fat graphs only from this class. Then we find the first approximation of the fat graph and construct a sequence converging to some local minimum of the function ε˜.
Curve-Skeletons Based on the Fat Graph Approximation
3.2
243
Fat Graph in the Fixed Topological Class
Let E be a set of edges of the graph G, and V is a set of its vertices. Since we consider only those fat graphs which are isomorphic to G, E also denotes the set of fat curves of the fat graph F . Definition 9. A skeletal structure S on V in the topological class FC |G, defined by the graph G, is a pair of mappings (t, g): t : V → [0; 1], g : V → E.
(6)
Definition 10. Fat graph F and skeletal structure S are called compatible if for each vertex v ∈ V ρ(v, F ) = ρ(v, c(t(v))) + r(t(v)), (7) where c(t) is a Bezier curve representing the axis of g(v) and r(t) is its radial function. The iterative algorithm, which finds the fat graph in the given topological class, consists of the following stages. 1. Selection of the initial approximation. 2. Fitting of the compatible pair of a skeletal structure and a fat graph with the zero radial function. 3. Radial function fitting, skeletal structure elaboration. 4. If the approximation error ε˜ changed less than some fixed value, then stop; otherwise, go to stage 2. We use the segmentation of the vertices set described above to obtain the initial approximation of the skeletal structure. g(v) is defined due to the one-to-one correspondence between segments S(q) and edges E. Fat Graph with the Zero Radial Function. In this chapter, all of the radial functions are equal to zero: r(t) ≡ 0. Let FC0 be a set of all fat graphs with such radial functions. Let cv be an axis of a fat line g(v). We solve the following optimization problem: ρ2 (v, cv (t(v)) → min . (8) J1 (F ) = 0 v∈V
F ∈FC |G
Bezier curves cv ∈ Bn are defined via their reference points. Each of these curves has (n + 1) reference points. Two of them are endpoints, these are vertices of the fat graph. Those curves which are incident to the same vertex of the fat graph share one reference point. There are |VG | different endpoints. Each curve also has (n − 1) reference points which are not endpoints. Therefore, the set of fat curves C is defined by k1 points from R3 , where k1 = |VG | + |EG |(n − 1). Here and below |A| indicates the number of elements in the set A.
(9)
244
D. Khromov
We minimize function J1 with respect to k1 3D vectors, or, which is the same, 3k1 real numbers. To do so, we equate all of the partial derivatives to zero. That would be a system of 3k1 linear equations with 3k1 variables. The solution of this system defines the desired curves. Having the fat graph F , we use it to elaborate the existing skeletal structure S in order to make it compatible with F . It means recalculation of the t and g values. The solution of an equation (which can be found with the help of Newton’s method)
∇c(t), v = 0, t ∈ [0; 1], (10) gives us the closest to v point on the curve c. That allows us to find the closest to v fat curve g(v) and the corresponding parameter value t(v). Again, we compute an optimal zero fat graph for this new skeletal structure, then elaborate the skeletal structure, and so on. This iterative process converges, and it’s result is a compatible pair of a fat graph and a skeletal structure. This fat graph is optimal for the skeletal structure S (by which we suggest a local minimum of J1 ). Non-Zero Radial Function. Let S be a skeletal structure and F ∈ FC0 |G is a fat graph compatible with S. We choose radial functions for fat curves from F in such manner that the function ρ(v, F ), (11) J2 (F ) = v∈V
reaches its minimum. Each radial function is defined by (n + 1) non-negative value. Therefore, J2 should be minimized with respect to k2 variables, where k2 = |EG |(n + 1).
(12)
Like in the previous case, we reduce this problem to the system of k2 linear equations and then elaborate the skeletal system, considering that now radial functions are not constant and depend on t(v).
4
Implementation
The main issue which wasn’t discussed in the previous chapter but seems to be very important in the practical implementations of the method is how to choose the topological class. It’s possible to use any existing algorithm which produces curve-skeletons. But the proposed scheme has the advantage that the first approximation of the skeleton doesn’t have to be very nice and accurate. It can be easily fitted into the shape afterwards. That means that we can use some algorithm which is inaccurate but very fast, hoping to improve it during the fat graph fitting. One way to do so is to use the 2D medial axis of some Ω’s planar projection. There’re some facts in favor of this decision.
Curve-Skeletons Based on the Fat Graph Approximation
245
– It seems that medial information plays the key role in the human vision and visual perception[8]. But the human vision is planar, so if some 3D graph feels to be a good curve-skeleton, its projection would be also considered as a graph which is very close to the 2D medial axis of the object. – 2D medial axis is a well-defined and examined object. There are fast and robust algorithms for 2D skeletonization. – As mentioned above, the 2D medial axis can be considered as a fat graph which has zero approximation error. It’s possible to try to bring this property in 3D as close as possible. – 2D medial axis of a polygonal shape consists of smooth curves of degree 1 and 2, which can be fitted in the manner described above. An example of the object and its planar projection is shown in the Fig. 2.
Fig. 2. A model of a horse (left), its orthogonal planar projection with the 2D medial axis (center) and the approximating fat graph (right)
Each knot of the 2D skeleton is a projection of at least 2 points on the surface (see Fig. 3). A maximal inscribed ball tangent to the surface ∂Ω at those two points is a good mapping of the knot into the 3D space.
Fig. 3. 2D projection (left) of a 3D cylinder (right); center of the maximum inscribed circle is a projection of points A, B
The main problem of this method is the possibility of occlusions. For example, if one of the legs of the horse in Fig. 2 was occluded by another one, the usage of this particular projection would lead to an incorrect result. But since it’s possible to evaluate the approximation error of the fat graph, we can try a lot of projections and select the best one.
246
5
D. Khromov
Experiments
The described algorithm was successfully implemented. As mentioned above, it’s impossible to compare the quality of the skeletons produced by other methods, since there has been no numerical criterion for evaluation of the difference between the curve-skeleton and the given shape. We’ll prove the capacity of our approach in the way that is common in the literature on the curve-skeletons, which is based on the visual evaluation and experimental proof of the claimed properties. First of all, we demonstrate the examples of curve-skeletons produced by the described algorithm (see Fig. 4). 3D models which have been chosen for the experiment are widely used to evaluate various computer geometry algorithms, so they’re appropriate for the visual comparison with other methods. It’s useful to discuss the properties listed in [4]. Homotopy equivalence between the curve-skeleton and the shape is not guaranteed, since the fat graph
Fig. 4. Examples
Curve-Skeletons Based on the Fat Graph Approximation
(a) ε = 0.0077
(c) ε = 0.0082
247
(b) ε = 0.0080
(d) ε = 0.0089
Fig. 5. Curve-skeletons of the horse with different approximation error
with relatively large amount of edges approximates wide non-tubular fragments of the shape with a loop consisting of a pair of edges. However, this property can be easily provided by more strict requirements for the topological class. Invariance under isometric transformations (i.e. transformations in which the distances between points are preserved) is obvious. The possibility of reconstruction of the original shape is provided by the definition of a fat graph: the image of the fat graph is a 3D manifold which approximates the original object with a known precision. The reliability (which means that every boundary point is visible from at least one curve-skeleton location) is not guaranteed, but it’s achieved when the fat graph has enough edges. The robustness is implied by the robustness of the function ε(Ω, F ). In order to justify the meaningfulness of the function ε(Ω, F ), which is the core part of the described method, we’ve prepared a number of various curveskeletons of the same object. These skeletons were made without any fitting and with badly tuned parameters of the algorithm. The curve-skeletons and their corresponding approximation error values are shown on the Fig. 5. It’s pretty obvious that the greater the value of ε(Ω, F ), the worse the visual quality of the produced curve-skeleton. That implies that the proposed definition is not only theoretically grounded but also has some practical utility.
248
6
D. Khromov
Conclusion
In the paper, a new mathematical model for curve-skeleton formalization was presented. This model allows to compare and research various approaches for the 3D skeletonization. Also, a new algorithm for skeletonization was described, implemented and discussed. The further research involves the following issues: – elaboration of the model, in particular, approximation evaluation via Hausdorff metric; – further development of the algorithm, better selection of the first approximation and formalization of the iterative fitting.
References 1. Blum, H.: A Transformation for Extracting New Descriptors of Shape. In: WathenDunn, W. (ed.) Models for the Perception of Speech and Visual Form, pp. 362–380. MIT Press, Cambridge (1967) 2. Chang, M.C., Kimia, B.B.: Regularizing 3D medial axis using medial scaffold transforms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. p. Accepted. IEEE Computer Society, Los Alamitos (2008) 3. Chuang, J.H., Tsai, C.H., Ko, M.C.: Skeletonization of three-dimensional object using generalized potential field. IEEE Trans. Pattern Anal. Mach. Intell. 22, 1241– 1251 (2000), http://dx.doi.org/10.1109/34.888709 4. Cornea, N.D., Silver, D.: Curve-skeleton properties, applications, and algorithms. IEEE Transactions on Visualization and Computer Graphics 13, 530–548 (2007) 5. Dey, T.K., Sun, J.: Defining and computing curve-skeletons with medial geodesic function. In: Proceedings of the Fourth Eurographics Symposium on Geometry Processing, pp. 143–152. Eurographics Association, Aire-la-Ville (2006), http://portal.acm.org/citation.cfm?id=1281957.1281975 6. Giblin, P., Kimia, B.B.: A formal classification of 3d medial axis points and their local geometry. IEEE Trans. Pattern Anal. Mach. Intell. 26, 238–251 (2004), http://dx.doi.org/10.1109/TPAMI.2004.1262192 7. Pal´ agyi, K., Kuba, A.: A parallel 12-subiteration 3d thinning algorithm to extract medial lines. In: Sommer, G., Daniilidis, K., Pauli, J. (eds.) CAIP 1997. LNCS, vol. 1296, pp. 400–407. Springer, Heidelberg (1997), http://portal.acm.org/citation.cfm?id=648241.752833 8. Siddiqi, K., Pizer, S.: Medial Representations: Mathematics, Algorithms and Applications, 1st edn. Springer Publishing Company, Incorporated, Heidelberg (2008) 9. Tagliasacchi, A., Zhang, H., Cohen-Or, D.: Curve skeleton extraction from incomplete point cloud. ACM Trans. Graph. 28, 1–71 (2009), http://doi.acm.org/10.1145/1531326.1531377 10. Telea, A., van Wijk, J.J.: An augmented fast marching method for computing skeletons and centerlines. In: Proceedings of the Symposium on Data Visualisation, VISSYM 2002, pp. 251–260. Eurographics Association, Aire-la-Ville (2002), http://portal.acm.org/citation.cfm?id=509740.509782 11. Wang, Y.S., Lee, T.Y.: Curve-skeleton extraction using iterative least squares optimization. IEEE Transactions on Visualization and Computer Graphics 14, 926–936 (2008), http://portal.acm.org/citation.cfm?id=1373109.1373261
DTW for Matching Radon Features: A Pattern Recognition and Retrieval Method Santosh K.C.1 , Bart Lamiroy2 , and Laurent Wendling3 1
INRIA Nancy Grand Est Research Centre, LORIA - Campus Scientifique, BP 239 - 54506 Vandoeuvre-l´es-Nancy Cedex, France [email protected] 2 Nancy Universit´e INPL, LORIA - Campus Scientifique, BP 239 - 54506 Vandoeuvre-l´es-Nancy Cedex, France [email protected] 3 LIPADE, Universit´e Paris Descartes 75270 Paris Cedex 06, France [email protected]
Abstract. In this paper, we present a method for pattern such as graphical symbol and shape recognition and retrieval. It is basically based on dynamic programming for matching the Radon features. The key characteristic of the method is to use DTW algorithm to match corresponding pairs of histograms at every projecting angle. This allows to exploit the Radon property to include both boundary as internal structure of shapes, while avoiding compressing pattern representation into a single vector and thus miss information, thanks to the DTW. Experimental results show that the method is robust to distortion and degradation including affine transformations. Keywords: Radon Transform, DTW, Pattern Recognition and Retrieval.
1
Introduction
Given patterns such as symbols of any kind, cursive characters but also forensic elements like human faces, shoe- or finger-prints, recognition or classification can be done using structural [10], statistical [17] as well as hybrid approaches. In structural approaches, graph based methods have been widely used [10]. They provide a powerful relational representation. However, they suffer from intense computational complexity due to the general NP-hard problem of sub-graph matching resulting from the variation of graph structure with the level of noise, occlusion, distortion etc. In some cases, however, matching optimisation can be obtained, but is strongly based on how the image is described [18]. Besides, structural approaches do not have rich set of mathematical tools [7] unlike in statistical approaches [17]. We therefore emphasise in the rest of this paper on statistical pattern analysis representation and recognition. Shape representation has been an important issue in pattern analysis and recognition [23,31]. In this context, features are often categorised as regionbased as well as contour-based descriptors. Generally, contour-based descriptors J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 249–260, 2011. c Springer-Verlag Berlin Heidelberg 2011
250
K.C. Santosh, B. Lamiroy, and L. Wendling
include Fourier Descriptors (FD) [32,12]. Contour information can also come from polygonal primitives [1], or curvature information [3,5]. In the case of the latter, the shape is described in the scale space by the maximum number of curvatures. Other methods, like Shape Context (SC) [4] or skeleton approaches [33] are based on contour information. In short, contour-based descriptors are appropriate for silhouette shapes since they cannot capture the interior content as well as disconnected shapes or shapes with holes where boundary information is not available. On the other side, region-based descriptors account all pixels within patterns. Common methods are based on moments [28,2,13] including geometric, Legendre, Zernike, and Pseudo-Zernike moments. Comparative studies [2,28] have demonstrated the interest on improving invariance properties and reducing computational time of the Zernike moments [8]. On the other side, to overcome the drawbacks of contour-based Fourier descriptors, Zhang and Lu [30] have proposed a region-based Generic Fourier Descriptor (GFD). To avoid the problem of rotation in the Fourier spectra, the 2D Fourier Transform (FT) is applied on a polar-raster sampled shape. This approach outperforms common contourbased (classical Fourier and curvature approaches) and region-based (Zernike moments) shape descriptors. Region-based descriptors on the whole, can be generally applied. However, high computational complexity needs to be considered. Besides, use of normalisation in order to satisfy common geometric invariance properties introduces errors as well as they are sensitive to noise, eventually affecting the whole recognition process. Pattern representation must be sufficiently enriched with important information. Moreover, global pattern representation is the premier choice due to its simplicity that avoids extra pre-processing and segmentation process as in local pattern representation. To accomplish recognition, matching is another concern. In other words, feature selection corresponds to the matching techniques, eventually affects the overall performance of the method. For instance, compressing pattern information into a single vector as in global signal based descriptors provide immediate matching, while not offering complete shape information. In those respects, we take advantage of radon transform [11] to represent pattern and DTW is used to match patterns of any size that avoids compressing pattern representation into a single vector unlike the use of R−transform [27], for instance. The work is inspired from previous works such as 2D shape categorisation [22], gait recognition [6], off-line signature verification [9,24] and orientation estimation as in [16]. We have examined the method over two different datasets: graphical symbol [15] and shape [26]. The remaining of the paper is organised as follows. We start with detailing the proposed method in Section 2, which mainly includes pattern representation and matching. Section 3 provides a series of tests. In Section 4, analysis of the results are observed or examined and discussed thoroughly. The paper is concluded in Section 5 along with future perspectives.
DTW for Matching Radon Features
2
251
Method
In this work, we use the Radon transform to represent patterns [11], Radon based descriptors do not only encode contour information ([14,22], for instance), they also encode internal structure. Radon transforms are essentially a set of parametrized histograms. We apply Dynamic Time Warping to align every histogram for each projecting angle to absorb varying histogram sizes resulting from image signal variations. 2.1
Pattern Representation
The Radon transform consists of a collection of projections of a pattern at different angles [11]. This is illustrated in 1. In other words, the radon transform of a pattern f (x, y) and for a given set of angles can be thought of as the projection of all non-zero points. This resulting projection is the sum of the non-zero points for the pattern in each direction, thus forming a matrix. The matrix elements are related to the integral of f over a line L(ρ, θ) defined by ρ = x cos θ + y sin θ and can formally be expressed as, ∞ ∞ R(ρ, θ) = f (x, y)δ(x cos θ + y sin θ − ρ)dxdy −∞
−∞
where δ(.) is the Dirac delta function, δ(x) = 1, if x = 0 and 0 otherwise. Also, θ ∈ [0, π[ and ρ ∈] − ∞, ∞[. For radon transform, Li be in normal form (ρi , θi ). Following Fig. 1 (c), for all θi , the Radon transform now can be described as the length of intersections of all lines Li . Note that the range of ρ i.e., −ρmin < ρ ≤ ρmax is entirely based on the size of pattern. Since the Radon transform itself does not satisfy invariance properties, we consider the following affine transformation properties to adapt it for recognition. In case of translation, we use image centroid (xc , yc ) such that translation vector is u = (xc , yc ): R(ρ − xc cos θ − yc sin θ, θ). Therefore, translation of f results in the shift of its transform in ρ by a distance equal to the projection of translation vector of the line L (see Fig. 1 (c)). For rotation, we approximate rotation angle orientation as implemented in [16]. Orientation can be estimated as, d2 σ 2 α = arg min 2θ θ dθ where σθ2 = P1 ρ (R(ρ, θ) − μθ )2 is the variance of projection at θ, μθ = 1 ρ R(ρ, θ) and P , the number of samples. If angle of rotation is α, then P Rα (ρ, θ) = R(ρ, θ + α). This simply implies a circular shift of the histograms such that it does not require histograms duplication from [0, π[ to [π, 2π[ as in [9] to make rotation invariant. For scaling, we simply normalise histograms into [0, 1] at every projecting angle. Fig. 2 shows radon features for reference, rotation, scaling, as well as degradation instances from a known class of graphical symbol [15]. In all cases, radon histograms from their corresponding sample images are similar to each other.
252
K.C. Santosh, B. Lamiroy, and L. Wendling y
y
y
x
x
x
(b) projection at angle θ
(c) definition
(a) basic projections theory
Fig. 1. Radon transform
0.7
−100
0.6 0.5
100
0.4
0.3
200
0.3
0.2
300
0.1
(b) rotation
0
0.4
0
(a) reference
−200
−200
0.6
200
0 50100150 θ
0.8
−300
0.7
0.5
300
−300
0.8
ρ
ρ
0 100
0.2
400
0.1
500
1
0.9
−400
−200 −100
1
−500
0.9
0.9 0.8 0.7
−100
0.6
0
ρ
1 −300
0.5 0.4
100
0.3 200
0.2 0.1
300
0
0
150 0 50100 θ
(c) scaling
0 50100150 θ
(d) degradation (noise)
Fig. 2. Radon features for ideal (reference), rotation, scaling as well as noisy samples
2.2
Pattern Matching
As explained in Section 2.1, the Radon transform matrix R(ρ, θ) can represent any pattern P. Given two patterns P q and P d , matching can be obtained between corresponding histograms Rq (ρ, θ) and Rd (ρ, θ). Radon transforms generate different ρ sizes depending on the image contents’ size. In order to be able to adapt to these differences in size, we develop the following approach: Dynamic Time Warping [20], allows us to find the dissimilarity between two non-linear sequences potentially having different lengths. In radon matrix R(ρ, θ), column refers to the histogram of radon for each projecting angle. Let us say histograms {Hθi }i=0,...,Θ−1 (see Fig. 3). In this illustration, vertical lines represent radon histograms at every projecting angle. Let us consider two column vector sequences from R(ρ, θ), representing histograms Hq and Hd of length K and L, respectively. Hq = {hqk }k=1,...,K and Hd = {hdl }l=1,...,L . At first, a matrix M of size K × L is constructed. Then for each element in matrix M, local distance metric δ(k, l) between the events ek and el is computed. δ(k, l) can be expressed as, δ(k, l) = (ek − el )2 , where ek = hqk and el = hdl . Let D(k, l) be the global distance up to (k, l), D(k, l) = min [D(k − 1, l − 1), D(k − 1, l), D(k, l − 1)] + δ(k, l) with an initial condition D(1, 1) = δ(1, 1) such that it allows warping path going diagonally from starting node (1, 1) to end (K, L). The main aim is to find the
DTW for Matching Radon Features
253
Fig. 3. Radon histogram at every projecting angle θi
path for which the least cost is associated. The warping path therefore provides the difference cost between the compared sequences. Formally, the warping path is, W = {wt }t=1...T where max(k, l) ≤ T < k + l − 1 and tth element of W is w(k, l)t ∈ [1 : K] × [1 : L] for t ∈ [1 : T ]. The optimised warping path W satisfies the following three conditions. c1. boundary condition: w1 = (1, 1) and wT = (K, L) c2. monotonicity condition: k1 ≤ k2 ≤ · · · ≤ kK and l1 ≤ l2 ≤ · · · ≤ lL c3. continuity condition: wt+1 − wt ∈ {(1, 1)(0, 1), (1, 0)} for t ∈ [1 : T − 1] c1 conveys that the path starts form (1, 1) to (K, L), aligning all elements to each other. c2 restricts allowable steps in the warping path to adjacent cells, never be back. And monotonicity condition forces the path advances one step at a time. Note that c3 implies c2. We then define the global distance between Hq and Hd as, D(K, L) Δ Hq , Hd = . T The last element of the K × L matrix, normalised by the T provides the DTWdistance between two sequences where T is the number of discrete warping steps along the diagonal DTW-matrix. Matching Score. Aggregating distances between histograms in all corresponding projecting angles θi between P q and P d yields a global pattern-matching score, Dist(P q , P d ) =
Θ−1 i=0
Δ(Hθqi , Hθdi ). min.
Dist(.)−Dist (.) Scores are normalised into [0, 1] by, Dist(.) = Dist max. (.)−Distmin. (.) . As shown in Fig. 2, it is important to notice that there may be significant the amplitude differences between radon histograms, from one sample to another. These amplitude differences are very well handled by the DTW algorithm. In addition, Fig. 4 gives an overview of matching score values of the proposed method to affine transformations, noise addition and stretching.
254
K.C. Santosh, B. Lamiroy, and L. Wendling
cow1 (reference)
cow2 (noise)
cow1 cow2 cow3 cow4 cow5
cow3 (translate + scale)
cow4 (rotate)
cow1
cow2
cow3
cow4
cow5
0.0000
0.0033 0.0000
0.0004 0.0023 0.0000
0.0023 0.0034 0.0018 0.0000
0.0030 0.0031 0.0026 0.0035 0.0000
cow5 (stretch)
Fig. 4. Matching scores between samples shown above. It provides the differences exist between the samples due to noise, translation, rotation as well as scaling.
2.3
Pattern Recognition and Retrieval
We can now use the previously described approach as a global pattern matching score. This score expresses the similarity between database patterns and query. Our problem is: given a set of points S in a metric space Ms and a query point q ∈ Ms , find the closest point in S to q. Now, we express similarity as, 1 for the closest candidate q d q d Similarity(P , P ) = 1 − Dist(P , P ) = 0 for the farthest candidate. Ranking can therefore be expressed on the decreasing order of similarity. In our experiments, we will distinguish “recognition” (search for the closest candidate) from “retrieval” (where closest candidates are retrieved for a given short-list).
3 3.1
Experiments Benchmarking Methods
We confront D−Radon method with well-known descriptors: R−signature [27], GFD [30], SC [4] and Zernike [19]. For those descriptors, it is important to fit the best parameters. For GFD, we have tuned the parameters, radial (4 : 12) and angular (6 : 20) frequencies to get the best combinations. For SC, we follow [4]. In case of Zernike, we have used 36 Zernike functions of order less than or equal to 7. For radon, projecting range is [0, π]. 3.2
Experimental Results
We work on several different datasets in different contexts. However, we primarily focus on A. distorted and degraded symbols in document analysis – graphics recognition and retrieval, and then B. shape retrieval as a CBIR application.
DTW for Matching Radon Features
255
In order to test the robustness of the methods, we work on raw data, no prefiltering (de-noising, for instance) has been applied. A. Symbol Recognition and Retrieval GREC2003 dataset symbol recognition contest [15] – In this dataset, we have used the following different categories: ideal, scaling, distortion as well as degradation. Altogether, there are 50 different model symbols. Those symbols are grouped into 3 sets, containing 5, 20 and 50 model symbols. Each model symbol has 5 test images in every category except the ideal one. Ideal test images are directly taken from the set of model symbols and therefore the test is to evaluate the ability of simple shape discrimination, as the number of symbols increases. Since vectorial distortion works only with symbols with straight lines, and not arcs, it is applied to a subset of 15 model symbols. Besides, there are 9 models of degradation, aiming to evaluate the robustness to the scalability with degradation. Fig. 5 shows a few samples of GREC2003 dataset. To evalu-
(a)ideal
m1
m2
(b) scaling
(c) distortion
m3 m4 m5 m6 m7 m8 (d) 9 different degradation models (m1 to m9)
m9
Fig. 5. GREC2003 samples – graphical symbol
ate the method, each test image is matched with the model symbols to get the closest model. Experimental results for all types of aforementioned categories of datasets are shown in Table 1. Note that since there are 9 models of degradations (m1 to m9), there are nine tests in every sample in each set (see Fig. 5 (d)). Based on the results from ideal test images, one cannot judge the superiority of the methods. Only running time comparison would be an alternative. For scaled images, R−signature lags far behind. Our method achieves 100% recognition rate while not offering substantial difference with GFD, SC and Zernike. Also, results from test images with vectorial distortions shows identical behaviour as in scaled as well as rotated samples. However, we receive noteworthy differences in case of binary degradations. Overall, D−Radon performs the best of all. CVC hand-drawn symbol dataset – As in [29], we have used 10 × 300 sample images i.e., 10 different known classes of hand-drawn architectural symbols with 300 instances in each. Samples are having distortions, gaps, overlapping as well as missing parts within the shapes. Fig. 6 shows a few samples of it. In this dataset, we aim to retrieve all 300 instances for every chosen query.
256
K.C. Santosh, B. Lamiroy, and L. Wendling Table 1. Recognition rate in % for GREC2003 dataset
ideal scale
set1 set2 set3
distort
distort1-set1 distort2-set1 distort3-set1 distort1-set2 distort2-set2 distort3-set2
degrade
Test images R-sign. GFD Zernike set1 set2 set3
5×1 20×1 50×1
set1 set2 set3
Average – ideal 5×5 20×5 50×5
Average – scale 5×5 5×5 5×5 15×5 15×5 15×5
Average – distort 5×5× 9 20×5× 9 50×5× 9
Average – degrade
SC
D-Radon
100 100 100
100 100 100
100 100 100
100 100 100
100 100 100
100
100
100
100
100
45 36 28
100 100 98
100 98 96
100 100 98
100 100 100
37
99
98
99
100
20 8 8 8 4 4
100 100 100 100 100 100
100 100 100 100 100 100
100 100 100 100 100 100
100 100 100 100 100 100
7
100
100
100
100
12 07 07
86 93 89
79 79 77
87 76 70
95 96 93
08
89
78
77
95
Table 2 shows the average retrieval rate for all requested short-lists (e.g., top-20, top-40 and so on). Up to top-60, one cannot make decision that which method performs well since there exists no notable differences in retrieval rate. It is only determined after top-60. To be precise, the aim of the test is to evaluate retrieval stability of the methods. D−Radon provides the rate of 86% in top-300 while more than 16% difference (fairly in large amount) has been made with SC. SC lags GFD by approximately 9%. R−signature provides an average results compared to Zernike. Compared to all, D−Radon outperforms.
class 1
class 2 class 3 class 4 class 5 class 6
class 7
class 8
class 9
class 10
Fig. 6. 2 hand-drawn samples from 10 different known classes
Table 2. Average retrieval rate in % for CVC hand-drawn symbol dataset
Zernike R−sign. GFD SC D−Radon
10×300
Test top top top top top top top top top top top top top top top images 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 48 82 96 98 99
42 75 93 95 99
39 69 90 95 98
37 65 88 92 97
35 62 85 91 97
34 59 83 88 96
33 56 81 87 95
32 54 78 85 94
30 51 76 83 93
30 49 73 81 92
29 48 71 78 91
28 46 68 78 90
28 45 66 75 89
27 43 63 73 87
26 42 61 70 86
DTW for Matching Radon Features
257
(a) shapes99
(b) shapes216
Fig. 7. 2 samples from each class of (a) shapes99 and (b) shapes216 datasets
B. Shape Retrieval We have used two different shape datasets [26]: Kimia’s Shapes99 and Shapes216. Shapes99 dataset consists of 9 classes, each one is having 11 samples. In another dataset, there are 18 classes and each one contains 12 samples. Fig. 7 shows a few samples of both datasets. As in [26], we have used the dataset for recognition purpose. For retrieval rate, since there are N instances from each class, we have increased proximity search from 1 to N with the step of +1. In addition, we have tested Bull’s eye score [21,4]. Bull’s eye score is the ratio of the number of correct matches up to proximity search space of 2N to the possible number of matches in the dataset. Table 3 shows the experimental results for both datasets. Compared to Zernike and R−signature, our method outperforms with significant difference while it goes almost equally with GFD and SC.
4
Discussions
We analyse behaviour of the methods based on the key characteristics as well as major challenges of test images. In graphical symbol datasets, samples are distorted, embedded with different levels of noise and even degraded. In particular, there exists missing parts including severe vectorial distortions in hand-drawn symbol dataset in addition to a significant size variation as well as multi-class similarity between the classes. Besides, occlusion exists more in hand-drawn and shape datasets. Of course, global signal based descriptors are easy to implement as explained in Section 1. However, they are generally not well adapted for capturing precisely small detail changes as well as this tends to make them fault tolerant to image distortions, noise as well as missing parts. We have examined such a behaviour throughout the tests. Considering GREC2003 dataset for instance, for degradation models R−signature is affected from every dot (from noise) due to the square effect (via R−transform). Similarly, SC is affected where the level of noise (see m2 to m7 in Fig. 5) is high since it accounts those pixels while sampling. For Zernike, the disadvantage (in particular if high degree polynomials
258
K.C. Santosh, B. Lamiroy, and L. Wendling
top 1
top top top top top top top top top top top bull’s eye 2 3 4 5 6 7 8 9 10 11 12 score
9×11
Zernike R−sign. GFD SC D−Radon
Test image
100 100 100 100 100
74 83 97 97 97
66 73 95 96 95
59 66 92 92 94
53 60 89 90 92
48 56 87 87 91
45 51 85 83 86
42 49 84 80 81
40 47 82 77 80
37 45 80 73 79
35 43 77 70 77
18×12
shapes99
Zernike R−sign. SC GFD D−Radon
shapes216
Table 3. Average retrieval rate and Bull’s eye score in % for kimia’s dataset
100 100 100 100 100
81 86 97 99 99
71 80 93 97 95
63 76 91 95 93
57 71 87 93 91
53 67 85 91 90
50 65 84 90 89
48 62 81 89 86
45 59 79 88 85
43 57 78 86 84
41 54 76 83 83
48 59 84 86 87 39 52 73 80 81
48 64 83 87 88
Table 4. Average running time for generating features and matching for a single pair
Time (sec.)
R−sign. GFD Zernike SC D−Radon 1.5 15 20 43 77
are involved) is the unequal distribution of nodal lines over the unit disk. We have observed the effect in case of degradation models: m8 and m9 in Fig. 5. Similar situations occur in the CVC hand-drawn symbols. GFD on the whole, provides average results. Those descriptors provide interesting results for silhouette shapes. Considering such datasets, our method performs reasonably better. More specifically, it can optimally handle noisy, degraded as well as distorted samples as well as where internal content is necessary to be considered. Running time complexity is usually high since it uses DTW for matching radon histograms, however it is largely depend on how big the image is. As far as concern to computational cost, the observed average running time for all methods is given in Table 4. We have used MATLAB 7.8.0 in Linux platform.
5
Conclusions
We have presented a method for graphics recognition under document analysis and shape retrieval under CBIR applications. The method is basically based on dynamic programming for matching radon features. The method is quite simple and easy to implement since it is parameter free. Computing radon transform is quite immediate. But the running time complexity lies in matching. However, it could be substantially reduced by using optimised DTW [25] – a step to go further.
References 1. Attalla, E., Siy, P.: Robust shape similarity retrieval based on contour segmentation polygonal multiresolution and elastic matching. PR 38(12), 2229–2241 (2005) 2. Bailey, R.R., Srinath, M.: Orthogonal moment features for use with parametric and non-parametric classifiers. IEEE PAMI 18(4), 389–399 (1996)
DTW for Matching Radon Features
259
3. Bandera, C.U.A., Sandoval, F.: Non-parametric planar shape representation based on adaptive curvature functions. PR 35, 43–53 (2002) 4. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE PAMI 24(4), 509–522 (2002) 5. Bernier, T., Landry, J.A.: A new method for representing and matching shapes of natural objects. PR 36(8), 1711–1723 (2003) 6. Boulgouris, N.V., Chi, Z.X.: Gait recognition using radon transform and linear discriminant analysis. IEEE Image Processing 16(3), 731–740 (2007) 7. Bunke, H., Riesen, K.: Recent advances in graph-based pattern recognition with applications in document analysis. PR 44(5), 1057–1067 (2011) 8. Chong, C., Raveendran, P., Mukudan, R.: A comparative analysis of algorithms for fast computation of zernike moments. PR 36, 731–742 (2003) 9. Coetzer, J.: Off-line Signature Verification. Ph.D. thesis, University of Stellenbosch (2005) 10. Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern recognition. IJPRAI 18(3), 265–298 (2004) 11. Deans, S.R.: Applications of the Radon Transform. Wiley Interscience Publications, New York (1983) 12. El-ghazal, A., Basir, O., Belkasim, S.: Farthest point distance: A new shape signature for fourier descriptors. Signal Processing: Image Communication 24(7), 572– 586 (2009) 13. Flusser, J.: On the independence of rotation moment invariants. PR 33(9), 1405– 1410 (2000) 14. Fr¨ anti, P., Mednonogov, A., Kyrki, V., K¨ alvi¨ ainen, H.: Content-based matching of line-drawing images using the hough transform. IJDAR 3(2), 117–124 (2000) 15. GREC: International symbol recognition contest at grec2003 (2003), http://www.cvc.uab.es/grec2003/SymRecContest/ 16. Jafari-Khouzani, K., Soltanian-Zadeh, H.: Radon transform orientation estimation for rotation invariant texture analysis. IEEE PAMI 27(6), 1004–1008 (2005) 17. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: A review. IEEE PAMI 22(1), 4–37 (2000) 18. Santosh,K.C., Wendling, L., Lamiroy, B.: Using spatial relations for graphical symbol description. In: ICPR. pp. 2041–2044 (2010) 19. Kim, W.Y., Kim, Y.S.: A region-based shape descriptor using zernike moments. Signal Processing: Image Communication 16(1-2), 95–102 (2000) 20. Kruskall, J.B., Liberman, M.: The symmetric time warping algorithm: From continuous to discrete. In: Time Warps, String Edits and Macromolecules: The Theory and Practice of String Comparison, pp. 125–161. Addison-Wesley, Reading (1983) 21. Latecki, L.J., Lakmper, R., Eckhardt, U.: Shape descriptors for non-rigid shapes with a single closed contour. In: CVPR, pp. 1424–1429 (2000) 22. Leavers, V.F.: Use of the two-dimensional radon transform to generate a taxonomy of shape for the characterization of abrasive powder particles. IEEE PAMI 22(12), 1411–1423 (2000) 23. Loncaric, S.: A survey of shape analysis techniques. PR 31(8), 983–1001 (1998) 24. Jayadevan, R., Kolhe, S.R., Patil, P.M.: Dynamic time warping based static hand printed signature verification. PRR 4(1), 52–65 (2009) 25. Salvador, S., Chan, P.: Toward accurate dynamic time warping in linear time and space. Intell. Data Anal. 11(5), 561–580 (2007) 26. Sebastian, T.B., Klein, P.N., Kimia, B.B.: Recognition of shapes by editing shock graphs. In: ICCV, pp. 755–762 (2001)
260
K.C. Santosh, B. Lamiroy, and L. Wendling
27. Tabbone, S., Wendling, L., Salmon, J.-P.: A new shape descriptor defined on the radon transform. CVIU 102(1), 42–51 (2006) 28. Teh, C.H., Chin, R.T.: On image analysis by the methods of moments. IEEE PAMI 10(4), 496–513 (1988) 29. Wendling, L., Rendek, J., Matsakis, P.: Selection of suitable set of decision rules using choquet integral. In: SSPR/SPR, pp. 947–955 (2008) 30. Zhang, D., Lu, G.: Shape-based image retrieval using generic fourier descriptor. Signal Processing: Image Communication 17, 825–848 (2002) 31. Zhang, D., Lu, G.: Review of shape representation and description techniques. PR 37(1), 1–19 (2004) 32. Zhang, D., Lu, G.: Study and evaluation of different fourier methods for image retrieval. IVC 23(1), 33–49 (2005) 33. Zhu, S.C., Yuille, A.L.: Forms: A flexible object recognition and modelling system. IJCV 20(3), 187–212 (1996)
Ridges and Valleys Detection in Images Using Difference of Rotating Half Smoothing Filters Baptiste Magnier, Philippe Montesinos, and Daniel Diep Ecole des Mines d’ALES, LGI2P, Site EERIE, Parc Scientifique G.Besse 30035 Nimes Cedex 1 {Baptiste.Magnier,Philippe.Montesinos,Daniel.Diep}@mines-ales.fr
Abstract. In this paper we propose a new ridge/valley detection method in images based on the difference of rotating Gaussian semi filters. The novelty of this approach resides in the mixing of ideas coming both from directional filters and DoG method. We obtain a new ridge/valley anisotropic DoG detector enabling very precise detection of ridge/valley points. Moreover, this detector performs correctly at crest lines even if highly bended, and is precise on junctions. This detector has been tested successfully on various image types presenting difficult problems for classical ridges/valleys detection methods. Keywords: Ridge/valley, directional filter, Gauss filter, difference of Gaussian, anisotropic.
1
Introduction
Anisotropic filters are an important part in image processing. Indeed, anisotropic filters provide good results and are often used in edge detection [5] [10], texture removal [9], image enhancing [12] and restoration [11]. In several domains, anisotropic filters allow for a better robustness than classical method. However, they are seldom used in crest lines finding. Ridges and valleys are formed with the points where the intensity gray level reaches a local extremum in a given direction (illustrated in Fig. 1). This direction is the normal to the curve traced by the ridge or respectively the valley at this point [3]. Crest lines correspond to important features in many images. Ridges and valleys are attached but not limited to roads in aerial images [7] or blood vessels in medical images [1] [6]. Classical edge detection [2] fails to detect ridges or valleys in images. Instead, it results in two edges at each side of the ridge or the valley (illustrated in Fig. 1(e)). Edges can be used to detect straight lines using a Hough transform. It is advisable to compute edges using [5] which creates straight contours [13]. However, this method is adapted only for straight lines [4]. According to [6], crest lines extraction can be divided in three main categories of segmentation algorithms. The first refers to pattern recognition and filtering J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 261–272, 2011. Springer-Verlag Berlin Heidelberg 2011
262
B. Magnier, P. Montesinos, and D. Diep
1
1
z
y
z
y 0
0
y
x (a)
x (b)
x (c)
y
x (d)
(e)
Fig. 1. Valley and ridge in scalar images. (a) Valley in an image. (b) Surface representation of a valley. (c) Ridge in an image. (d) Surface representation of a ridge. T he z axis corresponds to the intensity gray level. (e) Edge detection on the image in (a).
techniques (for example differential geometry [1] [14] and morphology), the second to model-based approaches (snakes) [8] [7], and the third to tracking-based approaches. Filtering techniques are well adapted in ridge and valley extraction because they are able to smooth the noise and amplify the crest line information by computing surface curvature [1] [14]. However, results obtained by these approaches can present important false detection rate in noisy images. It is mainly because the high pass filtering used for the second derivative is sensitive to the noise level. In this paper, we present a rotating filter (inspired by [9] [1] and [10]) able to detect ridges and valleys. Our ridge/valley detector implements anisotropic directional linear filtering by means of difference of two rotating half smoothing filters. Then, we compute a ridge or valley operator using a local directional maximization or respectively minimization of the response of the filters. These directions correspond to the orientation of a crest line or a junction of ridges/valleys. Contrary to several approaches involving crest lines, this algorithm performs fine even on highly bended ridges or valleys. Moreover, our detector is robust at level of crest lines junctions and bended lines due to these two rotating half smoothing kernels. Finally, due to its strong smoothing in the directions of the crest line, the detection is not sensitive to noise. This paper is organized as follows. In the section 2, we present an anisotropic smoothing Gaussian filter. We present a robust crest line detector using difference of half directional Gaussian filters in the section 3. The section 4 is devoted to experimental results, comparison with an other methods and results evaluation. Finally, the section 5 concludes this paper and presents future works.
2
A Rotating Smoothing Filter
In our method, for each pixel of the original image, we use a rotating smoothing filter in order to build a signal s which is a function of a rotation angle θ and the underlying signal. As shown in [10] and [9], smoothing with rotating filters means that the image is smoothed with a bank of rotated anisotropic Gaussian kernels:
Ridges and Valleys Detection in Images
263
θ
Y
1_1 _ 2 2
X
(a)
Smoothing filter
(b)
Rotating filters
(c)
Points selection on an original image 712×220.
Fig. 2. A smoothing rotating filter and points selection on an original image
(a) θ = 34 degrees, μ = 10, λ = 1
(b) θ = 270 degrees, μ = 10, λ = 1
(c) θ = 34 degrees, μ = 10, λ = 1.5
(d) θ = 270 degrees, μ = 10, λ = 1.5
Fig. 3. Image in Fig. 2(c) smoothed using different parameters and different orientations
− x y x G(μ,λ) (x, y, θ) = C.H Rθ e y
⎛ ⎝ R−1 θ
1 2 μ2
0
0 1 2λ2
⎞
⎛
⎠ Rθ ⎝
⎞
x⎠ y
(1)
where C is a normalization coefficient, Rθ a rotation matrix of angle θ, x and y are pixel coordinates and μ and λ the standard-deviations of the Gaussian filter. As we need only the causal part of the filter (illustrated on figure 2(a)), we simply “cut” the smoothing kernel by the middle, this operation corresponds to the Heaviside function H [10]. By convolution with these rotated kernels (see figure 2(b)), we obtain a collection of directional smoothed images Iθ = I ∗ G(μ,λ) (θ). For computational efficiency, we proceed in a first step to the rotation of the image at some discretized orientations from 0 to 360 degrees (of Δθ = 1, 2, 5, or 10 degrees, depending on the angular precision needed and the smoothing parameters) before applying non rotated smoothing filters. μ and λ define the standard-deviations of the Gaussian filter (illustrated on figure 2(a)). As the image is rotated instead of the filters, the filtering implementation can use efficient recursive approximation of the Gaussian filter. As presented in [10], the implementation is quite straightforward. In a second step, we apply an inverse rotation of the smoothed image and obtain a bank of 360/Δθ images (some examples are available in Fig. 3).
264
3 3.1
B. Magnier, P. Montesinos, and D. Diep
Ridge/Valley Lines Detection Using Difference of Directional Gaussian Filters Difference of Rotated Half Smoothing Filters (DRF)
As presented in Fig. 4(a), we want to estimate at each pixel a smoothed second derivative of the image along a curve crossing this pixel. In one dimension, the second derivative of a signal can be estimated thanks to a DoG operator. For our problem, we have just to apply two filters with two different λ and the same μ to obtain directional derivatives (one example of two discretized filters is available in Fig. 4(d)). Then, we compute the difference of these two filters to obtain the desired smoothed second derivative information in the thin net directions (illustrated in Fig. 4(b)). 3.2
Pixel Classification
Applying by convolution the DRF filter to each pixel of an image (by means of a technique of rotated images, as defined above), we obtain for each pixel a signal which corresponds to a 360/Δθ scan in all directions (see Fig. 5). Our idea is then to characterize pixels which belong to a crest line (a ridge or a valley), and thus to build our detector. Let D(x, y, θ) be the pixel signal obtained at pixel P located at (x, y). D(x, y, θ) is a function of the direction θ such that: Y
DRF filters
5 10 15
Y
20 25 30 35
2
_1
40
5
10
15
X
20
25
30
θ =0
- +-
5 10
X
Thin net
1
15
Y
20 25 30 35
2
(a) A DRF
40 5
10
15
20
X
25
30
35
(b) DRF in the thin net directions (c) Discretized filter
Fig. 4. DRF filter descriptions. For (c) top: μ = 10 and λ = 1. For (c) bottom: μ = 10 and λ = 1.5.
(a) θ = 34 degrees
(b) θ = 270 degrees
Fig. 5. DRF result of Fig. 2(c) at different orientation θ using the following parameters: μ = 10, λ1 = 1 and λ2 = 1.5 (normalized images)
Ridges and Valleys Detection in Images
Point 1
Point 2
0.2
0.2
0 −0.2
0 0
90
180 Point 3
270
360
−0.2 0
0.2
0.2
0
0
−0.2
0
90
180 Point 5
270
360
−0.2 0
0.2
0.2
0
0
−0.2
0
90
180 Point 7
270
360
0.2
−0.2 0
180 Point 4
270
360
90
180 Point 6
270
360
90
180 Point 8
270
360
90
180 Point 10
270
360
90
180
270
360
0 0
90
180 Point 9
270
360
−0.2 0
0.2
0.2
0
0
−0.2
90
0.2
0 −0.2
265
0
90
180
270
360
−0.2 0
Fig. 6. Examples of functions D(x, y, θ) on the points selection in Fig. 2(c) using μ = 10, λ1 = 1, λ2 = 1.5. The x-axis corresponds to the value of θ (in degrees) and the y-axis to D(x, y, θ).
D(x, y, θ) = G(μ,λ1 ) (x, y, θ) − G(μ,λ2 ) (x, y, θ)
(2)
where x and y are pixel coordinates. μ, λ1 and λ2 correspond to the standarddeviations of the Gaussians. Some examples are represented on Fig. 6. We define a ridge/valley operator Σ(x, y) by the following expression: Σ(x, y) = D(x, y, θM1 ) + D(x, y, θM2 ) + D(x, y, θm1 ) + D(x, y, θm2 )
(3)
where θM1 , θM2 are the directions of the local maxima of the function D and θm1 , θm2 the directions of the local minima (see example in Fig. 8(a)). Conditions of detection are as follows: if Σ(x, y) > Σth , the pixel P belongs to a ridge line, if Σ(x, y) < −Σth , the pixel P belongs to a valley line, where Σth > 0. On a typical valley (for example point 1 in Fig. 6), the pixel signal at the minimum of a valley contains at least two negative sharp peaks. For ridges (for example point 7 in Fig. 6), the pixel signal at the maximum of a ridge contains at least two positive peaks. These sharp peaks correspond to the two directions of the curve (an entering and leaving path). In case of a junction, the number of peaks corresponds to the number of crest lines in the junction (see point 4 in
266
B. Magnier, P. Montesinos, and D. Diep
Y
0.1 0.08
M1
0.06 0.04 0.02 0 −0.02
m1 M2
M2
Thin net
m2 M1
P X
−0.04 −0.06 −0.08 −0.1 0
90
180
270
360
(degrees)
(a)
(b)
Fig. 7. η extraction (Σ(x, y) > Σth ). (a) η computation from θM1 and θM2 . (b) η corresponds to the direction perpendicular to the crest line at the level of a pixel P.
Fig. 6). We obtain the same information for bended lines (illustrated in point 2 on Fig. 6). However, at the level of an edge, the absolute value of Σ is close to 0 because the absolute value of D at θM1 , θM2 , θm1 and θm2 are to each other close (see points 6 and 7 on Fig. 6). Finally, due to the strong smoothing, D is close to 0 in the presence of noise without neither crest line nor edge (illustrated in point 10 in Fig. 6), that is why our method is robust to noise. Note that Σth can be a parameter for the hysteresis threshold (see next section). 3.3
Ridge and Valley Extractions
Once Σ(x, y) computed, we simply estimate η(x, y) (see Fig. 7(a) and (b)) by:
η(x, y) = (θM1 + θM2 )/2, when Σ(x, y) > Σth η(x, y) = (θm1 + θm2 )/2, when Σ(x, y) < −Σth . Thus, from Σ(x, y) and η(x, y) (an example in Fig. 8(b)), crest lines can easily be extracted computing local maxima of Σ(x, y) in the direction η(x, y) (for ridge detection and the minima for valley detection, examples can be seen in Fig. 8(c) and (d)).
4
Results
We present results obtained both on synthetic and real images using our DRF detector. Let us note τL the hysteresis lower threshold and τH the higher. 4.1
Results on a Synthetic Image
Fig. 9 shows results of ridges and valleys detection on a noisy synthetic image. Ridges are correctly detected as well as valleys, whereas our detector is not misled by the contour of the black object and the noise. At the end of this section an evaluation of the robustness of our detector in presence of noise is provided.
Ridges and Valleys Detection in Images
(a) Image of Σ
(b) η image (η in degrees, modulo 180)
(c) Maxima of Σ in the η direction
(d) Minima of Σ in the η direction
267
Fig. 8. Example of different steps for lines extraction in the image presented in Fig. 2(c) using Δθ = 2 (degrees), μ = 10, λ1 = 1, λ2 = 1.5. All these images are normalized.
Fig. 9. Ridges and valleys detection on a noisy synthetic images (712 × 220). Left to right: noisy image (uniform noise), ridges detection and valleys detection. Δθ = 2 (degrees), μ = 10, λ1 = 1, λ2 = 1.5, τL = 0.03 and τH = 0.08.
(a)
Original image 312×312
(b)
Valley detection
(c)
Superposition of (b) on (a)
Fig. 10. Valley detection of blood vessels in brain. Δθ = 5 (degrees), μ = 5, λ1 = 1, λ2 = 1.5, τL = 0.0001 and τH = 0.005.
4.2
Results on Real Images
We have tested our detector on several different real images and compared our method with the one described in [1]. In the first image of blood vessels, Fig. 10, the aim is to extract thin nets. This image is not corrupted by noise, so it is quite easy with the DRF detector to compute dark as well as bright crest lines. Valley highly bended are easily extracted from the image. Superposition of valleys detected on the original image shows satisfying results in terms of precision.
268
B. Magnier, P. Montesinos, and D. Diep
(a) Original image
(b) Result of [1]
(c) Our result
Fig. 11. Valley detection of watermarks in a paper. (b) σ = 1.5, τL = 0.5 and τH = 0.8. (c) Δθ = 5 (degrees), μ = 10, λ1 = 1, λ2 = 1.5, τL = 0.001 and τH = 0.008.
(a) Original image (b) Result of Ziou [14]
(c) Result of [1]
(d) Our result
Fig. 12. Ridge detection on a satellite image (277 × 331). For (c) : σ = 1.5, τL = 0.5 and τH = 0.7. For (d): Δθ = 2, μ = 3, λ1 = 1.33, λ2 = 2, τL = 0.002 and τH = 0.01.
In Fig. 11, the aim is to extract vertical watermarks. As this image is very noisy, the task to extract valleys caused by watermarks is very hard by classical methods. However, our detector performs well, the rate of noise of the results is much smaller than the method proposed in [1]. Roads often appears as ridges in satellite images. In, Fig. 12 roads are clearly visible as opposite in Fig. 13, where our method detects ridges even if they are highly bended. Moreover, it performs well at junctions. We have compared our results with the method presented in [1] and [14], these results clearly shows the superiority of our approach. In Fig. 13(e), crests lines are not very sharp, however our detector is able to extract most of the roads. The last figure shows the efficiency of our method against noise. In Fig. 14, we have tested ridges and then valleys detection. This result satisfy greatly us because our approach is able to detect both short valleys created by letters in the image and ridges between these same letters while detecting other ridges. Moreover, the noise in this image does not affect our detection. We provide quantitative results of noisy images in the next paragraph. A result database is available online1 . 1
http://www.lgi2p.ema.fr/~ magnier/Demos/DRFresults.html
Ridges and Valleys Detection in Images
(a)
Original image 500×500
(d)
(b)
Our ridge detection on (a)
Original image 1000×1000
(e)
(c)
269
Result of [1] on (a)
Our ridge detection superposed on (d)
Fig. 13. Ridges detection on aerial images. In (b) and (e): Δθ = 5 (degrees), μ = 10, λ1 = 1 and λ2 = 1.5. (b) τL = 0.02 and τH = 0.06. (e) τL = 0.01 and τH = 0.0025. (c), σ = 1.5, τL = 0.55 and τH = 0.65.
(a) Original image 403 × 351
(b) Ridge detection
(c) Valley detection
Fig. 14. Ridges an valleys detection (in red) on a noisy real image. Δθ = 5 (degrees), μ = 5, λ1 = 2, λ2 = 3, τL = 0.01 and τH = 0.03.
270
B. Magnier, P. Montesinos, and D. Diep
(a) L = 0.1
(b) L = 0.5
(c) L = 0.7
(d) L = 0.8
(e) L = 0.9
Fig. 15. Images 160×80 with different levels of noise L
2000
350
1600
250
200
150 100
50
0 0
DRF method TNE method
1800
number of wrong pixels
number of wrong pixels
300
true negative false positive
1400 1200 1000 800 600 400 200
0.1
0.2
0.3
0.4
0.5
noise level
0.6
0.7
0.8
0.9
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
noise level
(a) Error of our detector
(b) Comparison of total error with [1]
Fig. 16. Error evaluation of our approach Table 1. Confusion matrix showing detection errors
Actual pixels true Actual pixels false
4.3
Detection positive Detection negative 346 26 34 12394
Results Evaluation
In order to carry out some quantitative results, we have also conducted a number of tests with synthetic images including thin one-pixel wide ridges or valleys. Fig. 15 shows an example of such valleys with a simple image composed of a square and a circle. In our test, we performed a valley detection and compared the result to the original image, pixel per pixel. We thus obtained a quantified error by making the difference between the two images. In artificial intelligence, confusion matrices are often used to evaluate classifier errors. An example of such a confusion matrix is shown in Table 1, with the following parameters: λ1 = 1, λ2 = 1, μ = 5 for a noise free image. In this example, we see that 346 pixels over the 372 that build the figure were correctly found, whereas 60 pixels (34 + 26) were mistaken. Influence of noise: We analyzed the effect of adding a uniform white noise on the original image using the following formula:
Ridges and Valleys Detection in Images
271
Im = (1 − L)I0 + L.IN where I0 is the original image, IN an image of random uniform noise and Im the resulting noisy image. As expected, the number of errors increases monotonically with the noise level L. Two curves have been plotted on Fig. 16(a) the number of true negative pixels and the number of false positive pixels which both constitute errors. For low level of noise (L < 0.8), small variations of the number of errors are caused by the sampling effect: lines in the image are projected on a square grid and binarized, generating some inaccuracies of quantization. In particular, the drawing of a circle may slightly differ from one detection to the other, leading however to perceptually equivalent representations. As a result, the number of errors remains relatively low even at a high level of noise, showing the good robustness of the DRF filter. Comparison with Another Method: In a second part, we compared the results of the DRF filter with those obtained by the method in [1] called TNE. The total number of errors, i.e. false positive + true negative, has been plotted in Fig. 16(b). Both methods show the same robustness to noise, but the DRF filter clearly outperforms those from the TNE method. Noise relocates the maxima position in the TNE method. So the crest lines are detected with a distance of one pixel. Whereas with our approach, strong smoothing in the directions of the crest line does not relocate the detection (Fig. 7(b)).
5
Conclusion
We have presented a new, precise and robust detection method of ridge and valley based on the difference of two smoothing half rotating linear filters and local maximization/minimization. Our method is able to detect ridges and valleys even if they are highly bended. Moreover, due to these two half rotating smoothing kernels, our approach enables to compute the two directions of a crest line and the two principal directions at junctions. Finally, the strong smoothing in the direction of the crest line enables the method to be highly robust to noise. This detector has been tested successfully on various image types presenting difficult problems for classical crest line detection methods. Next on our agenda is to extend this approach to a detection of isolated junctions. This contribution will lead to improve the DRF detector which treats only at present two directions corresponding to the maxima/minima of the signal at each pixel.
References 1. Armande, N., Montesinos, P., Monga, O.: Thin Nets Extraction using a Multi-Scale Approach. In: Scale-Space Theory in Computer Vision, pp. 361–364 (1997) 2. Canny, J.F.: A Computaional Approach to Edge Detection. IEEE Transaction on Pattern Analysis and Machine Intelligence 8(6), 679–698 (1986)
272
B. Magnier, P. Montesinos, and D. Diep
3. Do Carmo, M.P.: Differential Geometry of Curves and Surfaces. Prentice Hall, Englewood Cliffs (1976) 4. El Mejdani, S., Egli, R., Dubeau, F.: Old and New Straight-Line Detectors: Description and Comparison. Pattern Recognition 41(6), 1845–1866 (2008) 5. Geusebroek, J., Smeulders, A., van de Weijer, J.: Fast Anisotropic Gauss Filtering. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 99–112. Springer, Heidelberg (2002) 6. Kirbas, C., Quek, F.: A Review of Vessel Extraction Techniques and Algorithms. ACM Computing Surveys 36(2), 81–121 (2004) 7. Laptev, I., Mayer, H., Lindeberg, T., Eckstein, W., Steger, C., Baumgartner, A.: Automatic Extraction of Roads from Aerial Images based on Scale Space and Snakes. Machine Vision and Applications 12(1), 23–31 (2000) 8. Lindeberg, T.: Edge Detection and Ridge Detection with Automatic Scale Selection. International Journal of Computer Vision 30(2), 117–154 (1998) 9. Magnier, B., Montesinos, P., Diep, D.: Texture Removal by Pixel Classification using a Rotating Filter. In: IEEE 36th International Conference on Acoustics, Speech and Signal Processing, pp. 1097–1100 (2011) 10. Montesinos, P., Magnier, B.: A New Perceptual Edge Detector in Color Images. In: Advanced Concepts for Intelligent Vision Systems, pp. 209–220 (2010) 11. Tschumperl´e, D.: Fast Anisotropic Smoothing of Multi-Valued Images using Curvature-Preserving PDEs. IJCV 68(1), 65–82 (2006) 12. Weickert, J.: Coherence-Enhancing Diffusion Filtering. International Journal of Computer Vision 31(2/3), 111–127 (1999) 13. Zhou, J., Bischof, W.F., Sanchez-Azofeifa, A.: Extracting Lines in Noisy Image Using Directional Information. In: 18th International Conference on Pattern Recognition, vol. 2, pp. 215–218 (2006) 14. Ziou, D.: Line Detection using an Optimal IIR Filter. Pattern Recognition 24(6), 465–478 (1991)
Analysis of Wear Debris through Classification Roman Jur´ anek1 , Stanislav Machal´ık2, and Pavel Zemˇc´ık1 1
Graph@FIT Faculty of Information Technology Brno University of Technology {ijuranek,zemcik}@fit.vutbr.cz 2 University of Pardubice [email protected]
Abstract. This paper introduces a novel method of wear debris analysis through classification of the particles based on machine learning. Wear debris consists of particles of metal found in e.g. lubricant oils used in engineering equipment. Analytical ferrography is one of methods for wear debris analysis and it is very important for early detection or even prevention of failures in engineering equipment, such as combustion engines, gearboxes, etc. The proposed novel method relies on classification of wear debris particles into several classes defined by the origin of such particles. Unlike the earlier methods, the proposed classification approach is based on visual similarity of the particles and supervised machine learning. The paper describes the method itself, demonstrates its experimental results, and draws conclusions.
1
Introduction
Analysis of wear debris allows for monitoring of the status as well as the longterm trend of engineering equipment parts wear [1]. It plays an important role in preventive measures that can help avoiding faults of engineering equipment. This paper presents a novel approach that extends the possibilities of the analytical ferrography method which is one of the most frequently used methods of analysis of particles occurring in wear debris found in operational liquids, especially of lubricating oils. The extension consists of the proposal and implementation of an automatic classifier of particle images. The approach facilitates particle sedimentation on a glass board in flow of the oil sample through a strong inhomogeneous magnetic field. The images of particles are then obtained using a microscope [1]. While the method is very successful, determining the wear classes for individual particles, depends on skills of the operator performing the evaluation [2]. The idea of using the analytical ferrography method with tools for automated particle classification is not completely new. Some proposals as well as implementation of systems based on ferrograms can be found in literature [2–7]. Most of the proposed systems were not finalized up to the practical implementation. A system working in practice is described in [7] however, it is not suitable for use J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 273–283, 2011. c Springer-Verlag Berlin Heidelberg 2011
274
R. Jur´ anek, S. Machal´ık, and P. Zemˇc´ık
with analytical ferrography. The described system works with surface textures so the particle images have to be obtained using an electron microscope in a difficult way. Analytical ferrography cannot be used to obtain particle images that would allow analyzing their surface. The aim of the presented work is to introduce a new automatic method for particle classification using supervised machine learning. The approach is based on within-class visual similarity of particles. The rest of this paper is structured as follows. Section 2 describes approaches currently used in classification of wear particles. Section 3 characterizes our data and presents normalization process that we used in our experiments. The machine learning and feature extraction are described in Section 4. Summary of the results is presented in Section 5. We conclude in Section 6 with some ideas for future work.
2
Background
Most systems that classify the wear particles are based on specification of particular shape factors of particles. These are quantities that characterize the morphological properties of particles. This procedure is based on an assumption that particles formed during a particular process of wear can be characterized with certain values of shape factors. Although this assumption is probably correct, a set of shape factors and sets of corresponding values that allow correct assignment of the wear type and its severity for each particle has not been specified yet. The most frequent techniques of identification and correct diagnosis of wear particles, apart from the laboratory approach of identifying individual elements using a microscope, is the usage of neural networks [3, 4] or other methods of artificial intelligence. These methods use particle classifiers and expert systems which learn how to recognize particles based on their morphological parameters or other patterns occurring in image. The particles can be analyzed by using the Fourier Transform [4]. In this method, amplitudes of various harmonics can be viewed as shape descriptors. However, jagged edges of particles can cause problems. Other methods include exploitation of fractal mathematics [5] or fuzzy logic [8]. Similarly to neural networks facilitating the analysis of shape factors, the above methods have to deal with a problem of selecting a suitable set of shape factors. In spite of a large amount of different shape factors that have been examined in order to be used for classification, satisfactory results have not been achieved yet [4, 9, 10]. Due to these facts, supervised machine learning was used for classification of wear particles. The aim of the proposed work was to establish an alternative classification method that uses images of particles. Hence, the database that serves as the base for the learning process is populated directly with images of particles instead of a set of shape factors and classifier distinguishing the particle classes is trained. This paper describes the results of initial experiments.
Analysis of Wear Debris through Classification
275
Fig. 1. Classes of particles with typical representatives
3
Description of Data
In the experiments we used database of 2719 sample images of particles divided into four classes by expert. The classes and their representatives are shown in Fig. 1. Each class corresponds to a type of wear by which the particles were generated. The main source of sample images was the SpectroLNF system which produces samples automatically. The second source were ferrogram images which were manually segmented and rescaled to match resolution of the SpectroLNF. The resolution of all particles was 2.4μm/pixel. Classes used in this work were following. – Cutting (CU) – Cutting wear particles are abnormal. They are generated as a result of one surface penetrating another. Their presence and quantity should be carefully monitored. If a system shows increased quantity of large (50μm long) cutting wear particles, a component failure is potentially imminent. – Fatigue (FA) – This class contains particles that are formed as an effect of repeated passes through the system which results in plastic deformation of particles. They have a smooth surface and irregularly shaped circumference. Their occurrence is usually accompanied with an occurrence of spherical particles. – Sliding (SL) – Sliding particles have oblong shape and irregular edge. They are smaller than fatigue particles but still larger than 15μm. – Sphere (SP) – Spherical particles can be generated if there is insufficient lubrication or there is a depletion of extreme pressure additives in high load or high stress conditions. Spheres are also produced by fatigue of rolling element bearings. 3.1
Normalization
The range of sizes of particles in source images is very broad (see Fig. 2). Most particles are from 10 to 20 pixels wide, but also much larger particles exist. Such diverse samples are not suitable for machine learning based on visual similarity and thus we propose normalization process that transforms all particles to sample images of constant size (e.g. 24 × 24 pixels, shown in Fig. 2 c) prior to machine learning.
276
R. Jur´ anek, S. Machal´ık, and P. Zemˇc´ık
Fig. 2. a) Source photo from microscope, b) segmented image and c) extracted and normalized particle samples
Fig. 3. Image normalization. a) source image with particle bounding box and major and minor axes, b) image normalized to size w without rotation, c) rotation of the a with new bounding box and d) image normalized with rotation.
Linear scaling would result in large amount of small particles covering very few pixels and few large particles covering whole area of sample image. Requirements for the normalization are following. First, all particles must fit to sample image of constant size and second, information about particle size is not discarded completely as it could be relevant for classification. Therefore we compress dynamic range of particle sizes in non-linear fashion. In the proposed normalization scheme, we take the length l of the longer side of particle’s bounding box and transform it to the new length l by (1) where c is a coverage factor, w size of the sample image and α normalization factor. The result of normalization is shown in Fig. 3 a,b. l = cw 1 − e−αl (1) The coverage factor ensures that the particle fits the sample image while keeping some distance from sides of the image. The most important parameter of the normalization is the normalization factor α, (see Fig. 4). Setting this factor to low values, the normalization maintain relative size of particles but ultimately, some particles can be very small and thus unsuitable for machine learning. High α values causes that size of all particles is more even. Small particles can be upsampled which can result on blocky artifacts on the edges. It is therefore important to find a good trade-off between sample size and normalization factor such that particles are not too deformed and size information is preserved. This trade-off is subject of experiments in Section 5. Examples of the normalized samples are shown Fig. 5.
Analysis of Wear Debris through Classification
277
Size in sample [px]
Particle scaling
20 15 0.02 0.05 0.10 0.20 0.50
10 5 0 0
10
20
30
40
50
60
70
80
Real particle size [px]
Fig. 4. Dependence of particle size in sample on real particle size for normalization factor α in range from 0.01 to 0.2. The sample size in this case was w = 24px.
Fig. 5. Different settings of the α. Size differences between particles are more distinct at lower values of α. Higher values result in more compensated size.
Another factor that can be subject to normalization is particle rotation in sample image. The class of the particle is independent on the rotation but the rotation strongly increases visual variability of samples within class. Therefore, we get rid of the rotation prior to scale normalization by aligning the major axis of the particle with an axis of sample coordinate system (see Fig. 3 c,d). The major axis is found by PCA.
4
Classification
Normalized images of particles were used as training set for machine learning. We use Real AdaBoost [11, 12] which produces strong classifier as an composition of weak classifiers or hypotheses. It is greedy algorithm that in each iteration select weak classifier which performs best on current training set and adds it to the strong classifier. In the context of the pattern recognition in images, simple image features are often used. Viola and Jones [13] used Haar-like features [14] and cascade of classifiers to build very fast and accurate classifier for detection of faces. Other features that are often used are LBP [15], Histograms of Oriented Gradients (HOG) [16] and Local Rank Functions [17].
278
R. Jur´ anek, S. Machal´ık, and P. Zemˇc´ık
Fig. 6. CS-LBP feature sampling and evaluation
4.1
Image Features
In our work, we use center-symmetric modification of Local Binary Patterns [18] as low level image features. This feature samples local neighborhood and compares a sample with a value of opposite sample (see Fig. 6). Thus for each pair of samples a binary value is generated. The samples are taken by convolution with rectangular kernel. Contrary to standard LBP features, the output of CS-LBP is not 8 bit code but only 4 bit code. The response of the feature is calculated by (2) where v is a vector of eight samples obtained from image and δt comparison function (3) that is 0 when the two samples have similar value (difference lower than threshold t), or 1 otherwise. In our experiments we used the threshold fixed to t = 32. CS-LBP(v, t) =
4
δt (vi , vi+4 )2i
(2)
i=1
δt (a, b) = |a − b| > t 4.2
(3)
Machine Learning
The AdaBoost algorithm and its modifications are widely used in object detection with classifiers [13, 16, 19–22] where it achieves high accuracy of detection while keeping false alarm rate on low level. We use Real AdaBoost to train binary classifier distinguishing between one class of particles and the rest. Thus for recognition of all four classes we need four classifiers. The input for the learning is a set of normalized samples with their classes and set of weak hypotheses. We use space partitioning weak hypotheses [23] based on CS-LBP features. The output is classifier (4) which for given sample X evaluates responses ht (X) of all weak hypotheses and sums them. The decision is positive when the sum exceeds a threshold, otherwise it is negative. H(X) =
T
ht (X)
(4)
t=1
For the experiments, we set the threshold such that the false negative rate corresponds to the false positive rate. However, for particular application it could be adjusted to any point on a ROC curve.
Analysis of Wear Debris through Classification
5
279
Experiments and Results
This section summarizes experiments carried on with the data and discusses influence of normalization parameters on the results. In the most of the published works on the performance of analytical ferrography, the results are not numerically evaluated (ferrography, so far, is mostly based on subjective evaluation). The error rate in the range of 10-15 percent can be considered sufficient and the error rate below 10 percent as very good. The results described here cover only small part of experiments that were actually carried out. Supplementary materials along with the data and software used for image normalization can be found on-line1 . 5.1
Experimental Setup
We experimented with training of binary classifiers distinguishing between one category of particles and the rest. We tested different normalization parameters. The setting of the experiments was following. – The coverage factor c was set constantly to 0.8 to fit all particles to the sample and keeping some border. – Sizes of sample images w were 16, 24, 32 and 48 pixels. – Normalization factors α were 0.02, 0.05, 0.1, 0.2 and 0.5 to test influence of absolute size of particle to the classification accuracy. – Data were processed both with and without rotation normalization to test whether the rotation has influence on the accuracy. For each setting, dataset from original data was generated and used for machine learning with AdaBoost algorithm. In each test, classifiers for every of four classes were trained and evaluated (ROC curve and Equal Error Rate measure). The data were randomly divided to independent training and testing set as shown in Table 1. Table 1. Amounts of samples for training and testing
CU FA SL SP
5.2
Training Testing 400 335 400 507 400 389 100 166
Results
The classification accuracy depends on many parameters, namely normalization factor α, size of samples and using of PCA. 1
FIT BUT server http://medusa.fit.vutbr.cz/particles
280
R. Jur´ anek, S. Machal´ık, and P. Zemˇc´ık
ROC-CU-sz24-pca 1
True positive rate
0.95 0.9 0.85 0.8 0.75
CU, sz24, a002, pca CU, sz24, a005, pca CU, sz24, a010, pca CU, sz24, a020, pca CU, sz24, a050, pca
0.7 0.65 0
0.05
0.1
0.15 0.2 0.25 False negative rate
0.3
0.35
Fig. 7. Influence of normalization factor α on classification accuracy of Cutting class, w = 24 and PCA was used ROC-FA-a020-pca 1
0.95
0.95
0.9
True positive rate
True positive rate
ROC-CU-a020-pca 1
0.85 0.8 0.75 CU, sz16, a020, pca CU, sz24, a020, pca CU, sz32, a020, pca CU, sz48, a020, pca
0.7
0.9 0.85 0.8 0.75 FA, sz16, a020, pca FA, sz24, a020, pca FA, sz32, a020, pca FA, sz48, a020, pca
0.7
0.65
0.65 0
0.05
0.1
0.15 0.2 0.25 False negative rate
0.3
0.35
0
0.05
0.1
1
0.95
0.95
0.9 0.85 0.8 0.75 SL, sz16, a020, pca SL, sz24, a020, pca SL, sz32, a020, pca SL, sz48, a020, pca
0.7
0.3
0.35
0.3
0.35
ROC-SP-a020-pca
1
True positive rate
True positive rate
ROC-SL-a020-pca
0.15 0.2 0.25 False negative rate
0.9 0.85 0.8 0.75 SP, sz16, a020, pca SP, sz24, a020, pca SP, sz32, a020, pca SP, sz48, a020, pca
0.7
0.65
0.65 0
0.05
0.1
0.15 0.2 0.25 False negative rate
0.3
0.35
0
0.05
0.1
0.15 0.2 0.25 False negative rate
Fig. 8. ROC curves of classifiers for different sizes of samples for all classes, α = 0.2 and PCA normalization
The α influences relative size of particles in sample images. We tested several settings of the parameter to discover whether the task is dependent on the size of particles. Fig. 7 shows ROC curve for class Cutting with the α set to different values (for other classes the results are similar).
Analysis of Wear Debris through Classification
ROC-a020-sz32-pca 1
0.95
0.95
0.9
True positive rate
True positive rate
ROC-a020-sz32-nopca 1
0.85 0.8 0.75 CU, sz32, a020, nopca FA, sz32, a020, nopca SL, sz32, a020, nopca SP, sz32, a020, nopca
0.7
281
0.9 0.85 0.8 0.75 CU, sz32, a020, pca FA, sz32, a020, pca SL, sz32, a020, pca SP, sz32, a020, pca
0.7 0.65
0.65 0
0.05
0.1
0.25 0.2 0.15 False negative rate
0.3
0.35
0
0.05
0.1
0.25 0.2 0.15 False negative rate
0.3
0.35
Fig. 9. Comparison of normalization without PCA (left) and with PCA (right) for all classes, w = 32, α = 0.2 Table 2. Equal Error Rate measure (in percent) for normalization without and with PCA for all classes and normalization factors, w = 32. Best results are marked with bold font. % Cutting Fatigue Sliding Sphere
α = 0.02 α = 0.05 α = 0.1 16.3/13.8 12.2/12.6 14.4/13.1 16.0/15.8
8.3/5.6 12.0/11.0 12.0/10.0 10.3/11.4
6.5/5.6 10.2/9.4 10.5/8.7 8.4/8.7
α = 0.2
α = 0.5
5.4/4.7 11.2/11.0 12.8/11.0 7.8/6.0
6.2/5.9 15.8/13.2 17.6/13.9 8.3/5.8
This experiment shows that the accuracy increases with α. The accuracy is highest for α = 0.2 when small particles (smaller than 20 pixels, see Fig. 4) keeps their relative size and larger particles are scaled down to fit the sample size. Large sample size means more information to extract and thus potentially more accuracy to gain. Fig. 8 shows ROC curves of all classes for differently set sample size. Another tested parameter is influence of rotation normalization on the accuracy. Fig. 9 shows comparison of accuracy on data without and with rotation normalization. It is clear that the normalization of rotation helps and increases accuracy. To summarize the results, reasonably high α is better, which means partial independence on absolute particle size. Bigger samples are better, but not too large as the data could be interpolated and particle edges could be distorted. The rotation normalization increases accuracy. The classification accuracy for w = 32 is summarized in Table 2 which displays classification error for equal error rate classifier setting. It should be noted that for different classes different normalization should be used in the target application to achieve best performance. Additionally, the ratio of false alarms and correct classifications can be tuned by setting the classification threshold.
282
6
R. Jur´ anek, S. Machal´ık, and P. Zemˇc´ık
Conclusions
The presented paper proposed the novel method of wear debris analysis through supervised machine learning and more specifically, the visual similarity evaluation by using AdaBoost classifier. The proposed approach is based on searching of the particles in the image of wear debris followed by normalization of size and orientation of the particles and classification to four basic classes. It has been shown that this approach lead in better understanding of the content of the debris than more traditional methods based on size of the particles and simple shape evaluation. The proposed method is, therefore, good basis for interpretation of engineering equipment wear and early detection and prevention of failures. Additionally, it has been demonstrated that the proposed method of size normalization has significant effect on performance of classification of wear debris particles and that normalization of orientation has positive effect as well. Future work includes further improvements in class definition and classification, interpretation of class overlaps, and evaluation of efficiency and precision of the whole processing chain. Acknowledgements. This research was supported by the research project MSM 0021627505 ”Transport systems theory” and MSM 0021630528 ”Security oriented research in information technologies” from Czech Ministry of Youth and Sports and the BUT FIT grant FIT-10-S-2.
References 1. Maci´ an, V., Payri, R., Tormos, B., Montoro, L.: Applying analytical ferrography as a technique to detect failures in diesel engine fuel injection systems. Wear 260(4-5), 562–566 (2006) 2. Roylance, B.J.: Ferrography–then and now. Tribology International 38(10), 857– 862 (2005); Ferrography and Friends - Pioneering Developments in Wear Debris Analysis 3. Xu, K., Luxmoore, A.R., Jones, L.M., Deravi, F.: Integration of neural networks and expert systems for microscopic wear particle analysis. Knowledge-Based Systems 11(3-4), 213–227 (1998) 4. Raadnui, S.: Wear particle analysis–utilization of quantitative computer image analysis: A review. Tribology International 38(10), 871–878 (2005); Ferrography and Friends - Pioneering Developments in Wear Debris Analysis 5. Thomas, A.D.H., Davies, T., Luxmoore, A.R.: Computer image analysis for identification of wear particles. Wear 142(2), 213–226 (1991) 6. Peng, Z., Kirk, T.B.: Wear particle classification in a fuzzy grey system. Wear, 225-229(Part 2): 1238–1247 (1999) 7. Stachowiak, G.P., Stachowiak, G.W., Podsiadlo, P.: Automated classification of wear particles based on their surface texture and shape features. Tribology International 41(1), 34–43 (2008) 8. Umeda, A., Sugimura, J., Yamamoto, Y.: Characterization of wear particles and their relations with sliding conditions. Wear 216(2), 220–228 (1998)
Analysis of Wear Debris through Classification
283
9. Machal´ık, S.: Recognition of objects in ferrography using image analysis. PhD thesis, Brno University of Technology (2007) 10. Peng, Z.: An integrated intelligence system for wear debris analysis. Wear 252(910), 730–743 (2002) 11. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 12. Freund, Y., Schapire, R.: A short introduction to boosting. Japonese Society for Artificial Intelligence 14(5), 771–780 (1999) 13. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1:511–I–518 (2001) 14. Papageorgiou, C.P., Oren, M., Poggio, T.: A general framework for object detection. In: ICCV 1998: Proceedings of the Sixth International Conference on Computer Vision, page. 555. IEEE Computer Society, Washington, DC (1998) 15. Zhang, L., Chu, R., Xiang, S., Liao, S., Li, S.Z.: Face detection based on multi-block lbp representation. In: ICB, pp. 11–18 (2007) 16. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR 2005: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893. IEEE Computer Society, Washington, DC (2005) ˇ adn´ık, M.: 17. Herout, A., Zemˇc´ık, P., Hradiˇs, M., Jur´ anek, R., Havel, J., Joˇsth, R., Z´ Low-Level Image Features for Real-Time Object Detection, page. 25. IN-TECH Education and Publishing (2009) 18. Heikkil¨ a, M., Pietik¨ ainen, M., Schmid, C.: Description of interest regions with local binary patterns. Pattern Recogn. 42, 425–436 (2009) 19. Sochman, J., Matas, J.: Waldboost - learning for time constrained sequential detection. In: CVPR 2005: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 150–156. IEEE Computer Society, Washington, DC (2005) 20. Zhu, Q., Yeh, M.-C., Cheng, K.-T., Avidan, S.: Fast human detection using a cascade of histograms of oriented gradients. In: CVPR 2006: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1491–1498. IEEE Computer Society, Washington, DC (2006) 21. Li, S.Z., Zhu, L., Zhang, Z., Blake, A., Zhang, H., Shum, H.Y.: Statistical Learning of Multi-view Face Detection. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 67–81. Springer, Heidelberg (2002) 22. Li, S., Zhang, Z., Shum, H., Zhang, H.: Floatboost learning for classification. In: The Conference on Advances in Neural Information Processing Systems, NIPS (2002) 23. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37(3), 297–336 (1999)
Fourier Fractal Descriptors for Colored Texture Analysis Jo˜ ao B. Florindo and Odemir M. Bruno Instituto de F´ısica de S˜ ao Carlos (IFSC) Universidade de S˜ ao Paulo (USP) Avenida do Trabalhador S˜ ao-carlense, 400 13560-970 S˜ ao Carlos SP Brazil [email protected], [email protected]
Abstract. This work proposes the use of a texture descriptor based on the Fourier fractal dimension applied to the analysis of colored textures. The technique consists in the transform of the color space of the texture through a colorimetry approach followed by the extraction of fractal descriptors from each transformed color channell. The fractal descriptors are obtained from the Fourier fractal dimension of the texture. The descriptors are finally concatenated, composing the complete descriptors. The performance of the proposed technique is compared to the classical chromaticity moment and the state-of-the-art multispectral Gabor filters in the classification of samples from the VisTex dataset. Keywords: fractal descriptors, color texture, fractal dimension, Fourier transform.
1
Introduction
The texture analysis is an extensively studied field in computer vision, applied to a wide variety of problems, involving the description, classification and segmentation of images. Essentially speaking, the texture analysis may be categorized into 4 approaches: structural, spectral, statistical and modeling-based [12]. In the last categorie we find the texture analysis based on the fractal theory, focused on this work. The fractal theory is an interesting tool for the modeling of objects from the real world as attested still in the primes of the formalization of the theory [9]. Particularly, the fractal dimension is a valuable measure to describe a natural object, even if it can not be properly considered a fractal object [4]. Following this idea, a lot of methods based on the fractal theory were developed for the analysis of textures extracted from real scenaries [17][21][13]. The most of these methods uses direct or indirectly the fractal dimension for the characterization of the texture. [11] extended the use of fractal dimension methods to the concept of fractal descriptors. In a general way, fractal descriptors are a set of features extracted from an object, represented in an image through its shape, contour, texture, J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 284–292, 2011. c Springer-Verlag Berlin Heidelberg 2011
Fourier Fractal Descriptors for Colored Texture Analysis
285
etc. These descriptors are calculated from a measure developed in fractal theory, generally, the fractal dimension. Examples of interesting applications of fractal descriptors may be found in [11][16][2], among others. This work proposes a novel method for the extraction of fractal descriptors from colored textures. The technique consists in a physical transform of the color space, described in [6], followed by the use of the Fourier fractal dimension to provide the descriptors. The experiments are carried out over the well-known Vistex texture dataset. The perfomance of the proposed method is compared to classical approaches for the characterization of colored textures, that is, chromaticity moment and multispectral Gabor. This work is divided into 7 sections. The following section introduces the fractal geometry and defines the fractal dimension and, more specifically, the Fourier fractal dimension. The third section addresses the fractal descriptors. The fourth describes the proposed method. The fifth section shows the performed experiments and the sixth shows the results of the experiments. The last section does the conclusion of the work.
2
Fractal Geometry
The fractal geometry is the area of Mathematics which deals with fractal objects. These are objects which do not obey the rules of the classical Euclidian geometry. Mandelbrot [9] defined them as geometrical sets whose Hausdorff-Besicovitch dimension exceeds strictly the topological dimension. In practice, fractals are objects presenting infinite complexity and self-similarity, that is, one may observe structures which repeat itself in different (theoretically infinite) observation scales. An interesting point to be observed is that the repetition of structural patterns in different scales is a recurrent characteristic in many objects or scenaries from the real world, like a cauliflower, the alveoli in the lung, the stream of a river, etc. Mandelbrot [9] observed that this fact has suggested the use of fractals to model objects in the Nature which could not be well handled by the classical euclidian tools. 2.1
Fractal Dimension
In computational solutions for the analysis of natural objects, it is fundamental the use of methods which extract features from the object. In the fractal modeling approach, a very important feature to be calculated is the fractal dimension of the object. In fact, [4] demonstrated that the fractal dimension can be calculated even from a non-fractal object. Along the years, a lot of mathematical definitions were developed for the fractal dimension. Among these definitions we can cite the Hausdorff-Besicovitch dimension, the packing dimension, Renyi dimension, etc. [5]. A common expression for the fractal dimension D of a set S may be provided through: log(Mδ (S)) , δ→0 log(δ)
D = lim
(1)
286
J.B. Florindo and O.M. Bruno
where M is a set measure defined according to the specific fractal dimension approach and δ is the scale parameter. 2.2
Fourier Fractal Dimension
In this work, we focus specifically on a definition of fractal dimension: the Fourier fractal dimension. The interest in this method here is due to some interesting characteristic, like the ease of implementation, its good computational performance and its invariance to geometrical transforms. Besides, the Fourier method allows a better control over noises present in the analyzed object. Basically, the Fourier fractal dimension is calculated from an exponent relation between the power spectrum of the Fourier transform of the image of the analyzed object and the frequency variable [18]. It is thus noticeable that the measure M in the Equation 1 is the power spectrum P , while the fequency f is an analogy of δ. The exponential relation is summarized by: P ∝ f −α ,
(2)
where α is the exponentiation parameter, related to the Hurst coefficient [18]. The fractal dimension is related to α by: D=
α+6 . 2
(3)
In practice, for computational purposes, the coefficient α is the slope of the straight line fitting the curve log(P ) × log(f ). The Figure 1 shows the curve P × f and log(P ) × log(f ) for a specific image representing a natural texture.
(a)
(b)
(c)
Fig. 1. Illustration of the Fourier fractal dimension method. (a) Texture whose dimension must be estimated. (b) Curve power spectrum (P ) × frequency (f ). (c) Curve in log-log scale of P × f .
3
Fractal Descriptors
With the aim of obtaining a richer descriptor for objects represented in digital images and, consequently, improve the performance in description problems, [11] extended the use of the fractal dimension to the concept of fractal descriptors.
Fourier Fractal Descriptors for Colored Texture Analysis
287
Fractal descriptors are a set of features extracted from an object based on the fractal modeling of such object. Usually, these descriptors are extracted from the fractal dimension. In [11], the authors use the Minkowski sausage method, a computational implementation of the Bouligand-Minkowski fractal dimension. The sausage method consists of dilating the object by circles of radius r. For each radius, we name the number of pixels of the dilated object as dilation area A(r). Varying the value of r, the fractal dimension is obtained from the slope of a straight line fit to the curve log(A(r)) × log(r). Instead of using only the fractal dimension, [11] extracts measures from the curve log(A(r)) × log(r), like the peaks, the dispersion and the area under the graphics of the curve. These measures compose the fracal descriptors. In [11] these descrptors provided interesting results in the analysis of gene expression patterns in embryonic development. [16] and [2] use the whole log(A(r)) × log(r) curve, after the application of the Fourier derivative to this curve. The authors obtained excellent results in the classification of brazilian plant species, based on the contour and nervures of the leaf. In its turn, [1] used a three-dimensional version of the sausage method to provide fractal descriptors for texture images. This method was applied to the classification of leaves based on the internal texture, achieving good results.
4
Proposed Method
This work proposes the study of a novel method for fractal descriptors from colored textures. Indeed, it is well known that the color is a very utile characteristic in image analysis [10], overall in the decription of textures extracted from natural scenaries. As a consequence, a lot of methods were developed considering the color parameter in texture analysis [3][15][8]. Occurs however that the most of such methods does not take into account the spatial strcture of the image when analyzing the color characteristic. Works like [7] and [20] proposed novel strategies in order to solve such limitations. Particularly, [7] developed an interesting model based on physical aspects of the color, associated to the spatial structure of the pixels in the image. In practice, the model is obtained by the application of a linear transform to the color channells of the original image. In the case of RGB (color space Red-Green-Blue) images, like used in this work, the transform may be represented through: ⎛
⎞ ⎛ ⎞⎛ ⎞ E˜λ R 0.06 0.63 0.31 ˜λλ ⎠ = ⎝ 0.19 0.18 −0.37 ⎠ ⎝ G ⎠ , ⎝E ˜λλλ B 0.22 −0.44 0.06 E
(4)
˜λλ and E ˜λλλ are the ˜λ , E where R, G and B are the original color channels and E transformed channells. This approach presented interesting results when associated to Gabor filters in [7]. This work proposes the use of the values in the curve log(P ) in the Section 2.2 as the fractal descriptors for the colored texture. This descriptors are extracted
288
J.B. Florindo and O.M. Bruno
(a)
(b)
Fig. 2. Discrimination of colored textures with the Fourier fractal descriptors. (a) The original texture images. (b) The curve of the descriptors for each sample (the descriptors are dimensionless data). The blue line represents the samples from class 1, the red lines from class 2 and the green lines from class 3. Notice the visual discrimination among the curves in each class.
from each channel in the above transform and they are concatenated in a unique feature vector, used to describe the texture image. The Figure 2 illustrates the discrimination capability of the proposed descriptor.
5
Experiments
In order to verify the performance of the novel proposed descriptor technique, an experiment was done in which samples from the classical texture dataset VisTex [14] are classified using the proposed descriptors. Considering the heterogeneity of VisTex dataset, we opt for the use of the classes with more significant samples in the dataset. In this way, we used the classes employed in [19], that is, Bark, Fabric, Food, Metal, Sand, Tile and Water. We used the samples with size 128 × 128. From each sample, it were generated 4 windows with size 64 × 64. The Figure 3 shows a sample from each class used in the experiments. The Fourier fractal descriptors were thus extracted from each sample, composing the feature matrix which is used in the classfication process. For sake of comparison, we also extracted features using a classical and a state-of-theart method for colored texture analysis, that is, cromaticy moments [15] and multispectral Gabor filters [7]. For the classification, we used the KNN (K-Nearest Neighbors) method, using K = 1, which was the parameter which allowed the best results in the classification results.
Fourier Fractal Descriptors for Colored Texture Analysis
(a)
(b)
(e)
(c)
(f)
289
(d)
(g)
Fig. 3. Samples from each VisTex class used in the experiments. (a) Bark. (b) Fabric. (c) Food. (d) Metal. (e) Sand. (f) Tile. (g) Water.
6
Results
The following tables show the performance of the compared techniques in VisTex dataset. The Table 1 shows the global correctness rate for each method, that is, the percentage of samples classified correctly by each technique. Table 1. Comparison between the correctness rate of the proposed Fourier fractal descriptors and the classical chromaticity moments and Gabor filters. The best result is underlined. Method Correctness rate (%) Cromaticy moments 68.8312 Gabor filter 92.5325 Proposed method 95.1299
We observe in the Table 1 that the proposed method presented the greater correctness rate. The advantage is larger relative to chromaticity method, which is explained by the simplicity of chromaticity technique. In its turn, Gabor presented a better performance. In fact, Gabor method is a sophisticated state-ofthe-art technique which presents excellent results in the literature. The tables 2, 3 and 4 present the confusion matrix correspondent to each method in the Table 1. In this matrix, we have the real class of the sample in the row and the class obtained by the classfication method in the column. The confusion matrices demonstrate that the proposed method presented the best performance in each class. In fact, it hits the class from all the samples
290
J.B. Florindo and O.M. Bruno
Table 2. Confusion matrix for the classification of VisTex dataset using the chromaticity moment descriptors 25 10 1 1 1 10 2
5 61 1 5 2 3 1
2 3 43 0 2 5 0
3 3 0 16 0 0 4
4 1 1 0 20 4 0
13 2 2 0 3 22 0
0 0 0 2 0 0 25
Table 3. Confusion matrix for the classification of VisTex dataset using the multispectral Gabor descriptors 45 3 1 77 0 0 0 1 0 0 6 2 0 1
0 0 48 0 0 0 0
1 0 0 23 0 2 0
2 0 0 0 28 1 0
1 2 0 0 0 33 0
0 0 0 0 0 0 31
Table 4. Confusion matrix for the classification of VisTex dataset using the proposed Fourier fractal descriptors 50 1 1 74 0 0 0 0 0 0 1 1 2 3
0 1 48 0 0 0 0
0 0 0 24 0 0 0
0 1 0 0 28 0 0
1 1 0 0 0 42 0
0 2 0 0 0 0 27
in the class 3, 4 and 5 (corresponding to rows 3, 4 and 5). In addition, it had a significantly better performance in the class 1, hitting 5 samples more than Gabor method. As waited from the global results in the Table 1, the worst results in all the classes were achieved in the chromaticity method.
7
Conclusion
This work studied the development of a fractal descriptor based on the Fourier fractal dimension and applied to the description of colored textures. The proposed method was compared to other classical methods for colored texture analysis in the classification of samples from the VisTex dataset. The methods used in the comparison were the classical chromaticity moments and the state-of-the-art multispectral Gabor filters. The proposed method presented the best performance in the experiments, achieving the grater correctness rate in all the classes from the dataset.
Fourier Fractal Descriptors for Colored Texture Analysis
291
In this way, the results demonstrated that Fourier fractal descriptors are a powerful technique for colored texture analysis. The results suggest the use of the proposed descriptors in applications involving texture discrimination in colored images, like classification, description for mathematical modeling, segmentation, etc. We are also encouraged to accomplish more and deeper studies on this novel technique, for instance, by analyzing the use of other classification methods and other texture datasets. Acknowledgments. Joao B. Florindo gratefully acknowledges the financial support of CNPq (National Council for Scientific and Technological Development, Brazil) (Grant #306628/2007-4). Odemir M. Bruno gratefully acknowledges the financial support of CNPq (National Council for Scientific and Technological Development, Brazil) (Grant #308449/2010-0 and #473893/2010-0) and FAPESP (The State of S˜ ao Paulo Research Foundation) (Grant # 2011/01523-1).
References 1. Backes, A.R., Casanova, D., Bruno, O.M.: Plant leaf identification based on volumetric fractal dimension. International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI) 23(6), 1145–1160 (2009) 2. Bruno, O.M., de Oliveira Plotze, R., Falvo, M., de Castro, M.: Fractal dimension applied to plant identification. Information Sciences 178(12), 2722–2733 (2008) 3. Caelli, T., Reye, D.: On the classification of image regions by colour, texture and shape. Pattern Recognition 26(4), 461–470 (1993) 4. Carlin, M.: Measuring the complexity of non-fractal shapes by a fractal method. Pattern Recognition Letters 21(11), 1013–1017 (2000) 5. Falconer, K.J.: The Geometry of Fractal Sets. Cambridge University Press, New York (1986) 6. Geusebroek, J.-M., van den Boomgaard, R., Smeulders, A.W.M., Dev, A.: Color and scale: The spatial structure of color images. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, pp. 331–341. Springer, Heidelberg (2000) 7. Hoang, M.A., Geusebroek, J.M., Smeulders, A.W.: Color texture measurement and segmentation. Signal Processing 85(2), 265–275 (2005) 8. Jain, A., Healey, G.: A Multiscale Representation Including Opponent Color Features for Texture Recognition. IEEE Transactions on Image Processing 7(1), 124– 128 (1998) 9. Mandelbrot, B.B.: The Fractal Geometry of Nature. Freeman, NY (1968) 10. Manjunath, B., Ohm, J., Vasudevan, V., Yamada, A.: Color and Texture Descriptors. IEEE Transactions on Circuits and Systems for video Technology 11(6), 703– 715 (2001) 11. Manoel, E.T.M., da Fontoura Costa, L., Streicher, J., M¨ uller, G.B.: Multiscale fractal characterization of three-dimensional gene expression data. In: SIBGRAPI, pp. 269–274. IEEE Computer Society, Los Alamitos (2002) 12. Materka, A., Strzelecki, M., Analysis, T., Review, M.A., Materka, A., Strzelecki, M.: Texture analysis methods - a review. Tech. rep., Institute of Electronics, Technical University of Lodz (1998) 13. Millan, H., Gonzalez-Posada, M.: Modelling Soil Water Retention Scaling. Comparison of a Classical Fractal Model with a Piecewise Approach. GEODERMA 125(12), 25–38 (2005)
292
J.B. Florindo and O.M. Bruno
14. MIT: Mit vistex texture database, http://vismod.media.mit.edu/vismod/imagery/VisionTexture/vistex.html 15. Paschos, G.: Chromatic correlation features for texture recognition. Pattern Recognition Letters 19(8), 643–650 (1998) 16. Plotze, R.O., Padua, J.G., Falvo, M., Vieira, M.L.C., Oliveira, G.C.X., Bruno, O.M.: Leaf shape analysis by the multiscale minkowski fractal dimension, a new morphometric method: a study in passiflora l (passifloraceae). Canadian Journal of Botany-Revue Canadienne de Botanique 83(3), 287–301 (2005) 17. Quevedo, R., Mendoza, F., Aguilera, J.M., Chanona, J., Gutierrez-Lopez, G.: Determination of Senescent Spotting in Banana (Musa cavendish) Using Fractal Texture Fourier Image. Journal of Food Engineering 84(4), 509–515 (2008) 18. Russ, J.C.: Fractal Surfaces. Plenum Press, New York (1994) 19. Singh, S., Sharma, M.: Texture analysis experiments with meastex and vistex benchmarks. In: Singh, S., Murshed, N., Kropatsch, W.G. (eds.) ICAPR 2001. LNCS, vol. 2013, pp. 417–424. Springer, Heidelberg (2001) 20. Thai, B., Healey, G.: Modeling and Classifying Symmetries Using a Multiscale Opponent Color Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11), 1224–1235 (1998) 21. Tian-Gang, L., Wang, S., Zhao, N.: Fractal Research of Pathological Tissue Images. Computerized Medical Imaging and Graphics 31(8), 665–671 (2007)
Efficiency Optimization of Trainable Feature Extractors for a Consumer Platform Maurice Peemen, Bart Mesman, and Henk Corporaal Eindhoven University of Technology, The Netherlands [email protected]
Abstract. This paper proposes an algorithmic optimization for the feature extractors of biologically inspired Convolutional Neural Networks (CNNs). CNNs are successfully used for different visual pattern recognition applications such as OCR, face detection and object classification. These applications require complex networks exceeding 100,000 interconnected computational nodes. To reduce the computational complexity a modified algorithm is proposed; real benchmarks show 65 - 83% reduction, with equal or even better recognition accuracy. Exploiting the available parallelism in CNNs is essential to reduce the computational scaling problems. Therefore the modified version of the algorithm is implemented and evaluated on a GPU platform to demonstrate the suitability on a cost effective parallel platform. A speedup of 2.5x with respect to the standard algorithm is achieved. Keywords: Convolutional Neural Networks, Feature Extraction, GPU.
1
Introduction
Visual object recognition is a computationally demanding task that will be used in many future applications. An example is surveillance, for which multiple human faces have to be detected, recognized and tracked from a video stream. The classical approach to achieve visual object recognition using a computer is to split the task into two distinct steps [9]: feature extraction and classification. During the first step, the original input is typically preprocessed to extract only relevant information. Classification with only the relevant information makes the problem easier to solve and the result becomes invariant to external sources like light conditions that are not supposed to influence the classification. Classical approaches use matching algorithms for classification. These compute the difference between the feature vector and a stored pattern to distinguish different object classes. To construct a pattern which gives robust results is a very difficult task. Therefore, it is desirable to train a classifier for a certain task by using a labeled set of examples. Convolutional Neural Networks (CNNs) are fully trainable pattern recognition models that exploit the benefits of two step classification by using feature extraction [7]. CNN models are based on Artificial Neural Networks (ANNs) [4] but their network structure is inspired by the visual perception of the human J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 293–304, 2011. c Springer-Verlag Berlin Heidelberg 2011
294
M. Peemen, B. Mesman, and H. Corporaal
input 32 x 32
C1 feature maps 28 x 28
S1 C2 S2 feature maps feature maps feature maps 14 x 14 10 x 10 5x5
n2 output
n1
0
1 8
5x5 convolution
2x2 subsampling
5x5 convolution
2x2 subsampling
feature extraction
9
fully connected
classification
Fig. 1. An Example CNN architecture for a handwritten digit recognition task
brain. The network architecture of an example CNN is depicted in Fig. 1. The processing starts with feature extraction layers and is finished by fully connected ANN classification layers. Using different layers delivers robust recognition accuracy and is invariant to small geometric transformations of the input images. The robust recognition accuracy makes that CNN are successfully used for classification tasks on real world data [3][7][14]. It is a challenge to implement these CNNs for real-time recognition; this is due their large computational workload, especially on high resolution images. Consider for example the results that are published for face recognition applications on programmable architectures (see Table 1). These results not yet meet realtime requirements, and assume a relative low resolution. To reach recognition results with 20 frames per second for 1280x720 HD video streams the processing speed must be improved considerably. The processing problem gets even worse when the implementation platform is a low cost consumer platform as used in smart phones. So a more efficient algorithm is needed. The contribution of this work is a modified CNN architecture to reduce the computational workload and data transfer. Training rules for the modified architecture are derived and the recognition accuracy is evaluated with two real world benchmarks. An Intel CPU and an Nvidia CUDA-enabled Graphics Processing Unit (GPU) are used to demonstrate the performance improvement of the modified feature extraction layers. The content of the paper is as follows. Section 2 contains an overview of the CNN model introduced in [6]. Section 3 describes the algorithmic optimization and training rules are derived. In Section 4, the recognition accuracy is evaluated. Section 5 describes the mapping of the feature extractors and the speedup of the modification is evaluated. Section 6 describes related work and in Section 7, the paper is summarized and concluded. Table 1. Frame rate for a face recognition CNN on three programmable platforms platform input pixels 1.6 GHz Intel Pentium IV [3] 384x288 2.33 GHz 8-core Intel Xeon [1] 640x480 128-Core GPU Nvidia Tesla C870 [1] 640x480
frames per second 4.0 7.4 9.5
Efficiency Optimization of Trainable Feature Extractors
2
295
CNN Algorithm Overview
An example architecture of a CNN is shown in Fig. 1. This one is used for handwritten digit recognition [7]. The last two layers n1 and n2 function as an ANN classifier. The first layers of the network C1 up to S2 function as a trainable feature extractor. These are ANN layers with specific constrains to extract position invariant features from two-dimensional shapes. The different layers in this architecture can be described by: 1) Convolution Layers (CLs): The feature maps of CLs, such as C1 and C2 in Fig. 1, contain neurons that take their synaptic inputs from a local receptive field, thereby detecting local features. The weights of neurons in a feature map are shared, so the exact position of the local feature becomes less important, thereby yielding shift invariance. The schematic overview of a convolution neuron for a one-dimensional input is shown in Fig. 2(a). The schematic shows the names for the different variables, x for the input window, v for the shared trainable kernel weights and b for the trainable bias value. The inputs are used to compute a weighted sum with kernel size K ; this is represented as the neuron potential p. To generate an output value y, the potential value is passed through an activation function φ(p). For a two-dimensional feature map the model is rewritten, and the neuron operation can be described by y[m, n] = φ(p) = φ(b +
K−1 K−1
v[k, l]x[m + k, n + l])
(1)
k=0 l=0
where, φ(x) =
1 . 1 + exp(−x)
(2)
The sigmoid activation function of (2) is used in this work, but many other functions can be used [4]. The kernel operation is a two-dimensional convolution on the valid region of the input. This 2d-convolution is done multiple times with different kernels to generate multiple feature maps that are specialized to extract different features. Some of these feature maps as in layer C2 of Fig. 1 are fed by multiple inputs, in this case multiple kernels are used, one for each input and the results are summed.
x[n+0]
v[0]
x[n+1]
v[1]
K-1
Σ
v[K-1] x[n+K-1]
p
k=0
b
(a) CL
φ(p)
x[nS+0]
u
x[nS+1]
u
x[0] S-1
Σ
y[n] u x[nS+S-1]
x[1] p
k=0
b
(b) SL
φ(p)
w[n,0] w[n,1]
K-1
Σ
y[n]
p
w[n,K-1] k=0 x[K-1]
φ(p)
y[n]
b[n]
(c) NL
Fig. 2. Schematic models of different neuron types in a CNN. (a) Convolution Layer (CL) neuron. (b) Subsample Layer (SL) neuron. (c) Neuron Layer (NL) neuron.
296
M. Peemen, B. Mesman, and H. Corporaal
2) Subsampling Layers (SLs): A CL is succeeded by a SL to carry out a data reduction operation of the CL output. The data reduction operation is done by local averaging over a predefined, non-overlapping window; the size is described by the subsample factor S. The result of local averaging is multiplied by a shared trainable coefficient u and a shared bias coefficient is added to the result before it is passed through the activation function. The schematic model of a one-dimensional subsampling neuron is depicted in Fig. 2(b). The mathematical model of a two-dimensional feature map gives y[m, n] = φ(p) = φ(b + u
S−1 S−1
x[mS + k, nS + l]).
(3)
k=0 l=0
3) Neuron Layers (NLs): The output layers of a CNN such as n1 and n2 in Fig. 1 contain classical neuron models or perceptrons [4]. The perceptron model that is depicted in Fig. 2(c) has a unique set of weights w and bias b for each neuron. With the unique set of weights each neuron can detect a different pattern; this is used to make the final classification. In most NLs the result of the preceding layer is used as a one-dimensional fully connected input. When K equals the number of neurons in the preceding layer the expression to compute the NL output is given as y[n] = φ(p) = φ(b[n] +
K−1
w[n, k]x[k]).
(4)
k=0
An important property of the CNN architecture is that all synaptic weights and bias values can be trained by cycling the simple and efficient stochastic mode of the error back-propagation algorithm through the training sample [7].
3
Algorithm Optimization
As is shown in Section 1 the high computational complexity of CNNs restricts their applications to high performance computer architectures. To enable CNN applications for cheap consumer platforms a reduction of the computational workload would be very desirable. This reduction is achieved by a high level modification that reduces the number of Multiply Accumulate (MACC) operations and the amount of data movement in the feature extractor. High level modifications to an algorithm can have a huge impact on performance, but in most cases it is a trade-off between recognition accuracy and computational complexity. Therefore changes in the algorithm must be analyzed carefully to verify that the classifier does not lose the learning and classification abilities it had before. To analyze the recognition performance new training rules are derived and two real world benchmarks are used to validate the recognition performance. 3.1
Merge Convolution and Subsampling
The data dependencies between a CL and a SL; as depicted in Fig. 3 show that the SL output can be calculated directly from the input. Therefore the succeeding
Efficiency Optimization of Trainable Feature Extractors S2·K2 kernel
input X
convolution C
297
S2 kernel subsample Y
(K+S-1)2 kernel
Fig. 3. Feature extraction layer example with 2d-convolution kernel size K =3 and subsample factor S =2, data dependencies are visualized from input to CL and SL. For the merged method of operation there is no need for a middle CL.
operations are merged; this is only possible if the activation function of the CL is linear. The merged expression with corresponding coefficients is derived by substitution of the CL expression (1) with a linear activation function into the SL expression (3). y[m, n] = φs (bs + u
S−1 S−1
c[mS + i, nS + j])
i=0 j=0
=φs (bs + u
S−1 S−1
φc (bc +
i=0 j=0
˜ ˜b + = φ(
v[k, l]x[mS + i + k, nS + j + l])) (5)
k=0 l=0
K+S−2 K+S−2 k=0
K−1 K−1
v˜[k, l]x[mS + k, nS + l])
l=0
The enlarged kernel v˜ is constructed from all coefficients that are multiplied with each input value x. The new bias ˜b is the CL bias multiplied by u and added to the SL bias. From Fig. 3 and (5) is concluded that merging a linear CL and a SL result in a reduction of MACC operations while retaining the functional correctness. With the significant reduction of MACC operations the number of memory accesses is also reduced because there is no intermediate storage of a CL. Table 2 shows expressions for the number of kernel weights, MACC operations and memory accesses that are required to calculate a feature map output. The reduction of MACC operations for multiple merged CL and SL configurations is depicted in Fig. 4. Table 2. For a feature map output the required number of weights, MACC operations and memory accesses depend on kernel size K and subsample factor S feature extractor CL and SL merged
# kernel weights K2 + 1 (K + S − 1)2
# MACC operations S 2 (K 2 + 1) (K + S − 1)2
# mem. accesses S 2 (K 2 + 2) + 1 (K + S − 1)2 + 1
298
M. Peemen, B. Mesman, and H. Corporaal
90%
reduction #MACC
78%
80%
76%
89% 84%
69% 72%
75% 70%
64%
65% 60%
83%
82%
81%
79%
88%
88%
86%
85%
82%
85%
60%
63%
67%
65%
68%
69%
70%
55%
55% 50%
4
2
3
4
5
3 6
K
7
2 8
S
9
Fig. 4. Reduction of the #MACC operations to calculate a merged feature map compared to the original algorithm for multiple kernel sizes K and subsample factors S
It is not possible to derive the coefficients of v˜ and ˜b if a non-linear activation function is used in the CL. This is not a problem; the learning algorithm is adapted such that it can train the coefficients for the merged configuration. After merging the CL and SL the weight space is changed; therefore training could find better solutions in the weight space which makes derivation from the old weight space suboptimal. The recognition performance of such a trained kernel is evaluated in Section 4. During the remaining part of this paper the merged configuration is used. The merged layers are named Feature Extraction Layers (FELs). For completeness the expression is given for a variable number of input feature maps, y[m, n] = φ(b +
K−1 K−1
vq [k, l]xq [mS + k, nS + l]).
(6)
q∈Q k=0 l=0
As depicted in Fig. 5(a) the number of input feature maps can vary for each feature map. The set Q in (6) contains the indices of the connected input feature maps. The constant K describes the new kernel size and S describes the step size of the input window on an input feature map as depicted in Fig. 5(b). X
y3 Y0
1
input X0 Y11
Y02
S
K v[k,l]
Y12 Y2
y[m,n]
2
Y32
(a)
y4
unique weights
unique weights
input
kernel
output
(b)
Fig. 5. Variables and indices required for feed-forward computation. (a) Feature map naming with connections. (b) Variables for computation of feature map neurons.
Efficiency Optimization of Trainable Feature Extractors
3.2
299
Training with Error Back-Propagation
The training algorithm that is used to learn the coefficients of the merged FELs is the on-line mode of error back-propagation [12]. The derivation of the training rules is described in detail because the merged FELs change the published CNN training expressions [7]. The new contributions to the training procedure are the steps involving the merged FELs. The basic idea of error back-propagation is to calculate the partial derivatives of the output error in function of the weights for a given input pattern. The partial derivatives are used to perform small corrections to the weights to the negative direction of the error derivative. This procedure is split into three parts; feed-forward processing, compute partial derivatives and update the weights. For clarity the network depicted in Fig. 5(a) is used to explain the training procedure. 1) Feed-forward processing: Before training all weights are initialized to a small random value. Then a pattern of the training sample is processed in feed-forward mode through the FELs (6) and the NLs (4). The feed-forward propagation results in an output vector that is compared with the desired output in the cross-entropy (CE) error-function [5], ECE = − dn log(yn ) + (1 − dn ) log(1 − yn ). (7) n∈N
In (7) the set N contains all output neuron indices and dn the target values. 2) Compute partial derivatives: In the previous expressions x is used as input and y as output, for the error derivatives more variables are required. The remaining expressions use λ to describe in which layer variables are positioned. The partial derivatives are found by applying the chain rule on the CE error function of (7), which results in ∂ECE ∂ECE ∂ynλ ∂pλn = ∂wnλ [k] ∂ynλ ∂pλn ∂wnλ [k] y λ − dn ˙ λ λ−1 = λn φ(pn )yk yn (1 − ynλ )
(8)
= δnλ ykλ−1
where, ˙ φ(x) = φ(x)(1 − φ(x)) = yn (1 − yn ),
(9)
∂ECE ∂y . (10) ∂y ∂p Efficient computation of the partial derivatives for NLs that are not positioned at the output is performed by reusing the local gradient δ of the succeeding layer. ∂ECE ∂y λ ∂pλ ∂y λ−1 ∂pλ−1 ∂ECE n n i i = ∂wnλ−1 [k] i∈D ∂yiλ ∂pλi ∂ynλ−1 ∂pλ−1 ∂wnλ−1 [k] n (11) = δiλ wiλ [n]ynλ−1 (1 − ynλ−1 )ykλ−2 δ=
i∈D
= δnλ−1 ykλ−2
300
M. Peemen, B. Mesman, and H. Corporaal
The set D in (11) contains all neurons of the succeeding layer that are connected to neuron ynλ−1 or yn3 in Fig. 5(a). To compute the partial derivatives for multiple NLs (11) is used recursively. The calculation of the gradients for weights in the FELs is done in two steps. First the local gradients are computed by back-propagation from the succeeding layer. Second the local gradients are used to compute the gradients of the weights. Computation of the local gradients for an FELs succeeded by an NL such as for y 2 in Fig. 5(a) is expressed as ˙ λ [m, n]). δ λ [m, n] = δiλ+1 wiλ+1 [m, n]φ(p (12) i∈D
If the succeeding layer is an FEL a select set of neurons is connected which makes computation of the local gradients complex. The connection pattern is influenced by the current neuron indices, the subsample factor and the kernel size as depicted in Fig. 6. For the two-dimensional case the local gradient is δ λ [m, n] =
max K
L max
˙ λ [m, n])) (13) (δqλ+1 [k, l]vqλ+1 [m − Sk, n − Sl]φ(p
q∈Q k=Kmin l=Lmin
where, Kmax =
m m−K +S n n−K +S , Kmin = , Lmax = , Lmin = . S S S S
Border effects restrict Kmin , Kmax , Lmin and Lmax to the featuremap indices. The obtained local gradients are used to compute the gradients of the FEL coefficients. The bias is connected to all neurons in a feature map therefore the gradient is computed by summation over the local gradients in a feature map. M N ∂ECE = δ[m, n] ∂b m=0 n=0
(14)
The gradients for the kernel weights of a FEL are computed by M N ∂ECE = δ λ [m, n]y λ−1 [mS + k, nS + l]. ∂vλ [k, l] m=0 n=0
(15)
3) Update the coefficients of the network: The delta rule of the error backpropagation algorithm is used to keep the training algorithm simple and easy to reproduce with η as single learning parameter. The update function is given as Wnew = Wold − η
∂ECE . ∂Wold
(16)
In the update function (16) W represent the weights w, kernels v and bias b for all possible indices in the network.
Efficiency Optimization of Trainable Feature Extractors
yλ+1
301
0 1 2 δλ+1[k] vλ+1[n-Sk]
yλ 0 1 2 3 4 5 6 7 δλ[n] Fig. 6. Two one dimensional FELs with K=4 and S=2. To compute the local error gradient the error is back-propagated from the succeeding layer.
4
Validate the Recognition Performance
Evaluation of the recognition performance of the merged feature extractors is performed by a training task on two published datasets. The availability of published training results for CNN implementations is the main motivation to use these datasets for a fair comparison. The first training task is performed on the MNIST handwritten digit dataset [7]. This dataset consist of 28x28 pixel images of handwritten digits as shown in Fig. 7(a). For evaluation of feature extraction based on separated or merged CLs and SLs, a MATLAB implementation of a CNN based on LeNet-5 [7] is trained for both configurations. For fair training and testing the original separation of the MNIST data into 60,000 training and 10,000 test samples is used. The classification performance is expressed as the percentage of the test set that is misclassified. Classification of a pattern of the test set is performed by selecting the output neuron that is activated the most (winner takes all). These outputs represent the digits zero to nine. The classification score for the original and the merged network for the MNIST dataset is summarized in Table 3. The second dataset is the small-NORB stereoscopic object classification dataset [8]. This dataset consists of 96x96 pixel image pairs which belong to one of five object classes; a subset is shown in Fig. 7(b). The dataset contains 50 different objects, 25 for training and 25 for testing which are equally distributed over the five classes. These objects are shown from different angles and with different lightning conditions, which makes that each set consist of 24,300 image pairs. For comparison of the recognition performance, implementations with separated or merged CLs and SLs which are based on the LeNet-7 [8] are trained. The classification scores for the small-NORB dataset are also shown in Table 3.
(a) MNIST
(b) small-NORB
Fig. 7. Subset of the visual patterns that are used for training
302
M. Peemen, B. Mesman, and H. Corporaal Table 3. Comparison of the training results for MNIST and NORB data set
benchmark MNIST LeNet-5 [7] separated CLs and SLs merged CLs and SLs reduction
misclassification 0.82% 0.78% 0.71% 8.97%
# MACC ops. 281,784 281,784 97,912 65%
NORB LeNet-7 [8] separated CLs and SLs merged CLs and SLs reduction
6.6% 6.0% 6.0% 0%
3,815,016 3,815,016 632,552 83%
# coefficients FELs 1,716 1,716 2,398 -40% 3,852 3,852 6,944 -80%
Important to conclude from the results of the experiments is that merging the convolution and subsample layers of the feature extractor do not negatively influence the networks ability to generalize. The number of MACC operation to do feed-forward detection is significantly reduced. As mentioned before there is also a non-favorable property, this is the increase of the number of coefficients which is due to the increased kernel sizes as described in Section 3.1. The networks that are implemented for this experiment do not need extra preprocessing of input patterns, such as mean removal. The training procedures used in [7] and [8] use this extra preprocessing on the input data to improve recognition results.
5
Implementation
Demonstration of the recognition speedup as result of merged FELs is performed with a real application. The application is an internally developed implementation of a road sign detection and classification CNN [11]. The CNN is build with merged FELs and with separated CLs and SLs to compare the processing speed. Both CNNs are trained to classify road signs on a 1280x720 HD input image. First a C implementation of the two feature extractor configurations are executed on a 2.66 GHz Core-i5 M580 platform as a reference. The implementation is optimized to exploit data locality by loop interchanges. Compilation of the code is done with MS Visual Studio 2010, all compiler flags are set to optimize for execution speed. The timing results for the two feature extractors are shown in the first column of Table 4. The speedup of a factor 2.8x after merging matches the expectation. In Fig. 4 is shown that the workload of the feature extractor with kernel size K = 5 and subsample factor S = 2 is reduced with 65%. Table 4. Timing comparison of the FELs for the standard and the merged configuration configuration CLs and SLs merged FELs speedup
CPU 577 ms 203 ms 2.84 x
GPU 6.72 ms 2.71 ms 2.48 x
speedup 86 x 75 x
Efficiency Optimization of Trainable Feature Extractors
303
The processing in the feature extractor contains a huge amount of parallelism. A cost effective platform to exploit parallelism is a GPU. Therefore the standard and the merged feature extractors are mapped to a GPU platform to test the impact of the proposed optimization on a parallel implementation. The platform that is used for the experiment is an Nvidia GTX460. First the GPU implementation is optimized to improve data locality by loop interchange and tiling. Second GPU specific optimizations described in the CUDA programming guide [10] are applied. Memory accesses to the images are grouped to have coalesced memory accesses and the used kernel coefficients are stored in the fast constant memory. As final optimization the non-linear sigmoid activation function is evaluated fast using the special function units of the GPU. The following intrinsics from the CUDA programming guide are used to perform a fast but less accurate evaluation of the sigmoid activation function. __fdividef(1,1+__expf(-x)); The kernel execution times for the two GPU implementations are shown in Table 4. The GPU speedup after merging is close to the CPU speedup, this shows that the performance gain of merging is not reduced much due to a parallel implementation.
6
Related Work
Acceleration of CNNs is not a new field of research. Since a few years the first dedicated hardware implementations of CNNs for FPGA platforms are published in [1] and [2]. These implementations are based on hand crafted systolic implementations of the convolution operation to speed up execution time. Non of these implementations explore high level trade-offs to the CNN algorithm. A different simplified CNN is given in [13]. Instead of averaging with subsampling using (3), they only calculate convolution outputs for S 2 pixels. This also reduces computational complexity, but likely at a severe recognition quality loss. However the paper does not report on this. Furthermore, the work in [13] differs from this work because no analysis is published that shows how kernel size K and subsample factor S influence the reduction of computational complexity. Performance measurements of the simplified algorithm on real platforms such as a GPU are not published.
7
Conclusion
In this work a high level algorithm modification is proposed to reduce the computational workload of the trainable feature extractors of a CNN. The learning abilities of the modified algorithm are not decreased; this is verified with real world benchmarks. These benchmarks show that the modification results in a reduction of 65-83% for the required number of MACC operations in the feature extraction stages.
304
M. Peemen, B. Mesman, and H. Corporaal
To measure the real speedup that is gained by the algorithm modification; an implementation of a road sign classification system is performed. This application is mapped to a CPU and a GPU platform. The speedup of the CPU implementation is a factor 2.7 where the GPU implementation gains a factor 2.5, compared to the original convolution and subsample feature extractor. These speedups on real platforms prove that the proposed modification is suitable for parallel implementation. The modifications that are proposed in this paper enable implementations of CNNs on low cost resource constrained consumer devices. This enables a legacy of applications that use trainable vision systems on mobile low-cost devices such as smartphones or smart cameras.
References 1. Chakradhar, S., Sankaradas, M., Jakkula, V., Cadambi, S.: A dynamically configurable coprocessor for convolutional neural networks. In: ISCA 2010: Proceedings of the 37th Annual International Symposium on Computer Architecture, pp. 247– 257. ACM, New York (2010) 2. Farabet, C., Poulet, C., Han, J., LeCun, Y.: Cnp: An fpga-based processor for convolutional networks. In: International Conference on Field Programmable Logic and Applications, FPL 2009, pp. 32–37 (August 2009) 3. Garcia, C., Delakis, M.: Convolutional face finder: A neural architecture for fast and robust face detection. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1408– 1423 (2004) 4. Haykin, S.: Neural Networks and Learning Machines, 3rd edn. Prentice Hall, Englewood Cliffs (2008) 5. Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1-3), 185–234 (1989) 6. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989) 7. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998) 8. Lecun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recognition with invariance to pose and lighting. In: In Proceedings of CVPR 2004 (2004) 9. Nixon, M., Aguado, A.S.: Feature Extraction & Image Processing, 2nd edn. Academic Press, London (2008) 10. Nvidia: NVIDIA CUDA C Programming Guide 3.2. NVIDIA Corporation (2010) 11. Peemen, M., Mesman, B., Corporaal, C.: Speed sign detection and recognition by convolutional neural networks. In: Proceedings of the 8th International Automotive Congress, pp. 162–170 (2011) 12. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, pp. 318–362. MIT Press, Cambridge (1986) 13. Simard, P., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. In: ICDAR, pp. 958–962 (2003) 14. Szarvas, M., Yoshizawa, A., Yamamoto, M., Ogata, J.: Pedestrian detection with convolutional neural networks. In: Proceedings IEEE Intelligent Vehicles Symposium, Las Vegas, NV, pp. 224–229 (June 2005)
Salient Region Detection Using Discriminative Feature Selection HyunCheol Kim and Whoi-Yul Kim Department of Electronics and Computer Engineering, Hanyang University, Seoul, Republic of Korea [email protected], [email protected]
Abstract. Detecting visually salient regions is useful for applications such as object recognition/segmentation, image compression, and image retrieval. In this paper we propose a novel method based on discriminative feature selection to detect salient regions in natural images. To accomplish this, salient region detection was formulated as a binary labeling problem, where the features that best distinguish a salient region from its surrounding background are empirically evaluated and selected based on a two-class variance ratio. A large image data set was employed to compare the proposed method to six state-ofthe-art methods. From the experimental results, it has been confirmed that the proposed method outperforms the six algorithms by achieving higher precision and better F-measurements. Keywords: Visual saliency, salient regions, discriminative feature selection.
1
Introduction
Visual saliency is the mechanism that makes objects or regions stand out and attract attention relative to their surroundings. Through the benefits of visual saliency, when one senses the environment, one can correctly and rapidly identify interesting or important regions that are visually salient in complex scenes. However, questions remain regarding what makes particular regions salient or what is salient in an image. The answers to these questions can be slightly different for each individual since they may encompass the viewer’s subjectivity such as their personal preferences and experiences. Nevertheless, some common principles [1]–[5] exist that govern the process of selecting salient regions. If such an attention mechanism can be appropriately modeled, then it can be applied to a wide range of applications such as automatic image cropping [19], image segmentation [10, 20, 21, 22], and object detection/recognition [23, 24, 25]. Beginning with research by James [1], who was the first person to formulate a human attention theory, early visual saliency studies were mainly performed by researchers in physiology, psychology, and neural systems [2, 3]. In recent years, computer vision researchers have devoted much effort to detect salient regions and many computational models of visual saliency have been proposed. Itti’s model [4], which was inspired by the feature integration theory, is one of the most influential J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 305–315, 2011. © Springer-Verlag Berlin Heidelberg 2011
306
H. Kim and W.-Y. Kim
computational models. The model is based on the biologically plausible architecture proposed in [3]. In order to find locations that stand out locally from their surroundings, center-surround operations are implemented on multi-scaled feature images that are created using a difference of Gaussians. Recently, Walther and Koch [5] extended Itti’s model for addressing proto-objects in natural scenes and created Saliency Tool Box (STB). Hou and Zhang [6] introduced the Incremental Coding Length (ICL) as a general principle by which to distribute energy in an attention system and proposed a dynamic visual attention model based on the rarity of features. Zhang et al. [7] employed a Bayesian framework to estimate the probability of a target at every location. The model detects saliency using natural statistics (SUN) that correspond to an organism’s visual experience over time. The probability distribution over the difference of Gaussian or Independent Component Analysis features is learned from a training set of natural images. Harel et al. [8] proposed graph-based visual saliency (GBVS) including a framework for activation, normalization, and combination. The model extended the technique presented in [18] through the use of a better dissimilarity measure to define edge weights. Hou and Zhang [9] explored the role of spectral components and proposed a spectral residual (SR) approach based on the Fourier transform. The researchers analyzed the log-spectrum of images so as to approximate the innovation part of an image by removing statistically redundant components. Achanta et al. [10] performed a frequency domain analysis on several conventional methods and compared the spatial frequency content retained from an original image. Based on the analysis, the researchers introduced a frequency-tuned (IG) approach that estimates center-surround contrast using color and luminance features. Unlike the previously described methods, IG provides full resolution saliency maps with well-defined boundaries for salient objects. A key issue addressed in this paper is the accurate detection of salient regions from images. For accurate detection, it is suggested that discriminative features that effectively distinguish a salient region from non-salient regions are selected. The discriminative feature selection model consists of two processes: (a) identification of the photographic composition process that captures the photographer’s intent so that an image may be divided into two disjoint salient and non-salient parts, (b) a feature discriminability process that evaluates the discriminating power of a feature and selects the most discriminative features. From an object detection perspective, saliency detection can be regarded as a binary labeling problem [11]. However, unlike typical problems such as face detection, in saliency detection there are no fixed targets that have to be detected, i.e., targets are different for every image. Hence, a detection target will first be determined in (a) by considering the salient part as the target. Features with high discrimination capabilities are then empirically sought based on a two-class variance ratio with samples that are extracted from salient and non-salient parts. In Section 2, a salient region detection framework that describes the feature selection mechanism is presented. The saliency computation is described in Section 3 and experimental results with a large image data set are detailed in Section 4. Conclusions are ultimately presented in Section 5.
Salient Region Detection Using Discriminative Feature Selection
307
Fig. 1. Proposed framework for salient region detection and segmentation. Feature maps are generated from the top ranked N (= 3 in this example) features and are integrated to obtain the saliency map.
2
Salient Region Detection
The saliency region detection framework proposed in this work is shown in Fig. 1. Given an input image, an identification of the photographic composition is first performed so as to divide the image region into two parts that are likely to be salient or non-salient. After dividing the image into salient and non-salient parts according to the identified photographic composition, discriminative features that are efficient in detecting a salient region are sought. With the selected discriminative features, feature maps are generated in order to capture the spatial contrast, where the contrast is computed as the difference between an average of the feature values in the non-salient part and each feature value. The outputs from these feature maps are weighted summed so as to yield a master map of the saliency. 2.1
Identification of Photographic Composition
There are some heuristic guidelines (e.g., rule of thirds, lines, framing, visual balance, and diagonal dominance) that make a photograph more appealing by increasing its aesthetic value [12]. People usually reflect these rules in their photographs when they take a picture. Therefore, if the composition rule for an image is correctly identified, then a salient region in the image can be easily and effectively found [12, 13]. Precisely identifying the photographic composition of an image is very difficult without information regarding a salient object in the image. Hence, the intent here is to simplify the identification problem by partitioning the image region R into two disjoint parts: salient Rs and non-salient Rns. For this purpose, the photographic region templates (PRT) proposed in [13] are utilized so as to capture the photographer’s intentions in a simple manner. As shown in Fig. 2, the PRT is composed of a total of 9
308
H. Kim and W.-Y. Kim
region templates: one center (RT1), four corners (RT2–RT5), two horizontals (RT6 and RT7), and two verticals (RT8 and RT9). Each template RTi is partitioned into two subregions: an object region RTio and a background region RTib. A salient region can be located in one RTio among the 9 object regions. In order to decide which region template is the most suitable match to the photographic composition of an input image, interest points that are characterized by a significant local variation in image values (obtained using the Harris detector of [14]) are used as a feature. Interest points can be a good clue for identifying photographic composition since they attract attention and fixation due to their high information content. A salient region is likely to be positioned at an area that contains many interest points. Hence, a salient region can be detected by searching an area that simultaneously contains as many interest points and as few non-interest points as possible. Let nio and nib be the number of interest points in RTio and RTib, respectively, for the ith region template. A matching score fM(RTi) indicating how well a region template decomposes an image is then defined as
f M ( RTi ) = (nio − nib ) /(nio + nib ) × 100.
(1)
A best matching template (BMT) is obtained by searching a region template that has a maximum fM(RTi) value among all the candidate region templates that satisfy the following two constraints:
i ) f M ( RTi ) ≥ th1 ,
ii )| f M ( RTi ) − f M ( RT1 )| ≥ th2 .
(2)
With regard to the constraints, i) allows the BMT to be selected from the region templates that have a significant difference between nio and nib and ii) is set to give more weight to the center template since a strong bias for human fixations tends to be near the center of an image [7, 15, 16]. Referring to a study by Judd et al. [16], it was reported that 40% and 70% of fixations lie within the center 11% and 25% of an image, respectively. The BMT is ultimately determined by the following formula ⎧⎪arg max f M ( RTi ), BMT = ⎨ RTi ∈RT ⎪⎩ RT1 ,
if RT ≠ ∅, (3) otherwise,
where RT is a set of region templates that satisfy (2). 2.2
Feature Selection for Saliency
The proposed feature selection mechanism consists of the following three steps: 1) generating a pool of features, 2) computing the means and variances for salient and non-salient parts of the BMT with respect to each feature, and 3) evaluating feature discriminability based on the proposed two-class variance ratio measure. Details regarding each step are explained below. There are a wide range of low-level visual features (e.g., luminance, color, and shape) that can be employed for saliency detection. It is quite obvious that all of these features have to be considered and integrated efficiently in order to accurately detect a salient region. However, each feature space typically has so many tunable parameters
Salient Region Detection Using Discriminative Feature Selection
309
that the full set of potential features is enormous. Therefore, in this work, only color is considered as a saliency feature since color contrast is rich in information and plays a very important role in the human visual perception process when compared to other features [19]. A set of candidate features F is composed of linear and non-linear combinations of R, G, and B pixel values. The linear candidate features Fl with integer coefficients in the range of -2 to 2 are generated as [17] Fl = {c1 R + c2 G + c3 B | ci ∈ [−2, −1, 0,1, 2]}.
(4)
After disallowing the feature with all zero coefficients and eliminating redundant coefficients where k(c’1,c’2,c’3) = (c1,c2,c3), a pool of 49 features remains. Although Fl covers some common features such as raw R, G, and B values, non-linear color features that are designed to approximate human vision also have to be considered for better results. Therefore, 4 non-linear color spaces Fnl that are difficult to account for with Fl are added to F; Fnl consists of a* and b*, hue, and saturation channels of L*a*b*, HSV color spaces. Finally, for every feature f ∈ F, an input image is transformed to a feature image If(p). The feature images, If(p)s, are then normalized into a range 0 to 255, where p is a pixel site in R. Linear Discriminant Analysis (LDA) is a well-known feature discrimination method that seeks an optimal linear projection yielding the least overlap between classes. LDA has been shown to be successful when the data of classes follow unimodal distributions because it intrinsically assumes that the data from each class are normally distributed. Therefore, if the data are multi-modal distributions of colors (such as in this work), then it is difficult to expect that LDA will separate the data correctly. Rather than find only one optimal feature, the above problem is overcome by empirically evaluating each candidate feature in order to determine which ones have high discriminating power. To evaluate the discriminating power of each feature, a variance ratio measure VR(f) that is a simple variant of Fisher’s criterion is formulated as VR ( f ) = μ sf − μ nsf / σ nsf ,
(5)
2R–G–B
2R–2B
Hue
R–G+2B
2R–2B
a*
2R+2B–G
2R–2G
Fig. 3. Example images with ranked feature images. Column 1: Input image with a labeled salient region (green box). Column 2–5: feature images corresponding to the highest 2, median, and lowest ranked features, respectively.
310
H. Kim and W.-Y. Kim
where, μ sf = (∑ p∈Rs I f ( p )) / Rs , μ nsf = (∑ p∈Rns I f ( p)) / Rns
are mean values in the
salient and non-salient parts of BMT, respectively, and the variance in the non-salient part σ nsf = (∑ p∈Rns I f ( p ) −μ nsf ) 2 / Rns . The idea behind this variance ratio is that a tight cluster of pixel values in the non-salient part (low within class variance) would be preferred and two clusters should ideally be spread apart as much as possible (high total variance). Notice that, in contrast to the Fisher criterion, the within class variance in the denominator for the salient part is discarded. In the ratio VR(f), only the variance for the non-salient part should be small and the reason is as follows. First, for the case presented here, it is quite difficult to minimize the within class variances of both the salient and non-salient parts due to inaccurate image decomposition. In order words, many misclassified samples are initially given in the class data which hinder a correct discriminating power evaluation. Secondly, the focus is on minimizing the variance for the non-salient part since the low contrast in the non-salient part can allow more human fixation/attention to the salient part. This expectation can be guaranteed by the numerator, which enables the salient part to stand out from the non-salient part. Shown in Fig. 3 are sample images with a labeled salient part and feature images with the highest 2, median, and lowest variance ratio values. When compared to the median and lowest images, the salient regions in Fig. 3 (yellow flower, red wheel) are distinct in the high ranked feature images. In addition, the discriminative features are different for each image.
3
Computing Saliency
After the top-ranked N features are selected from F, N feature maps FMf that represent the spatial contrast in its associated feature are calculated as
FM f ( p ) =| μ nsf − I f ( p ) |,
(6)
where I f is the Gaussian blurred version of the original feature image that serves to suppress fine texture details, coding artifacts, and noise. With the help of feature selection, feature images whose pixel values in salient and non-salient parts are very different can be obtained. Thus, the difference between μ nsf and I f is taken as the degree of saliency. A saliency map SM is ultimately generated by combining the computed N feature maps. The map is defined as SM ( p) = ∑ f =1 w f ⋅ FM f ( p ),
(7)
w f = VR( f ) / ∑ f =1VR( f ).
(8)
N
N
In order to increase the contribution of the higher ranked features in the saliency map, the SM is computed as the weighted sum of the FMf , where the weights ( ∑ Nf =1 w f = 1 ) are adjusted by considering the variance ratio of each FMf.
Salient Region Detection Using Discriminative Feature Selection
4
311
Experimental Results
In our experiments, Achanta set [10] consisting of 1000 images and their accurately labeled ground truth is considered to evaluate the performance of salient region detection. The evaluation of an algorithm is carried out in terms of Precision, Recall, and F-Measure. Precision is the ratio of correctly detected salient regions to the detected salient regions and Recall is the ratio of correctly detected salient regions to the ground truth. F-Measure is an overall performance measurement and is computed as the weighted harmonic mean of precision and recall with a positive α . Formally, F-Measure is defined as Fα =
(1 + α ) Precision × Recall , α × Precision + Recall
(9)
where α decides the importance of precision over recall while computing the FMeasure. In this work, α = 0.3 was used to weigh precision more than recall. The proposed method was compared with six leading methods: STB [5], ICL [6], SUN [7], GBVS [8], SR [9] and IG [10]. The proposed method is referred to as DFS. The reason we chose the aforementioned algorithms for comparison is that they are widely cited in the literature (STB, GBVS, and SR), recently proposed (ICL, SUN, IG), based on various theoretical backgrounds, and exhibit good performance. For the five methods (IG is not included) that yield lower resolution saliency maps, bilinear interpolation is used to resize the maps to the original image size. Throughout the experiments, th1 and th2 used for searching the BMT are set to 70 and 40, respectively. And the number of candidate images N is set to 5. The saliency map detections of the proposed method with the other methods are shown in Fig. 4. Let A(m, n) be the image positioned at the m row and n column. For viewing convenience, the binary ground truths are converted to color images, where non-salient pixels have a gray value for the ground truth. As shown in Fig. 4, the proposed method outperforms the other methods by generating sharper and more uniform highlighted salient regions. For example, in column 2, the proposed method correctly detects the salient traffic light regions by selecting features with a high discrimination power, while STB and ICL do not distinguish the small color difference and ultimately. Similarly, for A(1, 3), the salient object on the ground has been accurately detected, although the object is not highly distinguishable. The proposed method is also not sensitive to the scale of salient objects since discriminative features are analyzed and selected based on statistics and the photographic composition of an image. In A(1, 4), the dominant object almost occupies the main body of the image and thus, STB and ICL detect the orange region corresponding to the stamen and pistil of the yellow flower as the salient region since those methods are based on the rarity concept. In contrast, the proposed method has determined the BMT for A(1, 10) as PT1 and thus, the feature that distinguishes between the green color in the non-salient part and both the yellow and orange colors in the salient part is searched. IG also generates well-defined saliency maps A(8, 1) and A(8, 2) whose salient regions are uniformly highlighted. However, IG yields poor
312
H. Kim and W.-Y. Kim
Input image
Ground truth
STB [5]
ICL [6]
SUN [7]
GBVS [8]
SR [9]
IG [10]
DFS
Fig. 4. Visual comparison of saliency maps
saliency maps for images A(8, 3) and A(8, 4), where the contrast between salient and non-salient regions is not distinct or where the salient regions are larger than the nonsalient regions. The reason this occurs is that the method computes a saliency map as the difference between a global mean feature vector and each pixel feature. In contrast to IG, the proposed method computes a saliency map as the difference between the mean feature vector, which is calculated only in the non-salient part, and each pixel feature. In addition, the feature whose discriminating power is high is only employed at the saliency computation. Thus, the proposed method can effectively deal with the images. In Fig. 5, additional examples are displayed to show the performance of the proposed method. From the figure, one can again confirm that the proposed method effectively detects and segments the salient regions.
Salient Region Detection Using Discriminative Feature Selection
313
Fig. 5. Some sample results of salient region detection: original image (odd columns), saliency map (even columns)
In order to evaluate the quantitative performance of the proposed method, an experiment how well the detected salient map contains the salient region in the images was performed. For a given saliency map with saliency values in the range of [0, 255], the threshold Tf was varied from 0 to 255 so as to obtain a binary mask for the salient object and compute the precision, recall, and F-Measure at each threshold. This experiment is the simplest way to obtain a binary mask for the salient object. It also allows for a reliable comparison of how well the salient object is detected without a segmentation algorithm. In fact, this experiment was conducted to compare the quality of the detected saliency maps themselves. Achanta et al. also carried out this experiment with the same intent. Shown in Fig. 6 is an experimental precision versus recall curve; the proposed DFS method outperforms several existing methods. From the precision-recall curves, the following observations can be made: (a) At maximum recall, i.e., Tf = 0, all methods have the same low precision value of about 0.2 because all pixels in the saliency maps are retained as positives at this threshold regardless of the method used; (b) For a very low recall (< 0.1), the precisions of STB, SUN, and SR drop steeply because the salient pixels from these methods do not cover the entire salient object. Actually, STB tends to detect only small parts within salient regions and SUN and SR often detect the boundaries of salient regions; (c) Although SUN has been proven to be a powerful technique in [12], it exhibited poor performance in this study. This is because we just used the natural statistics provided by the researchers instead of training their algorithm with the Achanta data set; (d) For a very high recall (> 0.95), GBVS exhibits the highest precision. This is because GBVS tends to detect many pixels that are neighboring the salient object as salient regions, as shown in Fig. 4. In Table 1, the average precision, recall, and F-Measure are listed. The proposed method achieved the highest precision (84%) and F-Measure (51%). Although GBVS achieved the highest recall rate (52%), its F-Measure is lower than that attained with
314
H. Kim and W.-Y. Kim
Fig. 6. Precision-recall curves for naïve thresholding of saliency maps on the Achanta set Table 1. Average Precision, Recall, and F-Measure for the Achanta set
Precision Recall F-Measure
STB 0.65 0.08 0.18
ICL 0.67 0.19 0.30
SUN 0.46 0.44 0.31
GBVS 0.64 0.52 0.45
SR 0.53 0.24 0.26
IG 0.68 0.31 0.38
DFS 0.84 0.35 0.51
the proposed method because of its low precision. Of course, by controlling the value of α (>1, i.e., weigh recall more than precision) in (9) GBVS can yield a larger FMeasure than the proposed method. However, this is not desirable since the recall rate is not very important in attention detection [11].
5
Conclusions
In this paper, a salient region detection method was presented that consists of two parts: discriminative feature selection and photographic composition identification. For salient region/object detection in natural images, color contrast was considered to be the most important factor among such visual cues as orientation and shape. In order to select features with high discrimination capabilities, each feature was empirically evaluated rather than searching for only one optimal feature. The robustness of the proposed saliency detection method was objectively demonstrated through the use of a large image data set.
References 1. James, W.: The Principles of Psychology. Holt, New York (1890) 2. Treisman, A., Gelade, A.: A feature-integration theory of attention. Cogn. Psych. 12, 97– 136 (1980)
Salient Region Detection Using Discriminative Feature Selection
315
3. Koch, C., Ullman, S.: Shifts in selection in visual attention: Toward the underlying neural circuitry. Human Neurobil. 4, 219–227 (1985) 4. Itti, L., Koch, C., Neibur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20, 1254–1259 (1998) 5. Walther, D., Koch, C.: Modeling attention to salient proto-objects. Neural Netw. 19, 1395–1407 (2006) 6. Hou, X., Zhang, L.: Dynamic visual attention: Searching for coding length increments. In: IEEE NIPS, pp. 681–688 (2008) 7. Zhang, L., Tong, M., Marks, T., Shan, H., Cottrell, G.: SUN: A bayesian framework for saliency using natural statistics. J. Vision 8, 1–20 (2008) 8. Harel, J., Koch, C., Perona, P.: Graph-Based Visual Saliency. In: IEEE NIPS (2006) 9. Hou, X., Zhang, L.: Saliency detection: A spectral residual approach. In: IEEE CVPR, pp. 1–8 (2007) 10. Achanta, R., Hemami, S., Estrada, F., Süsstrunk, S.: Frequency-tuned Salient Region Detection. In: IEEE CVPR, pp. 1597–1604 (2009) 11. Liu, T., Sun, J., Zheng, N., Tang, X., Shum, H.: Learning to detect a salient object. In: IEEE CVPR, pp. 1–8 (2007) 12. Kodak.: How to Take Good Pictures: A Photo Guide by Kodak. Ballentine, New York (1995) 13. Yang, S., Kim, S., Ro, Y.M.: Semantic Home Photo Categorization. IEEE Trans. Circuits Sys. Video Tech. 17, 324–335 (2007) 14. Harris, C., Stephens, M.: A combined corner and edge detection. IEEE Trans. Pattern Anal. Mach. Intell., 147–151 (1988) 15. Tatler, B.W.: The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions. J. Vis. 7, 1–17 (2007) 16. Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to Predict Where Humans Look. In: IEEE ICCV (2009) 17. Collins, R.T., Liu, Y., Leordeanu, M.: Online Selection of Discriminative Tracking Features. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1631–1643 (2005) 18. Chen, W.-K.: Linear Networks and Systems. Wadsworth, Belmont (1993) 19. Suh, B., Ling, H., Bederson, B.B., Jacobs, D.W.: Automatic thumbnail cropping and its effectiveness. In: UIST, pp. 95–104 (2003) 20. Achanta, R., Estrada, F., Wils, P., Süsstrunk, S.: Salient Region Detection and Segmentation. In: ICVS, pp. 66–75 (2008) 21. Fu, Y., Cheng, J., Li, Z., Lu, H.: Saliency cuts: An automatic approach to object segmentation. In: ICPR (2008) 22. Fukuda, K., Takiguch, T., Ariki, Y.: Automatic Segmentation of Object Region Using Graph Cuts Based on Saliency Maps and AdaBoost. In: ISCE, pp. 36–37 (2009) 23. Gao, D., Vasconcelos, N.: Integrated learning of saliency, complex features, and object detectors from cluttered scenes. In: CVPR, pp. 282–287 (2005) 24. Oliva, A., Torralba, A., Castelhano, M., Henderson, J.: Top-down control of visual attention in object detection. In: ICIP, pp. 253–256 (2003) 25. Gao, D., Han, S., Vasconcelos, N.: Discriminant Saliency, the Detection of Suspicious Coincidences, and Applications to Visual Recognition. IEEE Trans. Pattern Anal. Machine Intell. 31, 989–1005 (2009)
Image Analysis Applied to Morphological Assessment in Bovine Livestock Horacio M. Gonz´ alez–Velasco, Carlos J. Garc´ıa–Orellana, Miguel Mac´ıas–Mac´ıas, Ram´on Gallardo–Caballero, and Antonio Garc´ıa–Manso CAPI Research Group Politechnic School, University of Extremadura Av. de la Universidad, s/n. 10003 C´ aceres - Spain [email protected]
Abstract. Morphological assessment is one important parameter considered in conservation and improvement programs of bovine livestock. This assessment process consists of scoring an animal attending to its morphology, and is normally carried out by highly-qualified staff. In this paper, a system designed to provide an assessment based on a lateral image of the cow is presented. The system consists of two main parts: a feature extractor stage, to reduce the information of the cow in the image to a set of parameters, and a neural network stage to provide a score considering that set of parameters. For the image analysis section, a model of the object is constructed by means of point distribution models (PDM). Later, that model is used in the searching process within each image, that is carried out using genetic algorithm (GA) techniques. As a result of this stage, the vector of weights that describe the deviation of the given shape from the mean is obtained. This vector is used in the second stage, where a multilayer perceptron is trained to provide the desired assessment, using the scores given by experts for selected cows. The system has been tested with 124 images corresponding to 44 individuals of a special rustic breed, with very promising results, taking into account that the information contained in only one view of the cow is not complete.
1
Introduction
As for the control and conservation of purity in certain breeds of bovine livestock, European legislation imposes the creation of herd-books, records of the information stored about existing animals of the above-mentioned race. Morphological assessment is one of the parameters that are stored on those herd-books for each pure-bred animal, and will be of great importance for many tasks that are done with them. In general, the morphological assessment consists of scoring an animal considering its external morphology (hence is also known as conformational evaluation or external evaluation) and the similarity of its characteristics to the standard of the race. This score is usually settled between 0 and 100 points, J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 316–326, 2011. c Springer-Verlag Berlin Heidelberg 2011
Morphological Assessment System
317
and is conducted by highly–qualified and experienced staff (which are known as assessors or, more generally, experts). This personnel must behave in a “neutral way” and they have to be trained in a centraliced manner in order to maintain the criteria. There are several methods to carry out the morphological assessment, usually considering measurable traits and traits directly evaluated by visual inspection [10]. However, with rustic meat-producing breeds, normally breeded in extensive farms, morphological assessment is only based in visual inspection. This method consists of scoring ten traits, each one including several aspects of the standard of the race (see table 1). Those traits are assessed with a continuous value between 1 and 10 points, and the global score is obtained by a weighted sum of the partial scores. The weights are determined by the characteristics of the breed and the objectives to be reached with its control and selection. This is the method applied for the breed considered in this work. A similar method, but in other context, is described in [4]. Table 1. Traits to be individually assessed for the considered breed. The weights for the final sum are included. Trait to be assessed General aspect and figure Development of the body Head Neck, chest, withers and back Thorax and belly Back and loins Hindquarters and tail Thighs and buttocks Extremities Genital organs Shape and quality of the udder
Weights (males) Weights (females) 1.6 1.2 0.5 0.5 1.0 1.5 1.2 1.3 0.7 0.5 —
1.6 1.2 0.5 0.4 1.0 1.3 1.3 1.3 1.0 — 0.5
With this method, it is worrisome the great amount of subjectivity involved in morphological assessment. This together with the need of uniformity in criteria, lead us to consider the utility of a semiautomatic system to help morphological assessment, as already discussed in [8,10]. In these works, as well as in the information received from the consulted experts it is suggested that the assessment could be done quite accurately using three pictures corresponding to three positions of the cow (frontal, lateral and rear), and fundamentally analyzing the profiles of the animal in those images. In this paper we present a first approach to an automatic morphological assessment system, based on images of only one of the views: the lateral position. The proposed method has a classical structure in two stages. First, a feature extraction is performed on the image, with the aim of representing the profile
318
H.M. Gonz´ alez–Velasco et al.
of the cow with a reduced set of parameters. Later, using this parameters, a supervised neural network system is trained with samples of assessments made by the experts. This way, their knowledge about assessment can be extracted and their assessments can be approximately reproduced. For the first stage we used the approach designed by Hill et al. [2,12] already applied with great success to other types of images (usually grey-scale medical images [5,13]. This approach was adapted to be used with the colour images that we obtained outdoors using an automatic digital camera. We propose a first image processing stage consisting of an edge detection and selection similar to that described in [9], so that the new image, together with a simple objective function, lets us carry out an efficient search using genetic algorithms [7,11]. As a result, the contours extracted are represented by a reduced set of parameters: the vector of weights that describe the deviation of the considered shape from the mean in the training set. In the second stage, we propose to use the contours extracted for a selected set of cows, and the assessments made by experts for that cows, to train a supervised neural network (a multilayer perceptron [1]). Once the neural network is trained, the complete system can provide an assessment for each image presented at its input. In section 2 the technique used to search the contours within the images is described in detail (e.g. shape modeling and objective function). Next, section 3 provides details about the neural network used and the training method. The results obtained by applying the whole system to a set of images can be found in section 4, and finally our conclusions and possible improvements of our work are presented in section 5
2
First Stage: Image Analysis
The contour extraction problem can be converted into an optimization problem considering two steps: the parametrization of the shape we want to search and the definition of a cost function that quantitatively determines whether or not a contour is adjusted to the object. The general outline of the system used is shown in figure 1. As illustrated, for the contour modelling and parametrization the PDMs technique [2] has been applied, which let us restrict all the parameters and precisely define the search space. On the other hand, the objective function proposed is designed to place the contour over areas of the image where edges corresponding to the object are detected. 2.1
Shape Modelling and Search Space
As is shown in figure 1, the first step consists of representing the desired shape we want to search along with its possible variations. In some previous works [12,8] a general method known as PDM has been used, which uses deformable models representing the contours by means of ordered groups of points (figure 2) located at specific positions on the object, which are constructed statistically
Morphological Assessment System
319
Fig. 1. General outline of the contour location system using genetic algorithms
based on a set of examples. As a result of the process, the average shape of our object is obtained, and each contour can be represented mathematically by a vector x such as x = xm + P · b (1) where xm is the average shape, P is the matrix of eigenvectors of the covariance matrix and b a vector containing the weights for each eigenvector, and which properly defines the contour in our description. Fortunately, considering only few eigenvectors corresponding to the largest eigenvalues of the covariance matrix, practically all the variations taking place in the training set could be described. This representation (model) of the contour is, however, in the normalized space. To project instances of this model to the space of our image a transformation that preserves the shape is required (translation t = (tx , ty ), rotation θ and scaling s). Thus, every permitted contour in the image can be represented by a reduced set {s, θ, t, b} of parameters.
(a)
(b)
Fig. 2. Figure (a) illustrates the images dealt with. Figure (b) shows the model of the cow contour in lateral position.
320
H.M. Gonz´ alez–Velasco et al.
In order to have the search space precisely defined, the limits for this parameters are required. As stated in [2], to maintain a shape similar to those √ √ in the training set bi parameters must be restricted to values − λi ≤ bi ≤ λi , where λi are the eigenvalues of the covariance matrix. The transformation parameters can be limited considering that we approximately know the position of the object within the image: normally the animal is in horizontal position (angle restriction), not arbitrarily small (restriction in scale) and more or less centered (restriction in t). In our specific case (lateral position), the cow contour is described using a set of 126 points (figure 2), 35 of which are significative. The model was constructed using a training set of 45 photographs, previously selected from our database of 124 images, trying to cover the variations in position as much as possible. Once the calculations were made, no more than 10 eigenvalues were needed to represent virtually every possible variation (more than 90 %). 2.2
Objective Function and Potential Image Extraction
Once the shape is parameterized, the main step of this method is the definition of the objective function. In several consulted works [13,12] this function is built so that it reaches a minimum when strong edges of similar magnitude are located near the points of a certain instance of the model in the image. We use a similar approach, generating an intermediate grayscale image that acts as a potential image (inverted, so we try to maximize the “energy”). This potential image could be obtained using a conventional edge detector. However, to avoid many local maxima in the objective function, it is important that the edge maps does not contain many more edges than those corresponding to the searched object. This is not always possible because of background objects. For this reason, the method devised by Gonz´ alez et al. [9] for edges selection is applied. In this method a multilayer perceptron neural network is trained using a set of images for which the edges of the searched object are known. The selection system acts over a previously obtained maximum edge map, resulting only those edges that more likely correspond with the object we are looking for, according to the learned criterion. With this potential image, we propose the following objective function to be maximized by the GA: ⎛
n−2
fL (X) = ⎝
j=0
⎞−1 Kj ⎠
·
n−2 j=0
⎛ ⎝
⎞ Dg (X, Y )⎠
(2)
(X,Y )∈rj
where X = (X0 , Y0 ; X1 , Y1 ; . . . ; Xn−1 , Yn−1 ) is the set of points (coordinates of the image, integer values) that form one instance of our model, and can be easily obtained from the parameters using equation 1 and a transformation; rj is the set of points (pixels of the image) that form the straight line which joins (Xj , Yj ) and (Xj+1 , Yj+1 ); Kj is the number of points that such a set contains; and Dg (X, Y ) is a gaussian function of the distance from the point (X, Y ) to the
Morphological Assessment System
321
nearest edge in the potential image. This function is defined to have a maximum value Imax for distance 0 and must be decreasing, so that a value less than or equal to Imax /100 is obtained for a given reach distance DA , in our case 20 pixels. In order to avoid the calculation of distances within the objective function (that is executed a great number of times), a gray-level distance map is generated, over which the instances of our contour are projected. 2.3
Genetic Algorithm Configuration
After defining the search space and the objective function, two aspects must be determined in order to have an operating scheme of GA. On one hand, the general outline of the GA must be decided (coding, replacement and selection methods, etc). On the other hand, values for the parameters that determine the algorithm (population size, mutation rate, etc) must be given. As there is no definitive conclusion about the best GA scheme to be used in this kind of problems, two configurations were considered, based on the ideas in [3]. The first is similar to the basic GA in [7], and we called it standard, while the other uses real coding, and is called alternative. They present the following characteristics: – Standard configuration: binary coding (16 bits), roulette selection method (using raw fitness), two-point crossover and mutation operators, and steadystate replacement method. – Alternative configuration: real coding, binary tournament selection method (using linear normalization to calculate fitness), two-point crossover and mutation operators, and steady-state replacement method. To determine the most suitable scheme for our problem, along with the best values for the parameters involved (modes of variation t, population P , probabilities of crossover and mutation Pc and Pm , and replacement rate S), a secondary GA was applied, whose chromosome included those parameters, and whose objective function was related to the efficiency shown by the coded configuration over a concrete set of images, for which the correct position of the contour was known. As a result, the best performance was obtained with the alternative configuration and the following parameters: t = 10 (so we have 14 parameters to represent a contour in the image), P = 2000 individuals, Pm = 0.4, Pc = 0.45 and S = 60%. It is remarkable that the best results are obtained with few modes of variation (with ten modes of variation around 92 % of the variations observed in the training set are explained), though up to 20 were considered. The system described above presents the drawback of not assuring a certain rate of successes in the search. A similar problem is described in [14], where a method is proposed to guarantee a given probability of success by a repeated execution of the GA (repeated GA). In that paper, a method is given to calculate the number N of executions needed to reach a given probability of success PRGA . In our case, we used this method to calculate the minimum number of execution needed to obtain PRGA = 0.95, resulting that N must be at least 4 executions.
322
2.4
H.M. Gonz´ alez–Velasco et al.
Fixing Points to Drive the Search: Method
Even using the repeated GA, results are not so good as expected, so we decided to drive the search in an interactive manner by providing one or two fixed points. In the method described above, we codify the parameters {s, θ, t, b} in the chromosome, and, in the process of decoding, we calculate the projection of the contour into the image, using those parameters. Let us denominate x = (x0 , y0 , x1 , y1 , . . . , xn−1 , yn−1 )T the vector with the points of the contour in the normalized space, and X = (X0 , Y0 , X1 , Y1 , . . . , Xn−1 , Yn−1 )T the vector with the points of the contour projected into the image. The vector x can be calculated using equation 1, with the parameters b. In order to obtain vector X, all the points must be transformed into the image space: Xj xj tx s cos θ −s sin θ tx ax −ay xj = + = + (3) Yj ty yj ty ay ax yj s sin θ s cos θ To include the information of fixed points, our proposal is not to codify some of the parameters into the chromosome, but calculate them using the equations described above. Therefore, we reduce the dimensionality of the search space, being easier for the algorithm to find a good solution. Fixing one or two points. Firstly, consider that we know the position of the point p, (Xp , Yp ). Then we can leave the parameters (tx , ty ), the translation, without codifying, and calculate them using equations 1 and 3. With equation 1 we calculate the vector x, and applying equation 3 we obtain tx = Xp − ax xp + ay yp Xp = tx + ax xp − ay yp and then (4) Yp = ty + ay xp + ax yp ty = Yp − ay xp − ax yp Consider now that we know the position of the points p and q, (Xp , Yp ) and (Xq , Yq ). In this case, we can leave two more parameters without codifying: s and θ, appart from (tx , ty ). With equation 1 we calculate again the vector x, and applying the equation 3 we obtain ⎧ Xp = tx + ax xp − ay yp ⎪ ⎪ ⎨Y = t + a x + a y p y y p x p (5) ⎪ X = t + a x − a x x q y yq ⎪ ⎩ q Yq = ty + ay xq + ax yq This is a system of linear equations where tx , ty , ax and ay are the unknowns, and its resolution is quite straightforward. Fixing more than two points. We did not consider more than two fixed points because of two reasons. First, because if we fix the position of more than two points, we have to leave some of the shape parameters (components of the vector b) without coding, thus restricting the allowable shapes that can satisfy the fixed points. Second, because results with one or two fixed points were good enough, and to consider more than two points would be mathematically complicated, thus slowing down the decoding process of the chromosome.
Morphological Assessment System
323
Interactive scheme. We propose to use an approach similar to [6]. First we start running the GA search without fixed points. Then the user decides if the result is acceptable or not. If not, he also has to decide whether to fix two, one or zero points depending on the previous result, fix the points and run the algorithm again. This process can be repeated until an acceptable fitting is reached.
3
Second Stage: Neural Network
Once the contour of the animal in the image has been located and parameterized, a system is needed to provide an assessment as a function of those parameters. Though our contours are characterized by 14 parameters, we discarded four of them (s, θ, tx and ty ) because they provide information about the position of the animal within the image, but not about the morphology of them. Hence, we use the vector b as input for a multilayer perceptron [1], with one hidden layer, ten neurons at the input and one neuron at the output, which gives us the value for the assessment. In order to train the network we have a database of images of cows whose morphological assessment made by experts is known. After splitting this set of data into two subsets (80% for training and 20% for validation), the training is carried out using the backpropagation algorithm with momentum, and stops when the mean squared error (MSE) obtained over the validation set goes through a minimum. Later, we repeat this process by increasing the network size and we chose the new configuration as optimal if the MSE over the validation set is lower than the previous one.
4
Experiments and Results
We made experiments to test both stages of our system. On one hand we wanted to determine the performance of the image analysis stage. On the other hand, we intended to find out about the ability of the system to reproduce the assessments of the experts. 4.1
First Stage
We tested our method with 79 images (640x480 pixels size, taken outdoors using an automatic digital camera) from our database that were not used in the construction of the shape model. Our goal was to test the performance of the system with and without fixed points and to compare results. In order to evaluate quantitatively a resulting contour, the mean distance between its points and their proper position has been proposed as a figure of merit. In our case, the distance was estimated by using only seven very significant points of the contour. The points used were {20, 25, 40, 50, 78, 99, 114}, and are represented in figure 2. Those points were precisely located for all the images in the database. To test the performance of the method we considered the significant points referred above and run the GA search, without fixed points and with all the
324
H.M. Gonz´ alez–Velasco et al.
possible combinations of one and two fixed points (1 + 7 + 21 = 29 cases), over the 79 images. In order to define “success”, we applied a hard limit of ten pixels to the distance, what we consider enough for the morphological assessment in which these contours are to be used. As a result we obtained that without fixed points the rate of success was 61.3 %, but with two fixed points, in the best of the cases (fixing points 78 and 114, corresponding to the limbs) the rate of success reached 95.7 %. However, fixing points 50 and 99 produces only a 58 % of successes. These results indicate that the help of fixed points is very important for the system, but also that the points to be fixed must be well-selected by the operator. 4.2
Second Stage: Assessment
To test de morphological assessment system, and considering that we do not have a great amount of data, we have followed a leave-one-out strategy. For each of the 118 images (for which we had the contour extracted) we randomly divided the remaining 117 into two subsets: training (94 images) and validation (23 images). Using this subsets, networks were trained with a number of neurons in the hidden layer between 5 and 40, and we selected the one offering the best results. Finally, the contour from the original image was introduced to the network in order to be assessed, and the result was compared with the assessment provided by the expert, to generate the absolute error. Table 2. Mean and standard deviation of the absolute value of the error, for the global score and for the individual aspects Trait Description
1 2 3 4 5 6 7 8 9 10
Mean abs. error
σ
Global score
5.662
4.048
General aspect and figure Development of the body Head Neck, chest, withers and back Thorax and belly Back and loins Hindquarters and tail Thighs and buttocks Extremities Shape and quality of the udder
0.666 0.619 0.768 0.832 0.835 0.717 0.881 0.809 0.671 1.235
0.469 0.484 0.536 0.493 0.530 0.528 0.745 0.659 0.635 0.846
This process has been carried out both with the global score and with the individual scores for each of the aspects. In table 2, the mean and the standard deviation of the absolute value of the error are presented, both for the global score and for the individual aspects. Also, in fig. 3 the absolute error for all
Morphological Assessment System
325
the processed images is shown. As we can see, we obtain a mean error of 5.662 points in the global score with a relatively high sigma, indicating that there is high variability in the errors of the assessments. We think that the causes for these errors are, mainly, two. First, we have included the limbs in the contours, but the position of the limbs can change significantly from one image to another of the same (or different) animal, causing a great difference between the shape parameters of the contours. Also, though many of the traits to be evaluated can be observed in the image of the lateral position, there are others that can not. The clearest example is the trait 10 (shape and quality of the udder), which obtains the highest error, because the udder is not visible in the contour that we have defined.
0.8 0.6 0.4 0.0
0.2
Mean absolute error
1.0
1.2
Mean absolute error for aspects
1
2
3
4
5
6
7
8
9
10
Aspects
Fig. 3. In this figure the absolute error for all the processed images is shown
Anyway, our results can be considered quite good and promising, because with only a partial input information, we obtain a very low error in many cases.
5
Conclusions and Future Research
In this paper, we have presented a system designed to provide a morphological assessment of a cow based on a lateral image of the animal. The system consists of two parts: one dedicated to the analysis of the image and the extraction of the profile of the animal, and the other designed to provide a score, given the parameterized contour. As we have shown, the first part performs very well with a little human intervention (fixing points), reaching a 95 % of successes. The second stage, based on a neural network, has also given very promising results, taking into account that not all the information required for the morphological assessment can be observed in one view of the animal. In order to improve the method, two lines are being considered at this moment. First, we are considering to include the contours of the rear and frontal views
326
H.M. Gonz´ alez–Velasco et al.
of the cows as inputs to the neural network, complementing the information contained in the lateral view. Also, we intend to eliminate from the profiles all the parts that are affected by a change in the pose of the animal (for instance, the limbs in the lateral position). Acknowledgements. This work has been supported in part by the Junta de Extremadura through projects PDT09A045, PDT09A036 and GRU10018.
References 1. Bishop, C.M., Hinton, G.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995) 2. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models – their training and application. Comp. Vision and Image Understanding 61(1), 38–59 (1995) 3. Davis, L.: Handbook of genetic algorithms. Van Nostrand Reinhold, New York (1991) 4. Edmonson, A.J., Lean, I.J., Weaver, L.D., Farver, T., Webster, G.: A body condition scoring chart for holstein dairy cows. Journal of Dairy Science 72(1), 68–78 (1989) ˇ ara, J., Jeˇzek, B.: On segmentation for medical 5. Felkel, P., Mr´ azek, P., S´ ykora, L., Z´ data visualization. In: Slavik, P., Wijk, J.V., Felkel, P., Vorl´ı`eek, J. (eds.) Proc. of the 7th Eurographics Workshop on Visualization in Scientific Computing, Prague, pp. 189–198 (1996) 6. Ginneken, B.v., Bruijne, M.d., Loog, M., Viergever, M.: Interactive Shape Models. In: Sonka, M., Michael Fitzpatrick, J. (eds.) Medical Imaging 2003: Image Processing, Proceedings of SPIE, vol. 5032, pp. 1206–1216. SPIE - The International Society for Optical Engineering (May 2003) 7. Goldberg, D.E.: Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading (1989) 8. Gonz´ alez, H.M., Garc´ıa, C.J., L´ opez, F.J., Mac´ıas, M., Gallardo, R.: Segmentation of bovine livestock images using GA and ASM in a two–step approach. Int. Journal of Pattern Recognition and Artificial Intelligence 17(4), 601–616 (2003) 9. Gonz´ alez, H.M., Garc´ıa, C.J., Mac´ıas, M., L´ opez, F.J., Acevedo, M.I.: Neural– networks–based edges selector for boundary extraction problems. Image and Vision Computing 22(13), 1129–1135 (2004) 10. Goyache, F., Del Coz, J.J., Quevedo, J.R., L´ opez, S.e.a.: Using artificial intelligence to design and implement a morphological assessment system in beef cattle. Animal Science 73, 49–60 (2001) 11. Haupt, R.L., Haupt, S.E.: Practical Genetic Algorithms 2e. John Wiley, Chichester (2004) 12. Hill, A., Cootes, T.F., Taylor, C.J., Lindley, K.: Medical image interpretation: a generic approach using deformable templates. J. of Med. Informatics 19(1), 47–59 (1994) 13. Hill, A., Taylor, C.J.: Model-based image interpretation using genetic algorithms. Image and Vision Computing 10(5), 295–300 (1992) 14. Yuen, S.Y., Fong, C.K., Lam, H.S.: Guaranteeing the probability of success using repeated runs of genetic algorithm. Im. and Vision Computing 19(8), 551–560 (2001)
Quantifying Appearance Retention in Carpets Using Geometrical Local Binary Patterns Rolando Quinones1,2,3, , Sergio A. Orjuela1,3 , Benhur Ortiz-Jaramillo1,2, Lieva Van Langenhove2, and Wilfried Philips1 1
Ghent University, Department of Telecommunications and Information Processing (TELIN-IPI-IBBT) St-Pietersnieuwstraat 41, Gent, Belgium 2 Ghent University, Department of Textiles Technologiepark 907, 9052 Gent, Belgium 3 Antonio Narino University, Faculty of Electronic and Biomedical Engineering Cra. 58A No. 37 - 94 Bogota, Colombia {rquinone,seraleov,bortiz}@telin.ugent.be, {Lieva.VanLangenhove,Wifried.Philips}@ugent.be, telin.ugent.be
Abstract. Quality assessment in carpet manufacturing is performed by humans who evaluate the appearance retention (AR) grade on carpet samples. To quantify the AR grades objectively, different research based on computer vision have been developed. Among them Local Binary Pattern (LBP) and its variations have shown promising results. Nevertheless, the requirements of quality assessment on a wide range of carpets have not been met yet. One of the difficulties is to distinguish between consecutive AR grades in carpets. For this, we adopt an extension of LBP called Geometrical Local Binary Patterns (GLBP) that we recently proposed. The basis of GLBP is to evaluate the grey scale differences between adjacent points defined on a path in a neighbourhood. Symmetries of the paths in the GLBPs are evaluated. The proposed technique is compared with an invariant rotational mirror based LBP technique. The results show that the GLBP technique is better for distinguishing consecutive AR grades in carpets. Keywords: Carpet Wear Assessment, Local Binary Pattern, Texture Inspection, Image Analysis.
1
Introduction
Carpet manufacturers are highly interested in reducing the subjectivity in the current quality assessment method performed by human experts. Carpet quality assessment consists of quantifying the expected appearance of the carpets due to traffic exposure after a predefined time of installation. The traffic exposure is
Rolando Quinones is supported by Antonio Narino University (Colombia) in the framework of the WEARTEX project.
J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 327–336, 2011. c Springer-Verlag Berlin Heidelberg 2011
328
R. Quinones et al.
simulated by a mechanical system which accelerates the wear on samples of new carpet products [1, 2]. Then, the quantification is assessed by evaluating surface changes following certified standards, where carpets with original appearance are graded with number 5 and carpets with severe overall change are graded with number 1 [3, 4]. A set of numbers within this range defined in steps of 0.5 are assigned to grade the appearance changes. These numbers are called ‘Appearance Retention’ (AR) grades. Several studies based on computer vision have been aimed to quantify the AR grades objectively [5–9]. They still are not good enough to meet the required discrimination of the AR grades imposed by standards on a sufficient wide of carpets [10]. Recently, we have proposed an automatic assessment system based on extracting texture parameters from intensity color and depth (range) images [11]. The range images are obtained using our own scanner specifically designed for carpets [12]. We have composed our own carpet database following the European standard [3, 4]. While own earlier work should a definitive improvement of the state of the art, it still did not meet the requirements of the standard. Thus, the challenge is to perform an algorithm able to distinguish between texture features corresponding to different AR grades. Algorithms based on Local Binary Pattern techniques have shown promising results to achieve this goal [9]. Several extensions of LBP have been proposed in other applications for texture classification [13–16]. LBP techniques first describe with binary codes the changes in intensity values around a neighbourhood for each pixel in an image [17]. The binary code of each pixel is defined by thresholding the intensity values in the pixels on the neighbourhood with the intensity value of the evaluated pixel [18, 19]. Assigning code numbers to the binary codes, the texture is statistically characterized by the probability of occurrence of the possible code numbers in the image. Grouping symmetric invariants of binary codes by using points on a circular neighbourhood improves the distinction of similar textures such as those exhibit by consecutive AR grades [9]. Also, relevant changes in appearance are better characterized by computing binary codes at different radii depending of the type of carpet [17]. Therefore, different configurations of LBP techniques evaluating neighbourhoods at different distances are necessary to describe completely a particular texture given by a certain carpet type. This results in a high feature dimension, with vectors of sizes equal to the number of neighbours points used on each circle. Many of these feature vectors have irrelevant information disturbing the discrimination between similar textures like those given by carpet surface appearance in consecutive AR grades. In our earlier work, the best improvement was achieved using a rotational mirror based LBP technique [9]. We propose in this research to adopt an extension of the LBP technique, called ‘Geometric Local Binary Pattern’ (GLBP) to increase the distinction between consecutive AR grades. This technique improves the performance discriminating similar textures [20]. The binary codes are computed from a complex neigbourhood defined in a symmetric structure composed by points located on multiple circles with different radii. The intensity changes are explored in the complex
Quantifying AR Grades in Carpets Using GLBP
329
neigbourhood around the pixel. Each pixel is associated with a set of binary code words representing the intensity changes between adjacent points defined on a path in the neigbourhood. Symmetries of the paths in the neigbourhood are evaluated. Thus, the texture is described statistically by the probabilities of occurrence of code numbers in the whole image or in a region of interest. We test the performance of the technique in distinguishing AR grades using range images of textile floor coverings from our carpet database. The GLBP technique is compared with an invariant rotational mirror based LBP technique [9] called in this paper symLBP. The techniques are applied on four set of carpets from our database. The results show that the GLBP technique correctly distinguish more textures from consecutive AR grades than symLBP. The results are based on quantifying monotonicity, discriminance and variability from the relation between texture features and AR grades. This paper is organized as follows: We first describe in Section 2 the carpet database to be evaluated by GLBP. Secondly, in Section 3 the proposed GLBP technique. Then, we describe in Section 4 the experiment conducted to evaluate the performance of the GLBP technique. Afterwards, results are reported in Section 5. Finally, conclusions are drawn in Section 6.
2
Materials
For this research, we evaluated the performance of the GLBP technique on range images of carpets to distinguish the different levels of AR grades. We used range images from our database which represent the texture on the surface of the carpets. The database is composed of images taken from physical samples of carpets following the EN1471:1996 European standard [3]. The surface of the carpets have been subjected to accelerated wear by using the Vetterman tester [21]. For this research we composed four representative references from the eleven types of carpet references established by the EN1471:1996 European standard. These references have been established based on a combination of characteristics in the carpet construction such as pile/surface fibre, among others. In each reference, nine degrees of surface degradation are defined. The degrees are specified by using AR grades. The AR grades vary from 1 to 5 with steps of half point, where a AR grade of 1 represents a severe change of a fatigued specimen whereas an AR grade 5 represents an original sample not exposed to any traffic exposure mechanisms. The database contains range images of a set of carpets with different AR grades. The range images were composed by scanning the carpet samples with the scanner based on structured light described in Orjuela et al [12]. The surface reconstruction of the range image was based on a wavelet edge detector [22]. Range images represent the digitized 3D structure on the surface of the textile floor coverings. Each pixel in an image represents the depth of the surface in an area of 0.24 mm by 0.24 mm. Cut-outs of of range images for transitional changes in appearance from label 1 to 5 are shown in Figure 1.
330
R. Quinones et al.
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Wear Label
Fig. 1. Cut-outs of of range images for transitional changes in appearance from label 1 to 5
3
Methods
We defined a GLBP structure as a set of points placed on concentric circles with different radius around a evaluated pixel. The set of points are symmetrically distributed on the circles and calculated by bilinear interpolation [20]. Figure 2 a) shows the example for point on three circles with different radii into a 7 × 7 window. We define Γ(r,N ) as a set of N points on a circle of radius r from the evaluated pixel, representing one circular neighbourhood as follows: Let p = (r, n2π/N ) be the polar coordinates of a point n in a circular neighbourhood. A circular neighbourhoods is given by the sets of points: Γ(r,N ) = {pn = (r, n2π/N ), n = 1, . . . , N }
(1)
Figure 2 a) illustrates an example of a neighbourhood with points placed on three circles {Γ(r1 ,N1 ) , Γ(r2 ,N2 ) , Γ(r3 ,N3 ) }, with corresponding radii defined by {r1 , r2 , r3 } = {0.707, 1.93, 2.97} and corresponding numbers of points on each circle given by {N1 , N2 , N3 } = {4, 8, 12}. In the neighbourhood, a pair of points pi ∈ Γ(rn ,Nn ) and pj ∈ Γ(r(n−1) ,N(n−1) ) are called adjacent, if pj is the nearest point of the set Γ(r(n−1) ,N(n−1) ) to the point pi . Figure 2 b) shows the adjacent points for the example in Figure 2 a) The connections of the adjacent points are drawn with arrows. Bits located on the points of the three neighbourhoods represent intensity changes in the image. The bits are calculated by thresholding the corresponding intensity of adjacent points pi and pj in the direction of the central pixel as follows: If Ii is the intensity value of a point located in pi and Ij the intensity value of the adjacent point pj located in an inside circle; the bit value, termed bi , assigned to the point pi is computed as follows: 1, (Ii > Ij ) (2) bi = 0, otherwise Figure 2 c) shows the bits obtained for the points in Figure 2 a). Let a path P = {pk }, k = 1, . . . , R, with R the number of circular neighbourhoods, be a set containing pairs of adjacent points, where each pair of points
Quantifying AR Grades in Carpets Using GLBP Evaluated pixel
Points in Γr 1
Points in Γr2
Points in Γr 3
0
1
1 1
1
1
0 1
0
0 0
1
1
0
0 0 0
1
0
1
0
1
0
c)
b)
a)
331
Fig. 2. Overview of the Geometric Local Binary Pattern technique. a) Points placed on three circles with different radii into a 7 × 7 window. The central pixel as well as the points on the different circular neighbourhoods Γr , with r = r1 , r2 , r3 , are identified with different colours b) Adjacency structure. The connections between adjacent points are drawn with arrows starting from the point that is used as threshold. c) Binary representation.
I)
II) a)
b)
Fig. 3. a) Paths used in this approach as primitive structures of GLBP. b) Rotation of the path a)I).
has one common point with the next pair. The path P describes the primitive structure of the GLBP. Thus, the structure is completely described by a set of prototype paths under symmetry rules of rotation and complement. There are Ne numbers of paths as the number of points in the exterior circle. Figure 2 shows the two prototype paths. If bk represents the bit value assigned to the point pk ∈ P , the corresponding code c, of the intensity changes on the path is computed as follows: c=
R
b k 2k
(3)
k=1
The local texture around a pixel is described with the probability of occurrence of the code numbers resulting from all detected paths around the pixel. Thus, the texture description is obtained by the union of paths describing structures called the GLBP structures. Any combination of the primitive structures results in a predefined GLBP structure with rotational and mirror symmetry invariance. The texture in an image is characterized statistically by the probabilities of
332
R. Quinones et al.
occurrence of each code number c computed with the bits on the points from the GLBP structure accumulated into one histogram h(c). We use the KullbackLeibler divergence to quantify the difference between the histograms H1 (c) and H2 (c) of the GLBPs of a non-worn sample (AR grade 5) and the worn sample (AR grade between 0 to 4.5). We denote i = 1, 2 the index of a histograms H1 (c) and H2 (c) (non-worn and worn) and b = 1, . . . , B the index b of a bin of a histogram H(c). With this notation, the difference between both histograms is quantified using the symmetric adaptation of the Kullback-Leibler divergence, termed κ, in eq. (4).
κ=
B 2
H(i, b) log H(i, b) −
i=1 b=1
B
Hp (i) log Hp (i);
Hp (i) =
i=1
2
H(i, b) (4)
i=1
One κ value is obtained for each comparison of two textures.
4
Experiment
To evaluate the GLBP, we use four sets of carpets with their AR grades from range images described in section 2 based on the Europe carpet-appearance standard EN1471. The sets correspond to carpets type: loop pile (Carpet 1), cut pile (Carpet 2), Shaggy (Carpet 3) and Loop/Cut Pile (Carpet 4). The range images are evaluated by GLBP based on their κ values. κ values are computed comparing textures of samples with appearance changes corresponding to AR grades against samples with original appearance. Finally, we compare the performance of GLBP against an invariant rotational mirror based LBP technique (symLBP) to distinguish between consecutive AR grades. The assumptions to evaluate the performance of GLBP and symLBP for distinguishing textures related to AR grades are: – the mean values of the texture features corresponding to AR grades change in order with respect to the AR grades. – the texture features corresponding to consecutive AR grades are well discriminated. These two characteristics are based on the principle of monotonicity and discriminance. Additionally, we measure the variability in the relation between the κ values and the AR grades. These three measures are described as follows: 1. The monotonicity (M) evaluate the order-preserving of the extracted features related to the AR grades. This is computed by Spearman rank correlation, termed ρ [23]. To compute the Spearman rank correlation, the texture features must be first ordered from small to large and then computed under the eq. (5).
Quantifying AR Grades in Carpets Using GLBP
333
G K ρ=1− dg 2 (G)((G)2 − 1) g=1
(5)
Where, db the differences between an assigned rank and an expected rank. g is the index of the G number of differences g = 1, . . . , G. K = 6, a constant defined by Spearman. 2. The discriminance (D), evaluates the efficiency in distinguishing between consecutive AR grades. The discriminance is calculated based on Tuckey Test[23] by eq. (6) and indicates whether there is a significant statistically difference in the means of the κ values between consecutive AR grades. . Defining F as the texture features, S the number of texture features with s = 1, . . . , S, the statistical significance is computed as follows: G S q(α,G,SG−G) 2 ς= (Fgs − μg ) (6) (SG − G)S g=1 s=1 q(α,G,SG−G) is obtained from the studentized range distribution at 100(1−α) of confidence. μg is the mean value of the texture features associated with the AR grade g. Discriminance is finally computed as the number of times that eq. (7) is satisfied. (μ(g+1) − μg ) − ς > 0
(7)
3. The variability (V), defines how well the total variation in the AR grades can be explained by the linear relation between the κ values and the AR grades. The adjusted coefficient of determination, termed, Ra2 , is used [23] to quantify the variability. Ra2 is defined as: 2 (yi − yˆi ) n−1 i Ra2 = 1 − (8) 2 n−p−1 (yi − y¯) i
Where yi is an AR grade computed from the features, yˆi is the estimated AR grade assessed by humans and y¯ is the mean of the yi values. p is the total number of parameters in the linear model y¯ = α + β¯ κ, n is the number of κ values per sample size.
5
Results
We have compared the performance of the GLBP and the symLBP techniques in discriminating AR grades using samples from four carpet types of the European standard. We found that discriminance, monotonicity and variability in the description of AR grades improved in the GLBP method. The results are listed
334
R. Quinones et al.
Table 1. Discriminance(D), Monotonicity(M) and Variability(V) to distinguish AR grades in carpets using symLBP and GLBP symLBP
Carpet Carpet Carpet Carpet
1 2 3 4
GLBP
D
M
V
D
M
V
6 6 5 7
0.98 0.93 0.98 1.00
0,80 0,79 0,80 0,88
7 6 6 7
1.00 0.93 1.00 1.00
0,89 0,80 0,83 0,90
in Table 1. Table 1 shows the measurements of discriminance, monotonicity and variability obtained by evaluating symLBP and GLBP techniques for 4 carpets with their respective AR grades. The results show that the discriminance between consecutive pairs of AR grades for carpet types loop pile (Carpet 1: 6 to 7) and Shaggy (Carpet 3: 5 to 6)increased using the GLBP technique, achieving a correct description of AR grades based on their κ values (M = 1). For all carpets the variability increased (from 0.81 to 0.86 in average for the four carpets) indicating an improved linear relation between the κ values and the AR grades. Figure 4 illustrates the relation between AR grades and κ values for carpet type loop pile using both techniques. In Figure 4, the κ corresponding to AR grades are displayed with box coxes, where the center of a cox box is the mean value of the κ values and two lines at the top and at the bottom of each cox-box represent its standard deviation.
symLBP
1.00
KL-values
KL-values
0.75
0.50
0.25
0.00
GLBP
1.00
0.75
0.50
0.25
4.5
4.0
3.5
3.0
2.5
2.0
1.5
Appearance Retention Grades
1.0
0.00
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
Appearance Retention Grades
Fig. 4. Comparison between symLBP and GLBP techniques for distinguish consecutive AR grades for the loop pile carpet
Figure 4 shows how the monotonicity of GLBP is better than the monotonicity in symLBP. In Figure 4, the highlighted area show how a correct distinction between AR grades 2.5 and 3 is achieved using the GLBP technique while the SymLBP technique confuses both AR grades.
Quantifying AR Grades in Carpets Using GLBP
6
335
Conclusions
A new GLBP based structure applied for distinguish AR grades in carpets was presented. The techniques have been tested on range images from four set of carpets, computing texture features using the Kullback-Leibler divergence on histograms representing the frequency of the symmetric patterns. We evaluated the performance of the GLBP structure and compared with symLBP for distinguishing consecutive AR grades. The results show that GLBP technique improves the performance for distinguishing consecutive AR grades on loop pile, shaggy and cut/loop pile carpets.
References 1. The European Standard. Textile floor coverings. classification of machine-made pile rugs and runners. Textiles Floor Coverings, Standard BS EN 14215 (June 2003) 2. American Society for Testing and Materials. Annual book of astm standards 2010, section 14. general methods and instrumentation (2010) 3. European Committee for standardization. Constructional details of types of textile floor covering available as reference fatigued specimens. Standard EN1471 (1996) 4. The Carpet and Rug Institute. Cri test method - 101: Assessment of carpet surface appearance change using the cri reference scales. Technical Bulletin (2003) 5. Siew, L.H., Hodgson, R.M., Wood, E.J.: Texture measures for carpet wear assessment. IEEE Transactions on Pattern Analysis and Machine Intelligence 10, 92–105 (1988) 6. Wood, E., Hofgson, R.: Carpet texture measurement using image analysis. Textile Research Journal 59, 1–12 (1989) 7. Wu, Y., Pourdeyhimi, B., Spivak, S.M.: Texture evaluation of carpets using image analysis. Textile Research Journal 61, 407–419 (1991) 8. Waegeman, W., Cottyn, J., Wyns, B., Boullart, L., De Baets, B., Van Langenhove, L., Detand, J.: Classifying carpets based on laser scanner data. Engineering Applications of Artificial Intelligence 21(6), 907–918 (2008) 9. Orjuela, S.A., Vansteenkiste, E., Rooms, F., De Meulemeester, S., De Keyser, R., Philips, W.: Evaluation of the wear label description in carpets by using local binary pattern techniques. Textile Research Journal 80(20), 2132–2143 (2010) 10. Van Dale, D., De Meulemeester, S.: Annual report of department of textiles. Technical report, Ghent University (2006) 11. Orjuela, S.A., Vansteenkiste, E., Rooms, F., De Meulemeester, S., De Keyser, R., Philips, W.: Automated wear label assessment in carpets by using local binary pattern statistics on depth and intensity images. In: Proc. of IEEE ANDESCON, pp. 1–5 (2010) 12. Orjuela, S.A., Vansteenkiste, E., Rooms, F., De Meulemeester, S., De Keyser, R., Philips, W.: Feature extraction of the wear label of carpets by using a novel 3d scanner. In: Proc. of the Optics, Photonics and Digital Technologies for Multimedia Applications Conference (2010) 13. Ojala, T., Pietikainen, M., Harwoodm, D.: A comparative study of texture measures with classification based on featured distributions. Pattern Recognition 29, 51–59 (1996)
336
R. Quinones et al.
14. Ojala, T., Pietikainen, M., Harwoodm, D.: Texture discrimination with multidimensional distributions of signed gray-level differences. Pattern Recognition 34, 727–739 (2001) 15. Guo, Z., Zhang, L., Zhang, D.: Rotation invariant texture classification using lbp variance (lbpv) with global matching. Pattern Recognition 43, 706–719 (2010) 16. Varma, M., Zisserman, A.: A statistical approach to texture classification from single images. International Journal of Computer Vision 62 (1-2), 68–81 (2005) 17. Ojala, T., Pietikainen, M.: A comparative study of texture measures with classification based on feature distributions. Pattern Recognition 29(1), 51–59 (1996) 18. Ojala, T., Pietikainen, M., Menp, T.: Multiresolution gray scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002) 19. Ojala, M., Pietikainen, T., Xu, Z.: Rotation-invariant texture classification using feature distributions. Pattern Recognition 33(1), 43–52 (2000) 20. Orjuela, S.A., Rooms, F., Keyser, R.D., Philips, W.: Geometric local binary pattern, a new approach to analyse texture in images. In: Books of Abstracts: The 2010 International Conference on Topology and its Applications, pp. 179–181 (June 2010) 21. International Organization for Standarization. Textile floor coverings. production of changes in appearance by means of vettermann drum and hexapod tumbler testers. ISO 10361:2000 (2000) 22. Orjuela, S.A., Ortiz, B., De Meulemeester, S.J., Garcia, C., Rooms, F., Pizurica, A., Philips, W.: Surface reconstruction of wear in carpets by using a wavelet edge detector. In: Blanc-Talon, J., Bone, D., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2010, Part I. LNCS, vol. 6474, pp. 309–320. Springer, Heidelberg (2010) 23. Kutner, M., Nachtsheim, C.J., Neter, J., Li, W. : Applied Linear Statistical Models, 5th edn. McGraw-Hill/Irwin (2004)
Enhancing the Texture Attribute with Partial Differential Equations: A Case of Study with Gabor Filters Bruno Brandoli Machado1 , Wesley Nunes Gon¸calves2 , and Odemir Martinez Bruno1,2 1
Institute of Mathematical Sciences and Computing (ICMC) 2 Physics Institute of S˜ ao Carlos (IFSC) University of S˜ ao Paulo (USP) Av. Trabalhador S˜ ao-carlense, 400 13560-970 S˜ ao Carlos, SP - Brazil [email protected], [email protected], [email protected]
Abstract. Texture is an important visual attribute used to discriminate images. Although statistical features have been successful, texture descriptors do not capture the richness of details present in the images. In this paper we propose a novel approach for texture analysis based on partial differential equations (PDE) of Perona and Malik. Basically, an input image f is decomposed into two components f = u + v, where u represents the cartoon component and v represents the textural component. We show how this procedure can be employed to enhance the texture attribute. Based on the enhanced texture information, Gabor filters are applied in order to compose a feature vector. Experiments on two benchmark datasets demonstrate the superior performance of our approach with an improvement of almost 6%. The results strongly suggest that the proposed approach can be successfully combined with different methods of texture analysis. Keywords: Texture analysis, anisotropic diffusion, texture modeling.
1
Introduction
Texture plays an important role in image analysis systems. Applications with textures are found in different areas, including aiding diagnoses in medical images [1] and remote sensing [2]. Though texture is easily perceived by humans, it has no precise definition due to its spatial variation. In addition, physical surface properties produce distinct texture patterns. The lack of a formal definition of texture is reflected into a large number of methods to texture analysis. Several methods for texture recognition have been proposed in the literature [3]. They are based on statistical analysis of the spatial distribution (e.g., co-occurrence matrices [4] and local binary pattern [5]), stochastic models (e.g., Markov random fields [6,7]), spectral analysis (e.g., Fourier descriptors [8], J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 337–348, 2011. c Springer-Verlag Berlin Heidelberg 2011
338
B.B. Machado, W.N. Gon¸calves, and O.M. Bruno
Gabor filters [9] and wavelets transform [10]), structural models (e.g., mathematical morphology [11] and geometrical analysis [12]), complexity analysis (e.g., fractal dimension [13,14]), agent-based model (e.g., deterministic tourist walk [15]). Despite there are effective texture methods, few papers are concerned in enhancing the richness of the texture attribute before computing features. Inspired by biological vision studies, the community of computer vision has also shown a great deal in representing images using multiple scales. The basic idea is to decompose the original image into a family of derived images [16,17]. The decomposition is obtained by convolving the original image with an image operator, for example, a simple way is to employ Gaussian kernels. Although the Gaussian filtering satisfies the heat equation, its derivatives cause spatial distortion in region boundaries. It implies that the diffusion process is equally in all directions, that is, the diffusion is linear or isotropic. On the other hand, Perona and Malik [18] formulate a new concept that modified the linear scalespace paradigm to smooth regions while preserving edges. In this paper, we propose a novel framework to enhance the texture attribute by working as a pre-processing for texture analysis. Here, we decompose an input image using anisotropic diffusion of Perona and Malik before feature extraction task. The anisotropic diffusion process is mathematically modelled by partial differential equations (PDEs). The decomposition is applied to extract the texture component, obtained by the difference between the original image and cartoon approximations. Then, Gabor filters are used to extract features from the texture component, which presents more enhanced structures. The main contributions of this paper are: – We establish a framework to enhance the texture attribute by means of partial differential equations. – We compare the performance of our method together with the performance of a traditional texture analysis method. – We conduct a comparative parameter evaluation. The remaining of this paper is organized as follows. Section 2 presents background information on non-linear diffusion and Gabor filters. Section 3 details our approach in texture analysis. Section 4 presents the results of the experiments performed on two benchmark texture datasets. Finally, conclusions and directions for future research are given in Section 5.
2
Background
In this section, the anisotropic diffusion and Gabor filters are briefly discussed to provide motivation and background for the proposed approach. 2.1
Anisotropic Diffusion
Scale-space theory has been investigated for representing image structures at multiple scales. The idea is to decompose the initial image into a family of derived images. According to [17] and [19], a family of derived images may be
Enhancing the Texture Attribute with Partial Differential Equations
339
viewed as the solution of the heat equation and described using partial differential equations (PDEs). The successful use of PDEs in image analysis is assigned to the power to model many dynamic phenomenon, including diffusion. A new paradigm of nonlinear PDEs for image enhancement was introduced by Perona and Malik [18]. Their formulation, called anisotropic diffusion, uses a nonlinear scheme that smooths images by creating cartoon approximations, while the region boundaries remain sharp. Formally, the discrete formulation of PeronaMalik is defined as: t
t+1 t Ii,j = Ii,j + λ [cN .∇N I + cS .∇S I + cE .∇E I + cW .∇W I]i,j
(1)
where 0 ≤ λ ≤ 1/4 is a scalar that controls the numerical stability, ∇I is the gradient magnitude, c is a constant value for the conduction coefficient, N, S, E and W are the mnemonic subscripts for North, South, East and West. The PDE equation of Perona and Malik can be rewrite as: Ist+1 = Ist +
λ g(∇Is,ρ )∇Is,ρ ξs
(2)
ρ∈ξs
where Ist is the cartoon approximation, t denotes the number of iterations, s denotes the spacial position of each pixel, ξs represents the number of neighbors, and g(∇I) is the conduction function. The magnitude of the gradient is calculated by approximating its norm in a particular direction as follows: ∇Is,ρ = Iρ − Ist , ρ ∈ ξs
(3)
Perona and Malik proposed two functions of diffusion: 2
g (||∇I||) = e−(||∇I||/K) and
g (||∇I||) = 1+
1 ||∇I|| K
2
(4) (5)
The parameter K controls the heat conduction. The first equation favours high contrast edges over low contrast ones, while the latter favours wide regions over smaller ones. Although Perona and Malik proposed two different functions, the smoothed images are quite similar. Figure 1 shows the results of a texture decomposition with anisotropic diffusion. The first row of (b) shows the family of cartoon approximations derived from an input image I0 . We can notice that the content is gradually smoothed. The distribution of heat corresponds to grey values in z-axis of the images and the diffusion time is represented by each iterations t. For different scales t we obtain different levels of smoothing. 2.2
Gabor Filters
Gabor filters has been widely used in texture analysis. A Gabor function is a signal sinusoidal plane wave modulated by a Gaussian [9,20]. The filters used
340
B.B. Machado, W.N. Gon¸calves, and O.M. Bruno
(a) I0
(b) t1
(c) t2
(d) t3
Fig. 1. The basic idea of the scale-space representation is to create a family of cartoon approximations. This figure shows an input image I0 (a) that has been successively smoothed with anisotropic diffusion. The family of derived images may be viewed in two rows of (b-c-d).
in image decomposition are created from a “mother” Gabor function of two dimensions, for a given space g(x, y) and frequency G(x, y) domains. Given the “mother” function, a bank of Gabor filters can be obtained in the g(x, y) space domain from operations of dilatations and rotations. Initially, the Gabor technique generates a filter bank gmn (x, y) for different scales m = 1, . . . , K and orientations n = 1, . . . , S parameters. Texture features are computed by convolving the original image I with the Gabor filter bank, as depicted in Equation (6). By tunning the values of m and n, some aspects of the image’s underlying texture structure can be captured. In this work, a number of 40 Gabor features have been computed (8 orientations and 5 scales). cmn (x, y) = I(x, y) ∗ gmn (x, y)
(6)
The feature vector ψ = [E11 , E12 , . . . , EKS ] is finally obtained by computing the energy of the filtered images according to the Equation (7). Emn = [cmn (x, y)]2 (7) x,y
3
An Approach to Texture Analysis
A widely used strategy to compute texture features using Gabor is to construct a bank of filters with different scales and orientations. For each Gabor space is extracted statistical measures, such as energy and entropy. Instead of obtaining the Gabor space straight, we first decompose each image by means of
Enhancing the Texture Attribute with Partial Differential Equations
341
anisotropic diffusion of Perona and Malik. Handling the multi-scale nature of images enable us to enhance structures perceived in certain levels of decomposition. Anisotropic diffusion works by smoothing images at different scales, while preserving important structures in image analysis, including edges and T-junctions. After decomposing an image, we obtain two components: cartoon (u) and texture (v) component. The texture component is determined by the subtraction of the original image and cartoon approximation. Figure 2 shows an example of image decomposition using anisotropic diffusion of Perona and Malik.
(a) Original Image
(b) Cartoon (u)
(c) Texture (v)
Fig. 2. An example of image decomposition applied in the Barbara image (a). At each level of decomposition, it is generated a cartoon approximation (b). The texture component v is then obtained by subtracting the original image and the cartoon approximation. It lead us to enhance the texture in image and produce richer representations.
This filtering process overcomes the main restriction imposed by linear techniques, the spatial distortion (blur) in region boundaries of the image. Then we focus our attention on the texture enhanced (v). It is then used to extract Gabor features and useful for a variety of tasks, for example, texture classification. The diagram of Figure 3 summarizes the approach proposed here.
Fig. 3. Proposed approach for texture analysis
4
Experimental Results
In order to evaluate our approach, experiments are carried out on two image datasets. We first describe the datasets, and then, results are shown.
342
4.1
B.B. Machado, W.N. Gon¸calves, and O.M. Bruno
Datasets
The Brodatz album [21] is the most known benchmark for evaluating texture methods. Each class is composed by one image divided into nine new samples non-overlapped. These images have 200 × 200 pixels with 256 grey levels. A total of 100 texture classes with 10 images per class was used. One example of each class is shown in Figure 4.
Fig. 4. One sample of each class of the Brodatz album
Although largely used, Brodatz dataset is an old benchmark that lack some characteristic such as illumination, viewpoint and scale variation. Thus, in order to analyse this features, we also used the Vistex dataset in the experiment. The Vision Texture dataset [22] (or Vistex) contains a large set of natural colorful textures taken under several scale and illumination conditions. In addition, images are acquired with different cameras. For this dataset we use a total of 50 texture classes in grey scale. The size of the original images was 512 × 512, but we use the same number of samples as [23]. Each texture were split into 128 × 128 pixel images, with 16 sub-samples per class, resulting 800 images. Figure 5 shows one example of each texture.
Fig. 5. Image examples of the Vistex dataset
4.2
Performance Evaluation
In the following experiments, we calculated the energy of Gabor filters with 8 orientations and 5 scales, resulting a feature vector of 40 dimensions.
Enhancing the Texture Attribute with Partial Differential Equations
343
K nearest-neighbor (K-NN) classifier was adopted, since it is a simple method and in this way, it can better highlight the performance of method. A initial value of K = 5 was set, using 10-fold cross validation and Euclidean distance measure. We optimized two parameters of the anisotropic diffusion process: the levels of decomposition t (scales) and the parameter λ that controls the stability of diffusion. The decomposition ranges from 10 to 200 and the value of λ ranges from 0.05 to 0.25. Experiment 1: First, we evaluated our approach on the Brodatz dataset and compared it to the traditional Gabor features. Features are calculated with different levels of decomposition t and values of stability λ. Figure 6(a) shows the classification rates in the y axis, while decompositions are indicated in the x axis. It can be observed that the texture component (v), decomposed using our approach, performed better than the traditional Gabor method. The highest classification rate (t = 40) was 94.29% for texture (v) and 91% for the traditional Gabor, respectively. Experimental results also showed that classification rate of the cartoon approximations (u) decreased as the level of decomposition increased. This result is related to the image smoothing, while the texture attribute is degenerated. We also evaluated our approach for 5 distinct values of the parameter λ (from 0.05 to 0.25), taking t = 40 as the best value for the decomposition. The results are shown in Figure 6(a). Note the improving performance for the texture (v) confirms that the maximum of the stability value achieves better performance. Experiment 2: In this experiment we evaluated our approach on the Vistex dataset. The setting for this experiment was the same as the previous one. In Figure 7(a), the classification rates are presented in the y axis, while the decompositions are presented in the x axis. Our approach achieved the best performance of 88.96% (t = 140) against 83.66% for traditional Gabor. It is worth noting that the classification rates for the cartoon (u) component reduced at each iteration. This is associated to the gradual decomposition on the image. We also evaluated the lambda parameter in Figure 7(b), taking t = 140 as the best value for the level of decomposition. As in the early experiment, the highest classification rate of 88.96% was obtained for λ = 0.25. Table 1 presents the average and standard deviation in terms of classification rates. It also shows results for K = {3,5,7,9} on the original image, cartoon approximation (u) and texture (v). As we can see, our approach using the texture component (v) outperformed the others for all values of K on both datasets. Interesting results came out from the cartoon approximation experiments, which is discarded in the proposed approach. A classification rate of 67.80% and 31.16% were obtained on the Brodatz dataset and Vistex dataset, respectively. It clearly shows the poor classification power using the cartoon approximation. To illustrate the potential of our approach, we compared it with three representative operators used for filtering edges: Gaussian, Laplacian and Laplacian of Gaussian (LoG) (we refer to [24] for more details). For all operators, the same procedure of the proposed approach was performed. In this setting, our approach achieved the highest classification rates for all values of K on both datasets. For
344
B.B. Machado, W.N. Gon¸calves, and O.M. Bruno
(a)
(b) Fig. 6. Comparison of different scales (a) and stability (b) on the Brodatz dataset
the Brodatz dataset, an improvement of 3.31% compared to the Gaussian operator was obtained using K = 9. On the Vistex dataset with K = 3, our approach achieved a classification rate of 88.96%, which is significantly better than the classification rate of 83.63% achieved by the LoG operator. Experimental results demonstrate that our approach is an effective representation for texture modeling. To give a complete picture of the functioning of our approach, we conclude this section by reporting the CPU time of both datasets. The experiments were implemented in Matlab environment on Linux with 2.10 GHz Intel Core Duo CPU. In Figure 8, levels of decomposition (t) are presented in the x axis. Running times in seconds are presented in the y axis in logarithmic scale. For the traditional Gabor method, the CPU time corresponds to feature description,
Enhancing the Texture Attribute with Partial Differential Equations
345
(a)
(b) Fig. 7. Comparison of different scales (a) and stability (b) on the Vistex dataset Table 1. Comparison of different values of nearest neighbors on both datasets Dataset Component %(3-NN) Brodatz Traditional 92.53 ± 2.52 Cartoon (u) 71.61 ± 3.53 Texture (v) 94.88 ± 2.23 gain (v − f ) 2.35 Vistex Traditional 84.71 ± 3.33 Cartoon (u) 31.85 ± 3.93 Texture (v) 89.21 ± 2.84 gain (v − f ) 4.50
%(5-NN) 91.00 ± 2.66 70.06 ± 3.77 94.29 ± 2.15 3.29 83.66 ± 3.22 32.95 ± 4.49 88.96 ± 3.07 5.30
%(7-NN) 89.04 ± 2.42 68.96 ± 3.54 92.87 ± 2.26 3.83 83.10 ± 2.88 32.35 ± 4.05 86.65 ± 3.17 3.55
%(9-NN) 87.87 ± 2.53 67.80 ± 3.54 92.40 ± 2.55 4.53 81.59 ± 3.52 31.16 ± 3.91 84.46 ± 3.30 2.87
346
B.B. Machado, W.N. Gon¸calves, and O.M. Bruno Table 2. Comparison of different image operators on both datasets Dataset Operator + Gabor Brodatz Gaussian Laplacian LoG Our approach Vistex Gaussian Laplacian LoG Our approach
%(3-NN) 92.73 ± 2.54 91.17 ± 2.26 92.78 ± 2.09 94.88 ± 2.23 85.14 ± 3.29 84.56 ± 3.09 85.24 ± 3.20 89.21 ± 2.84
%(5-NN) 91.76 ± 2.61 89.49 ± 2.52 90.42 ± 2.61 94.29 ± 2.15 82.75 ± 3.69 82.71 ± 3.40 83.63 ± 3.64 88.96 ± 3.07
%(7-NN) 90.22 ± 2.71 87.93 ± 2.64 89.45 ± 2.45 92.87 ± 2.26 81.72 ± 3.62 81.05 ± 3.52 82.11 ± 3.65 86.65 ± 3.17
%(9-NN) 89.09 ± 3.08 85.56 ± 2.92 87.19 ± 2.74 92.40 ± 2.55 80.74 ± 3.55 79.45 ± 3.43 80.85 ± 3.52 84.46 ± 3.30
Fig. 8. Running time comparison for both datasets
while our approach computes the time of anisotropic diffusion added to the feature description time. It was found that our approach spends an additional time of 0.3s for t = 40 on the Brodatz dataset. For the Vistex dataset, an extra time of 0.43s for t = 140 was observed compared with the traditional Gabor method.
5
Conclusions
In this paper we have presented a novel approach based on PDE of Perona and Malik for texture classification. It can also be considered as a pre-processing to enhance the richness of the texture attribute. We have demonstrated how the Gabor process can be improved by using the methodology proposed. Although traditional methods of texture analysis have provided satisfactory results, the approach proposed here has proved to be superior for characterizing textures. In the experiments we have used two image datasets widely accepted for texture classification. Experiments on the Brodatz dataset indicate that our approach improves classification rate from 87.87% to 92.40% over the traditional method. On the Vistex dataset, results demonstrated that the computed descriptors provide a good quality of discrimination with an improvement of 5.30%. The results support the idea that our approach can be used as a feasible step for
Enhancing the Texture Attribute with Partial Differential Equations
347
many texture classification systems. In addition, it can be easily adapted to a wide range of texture methods - e.g. from Gabor filters to Markov random fields. As part of the future work, we plan to focus on investigating different non-linear PDEs and texture analysis methods. Acknowledgments. BBM was supported by CNPq. WNG was supported by FAPESP grants 2010/08614-0. OMB was supported by CNPq grants 306628/ 2007-4 and 484474/2007-3.
References 1. Cheng, H.D., Shan, J., Ju, W., Guo, Y., Zhang, L.: Automated breast cancer detection and classification using ultrasound images: A survey. Pattern Recognition 43(1), 299–317 (2010) 2. Chen, C.H., Peter Ho, P.G.: Statistical pattern recognition in remote sensing. Pattern Recognition 41, 2731–2741 (2008) 3. Zhang, J., Tan, T.: Brief review of invariant texture analysis methods. Pattern Recognition 35(3), 735–747 (2002) 4. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. IEEE Transactions on Systems, Man and Cybernetics 3(6), 610–621 (1973) 5. Kashyap, R.L., Khotanzad, A.: A model-based method for rotation invariant texture classification. IEEE Trans. Pattern Anal. Mach. Intell. 8, 472–481 (1986) 6. Cross, G.R., Jain, A.K.: Markov random field texture models. IEEE Trans. Pattern Anal. Mach. Intell. 5, 25–39 (1983) 7. Chellappa, R., Chatterjee, S.: Classification of textures using gaussian markov random fields. IEEE Transactions on Acoustics, Speech, and Signal Processing 33(1), 959–963 (1985) 8. Azencott, R., Wang, J.P., Younes, L.: Texture classification using windowed fourier filters. IEEE Trans. Pattern Anal. Mach. Intell. 19, 148–153 (1997) 9. Gabor, D.: Theory of communication. Journal of Institute of Electronic Engineering 93, 429–457 (1946) 10. Daubechies, I.: Ten lectures on wavelets. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (1992) 11. Serra, J.: Image Analysis and Mathematical Morphology. Academic Press, Inc., Orlando (1983) 12. Chen, Y., Dougherty, E.: Gray-scale morphological granulometric texture classification. Optical Engineering 33(8), 2713–2722 (1994) 13. Mandelbrot, B.B.: The Fractal Geometry of Nature. W. H. Freeman and Company, New York (1983) 14. Bruno, O.M., de Oliveira Plotze, R., Falvo, M., de Castro, M.: Fractal dimension applied to plant identification. Information Sciences 178, 2722–2733 (2008) 15. Backes, A.R., Gon¸calves, W.N., Martinez, A.S., Bruno, O.M.: Texture analysis and classification using deterministic tourist walk. Pattern Recogn. 43, 685–694 (2010) 16. Lindeberg, T.: Scale-space. In: Wah, B. (ed.) Encyclopedia of Computer Science and Engineering, EncycloCSE 2008, vol. 4, pp. 2495–2504. John Wiley and Sons, Hoboken (2008) 17. Witkin, A.P.: Scale-space filtering. In: International Joint Conference on Artificial Intelligence, pp. 1019–1022 (1983)
348
B.B. Machado, W.N. Gon¸calves, and O.M. Bruno
18. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 12, 629–639 (1990) 19. Koenderink, J.J.: The structure of images. Biological Cybernetics 50(5), 363–370 (1984) 20. Bianconi, F., Fern´ andez, A.: Evaluation of the effects of gabor filter parameters on texture classification. Pattern Recognition 40(12), 3325–3335 (2007) 21. Brodatz, P.: Textures: A Photographic Album for Artists and Designers. Dover Publications, New York (1966) 22. Lab, M.M.: Vision texture – vistex database (1995) 23. M¨ aenp¨ aa ¨, T., Pietik¨ ainen, M.: Classification with color and texture: jointly or separately? Pattern Recognition 37(8), 1629–1640 (2004) 24. Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice Hall series in artificial intelligence. Prentice Hall, New Jersey (2003)
Dynamic Texture Analysis and Classification Using Deterministic Partially Self-avoiding Walks Wesley Nunes Gon¸calves and Odemir Martinez Bruno Institute of Physics of S˜ ao Carlos (IFSC) University of S˜ ao Paulo Av. do Trabalhador S˜ aocarlense, 400, S˜ ao Carlos, S˜ ao Paulo, Brazil [email protected], [email protected]
Abstract. Dynamic texture has been attracting extensive attention in the field of computer vision in the last years. These patterns can be described as moving textures which the idea of self-similarity presented by static textures is extended to the spatio-temporal domain. Although promising results have been achieved by recent methods, most of them cannot model multiple regions of dynamic textures and/or both motion and appearance features. To overcome these drawbacks, a novel approach for dynamic texture modeling based on deterministic partially self-avoiding walks is proposed. In this method, deterministic partially self-avoiding walks are performed in three orthogonal planes to combine appearance and motion features of the dynamic textures. Experimental results on two databases indicate that the proposed method improves correct classification rate compared to the existing methods. Keywords: Deterministic partially self-avoiding walks, Dynamic texture.
1
Introduction
In recent years, dynamic texture has gained increasing attention from computer vision community due to the explosive growth of multimedia databases. Unlike the image textures or static textures, dynamic textures in a sequence of images are texture patterns in motion. In this recent field of investigation, the definition of self-similarity of static texture is extended to the spatio-temporal domain. Examples of dynamic textures include real world scenes of smoke, waves, traffic, crowd, and flag blowing. Few methods for dynamic texture modeling have been proposed in the literature. Most of the them can be classified into four categories: (i) motion based methods: these methods transfer the dynamic textures analysis to the analysis of a motion patterns sequence [1–3]; (ii) spatiotemporal filtering and transform based methods: they describe dynamic textures at different scales in space and time through spatiotemporal filters, such as wavelet transform [4–6]; (iii) model based methods: these methods use a generative process, which provide a representation that can be used in applications of synthesis, segmentation, and J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 349–359, 2011. c Springer-Verlag Berlin Heidelberg 2011
350
W.N. Gon¸calves and O.M. Bruno
classification [7–9]; (iv) spatiotemporal geometric property based methods: they are based on properties of surfaces of moving contours, where it is possible to extract motion and appearance features based on the tangent plane distribution [10]. Although promising results have been achieved by recent methods, most of them present at least one of the following drawbacks: do not provide an explicit combination between motion features and appearance features, do not provide features that are robust against image transformations, cannot model multiple dynamic textures, and can be quite expensive. We address these issues with a novel method for dynamic texture modeling based on deterministic partially self-avoiding walks. Recently, a promising method for static texture recognition using deterministic partially self-avoiding walks has been published [11]. In this method, a traveler walks through image pixels using a walk rule and a memory that stores the last steps taken by the traveler. Each trajectory is composed by two parts: (i) transient: the traveler walks freely to exploits texture characteristics, and (ii) attractor: a sequence of pixels which repeats along the trajectory and from the traveler cannot escape. By analyzing the distribution of transients and attractors, it is possible to quantify and compare image textures [11, 12]. The method proposed here is an extension of deterministic partially self-avoiding walks that models motion and appearance features from dynamic textures. For this, we show how effectively the traveler can walk on three orthogonal planes of the image sequence: XY, XT, and YT planes. The first plane captures appearance features and the two other ones capture motion features. To validate the proposed method, experiments were conducted with dyntex database (1230 videos) [13] and traffic video database (254 videos) [14]. In both databases, the proposed method provides excellent recognition results compared to recent methods. The experimental results are especially interesting in videos which the relevant information for discrimination rely on both appearance and motion features. Modeling these videos is exactly the main advantage of the proposed method. Moreover, the advantages of the deterministic partially selfavoiding walks, such as invariance to image transformations and multi-scale analysis, are maintained in the proposed method [11]. This paper is organized as follows. In Section 2, we briefly discussed deterministic partially self-avoiding walks to provide background for the proposed method. A novel approach for dynamic texture modeling is presented in Section 3. Experimental results are given in Section 4, which is followed by the conclusion in Section 5.
2
Deterministic Partially Self-avoiding Walks
In this section we briefly describe deterministic partially self-avoiding walks for image texture classification [11]. Consider an image with size N = w × h pixels, where each pixel pi has a gray level I(pi ) ranging from 0 to 255. Consider also that pi has a set of neighbor pixels η(pi ) composed by the pixels pj which the
Dynamic Texture Analysis Using Deterministic Partially Self-avoiding Walks
351
Euclidean distance between pi and pj is less than (2) (8-connectivity). If two pixels are neighbors (pj ∈ η(pi )), then the connection weight wij is given by wij = |I(pi ) − I(pj )|. Given the above definitions, the main idea of the method is a traveler walking on the image pixels with a memory M which stores the last μ pixels visited. The steps taken by the traveler follow the deterministic rule: move to a neighbor pixel pj that has not been visited in the last μ steps. The choice of the neighbor pixel pj is determined by a criterion of movement din and the memory M . Examples of criteria include: din = min, which the pixel pj that is closest to current pixel pi is chosen (Equation 1), and din = max, similar case. The trajectory can be divided into two parts: an initial part with t steps called transient, and a final part, where the traveler is trapped in a cycle of period p called attractor. The attractor consists of a group of pixels whose intensities form a path where travelers cannot escape. N
pj = arg min{wij | pj ∈ M, pj ∈ η(pi )} j=1
(1)
For each initial situation, the traveler produces a different trajectory. The traveler’s behavior depends strictly on the configuration of pixels, the memory M of size μ and the initial pixel. In this sense, each image pixel is taken as an initial condition for a trajectory. Both transient and attractor of each trajectory can be combined into a joint distribution Sμ,din (t, p). The joint distribution defines the probability of a trajectory has transient t and attractor p, according to Equation 2. From the study of these distributions, it is possible to obtain features able to discriminate image textures [11]. N 1 1, if ti = t, pi = p Sμ,din (t, p) = N i=1 0, otherwise
(2)
where ti and pi are the transient and period of a trajectory initiated in the pixel i.
3
Modeling Dynamic Textures
In this section we describe the proposed method for dynamic texture classification based on deterministic partially self-avoiding walks. First, a sequence of images is modeled in three orthogonal planes and deterministic partially selfavoiding walks are performed in each plane. Then using the joint distributions, a feature vector is built to characterize the dynamic texture. 3.1
Deterministic Partially Self-avoiding Walks on Sequence of Images
Let I(pi ) | pi = (xi , yi , ti ) be a sequence of images, where xi and yi are the spatial indexes and ti is the temporal index. For modeling appearance and motion
352
W.N. Gon¸calves and O.M. Bruno
(a) XY plane
(b) XT plane
(c) Y T plane
Fig. 1. Three planes obtained from a sequence of images
features, we propose to apply deterministic partially self-avoiding walks on three orthogonal planes. The three planes, namely XY, XT and YT planes, can be viewed in Figure 1. These planes define the connection between image pixels and consequently the way the traveler walks on them. The XY plane aims to evidence appearance features from the sequence of images. In this plane, the traveler can only walk on pixels belonging to the same image frame. In this context, two pixels pi = (xi , yi , ti ) and pj = (xj , yj ,√tj ) are neighbors if the Euclidean distance between spatial indexes is less than 2 and ti is equal to tj , according to Equation 3. As a result, the XY plane is equivalent to apply deterministic partially self-avoiding walks on each image separately and then add the joint distributions. √ pj ∈ η(pi ) if (xi − xj )2 + (yi − yj )2 ≤ 2 and ti = tj (3) The XT and Y T planes contain motion features from the sequence of images. On the XT plane, the traveler can only walk on pixels belonging to the same Y plane. Two pixels pi and pj are neighbors if the Euclidean distance is less than √ 2 and yi is equal to yj , according to Equation 4. On the other hand, for the Y T plane, the traveler can only walk on pixels belonging to the same X plane. Thus,√two pixels are neighbors if the Euclidean distance between them is less than 2 and xi is equal to xj (Equation 5). √ pj ∈ η(pi ) if (xi − xj )2 + (ti − tj )2 ≤ 2 and yi = yj (4) pj ∈ η(pi ) if
√ (yi − yj )2 + (ti − tj )2 ≤ 2 and xi = xj
(5)
To combine appearance and motion features, deterministic partially selfavoiding walks are applied separately on each plane described above. Thus,
Dynamic Texture Analysis Using Deterministic Partially Self-avoiding Walks
353
XY XT YT three joint distributions are obtained: Sμ,din (t, p), Sμ,din (t, p) and Sμ,din (t, p). From these distributions, the proposed method is able to discriminate dynamic textures.
3.2
Feature Vector
From the joint distribution, a histogram hφμ,din (t + p) is built to summarize this information (Equation 6, where φ represents one of the planes). This histogram represents the number of trajectories that has size equal to (t + p) in the joint distribution, where t and p are the transient and period, respectively. φ hφμ,din = Sμ,din (t, p) (6) n=t+p
φ . From the histogram, n features are selected to compose the feature vector ψμ,din First feature is μ + 1 because there is no smaller period, since the traveler performs a partially self-avoiding trajectory limited to the memory window τ = μ. The feature vector constructed with this strategy is given by Equation 7. φ ψμ,din = [hφμ,din (μ + 1), hφμ,din (μ + 2), ..., hφμ,din (t + p), ..., hφμ,din (μ + n)]
(7)
The joint distribution depends on the memory size μ and the criterion of movement din. To capture information of different scales and sources, a feature vector ϕφ considering different values of μ is shown in Equation 8. This feature vector is composed by concatenating the features vector ψμφi ,din . ϕφ = ψμφ1 ,din , ψμφ2 ,din , . . . , ψμφM ,din
(8)
where din is the criterion of movement which can be max or min. Using multiple memories helps characterize texture patterns and has great potential as a method of image classification [12]. However, it does not capture appearance and motion features. In order to capture both features, a strategy combining different planes is used. Thus, the feature vector ϕ (Equation 9) contains information extracted from three planes: XY, XT and Y T . ϕ = ϕXY , ϕXT , ϕY T (9)
4
Results and Discussion
In order to validate the proposed method and compare its efficiency with other ones, experiments were carried out using two databases: (i) Dyntex database and (ii) traffic video database. The first database consists of 123 dynamic texture classes each containing 10 samples collected from Dyntex database [13]. The videos in the database present high variability thus presenting an interesting challenge for the purpose of modeling and classification. All the videos are at least 250 frames long, while the dimension of the frames are 400 × 300
354
W.N. Gon¸calves and O.M. Bruno
Fig. 2. Examples of dynamic textures from dyntex database [13]
Fig. 3. Examples of dynamic textures from traffic database [14]
pixels. Figure 2 shows examples of dynamic textures from the first database. The second database, collected from traffic database [14], consists of 254 videos divided into three classes − light, medium, and heavy traffic. Videos had 42 to 52 frames with a resolution of 320 × 240 pixels. The variety of traffic patterns and weather conditions are shown in Figure 3. The traffic database is used to assess the robustness of methods with regard to classification of different motion patterns, since only appearance features is not sufficient for discrimination. In all experiments, 10-fold cross-validation strategy was performed to compare the methods using k-nearest neighbor (KNN) and Support Vector Machine (SVM).
Dynamic Texture Analysis Using Deterministic Partially Self-avoiding Walks
355
Table 1. Comparison results for different dynamic texture methods in the Dyntex database Method N. of Features KNN SVM RI-LBP 4115 97.64 (±1.32) 94.63 (±2.68) LBP-TOP 768 99.02 (±0.77) 93.66 (±2.29) Proposed Method 75 97.56 (±1.20) 97.64 (±1.34)
Average 96.14 96.34 97.60
Table 2. Comparison results for different dynamic texture methods in the Traffic database Method N. of Features KNN SVM RI-LBP 4115 93.31 (±4.34) 93.31 (±4.10) LBP-TOP 768 93.70 (±4.70) 94.49 (±4.23) Proposed Method 75 93.70 (±5.05) 96.06 (±3.92)
Average 93.31 94.10 94.88
Previous experiments [11] conducted over histogram hφμ,din have shown that most texture information is concentrated within few first elements. Due to this experiments, we have used n = 5 histogram descriptors to compose the feature vectors. Regarding the memories values, experiments have shown that lower values of μ results in higher correct classification rates. Thus, we have chosen five values of memories (1 to 5) to compose the feature vector. The same configuration of parameters (histogram descriptors and memories) are used for each plane (XY, Y T, XT ). 4.1
Comparison with other Methods
The results of the proposed method are compared with reported methods (Table 1 and 2) in terms of correct classification rate and number of features using two well-known classifiers. In experiments with dyntex database (Table 1), the proposed method achieved the highest correct classification rate, followed by the LBP-TOP and RI-LBP [15]. On this database, the method introduced in this paper show similar results to the LBP-TOP method proposed by Zhao [15]. However, it should be noted that this method is only 75-dimensional (5 histogram descriptors * 5 memories * 3 planes), compared to a minimum of 768 dimensions for other methods. Table 2 provides a comparison of correct classification rate for the traffic database. Experimental results indicate that the proposed method achieved a correct classification rate of 94.88%, which is approximately equal to the correct classification rate of 94.10 obtained by LBP-TOP method. The next highest rate of 93.31% was obtained by RI-LBP, where the classification is done with 4115 features. These experimental results demonstrate that the proposed method is an effective representation, compared to the other methods, for dynamic textures that have similar appearance (vehicles) but different motion patterns (light, medium, heavy traffic).
356
W.N. Gon¸calves and O.M. Bruno
(a) Velocity from the speed-sensor located on the highway
(b) Features extracted using proposed method and clustered by k-means with k = 2
(c) Features extracted using LBP-TOP and clustered by k-means with k = 2
(d) Features extracted using proposed method and clustered by k-means with k = 3
(e) Features extracted using LBP-TOP and clustered by k-means with k = 3 Fig. 4. Comparison using temporal cluster index and real speed measurements
Dynamic Texture Analysis Using Deterministic Partially Self-avoiding Walks
357
Fig. 5. Examples of videos of the three clusters formed using features extracted with the proposed method. Each line corresponds to a cluster.
4.2
Clustering Traffic Database
To compare results on real data, we applied the k-means method to cluster the feature vectors extracted from 256 videos of vehicle highway traffic. Videos of this database were collected spanned about 20 hours over two days. Figure 4 presents the temporal evolution of the cluster index and the average traffic speed measured with an electromagnetic sensor. For k = 2 (Figure 4(b) for the proposed method and Figure 4(c) for LBP-TOP method), two clusters are formed each one related to low and high traffic speeds, respectively. As we can see in the Figures, in comparison to the LBP-TOP method, the proposed method presents a better correspondence to the real speed measurements. For k = 3, (Figure 4(d) for the proposed method and Figure 4(e) for LBP-TOP method), the k-means method forms a low, medium and high speed clusters. Samples from three clusters are illustrated in Figure 5. These results shows that the features extracted with the proposed method are in agreement with perceptual categorization of vehicle traffic.
358
5
W.N. Gon¸calves and O.M. Bruno
Conclusion
A novel approach for dynamic texture modeling has been presented in this paper. We demonstrated how deterministic partially self-avoiding walks can be performed on three orthogonal planes derived from a sequence of images. This strategy is an efficient way to characterize both appearance and motion features. Moreover, the proposed method is invariant to image transformations and multi-scale analysis because it keeps the main advantages of deterministic partially self-avoiding walks. Promising results have been obtained on public databases of high complexity. In the dyntex database, experimental results indicate that the proposed method improves recognition performance, e.g. from 96.34% to 97.60% over the traditional approaches. Experimental results on traffic database, involving dynamic textures that have similar appearance but different motion patterns, demonstrated that the proposed method is also robust in terms of motion classification. The results were compared with state-of-the-art and found that the proposed method outperforms the earlier reported methods in both databases. In addition, our method makes the modeling of dynamic textures feasible and simple, which results in an efficient and low cost implementation. The proposed method is able to successfully handle a wide range of dynamic image applications, from human motion classification to dynamic textures. Also our method can be applied for other machine learning problems such as 3D texture segmentation. As part of the future work, we plane to focus on investigating the impact on recognition performance for different radii of connection between pixels - not only 8-connectivity. Another research issue is to evaluate other strategies to compact the information of the joint distribution. Acknowledgments. WNG was supported by FAPESP grants 2010/08614-0. OMB was supported by CNPq grants 306628/2007-4 and 484474/2007-3.
References 1. Fazekas, S., Chetverikov, D.: Analysis and performance evaluation of optical flow features for dynamic texture recognition. Image Commun. 22, 680–691 (2007) 2. Polana, R., Nelson, R.C.: Temporal texture and activity recognition. In: MotionBased Recognition, ch.5 (1997) 3. Fablet, R., Bouthemy, P.: Motion recognition using nonparametric image motion models estimated from temporal and multiscale cooccurrence statistics. IEEE Trans. Pattern Analysis and Machine Intelligence 25(12), 1619–1624 (2003) 4. Dubois, S., P´eteri, R., M´enard, M.: A comparison of wavelet based spatio-temporal decomposition methods for dynamic texture recognition. In: Araujo, H., Mendon¸ca, A.M., Pinho, A.J., Torres, M.I. (eds.) IbPRIA 2009. LNCS, vol. 5524, pp. 314–321. Springer, Heidelberg (2009) 5. Zhong, H., Shi, J., Visontai, M.: Detecting unusual activity in video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 819–826 (2004)
Dynamic Texture Analysis Using Deterministic Partially Self-avoiding Walks
359
6. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: ICCCN 2005: Proceedings of the 14th International Conference on Computer Communications and Networks, pp. 65–72. IEEE Computer Society, Washington, DC (2005) 7. Doretto, G., Chiuso, A., Wu, Y.N., Soatto, S.: Dynamic textures. International Journal of Computer Vision 51(2), 91–109 (2003) 8. Chan, A.B., Vasconcelos, N.: Classifying video with kernel dynamic textures. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–6 (2007) 9. Szummer, M., Picard, R.W.: Temporal texture modeling. In: ICIP, vol. III, pp. 823–826 (1996) 10. Fujii, M., Horikoshi, T., Otsuka, K., Suzuki, S.: Feature extraction of temporal texture based on spatiotemporal motion trajectory. In: ICPR, vol. II, pp. 1047– 1051 (1998) 11. Backes, A.R., Gon¸calves, W.N., Martinez, A.S., Bruno, O.M.: Texture analysis and classification using deterministic tourist walk. Pattern Recognition 43(3), 685–694 (2010) 12. Backes, A.R., Bruno, O.M., Campiteli, M.G., Martinez, A.S.: Deterministic tourist walks as an image analysis methodology based. In: CIARP, pp. 784–793 (2006) 13. Peteri, R., Fazekas, S., Huiskes, M.: Dyntex: A comprehensive database of dynamic textures. Pattern Recognition Letters 31(12), 1627–1632 (2010) 14. Chan, A.B., Vasconcelos, N.: Classification and retrieval of traffic video using autoregressive stochastic processes. In: IEEE Intelligent Vehicles Symposium, pp. 771– 776 (2005) 15. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2007)
Segmentation Based Tone-Mapping for High Dynamic Range Images Qiyuan Tian1, Jiang Duan2, Min Chen3, and Tao Peng2 1 Fudan University, Shanghai, China Southwestern University of Finance and Economics, Chengdu, China 3 Applied Science and Technology Research Institute, Hong Kong, China [email protected], [email protected], [email protected], [email protected] 2
Abstract. In this paper, we present a novel segmentation based method for displaying high dynamic range image. We segment images into regions and then carry out adaptive contrast and brightness adjustment using global tone mapping operator in the local regions to reproduce local contrast and brightness and ensure better quality. We propose a weighting scheme to eliminate the boundary artifacts caused by the segmentation and decrease the local contrast enhancement adaptively in the uniform area to eliminate the noise introduced. We demonstrate that our methods are easy to use and a fixed set of parameter values produces good results for a wide variety of images. Keywords: Tone mapping, tone reproduction adaptation, high dynamic range, image segmentations.
1
Introduction
The dynamic range of a scene, image or imaging device is defined as the ratio of the highest to the lowest luminance or signal level. The real world scenes we experience in daily life often have a very high dynamic range of luminance values. Human visual system is capable of perceiving scenes over 5 orders of magnitude and can gradually adapt to scenes with dynamic ranges of over 9 orders of magnitude. However, conventional digital image capture and display devices suffer from a limited dynamic range, typically spanning only two to three orders of magnitude. Therefore, high dynamic range (HDR) image is designed to to better capture and reproduce HDR scenes. Fig. 1 shows an HDR scene with a dynamic range of about 167, 470:1. In order to make features in the dark areas visible, most cameras need longer exposure intervals, but this renders the bright area saturated. On the other hand, using shorter exposure times to capture details in bright areas will obscure features in the darker areas. These limitations are addressed by HDR image sensor [1], or through the use of HDR radiance maps [2-4]. These HDR radiance maps are obtained by merging a sequence of low dynamic range (LDR) images of the same scene taken under different exposure intervals as shown in Fig.1 and thus capture the full dynamic range of the scene in 32bit floating-point number format. J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 360–371, 2011. © Springer-Verlag Berlin Heidelberg 2011
Segmentation Based Tone-Mapping for High Dynamic Range Images
361
Fig. 1. Selected multi-exposed image set of the original scene
Although HDR image displaying systems are being developed [5, 6] to display HDR radiance maps, they are expensive or designed for research. We usually display 32-bit HDR radiance maps on conventional LDR reproduction media such as CRT monitors and printers, which are usually only 8-bit per channel, while preserving as much of their visual content as possible. One solution to this problem is to compress the dynamic range of the radiance maps such that the mapped image can be fit into the dynamic range of the display devices. This process is called Tone Mapping or Tone Reproduction. This paper will address this issue and our work is motivated by much needs [7] for fast and automatic tone mapping operators which can accurately and faithfully reproduce HDR maps in the LDR devices. The organization of the paper is as follows. In the next section, we briefly review previous work. In section 3 we describe our segmentation based tone mapping operators. Section 4 presents experimental results and section 5 concludes the paper and briefly discusses future works.
2
Review of Tone Mapping Methods
In the literature, tone mapping methods are usually divided into two broad categories, i.e., global and local tone mapping techniques [8]. Global tone mapping techniques apply the same appropriately designed spatially invariant mapping function across the image. They do not involve spatial processing and are therefore computationally very simple. This is useful in real time applications such as HDR video [9, 10]. Global tone mapping techniques also preserve the intensity orders of the original scenes thus avoiding “halo” artifacts but tend to lose details. Earlier pioneering work in this category includes [11] and [12]. [11] developed global tone mapping operator that attempted to match display brightness with real world sensations while [12] matched perceived contrast between displayed image and the scene. Later, [13] proposed a technique based on comprehensive visual model which successfully simulated several important visual effects like adaptation, color appearance, visual acuity and time-course of adaptation. Further [14] presented a method which was based on logarithmic compression of luminance values, imitating the human response to light. Recently, [15] formulated tone mapping problem as a quantization process and employed an adaptive conscience learning strategy to obtain mapped image. An interesting work was proposed in [16] to visualize HDR image with a series of its tone mapped versions produced by different monotonic global mapping. However, this work focused on information visualization rather than tone mapping. Perhaps the most comprehensive technique is still that of [17], which firstly improved histogram equalization and then extended this idea to incorporate models of human contrast sensitivity, glare, spatialacuity, and color sensitivity effects.
362
Q. Tian et al.
Local tone mapping techniques consider local pixel statistics and local pixel contexts in the mapping processing for each individual pixel, often at multiple scales, and therefore better preserve the details and local contrast in the images, at a higher computational cost. But these techniques usually introduce “halo” artifacts because of the violation of the basic monotonic principle and are quite difficult to use since there are often too many parameters in the algorithms which have to be set empirically. [18-22] are based on the same principle. They decomposed an image into layers and these layers were differently compressed. Usually, layers with large features were strongly compressed to reduce the dynamic range while layers of details were left unchanged or even enhanced to preserve details. Afterwards, compressed layers were reconstructed to obtain displayable image. These methods mainly differ in the way in which they attempted to decompose the image and compress different layers. [23] presented a dynamic range compression method based on a multiscale version of Retinex theory of color vision. [24] attempted to incorporate traditional photographic technology to the digital domain for reproduction of HDR images. Their method was based on the well-known photographic practice of dodging-and-burning. [25] manipulates the gradient field of the luminance image by attenuating the magnitudes of large gradients and obtains the displayable image by solving a Poisson equation on the modified gradient field. In [26], the authors presented a multiscale image processing technique to display HDR image. In their method, they used a symmetrical analysis– synthesis filter bank, and computed a smooth gain map for multiscale subband images. More recently, [27] derived a tone mapping operator which was based on the segmentation of an HDR image into frame works (consistent areas) using anchoring theory of lightness perception and on the local calculation of the lightness values. The final lightness of an image was computed by merging the frameworks proportional to their strength. In [28], the users can indicate the regions of interest interactively using a few simple brush strokes, then the system adjusts the brightness, contrast and other parameters in these regions automatically. In this paper, we investigate local tone mapping method for displaying high dynamic range images. Our method is inspired by a visual mechanism called adaptation, which can be explained as the quality to accommodate to the level of a certain visual field around the current fixation point [17]. Thus, our eyes will always be adjusted to an optimal status for certain areas of the scene and therefore we can see the details in all parts in the scenes. There have been researches to model and better understand this mechanism like [29]. Results in [29] showed that different luminance intervals could result in overlapped reactions on the limited response range of the visual system, thus extending our visual response range to cope with the full dynamic range of high contrast scenes. Based on this visual mechanism, but not computationally building on it, we segment the images into regions and then adaptively adjust the contrast and brightness of each region into a better status. Finally, we propose a weighting scheme to eliminate the boundary artifacts caused by the segmentation and decrease the local contrast enhancement level adaptively in the relatively uniform area to avoid the noise artifacts introduced by excessive enhancement.
3
Algorithm
According to local adaptation mechanism of human visual system as mentioned above, when we stand in the real scene, we focus our eyes on different parts so that we can see
Segmentation Based Tone-Mapping for High Dynamic Range Images
363
the full scene. For each visual focus, our visual system is adjusted to an optimal status to accommodate to the light level in order to see the details better in that particular region. A good tone mapping algorithm should simulate this process. Thus, in our approach, we segment the images into regions and then individually reproduce the contrast and brightness of each region into a better status using global tone mapping method. Local regions may obtain larger display dynamic range, and they might overlap with each other. This is consistent with our local adaptation mechanism as described before and thus better matches our visual experience. 3.1
Image Segmentation
In the literature, a number of techniques have been developed to segment image into local areas [30-34]. However, there does not appear to be a consensus about which of these algorithms is best. In the most recent work [35], the quantitative performance of several existing segmentation algorithms was evaluated. We did not go deep into segmentation methods evaluation but just select one segmentation algorithm which can meet our need described below. When we observe real scene, the visual focus is usually based on seeing objects, thus the segmentation method that can roughly divide the images into object-based regions will fulfill our purpose. In addition, to avoid interaction among regions during tone mapping in different regions, the segmentation algorithm should divide the image into non-overlapping, closed regions. Due to a combination of reasonable performance and public availability, Cour's Multiscale Ncut segmentation algorithm [32] is chosen as our segmentation method in our tone mapping approach. Fig. 2 shows segmentation results of using this method with different segments number R.
Fig. 2. Original images and segmentation results of Multiscale Ncut segmentation algorithm [32] showed by pseudo color; from left to right, the second and fifth images have 20 segments while the third and sixth images have 60 segments
3.2
Local Tone Mapping in Segmented Regions
We have described the image segmentation algorithm in our tone mapping approach. The next step is to reproduce the contrast and brightness in each individual region to a better status using global tone mapping method. Since the global tone mapping method will be applied to a number of segmented regions, it should be computationally very efficient in order to keep the whole method fast. In addition, the image resulting from
364
Q. Tian et al.
the global tone mapping method should be of high quality so that the contrast and brightness in each local region will be reproduced to a good status. Duan’s fast tone mapping method [36] is an ideal candidate for our purpose in these aspects and thus is chosen in our approach. In their method, they proposed a Histogram Adjustment based Linear to Equalized Quantization (HALEQ) based global tone mapping method. HALEQ works by striking a balance between linear compression and histogram equalized compression in mapping process, which can be approximately expressed as
d ( x, y ) = β ⋅ EC[ D ( x, y )] + (1 − β ) ⋅ LC[ D ( x, y )]
(1)
D(x, y) is the input luminance calculated from radiance map and d(x, y) is the output display intensity level. EC is the histogram equalization mapping and LC is the linear mapping both after the initial logarithmic compression [36]. β is set between 0 and 1 and larger value means larger contrast in the mapped image. Setting β = 0, the mapping is linear. If β = 1, the mapping is histogram equalized. Setting 0 ≤ β ≤ 1, the technique controls the mapping between linear and histogram equalized in a very simple and elegant way. The first and fourth images (from left to right) of Fig. 2 show the mapping results from the HALEQ method. In our approach, we compute local HALEQn ( n ∈ R , where R is the number of segmented regions) based on the pixel statistics in each region in the same way as in the global case described in the method of Duan [36]. We use a common parameter β = 0.6 for all the regions as the initial study in our local operator and we will introduce an adaptive method to choose β later. If we regard HALEQn as mapping function, for an individual pixel luminance value D(x,y), output integer display level d(x, y) is given by d ( x, y ) = HALEQn [ D ( x, y )] ( x, y ) ∈ n .
(2)
The left and middle images in Fig. 3 show mapping results from the local HALEQ method. Obviously, these images show more details and local contrast in either dark or bright regions in comparison with the global case shown in the first and fourth images (from left to right) of Fig. 2. However, the direct application of HALEQ in each independent local area causes sharp jumps among different regions. The result is the boundary artifacts in the images as can be seen in the staircase area of the left image and the floor area of the middle image in Fig. 3, making the mapped images unacceptable despite of the improvement in details visibility and local contrast. This is due to the fact that local HALEQ operators are computed based on very different luminance distributions. Pixels with similar value but on different sides of the local regions boundaries can be projected to have very different values and thus lead to boundary artifacts. Another problem with the initial reproduced images is that a lot of noises are represented in the sky area of the left image and the floor area of the middle one in Fig. 3, which are usually relatively uniform areas. In the initial step, a common parameter β is applied to the local areas across the image. The effect of contrast enhancement becomes much more obvious for all the regions. However, the strong contrast enhancement in the uniform area means similar pixels are mapped to quite different values, thus resulting in noise artifacts.
Segmentation Based Tone-Mapping for High Dynamic Range Images
365
Fig. 3. Left and middle: mapping results directly from the local HALEQ method with the number of segmented regions R=40, right: distance weighting function introduced to eliminate the boundary artifacts. For easy illustration, only 7 regions are used in this figure.
3.3
Eliminate the Boundary Artifacts
To eliminate the boundary artifacts, we introduce a weighting scheme as illustrated in the right image of Fig. 3. For each pixel value D(x, y) in the image, the final mapped pixel value is the weighted average of the results tone mapping functions of N nearest regions according to a distance weighting function as following:
∑ d ( x, y ) =
n= N n =1
where distance weighting function
HALEQn [ D( x, y )] ⋅ wd (n)
∑
n= N n =1
(3)
wd (n)
wd is calculated as
wd (n) = e − ( d n / σ d )
(4)
d n is the Euclidean distance between the current pixel position and the center of each of the regions. σ d controls the smoothness of the image. The larger the value of σ d is, the smaller the influence of d n when calculating wd , which means setting σ d to larger values results in an image free from boundary artifacts but with less local contrast. N is the number of regions used in the weighting operation. Larger N means more regions will be used in weighting average. In other words, large N facilitates the elimination of boundary artifacts but will produce image with less local contrast. The left and middle images in Fig. 4 show mapped results after considering the distance weighting function. We can see that good results are obtained and the disturbing boundary artifacts are gone. These images show details and local contrast as well as give a much more natural appearance than the left and middle images of Fig. 3.
366
Q. Tian et al.
Fig. 4. Left and middle: mapped results after consideration of distance weighting function with σ d =25, N=15, R=40; right: detected uniform areas in example image
3.4
Adaptive Selection of Parameter β in Uniform Areas
In the previous section, a common parameter β was applied to all the regions across the image and this can introduce noise artifacts in uniform areas. We solve this problem by introducing an adaptive mechanism. More specifically, we detect the uniform areas and then decrease the contrast enhancement in this region. This can be achieved by decreasing the parameter β of HALEQ technique. The main challenge is to measure uniformity in order to detect flat areas in the images. Based on the fact that uniform areas exhibit an arrow shaped histogram after logarithmic mapping, we regard regions as uniform areas if its histogram has a large deviation which is greater than a threshold η. The contrast enhancement in these uniform regions should become moderate. The deviation of histogram for each region SDn is calculated as
SDn
∑ =
i=M i =0
Hist (i) − meann
(5)
M
where M is the bin number, Hist(i) is the pixel population in i th bin and the meann is the mean pixel population in each bin. A larger value of SDn means that pixel population distribution is further away from the mean pixel population in each bin, corresponding to more uniform luminance distributions. η is the threshold which determines that at what level, the region can be regarded as a uniform area. Smaller η considers more areas to be uniform. In our experiment, we find choosing 20 bin number (M=20) when setting the threshold value η empirically in different images gives good results. After detecting the uniform areas in the image, we apply a smaller parameter β in Duan’s histogram adjustment technique [36] as
β = 0.6 ⋅ [1 − e
− ( SDnmax − SDn )
]
(6)
Equation (6) means a large SDn results in small β. The effect is that the resulting images show few noise artifacts at the cost of an overall decreased local contrast. The
Segmentation Based Tone-Mapping for High Dynamic Range Images
367
right image in Fig. 4 shows the detected regions by blacking them and the left image of Fig. 8 shows the mapping result after considering uniform areas. For this example we can see the sky has a more natural appearance.
4 4.1
Experimental Results and Discussion Mapping Results and Discussion
As discussed in section 3.3, large σ d and N result in an image free from boundary artifacts but with less local contrast, which can be demonstrated by comparing the left and middle, the left and right images in Fig. 5, although not that evidently, especially for the effects of parameter N when comparing the left and right images. In term of region number R, mapping results of image segmented into more regions obviously have more local contrast since the full dynamic range of the display can be better utilized in local areas. However, in our experiments we observed that mapped images look almost the same when R is larger than 30 with identical σ d and N, suggesting a small R can be used to accelerate the tone mapping in practical use since larger R leads to longer process time. This will be explained in the 4.2 section. For most cases in our experiments, setting σ d to 25.0, N to 15 and R to 40 leads to good results and therefore we choose these values as our default parameters in order to overcome the common difficulty of local tone reproduction techniques that there are too many parameters the users have to set.
Fig. 5. Mapping results with different parameters σ d and N. From left to right: σ d =25, N=15, R=40; σ d =10000, N=15, R=40; σ d =25, N=40, R=40.
Until now, there has been no available standard, objective evaluation method which can determine the superiority of the tone mapping algorithms, although a lot of evaluation studies [37-40] have been carried out. Since the main focus of the paper is the development of the tone mapping algorithms, we did not seek a comprehensive way of doing the evaluation. Most existing methods assess their own results by means of subjective evaluation [19, 24, 25] and this method is adopted in our paper. In addition, the rendering intent of assessment such as accuracy compared to an original scene or rendering pleasantness [38] should be specified. The goals of tone mapping can be different depending on the particular application. An example would be to maximize overall visibility for use on HDR medical images. Our operator is designed for general purpose, and in most cases users can not experience the real scenes. Thus,
368
Q. Tian et al.
we assess the performance of how pleasing the tone mapped image looks. The bilateral filter technique [20] was reported to consistently perform best in accuracy as well as in overall pleasing rendering aspect by [38]. Thus, we conducted a relatively detailed pair comparison between our image segmentation based algorithm with the super bilateral filter tone mapping. We insist, the comparison is informal, and it is hard to draw conclusions from only a few different inputs. Fig. 6 shows one images pair rendered by both methods. Both outputs have a photographic look and are visually pleasing. However, it can be seen that our way clearly shows more details like in the windows region (amplified area in upper right of Fig. 6). With regard to the performance of preserving local contrast, the bilateral filter approach wins in bright areas, like mural regions (amplified area in lower right of Fig. 6), while ours performs more effectively in dark areas, like the top left corner region of the image. This is due to the fact that in bilateral filter approach, highlights are more saturated, but the blacks are not as deep [41]. In summary, our algorithm is very competitive with the state-of-the-art tone mapping operator in quality aspect. More results produced by our method are shown in Fig. 7.
Fig. 6. Pair comparison of tone mapping results. Left and middle column: result mapped by our image segmentation based operator and result mapped by bilateral filtering tone mapping [20]; right column: two amplified areas (from top to bottom, the first and third are of our methods, the second and the forth are of the bilateral filter technique).
4.2
Computational Efficiency
Computation of our algorithm consists of two parts: image segmentation and local tone mapping in segmented regions. The former part is the most computationally expensive. We directly used the 1_5 version of ncut_multiscale software downloaded from [42]. The running time depends linearly on the image size and number of segments requested [42]. It takes about 928s to segment a 512×768 pixel Memorial image in Fig. 7 into 40 regions on a T5750 with 2.0 GHz CPU and 2GB RAM running Windows
Segmentation Based Tone-Mapping for High Dynamic Range Images
369
Vista Home Basic. Local HALEQ implementation is relatively efficient. Written in Visual C++ without regard for speed, our code of this part required 6s to compute a 512×768 pixel image with R equal to 40 on the same computer as described above. Although several strategies can be adopted to accelerate the algorithm, like decreasing the parameter “eigensolver tolerance” in ncut_multiscale software, from 1e-2 to 1e-3 for example, thus increasing the speed by about three times, the computational speed is still intolerably slow comparing with that of some previously published tone mapping methods [20, 24, 25, 43].
5
Conclusion and Future Work
In this paper, we have presented a novel segmentation based method for displaying high dynamic range images. Our tone mapping method can better represent human visual mechanism when dealing with high contrast scene and thus can well reproduce the visual experience of the real scene. In the future work, we will focus on improving computational efficiency of the algorithm, especially speeding up the image segmentation which is the main computation bottleneck. Obviously, a more efficient segmentation algorithm can easily and radically solve this problem. [31] is such a good candidate. A better implementation of the method is also useful. In addition, advanced testing and evaluation methodology will be explored with a wider range of HDR images tested and the mapped outputs evaluated to assess the performance of our tone mapping technique.
Fig. 7. Various HDR images tone mapped by our proposed segmentation based tone mapping method with default parameters: σ d =25.0, N=15, R=40
Acknowledgments. Radiance maps used in this paper courtesy of corresponding author(s). We would like to thank various authors for making their data available on the Internet for experiments. This project Sponsored by National Natural Science Foundation of China (Grant No. 60903128) and Sponsored by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry.
370
Q. Tian et al.
References 1. Spivak, A., Belenky, A., Fish, A., Yadid-Pecht, O.: Wide-dynamic-range CMOS image sensors—comparative performance analysis. IEEE Trans. on Electron Devices 56(11), 2446–2461 (2009) 2. Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from photographs. In: Proc. ACM SIGGRAPH 1997, pp. 369–378 (1997) 3. Mitsunaga, T., Nayar, S.K.: Radiometric self calibration. In: Proceedings of the Computer Vision and Pattern Recognition, vol. 1, pp. 374–380 (1999) 4. Mann, M.S., Picard, R.W.: On being undigital with digital cameras: extending dynamic range by combining differently exposed pictures. In: Proceedings of the IS&T’s 48th Annual Conference, Society for Imaging Science and Technology, pp. 422–428 (1995) 5. Seetzen, H., Heidrich, W., Stuerzlinger, W., Ward, G., Whitehead, L., Trentacoste, M., Ghosh, A., Vorozcovs, A.: High Dynamic Range Display Systems. ACM Transactions on Graphics (Siggraph 2004) 23(3), 760–768 (2004) 6. Ferwerda, J.A., Luka, S.: A high resolution high dynamic range display for vision research (abstract/poster). Vision Sciences Society, 8th Annual Meeting, Journal of Vision 9(8), 346a (2009) 7. Bandoh, Y., Qiu, G., Okuda, M., Daly, S., Aachyyy, T., Au, O.C.: Recent Advances in High Dynamic Range Imaging Technology. In: 2010 17th IEEE International Conference on Image Processing, ICIP (2010) 8. Reinhard, E., Ward, G., Pattanaik, S., Debevec, P.: High dynamic range imaging, pp. 223– 323. Morgan Kaufmann Publisher, San Francisco (2006) 9. Kang, S.B., Uyttendale, M., Winder, S., Szeliski, R.: High dynamic range video. ACM Transactions on Graphics 22(3), 319–325 (2003) 10. Mantiuk, R., Krawczyk, G., Myszkowski, K., Seidel, H.-P.: Perception-motivated High Dynamic Range Video Encoding. In: Proc. of SIGGRAPH 2004, pp. 733–741 (2004) 11. Tumblin, J., Rushmeier, H.: Tone reproduction for realistic images. IEEE Computer Graphics and Applications 13, 42–48 (1993) 12. Ward, G.: A contrast-based scalefactor for luminance display. In: Graphics Gems IV, pp. 415–421. Academic Press, London (1994) 13. Ferwerda, J.A., Pattanaik, S.N., Shirley, P., Greenberg, D.P.: A model of visual adaptation for realistic image synthesis. In: Proceedings of the SIGGRAPH 1996, pp. 249–258 (1996) 14. Drago, F., Myszkowski, K., Annen, T., Chiba, N.: Adaptive Logarithmic Mapping For Displaying High Contrast Scenes. The Journal of Computer Graphics Forum 22(3), 419– 426 (2003) 15. Duan, J., Qiu, G., Finlayson, G.M.D.: Learning to display high dynamic range images. Pattern Recognition 40(10), 2641–2655 (2007) 16. Pardo, A., Sapiro, G.: Visualization of high dynamic range images. IEEE Transactions on Image Processing 12(6), 639–647 (2003) 17. Larson, G.W., Rushmeier, H., Piatko, C.: A visibility matching tone reproduction operator for high dynamic range scenes. IEEE Trans. on Visualization and Computer Graphics 3, 291–306 (1997) 18. Chiu, K., Herf, M., Shirley, P., Swamy, S., Wang, C., Zimmerman, K.: Spatially nonuniform scaling functions for high contrast images. In: Proc. Graphics Interface 1993, pp. 245–253 (1993) 19. Tumblin, J., Turk, G.: LCIS: A boundary hierarchy for detail preserving contrast reduction. In: Proc. of ACM SIGGRAPH 1999, pp. 83–90 (1999) 20. Durand, F., Dorsey, J.: Fast bilateral filtering for the display of high-dynamic-range images. ACM Trans. Graph (special issue SIGGRAPH 2002) 21(3), 257–266 (2002)
Segmentation Based Tone-Mapping for High Dynamic Range Images
371
21. Li, X., Lam, K., Shen, L.: An adaptivea lgorithm for the display of high-dynamic range images. Journal of Visual Communication and Image Representation 18(5), 397–405 (2007) 22. Wang, J., Xu, D., Lang, C., Li, B.: An Adaptive Tone Mapping Method for Displaying High Dynamic Range Images. Journal of Information Science and Engineering (2010) 23. Jobson, D.J., Rahman, Z., Woodell, G.A.: A multiscale Retinex for bridging the gap between color images and the human observation of scenes. IEEE Transactions on Image processing 6, 965–976 (1997) 24. Reinhard, E., Stark, M., Shirley, P., Ferwerda, J.: Photographic tone reproduction for digital images. In: Proc. ACM SIGGRAPH 2002 (2002) 25. Fattal, R., Lischinski, D., Werman, M.: Gradient domain high dynamic range compression. In: Proc. ACM SIGGRAPH 2002 (2002) 26. Li, Y., Sharan, L., Adelson, E.H.: Compressing and companding high dynamic range images with subband architectures. ACM Transactions on Graphics 24(3), 836–844 (2005) 27. Krawczyk, G., Myszkowski, K., Seidel, H.P.: Computational model of lightness perception in high dynamic range imaging. In: Rogowitz, B.E., Pappas, T.N., Daly, S.J. (eds.) Human Vision and Electronic Imaging XI (2006) 28. Lischinski, D., Farbman, Z., Uyttendaele, M., Szeliski, R.: Interactive local adjustment of tonal values. ACM Transactions on Graphics 22(3), 646–653 (2006) 29. Stevens, S.S., Stevens, J.C.: Brightness function: parametric effects of adaptation and contrast. Journal of the Optical Society of America 53 (1960) 30. Comanicu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 603–619 (2002) 31. Felzenszwalb, P., Huttenlocher, D.: Efficient graph-based segmentation algorithm. In: IJCV (2004) 32. Cour, T., Benezit, F., Shi, J.: Spectral Segmentation with Multiscale Graph Decomposition. IEEE International Conference on Computer Vision and Pattern Recognition, CVPR (2005) 33. Ren, X., Fowlkes, C., Malik, J.: Learning probabilistic models for contour completion in natural images. International Journal of Computer Vision 77, 47–63 (2008) 34. Yang, A., Wright, J., Ma, Y., Sastry, S.: Unsupervised segmentation of natural images via lossy data compression. Computer Vision and Image Understanding 110(2), 212–225 (2008) 35. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: From Contours to Regions: An Empirical Evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009) 36. Duan, J., Qiu, G.: Fast Tone Mapping for High Dynamic Range Images. In: 17th International Conference on Pattern Recognition, ICPR 2004, vol. 2, pp. 847–850 (2004) 37. Yoshida, A., Mantiuk, R., Myszkowski, K., Seidel, H.P.: Analysis of reproducing realword appearance on displays of varying dynamic range. In: EUROGRAPHICS 2006, vol. 25 (3) (2006) 38. Kuang, J., Yamaguchi, H., Liu, C., Johnson, G.M., Fairchild, M.D.: Evaluating HDR rendering algorithms. ACM Transactions on Applied Perception 4(2), 9 (2007) 39. Cadík, M., Wimmer, M., Neumann, L., Artusi, A.: Evaluation of HDR Tone Mapping Methods using Essential Perceptual Attributes. Computers and Graphics (2008) 40. Kuang, J., Heckaman, R., Fairchild, M.D.: Evaluation of HDR tone-mapping algorithms using a high-dynamic-range display to emulate real scenes. Journal of the Society for Information Display 18(7), 461–468 (2010) 41. http://people.csail.mit.edu/fredo/PUBLI/Siggraph2002/ index.html#hdr 42. http://www.seas.upenn.edu/~timothee/software/ ncut_multiscale/ncut_multiscale.html 43. Duan, J., Bressan, M., Dance, C., Qiu, G.: Tone-mapping high dynamic range images by novel histogram adjustment. Pattern Recognition 43(5), 1847–1862 (2010)
Underwater Image Enhancement: Using Wavelength Compensation and Image Dehazing (WCID) John Y. Chiang1, Ying-Ching Chen1, and Yung-Fu Chen2 1 Department of Computer Science Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan [email protected], [email protected] 2 Department of Management Information Systems and Institute of Biomedical Engineering and Material Science, Central Taiwan University of Science and Technology, Taichung, Taiwan [email protected]
Abstract. Underwater environments often cause color scatter and color cast during photography. Color scatter is caused by haze effects occurring when light reflected from objects is absorbed or scattered multiple times by particles in the water. This in turn lowers the visibility and contrast of the image. Color cast is caused by the varying attenuation of light in different wavelengths, rendering underwater environments bluish. To address distortion from color scatter and color cast, this study proposes an algorithm to restore underwater images that combines a dehazing algorithm with wavelength compensation (WCID). Once the distance between the objects and the camera was estimated using dark channel prior, the haze effects from color scatter were removed by the dehazing algorithm. Next, estimation of the photography scene depth from the residual energy ratios of each wavelength in the background light of the image was performed. According to the amount of attenuation of each wavelength, reverse compensation was conducted to restore the distortion from color cast. An underwater video downloaded from the Youtube website was processed using WCID, Histogram equalization, and a traditional dehazing algorithm. Comparison of the results revealed that WCID simultaneously resolved the issues of color scatter and color cast as well as enhanced image contrast and calibrated color cast, producing high quality underwater images and videos. Keywords: Underwater image, Image dehazing, Wavelength compensation.
1
Introduction
Capturing clear images in underwater environments is an important issue of ocean engineering [1]. The effectiveness of applications such as underwater navigational monitoring and environment evaluation depend on the quality of underwater images. Capturing clear images underwater is challenging, mostly due to haze caused by color scatter in addition to color cast from varying light attenuation in different wavelengths [2]. Color scatter and color cast result in blurred subjects and lowered contrast in J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 372–383, 2011. © Springer-Verlag Berlin Heidelberg 2011
Underwater Image Enhancement: Using WCID
373
underwater images. In Figure 1, for example, the yellow coral reef at the bottom of the image and the yellow fish in the upper-right corner are indistinct because of color cast; the school of Carangidae, the diver, and the reef in the back are unclear due to scattering.
Fig. 1. Blurry and Bluish Effects from Haze and Color cast in Underwater Images
Haze is caused by suspended particles such as sand, minerals, and plankton that exist in lakes, oceans, and rivers. As light reflected from objects proceeds towards the camera, a portion of the light meets these suspended particles, which absorbs and scatters the light (Fig. 2). In environments without blackbody emission [3], scattering often expands to multiple scattering, further dispersing the beam into homogeneous background light.
Fig. 2. Natural Light Illuminates an Underwater Scene Point x and the Reflected Light Travels to the Camera by Direct Transmission and Scattering
The underwater image of light after scattering can be expressed as the weighted sum of directly transmitted reflected light and scattered background light [4]:
I λ ( x) = J λ ( x)tλ ( x) + Bλ (1 − tλ ( x)), λ ∈ {R, G, B}, tλ ( x ) =
Eo (λ , d ( x )) d ( x) , = 10 − β ( λ ) d ( x ) = ( Rer( λ )) E I ( λ , 0)
(1)
(2)
374
J.Y. Chiang, Y.-C. Chen, and Y.-F. Chen
where x is a point in the image; λ is the wavelength of the light; Iλ(x) is the image captured by the camera; Jλ(x) is the reflected light that is directly transmitted. Light attenuates when passing through a medium [5]; the residual energy ratio (Rer) indicates the ratio of residual energy to initial energy for every unit of distance. Supposing the energy of a light beam before and after it passes through a medium with a length of d(x) is EI(λ,0) and EO(λ,d(x)), respectively; tλ(x) represents the residual energy ratio of the light beam after passing through the medium. Due to the fact that tλ(x) depends on wavelength λ and d(x), the distance between x and the camera, tλ(x), causes color scatter and color cast. The Rer of various light wavelengths differ in water [6]. As illustrated in Figure 3, red light possesses longer wavelength and lower frequency, thereby attenuating faster than blue light. This results in the blueness of most underwater images.
Fig. 3. Penetration of Light of Various Wavelengths through Water; Blue Light is the Strongest and Red Light is the Weakest
In addition to wavelength, the residual energy ratio tλ(x) is also influenced by the salt ratio in the water [7]. Using the amount of suspended particles and salt ratio, ocean water falls into three categories: general ocean water (Ocean Type 1), turbid tropical-subtropical water (Ocean Type 2), and mid-latitude water (Ocean Type 3) [7]. For every meter of general ocean water that a light beam passes through, the Rer values of red light (700μm), green light (520μm), and blue light (440μm) are 82 %, 95 %, and 97.5 %. The Rer in various environments can be adjusted with general ocean water as the standard. Suppose an incident light beam A from the air forms background light B at depth D after attenuation and multiple scattering; background light is in correspondence with the brightest portion of the image. The relationship between incident light beam A and background light B can be expressed with an energy attenuation model:
E B ( λ , D ) = E A ( λ , 0 ) × ( Rer( λ ) ) , λ ∈ { R , G , B} , D
(3)
where EA(λ,0) and EB(λ,D) are the energy of the incident light and the background light with wavelength λ. The Rer values of various wavelengths are [6]: ⎧0.8~0.85 ⎪ Rer (λ ) = ⎨0.93~0.97 ⎪0.95~0.99 ⎩
if λ = 650 ~ 750μ m, if λ = 490 ~ 550 μ m, if λ = 400 ~ 490μ m.
(4)
Underwater Image Enhancement: Using WCID
375
Conventionally, the processing of underwater images is directed either towards calibrating distortion from color scatter or from color cast. Research on improving the former has included applying the properties of polarizers to enhance image contrast and visibility [8], using image dehazing to eliminate hazing effects and enhance image contrast [9], and combining point spread functions (PSF) and modulation transfer function (MTF) in coordination with wavelet decomposition to enhance the high frequency areas in images [10] and increase visibility. Although the approaches above can augment contrast and sharpen images, they cannot solve the issue of color cast. Research regarding improvement of color cast includes using the properties of light transmitting through water to provide energy compensation using the attenuation differences between various wavelengths [11] and employing histogram equalization on underwater images to balance the luminance distributions of color [12]. Despite the improvement in the color distortion of objects, these methods cannot repair the image blurriness caused by color scatter. The WCID algorithm proposed in this study combines a dehazing algorithm and energy compensation. Dark channel prior is used to estimate the distance of the object to the camera, and the dehazing algorithm removes the hazing effects caused by color scatter. Once underwater background light and the Rer values of various wavelengths of light are used to estimate the depth of the underwater scene, reverse compensation according to each wavelength is carried out to restore the color cast from water depth. With WCID, expensive optical instruments or distance estimation by two images is no longer required; WCID can effectively enhance visibility in underwater images and restore the original colors, obtaining high quality visual effects.
2
Underwater Image Model
The actual environment of underwater photography is as seen in Figure 2. Natural light from above the water attenuates while traveling to underwater depth D to illuminate the underwater scene. At x, a point within the scene, the reflected light travels a distance of d(x) to the camera to form the image. Color scatter is a result of light absorption and multiple scattering by suspended particles on the way to the camera; color cast is due to the inconsistent attenuation of light at different wavelengths and occurs in both the depth D and the distance d(x). Within the depth range R from the top of the image (D) to the bottom (D+R), the degree of attenuation varies in each area of the image, thereby necessitating estimation of underwater depth at each point for compensation. In general, to overcome insufficient lighting in an underwater photographic environment, an artificial light source such as L is used to assist photography. While compensating the energy lost in attenuation within the depth range, the luminance contributed by L must be considered to avoid overcompensation. The WCID algorithm follows an underwater image model for reverse compensation by first removing color scatter and color cast from distance d(x) and then restoring the color cast from depth D. The amount of energy attenuated within the image range R and the luminance of the artificial light source L are then considered before carrying out appropriate compensation. The following section discusses the estimation of d(x), depth D, artificial light source L, and depth range R as well as the procedure for energy compensation.
376
2.1
J.Y. Chiang, Y.-C. Chen, and Y.-F. Chen
Distance between the Camera and the Object: d(x)
Conventional estimation of the distance between an object in an image and the camera requires two images for parallax. In a hazy environment, haze increases with distance; consequently, evaluating the concentration of haze in a single image is sufficient to predict the distance d(x) between the object in the scene and the camera [4]. Using dark channel prior, d(x) can be derived [14]. Dark channel refers to the phenomenon of there being at least one pixel with a near zero brightness value in the area Ω(x) surrounding any given point x in an outdoor haze-free image. Therefore the dark channel Jdark of an outdoor haze-free image can be defined as:
⎛ ⎞ J dark ( x) = min ⎜ min ( J λ ( y ) ) ⎟ ≈ 0, λ ∈ {R, G, B} . λ ⎝ y∈Ω( x ) ⎠
(5)
Taking the min operation in the local patch Ω(x) on the hazy image Iλ(x) in Eq. (1), we have
min ( I λ ( y ) ) = min { J λ ( y )tλ ( y ) + Bλ (1 − tλ ( y ))} , λ ∈ { R, G, B} ,
y∈Ω( x )
y∈Ω( x )
(6)
since Bλ is the homogeneous background light and the residual energy ratio tλ(y) in a local patch Ω(x) is essentially a constant [13], Eq. (6) can be further simplified as:
min ( I λ ( y ) ) ≈ min ( J λ ( y ) ) tλ ( x ) + Bλ (1 − t λ ( x)), λ ∈ {R , G , B}.
y∈Ω( x )
y∈Ω ( x )
(7)
Rearrange the above equation and perform one more min operation among all three color channels: ⎧ min ( I λ ( y ) ) ⎫ ⎧ min ( J λ ( y ) ) ⎫ ⎪ y∈Ω( x ) ⎪ ⎪ y∈Ω( x ) ⎪ ≈ min ⎨ ⋅ tλ ( x) ⎬ + min (1 − tλ ( x )), min ⎨ ⎬ λ∈{ R ,G , B} λ∈{ R ,G , B} λ ∈{ R ,G , B} B B λ λ ⎪⎩ ⎪⎭ ⎪⎩ ⎪⎭
(8)
since Jdark is very close to 0 as shown in Eq. (5), the first term on the right-hand side of Eq. (8) can be regarded as 0. After this simplification, Eq. (8) can be rewritten as:
⎧ min ( I λ ( y ) ) ⎫ ⎪ y∈Ω( x ) ⎪ t( x) = min tλ ( x) ≈ 1 − min ⎨ ⎬, λ∈{R ,G , B} λ∈{R ,G , B} B λ ⎪⎩ ⎪⎭
(9)
where t( x) changes with d(x), the distance between a point x on an object and the camera. The depth map of Fig. 1 is shown in Fig. 4(a). The calculation of dark channel prior is based on blocks, creating a less accurate depth map. By applying image matting to repartition the depth map of Figure 4, the distortion of mosaics can be improved [14] for better capture of object contours.
Underwater Image Enhancement: Using WCID
377
Fig. 4. (a) Depth Map from Estimating Distance between the Object and the Camera Using Dark Channel Prior; (b) Blow Up of Frame I; (c) Blow Up of Frame II
Image matting requires input of a preliminary partitioned depth map (Fig. 4) and the original image. Objects are detected using the relationship between the mean color value and the covariance of a local area wx within the image and then the preliminary partitioned depth map is corrected using the relationship among the objects themselves. Taking the depth map of Figure 4(a) as tcoarse and the improved depth map as trefine, tcoarse and trefine can be expressed as:
( L + Λ U ) t refined = Λ t coarse ,
where U is a unit matrix; Laplacian matrix [15]: L(i, j ) =
(10)
Λis a regularization coefficient and L represents the matting
−1 ⎛ ⎞⎞ ⎞ 1 ⎛⎜ ε T ⎛ ⎜ δ − 1 + I − μ + U I j − μ x ) ⎟ ⎟. ( ) ⎜ ⎟ ( ∑ ij i x ∑ x ⎜ ⎟ ⎟⎟ wx ⎜ wx x |( i , j )∈wx ⎜ ⎝ ⎠ ⎝ ⎠⎠ ⎝
(11)
Suppose the coordinates of a point x in the image is (i,j); I represents the original image; δij is the Kronecker delta; Σx is color covariance the of area wx; μ x is the mean color value of area wx; and ε is a regularization coefficient. Figure 5 shows the depth map after mosaic distortion was improved by image matting.
Fig. 5. (a) Depth Map after Improvement with Image Matting; (b) Blow Up of Frame I; (c) Blow Up of Frame II; in comparison with Fig. 4(b) and 4(c), the improved depth map captures the contours of image objects more accurately
378
J.Y. Chiang, Y.-C. Chen, and Y.-F. Chen
After correcting the original image to obtain a more accurate d(x), the distance between the object and the camera, Eq. (1) is employed to remove hazing effects and restore a portion of the color cast. The underwater image Jλ(x) was: J λ ( x) =
I λ ( x) − Bλ + Bλ , λ ∈ {R, G, B}. tλ ( x)
(12)
As can be seen in Fig. 6, subsequent to compensating the haze and color cast caused by the distance between the objects and the camera, a further estimation of the distance between the objects and the water surface D was required to calibrate the color cast caused by water depth.
Fig. 6. Underwater Image after Eliminating Haze and Some Color cast Caused by d(x); a bluish offset still exists
2.2
Underwater Depth of the Photographic Scene D
Suppose the energy of each wavelength in light beam A from the air is EA(R), EA(G), and EA(B). After penetrating underwater depth D, the energy of each wavelength after attenuation becomes the background light B: EB(R), EB(G),and EB(B). To estimate underwater depth D, background light B is first detected. The image location in correspondence with background light B can be estimated with dark channel prior [14]. Depth D is the depth with the minimum error between the background light energy EB(R), EB(G), and EB(B) and incident light energy from the air EA(R), EA(G), and EA(B) after attenuation to depth K (Eq. (3)). ⎧ ⎪⎪ min ⎨ k ⎪ ⎩⎪
( E (R, D) − E (R, 0) × ( Rer(R)) ) + ( E (G, D) − E (G, 0) × ( Rer(G)) ) ⎫⎪⎪. ⎬ ⎪ + ( E ( B, D ) − E ( B, 0) × ( Rer(B)) ) ⎭⎪ k
B
k
A
B
A
(13)
k
B
A
Once D is determined, the amount of attenuation in each wavelength can be used to compensate the energy differences and restore the color cast distortion from depth D: E(Jλ ) , λ ∈ { R, G , B} , E ( Jl λ ) = Rer (λ ) D
(14)
where is the underwater image after haze removal and calibration of color cast, as shown Jˆλ in Figure 7. However, the depth of the top and bottom of the image are not
Underwater Image Enhancement: Using WCID
379
the same; using depth D for energy compensation of the entire image results in some color cast remaining at the bottom of the image. Thus, depth estimation for each point in the image is necessary to achieve corresponding energy compensation at the various depths.
Fig. 7. Underwater Image after Removing Color cast and Color Scatter from Distance d(x) and Depth D; as the depths of the top and bottom of the image are different, color cast distortion still exists at the bottom of the image
2.3
Image Depth Range R
The depth range of the image covers the photographic scene from depth D to D+R, as shown in Figure 2. As light transmits through the depth range R, the different underwater depths induce differences in color cast at the top and bottom of the image, thereby necessitating varied energy compensation that corresponds to the underwater depth of each point to rectify color cast. During depth estimation in the depth range R, the foreground and background must first be separated to prevent the colors of the objects in the image from interfering with the estimation. The background of the image is natural light that travels directly to the camera without reflecting off objects; therefore using the background for estimation allows underwater depths to be calculated more accurately.
⎧ foreground type( x) = ⎨ ⎩background
if d ( x) > σ , if d ( x) ≤ σ ,
(15)
where σ is an adjustable coefficient and d(x) is the distance between the object and the camera. The background light B which is light that attenuated during passage to depth D is located at the very top of the image background. Taking a pixel at the bottom of the image background as dbottom, the corresponding underwater depth is D+R. Background light B and point dbottom are respectively located at the top and bottom of the depth range. The underwater depth of point x in the image is between D and D+R and can be obtained by linear interpolation between background light B and point dbottom. The corresponding underwater depth of point dbottom is derived from Eq. (13). Suppose x is situated on the xjth scan line; background light B is on the bth scan line; and dbottom is on the cth scan line. The underwater depth of x in the actual scene δ(xj) can be derived by linear interpolation between background light B and point dbottom:
380
J.Y. Chiang, Y.-C. Chen, and Y.-F. Chen
⎛
δ (x j ) = D + ⎜ r × ⎜ ⎝
( x j - b ) ⎞⎟ . (c − b ) ⎟ ⎠
(16)
Once the underwater depth δ(xj) of any given point x is obtained, the energy attenuation model is employed in energy compensation of the foreground and background in image Iλ(x):
(
E I λ ( x)
)
=
E ( Iλ ( x) )
δ ( x j )−D
Rer (λ )
, λ ∈ {R, G, B}.
(17)
Figure 8 demonstrates the result of fine-tuning the amount of wavelength compensation by deriving the water depth of every image pixel. In comparison with Fig. 7, the color shift suffered at the lower part of the image is greatly reduced. Color balance is restored for the whole image, rather than just the top portion of the frame.
Fig. 8. Underwater Image after Eliminating Color cast from Image Depth Range
2.4
Artificial Light Source L
Artificial light sources are often supplemented to avoid insufficient lighting commonly encountered in underwater photographic environment, as shown in Fig. 9. If an artificial light source L is employed during image capturing process, the luminance contributed by L must be deducted first to avoid over-compensation for the stages followed, as illustrated in Fig. 10. The existence of artificial lighting can be determined by the difference between the mean luminance of the foreground and the background, Lf and Lb. In an underwater image without artificial lighting, the background directly transmits natural light without reflecting off of objects and is therefore the brighter part of the image. Higher mean luminance in the foreground of an image than in the background indicates the existence of an artificial light source. The difference between Lf and Lb is also the luminance of the artificial light source; that is to say, the luminance of the artificial light source is: L ( λ ) = L f ( λ ) − Lb ( λ ).
(18)
Supposing the artificial light radiates spherically, the influence of the light is in inverse proportion with the square of the distance. The closer the object, the more it is
Underwater Image Enhancement: Using WCID
381
affected by the artificial light source, and vice versa. After incorporating the variable of artificial light source, Eq. (17) can be rewritten as:
(
E I λ ( x)
)
=
E ( I λ ( x) )
δ ( x j )− D
Rer (λ )
(
)
− L(λ ) × (1 − d ( x)2 ) , λ ∈ {R, G, B}.
(19)
The result following removal of the effects of the artificial light source L is shown in Fig. 11.
Fig. 9. Illuminated by an artificial light source, the intensity of the foreground appears brighter than that of the background
Fig. 10. When the luminance contributed by an artificial light source is not deducted first, an over-exposed image will be obtained after the compensation stages followed
Fig. 11. The underwater image obtained after removing the artificial lighting present in Fig. 10
382
3
J.Y. Chiang, Y.-C. Chen, and Y.-F. Chen
Experiment
Figure 12(a) was captured from a underwater video on the Youtube website filmed by the Bubble Vision Co. [16]. This AVI format video is 350 seconds long with a resolution of 720 p. Figure 12(b) shows the result after processing with a traditional dehazing algorithm [9]. Although the contrast of the image has increased, the color cast is more apparent as the attenuated energy was not individually compensated according to wavelength. Figure 12(c) shows the image after histogram equalization processing [12], from which haze effects and color cast still remain. Figure 12(d) shows the image after processing with the WCID method proposed in this study. The process effectively restored image color and removed haze effects, giving the image the original color and definition it would have had if it was not underwater. The Rer values used in the WCID in this study for red light, green light, and blue light were 82 %, 95 %, and 97.5 %.
(a)
(b)
(c)
(d)
Fig. 12. (a) Original Input Image; the frame corresponds to the automatically estimated background light; (b) Image after Processing with Traditional Dehazing Algorithm; (c) Image after Processing with Histogram Equalization; (d) Image after Processing with WCID; the underwater depth of the scene in the image ranges approximately from 10 to 14 meters
4
Conclusion
The WCID algorithm proposed in this study can effectively restore image color and remove haze. However, the salt ratio and amount of suspended particles in ocean
Underwater Image Enhancement: Using WCID
383
water varies with time, location, and season, making accurate estimation of the rate of energy attenuation problematic. In addition, it is presumed that artificial light sources produce spherically radiating light different to surface light sources generally used in underwater photography; therefore the value of luminance is also difficult to estimate accurately. Future researchers may like to obtain a sample of ocean water and data on artificial light sources to further improve image restoration processes.
References 1. Junku, Y., Michael, W.: Underwater Roboitics. J. Advanced Robotics 15(5), 609–639 (2001) 2. Ronald Zaneveld, J., Pegau, W.: Robust underwater visibility parameter. J. Optics Express 11, 2997–3009 (2003) 3. Van Rossum, M.C.W., Nieuwenhuizen, T.M.: Multiple scattering of classical waves: microscopy, mesoscopy and diffusion. J. Rev. Mod. Phys. 71(1), 313–371 (1999) 4. Houghton, J.T.: The Physics of Atmospheres. Cambridge University Press, London (2001) 5. McFarland, W.N.: Light in the sea-correlations with behaviors of fishes and invertebrates. J. American Scientist Zoology 26, 389–401 (1986) 6. Duntley, S.Q.: Light in the Sea. J. Optical Society of America 53(2), 214–233 (1963) 7. Jerlov, N.G.: Optical Oceanography. Elsevier Publishing Company, Amsterdam (1968) 8. Schechner, Y.Y., Karpel, N.: Clean Underwater Vision. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 536–543. IEEE Press, USA (2004) 9. Chao, L., Wang, M.: Removal of Water Scattering. In: International Conference on Computer Engineering and Technology, vol. 2, pp. 35–39. IEEE Press, India (2010) 10. Weilin, H., Gray, D.J., Weidemann, A.D., Fournier, G.R., Forand, J.L.: Automated underwater image restoration and retrieval of related optical properties. In: International Geoscience and Remote Sensing Symposium, vol. 1, pp. 1889–1892. IEEE Press, Spain (2007) 11. Yamashita, A., Fujii, M., Kaneko, T.: Color Registration of Underwater Image for Underwater Sensing with Consideration of Light Restoration. In: International Conference on Robotics and Automation, pp. 4570–4575. IEEE Press, Italy (2007) 12. Iqbal, K., Abdul Salam, R., Osman, A., Zawawi Talib, A.: Underwater image enhancement using an integrated color model. J. Computer Science 34, 2–12 (2007) 13. Fattal, R.: Single Image Dehazing. In: International Conference on Computer Graphics and Interactive Technique, vol. (72), pp. 1–9. ACM SIGGRAPH Press, USA (2008) 14. He, K., Sun, J., Tang, X.: Single image haze removal using Dark Channel Prior. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1956–1963. IEEE Press, USA (2009) 15. Levin, A., Lischinski, D., Weiss, Y.: A closed form solution to natural image matting. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 61–68. IEEE Press, USA (2006) 16. Bubblevision Underwater Image, http://www.youtube.com/user/bubblevision
Video Stippling Thomas Houit and Frank Nielsen Ecole Polytechnique
Abstract. In this paper, we consider rendering color videos using a non-photo-realistic art form technique commonly called stippling. Stippling is the art of rendering images using point sets, possibly with various attributes like sizes, elementary shapes, and colors. Producing nice stippling is attractive not only for the sake of image depiction but also because it yields a compact vectorial format for storing the semantic information of media. In order to create stippled videos, our method improves over the naive scheme by considering dynamic point creation and deletion according to the current scene semantic complexity. Furthermore, we explain how to produce high quality stippled “videos” (eg., fully dynamic spatio-temporal point sets) for media containing various fading effects. We report on practical performances of our implementation, and present several stippled video results rendered on-the-fly using our viewer that allows both spatio-temporal dynamic rescaling (eg., upscale vectorially frame rate). Keywords: Non-Photo-realistic rendering, Voronoi tessellations, Video, Stippling, Vectorization.
1
Introduction
Historically, stippling was primarily used in the printing industry to create dot patterns with various shade and size effects. This technique has proven successful to convey visual information, and was widely adopted by artists that called this rendering art pointillism.1 Informally speaking, the main idea is that many dots carefully drawn on a paper can fairly approximate different tones perceived by local differences of density, as exemplified in Figure 1. The main difference with dithering and half-toning methods is that points are allowed to be placed anywhere, and not only on a fixed regular grid. In this work, we present a method to generate stippled videos. Generating videos with stipples or other marks is very interesting but has not yet be fully solved. Our method produces high quality videos without point flickering artifacts. Each point can easily be tracked during an entire shot sequence. Our method allows us to deal robustly with an entire video without having to cut it down into pieces using shot detection algorithms. Secord [10] designed a non-interactive technique to create quality stippling still images in 2002. Based on the celebrated Lloyd’s k-means method [6], Secord 1
See http://www.randyglassstudio.com/ for some renderings by award-winning artist Randy Glass.
J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 384–395, 2011. c Springer-Verlag Berlin Heidelberg 2011
Video Stippling
385
Fig. 1. Stippling a photo by rendering the bitmap (left) as a point set (right) distributed uniformly according to a prescribed underlying image density: the grey-level intensity (here, 600 points)
creates stippling adapted to a density function and improves the result by considering various point sizes. We considered this approach of stippling for our algorithm because it fits with the goal of our algorithm: being able to render an high quality output for every type of video without knowing any other data. There are other approaches like the use of multi-agent systems to distributes stipples that has been explored by Schlechtweg [9]. There is too a possibility of generating stipple patterns on 3D-objects by triangulating the surface and by positioning the dots on the triangle vertices [8]. This is a very fast method but the resulting patterns are not optimal. This has been improved by Vanderhaeghe [11] but we always cannot use as input every type of data. Like all computer-generated point distributions, we are also looking for stippled images that has the so-called blue noise characteristics with large mutual inter-distances between points and no apparent regularity artifacts. An interesting approach of stippling with blue noise characteristics based on Penrose tiling is described by Ostromoukhov [7]. More recently a method based on Wang tiles enabled the creation of real-time stippling with blue noise characteristics. This method was described by Kopf [5]. Balzer [1] developed another alternative way to create, this time, Voronoi diagrams with blue noise characteristics. Another way to render a stipple drawing is to mix different type of points instead of only use contrasted and colored points. The work of S. Hiller [4] explored this possibility by positioning small patterns in a visually pleasing form in the stippled image. We use in our algorithm patterns to increase the render of the edges. Indeed, a major inconvenient of these previous stippling methods was that while those methods can handle nicely and efficiently fine details and textures, they fall short when dealing with objects with sharp edges. We overcome this drawback and propose a frequency-based approach in order to detect images edges and enhanced their support and rendering. This frequency approach is improved by the use of patterns that replace some points and partially reconstruct the edges of the image [3]. We adapted this recent set of methods to video and solved along problems that appeared while doing so.
386
T. Houit and F. Nielsen
The roadmap of the paper is as follows: the basic ingredient of our approach is the use of centroidal Voronoi diagrams (a fundamental structure of computational geometry) to produce good-looking distributions of points, as explained in Section 2. This generative method will be further extended to compute a full video in Section 3.1. Then to improve the rendering, one need to change some parameters to obtain good fading effects, as explained in Section 3.2. An improved rendering must also consider both color information and point contrasts. Small adjustments described in Section 3.3 are required to get such a desirable output. We implemented a tailored scheme to handle the stippling rendering of sharp edges. Section 3.4 explains our solution to improve edge rendering by considering both high and low frequencies to place points accordingly. Finally we used some oriented patterns to reconstruct sharp edges in the output. Section 3.4 explains the method used and describes the obtained results.
2
Voronoi Diagrams
In order to create a stipple image, we adapted Lloyd’s k-means method [6] to obtain a Voronoi diagram that adapts to an underlying image density. To do this, we need two sets of points that are drawn depending on the image intensity — a point has more chance to be drawn on a darker pixel — that are: 1. The Sites which are the points that are visible on the stipple image and also represent the center of mass of the Voronoi cells. 2. The generator points or support points. We draw 103 times more of these points than Sites. They are fixed and not visible on the stippled image. Then every generator point is identified with the closest Site according to a given metric. The common choice is to use the Euclidean L2 distance metric d(P1 , P2 ) = P1 − P2 = (x1 − x2 )2 + (y1 − y2 )2 . where Pi = (xi , yi ) are two given points. Each set of generator points identified with a particular Site forms a Voronoi region. We can define each Voronoi region Vi as follows: Vi = x ∈ Ω | x − xi ≤ x − xj for j = 1, ..., n and j = i . where . denotes the Euclidean distance derived from the L2 norm. The following update step is performed to obtain a good-looking random distribution of Sites: we move each Site to the mass center of its associated generator points. The center of mass is computed in a discrete way by using only the generator points as a metric and not the whole area surrounding a Site — this is the main difference regarding Lloyd’s method and it allows us to have a Voronoi diagram adapted to an underlying density. Then we do again the identification for each generator point. Iterating these two steps lets us converge to a Central Voronoi diagram [2], a partition of the domain adapted to a given image density.
Video Stippling
387
The convergence of Lloyd’s algorithm to a centroidal Voronoi diagram on continuous domains has been proven for the one-dimensional case. Although the higher dimensional cases seem to converge similarly in practice, no formal proof is yet reported. Notice that here we fully discretize the space by considering support points. The criterion we used to define the convergence is the numeric comparison between the positions of each Site between two consecutive iterations.
3 3.1
CVT Method for Video Stippling From Images to Video
To compute a full video stippling, we first extract from the video an image sequence in order to be able to use image stippling methods. We then need to keep the information about stipple points on the N − 1 images before computing the N -th image. If we fail to doing so, no correlation between the images of the sequence will appear and undesirable flickering effects will be visible in the synthesized video. The information we need for the remaining of the algorithm is the current overall density, the current number of generator points and the current number of Sites. We start the computation of the N -th image by keeping the generator points and Sites found for the (N − 1)-th image. Then we look for all the generator points and Sites that are no longer needed in this new image. To decide whether to keep points or not, we generate a difference image — Pdiff— where each pixel intensity is set as follows: Pdiff− (x, y) = 0 if PN −1 (x, y) − PN (x, y) ≤ 0 . Pdiff− (x, y) = PN −1 (x, y) − PN (x, y), otherwise where PN (x, y) is the color of the pixel at the coordinates (x, y) of the image frame number N . Each generator point or Site placed on a pixel where the color of the difference image is different from 0 has a probability proportional to the value of the color to be deleted. We do the same to find where we have to add new points by calculating another difference image — Pdiff+ — as follows: Pdiff+ (x, y) = 0 if PN (x, y) − PN−1 (x, y) ≤ 0 . Pdiff+ (x, y) = PN (x, y) − PN−1 (x, y), otherwise We obtain two difference images as explained in Figure 2. Then we need to link the global density of the image with the total number of points. Thus we calculate for each image its global density which is equal to n P (i) Density ρ = i=1 . 255n where P (i) is the grey color of the pixel i and n is the total number of pixel. The user enters the initial number of points. This number of point is linked to the initial density and serves as a reference during the processing of the full video sequence. Our algorithm preserves the same ratio between the number of points
388
T. Houit and F. Nielsen
Fig. 2. From two consecutive frames 1 and 2 (left), we generate two difference images (right): one with the blue zone (point suppression zone) and one with the red zone (point addition zone)
and the image density during all the operations by adding or suppressing more points based on the underlying image intensity until the right number of point is reached. To optimise the render, we do then the update step described in section 2 until convergence. We repeat these addition/suppression/update steps for all the images of the video sequence. When it is done, for each site, we store at each frame, its current position and color (black at this stage, but color can be added, as described later in section 3.3). With this information we generate a text file where we also add other important information such as the total number of Sites that will appear during the whole shot, the original size of the video (eg., width and height dimensions), and the total number of frames. To read and “play” the output file, we have developed an in-house application in JavaTM that renders the stippled video. With the information provided in the text file, the viewer application let us resize dynamically the video on-demand, change the “contrast” of the Sites (the difference of size between a Site in a high density region and one in a low density region) and allows users to activate or not the different options (color, points size, patterns, ...). Moreover, our application allows one to generate intermediate time frames by interpolating the position, size, and color of each Site between two consecutive frames. Thus yielding true spatio-temporal vectorization of video media. 3.2
Image Differences and Drift Adjustment of Sites
Handling fading effects. The method described before is particularly well adapted to shots where objects appear and disappear instantly. The problem is that there are often objects that disappear using a fading effect. If we use the previous formula to calculate the image differences, we do not obtain a proper rendering for those fading transition shots. Indeed, if an object disappears progressively during three images (say 33% each time), we suppress 33% of the generator points and Site at each time. But ((100 ∗ 0.33) ∗ 0.33) ∗ 0.33 > 0. So there are generator points and Sites in white zones of the image after the full removal of the object, which is of course not the expected result. To correct this step, we needed to find a difference formula that converges towards 0 when
Video Stippling
389
the destination image becomes fully white. We implemented in our program the following formula. If PN −1 (i) < PN (i) we store in the difference image: 255 PN (i) − PN −1 (i) × . 255 − PN−1 (i) If PN −1 (i) > PN (i) we store in the other difference image: 255 PN −1 (i) − PN (i) × . PN −1 (i) using the same notation as before. This yields a perfect image/vector transcoding of the fading effect during a shot sequence. A work-out example is shown on figure 3.
Fig. 3. The text “Duffy” (left) fades out progressively to let the small character appear (right). The point density of Voronoi tesselations adapts automatically and progressively to this fading effect. (3000 points + 1500 frequency points)
Drift correction. Another major drawback observed is the drift of the Sites which are on an edge of an object in the image. After putting the Sites on the mass center of the associated generator points, it appears that the Sites are slightly located on the wrong side of the boundary, often on a white zone. To completely suppress this annoying artifact, we need to repeat the operation of suppression and the relocation of the Sites several times until the number of deleted Sites falls under a threshold. In practice, we noticed that repeating this step only one more time is already very efficient and that we can force everywhere the exact correlation between Site density and image color. The result is shown in Figure 4.
Fig. 4. (Left) Some points drift from the original shape because of their identification with the centroid of their Voronoi region. (Right) our method removed this undesirable effect. (2000 points)
390
3.3
T. Houit and F. Nielsen
Color and Size
It would seem that to faithfully represent a grey-scale image of 256 levels, our method would require 256 support points per pixel of the original image. This is fortunately not the case. Indeed when rendering a stippling image we lose some information by the lack of support point. But we work on “areas” and not on “pixels” with stippling method. The most important information is the tone of each area of the image. To each image, we can further add some properties to each Site. It is easy to pick up the color of the pixel of the reference image which is at the same position of the Site. With this information we can display the Site in its original color and adapt its size in proportion to its grey color. This simple operation improves considerably the perception of the rendering of the output stippled “video” and let the user distinguish more details in videos with low contrasts (see Figure 5).
Fig. 5. (Left) Stippling result without contrast and color. (Middle) stippling with contrast. (Right) stippling with color to enhance the rendering. (3000 points)
3.4
Frequency Considerations
In order to detect edges and enhance their support and rendering we considered a novel rendering approach. This operation is carried out in parallel of the previous video processing. We first compute the discrete gradient at all pixels the images (using the Sobel operator) and filter the result with a threshold to put low frequencies at zero. Then we apply the same algorithm with a number of Site that can vary and that has to be adapted considering the number of pixels with high frequencies. We obtain a distribution of Sites located on the edges of the image. These edge points let us improve the overall rendering of the shapes in the image. To get a good rendering, we merely add those Sites to the ones previously calculated. To get a better rendering we reduce the size of these points by 33% in comparison to the other points. The frequency-based stippling result is presented for Lena in the Figure 6. Pattern placement. Detecting edges is easy but rendering them clearly by stippling is not really possible with a small amount of Sites. In order to keep a good render without having to generate a lot of Sites, we used a small quantity of segments as Sites in order to be able to recreate the edges.
Video Stippling
391
Fig. 6. We extract two density maps from the original source image: one is the color map and the other the frequency map. Then we apply our stippling algorithm and finally add both size and color information. We end the process by summing these two contributions — the classical (3000 points) and frequency approaches (6000 points).
After placing the patterns, we had to orient them perpendicularly to the local gradient of the image. To do so we just store for each image the local Sobel gradient following the x and y axes. Then we estimate the local angle of the gradient with the following formula:
Δx θ(x, y) = arctan . Δy Once this operation is done we can associate each frequency Site with its orientation. We obtain on the Lena image the result shown on the figure 7. Contrary to the previous frequency approach, we need less point to obtain a pleasant render. Thus we increase the speed of our algorithm.
4 4.1
Experiments and Results Blue Noise Characteristics
Figure 8 shows a representative distribution of 1024 random points that was generated with Lloyd’s method and extracted with the method of Balzer et al. [1]. The distribution clearly exhibits regularities that are underlined by the FFT analysis on the right of the figure. Our method generates a random set of points (here 3700 points) with less regularities. The result is substantially better and is conserved during a whole video. Two sets were generated and shown
392
T. Houit and F. Nielsen
Fig. 7. Use of patterns to describe sharp edges in the image — 2000 segments
Fig. 8. Lloyd’s method generates point distributions with regular structures if it is not stopped manually. The example set of 1024 points was computed with Lloyd’s method to full convergence and contains regularities. The two other examples have been generated by our algorithm, the first directly and the second after the computation of a whole video until the last frame. They present less regularity.
on figure 8. The first has been made on a single image, with a direct point placement with our algorithm and present good characteristics. The second has been build inside a video sequence by successive point addition frame by frame until the final number of 3700 points. We notice that this distribution has the same characteristics as the previous one. These blue noise characteristics are interesting because generating a good random point set is a challenge. Our algorithm has a good behavior because of the high place of randomness in the Sites initialization, suppression and creation. We look for good looking images, thus we can use high thresholds for the Voronoi tesselation termination criteria. This lets us keep a big part of randomness while improving the quality of the rendered image. 4.2
Algorithm’s Complexity
Our method and software allows one to convert a video into a stippling video, namely a time-varying point set with dynamic color and size attributes. The complexity of our method only depends on the number of support/Site points that we used to stipple the video. The complexity of the algorithm is quadratic, O(n2 ) where n is the number of Sites. We have carried out several
Video Stippling
393
time measurements to confirm this overall complexity and identify some variations depending on other parameters. We have done the timing measures on a video containing 10 frames. The video represents a black disk whose size decreases progressively itself 10% each frame. The graph on the Figure 9 sums up the results and plot the amount of time consumed to stipple a video depending on the number Sites required.
Fig. 9. Graph representing the time required to calculate a 10-frame video with varying number of Sites
To estimate the time required to compute the video we need to take into account the number of additions of points during the shot. This operation increases the overall computation time. Of course, one can accelerate our algorithm by porting it on graphics processor units (GPUs). The calculation of Voronoi diagrams is known to adapt well on GPU, and is far quicker and more efficient [12] than CPU-based algorithms. 4.3
Size of the Output File
We measured for 3 different video shots of 91 frames the number of Sites depending on the frame density. We noticed that our algorithm maintains a perfect correlation between those two parameters following the initial ratio asked by the user. This observation implies that we can estimate the average number of Point needed to compute each frame with this formula: Minitial + ρinitial
Number of frame i=0
Minitial ρdiff (i) ∗ ρinitial
.
Where M represent the number of Sites and ρ the density. With this method we can quickly estimate the final size of the output video file. After a ZIP compression we observed that we need 200 bytes per frame per 1000 Sites. Thus for a stipple video with 25 frames per second, that lasts 1 hour, we encode a video in about 20 Mbytes.
394
5
T. Houit and F. Nielsen
Discussion and Concluding Remarks
Stippling is an engaging art relating to non-photo-realistic depiction of images and videos by using point sets rendered with various color and size attributes. Besides the pure artistic and aesthetic interests of producing such renderings, the stippling process finds also many other advantages on its own: – The output video is fully vectorized, both on the spatial and temporal axes. Users can interactively rescale video to fit the device screen resolution and upscale the frame rate as wished for fluid animations. Stippling could thus be useful for web designers that have storage capacity constraints and yet would like to provide video contents that yields the same appearance on various types of screen resolution devices (let it be PDA, laptop or TV). – Another characteristic of the stippling process is the production of a video that bear only a few pixel color transitions compared to the original medium. Indeed, on a usual video, we have Width × Height× N umberFrame color transitions while in our output stippled video, we only have of number of transitions that is roughly N umberPoint × N umberFrame . This is advantageous in terms of energy savings for e-book readers for instance. Stippling may potentially significantly increase battery life of such devices based on e-inks that consume energy only when flipping colors. – To improve the stippling process and correct the problem of intensity loss of the stippled images (due to the averaging of the Human eyes), we can consider the area of each Voronoi cell and use this measure to extract a multiplicative normalization factor for the Site of the corresponding Voronoi cell. With this normalization factor, we are able to nicely correct the loss of intensity of the stippled images; The smaller a Voronoi cell, the bigger the normalization factor, and the darker the color setting. Figure 10 presents some extracted images from a stippled video.
Fig. 10. These snapshots were extracted from a stippled video. This particular sequence does not take into account the frequency information of images.
Video Stippling
395
References 1. Balzer, M., Schl¨ omer, T., Deussen, O.: Capacity-constrained point distributions: a variant of lloyd’s method. ACM Trans. Graph. 28(3) (2009) 2. Du, Q., Gunzburger, M., Ju, L.: Advances in studies and applications of centroidal voronoi tessellations (2009) 3. Gomes, R.B., Souza, T.S., Carvalho, B.M.: Mosaic animations from video inputs. In: PSIVT, pp. 87–99 (2007) 4. Hiller, S., Hellwig, H., Deussen, O.: Beyond stippling - methods for distributing objects on the plane. Comput. Graph. Forum 22(3), 515–522 (2003) 5. Kopf, J., Cohen-Or, D., Deussen, O., Lischinski, D.: Recursive wang tiles for realtime blue noise. ACM Trans. Graph. 25(3), 509–518 (2006) 6. Lloyd, S.P.: Least squares quantization in pcm. IEEE Transactions on Information Theory 28(2), 129–136 (1982) 7. Ostromoukhov, V., Donohue, C., Jodoin, P.-M.: Fast hierarchical importance sampling with blue noise properties. ACM Trans. Graph. 23(3), 488–495 (2004) 8. Pastor, O.M., Freudenberg, B., Strothotte, T.: Real-time animated stippling. IEEE Computer Graphics and Applications 23(4), 62–68 (2003) 9. Schlechtweg, S., Germer, T., Strothotte, T.: Renderbots-multi-agent systems for direct image generation. Comput. Graph. Forum 24(2), 137–148 (2005) 10. Secord, A.: Weighted voronoi stippling. In: NPAR, pp. 37–43 (2002) 11. Vanderhaeghe, D., Barla, P., Thollot, J., Sillion, F.: Dynamic point distribution for stroke-based rendering (2007), http://artis.imag.fr/Publications/2007/VBTS07a 12. Vasconcelos, C.N., S´ a, A.M., Carvalho, P.C.P., Gattass, M.: Lloyd’s algorithm on gpu. In: ISVC (1), pp. 953–964 (2008)
Contrast Enhanced Ultrasound Images Restoration Adelaide Albouy-Kissi1 , Stephane Cormier2 , Bertrand Zavidovique3 , and Francois Tranquart4 1
Clermont Universit´e, Universit´e d’Auvergne, ISIT, BP 10448, 63000 Clermont Ferrand, France 2 Universit´e de Reims Champagne Ardenne, CReSTIC, Dept. Math-Info, B.P. 1039, 51687 Reims Cedex 2, France 3 Universit´e Paris XI, The Institute of Fundamental Electronics, 91405 Orsay Cedex, France 4 Bracco Suisse SA, Research Centre, Bracco Imaging BV, 31 Route de la Galaise, 1228 - Plan-les-Ouates, Switzerland [email protected], [email protected]
Abstract. In this paper, we propose a new anisotropic diffusion scheme to restore contrast enhanced ultrasound images for a better quantification of liver arterial perfusion. We exploit the image statistics to define a new edge stopping function. The method has been tested on liver lesions. The results show that the assessment of lesion vascularization from our process can potentially be used for the diagnostic of liver carcinoma. Keywords: Contrast Enhanced Ultrasound, Anisotropic Diffusion, Coherence, Liver Imaging, Image Restoration.
1
Introduction
Contrast enhanced ultrasound is a modality, based on contrast agents intravenously injected, that allows to image blood perfusion. The quantification of perfusion is based on the analysis of the video signal intensity, related to contrast agents’ concentration, to obtain the lesion perfusion kinetic. Typically, the physician draws and moves manually regions in which he calculates the mean signal intensity over time. This method is time consuming and subject to errors due to intra and/or inter-observer variability [1]. A solution consists in automatically detecting and repositioning the region of interest by segmenting it [2], [3]. However, the image analysis is subject to errors due to the noisy nature of images. A preprocessing step is thus needed. In this context, we present in this paper a new method to restore contrast enhanced ultrasound images. A coherence enhancing diffusion with new automatic edge stopping function, based on image statistics, is proposed. This method has been tested on five patients and the results show that the assessment of lesion vascularization from our segmentation process can potentially be used for the diagnostic of liver carcinoma. J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 396–404, 2011. c Springer-Verlag Berlin Heidelberg 2011
Contrast Enhanced Ultrasound Images Restoration
2
397
Noise Removal
One of the main challenges involved in the segmentation of ultrasound images, is the noise introduced by the acquisition method and called speckle. It leads to fluctuations in time of the signal amplitude, limiting the possibilities to do quantification with image processing. In contrast enhanced ultrasound, the collected signals come from microbubbles and surrounding tissues. The scatterers’ response varies with the emitted frequency, their size, physical properties and concentration. As a result, there are intensity changes due to concentration agent variation and speckle noise. Restoration methods are then applied to represent physiological structures of interest (often lesions) with a better resolution and a higher signal-to-noise ratio. The first restoration approach proposed in the literature idea consists in subtracting “the background component” in the image. Since the level of contrast enhancement, at high acoustical power, is a few decibels above the background, Kaul enhances the blood pool by subtracting images from their mean value [4], [5]. This method does not however takes account of noise component that creates subtle changes of backscattering and the fact that the contrast agents can be diffused also in normal tissue. Furthermore, at the maximum of perfusion, the contrast level of the lesion is frequently equal to that of the surrounding tissue, limiting the subtraction effect. The nonlinear anisotropic diffusion is used for noise reduction. This approach rises from the analogy with fluid diffuse on that states that the image intensity I, seen as a fluid concentration, is evolving towards an equilibrium state according to the following equation: ∂I = ΔI . (1) ∂t where I(x, y, t) defines heat at point (x, y) at t time and I(x, y, 0) = 0. Perona and Malik introduced in [6] an edge stopping function or diffusivity, a nonnegative monotonically decreasing function that encourages intra-region smoothing while preventing the distortion of region boundaries. The preceding equation becomes: ∂t I = div (c (∇I) ∇I (x, y)) . (2) where c is the edge stopping function or diffusivity and I(x,y) is the gray level of the pixel (x,y). Common diffusivity functions include Lorentzian and Leclerc functions given by Perona and the Tuckey bi-weight function by Black in [7]. Recently, Weickert proposed the scheme of coherence enhancing diffusion [8] to eliminate the speckle by iterating the following equation: ∂t I = div D ∇Iσ ∇IσT ∇Iσ .
(3)
where Iσ = Kσ ∗ I and Kσ is a Gaussian with standard deviation σ (window size over which the information is averaged). The parameter D is the diffusion tensor that can be decomposed into eigenvectors ei and eigenvalues λi . Thus, the diffusion process, controlled by the diffusion tensor D, consists in smoothing
398
A. Albouy-Kissi et al.
images according to the local eigenvalues’ extent adapts them to enhance the local coherence, related to the difference between eigenvalues calculated from the structure tensor also called scatter matrix or interest operator. It allows us to separate the image into constant areas, corners and straight edges according to the number of non-zero eigenvalues. It is defined by: 2 Ixx Ix Iy T J0 = ∇Iσ ∇Iσ = . (4) 2 Ix Iy Iyy The convolution with a Gaussian Kσ gives finally: j11 j12 Jσ = Kσ ∗ J0 = . j12 j22
(5)
The corresponding eigenvalues of this tensor are defined by: 1 2 2 λ1,2 = j11 + j22 ± (j11 − j22 ) + 4j12 . 2
(6)
The + sign is for the first eigenvalue. These elements characterize the image structure : eigenvectors give the structure orientation and eigenvalues the vectorial variations of the local structures in the images. They characterize also the type of structure as their difference, called coherence, is related to the local tissue anisotropy. Typically, a region corrupted with speckle noise will be characterized by a small difference in eigenvalues whereas a highly structure area will be characterized by a high anisotropy [9]. The coherence enhancing diffusion consists in smoothing images according to the local eigenvalues’ extent adapts them to enhance the local coherence, related to the difference between eigenvalues such as: ⎧ c1 ⎪ ⎨ μ1 = c1 if λ1 = λ2 (7)
. ⎪ −c2 ⎩ μ2 = c1 + (1 − c1 ) ∗ exp 2 (λ −λ ) 1
2
where c1 is a constant and (λ1 , λ2 ) the eigenvalues of the local structure tensor Jσ. In our context, the obtained results with eigenvalues given in the last equation are not sufficient and are not relevant. No significant improvement over classical algorithms is realized. So, the idea is to adapt the distribution to the gradient of a statistical parameter which characterizes the images and in our case of study, the state of perfusion.
3
Our Approach
Using an existing function is not sufficient. We propose to use the image statistics in the diffusion process. This method has two advantages. First, the inherent problem of threshold is overcome and secondly, the processing fits precisely the data.
Contrast Enhanced Ultrasound Images Restoration
399
The eigenvalues are proportional to the gradient. The eigenvalues histogram depicts the overall distribution of the gradient. In order to fit the distribution at each pixel, the global behaviour is modified locally. Therefore, we will solve our diffusion coefficient using the function that describes the histogram. Sharr et al. proposed to determine this function in 3 steps [10] : 1. Extraction of the eigenvalues λ1 and λ2 from the structure tensor Jσ with the same method as above. 2. Calculation of the histogram of the eigenvalues λ1 and λ2 3. Fitting of the eigenvalues of the diffusion tensor. As described before, we calculate for each pixel of the sequence the diffusion tensor, for each tensor the two eigenvalues and finally the two histograms corresponding to eigenvalues distribution. An exemple of the result is shown Fig.1.
Fig. 1. Example of a histogram computed for an image sequence
In contrast enhanced ultrasound images the noise can be modelled by a Rayleigh distribution and the horizontal and vertical derivatives can be modeled by a t-distribution: −t μ2 μ μ2 p(μ) = ω 1 + 2 +(1 − ω) 2 exp − 2 . (8) σ σ 2σ2 1 2 t−distribution
This model leads to these parameters: - a mixture parameter w, 0 < w < 1 - the degree t - the scale parameters σ1 and σ2
noise
400
A. Albouy-Kissi et al.
With these two histograms (from λ1 and λ2 ), we can determine μ1 and μ2 and the diffusion tensor. If one of those two values is a constant, a preferential direction for the diffusion process is obtained (vertical and horizontal): μ1 = K.f (λ1 ) . μ2 = T
(9)
μ1 = T . μ2 = K.f(λ2 )
(10)
where T is a constant, K is a constant to enhance the diffusion and f (λk )) is the approximation function of the kth eigenvalue according to equation 9. Combining these two approaches we can get the final diffusion where no direction is preferred. An explicit model based on rotational optimal filters is used [11]. An example is presented Fig.2.
(a)
(b)
(c)
(d)
Fig. 2. The filtering results of the original image (a) using process given in equation 9 (b), process given in equation 10 (c) and the final result (d)
Contrast Enhanced Ultrasound Images Restoration
4
401
Results
To illustrate the performance of the described process, two applications on focal liver lesions are proposed below. Five sequences of 160 non compressed images have been obtained, using a Siemens Acuson Sequoia 512 with a 4C1 probe, in CPS mode, after Sonovue injection (Bracco SpA imaging, Italy) with an MI = 0.21. In the diffusion process, the different parameters are: -
the the the the
diffusion time number of iteration parameters of the smoothing filter multiplication coefficient K
We present examples with different number of iterations. The parameter K is set to 100 and the diffusion time is set to 2.5 (Fig.3) and to 5 (Fig.4).
n=5
n=20
n=30
n=60
Fig. 3. The filtering results with n number of iteration (diffusion time=2.5)
402
A. Albouy-Kissi et al.
n=5
n=10
n=15
n=20
Fig. 4. The filtering results with n number of iteration (diffusion time=5)
We notice that the greater the number of iterations increases over the process more the image diffusion is. Furthermore, for the diffusion time and K, the system remains stable regardless of the number of iterations. For a sequence, it is possible firstly to estimate the histogram for each image according the algorithm described previously and secondly to superimpose all the histograms. An example of the results is presented Fig.5. The histograms of the sequence are very close to each other. Our experiments show that the same histogram can be used during all the sequence and that we obtain similar results with a faster process. This is a very interesting approach and a significant advantage due to the inherent complexity of the diffusion algorithms. The obtained results confirm the interest of this method in order to improve the quantification of perfusion in enhanced liver ultrasound images. The image statistics, in the diffusion process, increase significantly the performance of the results and is investigated for further work.
Contrast Enhanced Ultrasound Images Restoration
403
Fig. 5. Superimposition of all eigenvalues histograms of the sequence
5
Conclusion
We have presented a novel framework dedicated to the restoration of contrast enhanced ultrasound images. By modeling the image with a mixture of Rayleigh and t-distribution, we have design a new automatic edge stopping function for coherence enhancing diffusion. The inclusion of such a method would improve the quantification of perfusion. Actual studies are conducted to generalize our method for other liver diseases.
References 1. Quaia, E.: Assessment of tissue perfusion by contrast-enhanced ultrasound. European Radiology 21, 604–615 (2011) 2. Kissi, A.A., Cormier, S., Pourcelot, L., Tranquart, F.: Hepatic lesions segmentation in ultrasound nonlinear imaging. In: Progress in Biomedical Optics and Imaging Proceedings of SPIE, vol. 5750, pp. 366–377 (2005) 3. Kissi, A., Cormier, S., Pourcelot, L., Tranquart, F.: Automatic lesions segmentation in ultrasound nonlinear imaging. In: ICIP (1), pp. 1153–1156 (2005) 4. Kaul, S., Pandian, N.G., Okada, R.D.: Contrast echocardiography in acute myocardial ischemia: I. in vivo determination of total left ventricular ’area at risk’. Journal of the American College of Cardiology 4, 1272–1282 (1984) 5. Kaul, S., Pandian, N.G., Gillam, L.D.: Contrast echocardiography in acute myocardial ischemia. iii. an in vivo comparison of the extent of abnormal wall motion with the area at risk for necrosis. Journal of the American College of Cardiology 7, 383–392 (1986) 6. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 629–639 (1990) 7. Black, M., Sapiro, G.: Edges as outliers: Anisotropic smoothing using local image statistics. In: Nielsen, M., Johansen, P., Fogh Olsen, O., Weickert, J. (eds.) ScaleSpace 1999. LNCS, vol. 1682, pp. 259–270. Springer, Heidelberg (1999) 8. Weickert, J.: Theoretical foundations of anisotropic diffusion in image processing. Computing (suppl 11), 221–236 (1996)
404
A. Albouy-Kissi et al.
9. Abd-Elmoniem, K.Z., Youssef, A.M., Kadah, Y.M.: Real-time speckle reduction and coherence enhancement in ultrasound imaging via nonlinear anisotropic diffusion. IEEE Transactions on Biomedical Engineering 49, 997–1014 (2002) 10. Scharr, H., Black, M.J., Haussecker, H.W.: Image statistics and anisotropic diffusion. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2, pp. 840–847 (2003) 11. Weickert, J., Scharr, H.: A scheme for coherence-enhancing diffusion filtering with optimized rotation invariance. Journal of Visual Communication and Image Representation 13, 103–118 (2002)
Mutual Information Refinement for Flash-no-Flash Image Alignment Sami Varjo1 , Jari Hannuksela1 , Olli Silv´en1 , and Sakari Alenius2 1
Machine Vision Group, Infotech Oulu and Department of Electrical and Information Engineering, P.O. Box 4500, FI-90014 University of Oulu, Finland {sami.varjo,jari.hannuksela,olli.silven}@oulu.fi 2 Nokia Research Center, Tampere, Finland [email protected]
Abstract. Flash-no-flash imaging aims to combine ambient light images with details available in flash images. Flash can alter color intensities radically leading to changes in gradient directions and strengths, as well as natural shadows possibly being removed and new ones created. This makes flash-no-flash image pair alignment a challenging problem. In this paper, we present a new image registration method utilizing mutual information driven point matching accuracy refinement. For a phase correlation based method, accuracy improvement through the suggested point refinement was over 40 %. The new method also performed better than the reference methods SIFT and SURF by 3.0 and 9.1 % respectively in alignment accuracy. Visual inspection also confirmed that in several cases the proposed method succeeded in registering flash-no-flash image pairs where the tested reference methods failed. Keywords: registration, illumination, flash.
1
Introduction
Computational imaging is used, not only to create, but also to improve existing digital images. Among the goals of image fusion research are a higher dynamic range for more natural coloring of the scene [12], a larger field of view via image mosaicking [30], and increased information content for super resolution images [18]. All methods combining data from several sources require the input images be spatially aligned, and failing to do so usually result in a anomalies, like ghosting effects, in the final result. Typically capturing images with low ambient light requires long exposure times. This leads easily to image blurring resulting from small movements of the camera if a tripod is not used. Other options are to increase the camera sensor sensitivity, or increase the aperture size, or to use external lighting. Increasing the sensor sensitivity leads easily to noisy images and aperture size adjustment is often limited by the optics. Flash is often used to provide extra illumination on the scene to reduce the required exposure time. While the flash images appear sharp, flash render colors that are often unnatural compared to ambient light. Dark areas may appear too J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 405–416, 2011. c Springer-Verlag Berlin Heidelberg 2011
406
S. Varjo et al.
light and light areas too dark. This also often alters the directions and strengths of gradients in the two frames. Directional flash also has limited power, often yielding uneven lighting on the scene. The foreground and center of the scene are illuminated more than the background and edges. Flash also reflects from shiny surfaces like glass or metal producing artifacts. Previous problems can lead to a situation where the ambient light image has more information in certain areas than the flash image and the other way around. Automatic camera focus may also differ depending on whether the foreground or background of the scene is illuminated, resulting in out of focus blurring in different areas of images. The differences between ambient light images and flash images make flashno-flash image pair alignment a challenging problem. However, these same differences make the fusion of the flash-no-flash images desirable. While fusion methods [19,2] and removal of artifacts introduced by flash have already been addressed [1], the flash images typically have been obtained using a tripod, or alignment is assumed to be solved without studying the alignment. For example, Eisemann and Durand utilize a gradient extraction-based method for image alignment when describing their flash image fusion with a notion that more advanced approaches could be used [5]. Since flash-no-flash image alignment is new research area, we discuss in the following chapters several popular alignment approaches and their suitability for the problem. We propose the use of mutual information for refining the point pair matching and a new alignment method based on that concept. In the experiments, we show that the refinement step can improve the alignment accuracy considerably, and that the new method performs better than the selected reference methods in alignment accuracy with flash-no-flash image pairs.
2
Alignment Methods
The goal in image alignment is to solve the transformation between two or more captured images. Alignment methods can be roughly divided into feature and image based approaches. In the former, the interest points are extracted from images and these are matched, based on selected features. Matching point sets are used to solve the transformation. The image based methods use image data for solving the global transformations. 2.1
Feature Based Methods
In feature based approaches, interest points are extracted from input images and suitable descriptors are calculated for each of these points. The interest points and the descriptors should be invariant to possible transformations so that matching points can be found. The interest points are typically visually distinct in the images, such as corners or centroids of continuous areas. The mapping between images can be solved using least squares type minimization approaches, if the point correspondences are well established. Usually nonlinear optimization methods are used to solve transformations in over determined point sets to achieve more accurate results.
MI Refinement for FnF Image Alignment
407
The Harris corner detector has been widely utilized as an interest point detector [10]. Other well known methods for interest points and descriptors are the Smallest Univalue Segment Assimilating Nuclei (SUSAN) [23], the Scale Invariant Feature Transform (SIFT) [15] and the Speeded Up Robust Features (SURF) [3]. The above methods rely on spatial image information. Frequency domain approaches such as phase congruency also exist [13]. Maximally Stable Extremal Regions (MSER) [16] is an example of interest area detectors. Figure 1 presents examples illustrating the problems associated with aligning flash-no-flash images. The Harris corner value for a point is calculated using the gradients to estimate the eigenvalues for the points’ autocorrelation matrix in a given neighborhood. Flash clearly alters the gradients in the scenes. Foreground features are highlighted with flash while the ambient light image picks up the features in the background or behind transparent objects like windows. Interest point selection based on gradients can lead to point sets where no or only a few matches exist. SUSAN is based on assigning a gray value similarity measure for a pixel observed in a circular neighborhood [23]. This non-linear filtering method gives a high response for points where neighborhood intensities are similar to the center pixel value. Interest points like corners and edges have low values which can be found by thresholding. With flash-no-flash image pairs, prominent edges can differ, leading to false matching. Even when there are some corresponding edges, the feature points match poorly due to changes in image intensities, and the approach fails in practice.
Fig. 1. Left: Harris feature response for a flash image (top) and no flash image (bottom), Center: SUSAN responses and interest points (yellow) for a flash image (top) and no flash image (bottom), Right: MSER features in a flash (top) and no flash (bottom) images. With MSER images the yellow areas are the found stable regions and the red marks seed points.
408
S. Varjo et al.
MSER can be considered to be based on the well known watershed segmentation algorithm [17]. Compared to watershed, instead of finding segmentation boundaries MSER seeks regions which are stable on different watershed levels. The local minimum and the thresholding level define a maximally stable extremal region. But also here, flash alters the scene too much for this approach to perform reliably. The example image pair in Fig. 1 shows that the extracted areas not only vary in shape and size, but the stable areas also vary in location. SIFT key points are located in maximums and minimums in the difference of Gaussian filtered images in scale space. The descriptors based on gradient histograms are normalized using local maximum orientation. SURF also relies on gradients when locating and calculating descriptors for the key points. While the flash may relocate shadow edges, also the gradient directions may change since light areas can appear darker or vice versa. Heavy utilization of gradient strengths and directions affect their robustness with flash-no-flash image pair alignment. 2.2
Image Based Methods
There are several methods where no key points are extracted for image alignment, but the whole or a large part of the available image is used. Normalized cross correlation or phase correlation based approaches are the most well known image based alignment methods. The Fourier-Mellin transformation has been utilized widely for solving image translations and rotations [22]. Mutual information (MI) has proven to be useful for registering multi-modal data in medical applications [20]. MI is an entropy based measure describing the shared information content between two signals, originating from Shannon’s information theory. The typical approach is to approximate the images’ mutual information derivative with respect to all transformation parameters and apply a stochastic search to find the optimal parameters [28]. Hybrid methods of image based and interest point approaches can also be made. Coarse alignment using an image based method and feature based approaches for refining the result have been used for enhancing low-light images for mobile devices [25] and also for panorama stitching [21].
3
Proposed Method
We propose a method where interest point matches found using block phase correlation are refined using mutual information as the improvement criteria to overcome the problems discussed earlier. MI considers the joint distributions of gray values in the inputs instead of comparing the gray values directly. The mutual information can therefore describe the similarity of the two image patches enabling alignment even with multi-modal imaging cases [20]. Pulli et al. have presented a similar approach but without the MI driven point refinement step [21]. The alignment is initialized by dividing one of the input images into subwindows. A single point is selected in each sub-window as the reference point
MI Refinement for FnF Image Alignment
409
that is matched in the other image using phase correlation. This gives the initial set of matching points with good control on the number of interest points, and distributes them uniformly on the images. Iterative mutual information based refinement is used to find more accurate point matching in each of the subwindows prior to solving the projective mapping between the point sets. The method is described and discussed in more detail below: 1. 2. 3. 4. 5. 6.
Solve rough prealignment from low scale images Divide input images into sub-windows Find a matching point for each sub-window using phase correlation Refine the point pair matching using mutual information RANSAC for removing the outliers Estimate the global transformation
Prealignment is required since the phase correlation applied for point matching is done in fairly small windows. These windows must overlap at least by a half for the phase correlation to work reliably. For example, when point matching in the subwindows is done using 128 pixel correlation windows, the overlap must be at least 64 pixels. Because the tolerance is fairly large, the prealignment can be done on a rough scale. Here down sampling by a factor of 4 was used. Prealignment rotation is solved with the method introduced by Vandewalle et al.[26]. Their method calculates frequency content in images as a function of the rotation angle by integrating the signal over radial lines in the Fourier domain. Now the rotation can be efficiently solved with a one-dimensional phase correlation while the translation is solved thereafter with conventional 2-D phase correlation. Sub-Windowing of the input images for point matching is done in two phases. First one image is divided over an evenly spaced grid and a single interest point is selected for each sub-window defined by the grid. In the second step for point matching, the phase correlation windows are established around the found point coordinates in both the inputs. This approach divides the points evenly on the image surface. This way no strong assumptions about the underlying transformation are made, yet it is likely that the point neighborhood contains some useful information that can be utilized later on in the mutual information based refinement step. Here, strong interest points in the flash image’s initial grid are located using the Harris corner detector. Other point selection techniques, like a strong Sobel response might be utilized as well. Note also that the point is selected only in one input image using this method, and the point matching is done in the other image in the vicinity of the same coordinates in the next step. Matching Points with phase correlation is based on the Fourier shift theorem. Let image f2 (x, y) = f1 (x + Δx, y + Δy) and fˆ be the Fourier transformed function of f . Then the shift in the spatial domain is presented as a phase shift in the frequency domain (1). This relationship can be used to solve the translation between two image blocks [14]. The cross power spectra of fˆ2 and
410
S. Varjo et al.
fˆ1 contain the needed phase correlation information, and the translation can be solved either in the spatial or Fourier domain. Taking the inverse transformation for normalized cross-correlation C and finding the maximum yields, the sought translation in the spatial domain (2). f ∗ denotes the complex conjugate. Phase correlation is capable of producing sub pixel translation accuracy [9]. fˆ2 (wx , wy ) = fˆ1 (wx , wy )ej(wx Δx+wy Δy) , C(˜ x, y˜) = F −1
∗ fˆ1 fˆ2 ∗ |fˆ1 fˆ2 |
(1)
,
(Δx, Δy) = arg max C(˜ x, y˜) . (˜ x,˜ y)
(2) (3)
Point correspondences where correlation is below a thresholding value are rejected from the point set. Low correlation can originate from flat image areas, as in the well known aperture problem. Since phase correlation utilizes the phase difference in images instead of gray values for finding the translation, this makes it less affected by the intensity changes induced by the flash than approaches using pixel data directly. Further the zero mean unit variance scaling can be used to reduce the lighting variations in the input image blocks [29]. Point Set Refining is applied to the initial point sets to achieve more accurate point matching. The MI is measure based on joint distributions of gray values in two inputs. Jacquet et al. discusses in detail about the behavior of the probabilities defined by the joint distribution stating that when two images are well aligned, but the intensities of the same structure differ, the set of probabilities will not change, but those probabilities may be assigned to shifted gray value combinations [11]. Hence, the Shannon entropy will not be affected since it is invariant under permutation of elements. This makes the MI an appealing distance measure for cases where the image contents are heavily distorted by lighting or other non linear transformations. The MI of two discrete signals X, and Y is defined in (4) where, p(x) and p(y) are marginal probabilities and p(x, y) is the joint probability mass function. M I(X, Y ) =
x∈X y∈Y
p(x, y)log2
p(x, y) , p(x)p(y)
(4)
In practice, the MI can be calculated by forming a 2-D joint distribution histogram for the gray values of the two input images. The histogram is normalized by the interest area size to give probability distribution p(x, y). Now the row and column sums of the histogram give the marginal gray value probabilities p(x) and p(y) for both the input images. These probability estimates can be used to calculate the respective entropies, and finally the mutual information as presented in (5)–(8). The MI value can be normalized by dividing it by the minimum of entropies H(X) and H(Y ) [6] or by their sum [24].
MI Refinement for FnF Image Alignment
H(X, Y ) =
−p(x, y) ∗ log2 (p(x, y)) ,
411
(5)
x∈p(x,y) y∈p(x,y)
H(X) =
−p(x) ∗ log2 (p(x)) ,
(6)
−p(y) ∗ log2 (p(y)) ,
(7)
x∈p(x)
H(Y ) =
y∈p(y)
M I(X, Y ) = H(X) + H(Y ) − H(X, Y ) .
(8)
For refining the point matching, the entropy values in one input image are calculated in 8-neighborhoods, and further the mutual information against the entropy present in the point in the other input image. Matching point location is adjusted by one pixel if higher mutual information is found in the 8neighborhood. Iterating this refining step several times allows us to find improved point matching in the given search radius. Here the refining was limited to 8 iterations, while usually only a couple iterations (2-6) were needed. Global Transformation described by a homogenous 3x3 matrix with 8-degrees of freedom was solved with non-linear optimization. Despite pruning the point set by a correlation threshold and the mutual information refining, the point sets may contain outliers that must be rejected beforehand. Here RANSAC [8] was used to produce the final point set prior to solving the projective transformation.
4
Results and Discussion
Test images were taken with a Nikon D80 digital camera using an internal flash. The original images with a size of 3872x2592 pixels were down sampled by a factor of two for Matlab implementations. The effects of the phase correlation window sizes for point matching, as well as the effect of the refinement MI window size were studied to optimize the proposed method. Also the proposed method’s accuracy in flash-no-flash image alignment was compared to three state-of-theart methods. Figure 2 presents an example of matched points in the flash-no-flash image pair. The resulting image after fusing the aligned images using the method described in [2] is also presented. 4.1
Effects of the Window Sizes
The mutual information window size had no clearly predictable effect on the resulting image pair registration accuracy. It appears that the MI window size does not have to be very large to achieve the refinement effect. The approach, without using MI refinement improved on average the the global mutual information (GMI) by 20.2%, while the utilization of refinement increased the GMI in a range of 30.2 to 31.3 % when the MI window size was varied between 25 and 150 pixels (table 1). The best result was achieved with a 75 pixel window size that was used later on.
412
S. Varjo et al.
Fig. 2. Matched points in a flash-no-flash image pair using blockwise phase correlation with mutual information-based point refining and the fused result image Table 1. Effect of window size in mutual information refinement on GMI. The best values are in bold numbers.
image Unregistered No MI 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0,1247 0,1216 0,1157 0,1073 0,1072 0,1649 0,1444 0,2901 0,1171 0,0834 0,1076 0,1007 0,1254 0,0508 0,1182
0,1562 0,1514 0,1535 0,1513 0,1278 0,1804 0,1736 0,3600 0,1196 0,0875 0,1151 0,1221 0,1461 0,1246 0,1858
25
Mutual information window size 50 75 100 150 200
0,1550 0,1546 0,1532 0,1540 0,1529 0,1538 0,1516 0,1520 0,1278 0,1278 0,1840 0,1841 0,1720 0,1721 0,4107 0,4136 0,1236 0,1231 0,0943 0,0978 0,1141 0,1163 0,1247 0,1244 0,1456 0,1453 0,1239 0,1234 0,1863 0,1866
0,1558 0,1547 0,1539 0,1498 0,1282 0,1840 0,1720 0,4120 0,1227 0,0981 0,1158 0,1255 0,1463 0,1272 0,1871
0,1559 0,1542 0,1528 0,1466 0,1281 0,1850 0,1705 0,4142 0,1236 0,0949 0,1149 0,1275 0,1460 0,1267 0,1833
0,1560 0,1544 0,1508 0,1460 0,1281 0,1851 0,4137 0,1206 0,0991 0,1161 0,1277 0,1460 0,1228 0,1862
0,1560 0,1546 0,1508 0,1469 0,1281 0,1853 0,4116 0,1190 0,0964 0,1149 0,1277 0,1467 0,1202 0,1848
The phase correlation window size had a substantial effect on the accuracy of the basic method. Phase correlations using window sizes of 16, 32, 64, 128 and 256 were tested. GMI improved 1.0, 2.3 and 7.8 percent when doubling the window from 16 to 128 pixels. Doubling the window further did not improve the results. It is also good to notice that the computational time increases quadratically when the correlation window size is increased. Here a window of 128 pixels was selected for phase correlation based point matching. 4.2
Comparison to State-of-the-Art Methods
The reference SIFT implementation was the vlSIFT from Vedaldi and Flukerson [27]. For SURF, a Matlab implementation based on OpenSURF was used [7]. The same RANSAC and homography estimation methods were used as with the
MI Refinement for FnF Image Alignment
413
Table 2. Comparison of the proposed method (BlockPhc+MI) to a method without a mutual information refinement step (BlockPhc), SIFT, SURF, and the method by Pulli et al. (PA). The best values are in bold numbers.
image Unregistered 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0,24499 0,24433 0,24980 0,25386 0,24582 0,36013 0,30986 0,48984 0,25545 0,32164 0,25342 0,25861 0,25789 0,35242 0,27389
PA 0,24173 0,24069 0,24466 0,24997 0,24143 0,35976 0,30938 0,48410 0,25535 0,31953 0,25278 0,24678 0,25137 0,35349 0,23896
RMSE BlockPhc BlockPhc+MI 0,23991 0,23994 0,24859 0,25541 0,24064 0,36163 0,30767 0,48943 0,26122 0,32058 0,25262 0,25088 0,25475 0,32054 0,24216
0,23949 0,23849 0,23854 0,24238 0,23970 0,36021 0,30762 0,48378 0,25663 0,31831 0,25223 0,24210 0,25027 0,31685 0,23799
SIFT
SURF
0,23981 0,23924 0,23925 0,23919 0,23994 0,36041 0,30798 0,48389 0,25541 0,32050 0,25289 0,24336 0,25023 0,31965 0,23729
0,24457 0,24005 0,24362 0,24199 0,24019 0,36039 0,30795 0,48547 0,25475 0,31865 0,25208 0,24356 0,25016 0,31751 0,23711
proposed method. The described method without the mutual information point refinement step was also tested (BlockPhc). In addition, the method by Pulli et al. presented in [21] was applied (PA). Table 2 contains the alignment accuracy results. The average root mean square error (RMSE) for registered images with the proposed method not using MI refinement, the proposed method, SIFT, SURF, and PA were 0.2857, 0.2816, 0.2819, 0.2825, 0.2860, respectively. The average RMSE for unregistered image pairs was 0.2915. The addition of a mutual information refinement step improves the block based phase correlation method accuracy by 41.6 %. The improvement over PA was 44.4 %. The new method yielded also more accurate results than SIFT or SURF by 3.0 % and 9.1 % respectively. In two thirds of the test cases, the proposed method was ranked the best. In a quarter of the cases, SURF was the best and the remaining best rankings were divided between SIFT and PA. Visual inspection of aligned images was also used to confirm the results. The proposed approach, SIFT and SURF all yielded visually quite similar results. The obvious misalignments were only to the order of a few pixels for most of the cases. There where, however, a few cases where the reference methods failed considerably more, as is shown in Figure 3. The grayscale version of aligned no-flash images has been subtracted from the reference flash image. The images show that there is considerable misalignment present with both SIFT and SURF. PA failed completely in this case. The computational complexity of the proposed method can be estimated to be similar to SIFT and SURF. It might be also of interest to notice that the image size affects the computational load only in the prealignment phase. In
414
S. Varjo et al.
Fig. 3. Example of visual comparison of alignment results for image pair 14: (left) the proposed method, (middle) SIFT, and (right) SURF
the subsequent steps, the number of handled points with the utilized window sizes have a more pronounced impact on the execution time than the image size. Compared to SIFT and SURF the number of handled points is fairly small and well controlled.
5
Conclusions and Discussion
The presented method that applies mutual information to improve phase correlation based interest point matching is a promising approach for the improvement of flash-no-flash image alignment. The achieved improvement in alignment accuracy using mutual information refinement was 41.6 %. The approach works also generally better than the tested reference methods. The relative improvement with the proposed method against SIFT was 3.0 %, SURF 9.1 %, and PA 44.4%. The average RMSE values suggest that the proposed method is better than either SIFT or SURF for flash-no-flash image pair alignment. The best ranking was achieved in two thirds of the cases. Visual inspection also revealed that none of the tested methods performed ideally in every case. Although the crafted alignment algorithm was applied to phase correlation for interest point extraction, there is no reason why the mutual information based
MI Refinement for FnF Image Alignment
415
refinement approach would not work also with other point matching methods when registering flash-no-flash images. The main requirement is that the correspondence is readily established. However, while mutual information based point refining improves the registration accuracy, it also requires considerable computational resources. One approach to ease the computational load could be parallelization, since with the proposed block based approach each point pair can be handled as a separate case. Modern accelerators may contain tens or even hundreds of processing units that might be utilized to process point pairs simultaneously to achieve remarkable speedup in processing time. Since mutual information is independent of input sources, the approach might also be suitable for rigid multi-modal image alignment, like for infrared-visible image pairs.
References 1. Agrawal, A., Raskar, R., Nayar, S.K., Li, Y.: Removing photography artifacts using gradient projection and flash-exposure sampling. ACM Trans. Graph. 24, 828–835 (2005) 2. Alenius, S., Bilcu, R.: Combination of multiple images for flash re-lightning. In: Proc. of IEEE 3rd Int. Symp. Commun. Contr. Sig., pp. 322–327 (2008) 3. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404– 417. Springer, Heidelberg (2006) 4. Chum, O., Matas, J.: Geometric Hashing with Local Affine Frames. In: IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 879– 884 (2006) 5. Eisemann, E., Durand, F.: Flash photography enhancement via intrinsic relighting. ACM Trans. Graph. 23, 673–678 (2004) 6. Est´evez, P.A., Tesmer, M., Perez, C.A., Zurada, J.M.: Normalized mutual information feature selection. Trans. Neur. Netw. 20, 189–201 (2009) 7. Evans, C.: Notes on the opensurf library. Tech. Rep. CSTR-09-001, University of Bristol (2009) 8. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981) 9. Foroosh, H., Zerubia, J., Berthod, M.: Extension of phase correlation to subpixel registration. IEEE Trans. Image Process. 11(3), 188–200 (2002) 10. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of the 4th Alvey Vision Conference, pp. 147–151 (1988) 11. Jacquet, W., Nyssen, W., Bottenberg, P., Truyen, B., de Groen, P.: 2D image registration using focused mutual information for application in dentistry. Computers in Biology and Medicine 39, 545–553 (2009) 12. Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High dynamic range video. ACM Trans. Graph. 22, 319–325 (2003) 13. Kovesi, P.: Image features from phase congruency. Journal of Computer Vision Research, 2–26 (1999) 14. Kuglin, C., Hines, D.: The phase correlation image alignment method. In: IEEE Proc. Int. Conference on Cybernetics and Society, pp. 163–165 (1975)
416
S. Varjo et al.
15. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004) 16. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. In: Proceedings of the British Machine Vision Conference, pp. 384–393 (2002) 17. Nist´er, D., Stew´enius, H.: Linear time maximally stable extremal regions. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 183–196. Springer, Heidelberg (2008) 18. Park, S.C., Park, M.K., Kang, M.G.: Super-resolution image reconstruction: a technical overview. IEEE Signal Process. Mag. 20, 21–36 (2003) 19. Petschnigg, G., Szeliski, R., Agrawala, M., Cohen, M., Hoppe, H., Toyama, K.: Digital photography with flash and no-flash image pairs. ACM Trans. Graph. 23, 664–672 (2004) 20. Pluim, J.P.W.: Mutual-information-based registration of medical images: A survey. IEEE Trans. Med. Imag. 22, 986–1004 (2003) 21. Pulli, K., Tico, M., Xiong, Y.: Mobile panoramic imaging system. In: Sixth IEEE Workshop on Embedded Computer Vision, ECVW 2010 (2010) 22. Reddy, B.S., Chatterji, B.N.: An FFT-based Technique for Translation, Rotation and Scale-Invariant Image Registration. IEEE Trans. Im. Proc. 5, 1266–1271 (1996) 23. Smith, S.M., Brady, J.M.: Susan – a new approach to low level image processing. Int. J. Comput. Vis. 23, 47–78 (1997) 24. Studholme, C., Hill, D.L.G., Hawkes, D.J.: Overlap invariant entropy measure of 3D medical image alignment. Pattern Recognition 32(1), 71–86 (1999) 25. Tico, M., Pulli, K.: Low-light imaging solutions for mobile devices. In: Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers, pp. 851–855 (2009) 26. Vandewalle, P., S¨ usstrunk, S., Vetterli, M.: A frequency domain approach to registration of aliased images with application to super-resolution. EURASIP J. Appl. Signal Process., p. 233 (2006) 27. Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer vision algorithms (2008), http://www.vlfeat.org/ 28. Viola, P., Wells III, W.M.: Alignment by maximization of mutual information. Int. J. Comp. Vis., 137–154 (1997) 29. Xie, X., Lam, K.-M.: An efficient illumination normalization method for face recognition. In: Pattern Recognition Letters, pp. 609–617 (2006) 30. Xiong, Y., Pulli, K.: Fast panorama stitching on mobile devices. In: IEEE Digest of Technical Papers: Int. Conference on Consumer Electronics, pp. 319–320 (2010)
Virtual Restoration of the Ghent Altarpiece Using Crack Detection and Inpainting Tijana Ruˇzi´c1 , Bruno Cornelis2 , Ljiljana Platiˇsa1, Aleksandra Piˇzurica1, Ann Dooms2 , Wilfried Philips1 , Maximiliaan Martens3 , Marc De Mey4 , and Ingrid Daubechies5 1
Ghent University, TELIN-IPI-IBBT, Ghent, Belgium Vrije Universiteit Brussel, ETRO-IBBT, Brussels, Belgium 3 Ghent University, Faculty of Arts and Philosophy, Dept. of Art, Music and Theatre, Ghent, Belgium The Flemish Academic Centre for Science and the Arts (VLAC), Brussels, Belgium 5 Duke University, Mathematics Department, Durham, NC, USA [email protected] 2
4
Abstract. In this paper, we present a new method for virtual restoration of digitized paintings, with the special focus on the Ghent Altarpiece (1432), one of Belgium’s greatest masterpieces. The goal of the work is to remove cracks from the digitized painting thereby approximating how the painting looked like before ageing for nearly 600 years and aiding art historical and palaeographical analysis. For crack detection, we employ a multiscale morphological approach, which can cope with greatly varying thickness of the cracks as well as with their varying intensities (from dark to the light ones). Due to the content of the painting (with extremely many fine details) and complex type of cracks (including inconsistent whitish clouds around them), the available inpainting methods do not provide satisfactory results on many parts of the painting. We show that patch-based methods outperform pixel-based ones, but leaving still much room for improvements in this application. We propose a new method for candidate patch selection, which can be combined with different patchbased inpainting methods to improve their performance in crack removal. The results demonstrate improved performance, with less artefacts and better preserved fine details. Keywords: Patch-based inpainting, crack detection, digital restoration, Ghent Altarpiece.
1
Introduction
Breaking of the paint layer, called craquelure or cracks, is one of the most common deteriorations in old paintings. The formation and the extent of cracks is influenced by factors such as ageing, drying of the paint, movement of the support (caused by changes in relative humidity) and physical impacts (such as vibrations during transport, etc.). These cracks form an undesired pattern that is, however, inherent to our appreciation of these paintings as old and valuable. J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 417–428, 2011. c Springer-Verlag Berlin Heidelberg 2011
418
T. Ruˇzi´c et al.
Yet, for specialists in visual perception for example, it is of interest how our perception of the painting is affected when observing it before the ageing process. Moreover, the crack patterns not only make art historical analysis more difficult, but also in the example of the inscribed text (see Section 3.2), the palaeographical deciphering. Therefore, an important task for the restoration of digitized paintings is the detection and removal of the cracks. Different crack detection techniques include simple thresholding, the use of multi-oriented Gabor filters and various morphological filters [1]. Removal of cracks can be seen as a special class of a more general inpainting problem. While many general inpainting methods exist, like [2–5], further research is needed to improve their performance in the difficult problem of crack inpainting. Specialized methods for crack detection and inpainting include [6–9]. The method from [6] is semi-automatic: the user manually selects a point on each crack to be restored. In [7, 8] cracks are first detected by thresholding the output of the morphological top-hat transform and then the thin dark brush strokes, which have been misidentified as cracks, are separated through neural network classification with hue and saturation as features. Finally, the cracks are inpainted using order statistics filtering or anisotropic diffusion [7] or patch-based texture synthesis [8]. In this work, we focus on the difficult problem of crack inpainting in the Ghent Altarpiece (see Fig. 1). The polyptych consisting of 12 panels, dated by inscription 1432, was painted by Jan and Hubert van Eyck and is considered as one of Belgium’s most important masterpieces known all over the world. It is still located in the Saint Bavo Cathedral in Ghent, its original destination. As in most 15th century Flemish paintings on Baltic oak, fluctuations in relative humidity, acting over time on the wooden support, caused age cracks. These cracks are particular in a number of ways. First of all, their width ranges from very narrow and barely visible to larger areas of missing paint. Furthermore, depending on the painting’s content, they appear as dark thin lines on a bright background or vice versa, bright thin lines on a darker background. Also, since this masterpiece contains many details and some of the brush strokes are of similar color as the cracks, it is difficult to make a distinction between them in some parts of the image. Finally, the bright borders that are present around some of the cracks cause incorrect and visually disturbing inpainting results. These borders are caused by either the reflection of light on the inclination of the paint caused by the crack on the varnish layer or the exposure of the underlying white preparation layer due to the accidental removal of the surface paint due to wear or after cleaning. In addition, the images we use are acquired under different conditions, i.e. different lighting circumstances, chemicals used to develop the negatives, as well as scanners to scan them. Therefore, the quality of the images varies significantly making a general, automatic crack detection and inpainting a very difficult problem. To deal with these specific problems of cracks and the difficulties of the dataset, we present a novel method for crack detection and inpainting. We perform crack detection by thresholding the outputs of the multiple morphological
Virtual Restoration of the Ghent Altarpiece
(a)
(b)
419
(c)
Fig. 1. Test images: (a) the piece of jewellery in the God the Father panel, (b) Adam’s face in the Adam panel, (c) the book in the Annunciation to Mary panel
top-hat transforms, applied by using structuring elements of different sizes. The resulting binary images at different scales are then combined in a novel fashion in order to deal with cracks of different properties and to separate them from other similar structures in the image. Due to the content of the painting (with extremely many fine details) and complex type of cracks (including inconsistent whitish clouds around them), the available inpainting methods do not provide satisfactory results on many parts of the painting. We show that patchbased methods outperform pixel-based ones, while still leaving much room for improvements in this application. We propose a new method for candidate patch selection, that we call constrained candidate selection, which can be combined with different patch-based inpainting methods to improve their performance in crack removal. The results demonstrate improved performance, with less artefacts and better preserved fine details. The comparison is for now performed visually, since there is no ground truth. However, we hope that the feedback from art historians on palaeographical deciphering (see Section 3.2) will help to further evaluate the effectiveness of our method. The paper is organized as follows. Section 2 describes the proposed crack detection algorithm, with multiscale crack map formation and crack map refinement. Section 3 gives a brief overview of state-of-the-art inpainting methods and introduces our novel patch selection method. Finally, we present conclusions in Section 4.
2
Crack Detection
Cracks are typically elongated structures of low intensity, which usually appear as local intensity minima [7]. Recent studies [7–10] indicated the effectiveness of the morphological top-hat transformation [11] for crack detection. As we generally need to detect dark cracks on a lighter background we use the black top-hat
420
T. Ruˇzi´c et al.
(or closing top-hat ) transform T HB (A), defined as the difference between morphological closing ϕB (A) of an input image A by a structuring element B and the image itself. The morphological closing operation is defined as dilation followed by erosion. The transformation yields a grayscale image with enhanced details, which is further thresholded to create a binary image of details, which are most likely to be cracks. 2.1
Multiscale Crack Map Formation
Since the size of the cracks in our application ranges from very small hairline cracks to larger areas of missing paint, we develop a multiscale morphological crack detection approach, which is the main difference with respect to crack detection in [7, 8]. A small structuring element will extract fine scale details while larger structuring elements will gradually extract coarser details. By thresholding these results we obtain different binary images, which we refer to as crack maps. The range of sizes of the structuring elements, depends on the resolution at which the painting was acquired. In particular, we used a square structuring element with size ranging from 2×2 to 10×10 pixels. The resulting crack maps are further cleaned up by bridging pixel gaps and removing isolated or small groups of pixels. Fig. 2b and Fig. 2c depict the crack maps obtained by choosing a very small structuring element (a 3×3 square) and a larger one (a 8×8 square). We obtain the final crack map, which we call dark crack map (Fig. 2d), by combining all the crack maps from different scales in a fine-to-coarse manner, as explained below. In Fig. 2a we observe that in areas where dark colours are used, cracks can manifest themselves as thin bright lines. In order to detect these, we use the white top-hat (or opening top-hat ) transformation, which is defined as the difference between the input image and its opening by some structuring element B. Recall that the morphological opening operation is defined as erosion followed by dilation. Just as for the closing top-hat transform the outputs of the white top-hat transform, constructed by using structuring elements of different sizes, are thresholded, cleaned and combined to form a single crack map called bright crack map (Fig. 2e). Finally, the dark and bright crack maps are combined as explained below and additional morphological dilation is performed to improve their connectivity. The resulting final crack map, that we call inpainting map, is shown in Fig. 2f and will be used as the input for the inpainting algorithm. One possibility to combine the crack maps from multiple scales is to use logical OR operation. In our experiments, this simple approach works well in practice. However, in some parts of the painting it can happen that the detection misidentifies objects as cracks due to their similar size and structure. To avoid this problem, we construct the multiscale crack map in a fine-to-coarse manner, as follows. We first define a base map as a combination of cleaned crack maps at the two finest scales (obtained with the two smallest structuring elements). As we move to coarser scales, we gradually add objects (i.e. groups of connected pixels) from those scales that are connected to the base map. The advantage of this method is that in our final crack map most misidentified structures (that
Virtual Restoration of the Ghent Altarpiece
(a)
(b)
(c)
(d)
(e)
(f)
421
Fig. 2. Crack maps for the central part of Fig. 1a: (a) original, (b) 3×3 square, (c) 8×8 square, (d) dark crack map, (e) bright crack map, and (f) inpainting map
are detected at larger scales) are eliminated, while it still allows cracks to expand through scale. Since the letters of the book in Fig. 1c have similar properties as the larger cracks (i.e. dark elongated structures on light background) and misdetections can have a negative impact on the further art historical analysis (the palaeographical deciphering), we used this novel approach for the construction of the dark crack map. We apply anisotropic diffusion [12] as a preprocessing step in order to reduce the noise in the images while still preserving the edges. Furthermore, since the images we work on are acquired under different conditions, as mentioned earlier, the quality of the images varies significantly. Therefore, we choose the colour plane where the contrast between cracks and non-cracks is the highest for crack detection. For example, in the case of the jewel depicted in Fig. 2, we use the green plane of the RGB colour space for the detection of dark cracks and the blue plane for the detection of bright ones. 2.2
Crack Map Refinement
Some of the digitized paintings contain whitish/bright borders along the cracks (see enlarged parts on Fig. 3a and Fig. 7a). The reason is that the paint is pushed
422
T. Ruˇzi´c et al.
(a)
(b)
(c)
(d)
Fig. 3. Separation of brush strokes misidentifed as dark cracks on part of Fig. 1b: (a) original, (b) dark crack map, (c) dark crack map after separation of misidentified brush strokes, and (d) removed outliers
upwards and forms a small inclination. Light reflections from the varnish and the ridges make them appear brighter than their immediate surroundings. Also, due to wear or during previous restorations the surface paint on these elevated ridges may have been accidentally removed, revealing parts of the underlying white preparation layer. While these white borders are problematic for the crack inpainting, they can serve as an additional feature for separating cracks from other structures in the image. When observing the HSV and RGB colour spaces of most of the images, we notice that bright borders have high values in the blue plane of the RGB colour space. On the other hand, other dark elongated structures (e.g. Adam’s eyebrows on Fig. 1b or letters in the book on Fig. 1c) have a very high saturation value in the HSV colour space. These two features are used to further filter the dark crack map obtained with the multiscale closing top-hat transform. For each crack pixel in the dark crack map (Fig. 3b) a weighted average (of the blue and saturation values) of the surrounding pixels is computed. Each object in the binary crack map is given a blue and saturation value. Based on these we are able to detect outliers and remove undesired false positives. The resulting crack map is shown in Fig. 3c and the removed outliers are shown on Fig. 3d.
3
Crack Inpainting
Crack inpainting methods considered in literature so far include order statistics filtering [7, 9], anisotropic diffusion [7] and interpolation [6] and remove cracks one pixel at the time. In our experiments, these pixel-wise methods do not always perform sufficiently well (see Fig. 5b and Fig. 6) and hence, we explore the use of patch-based techniques. These methods have recently demonstrated potentials in other inpainting applications, and here we shall adapt and improve them for crack inpainting.
Virtual Restoration of the Ghent Altarpiece
423
Fig. 4. Schematic representation of a patch-based inpainting algorithm
3.1
Patch-Based Inpainting
Present state-of-the-art general inpainting methods are typically patch-based [3–5]. These methods fill in the missing (target ) region patch-by-patch by searching for similar patches in the known (source) region and placing them at corresponding positions. One can divide patch-based methods into two groups: “greedy” and global. The basic idea of the “greedy” method [3] is the following: for each patch at the border of the missing region (target patch), find only the best matching patch from the source region (source patch) and replace the missing pixels with corresponding pixels from that match, until there are no more missing pixels (Fig. 4). The matching criterion is usually the sum of squared differences between the known pixels in the target patch and the corresponding pixels in the source patch. In this way, both texture and structure are replicated. Preserving structures, i.e. lines and object contours, is achieved by defining the filling order. Priority is given to the target patches that contain less missing pixels and object boundaries. In the case of digitized paintings, the object boundaries are usually difficult to determine due to painting technique (incomplete brush strokes), scanning artefacts, etc. Therefore, we define priority based only on the relative number of existing pixels within the target patch. Results of this method and its superior performance compared to anisotropic diffusion, that was used e.g. in [7], are shown in Fig. 5. The global patch-based method [4] poses inpainting as a global optimization problem. For each target patch in the missing region several candidate patches are found based on the known pixels and/or neighbouring context. Then one of the candidates is chosen for each position so that the whole set of patches (at all positions) minimizes the global optimization function. We applied this method and the result is shown in Fig. 5d. We can see that this complex global method performs similarly to the simpler greedy one for this kind of problem. In our experiments, the patch-based methods outperform pixel-based anisotropic diffusion (see Fig. 5), but still leave much room for improvement in this difficult application. To improve their performance, some specifics of the problem need to be tackled. We treat some of these in the next subsection. One more obvious problem is the presence of whitish clouds around the cracks. Most inpainting algorithms fill in gaps based on pixel values from their immediate surroundings, in this case the whitish borders around a crack. Hence, the missing
424
T. Ruˇzi´c et al.
(a)
(b)
(c)
(d)
Fig. 5. Inpainting results for the central part of Fig. 1a: (a) original, (b) diffusion result, (c) result of “greedy” patch-based method, and (d) result of global patch-based method
regions will likely be filled with incorrect content and the positions of cracks remain visible after inpainting (see the results on the left of Fig. 6). We solve this problem by using the bright crack map described in Section 2 to extend the dark crack map with the corresponding bright regions. Additionally, this crack map also marks cracks that appear completely bright and results in an overall better detection performance. The benefit of using this map, with white borders and bright cracks included, is evident in all cases: in the results on the right of Fig. 6 more cracks are detected and inpainted and the appearance is visually more pleasing. 3.2
Improved Patch-Based Inpainting for Crack Filling
To further improve the crack inpainting results we introduce a novel method that involves two contributions: a novel approach to patch candidate selection and use of adaptive patch size. This method, that we call constrained candidate
Virtual Restoration of the Ghent Altarpiece
425
Fig. 6. Influence of white borders on inpainting result. Left: Dark crack map as an input. Right: Combined dark and bright crack map as an input. The first row shows the original image of Adam’s nose and mouth overlapped with the crack maps. The second and third row show inpainting results with anisotropic diffusion and patch-based method, respectively.
426
T. Ruˇzi´c et al.
selection, aims at performing context-aware inpainting by constraining the search to certain parts of the image, depending on the position of the current patch. For that we need to segment the image. The simplest case is foreground/background segmentation. The method is also beneficial in the parts of the panels with pronounced whitish borders around the cracks. There are three main steps: 1. Exclusion of damaged pixels Although we use the inpainting map to deal with the problem of whitish borders around the cracks, some damaged pixels still remain. These pixels are either too distant from the crack, belong to the non-detected cracks or appear in the source region not related to the cracks. We detect these pixels based on their high values in the blue plane and we treat them as missing ones. Additionally, we do not use the patches from the source region containing damaged pixels as possible matches. 2. Label constrained matching In the results from Fig. 7b it can be seen that patch-based inpainting occasionally introduces some artefacts. Small parts of letters appear erroneously in the background and the other way around, parts of letters get “deleted”, i.e. replaced by background. This can happen when the known part of the target patch is not distinctive enough to find the right source patch. To minimize these errors we first segment the image in two classes: foreground and background. We use the k-means segmentation algorithm on the values of the red plane because the difference between the two classes is most visible there. Once we have the segmented image, we constrain the search for candidate patches accordingly. When inpainting a part of the background, i.e. when all the known pixels in the target patch are labelled as background, we only accept source patches that belong completely to the background as candidate patches. Otherwise, we search through all possible candidates. 3. Adaptive patch size Instead of using a fixed patch size, as most inpainting methods do, we make the patch size adaptive to the local content. We start from the maximal patch size and check if the target patch completely belongs to the background. If this is the case, we constrain the search to the background, as in the previous step. If not, we reduce the patch size and repeat the same procedure (e.g. we reduce the patch size from 17×17 to 9×9). Finally, if even this smaller patch only partially belongs to the background, we search for the match of the target patch of the maximal size at all possible locations. The proposed constrained candidate selection approach can be applied together with any patch-based method (both “greedy” and global one). However, the global method results in a very high computational load due to the high resolution of the scans, making it impractical for processing of larger areas. On the other hand, limiting the method on small areas can jeopardize finding the right match. Therefore, we apply our method only together with the “greedy” patchbased method from [3]. Note also that our label constrained matching is applicable on more complex images containing more than two segments. However, in that case, more sophisticated segmentation techniques are required (e.g. [13]).
Virtual Restoration of the Ghent Altarpiece
(a)
(b)
(c)
(d)
427
Fig. 7. Inpainting results of the part of the book: (a) original image, (b) result of patch-based method [3] with inpainting crack map, (c) improvement due to proposed patch-based inpainting (with constrained candidate selection and constant patch size), and (d) additional improvement due to adaptive patch size
The effects of the proposed constrained candidate selection are illustrated in Fig. 7c for the constant patch size and in Fig. 7d for the adaptive patch size. Fig. 7d has less artefacts in the background meaning that adaptive patch size approach can better locate the target patches belonging to the background (see parts circled with black). Also, some letters are better inpainted. In comparison with the results of the method from [3] in Fig. 7b, the letters are better inpainted and the whole image contains less visually disturbing white regions. More results can be found on http://telin.ugent.be/~truzic/Artwork/.
4
Conclusion
In this paper, we explored the use of patch-based inpainting of cracks in digitized paintings and highlighted some specific problems using the Ghent Altarpiece as a case study. We introduced a multiscale morphological filtering approach for the
428
T. Ruˇzi´c et al.
detection of dark and bright cracks and we presented improvements to the patchbased inpainting methods for crack filling. The results demonstrated improved performance with less artefacts and overall visually more pleasing results. Acknowledgements. The Van Eyck images are based on photographic negatives (b45, g09, 39-19) from the Dierickfonds made available to the Ghent University by the family of the late Alfons Dierick. We thank Saint Bavo cathedral, Lukas Art in Flanders and the Dierickfonds for permission to use these materials in this research report.
References 1. Abas, F.: Analysis of craquelure patterns for content-based retrieval. Ph.D. thesis, University of Southampton (2004) 2. Bertalmio, M., Sapiro, G.: Image inpainting. In: SIGGRAPH, pp. 417–424 (2000) 3. Criminisi, A., Perez, P., Toyama, K.: Region filling and object removal by exemplarbased image inpainting. IEEE Trans. Image Proc. 13(9), 1200–1212 (2004) 4. Komodakis, N., Tziritas, G.: Image completion using efficient belief propagation via priority scheduling and dynamic pruning. IEEE Trans. Image Proc. 16(11), 2649–2661 (2007) 5. Xu, Z., Sun, J.: Image inpainting by patch propagation using patch sparsity. IEEE Trans. Image Proc. 19(15) (2010) 6. Barni, M., Bartolini, F., Cappellini, V.: Image processing for virtual restoration of artworks. IEEE MultiMedia 7(2), 34–37 (2000) 7. Giakoumis, I., Nikolaidis, N., Pitas, I.: Digital image processing techniques for the detection and removal of cracks in digitized paintings. IEEE Trans. Image Proc. 15(1), 178–188 (2006) 8. Spagnolo, G., Somma, F.: Virtual restoration of cracks in digitized image of paintings. J.of Physics: Conference Series 249(1) (2010) 9. Solanki, S.V., Mahajan, A.R.: Cracks inspection and interpolation in digitized artistic picture using image processing approach. Int. Journal of Recent Trends in Eng. 1(2), 97–99 (2009) 10. Abas, F., Martinez, K.: Classification of painting cracks for content-based analysis. In: IS&T/SPIE Elect. Imag. 2003: Mach. Vis. App. in Ind. Inspection XI (2003) 11. Meyer, F.: Iterative image transforms for an automatic screening of cervical smears. J. Histoch. Cytochem. 27, 128–135 (1979) 12. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Trans. on Pattern Anal. and Machine Intel. 12(7), 629–639 (1990) 13. Malik, J., Belongie, S., Leung, T., Shi, J.: Contour and Texture Analysis for Image Segmentation. Int. J. Comput. Vision 43(1), 7–27 (2001)
Image Sharpening by DWT-Based Hysteresis Nuhman ul Haq1 , Khizar Hayat1 , Neelum Noreen1 , and William Puech2 1
COMSATS Institute of IT, Abbottabad 22060, Pakistan {nuhman,khizarhayat,neelumnoreen}@ciit.net.pk 2 LIRMM, Universit´e Montpellier 2, France [email protected]
Abstract. Improvement of edge details in an image is basically a process of extracting high frequency details from the image and then adding this information to the blurred image. In this paper we propose an image sharpening technique in which high frequency details are extracted using wavelet transforms and then added with the blurred image to enhance the edge details and visual quality. Before this addition, we perform some spatial domain processing on the high pass images, based on hysteresis, to suppress the pixels which may not belong to the edges but retained in the high-pass image. Keywords: Image sharpening, Discrete wavelet transform (DWT), High pass filtering, Unsharp mask, Deblurring, Hysteresis.
1
Introduction
Image sharpening is the process of enhancing the edge information in the images supposed to be blurred. Blurring occurs mainly in images during their transmission and acquisition by different imaging instruments. To enhance image quality, special methods should be employed to remove blurring. Many techniques have been proposed for image deblurring in spatial and frequency domain. In spatial domain, algorithms like unsharp masking (UM) [11] and Laplace filtering, whose details can be found in [5], have been proposed for image deblurring. In UM technique the processed low pass filtered image is subtracted from the original signal to enhance image and same effect can be achieved by adding the high frequency components to original input signal [11]. In frequency domain, techniques like Fourier series and wavelets [3,2,4] have gained a lot of popularity due to their better results. Discrete wavelet transform is an effective mathematical tool who´s principal aim is to visualize image at different scales and orientations. A wavelet domain deblurring and denoising approach for image resolution improvement is given in [8] which adopts a Maximum a posteriori (MAP) approximation to deal with poor conditioned problems. Chand and Chan [2] propose a wavelet algorithm for high resolution image reconstruction using blurred images. The work of Huang and Tseng [6] applies a Teager energy operator to select those energy coefficients closely related to image and those selected coefficients are enhanced to avoid J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 429–436, 2011. c Springer-Verlag Berlin Heidelberg 2011
430
N.u. Haq et al.
noise problem. A technique is proposed in [12] for the enhancement for X-ray images wherein wavelet coefficients obtained from input image decomposition are non-linearly mapped to a new set of coefficients. On the resulting new coefficients inverse discrete wavelet transform is applied to obtain the sharpened image. Banham and Katsaggelos [1] propose spatially adaptive multi-scale wavelet based technique for deblurring images. A spatially adaptive image deblurring algorithm adopts Abdou operator for deblurring [7]. Donoho and Raimondo [3] propose a fast nonlinear fully adaptive wavelet algorithm for image deblurring. This algorithm considers the characteristics of convolution operator in Fourier domain and Besov classes in the wavelet domain. Donoho [4] employs a thresholding method, called wavelet-vaguelette de-convolution, for image deblurring. Hybrid Fourier-Wavelet methods can be found in the literature, such as the one based on regularized deconvolution [10]. In [9] estimation maximum (EM) algorithm has been employed for image deblurring. This algorithm combines the discrete wavelet transform (DWT) and Fourier transform properties. In this work, we adopt a wavelet based strategy for image sharpening. We subject all the high frequency coefficients, from the level-2 transformed blurred image in the DWT domain, to inverse DWT. This is the proposed high pass image H1 with details from level-1 and level-2 sub bands. Alongside, we do the same process for a subset of these high frequency coefficients, including only those level-1 sub bands. This gives another high pass image H2 . With the resultant two high pass images we complete the one with lesser details (H2 ) by the help of the second one (H1 ), via hysteresis. The resultant image is then added to the original blurred image to get the sharpened image. The rest of the paper is arranged as follows. Section 2 presents the proposed strategy in a stepwise manner. The method is applied to suitable examples in Section 3 and the results are elaborated over there. Finally we conclude the paper in Section 4.
2
The Proposed Method
Wavelets provides frequency information and space localization as well as high frequency details in an image, viz. the horizontal, the vertical and the diagonal details. Multiresolution analysis in wavelets provides the information about high frequency details at different levels of decomposition. As we increase the number of levels for image decomposition, there is a risk that some lower frequencies are added to the detailed subbands. This may restrict us to use only fewer levels of decomposition because the lower frequencies will become part of high pass image and reduce effective edge information in the high pass image. Contrary to that, decomposition of image at level 1 DWT may provide lesser high frequency information which could not produce the desired results for image sharpening due to missing edge information. To produce better sharpening results, we need to complete the edge information. Taking the two extremes into account, we are hereby proposing an algorithm that extracts the high pass image from the wavelet domain. It then completes the edge information in the spatial domain,
Sharpening by DWT-Based Hysteresis
431
Fig. 1. The Proposed Method
that too with the help of details from high pass coefficients resulted from the high level details in the DWT domain. Our technique is a six-step process for image sharpening, illustrated in Fig. 1. The steps involved are: 1. Transformation into Wavelet Domain. An input image is transformed into wavelet domain by using level 1 and level-2 decomposition to get two wavelet decomposed images. A level-1 coefficient provides finest high pass detail and level-2 decomposition contains information in level-1 as well as more detail information at level-2. 2. Inverse Wavelet Transformation. By setting the coefficients, in lowest level sub band, to zero in both the decompositions, we applied inverse wavelet transform to get two detailed images i.e. level-1 details (H2 ), as stated above and level 2 details (H1 ) respectively. (H2 has incomplete edge information and (H1 has thicker than expected edges. Therefore, unsharp mask is required to complete edge information with thin edge details. To obtain the required mask, we performed hysteresis in the two steps that follow. 3. Apply a threshold. Apply threshold to both level-1 and level-2 detailed images of about k% of maximum value of grayscale image and we obtained two thresholded images. 4. Complete edge Information by hysteresis. Level-1 thresholded image missing information is completed by seeing neighbors in level-2 thresholded image to get the mask.
432
N.u. Haq et al.
5. Multiply the mask with H1 . Level-2 detailed image is considered to be the superset of level-1 detailed image. So, we multiply level-2 detailed image with the mask to obtained desired image details (High pass image). 6. Add High Pass image to input image. In the final step, we add high pass image to the blurred image to obtain a sharpened image.
3
Experimental Results
We have applied our method to a large set of images from a standard image database1 . Due to space limitation, we are mentioning just a few. One such image was Barbara, shown in Fig. 2.a. We have applied a 5× 5 Gaussian blurring convolution mask to get the blurred version given in Fig. 2.b. When our method was applied to this image, it resulted in the sharpened image shown in Fig. 2.c. For comparison we have processed the image in Fig- 2.b by two alternative methods from the literature in Fig. 3. Both the methods are well-known, i.e. fast
(a) Original image
(b) Blurred image
(c) Sharpened method
by
Fig. 2. Barbara
(a) FFT method
(b) Laplacian method
Fig. 3. Image in Fig. 2.b. sharpened by FFT and Laplacian method 1
http://decsai.ugr.es/crgkh/base.htm
our
Sharpening by DWT-Based Hysteresis
433
(a) Level 1 detail
(b) Level 2 detail
(c) Image in (a) thresholded at 10% of max
(d) Image in (b) thresholded at 10% of max
(e) Mask obtained by the Hysteresis
(f) High pass image after multiplying (b) and (e)
Fig. 4. Stepwise illustration of the proposed strategy as applied to Fig. 2.b
Fourrier transform (FFT) and the Laplacian based method. For the details of these methods one can consult [5]. Fig. 3.a. illustrates the sharpening results at a cut-off frequency of 64 when done through FFT. Using a 3 × 3 Laplacian mask to extract edges and then adding with the original blurred image resulted in an image given in Fig. 3.b. For our method illustrating the stepwise images involved during the process to obtain the sharp image, from the one in Fig. 2.b, would not be out of place here: 1. The input image was transformed into wavelet domain by level-1 decomposition as well as level-2 decomposition. Then inverse wavelet transform was applied to wavelet detail components to both level-1 and level-2 decomposition by setting lowest sub band coefficients of both decomposition to zero. The resultant images are Level-1 detail and Level-2 detail images shown in Fig- 4 (a) and Fig- 4(b), respectively. 2. Level-2 detail includes higher frequency details as well as lower frequency detail, so it has complete information but not the sharp edge details that can be seen from the images.To obtain the actual edge detail, we applied thresholds, at k% of the maximum, to both images to get level-1 threshold image
434
N.u. Haq et al.
(a) Original Image
(b) Blurred Image
(c) Sharpened Image
Fig. 5. Wall
(a) Original Image
(b) Blurred Image
(c) Sharpened Image
Fig. 6. Zigzag
(a) Original Image
(b) BlurredImage Fig. 7. Grass
(c) Sharpened Image
Sharpening by DWT-Based Hysteresis
435
and level-2 threshold image, as shown in Fig- 4(c) and Fig- 4(d), respectively. Note that for the thresholded images we in this particular example we have set k = 10. 3. The missing information was then completed in Level-1 thresholded image, via hysteresis, by observing the corresponding neighborhood of each of its coefficient in the Level-2 thresholded image. As a result we obtained a mask, like the one shown in Fig- 4(e). 4. The mask was multiplied to Level-2 detailed image to obtain desired high pass edge-detail image shown in Fig- 4(f). 5. The final step was to add edge-detail image to the original blurred image which, for the example in perspective, results in the sharpened image shown in Fig- 2.b. We are providing three more examples of the blurred images, sharpened by our method in Fig 5, Fig 6 and Fig. 7. Since the images are selected on the basis of the variety of details they have, both the examples throw more light on the effectiveness of our method. In the first example (Fig 5.a) vertical details are more emphasized than others, it can be seen that the image obtained by the enhancement of Fig 5.b as a result of the proposed method has better edge enhancement, as shown in Fig 5.c. To examine the image with vertical, horizontal and diagonal we have provided the zigzag image of Fig 6, and for more complex details the grass example shown in Fig 7. The results for these two examples speak of themselves.
4
Conclusion
In this paper we presented a wavelet based image sharpening technique that extracts the high frequency components from the image in the wavelet domain and then add the details to blurred image to obtain sharp image. It uses a mask that is developed from two thresholded images obtained respectively from level1 and level-2 approximate high pass images. The latter in then multiplied with the the mask and the resultant image is added with the original blurred image to form sharp image. The results have been interesting and when we compared those with the contemporary methods, our method was by no means inferior. In fact our method showed better performance. Wavelet transformation provides frequency information, space localization, and also provides horizontal, vertical, and diagonal details through LH, HL, and HH decomposition coefficients, respectively. In future, these directional details can be used to link the pixel in the mask for achieving best candidates for high frequency details that are added with the initial image to get the sharp image. In addition, the process to select the threshold needs to be investigated in detail in order to standardize the process. In this context factors like energy of Laplacian may prove important.
436
N.u. Haq et al.
References 1. Banham, M.R., Katsaggelos, A.K.: Spatially Adaptive Wavelet-Based Multiscale Image Restoration. IEEE Transactions on Image Processing 5(4), 619–634 (1996) 2. Chand, R.H., Chan, T.C.: A Wavelet Algorithm for High Resolution Image Reconstruction. Society for Industrial and Applied Mathematics 24, 100–115 (1995) 3. Donoho, D.L., Raimondo, M.E.: A Fast Wavelet Algorithm for Image Deblurring. In: May, R., Roberts, A.J. (eds.) Proc. 12th Computational Techniques and Applications Conference, CTAC 2004, vol. 46, pp. C29–C46 (March 2005) 4. Donoho, D.L.: Nonlinear Solution of Linear Inverse Problems by WaveletVaguelette Decomposition. Applied and Computational Harmonic Analysis 2, 101– 126 (1992) 5. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice Hall, Upper Saddle River, New Jersey, USA (2002) 6. Huang, M., Tseng, D., Liu, M.S.C.: Wavelet Image Enhancement Based on Teager Energy Operator. In: Proc. 16th International Conference on Pattern Recognition, vol. 2, pp. 993–996 (2002) 7. Jianhang, H., Jianzhong, Z.: Spatially Adaptive Image Deblurring Algorithm Based on Abdou Operator. In: Proc. 4th International Conference on Image and Graphics, ICIG 2007, pp. 67–70. IEEE Computer Society, Washington, DC (2007), http://dx.org/10.1109/ICIG.2007.172 8. Li, F., Fraser, D., Jia, X.: Wavelet Domain Deblurring and Denoising for Image Resolution Improvement. In: Proc. 9th Biennial Conference on Digital Image Computing Techniques and Applications, DICTA2007, pp. 373–379. Australian Pattern Recognition Society, Adelaide, Australia (December 2007) 9. Figueiredo, M.A.T., Nowak, R.D.: An EM Algorithm for Wavelet-Based Image Restoration. IEEE Transactions on Image Processing 12(8), 906–916 (2003) 10. Neelamani, R., Choi, H., Baraniuk, R.G.: ForWaRD: Fourier-Wavelet Regularized Deconvolution for Ill-Conditioned Systems. IEEE Transactions on Signal Processing 52(2), 418–433 (2004) 11. Ramponi, G., Polesel, A.: Rational Unsharp Masking Technique. Journal of Electronic Imaging 7(2), 333–338 (1998) 12. Tsai, D., Lee, Y.: A Method of Medical Image Enhancement using WaveletCoefficient Mapping Functions. In: Proc. of the International Conference on Neural Networks and Signal Processing, ICNNSP 2003, vol. 2, pp. 1091–1094 (2003)
Content Makes the Difference in Compression Standard Quality Assessment Guido Manfredi, Djemel Ziou, and Marie-Flavie Auclair-Fortier Centre MOIVRE, Universite de Sherbrooke, Sherbrooke(QC), Canada [email protected]
Abstract. In traditional compression standard quality assessment, compressor parameters and performance measures are the main experimental variables. In this paper, we show that the image content is an equally crucial variable which still remains unused. We compare JPEG, JPEG2000 and a proprietary JPEG2000 on four visually different datasets. We base our comparison on PSNR, SSIM, time and bits rate measures. This approach reveals that the JPEG2000 vs. JPEG comparison strongly depends on compressed images visual content.
1
Introduction
Still image number is increasing with the growing use of mobile phones, personal computers and digital cameras. They are stored, accessed, shared and streamed. However storage space and bandwidth are limited, that is why these images need to be compressed. Therefore several compression standards are widely used, such as JPEG, JPEG2000 and JPEG-XR (also know as HD-Photo format). However compression comes with side effects, like quality degradation and additional processing time. Given a specific application, it is hard to decide which standard should be used to minimize these drawbacks. To make a clear decision, users must rely on trial and error process or conclusions drawn from subjective quality assessments. The automatic version of such procedure is difficult to implement unless the subjective criteria used by humans to assess image quality are replaced by objective criteria. Such assessment is carried out by using quality metrics, which calculation is fast and cost-effective, but less trustworthy [10]. A typical compression standard quality assessment experiment depends on an image dataset, compressor parameters and performance measures. The images in the dataset are characterized by the classes to which they belong. Compressor parameters can be algorithms such as JPEG, AVC/H.264 or options such as lossy or not, progressive or not. Finally, the performance measures are the criteria used to measure the influence of parameter sets on compressors [1,3,5]; they are for example quality measures, compressor complexity or functionalities (e.g. region of interest, coding or multiscaling). Three central works [4,7,9], carried out between 2002 and 2008, in the field of compression standards objective quality assessment have influenced our work. In these three studies, the authors extensively tuned the compressors parameters J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 437–446, 2011. c Springer-Verlag Berlin Heidelberg 2011
438
G. Manfredi, D. Ziou, and M.-F. Auclair-Fortier Table 1. Experimental variables of three existing works along with our’s
Work
Ebrahimi et al. [4] De Simone et al. [9] Dataset name Live database Microsoft test set Dataset size 29 10 Classes N/A N/A Compression JPEG, JPEG2000, algorithm JPEG2000 H.264, JPEGcompared XR Compression parameters Performance measures
Bit rate, quality required, implementation Blockiness, blur, MOS prediction, PSNR, compression rate
Santa Cruz et al. [7] Our JPEG2000 test set
7 N/A JPEG, JPEG-LS, MPEG4-VTC, SPITH, PNG, JPEG2000 Bit rate Bit rate, lossless/lossy, progressive or not PSNR, SSIM PSNR, error resilience, compression rate, speed
Various 259 4 JPEG, JPEG2000, proprietary JPEG2000 Bit rate, quality required PSNR, SSIM, speed
and tried various criteria. They covered some of the most recent compression algorithms. They observed the effects of wavelets and cosine transform, of various coding algorithms like EBCOT, SPITH and Huffman and considered lossiness. They used various quality metrics and image artifacts as performance measures. Finally, they not only assessed compressors’ speed but also error resilience (i.e. robustness to transmission error) and if it supports ROI coding and multiscaling. According to these studies, JPEG2000 and AVC/H.264 perform better than JPEG-XR. Moreover, JPEG2000 outperform JPEG in terms of image quality for medium and strong compression. Unfortunately, it lacks of speed when compared to JPEG. JPEG is the best algorithm for weak compression in terms of quality and speed. For lossless compression, JPEG-LS stands as the best choice regardless of the criteria. Table 1 summarizes experimental information about these works. Those studies greatly improved our knowledge about the most popular compression algorithms. However, the datasets used in these experimentations do not exceed 30 images which cannot show the influence of image content on the compression process. Indeed, tests are made on a whole dataset and conclusions are drawn from the means of some performance measures. In this work, we address these issues by adding the image content as a new criterion to the standard comparison framework. By varying the image classes, and using a greater number of images, we show that new results can be obtained. We give an example of using this framework with a well known and extensively studied comparison, the JPEG-JPEG2000 comparison. Only two standards are compared in order to keep the example simple, still it allows understanding the benefits of our approach. Moreover we propose to go beyond the standards
Content Makes the Difference in Compression Standard Quality Assessment
439
specifications by adding the wavelets transform window size as a new parameter to JPEG2000. Varying images visual characteristics, in regard of the new criterion, reveals interesting results for medical image compression. The next section will describe the experimental protocols used in our standards comparison along with three experiments realized following these protocols. The third section shows the results of the comparisons. We conclude by proposing an add-on to standards comparison frameworks.
2
Experimental Protocols
It is a known fact that compression depends on image content [8]. For quality assessment experiments, we propose to highlight the effects of image content according to compressor parameters and some performance measures. Because the measures will be the same for all experiments, let us first define them once and for all. Let us choose objective quality metrics. In order to draw conclusions coherent with the related works, we use the Structural SIMilarity (SSIM) [11] and the Peak Signal to Noise Ratio (PSNR). In addition to these metrics, we define the bit rate (BR) in bits per pixel (bpp) as, BR(X) =
(size in bits of compressed image X) × image depth in bpp. (size in bits of image X)
The lower the bit rate, the more compressed the image and vice versa. Finally, the computational time is used as performance measure. Tests were carried out on a 2.66GHz Pentium 4 with 1GB of RAM. We used the lossy compressors JASPER 1.900.1 [2] for JPEG2000, and libjpeg7 [6] for JPEG. We claim that image content influences compressed image quality. The content of images can be defined by features such as contrast, textures, and shapes. Different image types will have different quality loss during compression. Indeed, as some high frequencies are lost during the compression process, an image composed mainly of high frequencies will suffer a heavy quality loss. In order to explore this assumption, we gather four datasets. In our case, spatial and frequency content distinguish datasets’ visual content. The whole dataset is formed by 259 uncompressed 24 bits depth color images, except for texture and medical images which are grayscale 8 bits depth. All color images are in the sRGB color space. Table 2 summarizes and describes the four classes used in this work. Please note that the classes chosen will be dependent on the information you need to put forward. In this case the classes have been chosen for their frequential content in order to put forward differences between JPEG and JPEG2000, but other choices are possible. The evaluation framework can be summarized as follows. We first compute the three measures at fixed bit rates for both JPEG and JPEG2000: PSNR, SSIM and computation time. For a given class we compute the mean PSNR obtained with JPEG2000 and JPEG over all images. Then we compute the difference
440
G. Manfredi, D. Ziou, and M.-F. Auclair-Fortier Table 2. Four visually different datasets used in this experiment
Class aerial
outdoor
texture medical
Sensor
Number of Resolutions images (in pixels) IR 38 512 × 512, 1024 × 1024, 2250 × 2250 RGB camera 129 800 × 530
Digital micro- 64 scope camera X-Rays 28
Frequency and spatial content High frequencies with little spatial extent.
Highly textured and low frequencies with large spatial extent. 512 × 512, Pure high frequencies. 1024 × 1024 256 × 256, Low frequencies with almost 768 × 575 no high frequencies.
Table 3. Correspondance table between JPEG qualiy and JPEG2000 BR in bpp Class/ JPEG quality aerial outdoor texture medical
6 0,23 0,29 0,37 0,07
25 0,62 0,76 0,93 0,13
50 1,02 1,18 1,42 0,22
75 1,62 1,77 2,09 0,27
100 8,58 8,03 7,36 0,75
between mean PSNR(JPEG2000) and mean PSNR(JPEG). If the difference is positive JPEG2000 performs better than JPEG, if the value is negative JPEG outperform JPEG2000. We want to compare JPEG and JPEG2000 at the same bit rate. Note that JPEG compressor is not parameterized by a given bit rate but by the quality expressed in percentage. We choose five quality levels that are 6, 25, 50, 75 and 100. In order to use the same bit rates for JPEG and JPEG2000, from each of these quality values the bit rate is estimated and used for JPEG2000. However, as we have no direct control on bit rate, values are different for each class (Tab. 3). That is why, in Fig. 2, 3, 4 and 5, classes span different bit rate intervals. For the second experiment we compare JPEG2000 to a proprietary JPEG2000, named WJPEG2000 for Windowed JPEG2000. In order to obtain a greater spatial resolution, in WJPEG2000, the wavelet transform is windowed. We apply the wavelet transform on non-overlapping fixed size square windows (see Fig. 1). Unlike tiling procedure of JPEG2000, only the DWT step is ran separately for each window. The quantization and coding are performed on the whole resulting wavelet coefficients. The modification leads to a compromise between spatial and frequency resolutions. Indeed, windowing the transform increases its spatial resolution. On the other side, the transform blocks are smaller so they will have a decreased frequency resolution compared to a holistic transform. The transform window size is a new parameter of the compressor. In order to highlight the impact of this parameter on compression, we compress our dataset with various window sizes from 2 to 64 pixels width.
Content Makes the Difference in Compression Standard Quality Assessment
441
Fig. 1. The windowed DWT
3
Experimental Results
According to Fig. 2, JPEG2000 reaches the best SSIM for high bit rate (> 4 bpp). Let us zoom on lower bit rates (Fig. 3). We see that the difference is hardly measurable for three out of four classes, their values are between 0.01 and −0.01. Even for aerial images, the difference is not far from 0.01. However, at these bit rates such differences are difficult to interpret. That is why we must use the PSNR for further analyze. For the rightmost points of the curves (Fig. 2), JPEG2000 outperforms JPEG, except for some points of the medical image class. At bit rates higher than 3.8 bpp, for texture class, JPEG outperform JPEG2000. Now let’s focus on the interval 0 to 2 bpp (Fig. 3) because the majority of image’s bit rates are within such interval. As we can see, the maximum JPEG2000/JPEG difference in PSNR is of 8db. Only one point is over a difference of 6db, regardless of bit rate. Above 1 bpp, the maximum difference lowers to 4db. Table 4 sums up the results of this comparison. For bit rate between 0 and 1 bpp, JPEG2000 shows better quality than JPEG, except for smooth images from medical class. For bit rates between 1 and 3.8 bpp, JPEG2000 outperforms JPEG for three out of four classes. At bit rates superior than 3.8 bpp and for textured images, JPEG has better quality than JPEG2000. Table 4 puts forward the difference in the comparison depending on the chosen image class. For aerial and outdoor images, which contains details of various sizes, the multiscale decomposition of JPEG2000 shows it’s superiority over the fixed size windows used in JPEG. However, JPEG gives better results for smooth images. This can be due to the quantization used in JPEG which allows preserving representative transform coefficients even if their value is low. At the contrary, the default JPEG2000 implementation uses a simple quantization scheme which Table 4. Best compressor depending on bit rate and image content BR aerial outdoor texture medical
0-1 JPEG2000 JPEG2000 JPEG2000 JPEG
1 - 3.8 JPEG2000 JPEG2000 JPEG2000 JPEG
3.8+ JPEG2000 JPEG2000 JPEG JPEG
442
G. Manfredi, D. Ziou, and M.-F. Auclair-Fortier
Fig. 2. JPEG2000/JPEG difference in PSNR and SSIM
Content Makes the Difference in Compression Standard Quality Assessment
Fig. 3. Zoom of JPEG2000/JPEG difference in PSNR and SSIM in Fig. 2
443
444
G. Manfredi, D. Ziou, and M.-F. Auclair-Fortier
Fig. 4. JPEG2000/WJPEG2000 difference in PSNR and SSIM for BR = 0.72/0.24 bpp
could prevent the accurate reconstruction of a smooth signal composed of numerous coefficients. Finally, textured images seems to suffer from the inverse problem compared to medical images. At high bit rates, the quantization used in JPEG2000 is not strong enough so too much coefficients are kept leading to a poor bit rate. Now let us compare WJPEG2000 with JPEG2000. Once more we observe differences in terms of PSNR and SSIM in order to see if the wavelet transform windowing affects the quality. The first experiment is carried out at a bit rate of 0.72 bpp for color images and 0.24 bpp for greyscale ones (0.72/0.24 bpp). The PSNR difference is over 6db whatever the window size (Fig. 4). The SSIM is coherent with the PSNR analysis and shows that JPEG2000 gives better quality than WJPEG2000. Now, it becomes more interesting for higher bit rate. The next scores are obtained with a BR of 7.2/2.4 bpp . In Fig. 5 the difference in terms of SSIM is too small to help us ( < 0.01 ). Therefore we look at the PSNR. For medical images WJPEG2000 outperforms JPEG2000. The difference in terms of PSNR is mostly over 15db and reaches 18db for the window of 17 pixels width. For other classes, JPEG2000 outperform WJPEG2000 of more than 5db. Therefore JPEG2000 has better quality than WJPEG2000 for three classes out of four. However, for medical images, WJPEG2000 shows a huge improvement over JPEG2000. In fact, windowing the wavelets transform greatly
Fig. 5. JPEG2000 - WJPEG2000 difference in PSNR and SSIM for BR = 7.2/2.4 bpp
Content Makes the Difference in Compression Standard Quality Assessment
445
Fig. 6. Compression time for JPEG, JPEG2000 and WJPEG2000 for each class using BR = 7.2/2.4 bpp
improves the compression of medical images. This provides an interesting direction to explore in further work. Indeed, the main feature of medical images is their smoothness. We conclude this section pointing out the fact that the separation of our dataset in classes allowed us to see this result which would have gone unnoticed in a non classified dataset. Regarding the mean computational time, Fig. 6 shows that for BR= 7.2/2.4 bpp, JPEG is the most efficient algorithm with a speed between two (medicals) and five times (textures) greater than those of JPEG2000 and WJPEG2000. WJPEG2000 is up to 1.14 times slower than JPEG2000. For BR=0.72/0.24, results are similar. JPEG is two to five times faster than the others. WJPEG2000 is up to 1.36 times slower than JPEG2000. Note that the computational time is image content free.
446
4
G. Manfredi, D. Ziou, and M.-F. Auclair-Fortier
Conclusion
Most of compression standard comparisons do not take into account image content. In this paper, we advocate in favor of an add-on for the regular compression standard quality assessment framework. We use the image content as an additional criterion in the assessment. By splitting the dataset among various classes and analyzing each one independently, we provide some insight on the differences between JPEG2000 and JPEG. Furthermore, this method allowed us to bring out an interesting result. Namely, that windowing the wavelet transform in JPEG2000 increases its quality in terms of PSNR and SSIM for medical images. Futur work includes gathering a larger and more structured dataset in order to compare various compression standards. Moreover, the windowing of a wavelet transform for compression purposes invites further exploration.
References 1. Ahumada, A.J.: Computational image quality metrics: a review. In: SID International Symposium, Digest of Technical Papers, vol. 24, pp. 305–308 (1993) 2. Adams, M.: Jasper jpeg2000 codec (2007), http://www.ece.uvic.ca/~ mdadams/jasper/ 3. Avcibas, I., Sankur, B., Sayood, K.: Statistical evaluation of image quality measures. Journal of Electronic Imaging 11, 206–223 (2002) 4. Ebrahimi, F., Chamik, M., Winkler, S.: JPEG vs. JPEG2000: an objective comparison of image encoding quality. In: Proceedings of SPIE Applications of Digital Image Processing, vol. 5558, pp. 300–308 (2004) 5. Eskicioglu, A.M.: Quality measurement for monochrome compressed images in the past 25 years. In: Proceedings. of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2000, vol. 6, p. 1907 (2000) 6. IJG: Jpeglib jpeg codec, http://www.ijg.org/ (2011) 7. Santa-Cruz, D., Grosbois, R., Ebrahimi, T.: JPEG 2000 performance evaluation and assessment. Signal Processing: Image Communication 17(1), 113–130 (2002) 8. Serra-Sngrista, J., Auli, F., Garcia, F., Gonzalez, J., Guiturt, P.: Evaluation of still image coders for remote sensing applications. In: IEEE Sensor Array and Multichannel Signal Processing Workshop (2004) 9. Simone, F.D., Ticca, D., Dufaux, F., Ansorge, M., Ebrahimi, T.: A comparative study of color image compression standards using perceptually driven quality metrics (2008), http://infoscience.epfl.ch/record/125933 10. Wang, Z., Bovik, A., Lu, L.: Why is image quality assessment so difficult? In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2002, vol. 4, pp. IV–3313–IV–3316 (2002) 11. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13(4), 600–612 (2004)
A Bio-Inspired Image Coder with Temporal Scalability Khaled Masmoudi1 , Marc Antonini1 , and Pierre Kornprobst2 1
2
I3S laboratory–UNS–CNRS Sophia-Antipolis, France [email protected] http://www.i3s.unice.fr/~kmasmoud NeuroMathComp Team Project–INRIA Sophia-Antipolis, France
Abstract. We present a novel bio-inspired and dynamic coding scheme for static images. Our coder aims at reproducing the main steps of the visual stimulus processing in the mammalian retina taking into account its time behavior. The main novelty of this work is to show how to exploit the time behavior of the retina cells to ensure, in a simple way, scalability and bit allocation. To do so, our main source of inspiration will be the biologically plausible retina model called Virtual Retina. Following a similar structure, our model has two stages. The first stage is an image transform which is performed by the outer layers in the retina. Here it is modelled by filtering the image with a bank of difference of Gaussians with time-delays. The second stage is a time-dependent analog-to-digital conversion which is performed by the inner layers in the retina. Thanks to its conception, our coder enables scalability and bit allocation across time. Also, our decoded images do not show annoying artefacts such as ringing and block effects. As a whole, this article shows how to capture the main properties of a biological system, here the retina, in order to design a new efficient coder. Keywords: Static image compression, bio-inspired signal coding, retina.
1
Introduction
Intensive efforts have been made during the past two decades for the design of lossy image coders yielding several standards such as JPEG and JPEG2000 [1,3]. These compression algorithms, mostly, followed the same conception schema, though, improving considerably the performances in terms of cost and quality. Yet, it became clear now that little is still to be gained if no shift is made in the philosophy underlying the design of coders. In this paper, we propose a novel image codec based on visual system properties: Our aim is to set a new framework for coder design. In this context, neurophysiologic studies “have demonstrated that our sensory systems are remarkably efficient at coding the sensory environment” [8], and we are convinced that an interdisciplinary approach would improve coding algorithms. J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 447–458, 2011. c Springer-Verlag Berlin Heidelberg 2011
448
K. Masmoudi, M. Antonini, and P. Kornprobst
We focused on the complex computations that the mammalian retina operates to transform the incoming light stimulus into a set of uniformly-shaped impulses, also called spikes. Indeed, recent studies such as [7] confirmed that the retina is doing non-trivial operations to the input signal before transmission, so that our goal here is to capture the main properties of the retina processing for the design of our new coder. Several efforts in the literature reproduced fragments of this retina processing through bio-inspired models and for various vision tasks, for example: object detection and robot movement decision [9], fast categorization [19,20], and regions of interest detection for bit allocation [13]. But most of these approaches do not account for the precise retina processing. Besides, these models overlooked the signal recovery problem which is crucial in the coding application. Attempts in this direction were done making heavy simplifications at the expense of biological relevance [14] or restricting the decoding ability within a set of signals in a dictionary [15]. Here, the originality of our work is twofold: we focus explicitly on the coding application and we keep our design as close as possible to biological reality considering most of the mammalian retina processing features. Our main source of inspiration will be the biologically plausible Virtual Retina model [23] whose goal was to find the best compromise between the biological reality and the possibility to make large-scale simulations. Based on this model, we propose a coding scheme following the architecture and functionalities of the retina, doing some adaptations due to the application. This paper is organized as follows. In Section 2 we revisit the retina model called Virtual Retina [23]. In Section 3, we show how this retina model can be used as the basis of a novel bio-inspired image coder. The coding pathway is presented in a classical way distinguishing two stages: the image transform and the analog-to-digital (A/D) converter. In Section 4 we present the decoding pathway. In Section 5 we show the main results that demonstrate the properties of our model. In Section 6 we summarize our main conclusions.
2
Virtual Retina: A Bio-Inspired Retina Model
The motivation of our work is to investigate the retina functional architecture and use it as a design basis to devise new codecs. So, it is essential to understand what are the main functional principles of the retina processing. The literature in computational neuroscience dealing with the retina proposes different models. These models are very numerous, ranking from detailed models of a specific physiological phenomenon, to large-scale models of the whole retina. In this article, we focused on the category of large-scale retina models as we are interested in a model that gathers the main features of mammalian retina. Within this category, we considered the retina model called Virtual Retina [23]. This model is one of the most complete ones in the literature, as it encompasses the major features of the actual mammalian retina. This model is mostly stateof-the-art and the authors confirmed its relevance by reproducing accurately real cell recordings for several experiments.
A Bio-Inspired Image Coder with Temporal Scalability
449
Fig. 1. (a) Schematic view of the Virtual Retina model proposed by [23]. (b) and (c): Overview of our bio-inspired codec. Given an image, the static DoG-based multiscale transform generates the sub-bands {Fk }. DoG filters are sorted from the lowest frequency-band filter DoG0 to the highest one DoGN−1 . Each sub-band Fk is delayed using a time-delay circuit Dtk , with tk < tk+1 . The time-delayed multi-scale output is then made available to the subsequent coder stages. The final output of the coder is a set of spike series, and the coding feature adopted will be the spike count nkij (tobs ) recorded for each neuron indexed by (kij) at a given time tobs .
The architecture of the Virtual Retina model follows the structure of mammalian retina as schematized in Figure 1(a). The model has several interconnected layers and three main processing steps can be distinguished: – Outer layers: The first processing step is described by non-separable spatiotemporal filters, behaving as time-dependent edge detectors. This is a classical step implemented in several retina models. – Inner layers: A non-linear contrast gain control is performed. This step models mainly bipolar cells by control circuits with time-varying conductances. – Ganglionic layer: Leaky integrate and fire neurons are implemented to model the ganglionic layer processing that finally converts the stimulus into spikes. Given this model as a basis, our goal is to adapt it to conceive the new codec presented in the next sections.
450
3
K. Masmoudi, M. Antonini, and P. Kornprobst
The Coding Pathway
The coding pathway is schematized in Figure 1(b). It follows the same architecture as Virtual Retina. However, since we have to define also a decoding pathway, we need to think about the invertibility of each processing stage. For this reason some adaptations are required and described in this section. 3.1
The Image Transform: The Outer Retina Layers
In Virtual Retina, the outer layers were modelled by a non-separable spatiotemporal filtering. This processing produces responses corresponding to spatial or temporal variations of the signal because it models time-dependent interactions between two low-pass filters: this is termed center-surround differences. This stage has the property that it responds first to low spatial frequencies and later to higher frequencies. This time-dependent frequency integration was shown for Virtual Retina (see [24]) and it was confirmed experimentally (see, e.g., [17]). This property is interesting as a large amount of the total signal energy is contained in the lower frequency sub-bands, whereas high frequencies bring further details. This idea already motivated bit allocation algorithms to concentrate the resources for a good recovery on lower frequencies. However, it appears that inverting this non-separable spatio-temporal filtering is a complex problem [24,25]. To overcome this difficulty, we propose to model differently this stage while keeping its essential features. To do so, we decomposed this process into two steps: The first one considers only center-surround differences in the spatial domain (through differences of Gaussians) which is justified by the fact that our coder here gets static images as input. The second step reproduces the time-dependent frequency integration by the introduction of time-delays. Center-Surround Differences in the Spatial Domain: DoG. Neurophysiologic experiments have shown that, as for classical image coders, the retina encodes the stimulus representation in a transform domain. The retinal stimulus transform is performed in the cells of the outer layers, mainly in the outer plexiform layer (OPL). Quantitative studies such as [6,16] have proven that the OPL cells processing can be approximated by a linear filtering. In particular, the authors in [6] proposed the largely adopted DoG filter which is a weighted difference of spatial Gaussians that is defined as follows: DoG(x, y) = wc Gσc (x, y) − ws Gσs (x, y),
(1)
where wc and ws are the respective weights of the center and surround components of the receptive fields, and σc and σs are the standard deviations of the Gaussian kernels Gσc and Gσs . In terms of implementation, as in [20], the DoG cells can be arranged in a dyadic grid to sweep all the stimulus spectrum as schematized in Figure 2(a). Each layer k in the grid, is tiled with DoGk cells having a scale sk and generating
A Bio-Inspired Image Coder with Temporal Scalability
(a)
(b)
451
(c)
Fig. 2. (a) Input Lena Image. (b) Example of a dyadic grid of DoG’s used for the image analysis (from [20]). (c) Example on image (a) of DoG coefficients generated by the retina model (the sub-bands are shown in the logarithmic scale).
a transform sub-band Fk , where σsk+1 = 12 σsk . So, in order to measure the degree opl of activation I¯kij of a given DoGk cell at the location (i, j) with a scale sk , we compute the convolution of the original image f by the DoGk filter: opl I¯kij =
∞
DoGk (i − x, j − y) f (x, y).
(2)
x,y=−∞
This generates a set of 43 N 2 − 1 coefficients for an N 2 -sized image, as it works in the same fashion as a Laplacian pyramid [2]. An example of such a bio-inspired multi-scale decomposition is shown in Figure 2(b). Note here that we added to this bank of filters a Gaussian low-pass scaling function that represents the state opl of the OPL filters at the time origin. This yields a low-pass coefficient I¯000 and enables the recovery of a low-pass residue at the reconstruction level [5,12]. Integrating Time Dynamics through Time-Delay Circuits. Of course, the model described in (2) has no dynamical properties. In the actual retina, the surround Gσs in (1) appears progressively across time driving the filter passband from low frequencies to higher ones. Our goal is to reproduce this phenomenon that we called time-dependent frequency integration. To do so, we added in the coding pathway of each sub-band Fk a time-delay circuit Dtk . The value of tk is specific to Fk and is linearly increasing as a function of k. The tk -delay causes the sub-band Fk to be transmitted to the subsequent stages of the coder starting opl from the time tk . The time-delayed activation coefficient Ikij (t) computed at the location (i, j) for the scale sk at time t is now defined as follows: opl opl Ikij (t) = I¯kij ½{ttk } (t),
(3)
where ½{ttk } is the indicator function such that, ½{ttk } (t) = 0 if t < tk and 1 otherwise. 3.2
The A/D Converter: Inner and Ganglionic Layers
The retinal A/D converter is defined based on the processing occurring in the inner and ganglionic layers, namely a contrast gain control, a non-linear rectification
452
K. Masmoudi, M. Antonini, and P. Kornprobst
and a discretization based on leaky integrate and fire (LIF) neurons [10]. A different treatment will be performed for each delayed sub-band, and this produces a natural bit allocation mechanism. Indeed, as each sub-band Fk is presented at a different time tk , it will be subject to a transform according to the state of our dynamic A/D converter at tk . Contrast Gain Control. Retina adjust its operational range to match the input stimuli magnitude range. This is done by an operation called contrast gain control mainly performed in the bipolar cells. Indeed, real bipolar cells conductance is time varying, resulting in a phenomenon termed shunting inhibition. This shunting avoids the system saturation by reducing high magnitudes. opl In Virtual Retina, given the scalar magnitude I¯kij of the input step current opl Ikij (t), the contrast gain control is a non-linear operation on the potential of the bipolar cells. This potential varies according to both the time and the magnitude opl opl b value I¯kij ; and will be denoted by Vkij (t, I¯kij ). opl This phenomenon is modelled, for a constant value of I¯kij , by the following differential equation: ⎧ opl b ⎪ ⎨ b dVkij (t, I¯kij ) opl opl b c + g b (t)Vkij (t, I¯kij ) = Ikij (t), for t 0, (4) dt ⎪ t ⎩ g b (t) = E b ∗ Q(V b (t, I¯opl )), τ kij kij
2 −t 1 b b where Q(Vkij ) = g0b + λb Vkij (t) and Eτ b = b exp τ b , for t 0. Figure 3(a) τ opl opl b shows the time behavior of Vkij (t, I¯kij ) for different magnitude values I¯kij of opl Ikij (t).
opl b Non-Linear Rectification. Then, the potential Vkij (t, I¯kij ) is subject to a g opl non-linear rectification yielding the so-called ganglionic current Ikij (t, I¯kij ). Viropl ¯ tual Retina models it, for a constant scalar value I , by: kij
g opl opl b Ikij (t, I¯kij ) = N Twg ,τ g (t) ∗ Vkij (t, I¯kij ) ,
for t 0,
(5)
where wg and τ g are constant scalar parameters, Twg ,τ g is the linear transient filter defined by Twg ,τ g = δ0 (t) − wg Eτ g (t), and N is defined by: ⎧ ig0 ⎨ , if v < v0g N (v) = ig0 − λg (v − v0g ) ⎩ g i0 + λg (v − v0g ), if v v0g , where ig0 , v0g , and λg are constant scalar parameters. Figure 3(b) shows the time g opl opl behavior of Ikij (t, I¯kij ) for different values of I¯kij . opl ¯ As the currents Ikij are delayed with times {tk }, our goal is to catch the instantaneous behavior of the inner layers at these times {tk }. This amounts to
A Bio-Inspired Image Coder with Temporal Scalability
(a)
(b)
(c)
(d)
453
g b Fig. 3. 3(a): Vkij (t) as a function of time for different values of I¯opl ; 3(b): Ikij as a g opl opl ¯ ¯ function of time for different values of I ; 3(c): The functions ftk that map Ikij into g r Ikij for different values of tk ; 3(d): The functions ftnobs that map I¯kij into nkij for different values of tobs
opl opl infer the transforms Itgk (I¯kij ) that maps a given scalar magnitude I¯kij into a r ¯ rectified current Ikij as the modelled inner layers would generate it at tk . To do g opl so, we start from the time-varying curves of Ikij (t, I¯kij ) in Figure 3(b) and we do a transversal cut at each time tk : We show in Figure 3(c) the resulting maps g opl opl ftgk such that Ikij (tk , I¯kij ) = ftgk (I¯kij ).
opl As for Ikij (t) (see (3)), we introduce the time dimension using the indicator function ½{ttk } (t). The final output of this stage is the set of step functions r Ikij (t) defined by: opl r r r Ikij (t) = I¯kij ½{ttk } (t), with I¯kij = ftgk (I¯kij ).
(6)
This non-linear rectification is analogous to a widely-used telecommunication technique: the companding [4]. Companders are used to make the quantization steps unequal after a linear gain control stage. Though, unlike A − law or μ−law companders that amplify low magnitudes, the inner layers emphasize high
454
K. Masmoudi, M. Antonini, and P. Kornprobst
magnitudes in the signal. Besides, the inner layers stage have a time dependent behavior, whereas a usual gain controller/compander is static, and this makes our A/D converter go beyond the standards. Leaky Integrate-and-Fire Quantization. The ganglionic layer is the deepest r one tiling the retina: it transforms a continuous signal Ikij (t) into discrete sets of spike trains. As in Virtual Retina, this stage is modelled by leaky integrate and fire neurons (LIF) which is a classical model. One LIF neuron is associated to every position in each sub-band Fk . The time-behavior of a LIF neuron is governed by the fluctuation of its voltage Vkij (t). Whenever Vkij (t) reaches a predefined δ threshold, a spike is emitted and the voltage goes back to a resting (l) (l+1) potential VR0 . Between two spike emission times, tkij and tkij , the potential evolves according to the following differential equation: cl
dVkij (t) r + g l Vkij (t) = Ikij (t), dt
(l)
(l+1)
for t ∈ [tkij , tkij ],
(7)
where g l is a constant conductance, and cl is a constant capacitance. In the literature, neurons activity is commonly characterized by the count of spikes emitted during an observation time bin [0, tobs ], which we denote by nkij (tobs ) [22]. Obvir ously, as nkij (tobs ) encodes for the value of Ikij (t), there is a loss of information as nkij (tobs ) is an integer. The LIF is thus performing a quantization. If we observe the instantaneous behavior of the ganglionic layer at different times tobs , we get a quasi-uniform scalar quantizer that refines in time. We can do this by a similar process to the one described in the previous paragraph. We show in r Figure 3(d) the resulting maps ftnobs such that nkij (tobs ) = ftnobs (I¯kij ). Based on the set {nkij (tobs )}, measured at the output of our coder, we describe in the next section the decoding pathway to recover the initial image f (x, y).
4
The Decoding Pathway
The decoding pathway is schematized in Figure 1(c). It consists in inverting, step by step, each coding stage described in Section 3. At a given time tobs , the coding data is the set of ( 43 N 2 − 1) spike counts nkij (tobs ), this section describes how we can recover an estimation f˜tobs of the N 2 -sized input image f (x, y). Naturally, the recovered image f˜tobs (x, y) depends on the time tobs which ensures time-scalability: the quality of the reconstruction improves as tobs increases. The ganglionic and inner layers are inverted using look-up tables constructed off-line and the image is finally recovered by a direct reverse transform of the outer layers processing. Recovering the Input of the Ganglionic Layer. First, given a spike count r r nkij (tobs ), we recover I˜kij (tobs ), the estimation of Ikij (tobs ). To do so, we comr ¯ pute off-line the look-up table ntobs (Ikij ) that maps the set of current magnitude r values I¯kij into spike counts at a given observation time tobs (see Figure 3(d)).
A Bio-Inspired Image Coder with Temporal Scalability
455
The reverse mapping is done by a simple interpolation in the reverse-look up table denoted LU TtLIF . Here we draw the reader’s attention to the fact that, obs as the input of the ganglionic layer is delayed, each coefficient of the sub-band Fk is decoded according to the reverse map LU TtLIF . Obviously, the recovobs −tk ered coefficients do not match exactly the original ones due to the quantization performed in the LIF’s. Recovering the Input of the Inner Layers. Second, given a rectified current opl opl r value I˜kij (tobs ), we recover I˜kij (tobs ), the estimation of Ikij (tobs ). In the same way as for the preceding stage, we infer the reverse “inner layers mapping” through opl the pre-computed look up table LU TtCG . The current intensities I˜kij (tobs ), obs corresponding to the retinal transform coefficients, are passed to the subsequent retinal transform decoder. Recovering the Input Stimulus. Finally, given the set of 43 N 2 −1 coefficients opl {I˜kij (tobs )}, we recover f˜tobs (x, y), the estimation of the original image stimulus f (x, y). Though the dot product of every pair of DoG filters is approximately equal to 0, the set of filters considered is not strictly orthonormal. We proved in [11] that there exists a dual set of vectors enabling an exact reconstruction. Hence, the reconstruction estimate f˜ of the original input f can be obtained as follows: opl k (i − x, j − y), f˜tobs (x, y) = I˜kij (tobs ) DoG (8) {kij}
where {kij} is the set of possible scales and locations in the considered dyadic k are the duals of the DoGk filters obtained as detailed in [11]. grid and DoG Equation (8) defines a progressive reconstruction depending on tobs . This feature makes the coder be time-scalable.
5
Results
We show examples of image reconstruction using our bio-inspired coder at different times1 . Then, we study these results in terms of quality and bit-cost. Quality is assessed by classical image quality criteria (PSNR and mean SSIM [21]). The cost is measured by the Shannon entropy H(tobs ) upon the population of {nkij (tobs )}. The entropy computed in bits
per pixel (bpp), for an N 2 -sized im K−1 2k 2 1 age, is defined by: H(tobs ) = N 2 k=0 2 H nsk ij (tobs ), (i, j) ∈ 0, 2k − 1 , where K is the number of analyzing sub-bands. Figure 4 shows two examples of progressive reconstruction obtained with our new coder. The new concept of time scalability is an interesting feature as it 1
In g0b ig0 gL
all experiments, the model parameters are set to biologically realistic values: = 8 10− 10 S, τ b = 12 10−3 s, λb = 9 10−7 , cb = 1.5 10−10 F , v0g = 4 10−3 V , = 15 10− 12 A, w g = 8 10− 1, τ g = 16 10−3 s; λg = 12 10−9 S, δ = 2 10−3 V, = 2 10−9 S, VR0 = 0 V , tk = 5 10−3 + k 10−3 s.
456
K. Masmoudi, M. Antonini, and P. Kornprobst
Fig. 4. Progressive image reconstruction of Lena and Cameraman using our new bioinspired coder. The coded/decoded image is shown at: 20 ms, 30 ms, 40 ms, and 50 ms. Rate/Quality are computed for each image in terms of the triplet (bit-cost in bpp/ PSNR quality in dB/ mean SSIM quality). Upper line: From left to right (0.07 bpp/ 20.5 dB/ 0.59), (0.38 bpp/ 24.4 dB/ 0.73), (1.0 bpp/ 29.1 dB/ 0.86), and (2.1 bpp/ 36.3 dB/ 0.95). Lower line: From left to right (0.005 bpp/ 15.6 dB/ 0.47), (0.07 bpp/ 18.9 dB/ 0.57), (0.4 bpp/ 23 dB/ 0.71), and (1.2 bpp/ 29.8 dB/ 0.88).
introduces time dynamics in the design of the coder. This is a consequence of the mimicking of the actual retina. We also notice that, as expected, low frequencies are transmitted first to get a first approximation of the image, then details are added progressively to draw its contours. The bit-cost of the coded image is slightly high. This can be explained by the fact that Shannon entropy is not the most relevant metric in our case as no context is taken into consideration, especially the temporal context. Indeed, one can easily predict the number of spikes at a given time t knowing nkij (t − dt). Note also that no compression techniques, such that bit-plane coding, are yet employed. Our paper aims mainly at setting the basis of new bio-inspired coding designs. For the reasons cited above, the performance of our coding scheme in terms of bit-cost have still to be improved to be competitive with the well established JPEG and JPEG2000 standards. Thus we show no comparison in this paper. Though primary results are encouraging, noting that optimizing the bitallocation mechanism and exploiting coding techniques as bit-plane coding [18] would improve considerably the bit-cost. Besides, the image as reconstructed with our bio-inspired coder shows no ringing and no block effect. Finally our codec enables scalability in an original fashion through the introduction of time dynamics within the coding mechanism. Note also that differentiation in the processing of sub-bands, introduced through time-delays in the retinal transform, enables implicit but still not optimized bit-allocation. In particular the non-linearity in the inner layers stage
A Bio-Inspired Image Coder with Temporal Scalability
457
amplifies singularities and contours, and these provide crucial information for the analysis of the image. The trade-off between the emphasize made on high frequencies and the time-delay in the starting of their coding process is still an issue to investigate.
6
Conclusion
We proposed a new bio-inspired codec for static images. The image coder is based on two stages. The first stage is the image transform as performed by the outer layers of the retina. In order to integrate time dynamics, we added to this transform time delays that are sub-band specific so that, each sub-band is processed differently. The second stage is a succession of two dynamic processing steps mimicking the deep retina layers behavior. The latter perform an A/D conversion and generate a spike-based, invertible, retinal code for the input image in an original fashion. Our coding scheme offers interesting features such as (i) time-scalability, as the choice of the observation time of our codec enables different reconstruction qualities, and (ii) bit-allocation, as each sub-band of the image transform is separately mapped according to the corresponding state of the inner layers. Primary results are encouraging, noting that optimizing the bit-allocation and using coding techniques as bit-plane coding would improve considerably the cost. This work is at the crossroads of diverse hot topics in the fields of neurosciences, brain-machine interfaces, and signal processing and tries to lay the groundwork for future efforts, especially concerning the design of new biologically inspired coders.
References 1. Antonini, M., Barlaud, M., Mathieu, P., Daubechies, I.: Image coding using wavelet transform. IEEE Transactions on Image Processing (1992) 2. Burt, P., Adelson, E.: The Laplacian pyramid as a compact image code. IEEE Transactions on Communications 31(4), 532–540 (1983) 3. Christopoulos, C., Skodras, A., Ebrahimi, T.: The JPEG2000 still image coding system: An overview. IEEE Transactions on Consumer Electronics 16(4), 1103– 1127 (2000) 4. Clark, A., et al.: Electrical picture-transmitting system. US Patent assigned to AT& T (1928) 5. Crowley, J., Stern, R.: Fast computation of the difference of low-pass transform. IEEE Transactions on Pattern Analysis and Machine Intelligence (2), 212–222 (2009) 6. Field, D.: What is the goal of sensory coding? Neural Computation 6(4), 559–601 (1994) 7. Gollisch, T., Meister, M.: Eye smarter than scientists believed: Neural computations in circuits of the retina. Neuron. 65(2), 150–164 (2010) 8. Graham, D., Field, D.: Efficient coding of natural images. New Encyclopedia of Neuroscience (2007)
458
K. Masmoudi, M. Antonini, and P. Kornprobst
9. Linares-Barranco, A., Gomez-Rodriguez, F., Jimenez-Fernandez, A., Delbruck, T., Lichtensteiner, P.: Using FPGA for visuo-motor control with a silicon retina and a humanoid robot. In: Proceedings of ISCAS 2007, pp. 1192–1195. IEEE, Los Alamitos (2007) 10. Masmoudi, K., Antonini, M., Kornprobst, P.: Another look at the retina as an image scalar quantizer. In: Proceedings of ISCAS 2010, pp. 3076–3079. IEEE, Los Alamitos (2010) 11. Masmoudi, K., Antonini, M., Kornprobst, P.: Exact reconstruction of the rank order coding using frames theory. ArXiv e-prints (2011), http://arxiv.org/abs/ 1106.1975v1 12. Masmoudi, K., Antonini, M., Kornprobst, P., Perrinet, L.: A novel bio-inspired static image compression scheme for noisy data transmission over low-bandwidth channels. In: Proceedings of ICASSP, pp. 3506–3509. IEEE, Los Alamitos (2010) 13. Ouerhani, N., Bracamonte, J., Hugli, H., Ansorge, M., Pellandini, F.: Adaptive color image compression based on visual attention. In: Proceedings of IEEE ICIAP, pp. 416–421. IEEE, Los Alamitos (2002) 14. Perrinet, L.: Sparse Spike Coding: applications of Neuroscience to the processing of natural images. In: Proceedings of SPIE, the International Society for Optical Engineering, number ISSN (2008) 15. Pillow, J., Shlens, J., Paninski, L., Sher, A., Litke, A., Chichilnisky, E., Simoncelli, E.: Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature 454(7207), 995–999 (2008) 16. Rodieck, R.: Quantitative analysis of the cat retinal ganglion cells response to visual stimuli. Vision Research 5(11), 583–601 (1965) 17. Sterling, P., Cohen, E., Smith, R., Tsukamoto, Y.: Retinal circuits for daylight: why ballplayers don’t wear shades. Analysis and Modeling of Neural Systems, 143–162 (1992) 18. Taubman, D.: High performance scalable image compression with ebcot. IEEE Transactions on Image Processing 9(7), 1158–1170 (2000) 19. Thorpe, S., Gautrais, J.: Rank order coding. Computational Neuroscience: Trends in Research 13, 113–119 (1998) 20. Van Rullen, R., Thorpe, S.: Rate coding versus temporal order coding: What the retinal ganglion cells tell the visual cortex. Neural Computation 13, 1255–1283 (2001) 21. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13(4), 600–612 (2004), http://www.cns.nyu.edu/~zwang/ 22. Gerstner, W., Kistler, W.: Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cambridge University Press, Cambridge (2002) 23. Wohrer, A., Kornprobst, P.: Virtual retina: A biological retina model and simulator, with contrast gain control. Journal of Computational Neuroscience 26(2), 219–249 (2009) 24. Wohrer, A., Kornprobst, P., Antonini, M.: Retinal filtering and image reconstruction. Research Report RR-6960, INRIA (2009), http://hal.inria.fr/ inria-00394547/en/ 25. Zhang, Y., Ghodrati, A., Brooks, D.: An analytical comparison of three spatiotemporal regularization methods for dynamic linear inverse problems in a common statistical framework. Inverse Problems 21, 357 (2005)
Self-Similarity Measure for Assessment of Image Visual Quality Nikolay Ponomarenko1, Lina Jin2, Vladimir Lukin1, and Karen Egiazarian2 1
National Aerospace University, Dept. of Transmitters, Receivers and Signal Processing, 17 Chkalova St, 61070 Kharkov, Ukraine [email protected], [email protected] 2 Tampere University of Technology, Institute of Signal Processing, P.O. Box-553, FIN-33101 Tampere, Finland {lina.jin,karen.egiazarian}@tut.fi
Abstract. An opportunity of using self-similarity in evaluation of image visual quality is considered. A method for estimating self-similarity for a given image fragment that takes into account contrast sensitivity function is proposed. Analytical expressions for describing the proposed parameter distribution are derived, and their importance to human vision system based image visual quality full-reference evaluation is proven. A corresponding metric is calculated and a mean squared difference for the considered parameter maps in distorted and reference images is considered. Correlation between this metric and mean opinion score (MOS) for five largest openly available specialized image databases is calculated. It is demonstrated that the proposed metric provides a correlation at the level of the best known metrics of visual quality. This, in turn, shows an importance of fragment self-similarity in image perception. Keywords: full reference image visual quality, human visual system, human perception, discrete cosine transform, image self-similarity.
1 Introduction A task of objective evaluation of image visual quality [1] attracts considerable attention in recent years since it plays an important role in digital image processing. For example, it is desirable to adequately characterize image visual quality in lossy compression of still images and video [2], watermarking [3], image denoising and its efficiency analysis for various applications [4]. A common feature of all these applications is availability of a reference image (or frame sequence) with respect to which a negative influence of distortions is evaluated. Metrics intended for such approach to visual quality evaluation are called full-reference quality metrics [5]. Although a great number of such metrics has been proposed so far, there is no one corresponding to human perception perfectly. To test (verify) new and known metrics, several large image databases have been created, such as TID [6], LIVE [7], etc. Even the best known metrics as MSSIM [8] possess Spearman rank correlation of this metric and MOS (Mean Opinion Score) of about 0.85 (for TID database). Certainly, such correspondence is high enough compared to conventional metrics as PSNR and J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 459–470, 2011. © Springer-Verlag Berlin Heidelberg 2011
460
N. Ponomarenko et al.
MSE (Spearman rank correlation for them is only about 0.5). Benefits of MSSIM and other good metrics are mainly due to incorporating several known and wellestablished peculiarities of human vision system [9]. Another class of metrics includes the so-called no-reference ones [10]. These metrics are mainly exploited when image quality has to be evaluated without having the corresponding reference at a disposal. This can be, e.g., evaluation of visual quality of just-made photo, image indexing by search engines in Internet, etc. Design and testing of such metrics are even harder and less studied tasks. Recent investigations [11] have shown that Spearman correlation between MOS and best existing metrics is only at the level of 0.6. As mentioned above, further improvement of metric performance should deal with more careful consideration of peculiarities of a human perception. In this sense, the best studied and important feature is contrast sensitivity function incorporated in many metrics as MSSIM, PSNR-HVS-M [12], etc. Moreover, it is already employed in image processing applications such as the lossy compression by standard JPEG [13]. Other important features used in visual quality assessment are luminance and contrast masking [9]. Other valuable features of image perception by humans remain less understood and, thus, not taken into account in metric design and image processing applications. In this paper, we propose to use a known self-similarity property of natural images for characterizing their visual quality. To the best of our knowledge, this is an initial attempt to analyze and evaluate importance of this property although it is known that humans use to take into account image self-similarity in their analysis and characterizing distortions of different types. The remaining part of the paper is organized as follows. Section 2 describes the method for self-similarity estimation for a given image fragment. The main properties of this method are analyzed. Section 3 deals with a simple full-reference metric of image visual quality based on self-similarity. Its correspondence to human perception is analyzed.
2 Self-Similarity Measure Any image fragment can be represented as a sum of spatial information component (true image) and noise. Information component can be considered as something that a human can compare to fragments earlier seen in other images or in a given image but in some other location. An error corresponding to a difference between a given block and the most similar one can be then considered as a noise component. Note that below we study image self-similarity from the viewpoint of human’s ability to find correspondence between a given image fragment and some similar image fragment. 2.1 Basic Definitions Denote A a given image block (patch) of size NxM pixels. Let us find the difference between A and another patch B of the same size:
Diff (A, B) =
1 N M ∑∑ (A ij − Bij ) 2 . NM i =1 j=1
(1)
Self-Similarity Measure for Assessment of Image Visual Quality
461
A characteristic of noise for the image fragment occupied by the patch A can be determined as: Dissim (A) = P(n , k ) min (Diff (A, D)) , D∈U
(2)
where U is an image area in some neighborhood of the A, n denotes the number of patches D in the area U, k=NxM, and P(n,k) is a correcting factor. Let us define an image block self-similarity parameter for the patch A as: Sim(A) = max(0, σA2-Dissim(A)T)/σA2,
(3)
σA2
is the image local variance in A, T denotes a factor that determines a where minimal ratio σA2/Dissim(A) for which smaller self-similarity for a given patch is considered zero. Recommendations for setting the factors P and T will be given in subsection 2.2. As it follows from (3), the parameter Sim(A) has minimal and maximal values equal to 0 and 1, respectively. Its value approaches unity when noise component for a given patch diminishes. Let’s consider parameters (1) and (3) in more details. Suppose that a given image contains only noise and it has a Gaussian distribution with zero mean and variance σ2. Then it is easy to show that Diff(A,B) is a random variable with the distribution χ2(k). Taking into account central limit theorem, for a large degree of freedom (NxM>30) the distribution χ2(k) can be approximately considered Gaussian and having the mean equal to 2σ2 and the variance 8σ4/k. As it is seen, variance of the parameters Dissim(A) and Sim(A) can be decreased by using blocks with larger size (then parameter k for the distribution χ2 becomes larger). However, for blocks of larger size it is less probable to find similar image blocks, this can lead to overestimated parameters Dissim(A) for informative image fragments. Thus, below we will analyze 8x8 pixel blocks (k=64). Such block size is “convenient” due to simplicity of calculating discrete cosine transform (DCT) in blocks (necessity and usability of this will be explained below). In general, we recommend to use block size from 8x8 to 32x32 pixels (for images of larger size the block size can be larger). 2.2 Calculation of Correcting Factors P and T
Distribution of Dissim(A) for image noise component is defined by distribution of the first order statistic (minimum) for a sample of n Gaussian random variables, its mean decreases with increasing n. Fig. 1 shows a dependence of the mean of min(Diff(A,B)) on n obtained by simulations for i.i.d noise. This dependence is obtained for k=64 and the image that has fixed (constant) level homogeneous fragments corrupted by noise with σ2=100. It is seen that for n=1 the considered parameter mean equals to 2σ2 and the mean decreases with further increase of n till the level at about σ2. It is possible to approximate it by a polynomial axb+c. Taking into account that the variance of estimates Diff(A,B) is equal to 8σ4/k and is inversely proportional to k, P(n,k) can be written as:
462
N. Ponomarenko et al.
200
min(Diff(A,B))
180
160
140
120
100
50
100
150
200
250 n
300
350
400
450
500
Fig. 1. Dependence of the parameter min(Diff(A,B)) mean on the number of analyzed blocks
P(n, k ) =
1 1 − (4.415 - 4.332n -0.2572 ) / k
.
(4)
For k=64, expression (4) simplifies to P (n,64) = 1 /(0.5415n −0.2572 + 0.4481) .
(5)
Although the expressions (4) and (5) are only approximations for P(n,k), their accuracy is high enough for solving the tasks below. However, if estimates of Dissim(A) are used for a blind evaluation of noise variance [14], we recommend using the table P(n,k) obtained experimentally in advance or experimentally obtained fixed value P(n,k) if n and k for a given task do not vary. We would like to have a self-similarity for homogeneous blocks corrupted by noise to be zero. Mean of the estimates Dissim(A) is equal to noise variance of noise component of a given fragment multiplied by two. However, random values of this parameter can vary in rather wide limits depending upon k and n. If the correcting factor T in (3) is set to be equal to 0.5, then the mean of σA2-Dissim(A) equals to 0 and about half of noisy blocks will have Sim(A) larger than zero. Let us analyze what should be set for T in order to provide the number of blocks with non-zero estimate of self-similarity to be small. For n=1 and k=64, the estimates σA2-Dissim(A) are Gaussian random variables with variance σ4/32+T2σ4/8 and mean equals to σ2-2σ2T. Then for T=1, the probability Ppos that the considered random variable is positive is equal to 0.6%. Our experiments have demonstrated that with n increasing the distribution of Dissim(A) becomes nonGaussian [15] and its variance decreases (it approximately equals to σ4/40 for n=500) and then increases. For n=100, this variance becomes approximately equal to σ 2A or to σ4/32. Let us estimate Ppos for different T. For T=0.5, one has distribution with zero mean and variance σ 4 (1 / 32 + 1 / 128) . Such a random variable will be positive in
Self-Similarity Measure for Assessment of Image Visual Quality
463
almost 50% of cases. For T=1, one has distribution with the mean equal to - σ 2 and variance σ 4 / 16 . Probability Ppos is very close to zero. In turn, for T=0.7 one has distribution with the mean equal to -0.4 σ 2 and variance σ 4 (1 / 32 + 1 / 64) . Then, the probability Ppos equals to 3.2%. Thus, the conclusion is that T should be larger than 0.5, where T=1 (in the worst case) leads to positive estimates of self-similarity for about 0.6% of homogeneous blocks and provides underestimated self-similarity. The values of T from 0.5 till 1 produce quite accurate estimates of self-similarity Sim(A) for image heterogeneous (informative) blocks but it can lead to overestimated self-similarity in homogeneous blocks. Probably, it can be good in future to introduce some nonlinear dependence of T on σ 2A / Dissim (A) . There are examples of maps of Sim(A) in Figure 2 (white color corresponds to Sim(A) value equal to 1 and black color corresponds to Sim(A)=0). It is clear that noise significantly decreases similarity level of an image, especially for low contrast details.
a
b
Fig. 2. Example of map of Sim(A) parameter (T=1, n=100): a) for test image Barbara, mean value of Sim(A) is 0.26; b) for noisy image (σ2=100), mean Sim(A) is 0.14
2.3 Accounting for Human Visual System
It is known [4] that spatially correlated noise degrades image visual quality more than white noise with the same variance (this is explained by properties of contrast sensitivity function (CSF) [16]). Thus, self-similarity of heterogeneous blocks corrupted by spatially correlated noise should change more than for blocks corrupted by i.i.d. noise. This property can be taken into account by introducing CSF into (1). This can be achieved by passing to calculation similarity between blocks in discrete cosine transform (DCT) domain by multiplying each element of the sum by the corresponding coefficient of CSF. Then, for 8x8 pixels blocks, expression (1) can be rewritten as:
464
N. Ponomarenko et al.
Diff (A, B) =
1 8 8 ∑∑ ((DCT(A)ij − DCT (B)ij )Tc ij ) 2 , 64 i =1 j=1
(6)
where DCT(A) and DCT(B) are DCTs for patches A and B, respectively, Tc denotes the matrix of CSF of correcting factors given in Table 1. Table 1. The matrix of correcting factors Tc
1.6084
2.3396
2.5735
1.6084
1.0723
0.6434
0.5046
0.4219
2.1446
2.1446
1.8382
1.3545
0.9898
0.4437
0.4289
0.4679
1.8382
1.9796
1.6084
1.0723
0.6434
0.4515
0.3730
0.4596
1.8382
1.5138
1.1698
0.8874
0.5046
0.2958
0.3217
0.4151
1.4297
1.1698
0.6955
0.4596
0.3785
0.2361
0.2499
0.3342
1.0723
0.7353
0.4679
0.4021
0.3177
0.2475
0.2277
0.2797
0.5252
0.4021
0.3299
0.2958
0.2499
0.2127
0.2145
0.2548
0.3574
0.2797
0.2709
0.2626
0.2298
0.2574
0.2499
0.2600
Note that the matrix Tc is normalized in such a manner that calculation of Diff(A,B) according to (1) and (6) leads to the same result for i.i.d. noise. For spatially correlated noise the values Diff(A,B) for (6) are larger than those calculated according to (1).
3 Visual Quality Metric Based on the Proposed Self-Similarity Parameter Let us analyze importance of the proposed self-similarity parameter in image perception for a new HVS-base full-reference metric. Denote I a reference image and Id the corresponding distorted image. For both images, let us determine maps of noise components (parameter Dissim) and denote them as Dmap for the reference image and DMapd for the distorted one. The parameter Diff is calculated according to (6), the search area U is selected in such a manner that the area of width 4 pixels from all sides of an analyzed block is excluded from search (to provide correctness of parameter calculation for the cases of spatially correlated noise). The analyzed block size is 8x8 pixels, n is set fixed and equal to 96. Half overlapping of blocks is used (i.e. a neighboring block mutual shift is 4 pixels). For color images, they are first converted into YcbCr color space and then only intensity component Y is analyzed. The source code in Delphi and the Matlab code can be found at [17]. For the obtained maps DMap and DMapd, the new metric is defined as: MSDDM = −
2 1 Q W⎛ d ⎞ DMap − DMap , ⎜ ⎟ ∑∑ ij ij QW i =1 j=1 ⎝ ⎠
(7)
Self-Similarity Measure for Assessment of Image Visual Quality
465
where Q and W denote image size. As for most other metrics, larger values of MSDDM relate to larger distortions. For identical images, MSDDM equals to zero. 3.1 Analysis of MSDDM Efficiency for Specialized Databases
There are several known database of distorted images for which HVS-metric performance are verified and compared. Let us use five of such databases as the largest databases TID and LIVE as well as IVC [18], CSIQ [19], and Toyama [20]. Values of MSDDM have been calculated for all pairs of reference and distorted images in test image databases. Then, Spearman rank order correlation coefficients (SROCCs) [21] for the obtained arrays of MSDDM and MOS (averaged observers’ assessments) for these images have been calculated. Table 2 presents the values of SROCCs for the proposed metric and the known metrics MSSIM, PSNR, SSIM [5], UQI [22], NQM [23], and PSNR-HVS-M. The most informative results and conclusions can be obtained for the database TID2008 since it is the largest one according to the number of distorted images and types of distortions. Because of this, the results obtained for this database are analyzed below in more details. It is also worth mentioning that for known metrics the results presented in Table have been obtained using [24]. Because of this, they can slightly differ from the results presented in some other references. Table 2. SROCC values for different metrics and databases MSDDM
MSSIM
PSNR
SSIM
UQI
NQM
PSNR-HVS-M
TID
0.8049
0.8520
0.5250
0.7735
0.6000
0.6240
0.5590
LIVE
0.8794
0.9445
0.8754
0.9106
0.8916
0.9076
0.9339
IVC
0.8403
0.8798
0.6892
0.7797
0.8217
0.8381
0.7662
CSIQ
0.8126
0.9065
0.8053
0.8752
0.8012
0.7315
0.8178
Toyama
0.8598
0.8920
0.6099
0.7651
0.7149
0.6099
0.8674
Average
0.8394
0.8952
0.7010
0.8208
0.7659
0.7422
0.7889
As it follows from the analysis of data in Table 2, the proposed metric that takes into account only self-similarity of the reference and distorted images and CSF provides rather high correlation with human perception. For the database TID2008, only the metric MSSIM provides larger SROCC than MSDDM. This evidences in favor of a great importance of self-similarity in image perception by humans. It is worth noting that the results for MSDDM are only slightly better than for PSNR for the databases LIVE and CSIQ. This is explained by the fact that these databases contain quite many images distorted by JPEG/JPEG2000 compression. Just for these types of distortions the proposed metric is not too sensitive (see details in subsection 3.2). Figures 3, 4, and 5 present scatter-plots of MSDDM and MOS for the three largest databases TID (1700 distorted images), CSIQ (866 images), and LIVE (779 images).
466
N. Ponomarenko et al.
For a convenience of analysis, the values of MSDDM are represented in the logarithmic scale as (10log10(-2552/MSDDM)). Recall that for a good metric the corresponding scatter-plot should be a compact cloud (without outliers) and with a clear monotonous dependence (increase or decrease) on the considered metric. Also note that MOS is calculated in a different way for the considered database. For TID2008, a larger MOS (that can reach 8) corresponds to a better visual quality.
8 7 6 MOS
5 4 3 2 1 0 15
25
35
45
55
MSDDM
Fig. 3. Scatter-plot of MSDDM wrt MOS for TID
1.1 0.9
MOS
0.7 0.5 0.3 0.1 -0.1 15
25
35
45
55
MSDDM
Fig. 4. Scatter-plot of MSDDM wrt MOS for CSIQ database
As it is seen from the analysis, MSDDM correlates well with human perception for most of distorted images. However, the database TID contains some distorted images for which MSDDM does not perform well (there is a group of images for which 10log10(-2552/MSDDM)=25..40). To understand the reasons, let us carry out more thorough analysis of the proposed metric for TID.
Self-Similarity Measure for Assessment of Image Visual Quality
467
110 90
MOS
70 50 30 10 -10 15
25
35
45
55
MSDDM
Fig. 5. Scatter-plot of MSDDM wrt MOS for LIVE database
3.2 Analysis of MSDDM Efficiency for Different Types of Distortions in TID
For this purpose, let us obtain and analyze dependences of MOS on MSDDM for a white Gaussian noise (the first type of distortion in TID2008) considering this curve as a specific “etalon” curve (EC). If for a given type of distortion the corresponding curve fully “goes” below EC, this means that MSDDM overestimates image visual quality for the considered type of distortions. If a curve “goes” over EC, this means that the metric MSDDM produces underestimation of visual quality for a given type of distortions. White gaussian noise
Masked noise
Spatially correlated noise
High frequency noise
Quantization noise
MOS
6 5 4 3 0
20
40
60
80
100
120
140
-MSDDM
Fig. 6. Dependence of MOS on -MSDDM for different types of noise
Fig. 6 presents the curves for images in TID corrupted by different types of noise. As it follows from curve analysis, MSDDM adequately estimates distortions due to
468
N. Ponomarenko et al.
high-frequency noise and overestimates visual quality for the cases of spatially correlated and masked noise. Assessments for quantization noise that does not distort self-similarity too much are considerably overestimated. White gaussian noise
Gaussian blur
Image denoising
JPEG compression
JPEG2000 compression 6.5
MOS
5.5 4.5 3.5 2.5 1.5 0
50
100
150
200
-MSDDM
Fig. 7. Dependences of MOS on -MSDDM for different types of smearing
Fig. 7 presents the plots for several types of distortions that lead to image smearing in one or another sense. It is seen that Gaussian blur and distortions due to image denoising are estimated adequately. However, MSDDM overestimates visual quality for images distorted due to lossy compression by JPEG and JPEG2000 if these distortions are large (observed for large compression ratios). Future research to understand the reasons for these effects are needed. White gaussian noise
Non eccentricity pattern noise
Local block-wise distortions
Mean shift
Contrast change
MOS
6.5 5.5 4.5 3.5 2.5 0
20
40
60
80
100
120
140
-MSDDM
Fig. 8. Dependences of MOS on MSDDM for “exotic” types of distortions
Self-Similarity Measure for Assessment of Image Visual Quality
469
Fig. 8 presents the plots for rarely met (“exotic” types of distortions as mean shift, contrast change, local block-wise distortions, non-eccentricity pattern noise). It is seen that MSDDM reacts well on mean shift distortions and on hardly noticeable textural distortions (non-eccentricity pattern noise). Meanwhile, assessments for block-wise distortions are overestimated. Quality assessments for contrast change distortions are considerably underestimated and the dependence of MOS on MSDDM is nonlinear. This deals with the fact that humans prefer better contrast compared to distortions leading to dynamic range (contrast) compressing. MSDDM is unable to take this peculiarity into account.
4 Conclusions The paper presents a new self-similarity parameter for images that can be used in image visual quality assessment. We also expect that it can be useful for other image processing applications as blind estimation of noise variance. The parameter properties are analyzed and it is demonstrated that visual quality metrics based on the self-similarity parameter has a good correspondence with a human perception. The metric shows very high level of correlation with mean opinion scores, second only the MSSIM metric. In the same time our metric considers only self-similarity, while the other compared metrics take into account many different factors. This shows that is worth taking into account a self-similarity in a design of adequate models of human visual system.
References 1. Keelan, B.W.: Handbook of Image Quality. Marcel Dekker, Inc., New York (2002) 2. Ponomarenko, N., Krivenko, S., Lukin, V., Egiazarian, K.: Lossy Compression of Noisy Images Based on Visual Quality: A Comprehensive Study. EURASIP Journal on Advances in Signal Processing, 13 (2010), doi:10.1155/2010/976436 3. Carli, M.: Perceptual Aspects in Data Hiding. Thesis for the degree of Doctor of Technology, Tampere University of Technology (2008) 4. Fevralev, D., Lukin, V., Ponomarenko, N., Abramov, S., Egiazarian, K., Astola, J.: Efficiency analysis of DCT-based filters for color image database. In: Proceedings of SPIE Conference Image Processing: Algorithms and Systems VII, San Francisco, vol. 7870 (2011) 5. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13(4), 600–612 (2004) 6. Ponomarenko, N., Lukin, V., Zelensky, A., Egiazarian, K., Carli, M., Battisti, F.: TID2008 - A Database for Evaluation of Full-Reference Visual Quality Assessment Metrics. In: Advances of Modern Radioelectronics, vol. 10, pp. 30–45 (2009) 7. Sheikh, H.R., Sabir, M.F., Bovik, A.C.: A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms. IEEE Transactions on Image Processing 15(11), 3441–3452 (2006) 8. Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multi-scale structural similarity for image quality assessment. In: IEEE Asilomar Conference on Signals, Systems and Computers, pp. 1398–1402 (2003)
470
N. Ponomarenko et al.
9. Watson, A.B., Solomon, J.A.: Model of visual contrast gain control and pattern masking. J. Opt. Soc. Am. A 14(9), 2379–2391 (1997) 10. Sheikh, H.R., Bovik, A.C., Cormack, L.K.: No-reference quality assessment using natural scene statistics: JPEG2000. IEEE Transactions on Image Processing 14(11), 1918–1927 (2005) 11. Ponomarenko, N., Eremeev, O., Egiazarian, K., Lukin, V.: Statistical evaluation of noreference image visual quality metrics. In: Proceedings of EUVIP, Paris, p. 5 (2010) 12. Ponomarenko, N., Silvestri, F., Egiazarian, K., Carli, M., Astola, J., Lukin, V.: On between-coefficient contrast masking of DCT basis functions. In: Proc. of the Third International Workshop on Video Processing and Quality Metrics, USA, p. 4 (2007) 13. Wallace, G.K.: The JPEG Still Picture Compression Standard. Comm. of the ACM 34, 30–44 (1991) 14. Ponomarenko, N.N., Lukin, V.V., Egiazarian, K.O., Astola, J.T.: A method for blind estimation of spatially correlated noise characteristics. In: Proceedings of SPIE Conference Image Processing: Algorithms and Systems VII, San Jose, p. 12 (2010) 15. Arnold, B.C., Balakrishnan, N., Nagaraja, H.N.: A First Course in Order Statistics. A Wiley-Interscience Publication, NY (1992) 16. Mannos, J.L., Sakrison, D.J.: The Effects of a Visual Fidelity Criterion on the Encoding of Images. IEEE Transactions on Information Theory 20(4), 525–535 (1974) 17. MSDDM metric page, http://ponomarenko.info/msddm.htm 18. Ninassi, A., Le Callet, P., Autrusseau, F.: Pseudo No Reference image quality metric using perceptual data hiding. In: SPIE Human Vision and Electronic Imaging, San Jose, vol. 6057-08 (2006) 19. Larson, E.C., Chandler, D.M.: Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of Electronic Imaging 19(1), 011006 (2010) 20. Horita, Y., Kawayoke, Y., Parvez Sazzad, Z.M.: Image quality evaluation database (2011), http://160.26.142.130/toyama_database.zip 21. Kendall, M.G.: The advanced theory of statistics, vol. 1, p. 457. Charles Griffin & Company limited, London (1945) 22. Wang, Z., Bovik, A.: A universal image quality index. IEEE Signal Processing Letters 9, 81–84 (2002) 23. Damera-Venkata, N., Kite, T., Geisler, W., Evans, B., Bovik, A.: Image Quality Assessment Based on a Degradation Model. IEEE Transactions on Image Processing 9, 636–650 (2000) 24. Murthy, A.V., Karam, L.J.: A MATLAB Based Framework For Image and Video Quality Evaluation. In: International Workshop on Quality of Multimedia Experience (QoMEX), pp. 242–247 (2010)
An Intelligent Video Security System Using Object Tracking and Shape Recognition Sang Hwa Lee1 , Siddharth Sharma1 , Linlin Sang2 , Jong-Il Park2, and Yong Gyu Park3 1
Dept. of Electrical Eng., Seoul National University, Seoul, S. Korea 2 Dept. of Computer Eng., Hanyang University, Seoul, S. Korea 3 Laice Electronics Co. Ltd., Seoul, S. Korea
Abstract. This paper deals with an intelligent video surveillance system using object tracking and recognition techniques. The proposed system integrates the object extraction, human recognition, face detection, object tracking, and camera control. First, the object in the video frame is extracted using the background subtraction. Then, the object region is examined whether it is human or not. For this recognition, the regionbased shape descriptor, angular radial transform, is used to model the human shapes. When the object is decided as human, the face detection is optionally performed to capture the clear face images. Finally, the face or object region is tracked in the video frames, and the pan/tilt/zoom (PTZ) controllable camera also tracks the moving object. The tracking filter updates the histogram information in the object region at every frame so that the moving object is well tracked even though the poses and sizes of object are varied. Since the PTZ parameters can be transformed into camera parameters such as rotation angles and focal length, we estimate the 3-D locations of moving object with multiple PTZ camera. This paper constructs test system with multiple PTZ cameras and their communication protocol. According to the experiments, the proposed system is able to track the moving person automatically not only in the image domain but also in the real 3-D space. The proposed system improves the surveillance efficiency using the usual PTZ cameras. Keywords: Visual surveillance, PTZ camera, object extraction, object tracking, shape recognition.
1
Introduction
Video surveillance and security systems have been studied for a long time and become the vital parts of our everyday lives. The surveillance and security systems usually consist of closed circuit TV (CCTV) cameras and some sensors to detect or investigate the objects [1], [2]. Multiple cameras are required to reduce the blind spots and occluded regions [3], [6]. The video of every CCTV camera is always recorded and displayed on the central monitoring system. The human supervisors inspect the video and sensing signals to determine that the situation from the signals is unusual and suspicious. In this classical concept of security J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 471–482, 2011. c Springer-Verlag Berlin Heidelberg 2011
472
S.H. Lee et al.
systems, the efficiency of the security system depends on the human supervisors. Since there are many cameras and displays, the supervisors sometimes skip some of video scenes. Moreover, there are many sensing errors which cause the unnecessary alarms and decrease system reliability. Recently, the intelligent surveillance and security systems have been developed using pattern recognition and computer vision techniques, such as face detection and recognition, fingerprint and iris recognition, object tracking, gesture recognition, and so on [1], [7], [8]. Some of them are now commercialized and popular, but most of them are usually used for access control systems. Beyond the entrance systems, the intelligent algorithms are not well introduced to the commercial security systems since the pattern recognition algorithms are not perfect for many cases. And the systems are usually specific to limited situations. Consequently, the intelligent algorithms need more time to be combined with the real surveillance and security industries. This paper deals with an intelligent visual surveillance system using object tracking and recognition. The goal of proposed system is that CCTV cameras recognize and track the human object automatically without any supervised inspection. We develop some computer vision techniques and combine them with the usual CCTV cameras network. The rest of the paper is organized as follows. The overall functions of proposed system are first summarized in Section 2. Then, the computer vision techniques used in the proposed system are explained. Object extraction by background subtraction is described in Section 3, and human object recognition is explained in Section 4. The object tracking in 2-D image domain and 3-D spatial domain is described in Section 5. The system integration and experimental results are shown in Section 6, and this paper is concluded in Section 7.
2
Overall Structure
Fig. 1 shows the overall structure of proposed system. First, the object region is extracted using background subtraction techniques. The object region is expressed as a binary image. Then, it is examined that the object is human or not. There are many erroneous alarms in video security systems because of illumination changes, wind, animals, and so on. These errors increase the cost and decrease reliability of security systems. Thus, it is important to make sure that the detected object is human before some investigation processes or alarms operate. This paper exploits the angular radial transform (ART) in MPEG-7 to recognize the human shapes from the extracted object region. When the object is determined as human, the proposed system tries to detect the human face in the object region. Since the human face is crucial information in the surveillance systems, if a face is detected, the face region is zoomed up to a desirable resolution using the estimated size of face. And the image information of color histogram for face or object region is updated for each frame to track the moving object. There are two modes to track the moving person in the paper. One is to track the person’s face using face detection and histogram model of skin color.
An Intelligent Video Security System
473
Fig. 1. The structure of proposed system. Left blocks are the functions, and right sides show the related techniques to implement the functions.
The other uses the histogram model of object region since faces are not detected frequently by occlusion of masks or caps. The proposed system tries to find the face periodically in tracking the object. The histogram model and motion information of object region are updated at every frame while tracking the object in the video. This enables to track the moving object continuously even though the poses and sizes of object vary. Finally, the object or face region is tracked by the pan/tilt/zoom (PTZ) controlled camera. The motion information of moving object is converted to the calibrated PTZ parameters, which operate real panning, tilting, and zooming. The parameters are packetized and transferred to the PTZ control module embedded on the local CCTV camera through a CCTV communication protocol. The proposed system keeps on watch the moving object with less cameras using PTZ camera tracking. When the moving person disappears to the occlusion area where the camera can’t watch, the object tracking is relayed to a proper neighboring camera. The topological relation among the cameras is set in advance. Furthermore, the 3-D positions of object are estimated using the PTZ parameters since they are transformed into rotation angles and focal length of PTZ camera. The proposed system tracks the moving persons not only in the image domain but also real 3-D space.
474
3 3.1
S.H. Lee et al.
Object Extraction Background Subtraction
The object in the video scene is extracted by background subtraction (BS) technique. The BS technique is popular and works well to extract the foreground objects in a static scene [4], [5]. The CCTV cameras are usually installed toward the static background so that the background images can be modeled in advance. The pixel colors of background are modeled from static multiple images without any objects. Using multiple images reduces the noise and enables to model the background image statistically. For each pixel i, the means and the variances of color components (RGB) are calculated from the multiple images. To extract object regions, we need distance models between the background image and observed one. We first define a color-based distance, DC (i) = min ||I(i) − λi MB , (i)||2 , λi
(1)
where I is the color in the observed image, and MB is the mean vector of chromatic components in the background image. In (1), we introduce a parameter λi , which adjusts the different brightness between two images and reduces the shade region in the extracted object. The difference of brightness in images is a main error in background subtraction, thus the parameter λi compensate for the brightness change of images by minimizing (1). Next, we consider the different distributions of chromatic components in calculating (1). The means and variances of chromatic components are generally different, thus, we normalize the distributions of chromatic components by their variances respectively. Thus, the distance in (1) is formulated as below, 2 2 2 IR (i) − λi mR (i) IG (i) − λi mG (i) IB (i) − λi mB (i) DC (i)= + + , σR (i) σG (i) σB (i) (2) where mK and σK (K = R, G, B) are the mean and standard deviation of chromatic components respectively. By differentiating (2) with λi , the parameter λi is derived by (1), IR (i)mR (i) IG (i)mG (i) IB (i)mB (i) + + σR (i)2 σG (i)2 σB (i)2 λi = 2 2 2 . mR (i) mG (i) mB (i) + + σR (i) σG (i) σB (i)
(3)
We define another distance based on image gradient to reduce the errors by shades. The shading region is usually darker in intensity, but has the similar intensity gradient to the background model image. We exploit the image gradient of gray image to reduce that the shading region is extracted as the object. The gradient of gray image I(i) is calculated in the x and y direction respectively, gx =
∂I(i) , ∂x
gy =
∂I(i) . ∂y
(4)
An Intelligent Video Security System
and the distance of gradients is defined as below, 2 2 ∂f (i) ∂f (i) DG = gx − + gy − , ∂x ∂y
475
(5)
where f (i) id the gray image of background model. Finally, the distance to extract object pixels is defined as D(i) = γDC (i) + (1 − γ)DG (i),
(6)
where γ is a weighting parameter between two distances. When the distance of (6) is greater than a threshold, the pixel is a candidate for object pixel. After determining the candidate pixels for object region, we apply morphological filters (dilation and erosion) to eliminate erroneous pixels. The morphological filters are useful to eliminate the small area of holes in the object and small number of object pixels in the background. 3.2
Boundary Smoothing
For the robust recognition, we should normalize the image format and reduce noise components in the object images. We first apply the morphological filters to the binary object images. The dilation reduces the erroneous small regions, and erosion fills the holes in the object. This processing makes a complete object region. However, the boundary of object also plays an important role in the shape recognition. We apply a boundary smoothing algorithm to reduce the boundary noise, which is based on chain code [9]. The boundary of an object can be traced and represented by the chain code. Suppose a boundary P consists of a sequence of N points. Then, we have P = {p0 , p1 , ··, pi , ··, pN −1 }, and its chain code is −− → represented by ci = − p− i pi+1 , where ci is the Freeman chain code in 8 directions th of i point. The smoothing rules are summarized as following. Suppose pi−1 , pi , and pi+1 are three consecutive points. – 1) If pi is a corner point, and the following conditions are satisfied, then pi is deleted: (1) ci−1 = 0 and ci = 0, (2) |ci − ci−1 | = 2 or 6. – 2) If pi is a spurious point, and the angle between segment pi pi+1 and pi−1 pi is 45◦ (this can be determined from the chain codes ci−1 and ci ), then pi is deleted. – 3) If |ci − ci−1 | = 4, and pi−1 and pi+1 correspond to the same point in the image, then pi is deleted. ◦ −−→ −−−−→ – 4) If ci = 1 and the angle between segment − p− i pi+1 and pi−1 pi is 90 (this can also be determined from the chain codes ci−1 and ci ), then pi is replaced by the neighbor point which lies in the line pi−1 pi+1 .
4 4.1
Human Recognition Overview of ART
ART is the 2-D complex transform defined on a unit disk that consists of the complete orthonormal sinusoidal basis functions in polar coordinates [10], [11]. The ART transformation is defined as
476
S.H. Lee et al.
2π
1
Fnm =
Vnm (ρ, θ) f (ρ, θ)ρ dρ dθ, 0
(7)
0
where Fnm is an ART coefficient of order n and m, f (ρ, θ) is an image function in polar coordinates, and Vnm (ρ, θ), is the ART basis function that is separable along the angular and radial directions, Vnm (ρ, θ) = Am (θ)Rn (ρ).
(8)
The angular and radial basis functions are defines as follows: 1 exp(jmθ), 2π
(9)
1, n = 0, 2 cos(πnρ), n = 0.
(10)
Am (θ) = Rn (ρ) =
To describe a shape, all pixel constituting the shape are transformed with ART, and the transformed coefficients are formed into the ART descriptor. Usually, 12 angular and 3 radial functions are used. Thus, the magnitudes of complex ART coefficients {Fnm } becomes a 36-D vector to describe the shape of region. For the original ART, the limitation is that the transformation involves so many sinusoidal computation, thus, the computing of ART is time consuming. In order to save the computational cost, we applied the fast ART computing method. It’s proved that the basis functions of ART have symmetry/antisymmetry with respect to the axes x and y, and the origin. It is possible to reproduce the complete basis functions of ART by computing only the first quadrant of the basis function. 4.2
Construction of Database
It is important to build database of various human shapes to model the human recognition system. For the purpose of surveillance systems, we chose the human body shapes with normal behaviors such as standing, walking, jogging, running, and so on. These shapes are closely related to the visual surveillance situation. We collected the binary images of human shapes from illustration, extracted objects by background subtraction, and MPEG data set. Each sample image includes only the object part as small as possible. Finally, all the sample images are binarized and normalized to 32x32 pixels for applying ART. 4.3
Modeling of ART Vectors
The similarity of two shapes is determined by the Euclidean distances between the representative ART vectors and ART vectors of test images. The representative ART vectors are modeled from database set and mean shift clustering. First, we classify the ART vectors of human shapes in the database using mean shift clustering. Mean shift clustering represents a general non-parametric mode search procedure, and it has shown good results in data clustering [14].
An Intelligent Video Security System
477
Fig. 2. Classification result of human shapes
Since variation of human shapes is very wide, it is not desirable to model the distribution of human shapes using a single mode. Mean shift clustering enables us to model the 36-D ART vector space of human shapes more accurately. The mode vector of each cluster is the representative ART vector to describe the center of cluster. Fig. 2 shows the classification results of human shapes using mean shift algorithm. Each specific pose is well classified by the mean shift clustering. The proposed method reflects the various distributions of human shapes by the clustering. Next, we design the classifier using the representative ART vectors of human shapes and Euclidean vector distance. For each cluster of 36-D ART vectors of human shapes, we calculate the mean and variance of the Euclidean distances between the representative vector and every ART vector in the cluster, md = E [ MART − DBART ] , σd2 = V [ MART − DBART ] ,
(11)
where, MART and DBART are the representative ART vector of a cluster and the ART vectors of human shapes in the same cluster. And, we define a threshold for each cluster to determine the human shapes as below, Thc = mcd + αc σdc ,
(12)
where mcd and σdc are the mean and standard deviation of distances between the representative ART vector and the ART vectors of human shapes in the cluster C. A parameter, αc , is a weighting factor to adapt for the variation of ART vectors. When the shapes of human bodies in the cluster are widely distributed, αc and the threshold are increased since the distance between the representative vector and each ART vector becomes large. In this case, the falsepositive recognition errors may be increased because of large threshold. On the other hand, αc and threshold are decreased when the shapes of human bodies
478
S.H. Lee et al.
(a)
(b)
(c) Fig. 3. Recognition results of human shapes. (a) Recognized human shapes, (b) falsenegative errors, (c) false-positive errors.
in the cluster are locally distributed. In this case, the false-negative recognition errors may be increased. The parameter adjusts the variation of human shapes according to the characteristics of cluster. We set the threshold for each cluster such that large variations of human shapes in the database are not recognized. Finally, the recognition process is performed as follows. The ART is applied to the binary object image whose boundary is smoothed. Then, the distances between 36-D ART vector of object image and all the representative ART vectors of clusters are calculated respectively. If the minimum distance is within the threshold of a cluster, then the object is considered as human. However, if the minimum distance is not less than any thresholds of clusters, the object is considered as non-human. This decision method based on the distance is different from the mean shift classification. In the case of usual mean shift classification, we try to find a mode from the input ART vector, and we determine the input object as human if the estimated mode is converged to one of representative ART vectors of human shapes. However, since we have no distribution of non-human shapes in the database, any ART vectors (human or non-human) may be converged to the representative ART vectors. Thus, we model distance-based classifier for multiple clusters. When we need to reduce false-negative errors, we just increase the parameter αc . On the contrary, we decrease the parameter αc when we need to reduce false-positive errors. In the visual security system, decreasing
An Intelligent Video Security System
479
false-negative errors are more desirable than decreasing false-positive errors since it is important not to miss human intruders. Fig. 3 shows some recognition results of human and non-human objects. Non-human images are from MPEG data set, and human objects are extracted by background subtraction. note that the falsenegative errors are usually caused by incorrect object extraction results. Thus, we need to improve the object extraction algorithm further to reduce recognition errors. 4.4
Face Detection
Since faces are crucial information in the video surveillance systems, it is important to obtain the clear face images in the system. When the extracted object is determined as human, the proposed system tries to find the face in the video frame, and zooms up the detected face to a desirable resolution. If no face is detected in the human object region, the object region is zoomed up and modeled to be tracked. Face detection is the most classical pattern recognition problem in computer vision. The face detection algorithm in the paper is implemented based on the 2-D Haar patterns and Adaboosting method [12], [13].
5
Human Tracking
When a person is detected, the face or object region is modeled for every video frame. Since the poses and sizes of face or object region vary, we need to update image information to track the moving person continuously. We model the image information of object region as the histogram of hue component in HSV color space. And we exploit mean shift based tracking filter to track the moving object [14]. The hue histogram of object region is considered as probability distribution, and the color of pixel is converted to probability values by the histogram. Then, the mean (xm , ym ) and variance of pixel locations are calculated using the probability, xm = xp (IH (x, y)) , ym = yp (IH (x, y)) , (13) where IH (x, y) is the hue component at pixel (x, y), and p(·) is the histogram which is the probability function. This process is performed for various window sizes until the mean location is converged. The mean of location is the center of object/face region and the variance determines the size of object region. When the object region is tracked, then the hue histogram is updated for the next tracking. Whenever the object is tracked, the motion of object region is converted to the parameters which control the amount of pan/tilt/zoom in the CCTV camera. We calibrate the parameters and motion information from empirical observation. The parameters are packetized into a protocol [15] and transferred to the corresponding CCTV camera. The moving object is tracked not only in the image domain (mean shift tracking) but also in the 3-D space (PTZ camera tracking). We would like to mention that the object’s location and size should
480
S.H. Lee et al.
Fig. 4. Examples of object region based tracking
be adjusted according to the PTZ parameters. When the PTZ camera moves to track the object, the location and size of object in the image domain are also changed. Thus, we consider new position and size of object according to the PTZ parameters. For example, if the camera is zoomed up, the size of object should be larger. And the search window and location are moved by the amount of panning and tilting angles. The proposed tracking system considers the PTZ movement in tracking the object. Finally, if the object moves where the camera can’t track, the neighboring camera is selected to relay the object tracking. For the relay of tracking, the topological relation of cameras is calibrated in advance. The neighboring camera also share object information (size, location, hue histogram) by the calibrated camera topology, so object extraction and recognition processes are skipped.
6
Experimental Results
We used a personal computer (PC) for operating the proposed algorithms and the usual analog CCTV cameras for tracking. The PTZ control signals are generated through PC’s serial port (RS-232) and converted into the commercial CCTV protocol. Then, the signals are transferred to the CCTV camera through RS-485 port. We trained the classifier to recognize human shapes using the database of more than 2,700 human shapes as shown in Fig. 2. The proposed algorithm recognizes various human shapes in the real images. We set the thresholds a bit large to reduce the false-negative errors. The recognition rate is 89.3% (2664/2982) for human objects, and 97.5% (2602/2667) for non-human objects, thus, 93.2%
An Intelligent Video Security System
481
Fig. 5. Examples of face region based tracking
(5266/5649) total. Fig. 4 and 5 show the results of human object tracking. For each image in Fig. 4 and 5, the background regions vary because the camera moves with respect to the moving object. The object region is updated by the local histogram for every frame, thus, the camera can track the moving object despite the change of object poses and sizes and color distributions. As we can see, the person is well tracked by the PTZ controlled camera. Note that the background subtraction process does not work in object tracking. Fig. 5 shows the result of face tracking. Face detection algorithm is performed for the first frame, and the histogram model of face region is also updated. When the face region is too small, then the face region is automatically zoomed up by PTZ control to get the clear face images. The zoomed faces are shown in Fig. 5. According to the various experiments, the CCTV camera is able to track the moving object automatically by controlling PTZ operation. The proposed system improves the efficiency of visual surveillance by reducing malfunctions and watching much wider area with a small number of cameras.
7
Conclusion
This paper has proposed an intelligent video security system using computer vision techniques. The proposed system enables the common CCTV cameras to track a moving object by integrating object extraction, human object recognition, face detection, object tracking, and PTZ camera control. The experimental results show that the proposed system tracks and watches the moving human object effectively. Since each camera watches much wider area by PTZ movement, the proposed system improves the surveillance efficiency with a small number of
482
S.H. Lee et al.
cameras. And the false alarms are also reduced by human recognition. It is expected that the proposed system is very useful for video surveillance and security systems in the indoor environment and local outdoor area. Acknowledgment. This work was supported by the 2009 R&BD program of Korea Institute for Advancement of Technology (KIAT), the Ministry of Knowledge and Economy of Korea.
References 1. Valera, M., Velastin, S.: Intelligent distributed surveillance systems: A review. IEEE Proc. Vis. Image, Signal Process. 152(2), 192–204 (2005) 2. Hampapur, A., et al.: Smart video surveillance: Exploring the concept of multiscale spatiotemporal tracking. IEEE Sig. Proc. Mag. 22(2), 38–51 (2005) 3. Micheloni, C., Rinner, B., Foresti, G.L.: Video analysis in pan-tilt-zoom camera networks. IEEE Sig. Proc. Mag, 78–90 (September 2010) 4. Piccardi, M.: Background subtraction techniques: a review. In: IEEE International Conf. on Systems, Man and Cybernetics, pp. 3099–3104 (2004) 5. Seki, M., Wada, T., Fujiwara, H., Sumi, K.: Background subtraction based on cooccurrence of image variations. In: Proc. of CVPR 2003, vol. 2, pp. 65–72 (2003) 6. Qureshi, F.Z., Terzopoulos, D.: Planning ahead for PTZ camera assignment and handoff. In: Proc. ACM/IEEE Conf. Distributed Smart Cameras, pp. 1–8 (2009) 7. Rinner, B., Wolf, W.: Introduction to distributed smart cameras. Proc. IEEE 96(10), 1565–1575 (2008) 8. Soto, C., Song, B., Roy-Chowdhury, A.: Distributed multi-target tracking in a self-configuring camera network. In: Proc. IEEE CVPR, pp. 1486–1493 (2009) 9. Yu, D., Hu, J., Yan, H.: A multiple point boundary smoothing algorithm. Pattern Recognition Letters, 657–668 (June 1998) 10. Jeannin, S.: MPEG-7 visual part of experimentation model version 9.0, in ISO/IEC JTC1/SC29/WG11/N3914 (January 2001) 11. Coeurjolly, D., Ricard, J., Baskurt, A.: Generalizations of angular radial transform for 2D and 3D shape retrieval. Pattern Recognition Letters 26(14), 2174–2186 (2005) 12. Yang, M.-H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: A survey. IEEE Trans. on PAMI 24(1), 34–58 (2002) 13. Hsu, R.-L., Mohamed, A.-M., Jain, A.K.: Face detection in color images. IEEE Trans. on PAMI 24, 696–706 (2002) 14. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Trans. on PAMI 24(5), 603–619 (2002) 15. PELCO-D Protocol manual
3D Facial Expression Recognition Based on Histograms of Surface Differential Quantities Huibin Li1,2 , Jean-Marie Morvan1,3,4 , and Liming Chen1,2 1
Universit´e de Lyon, CNRS Ecole Centrale de Lyon, LIRIS UMR5205, F-69134, Lyon, France 3 Universit´e Lyon 1, Institut Camille Jordan, 43 blvd du 11 Novembre 1918, F-69622 Villeurbanne - Cedex, France King Abdullah University of Science and Technology, GMSV Research Center, Bldg 1, Thuwal 23955-6900, Saudi Arabia {huibin.li,liming.chen}@ec-lyon.fr, [email protected] 2
4
Abstract. 3D face models accurately capture facial surfaces, making it possible for precise description of facial activities. In this paper, we present a novel mesh-based method for 3D facial expression recognition using two local shape descriptors. To characterize shape information of the local neighborhood of facial landmarks, we calculate the weighted statistical distributions of surface differential quantities, including histogram of mesh gradient (HoG) and histogram of shape index (HoS). Normal cycle theory based curvature estimation method is employed on 3D face models along with the common cubic fitting curvature estimation method for the purpose of comparison. Based on the basic fact that different expressions involve different local shape deformations, the SVM classifier with both linear and RBF kernels outperforms the state of the art results on the subset of the BU-3DFE database with the same experimental setting. Keywords: 3D facial expression recognition, normal cycle theory, curvature tensor, histogram of surface differential quantities, SVM classifier.
1
Introduction
Facial expression recognition (FER) has been an intensive subject for the last three decades because of its usefulness in many applications such as humancomputer interaction and the analysis of conversation structure [1]. Acknowledging that facial expressions are actuated by contraction of facial muscles, Ekman et al. [2] introduced the Facial Action Coding System (FACS) and evidenced six universal prototypic facial expressions, namely happiness, sadness, anger, fear, surprise and disgust along with neutral. There exists currently an impressive body of results on FER firstly in static 2D images, and then dynamic 2D videos [3] and more recently on static 3D scans [4,5,6,7,8,9,10,11] and dynamic 3D videos [12,13]. With 3D imaging systems readily available, the use of 3D J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 483–494, 2011. c Springer-Verlag Berlin Heidelberg 2011
484
H. Li, J.-M. Morvan, and L. Chen
facial data for FER has attracted increasing interest since 3D data are theoretically pose invariant and robust to illumination changes. Furthermore, they also capture accurate geometry information closely sensitive to expression variations. Existing methods for FER based on static 3D data can be categorized into two streams, i.e. feature based and model based. The first category claims that the distributions of facial surface geometric information such as gradient, curvature [5], distances between pairs of interest landmarks [6] and local shapes near landmarks [10,11], are closely related to expression categories. These geometric information are then extracted as features and fed to various classifiers, such as linear discriminant analysis (LDA), support vector machine (SVM) or neural network etc., for FER. The main drawback of this kind of approaches is that they require a set of accurately located landmarks. This explains that most of these work made use of the 83 landmarks manually labeled on the BU-3DFE dataset. The second category tries to simulate the physical process of generating expression and explores a generic elastically deformable face model, which can generate universal expressions by adjusting parameters [7]. In general, this kind of methods needs alignment and normalization steps to find one-to-one correspondence among 3D faces, then the shape deformations between each pair of faces can be represented by model parameters [7] or feature vectors [9] etc, which are further used to perform FER. The downside of these approaches is first their computational cost which is expensive. Furthermore, the fitting process also require existence of some landmarks for initialization and hardly converge in case of opening of the mouth in facial expressions. In this paper, we propose a new feature based method for FER using two local shape descriptors. To characterize shape information of the local neighborhood of facial landmarks, weighted statistical distributions of surface differential quantities, including histogram of mesh gradient (HoG) and histogram of shape index (HoS) are extracted as local shape descriptors from facial landmarks. These local shape descriptors can be considered as an extension of the popular 2D SIFT feature [14] (Scale Invariant Feature Transform) to 3D-mesh based discrete surfaces [15]. Normal cycle theory based curvature estimation method [17,18,19] is employed for the first time on 3D face models along with the popular cubic fitting curvature estimation method for the purpose of comparison. The local shape descriptors of 60 landmarks are then concatenated into a global shape descriptor according to the manual labeled order. Based on the basic fact that different expressions involve different local shape deformations, the global shape descriptors are fed to a SVM classifier using both linear and RBF kernels to perform facial expression recognition. The proposed approach was tested on the BU-3DFE database with the same experimental setting as its main competitors and the experimental results show that the proposed approach outperforms the state of the art. The remainder of this paper is organized as follows: differential quantities estimated on mesh-based facial models are introduced in section 2. Section 3 presents the local shape descriptors. Experimental results are discussed in section 4. Section 5 concludes the paper.
3D Face Expression Recognition Based on Surface Differential Quantities
2 2.1
485
Estimating Differential Quantities on Triangular Meshes Estimating Curvature by Normal Cycle Theory Based Method
There are many approaches to calculate curvature on triangular meshes based on estimation of curvature tensor. Taubin [16] introduced a 3D curvature tensor from which the principal curvatures directions can be estimated by two of the three eigenvectors and the principal curvature can be computed by linear combinations of two of the three eigenvalues. Cohen-Steiner and Morvan introduced [17,18] a discrete definition for the mean, Gaussian curvature and the curvature tensor based on normal cycle theory, and proved that the estimated curvature tensors converge to the true ones of the smooth surface under specific sampling conditions. The basic idea and its discrete form can be carried out as follows [19]: For every edge e of the mesh, there is an obvious minimum (i.e., along the edge) and maximum (i.e., across the edge) curvature. A nature curvature tensor can therefore be defined at each point along an edge, named as generalized curvatures [18]. This line density of tensors can now be integrated over the arbitrary region B by summing the different contributions from B, leading to the simple expression: T (v) =
1 β(e)|e ∩ B| e et B
(1)
edges e
Where v is an arbitrary vertex on the mesh, |B| is the surface area around v over which the tensor is estimated. β(e) is the signed angle between the normals to the two oriented triangles incident to edge e (positive if convex, negative if cancave), |e ∩ B| is the length of e ∩ B (always between 0 and e), and e is a unit vector in the same direction as e. In our experiments, we estimate the tensor at every vertex location v, for a neighborhood B of 2-ring. The principal curvatures kmin and kmax at v can now be estimated by the two maximum eigenvalues of T (v). Fig. 1 shows the schematic of this method.
Fig. 1. Illustration of normal cycle theory based curvature estimation method (equation (1) [19])
486
H. Li, J.-M. Morvan, and L. Chen
Fig. 2. First row, from left to right: κmax , κmin and shape index estimated by normal cycle theory based method. Second row, from left to right: κmax , κmin and shape index estimated by cubic fitting based method, (model M0044-DI03).
Shape index which expresses different shape classes by a single number ranging from 0 to 1 can then be estimated by the following equation: S=
1 1 κmax + κmin − arctan( ) 2 π κmax − κmin
(2)
The first row of Fig.2 shows an example of maximum, minimum curvatures and shape index estimated by this method on a 3D face model. 2.2
Estimating Curvature by Local Cubic Fitting Based Method
In this paper, we also compare the performance of our shape descriptors using different curvature estimation methods. For this purpose, we also adopt local cubic fitting method [20] to estimate curvatures. The basic idea of this method is that: for each vertex p of the 3D mesh, a local coordinate system is defined by taking the vertex p as an origin and normal vector np = (nx , ny , nz )T as the z axes. Two orthogonal axes, x and y, randomly chosen in the tangent plane perpendicular to the normal vector. The local neighborhood points (2-ring in our paper) and its corresponding normal vectors are first transformed to the local coordinate system, then used for fitting a cubic function and its normal respectively. The cubic function and its normal having the following forms: z(x, y) =
A 2 C x + Bxy + y 2 + Dx3 + Ex2 y + F xy 2 + Gy 3 2 2
(3)
3D Face Expression Recognition Based on Surface Differential Quantities
487
Fig. 3. Illustrated the estimated normal vectors (model M0044-DI03)
(zx , zy , −1) = (Ax + By + 3Dx2 + 2Exy + F y 2 + + Bx + Cy + Ex2 + 2F xy + 3Gy 2 , −1)
(4)
By using least-square fitting method to solve the fitting equations (3) and (4), the Weingarten matrix on a vertex can be computed as: ∂ 2 z(x,y) ∂ 2 z(x,y) AB ∂x2 ∂x∂y W = ∂ 2 z(x,y) ∂ 2 z(x,y) = (5) CD 2 ∂x∂y
∂y
The maximum curvature kmax and minimum curvature kmin then can be estimated as the eigenvalues of the Weingarten matrix. The second row of Fig.2 shows a example of maximum, minimum curvatures and shape index estimated by cubic fitting method on the same 3D face model. 2.3
Mesh Gradient Estimation
Let the normal vector at p as np = (nx , ny , nz )T , which can be estimated by averaging the normal vectors of one-ring faces. According to (3) and (4), the gradient direction and the gradient magnitude can be estimated as follows: ny θ = arctan( ) (6) nx ||∇z(x, y)|| =
nx ny (− )2 + (− )2 nz nz
(7)
Fig. 3 shows the estimated normal vectors on a 3D mesh-based face model.
3 3.1
Local Shape Descriptors Landmarks Selection
A set of landmarks, e.g. eye corners, nose tip, etc. is used for the purpose of FER. For each landmark, two local shape descriptors, namely HoG and HoS, are
488
H. Li, J.-M. Morvan, and L. Chen
Fig. 4. From left to right, 60 selected manual landmarks, local neighborhood points of the left mouth corner (M0044-DI04 and M0044-DI03)
extracted from its neighborhood. In this work, a neighborhood with a geodesic disk is considered. The experiments were carried out on the BU-3DFE database using the first 60 landmarks defined on the regions of eyebrows, eyes, nose and mouth. These landmarks are a subset of 83 manually labeled landmarks with a specified order defining in the dataset. The radius of the disk is equal to 22 mm in our experiments. The 60 selected landmarks and an example of local neighborhood of the left mouth corner are shown in Fig. 4. 3.2
Local Coordinate System and Orientation Assignment
In practice, we first transform the points within a local neighborhood to a local coordinate system, in which the landmark point is the origin and its normal vector is along the positive z axis. Two perpendicular vectors x and y axis are randomly chosen in the tangent plane. In order to make the descriptor invariant to rotation, each landmark point is assigned one or several canonical orientations according to the dominant direction(s) of gradients in the local tangent plane with 360 bins. Once the canonical orientations are assigned, the local coordinate system rotates in the local tangent plane, making each canonical orientation as new x axis. Now y axis can be computed by cross product of z and x. In this new local coordinate system, we project the neighbor points to the tangent plane of its corresponding landmark. Eight projected points along to eight quantized directions starting from canonical orientation with a distance of r1 to the landmark point are fixed. Nine circles centered at the landmark point and its eight neighbors with a radius r2 can be further located. Fig.5 shows this arrangement. 3.3
Feature Vectors Computed as Histograms of Surface Differential Quantities
In each circle, we calculate the histogram of surface gradient (hog c ) and histogram of shape index (hosc ). For hog c , we compute the histogram of gradient angle weighted by gradient magnitude. This histogram has 8 bins representing
3D Face Expression Recognition Based on Surface Differential Quantities
489
Fig. 5. Canonical orientation (arrow), landmark point (o) and its 8 neighborhood vertices (+) assigned with 9 circles
8 main orientations ranging form 0 to 360 degree. For hosc , the values of shape index ranging from 0 to 1 are also quantized to 8 bins. Then, all the values of histograms are weighted by a Gaussian kernel with the Euclidian distance to the center point of the circle as the standard deviation. Every histogram with the length of 72 is then normalized and form a feature vector for a single landmark point. Formally, hog p and hosp are defined as follows: hog p = (hog1c , hog2c , . . . , hog9c )
(8)
hosp = (hosc1 , hosc2 , . . . , hosc9 )
(9)
The final feature vectors of each face model, HoG and HoS, are then obtained respectively by simply concatenating the corresponding histograms of all 60 manual landmarks in a fixed order, giving birth to a 60 × 72 = 4320 length bins. They can be represented as follows: p ) HoG = (hog1p , hog2p , · · · , hog60
(10)
HoS = (hosp1 , hosp2 , · · · , hosp60 )
(11)
Using an early fusion strategy, the fusion of these two local shape descriptors, HoG+HoS, is a simple concatenation of these two vectors.
4
Experimental Results
In our experiments, the BU-3DFE database [4] consisting of 2500 textured 3D face models of 100 subjects with different gender, race, and age is used. Each subject contains one neutral model and six universal non-neutral expressions: happiness, sadness, anger, fear, surprise and disgust. Also, each non-neutral expression is displayed at four different intensities: low, middle, high and highest. Fig.5 shows some samples of the database with six different expressions of high and highest intensities. For fair comparison, we used the same experimental settings as [9]. A subset of 60 subjects was randomly selected with two high-intensity models for each of
490
H. Li, J.-M. Morvan, and L. Chen
Fig. 6. Examples of six universal expressions, from left to right: anger, disgust, fear, happiness, sadness, surprise. First row, high intensity, Second row, highest intensity.
the six facial expressions. In total, 60 × 12 = 720 3D mesh-face models were selected. Then, 54 and 6 subjects were randomly selected respectively as training set (648 models) and test set (72 models). In practice, since the number of models in test set is very limited (only 12 models for each expression), and people of different race, gender and age may have different facial surface changes when they perform the same expression, the average recognition accuracy obtained by 10 or 20 random experiments varies greatly, from about 50% to more than 90% [9] with the same feature set, classifier and parameters setup. To obtain stable average recognition accuracies, we run all of our experiments 1000 times independently. To calculate descriptors, we set r1 and r2 equal to 15 mm and 7 mm respectively. With a dimension of 4320 for both HoG and HoS, it is hard to get the information of their distributions in such high dimension space. We made use of the SVM (LIBSVM [21]) classifier with both liner kernel and RBF (radial basis function) kernel for final FER. The parameter ’gamma’ for RBF kernel was set to 0.04 by 8-fold cross-validation. The results are shown in table 1-5. AN, DI, FE, HA, SA, SU designate anger, disgust, fear, happiness, sadness, and surprise, respectively. Table 1 shows the average confusion matrix obtained by the HoG feature using the SVM classifier with linear and RBF kernels. We can find that both linear and RBF kernels achieve relatively high recognition accuracies for happiness and surprise and lower accuracies for other expressions. We observe the same phenomenon already evidenced in other work: anger and fear have comparatively lower classification rates. Anger is confused mainly by sadness while fear is confused mainly by happiness and surprise. Linear kernel works slightly better for anger while slightly worse for disgust and happiness as compared to the RBF kernel. Both of the two kernels achieve almost the same average recognition rate about 76.5% for all six expressions. Table 2 shows the average confusion matrix obtained by the HoS feature (estimated by normal cycle theory based method) and the SVM classifier of linear and RBF kernels. All the results are better than the accuracies in Table 1 except the one obtained by the RBF kernel for sadness (72%, vs 72.6%). Linear kernel
3D Face Expression Recognition Based on Surface Differential Quantities
491
Table 1. Average confusion matrix obtained by HoG linear kernel % AN DI FE HA SA SU Average
AN 74.0 5.9 5.4 1.1 16.6 1.0
DI 8.3 76.5 11.7 0.9 2.7 2.3
FE HA 0.8 1.6 10.4 3.7 63.6 9.7 16.0 82.0 8.2 0 4.9 0.3 76.4
SA 15.2 2.6 5.4 0 72.2 1.5
RBF kernel SU 0.1 0.9 4.2 0 0.3 90.0
AN 66.0 2.3 3.4 0.1 15.6 0
DI 12.0 80.7 8.4 1.4 4.1 2.6
FE HA 3.0 0.9 8.5 3.8 63.2 13.2 12.6 85.6 5.7 0 4.1 1.6 76.5
SA 18.1 3.2 5.5 0 72.6 0.9
SU 0 1.5 6.2 0.3 2.0 90.7
Table 2. Average confusion matrix obtained by normal cycle based HoS linear kernel % AN DI FE HA SA SU Average
AN 77.0 7.6 5.2 0.4 15.5 0.2
DI 8.5 80.0 5.9 0.9 1.8 1.9
FE HA 0.9 0 6.2 2.8 70.9 8.9 4.8 93.2 7.3 0 1.9 0.4 81.9
SA 13.3 3.4 5.9 0 74.4 0
RBF kernel SU 0.4 0 3.2 0 0.9 95.6
AN 71.5 3.6 4.4 0.6 14.1 0.1
DI 10.4 82.0 8.1 1.0 2.4 1.6
FE HA 2.2 0 6.5 3.4 65.6 12.0 6.5 91.2 10.1 0 1.4 0.2 79.9
SA 15.2 4.4 6.1 0 72.0 0
SU 0.7 0 4.0 0.7 1.5 96.8
works better than the RBF kernel except disgust (80.0% vs 82.0%) and surprise (95.6% vs 96.8%). The average recognition rates for all six expressions are 81.9% and 79.9% for linear and RBF kernels, respectively. Table 3 shows the average confusion matrix obtained by the HoS feature (estimated by local cubic fitting method) and SVM classifier of linear and RBF kernels. All the results are better than the corresponding results in table 1 except the one obtained by linear kernel for anger (73.1%, vs 74.0%). For the case of linear kernel, the performances are worse than the ones in table 2 except for disgust (80.6% vs 80.0%), fear (72.3% vs 70.9%) and sadness (76.3% vs 74.4%); while for the case of RBF kernel, the performances are a little better than table 2 (80.6% vs 79.9%) especially for sadness (76.2% vs 72%). The two curvature estimation methods achieve comparative results which are in contrast to the conclusion made in [5]. On the other side, comparing the computational complexity of the two curvature estimating methods, the local cubic fitting method which needs to solve the linear system equation at each point takes more computation cost. Table 4 presents the average confusion matrix obtained by fusing the features HoG and HoS into a single feature according to a simple early fusion scheme, denoted as HoG+HoS (estimated by normal cycle theory and cubic fitting), and a SVM classifier with a linear kernel. Compared to the results in left column of table 2 and the one of table 3 also achived by a SVM with a linear kernel, the fused feature HoG+HoS, when estimated using the normal cycle theory for the
492
H. Li, J.-M. Morvan, and L. Chen Table 3. Average confusion matrix obtained by cubic fitting based HoS linear kernel % AN DI FE HA SA SU Average
AN 73.1 5.1 5.4 0.8 16.3 0.1
DI 7.7 80.6 4.8 3.9 0.6 2.3
FE HA 2.7 0.2 9.3 3.0 72.3 8.7 4.2 90.5 5.9 0 3.0 0 81.1
SA 16.2 1.7 6.0 0 76.3 0.8
RBF kernel SU 0 0.5 2.7 0.7 0.9 93.9
AN 72.3 3.3 5.5 0.9 12.8 0.1
DI 9.9 81.3 73.3 2.2 1.5 2.3
FE HA 2.6 0 8.5 2.9 66.6 9.5 5.2 91.1 8.6 0 1.1 0 80.6
SA 15.2 2.7 6.7 0 76.2 0.6
SU 0 1.2 4.4 0.7 0.9 95.9
Table 4. Average confusion matrix obtained by HoG+HoS descriptor using a linear kernel with the SVM normal cycle based % AN DI FE HA SA SU Average
AN 76.8 7.6 4.6 0.5 14.5 0
DI 7.6 78.1 7.6 0.5 1.1 1.7
FE HA 2.1 0 6.6 2.1 73.2 7.3 6.8 91.4 8.3 0 2.0 0.9 81.6
SA 13.5 5.0 5.1 0 75.5 0.8
cubic fitting based SU 0 0.7 2.3 0.8 0.6 94.5
AN 76.4 4.4 5.1 0.8 14.7 0
DI 8.0 80.2 6.2 2.2 0.5 2.0
FE HA 1.8 0 10.2 2.8 73.6 8.0 6.5 90.4 6.2 0 3.3 0.1 82.0
SA 13.7 2.0 5.3 0 77.8 1.0
SU 0 0.5 1.7 0.8 0.8 93.6
computation of curvature, ameliorates the recognition accuracy for sadness and fear while recording a slight performance drop for other expressions. Now when estimated using the cubic fitting technique for the computation of curvature, the fused feature HoG+HoS improves the recognition accuracies for sadness, fear along with anger this time and also records a slight drop for the remaining expressions. Overall, there are no significant increase of recognition accuracies either using normal cycle or cubic fitting in the case of the early fused feature vector HoG+HoS. This may be caused by the huge dimension (8640×1) of the fused descriptor on the one hand and the different nature of geometric information captured by HoG and HoS. It may be interesting to study some other fusion strategies, for instance late fusion strategy in combining similarity scores. Table 5 compares the experimental results achieved by the proposed method (HoG by RBF kernel, HoS by normal cycle and linear kernel, HoG+HoS by cubic fitting and linear kernel) with the ones reported in [9] and [11]. In fact, the results of the approaches proposed by Gong et al. (Gong) [9], Wang et al. (Wang) [5], Soyel et al. (Soyel) [6], and Tang et al. (Tang) [8], were obtained using the same experimental setting. In Berretti et al. (Berretti) [11], 60 subjects were selected randomly from experiment to experiment. It can be seen from the table that the proposed approach using HoG feature achieves comparable result as the others while HoS and HoG+HoS features outperform all the other methods.
3D Face Expression Recognition Based on Surface Differential Quantities
493
Table 5. Comparison of the proposed method with the state of the art [11], [9], [5], [6], [8] HoG HoS HoG+HoS Berretti Gong Wang Soyel Tang AVE 76.48% 81.86% 82.01% 77.54% 76.22% 61.79% 67.52% 74.51%
5
Conclusion and Future Work
In this paper, we have developed a mesh-based 3D facial expression recognition approach and evaluated it on the BU-3DFE database. The proposed approach is based on two local shape descriptors, namely HoG and HoS, achieved by computing the histograms of normal vector and shape index within the local neighborhood of landmarks respectively. In this work, we selected the first 60 of the 83 ordering manually labeled landmarks in the BU3D-FE database. Curvatures are estimated using normal cycle theory based method and cubic fitting method. Both linear and RBF kernels of SVM are employed for classification. The experimental results show that the proposed approach outperforms the state of the art and demonstrate its effectiveness. In our future work, we will investigate alternative fusion schemes than the simple early fusion scheme used in this work. Furthermore, we also want to study the stability of the proposed approach when landmarks are automatically located for instance by our statistical technique SFAM as proposed in [22].
References 1. Otsuka, K., Sawada, H., Yamato, J.: Automatic inference of cross-modal nonverbal interactions in multiparty conversations: ”who responds to whom, when, and how?” from gaze, head gestures, and utterances. In: ICMI, pp. 255–262. ACM Press, New York (2007) 2. Ekman, P.: Universals and cultural differences in facial expressions of emotion. In: Nebraska Symposium on Motivation, Lincoln, NE, pp. 207–283 (1972) 3. Fasel, B., Luettin, J.: Automatic facial expression analysis: a survey. In: Pattern Recognition, pp. 259–275 (2003) 4. Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.: A 3D Facial Expression Database For Facial Behavior Research. In: The 7th International Conference on Automatic Face and Gesture Recognition (FG), pp. 211–216. IEEE Computer Society, Los Alamitos (2006); TC PAMI 5. Wang, J., Yin, L., Wei, X., Sun, Y.: 3d facial expression recognition based on primitive surface feature distribution. In: CVPR, pp. 1399–1406 (2006) 6. Soyel, H., Demirel, H.: Facial expression recognition using 3d facial feature distances. In: Int. Conf. on Image Analysis and Recognition, pp. 831–838 (2007) 7. Mpiperis, I., Malassiotis, S., Strintzis, M.G.: Bilinear models for 3-d face and facial expression recognition. IEEE Transactions on Information Forensics and Security 3(3), 498–511 (2008) 8. Tang, H., Huang, T.S.: 3d facial expression recognition based on automatically selected features. In: Int. Conf. on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
494
H. Li, J.-M. Morvan, and L. Chen
9. Gong, B., Wang, Y., Liu, J., Tang, X.: Automatic facial expression recognition on a single 3d face by exploring shape deformation. In: Int. Conf. on Multimedia, pp. 569–572 (2009) 10. Maalej, A., Ben Amor, B., Daoudi, M., Srivastava, A., Berretti, S.: Local 3D Shape Analysis for Facial Expression Recognition. In: ICPR, pp. 4129–4132. IEEE, Los Alamitos (2010) 11. Berretti, S., Bimbo, A.D., Pala, P., Ben Amor, B., Daoudi, M.: A Set of Selected SIFT Features for 3D Facial Expression Recognition. In: ICPR, pp. 4125–4128. IEEE, Los Alamitos (2010) 12. Yin, L., Chen, X., Sun, Y., Worm, T., Reale, M.: A High-Resolution 3D Dynamic Facial Expression Database. In: IEEE Int. Conference on Automatic Face Gesture Recognition, pp. 1–6 (2008) 13. Sun, Y., Yin, L.: Facial Expression Recognition Based on 3D Dynamic Range Model Sequences. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 58–71. Springer, Heidelberg (2008) 14. Lowe, D.G.: Distinctive image features from scale invariant keypoints. IJCV, 91– 110 (2004) 15. Fabry, M.C., et al.: Feature detection on 3d face surfaces for pose normalisation and recognition. In: BTAS (2010) 16. Taubin, G.: Estimating the tensor of curvature of a surface from a polyhedral approximation. In: ICCV, p. 902. IEEE Computer Society, USA (1995) 17. Cohen-Steiner, D., Morvan, J.M.: Restricted delaunay triangulations and normal cycle. In: Proceedings of the Nineteenth Annual Symposium on Computational Geometry, pp. 312–321. ACM, New York (2003) 18. Morvan, J.M.: Generalized Curvatures. Springer, Berlin (2008) 19. Alliez, P., Cohen-Steiner, D., Devillers, O., L´evy, B., Desbrun, M.: Anisotropic polygonal remeshing. ACM Trans. Graph. 22(3), 485–493 (2003) 20. Goldfeather, J., Interrante, V.: A novel cubic-order algorithm for approximating principal direction vectors. ACM Trans. Graph., 45–63 (2004) 21. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 22. Zhao, X., Szeptycki, P., Dellandrea, E., Chen, L.: Precise 2.5D Facial Landmarking via an Analysis by Synthesis approach. In: 2009 IEEE Workshop on Applications of Computer Vision (WACV 2009), Snowbird, Utah (2009)
Facial Feature Tracking for Emotional Dynamic Analysis Thibaud Senechal1 , Vincent Rapp1 , and Lionel Prevost2 1
ISIR, CNRS UMR 7222 Univ. Pierre et Marie Curie, Paris {rapp,senechal}@isir.upmc.fr 2 LAMIA, EA 4540 Univ. of Fr. West Indies & Guyana [email protected]
Abstract. This article presents a feature-based framework to automatically track 18 facial landmarks for emotion recognition and emotional dynamic analysis. With a new way of using multi-kernel learning, we combine two methods: the first matches facial feature points between consecutive images and the second uses an offline learning of the facial landmark appearance. Matching points results in a jitter-free tracking and the offline learning prevents the tracking framework from drifting. We train the tracking system on the Cohn-Kanade database and analyze the dynamic of emotions and Action Units on the MMI database sequences. We perform accurate detection of facial expressions temporal segment and report experimental results. Keywords: Facial feature tracking, Emotion recognition, emotional dynamic, Multi-Kernel SVM.
1
Introduction
A current challenge in designing computerized environments is to place the user at the core of the system. To be able to fully interact with human beings, robots or human-centered interfaces have to recognize user’s affective state and interpret gestures, voice and facial movements. While several works have been made to recognize emotions, only few extract the emotional dynamic. In particular, the emotion temporal segment, which is the limit of the emotion display, is crucial for a system waiting for a specific reaction from its user. But it is even more important for complex facial expression detectors which need to know when and how an expression appears before actually recognizing it. We propose here a facial feature tracking system dedicated to emotion recognition and emotional dynamic analysis. There is much prior work on detecting and tracking landmarks. Appearancebased methods use generative linear models of face appearance such as Active Appearance Models [1] used in [2] and [3], 3D Morphable Models [4] or Constrained Local Models [5]. Although the appearance-based methods utilize much J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 495–506, 2011. c Springer-Verlag Berlin Heidelberg 2011
496
T. Senechal, V. Rapp, and L. Prevost
knowledge on face to realize an effective tracking, these models are limited to some common assumptions, e.g. a nearly frontal view face and moderate facial expression changes, and tend to fail under large pose variations or facial deformations in real-world applications. These models introduce too strong constraint between points. Features-based tracking methods [6,7] usually track each landmark point by performing a local search for the best matching position, around which the appearance is most similar to the one in the initial frame. Tian et al [8,9] use multiple-state templates to track the landmarks. Landmark point tracking together with masked edge filtering is used to track the upper landmarks. Over short sequences, features-based tracking methods may be accurate and jitter-free. Over long ones, these methods often suffer from error accumulation, which produces drift, and cannot deal with severe aspect changes. As we want to track landmarks during display of facial expressions, we have to take into account the high deformation of facial shapes like eyes, brows and mouth. Hence, we try to localize 18 landmarks independently (4 per eye, 3 per brow and 4 for the mouth). In a sequence, we use the detection in previous images to detect landmarks in the current one by matching patches from precedent images with patches of the current image. But the main problem is that if a given detection is not really accurate, the following matching will lead to poorer detection resulting in a drift like other features-based tracking method. To solve this problem, we propose to incorporate prior knowledge on the appearance of the searched landmark. In this goal, we use multi-kernel algorithms in an original way to combine the temporal matching of patches between consecutive images and the hypothesis given by a static facial feature detector. To check performances of the tracking system, two tasks are achieved. We recognize emotion on the Cohn-Kanade database [10] and we detect temporal segment of emotions and Action Units (AUs) on the MMI database [11]. Comparison with state-of-art method for AUs temporal segmentation are provided. The paper is organized as follows. Section 2 describes the tracking method. Section 3 details the setup to train and test the system. Section 4 reports experimental results for the emotion recognition task. Section 5 deals with emotion and AUs temporal segmentation. Finally, section 6 concludes the paper.
2 2.1
Facial Features Tracking Method Overview
Fig. 1 gives an overview of the proposed system. To detect landmarks in a sequence image, we first use the previous detection to define a region of interest (ROI) containing candidate pixels that may belong to a given landmark. Then, we create two sets of features for each candidate pixel. The first set of features is called static as it just considers patch (region surrounding the candidate pixel) extracted from the current image. The second one consists of dynamic features. It measures how much this given patch matches with those extracted in the previous images. These two sets fed a Multi-Kernel Support Vector Machine (SVM). Hence, for each candidate pixel, the SVM output gives a confidence
Facial Feature Tracking for Emotional Dynamic Analysis
497
Prior knowledge
Previous detections
Statistical Model SVM score maps SVMs
Search regions
Fig. 1. Overview of the proposed method
index of being the searched landmark or not. Finally, we check information arising from the confidence index of each candidate pixel for each landmark with statistical models. We use five point distribution models: two four-points models for the eyes, two three-points models for the brows and one four-points model for the mouth. Model parameters are estimated on expressive faces. 2.2
Features
Two information are extracted on each candidate pixel i as shown in fig. 2: – Static features are the intensities gi of the neighboring pixels. We extract an 11x11 patch centered on the candidate pixel from facial images that have an inter-ocular distance of 50 pixels. Patch intensity is normalized after mean and standard deviation estimation. – Dynamic features are the correlations between a patch of the current image It with patches extracted in each previous image (It−1 , It−2 , It−3 , ...It−N ) and centered on the landmark detected. In this way, we compute N correlation maps. Features are the (X, Y ) coordinates (pt−1 , pt−2 , pt−3 , ...pt−N ) of each i i i i candidate pixel in relation to the maximum of each correlation map. Thus, the best matching point has the position (0,0).
Candidate pixel
Current Image t
t−1
search region ROI
11x11 patch extracted from previous detection
t−2
Correlation with the search region
Dynamic features
pti −1
pti −2
pti −3
t−3
[ Xmax , Ymax ] Coordinates of correlation map maximum
Fig. 2. Static and dynamic features extraction
Static features gi
498
2.3
T. Senechal, V. Rapp, and L. Prevost
Multi-Kernel Learning
The system has to discriminate target samples (candidate pixel that belong to the searched landmark) from non-target samples (candidate pixel which does not belong to the searched landmark). Given xi = (gi , pt−1 , ...pt−N ) samples associated with labels yi ∈ {−1, 1} i i (target or non-target), the classification function of the SVM associates a score s to the new sample (or candidate pixel) x = (g, pt−1 , ...pt−N ): i i m αi k(xi , x) + b (1) s= i=1
With αi the dual representation of the hyperplane’s normal vector [12]. k is the kernel function resulting from the dot product in a transformed high-dimensional feature space. In case of multi-kernel SVM, the kernel k can be any convex combination of semi-definite functions. In our case, we have one kernel function per features. k = βg kg (gi , g) +
N j=1
βj kj (pt−j , pt−j ) with βj ≥ 0, i
K
βj = 1
(2)
j=1
Weights αi and βj are set to have an optimum hyperplane in the feature space induced by k. This optimization problem has proven to be jointly-convex in αi and βj [13]. Therefore, there is a unique global minimum than can be found efficiently. βg represents the weight accorded to the static feature and β1 ...βN are the weights for the dynamic ones. Thus, by using a learning database, the system is able to find the best combination of these two types of feature that maximize the margin. This is a new way of using multi-kernel learning. Instead of combining different kinds of kernel functions (for example, radial basis with polynomial), we combine different features corresponding to different kinds of information. The first one, represented by the function kg , corresponds to a local landmark detector which is able to localize these points without drift but can sometimes leads to inaccurate detections. The second one, represented by the functions k1 ...kN , tries to match hypotheses between consecutive images. It is much more stable and will rarely results in bad detections but a drift can appear along the sequence. Combining both information leads to accurate detections with no drift. Among all candidate pixels, we need to choose one representing the searched landmark. In the perfect case, we should have a positive SVM output s if the candidate pixel is close to the landmark and negative otherwise. In the general case, when we have zero or more than one candidate pixel with a positive score, we use the value of s to take a decision. This score can be seen as a confidence index of the candidate pixel belonging to the landmark.
Facial Feature Tracking for Emotional Dynamic Analysis
2.4
499
Statistical Validation
We estimate five Gaussian Point Distribution Models (PDM) representing eyes, brows and the mouth by using EM algorithm on the training dataset. For each of these models, we tune a threshold T such as 95% of the training shapes have a distance to the model lower than T . During tracking, the SVM scores each candidate pixel. Then, we have to choose among these candidates, for each landmark, the one that leads to a valid facial shape. The first hypothesis is the shape having the highest SVM scores. It is considered as valid if its distance to the model is lower than T . Otherwise, another hypothesis is built considering the next best combination of SVM scores and so on.
3
Experimental Setup
The tracking system has been trained using the Cohn-Kanade database [10]. This is a representative, comprehensive and robust test-bed for comparative studies of facial expression. It contains 486 sequences (during from 10 to 50 frames) starting with the neutral expression and ending with the expression apex. 3.1
Training Database
We have manually labeled landmark positions for the first and last images of each sequence. Yet, we need the facial feature position for all the images of the sequence to compute correlation maps. Instead of manually labeling it, we trained a special detector using as prior knowledge the first and last images. This has the advantage of being less time consuming. Moreover, it leads to a more robust tracker because it is trained with correlation maps computed using noisy detections. To build the training database, for each landmark, we proceed as follows: – We resize all images to have an interocular distance of 50 pixels. – For each sequence, we use the last image and the position of the given landmark (ground truth) to create training samples. – We compute correlation maps between the ROI extracted in the last image and patches surrounding the ground truth in previous images. – We choose as target samples the 9 closest points to the ground truth and as non-target samples 8 other ones distanced from 5 pixels to the ground truth. – We repeat this process with the first image of each sequence, using the next images to compute correlation maps (as though the sequence was reversed). This way we train the tracking system with sequences in which the expression disappears from the face. This results in 18 target samples and 16 non-target samples per sequence.
500
T. Senechal, V. Rapp, and L. Prevost Table 1. Mean weights of points belonging to the same facial feature Facial features βg β1 β2 β3 Brows (6 points) 0.1300 0.6314 0.1774 0.0612 Eyes (8 points) 0.3142 0.4625 0.1477 0.0756 Mouth (4 points) 0.6918 0.1730 0.0822 0.0529
3.2
Multi-Kernel Learning
We use a SimpleMKL [14] algorithm to train multi-kernels SVMs. For the gray level patches, we use a linear kernel. For the position of candidate pixels in relation to correlation map maxima, we use a second order polynomial kernel. We choose this kernel because of the good samples closeness with the maximum of the correlation maps so the border between good and wrong samples looks like a circle. The algorithm found that matching with previous images It−4 , It−5 , ... is not useful for the detection and set β4 , β5 ... to zero. So we train the SVMs using only kg , k1 , k2 , k3 and we find one set of weights βg , β1 , β2 , β3 for each facial landmark. Mean weights learned for the points belonging to the same landmark set are reported table 1. We first notice that we always have β1 > β2 > β3 . This means a more important weight is associated to the matching with most recent images. Moreover the points that are difficult to detect on static images have the weakest coefficient kg , meaning the system does not overly use the static features. The brows are the most difficult to detect because the ground truth is not well-defined. The eyes, particularly in the Cohn-Kanade database, have some illumination problems and the eye contour is not always very clear for the tracking system. On the contrary, the mouth has the most salient points. Therefore, weight values tined by the SimpleMKL algorithm are in agreement with our intuition. 3.3
Testing Phase
To test the tracking system, we use a 2-fold validation setup in which half of the sequences is used for training and cross-validation, and the other half is used as an independent set. We take care not to have sequences of the same subject in the training and testing set. This way, we have experimental results, i.e landmarks coordinates given by the tracking system for all the sequences. During the test phase, we start by detecting landmarks on the first image with a facial landmark detector [15]. The following image is resized using the interocular distance computed by considering the previous detections. For each landmark, we perform as follows. First, we select a ROI of 15x15 pixels (30% of the interocular distance) surrounding the last detection. This allows the landmark point to move quickly from one image to another. Then, we test each candidate pixel of the ROI with the SVM classifier, leading to one score per candidate. Finally, we combine candidate pixels to create shape hypotheses and use the PDM to validate them, as described in section 2.4.
Facial Feature Tracking for Emotional Dynamic Analysis
4 4.1
501
Experiments on the Cohn-Kanade Database Performance Measures
As a first performance measure, we compute the localization error on the last image of each sequence. This is the mean Euclidian distance between the 18 detected points and the true (labelled) landmark points, normalized by the interocular distance. The Mean Localization Error is then computed over the whole test dataset. But the main objective of our system is to track an emotion. Some detection inaccuracies can reduce the emotion intensity and be harmful to emotion recognition. On the contrary, they will be harmless if they amplify the emotion intensity. Hence, the second performance measure is the Emotion Recognition Accuracy. 400 sequences of the Cohn-Kanade database are labeled by one of the five basic emotions (anger, fear, joy, relief, sadness). We want to verify that the tracking system can accurately recognize these emotions. To do so, we compute the difference between the 18 detected landmarks at the beginning and the end of each sequence. These feature vectors are used to train five one-versus-all binary classifiers, each one dedicated to an emotion, using as positive all the samples corresponding to this emotion and all the other samples as negatives. During testing, at the end of the sequence, the feature vector is computed and fed the five classifiers. The emotion belonging to the classifier with the highest decision function value output is assigned to the sequence. To test the generalization to new subjects, we use a leave-one-subject-out cross-validation setup in which all the data from one subject are excluded from the training database and used for test. 4.2
Experimental Results
Table 2 details the accuracy when of the tracking system. If we only use the static features (function kg ), the Mean Localization Error (MLE) does not overtake 11% of the interocular distance. Using only the dynamic features (functions k1 , k2 , k3 ), error decreases to 5.7%. Combining both features achieves better result with an error of 5.3%. Finally, the PDM mainly corrects the outliers and reduces the error standard deviation (SD). These local models do not unduly constraint points allowing expressive shapes and do not change the emotion recognition performance. We can notice that even if the function kg does not achieve good results, it is still useful combined with other functions using matching with previous images. It provides information about the searched landmark and prevents the tracking system from drifting. Moreover the Cohn-Kanade sequences are relatively short and the kernel kg would be even more useful on longer sequences. Emotion recognition accuracy (ERA) increases in the same sense, from 60.2% with the sole static information to 78.0% when using the full tracking system. Finally, let us notice that the detections reached by the tracking system at the end of the sequences lead to an emotion recognition score (78%) close to the one reached when using the ground truth (80%). This shows that the system is well suited to track emotions through sequences.
502
T. Senechal, V. Rapp, and L. Prevost Table 2. Experimental results achieved on the Cohn-Kanade database Using kg only Using k1 , k2 , k3 only Without stat. validation Full tracking system Ground truth
4.3
MLE 11.4% 5.7% 5.3% 5.1% 0%
SD 2.81% 1.91% 1.86% 1.81% 0%
ERA 60.2% 76.2% 78.0% 78.0% 80.0%
Sensitivity to Initialization
85
18
80
16
75
14
70
12
65
10
60
8
55
6
50
40
4
Initialization error achieved with the detector.
45
0
2
Mean localization error (%)
Emotion recognition score (%)
In fig. 3, we investigate the robustness of the tracking system to initialization (detection on the first image). In this experiment, we do not use a facial landmark detector but the ground truth of the first image to initialize the tracker. We add artificially to this ground truth a uniform noise. As some shapes cannot be statistically possible, we choose the closest shape within the noisy detections validated by the PDM. The mean localization error of this shape normalized by the interocular distance is reported as the initialization error. We notice that for an initialization error greater than 4% of the interocular distance, the localization error increases. It means that landmarks are tracked with less accuracy. But even with these inaccurate detections, the emotion is still correctly detected. With an initialization error of 8% the emotion recognition score only decreases from 78% to 74%. To measure the influence of poor detections in the first image, we perform another experiment. We use the labeled landmarks (ground truth) of the last image. We notice that the tracking system leads to landmark localizations on the apex expression image more useful for the emotion recognition task than the ground truth.
Tracking system, emotion recognition score Ground truth, emotion recognition score Tracking system, localization error 4
6 8 10 Mean initialization error (%)
12
2 0 14
Fig. 3. Sensitivity to the initialization error: localization error and emotion recognition accuracy
Facial Feature Tracking for Emotional Dynamic Analysis
5
503
Temporal Analysis of Expressions
Since the first image og Cohn-Kanade sequences represents the neutral face and the last, the expression apex, dynamic analysis of these sequences can be biased. In this section, we will evaluate the ability of the proposed system to detect an emotion temporal segment. We call onset the temporal beginning of the facial movement and offset its ending. To evaluate the trackers ability to (1) follow subtle facial movements and (2) generalize on other sequences, we decide to test the tracking system on another challenging database. 5.1
MMI Database
The MMI Facial Expression database [11] holds videos of about 50 subjects displaying various facial expressions on command. We apply our tracking system on the 197 sequences labeled as one of the six basic expressions and, to compare with other works, 256 AUs labeled sequences. In these sequences, AU can appear alone or in combination. We have chosen the AU-sequences in which the onset and offset are already labeled. We also manually labeled the onset and offset of the basic expression sequences. Contrary to the Cohn-Kanade database, the subjects can wear glasses and beard. Sequences last several seconds recorded at a rate of 25 frames per second. 5.2
Temporal Segment Detection
Our goal is to detect the temporal segment using only landmark detections along the sequence. In this way, we can check if the system is accurate enough to track subtle movements. To detect the onset and the offset in each sequence, we proceed as follow (Fig. 4): – For each frame, we express the landmark coordinates in a barycentric basis. To detect emotions, we use the coordinates Xip , Yip of all the landmarks p and frames i. To detect upper Aus, we only use the brow and eye landmarks (14 points). To detect lower Aus, we use only the mouth landmarks (4 points). – We first try to separate each sequence in half, each half containing either the onset or the offset. We compute the euclidian distance D(i) between the coordinates in the frame i and the coordinates in the first frame and its derivative d(i). p D(i) = (Xi − X1p )2 + (Yip − Y1p )2 (3) p
d(i) = D(i) − D(i − 1)
(4)
We cut the sequence in the frame ic such that we maximize: max ic
ic i=1
d(i) −
end
d(i)
(5)
i=ic
This way the onset of the expression is likely to be before ic and the offset after ic .
504
T. Senechal, V. Rapp, and L. Prevost
50
50
100
100
150
150
50
50
100
100
150
150
200
200
200
200
250
250 50
100
150
200
50
250
100
150
200
250
50
100
150
200
50
100
150
200
Euclidian distance i and dista sta anc nce nce c D(i) D between landmarks rks of the he e frame fra nd the frame 1 50
0
0
0 10 1
20
30
40 50 Derivative d(i)
60 60
70
80
90
0
0 10
20
30
40 50 Gradient magnitude G(i)
60 60
70
80
90
0
10
20
60
70
80
90
10 0 −10 20
Temporal segment of emotion
10 0
30
40
50
Fig. 4. Detection of the start and the end of the emotion
– We compute G(i), the sum of the gradient magnitudes over 6 frames. This represents the global movement of the facial landmark. 2 3 2 3 p p p p ( G(i) = Xi+k − Xi−k )2 + ( Yi+k − Yi−k )2 (6) p
k=0
k=1
k=0
k=1
– The expression onset corresponds to the maximum of G(i) for i < ic and the expression offset corresponds to the maximum of G(i) for i > ic . 5.3
Segmentation of Basic Emotions
We report the mean difference between the true label and the detected label in table 3. We detect the temporal segment of the emotion with an accuracy of 5 frames or 0.2 second. As the emotion in these sequences lasts between 40 and 120 frames, we can say that the tracking system leads to a good segmentation of the emotion. Table 3. Detection of the emotion temporal segment: mean error and standard deviation in number of frames (record speed: 25 frames/second)
Onset Offset
5.4
Mean error Standard deviation 4.5 5.1 5.5 4.9
Actions Units Segmentation: Comparison with Other Works
To the best of our knowledge, there is no tracking system specifically addressing the problem of emotion temporal segmentation. We decide to compare our work
Facial Feature Tracking for Emotional Dynamic Analysis
505
Onset detection Mean error (frames)
6 This work Koelstra et al. 4
2
0
AU1
AU2
AU4
AU5
AU6
AU7
AU9
AU10
AU12
AU15
AU20
AU23
AU24
AU25
AU27
Mean
Offset detection Mean error (frames)
6 This work Koelstra et al. 4
2
0
AU1
AU2
AU4
AU5
AU6
AU7
AU9
AU10
AU12
AU15
AU20
AU23
AU24
AU25
AU27
Mean
Fig. 5. Mean error in number of frames for the detection of the AU temporal segment
with appearance-based system to check if the tracking of facial landmarks is accurate enough to lead to accurate temporal segmentation. The only results are reported by Valstar & Pantic [16] and Koelstra et al. [17]. In the last one, they detect the AU segment on 264 sequences of the MMI database. They report temporal error (in frames) for onset and offset for AU detection. In the same way, we report results in fig. 5 for each AU to perform a fair comparison. But as we do not know the sequences they used for their experiments, we are not able to straightly compare. We can notice that the proposed tracker reaches the same overall accuracy as an appearance-based system. Such results can be obtained only if we can track very subtle facial movements. The worst results are for upper AUs, particularly AU5 (upper lid raiser), AU6 (cheek raiser) and AU7 (lid tightener) coding the eye movements. These AUs are more visible with appearance features in the higher cheek region (like wrinkles) than the eyelids motion. So, it is not surprising that the tracker on these AUs is less accurate. Good detections are reached for the lower AUs (AUs 10, 12, 15, 20, 23, 24, 25, 27). Using only the mouth points, we can detect temporal segments more accurately than state-of-art.
6
Conclusion
We present in this paper a fully automatic tracking system of 18 facial landmarks. Advanced Multi-kernels algorithms are applied in an original way to combine point matching between consecutive images with a prior knowledge of facial landmarks. The system is suited for real-time application as we are able to treat five frames per second with a non-optimal Matlab code. This system tested on the Cohn-Kanade database has been able to track facial emotions even with inaccurate facial feature localizations. Its localizations lead
506
T. Senechal, V. Rapp, and L. Prevost
to an emotion recognition performance almost as good as the one achieved with ground truth. This confirms that our tracker is well-suited for facial feature tracking during emotional display. Successful temporal segmentation of emotion and AUs on the MMI database has been realized. Experiments show lower AU temporal segments are as well as by state-of-art methods. Results for the upper AUs are promising too, but seem to need more than eyelid movements to be detected accurately. In future works, we will combine the landmark coordinates with texture around these points to increase results.
References 1. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, p. 484. Springer, Heidelberg (1998) 2. Al Haj, M., Orozco, J., Gonzalez, J., Villanueva, J.: Automatic face and facial features initialization for robust and accurate tracking. In: ICPR 2008, pp. 1–4 (2008) 3. Zhou, M., Liang, L., Sun, J., Wang, Y., Beijing, C.: Aam based face tracking with temporal matching and face segmentation. In: CVPR 2010, pp. 701–708 (2010) 4. Blanz, V., Vetter, T.: Face recognition based on fitting a 3d morphable model. In: PAMI (2003) 5. Cristinacce, D., Cootes, T.: Feature detection and tracking with constrained local models. In: BMVC 2006, pp. 929–938 (2006) 6. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI 1981, vol. 3, pp. 674–679 (1981) 7. Zhu, Z., Ji, Q., Fujimura, K., Lee, K.: Combining kalman filtering and mean shift for eye tracking under active ir illumination. Pattern Recognition 4 (2002) 8. Tian, Y., Kanade, T., Cohn, J.: Dual-state parametric eye track. In: FG 2000 (2000) 9. Tian, Y., Kanade, T., Cohn, J.: Recognizing upper face action units for facial expression analysis. In: CVPR 2000, pp. 294–301 (2000) 10. Kanade, T., Tian, Y., Cohn, J.: Comprehensive database for facial expression analysis. In: FG 2000, p. 46 (2000) 11. Pantic, M., Valstar, M., Rademaker, R., Maat, L.: Web-based database for facial expression analysis. In: ICME 2005, p. 5 (2005) 12. Scholkopf, B., Smola, A.J.: Learning with kernels. MIT Press, Cambridge (2002) 13. Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. JMLR 5, 27 (2004) 14. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: Simplemkl. JMLR (2008) 15. Rapp, V., Senechal, T., Bailly, K., Prevost, L.: Multiple kernel learning svm and statistical validation for facial landmark detection. In: FG 2011 (to appear, 2011) 16. Valstar, M., Pantic, M.: Fully automatic facial action unit detection and temporal analysis. In: CVPRW 2006, p. 149. IEEE, Los Alamitos (2006) 17. Koelstra, S., Pantic, M., Patras, I.: A dynamic texture based approach to recognition of facial actions and their temporal models. In: PAMI (2010)
Detection of Human Groups in Videos Sel¸cuk Sandıkcı, Svitlana Zinger, and Peter H.N. de With Eindhoven University of Technology, The Netherlands
Abstract. In this paper, we consider the problem of finding and localizing social human groups in videos, which can form a basis for further analysis and monitoring of groups in general. Our approach is motivated by the collective behavior of individuals which has a fundament in sociological studies. We design a detection-based multi-target tracking framework which is capable of handling short-term occlusions and producing stable trajectories. Human groups are discovered by clustering trajectories of individuals in an agglomerative fashion. A novel similarity function related to distances between group members, robustly measures the similarity of noisy trajectories. We have evaluated our approach on several test sequences and achieved acceptable miss rates (19.4%, 29.7% and 46.7%) at reasonable false positive detections per frame (0.129, 0.813 and 0.371). The relatively high miss rates are caused by a strict evaluation procedure, whereas the visual results are quite acceptable.
1
Introduction
Automated detection, monitoring and activity analysis of human groups is highly relevant for security surveillance and relates to crowd modeling and multi-object tracking. Groupings of individuals can be exploited to improve data association and motion model in multi-object tracking [1], [2]. A simple solution for finding human groups is to utilize foreground segmentation and classify foreground regions into individuals and human groups [3]. Deciding on the detection of group/individuals based on a single frame as in [3] does not fit to the group definition as proposed by McPhail and Wohlstein [4]. Their paper states that two people belong to the same group, if they stay in a proximity less than 2.13 meters, have the same speeds to within 0.15 meters per second and traveling in the same direction to within 3 degrees. Spatial proximities and spatio-temporal coherence of trajectories are employed to discover small groups in shops and metro stations in [5] and [6], respectively. These studies are partially consistent with the definition of [4], as they ignore velocity and direction of movement features. Currently, there is an increasing interest in adopting methods from social network analysis into computer vision research. Ge et al. [7] proposed an automatic social grouping algorithm inspired by [4]. They design a bottom-up clustering algorithm that iteratively merges sub-groups with the strongest inter-group closeness. A symmetric Hausdorff distance derived from velocity and spatial distances in the image domain measures the inter-group closeness. Image domain distances J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 507–518, 2011. c Springer-Verlag Berlin Heidelberg 2011
508
S. Sandıkcı, S. Zinger, and P.H.N. de With
are not invariant with changing perspective, so that they may yield inaccuracies in estimation of inter-group closeness. In addition, recomputation of the Hausdorff distance is required at each clustering iteration, which can be computationally expensive for large surveillance scenes. Alternatively, as a top-down approach, the Modularity algorithm from [8] is recursively applied to a graph of persons which is constructed from multi-camera tracking results and using a face recognition engine in [9]. Eigenvalue decomposition is carried out to divide a group into two subgroups in the Modularity algorithm, which may introduce computational burden for a large number of small groups. In this paper, we automatically detect and localize groups of people in videos based on the group definition of McPhail and Wohlstein. Our main contribution is that we propose a robust trajectory similarity measure to construct the group structure. We combine positional, velocity and directional similarities of trajectories into a single similarity metric, which fulfills the requirements of the group definition in [4]. All similarity metrics are computed in real-world units to handle videos with strong projective scaling. A computationally inexpensive clustering algorithm is sufficient to discover groups from robustly constructed group structure. In addition, we design a fully automatic multi-target tracking system to extract trajectories, which combines a state-of-the-art human detector of [10] with mean-shift tracking [11]. We compare quantitative and qualitative performances of our method to earlier work, using real-world datasets with varying number of groups involving maximally 10-12 people and difficulty. The sequel of this paper is as follows. In Section 2, we first give an overview of our system and provide details of the method. Section 3 describes experiments and results obtained with our system and compares it to other work. Finally, we conclude the paper and briefly discuss ideas for future work in Section 4.
2
Framework Overview and Group Detection Algorithm
An overview of our system is illustrated in Fig. 1. The input to our system is a video sequence, and detected human groups in the image domain are the outputs. Let us first briefly explain the three principal blocks of our system. 1. Pre-processing. Frames in the input sequence are processed by a human detector and possible detections with inappropriate size are removed, using camera calibration information. 2. Multi-target Tracking. Trackers are automatically initialized using human detection results. We perform explicit inter-object occlusion reasoning to improve the precision of mean-shift tracking during occlusions. Trackers are matched to detections by a data-association block. They are then updated by the matching detections. Finally, a trajectory generation module performs smoothing and motion prediction. 3. Human Group Detection. We develop a trajectory similarity matrix, encapsulating connection strengths between individuals. Clustering of similar trajectories results in grouping of individuals.
Detection of Human Groups in Videos
Initialization
Video Sequence
Human Detection
Human Group Detection
Multi-target Tracking
Pre-processing Automatic Tracker Initialization
Detections Geometric Filtering of Detections
509
Tracking
Occlusion Reasoning
Homography Mapping
Mean-shift Tracking with Occlusion Handling
Trajectory Generation
Data Association
Update Trackers
Image Domain Trajectories
Trajectory Similarity Matrix
Human Groups
Trajectory Clustering
Fig. 1. System overview of the group detection framework
2.1
Pre-processing
We detect humans in each frame to initialize, guide, and terminate individual person trackers (in the following abbreviated to ”trackers”). We apply two recent object detectors: Histograms of Oriented Gradients (HOG) [12] and Multi-scale Deformable Part Models (DPM) [10]. Humans found by the HOG computation appear as a set of histograms of gradient orientations computed in a dense manner. The detector itself is a classifier which is applied to an image at all locations and scales. DPM represents humans as a combination of a coarse root filter, which is quite similar to HOG, employing several fine-scale part filters and a deformable spatial model which encapsulates the relative position of each part w.r.t. to the root. The major advantage of DPM over HOG is a richer object representation and better robustness for pose variations. Each detection score in HOG is computed as in [12] and for DPM the detection is found as in [10]. Let us denote each human detection by {BBd , sc}, where BBd is the detection bounding box specified by its center, width and height, and sc is the score of the detection. We employ camera calibration to geometrically remove detections with inappropriate size. For each detection with a score sc < 1.5, we approximate the ground-plane position by the bottom-middle point of BBd . At such a position, ˜ d whose height is 1.80 m and width is 0.6 we generate a human hypothesis, BB ˜ d by m in real-world units. Then, we compute the overlap between BBd and BB ˜ d ) = area(BBd ∩ BB ˜ d )/area(BBd ∪ BB ˜ d) . o(BBd , BB
(1)
If the overlap is smaller than 0.7, we discard the detection. The threshold value was empirically found. 2.2
Multi-target Tracking
Track Initialization and Termination. We automatically initialize trackers by checking overlapping detections in consecutive frames. A tracker is initialized with three data components: a bounding box BBtr , a target template patch and a color-based appearance model hm . We terminate a tracker if the average
510
S. Sandıkcı, S. Zinger, and P.H.N. de With
score of detections matched to it or average mean-shift tracking score (i.e. Bhattacharyya coefficient) in the past 25 frames is below 0 and 0.80, respectively, where both values are found after some experiments. Occlusion Reasoning. In typical surveillance videos, humans move on a common ground plane and the camera is located above the human head level. We adopt the inter-object occlusion reasoning method of [13] and its assumptions. We determine inter-object occlusion relationships and construct an occupancy map, by modeling each target by an ellipse tightly fitted to its tracking box BBtr . The visibility information of each human (target) is expressed by a binary mask V . Mean-shift Tracking with Occlusion Handling. We employ the mean-shift algorithm [11] to track the visible part of the target obtained from the visibility mask V . Our motivation is to reduce temporary drifts during occlusions, which may cause tracking failures. The target appearance is modeled with an RGB color histogram hm , which is extracted from the visible part of the target template. The Epanechnikov kernel is used in mean-shift tracking. Data Association. We perform a one-to-one assignment between detections i and trackers. Given the mean-shift tracking results {BBtr }i=1,...,M , correspondj j ing to M targets and N detection results {BBd , sc }j=1,...,N , we first construct a matching cost matrix SM ×N using spatial proximity and appearance similari ity as metrics. Spatial proximity is measured with the overlap o(BBtr , BBdj ) as given in Eq. (1), and the appearance similarity is the Bhattacharyya coefficient j i ρ(htr , hd ) = k hitr (k) · hjd (k) (see [11]) between the color histograms hitr and i hjd , extracted from the regions defined by BBtr and BBdj . The matching cost j i Sij between BBtr and BBd is defined by i Sij = − loge ρ(hitr , hjd ) · o(BBtr , BBdj ) . (2)
Then, the Hungarian algorithm is applied to S to extract tracker-detection pairs which have the minimum total matching cost. i Let BBdj and BBtr be a matching detection-tracker pair. We linearly blend j i BBd and BBtr using a weighted scheme to obtain the detection-based tracking output BBo , which is specified by i i i BBo = wdj ∗ BBdj + wtr ∗ BBtr / wdj + wtr , (3) i where we use blending weights wdj = max(1, scj )·ρ(hjd , hm ) and wtr = ρ(hitr , hm ). These blending weights are proportional to the Bhattacharyya coefficients between the histogram hitr and target appearance model hm , and between the histogram hjd and target appearance model hm . We give more weight to BBdj , if its detection score is larger than unity, indicating a strong detection. We also update the target template patch and appearance model hm , if the matched detection has a detection score higher than unity.
Detection of Human Groups in Videos
511
Trajectory Generation. The detection and tracking described above cannot provide smooth and accurate trajectories. Therefore, we need a trajectory smoothing and refining technique. The Double-Exponential Smoothing (DES) filter is reported to have equivalent predictive performance compared to the Kalman filter and executes faster [14]. We adopt the DES filter to smooth and refine the center, width and height of BBo . We gradually increase the degree of smoothing, as the smoothed estimates stabilize. 2.3
Human Group Detection Algorithm
We first project image-domain trajectories onto the ground plane. In order to detect groups at a given time ti , we construct a trajectory similarity matrix AN×N , where N is the number of actively tracked humans at time ti . Then, an agglomerative clustering algorithm is applied to detect groups based on trajectory similarities. Homography Mapping. Ground-plane positions of individuals in the image domain are assumed to be the bottom-middle point of their smoothed tracking bounding boxes. We use camera calibration information for the projection [15]. Trajectory Similarity. Given N trajectories at a reference time-discrete moment t0 , we define a temporal interval T = [t0 − w/2, t0 + w/2] of length (w + 1) frames and use the trajectory samples falling into interval T to construct the trajectory similarity matrix A. Let us denote the set Pn of samples involved with the nth trajectory in the interval T by Pn = {pn (t1 ), . . . , pn (ti )} , ∀ti ∈ T, 1 ≤ i ≤ w + 1,
(4)
th
where pn (ti ) is the ground-plane location of the n human at time ti . The trajectory similarity matrix A is constructed by combining individual similarity matrices for positional, velocity and directional similarities, Ap , Av and Ad . Given a pair of sets of trajectory samples, Pn and Pm , let TOV be the temporal overlap between trajectory sample sets Pn and Pm . This overlapping interval is within our reference time interval T , hence, TOV ⊂ T . The overlapping sample moments are indexed with j, where t0 −w/2 ≤ j ≤ t0 +w/2, and equality holds for maximum overlap. The positional trajectory similarity is measured by ∀j pn (tj ) − pm (tj ) p A (m, n) = exp 1 − , tj ∈ TOV , (5) |TOV | C p where C p is a scale factor for tuning the average positional distance. We compute the average positional distance between Pn and Pm over TOV and divide it by C p . An average positional distance which equals to C p results in a positional similarity of unity. The value of C p has to selected regarding the spatial proximity value stated in [4]. Averaging of the positional distance over time increases the robustness for noisy trajectories. The positional similarity decays exponentially. Individuals walking closer to each other therefore yield a higher positional
512
S. Sandıkcı, S. Zinger, and P.H.N. de With
similarity. Similarly, we quantify the velocity similarity with a matrix Av (m, n) which is specified by Av (m, n) = exp (1 − ( vn − vm /C v )) ,
(6)
where C v is a scale factor for scaling the velocity difference and vn is the average velocity within Pn defined by [16]. Similar to the positional similarity case, C v has to be set regarding the speed difference value given in [4]. Humans walking with similar velocities will lead to a high velocity similarity. Finally, we quantify the directional similarity between Pn and Pm by the inner product of their heading vectors [16] →
→
Ad (m, n) =ψm • ψn → ψn
(7)
→ ψm
where and are unit-length heading vectors indicating the direction of the movement. The direction estimation for stationary humans may have significant fluctuations due to tracking inaccuracies. Therefore, we discard the directional similarity for humans whose velocities are below a threshold. To this end, the combined trajectory similarity is defined as the element-wise product of the individual similarity matrices discussed above, which results in p A (m, n) · Av (m, n) · Ad (m, n) if vn > C v and vm > C v , A(m, n) = Ap (m, n) · Av (m, n) otherwise. (8) Trajectory Clustering. We employ a simple agglomerative clustering algorithm [17] to extract groups from the similarity matrix A. We choose agglomerative clustering because it is simple and does not require the number of clusters as an input. Initially, every individual is considered as a separate cluster. We then hierarchically construct larger groups at each clustering iteration by merging the two clusters whose inter-cluster similarity is the largest among all clusters and is also larger than a predefined threshold. The inter-cluster similarity between two groups is defined to be the largest trajectory similarity among all trajectory similarities between the members of the two groups, so that it can be found by maxm∈Group1, n∈Group2 (A(m, n)), where Group1 and Group2 are subsets of indexes defined by the clustering algorithm. Although various metrics are possible, the choice of taking the maximum for inter-cluster similarity is its simplicity.
3
Experiments and Results
We have evaluated our approach with three sequences from publicly available datasets (PETS 2004 [18], PETS 2006 [19] and PETS 2009 [20]). PETS 2004 and PETS 2006 sequences involve small groups of 2-3 people, and the PETS 2009 sequence presents a crowded scene containing groups of 10-12 people. We have manually annotated the human groups by bounding boxes in each frame. The annotation is performed by one person and verified by two other persons.
Detection of Human Groups in Videos
513
Table 1. CLEAR MOT results of our tracking system using DPM and HOG detector and BPF on the PETS 2006 S7-T6-B sequence Tracking Method Ours with DPM Ours with HOG BPF with DPM
Prec. 79.6% 75.0% 78.5%
Accur. 88.4% 64.7% 88.7%
F.Neg 7.2% 14.2% 7.0%
F.Pos 4.3% 21.0% 4.2%
ID Sw. 7 8 21
We perform experiments with both HOG and DPM detectors, where our implementation of HOG is trained with the INRIA Person Dataset [12]. For DPM, the publicly available implementation of [10] and the human model trained on INRIA Person Dataset are employed. We have selected detections whose score is larger than −0.5 in order to obtain a high recall. The tracking framework, as described in Section 2.2 was applied to the test sequences. Experiments with Tracking of Humans. Sample tracking results using DPM are shown in Fig. 2. In the PETS 2004 sequence, targets close to the camera are successfully tracked, whereas targets with small sizes are challenging for both the detector and the tracker. All of the targets in the PETS 2006 sequence are detected and tracked successfully. We can detect and track targets experiencing light occlusions, but heavily occluded targets are not detected or suffer from fragmented trajectories in the PETS 2009 sequence. We have compared quantitatively our tracking framework with both DPM and HOG detectors. Furthermore, we also compare these combinations with DPM combined with a publicly available tracking algorithm [21] which is using Boosted Particle Filters (BPF). We have used the PETS 2006 sequence for this comparison. We have modified BPF-based combination such that it does not terminate its tracking operation during occlusions (which occurs in the original implementation), to improve its performance. The CLEAR MOT scores [22] of the tested combinations are provided in Table 1. All of the combinations have similar precision scores. The DPM-based tracking achieves higher accuracy than the HOG-based tracking due to more sophisticated human detection. Despite our simpler tracking algorithm, our DPM-based tracking and the BPF-DPM combination perform almost identically. Furthermore, we produce significantly less ID switches due to our more robust data association and precise tracker initialization and termination rules. Our tracking method has a lower computational complexity than the BPF-DPM combination, since mean-shift tracking is employed instead of particle filters. Experiments with Human Group Detection. We have detected human groups by applying our human group discovery method described in Section 2.3 to DPM-based tracking results. The total test set involves three different PETS datasets with different environments and cameras. For this reason, we have optimized the tuning parameters (C p , C v ) related to trajectory similarity for each dataset. We have found the following optimal settings: parameters (C p , C v ) = (1250, 6.25) for the PETS 2004 sequence, (C p , C v ) = (2000, 10) for the PETS
514
S. Sandıkcı, S. Zinger, and P.H.N. de With
3 87 1 4
6
5
2427 26
8 31 30
32
4
5
78
9 14 13 16 11 19
23 26 27 1820 28 21 31 25
3 30
29
32
4
Fig. 2. Sample tracking results using DPM human detector on PETS 2004 (left), PETS 2006 (center) and PETS 2009 (right) 1
1 0.9
0.7
0.8
Missrate
Miss rate
0.8
w w w w
0.9
PETS 2004 PETS 2006 PETS 2009
0.6 0.5 0.4 0.3
= = = =
24 50 74 100
0.7 0.6 0.5 0.4 0.3
0.2
0.2
0.1
0.1 −2
10
−1
0
10
10
FPPF (a)
1
10
0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
FPPF (b)
Fig. 3. Quantitative evaluation of our group detection framework. The curves in (a) and (b) are combination plots of (FPPF, miss rate) operation points as a function of the clustering threshold, which varies with steps of 0.05 in the specified intervals. (a): Group detection performance on the PETS 2004 sequence (clustering threshold ∈ [0.05, 1] and w = 100), the PETS 2006 sequence (clustering threshold ∈ [0.3, 1.5] and w = 100) and the PETS 2009 sequence (clustering threshold ∈ [0.05, 1] and w = 40); (b): The influence of window size w on the group detection performance for the PETS 2006 sequence.
2006 sequence, and (C p , C v ) = (1000, 14) for the PETS 2009 sequence. These (C p , C v ) values correspond to (1.25m, 0.15m/s), (2m, 0.25m/s) and (1m, 0.1m/s) respectively, which are comparable to the values (2.13m, 0.15m/s) given by McPhail and Wohlstein [4]. The differences are caused by local optimizations only. For the reference combination, using BPF with DPM, we have used the same settings for the combination with DPM and our tracking. Examples of group detections are shown in Fig. 4. The visual results on the PETS 2004 sequence (upper row of sub-figures) show that if the humans in the groups become too small, the detection process deteriorates. The results for the PETS 2006 sequence (in the middle row) are convincing, while the bottom row with the PETS 2009 sequence groups are initially detected well, but after merging a double sub-group detection occurs, even for a sequence of frames.
Detection of Human Groups in Videos
515
We have applied strict rules for defining correct detections. A detected group is counted as a true detection only if the detection bounding box BBdt significantly overlaps with the ground-truth bounding box (BBgt ), i.e. according to Eq. (1), the overlap o(BBdt , BBgt ) > 0.75. This value indicates the overlap and is partly arbitrarily chosen, provided that sufficient overlap is achieved for group detection. Insufficiently overlapping detections, such as over-divided groups or partially detected groups, give a penalty to both false positives and miss rates. This explains why relatively high numbers of miss rates can still be called acceptable for a running video sequence. The overall performance of the system is measured by evaluating the miss rate and the amount of False Positive detections Per Frame (FPPF). The miss rate is the ratio between the total number of missed groups and the total number of true groups throughout the sequence. We explore the influence of changing the window w and the clustering threshold on the group detection performance. These influences are shown by the curves for the PETS 2006 sequence in Fig. 3(b). The miss rate curves have a lowered setting (lower miss rate is better performance) with increasing w for the PETS 2006 sequence (this is similar for the PETS 2004 sequence). For the PETS 2009 sequence, the operating frame rate is not 25 Hz where w = 100 is chosen, but 7 Hz, so that w = 40 is a better choice. Let us now discuss the miss rate results for all sequences indicated in Fig. 3(a). The high miss rate (46.7%) for the PETS 2004 sequence is due to noisy and fragmented trajectories of small-sized humans. Summarizing, the video quality of the PETS 2004 is poor which gives a negative performance offset in the miss rate. In the PETS 2009 sequence, groups are over-divided into smaller groups with increasing clustering threshold, resulting in a steep increase in both the miss rate and the FPPF. The strong variation in the curves indicates that the performance of our system depends heavily on the scene contents, and that tuning of parameters is certainly required. Experiments with Human Group Detection: Influence of Tracking on Performance. We measure the group detection performance using different trajectory inputs to assess the effect of the tracking (see Table 2 for comparison on the PETS 2006 sequence). Trajectories produced by our tracking framework with DPM and HOG detectors and the BPF-DPM combination are supplied to our human group discovery method of Section 2.3. We have set the parameters of human group discovery module to (C p , C v ) = (2000, 10) for all trajectory inputs. In fact, this parameter setting results in the lowest miss rate and FPPF for all tested tracking methods. The DPM-based tracking achieves the lowest miss rate and the FPPF value. High false negative and positive rate of the HOG-based tracking are directly translated into a higher miss rate and FPPF value. A higher number of ID switches of the BPF-DPM combination results in a higher miss rate. Experiments with Human Group Detection: Comparison with Earlier Work. We implement the distance metrics and the clustering algorithm of [7] and apply them to our DPM-based trajectories. We call this approach hierarchical
516
S. Sandıkcı, S. Zinger, and P.H.N. de With
615
1020
1249
725
1624
1780
73
127
195
Fig. 4. Groups detected by our approach on PETS 2004 ThreePastShop2cor sequence (first row), PETS2006 S7-T6-B View 4 sequence (second row) and PETS 2009 S1-L1 Time13-57 View 1 sequence (third row). Table 2. Group detection results of different tracking methods on the PETS 2006 sequence Tracking Method Ours with DPM Ours with HOG BPF with DPM
Miss rate. 19.4% 36.3% 30.5%
FPPF. 0.129 0.261 0.130
Clustering based on the symmetric Hausdorff Distance (CHD). The key parameters τs and τv from [7] are tuned towards optimal group detection performance for each individual test set as done above. We found that the best settings are (τs , τv ) = (30, 4.5) for the PETS 2004 sequence, (τs , τv ) = (80, 1) for the PETS 2006 sequence and (τs , τv ) = (175, 10) for the PETS 2009 sequence. The optimal values of w for CHD are found to be w = 100 for the PETS 2004 sequence, w = 74 for the PETS 2006 sequence and w = 40 for the PETS 2009 sequence. Table 3 presents the best miss rate and FPPF values for the CHD framework and our group detection framework. Our group detection method constantly achieves very similar or lower miss rates and FPPF values compared to the CHD, proving the robustness of our trajectory similarity metric. Our method is more robust than the CHD for strong projective scaling in the image domain, which occurs in
Detection of Human Groups in Videos
517
Table 3. Comparison of our human group detection method with group detection based on the Hausdorff Distance (CHD). Dataset PETS 2004 PETS 2004 PETS 2006 PETS 2006 PETS 2009 PETS 2009
(Ours) (CHD) (Ours) (CHD) (Ours) (CHD)
Miss rate. 46.7% 66.3% 19.4% 19.1% 29.7% 26.6%
FPPF. 0.371 1.106 0.129 0.620 0.813 0.811
the PETS 2006 and the PETS 2004 sequences, since we use real-world measurements. Both methods have reasonably low miss rates on the PETS 2006 sequence, since the extracted trajectories are accurate and continuous.
4
Conclusions and Future Work
This paper describes a fully automatic approach for detecting and localizing human groups in videos. The framework can serve as a prior step for high-level group analysis. Our proposed algorithm involves human tracking based on individual trajectories which are combined if they are similar in position, velocity and direction. The choice for these features is motivated by sociological studies and appears to be robust for group detection. The decision of a defining a group is taken after considering a time interval for the individual tracks. The group detection is based on bundling groups of similar tracks. The similarity is based on defining similarity matrices for the above features and combining those matrices into one overall similarity matrix. We have found that extracting accurate and continuous trajectories is the key point for successful detection of human groups. The results for a crowded scene will be lower, because trajectories are more fragmented and noisy. We achieve reasonable miss rates (19.4%, 29.7% and 46.7%) at acceptable false positive detections per frame (FPPF) for challenging video sequences. It was found that individual human detection and associated tracking can be based on a relatively simple mean-shift tracking algorithm. We plan to improve the accuracy and stability of the multi-target tracking part of our approach, by considering social interactions (e.g. collision avoidance) in the motion prediction and replacing the features used in mean-shift tracking with more robust appearance models, such as HOG-based models and region covariances.
References 1. Choi, W., Savarese, S.: Multiple target tracking in world coordinate with single, minimally calibrated camera. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 553–567. Springer, Heidelberg (2010) 2. French, A.P., Naeem, A., Dryden, I.L., Pridmore, T.P.: Using social effects to guide tracking in complex scenes. In: Advanced Video and Signal Based Surveillance, pp. 212–217 (2007)
518
S. Sandıkcı, S. Zinger, and P.H.N. de With
3. Kilambi, P., Ribnick, E., Joshi, A.J., Masoud, O., Papanikolopoulos, N.: Estimating pedestrian counts in groups. Computer Vision and Image Understanding 110(1), 43–59 (2008) 4. McPhail, C., Wohlstein, R.: Using film to analyze pedestrian behavior. Sociological Methods and Research 10(3), 347–375 (1982) 5. Haritaoglu, I., Flickner, M.: Detection and tracking of shopping groups in stores. In: IEEE CVPR, vol. 1, p. 431 (2001) 6. Cupillard, F., Br´emond, F., Thonnat, M.: Tracking group of people for video surveillance. In: Proc. of the 2nd European Workshop on Advanced Video-Based Surveillance System (2001) 7. Ge, W., Collins, R.T., Ruback, B.: Automatically detecting the small group structure of a crowd. In: IEEE Workshop on Applications of Computer Vision (2009) 8. Newman, M.: Finding community structure in networks using the eigenvectors of matrices. Physical Review E 74(3) (2006) 9. Yu, T., Lim, S.-N., Patwardhan, K., Krahnstoever, N.: Monitoring, recognizing and discovering social networks. In: IEEE CVPR, pp. 1462–1469 (2009) 10. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. on Pattern Analysis and Machine Intell. 32(9), 1627–1645 (2010) 11. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. on Pattern Analysis and Machine Intell. 35(5), 564–575 (2003) 12. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE CVPR, vol. 1(2), pp. 886–893 (2005) 13. Wu, B., Nevatia, R.: Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. International Journal of Computer Vision 75(2), 247–266 (2007) 14. Laviola, J.: An experiment comparing double exponential smoothing and Kalman filter-based predictive tracking algorithms. In: Proc. IEEE Int. Conf. Virtual Reality, pp. 283–284 (2003) 15. Hartley, R.I., Zisserman, A.: Projective transformations. In: Multiple View Geometry in Computer Vision, 2nd edn., pp. 32–37. Cambridge University Press, Cambridge (2004) 16. Blunsden, S., Fisher, R.B.: The BEHAVE video dataset: ground truthed video for multi-person behavior classification. Annals of the BMVA, 1–12 (2010) 17. Duda, R.O., Hart, P.E., Stork, D.G.: Hierarchical clustering. In: Pattern Classification, 2nd edn., pp. 550–556. John Wiley and Sons, Inc., Chichester (2001) 18. http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/ 19. http://www.cvg.rdg.ac.uk/PETS2006/index.html 20. http://www.pets2009.net 21. Okuma, K., Taleghani, A., de Freitas, N., Little, J.J., Lowe, D.G.: A boosted particle filter: Multitarget detection and tracking. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 28–39. Springer, Heidelberg (2004) 22. Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J. on Image and Video Processing 2008, 1–10 (2008)
Estimation of Human Orientation in Images Captured with a Range Camera Sébastien Piérard, Damien Leroy, Jean-Frédéric Hansen, and Marc Van Droogenbroeck INTELSIG Laboratory, Montefiore Institute, University of Liège, Belgium
Abstract. Estimating the orientation of the observed person is a crucial task for some application fields like home entertainment, man-machine interaction, or intelligent vehicles. In this paper, we discuss the usefulness of conventional cameras for estimating the orientation, present some limitations, and show that 3D information improves the estimation performance. Technically, the orientation estimation is solved in the terms of a regression problem and supervised learning. This approach, combined to a slicing method of the 3D volume, provides mean errors as low as 9.2° or 4.3° depending on the set of considered poses. These results are consistent with those reported in the literature. However, our technique is faster and easier to implement than existing ones.
1
Introduction
The real-time interpretation of video scenes is a crucial task for a large variety of applications. As most scenes of interest contain people, analyzing their behavior is essential. It is a challenge because humans can take a wide variety of poses and appearances. In this paper, we deal with the problem of determining the orientation of persons observed laterally by a single camera (in the following, we use the term side view). To decrease the sensitivity to appearance, we propose to rely on geometrical information rather than on colors or textures. There are many applications to the estimation of the orientation of the person in front of the camera: estimating the visual focus of attention for marketing strategies and effective advertisement methods [11], clothes-shopping [17], intelligent vehicles [5,6], perceptual interfaces, facilitating pose recovery [8], etc. In most applications, it is preferable to observe the scene from a side view. Indeed it is not always possible to place a camera above the observed person. Most ceilings are not high enough to place a camera above the scene and to observe a wide area. The use of fisheye lenses raises a lot of difficulties as silhouettes then depend on the precise location of a person inside the field of view. Moreover, in the context of home entertainment applications, most of existing applications (such as games) already require to have a camera located on top or at the bottom of the screen. Therefore, in this paper, we consider a single camera that provides a side view. J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 519–530, 2011. © Springer-Verlag Berlin Heidelberg 2011
520
S. Piérard et al.
This paper explains that there is an intrinsic limitation when using a color camera. Therefore, we consider the use of a range camera, and we infer the orientation from a silhouette annotated with a depth map. We found that depth is an appropriate clue to estimate the orientation. The estimation is expressed in terms of regression and supervised learning. Our technique is fast, easy to implement, and produces results competitive with the known methods. The paper is organized as follows. The remainder of this introduction describes related works, defines the concept of orientation, and elaborates on how we evaluate methods estimating the orientation. In Section 2, we discuss the intrinsic limitation of a color camera. Section 3 details our method based on a range camera. In Section 4, we describe our experiments and comment on the results. Finally, Section 5 concludes the paper. 1.1
Related Works
Existing methods that estimate the orientation differ in several aspects: number of cameras and viewpoints, type of the input (image or segmentation mask), and type of the output (discrete or continuous, i.e. classification or regression). As explained hereinbefore, it is preferable to use a side view and to rely on geometric information only. Accordingly, we limit this review of the literature to methods satisfying these requirements. Using a single silhouette as input, Agarwal et al. [1] encode the silhouette with histogram-of-shape-contexts descriptors [3], and evaluate three different regression methods. They obtained a mean error of 17°, which is not accurate enough to meet the requirements of most applications. We will explain, in Section 2, that there is an intrinsic limitation when only a single side view silhouette is used. To get more discriminant information, some authors [4,9] consider the dynamics and assume that the orientation is given by the walking direction. This direction is estimated on the basis of the temporal evolution of the foreground blob location and size. Therefore, those methods require that the person is in motion to determine his orientation. Obviously, there is no way to ensure it. Multiple simultaneous silhouettes can also be used to improve the orientation estimation. Rybok et al. [14] establish that the use of several silhouettes leads to better results. They consider shape contexts to describe each silhouette separately and combine individual results with a Bayesian filter framework. Peng et al. [12] use two orthogonal views. The silhouettes are extracted from both views, and processed simultaneously. The decomposition of a tensor is used to learn a 1D manifold. Then, a nonlinear least square technique provides an estimate of the orientation. Gond et al. [8] use the 3D visual hull to recover the orientation. A voxel-based Shape-From-Silhouettes method is used to recover the 3D visual hull. All these methods require the use of multiple sensors, which is inconvenient. 1.2
Our Definition of the Orientation
The cornerstone observation for orientation estimation is that orientation is independent of the pose: the orientation is related to the coordinate system of the
Estimation of Human Orientation in Images Captured with a Range Camera
521
scene, whereas the pose is specific to the body shape. To achieve this independence, it is convenient to define the orientation of a person with respect to the orientation of one body rigid part. In this paper, we use the orientation of the pelvis. The orientation θ = 0° corresponds to the person facing the camera (see Figure 1).
Fig. 1. Defining the orientation of a person requires the choice of a body part. In this paper, we use the orientation of the pelvis. This figure depicts three examples of configurations corresponding to an orientation θ = 0°. Note that the position of the feet, arms, and head are not taken into account to define the orientation.
According to our definition, evaluating the orientation of the pelvis is sufficient to estimate the orientation of the observed person. But, evaluating the orientation of the pelvis is not a trivial task, even if a range camera is used. As a matter of fact, one would have first to locate the pelvis in the image, and then to estimate its orientation from a small number of pixels. Therefore, we need to get information from more body parts. Unfortunately, it is still an open question to determine the set of body parts that can be used as clues for the orientation. Therefore, we simply use the whole silhouette. 1.3
Criterion Used for Evaluation
The criterion we use to evaluate our method is the mean error on the angle. The error Δθ is defined as the smallest rotation between the true orientation ˆ The angle Δθ is insensitive to the rotation θ and the estimated orientation θ. direction and, therefore, Δθ ∈ [0°, 180°]. Δθ is the angle between the two vectors ˆ sin θ). ˆ If • denotes the scalar product, (cos θ, sin θ) and (cos θ, ˆ sin θ) ˆ = cos θˆ − θ . cos Δθ = (cos θ, sin θ) • (cos θ, (1) Since the error is comprised within Δθ ∈ [0°, 180°], it is computed as Δθ = cos−1 cos θˆ − θ .
(2)
In this paper, we present results as the mean error made by the orientation estimator.
522
2
S. Piérard et al.
The Intrinsic Limitation of a Color Camera
With a color camera, it would be mandatory to decrease the sensitivity to appearance by relying on shapes instead of colors or textures. The purpose of background subtraction algorithms is precisely to detect silhouettes by determining objects in motion. Edge detection techniques are less reliable as the detected edges relate to colors and textures. Therefore, the most reliable (which does not mean that it is robust!) information that can be used with a color camera is the silhouette. Most authors share this point of view. But, there is not enough information in a silhouette to recover the orientation of the observed person, as illustrated in Figure 2.
=
or
?
Fig. 2. There is not enough information in a sole silhouette to recover the orientation. With a side view, the intrinsic limitation is the possible confusion between θ and 180° − θ. The two configurations depicted on the right hand side correspond to θ = 30° and θ = 150°, with two mirror poses.
If several authors wrongly believe that a 180° ambiguity is inherent [12], Figure 2 clearly establishes that the intrinsic limitation is not to confuse the orientations θ and θ +180°, but to confuse θ and 180°−θ. In mathematical words, it is only possible to estimate sin θ, not cos θ. This can be proved if one assumes that (1) the rotation axis of the observed person is parallel to the image plane (that is, we have a side view) and the projection is orthographic (the demonstration of this property is beyond the scope of this paper). In other words, the perspective effects should be negligible. Given the intrinsic limitation previously mentioned, the best that an estimator can do is to choose randomly between θˆ = θ and θˆ = 180° − θ. Assuming that all orientations are equally likely, the expected mean error is 45° for the optimal estimator. However, in practice, the mean error is lower than 45°, but still substantial. As shown in our experimental results (see Section 4.1), this is because the perspective effects become significant (and thus a source of information) when the observed person is getting closer to the camera. For example, it is easier to distinguish the direction of the feet when the camera is close enough to the observed person, because they are viewed from above. Indeed, when the
Estimation of Human Orientation in Images Captured with a Range Camera
523
person stands at 3 meters from the camera (which is a typical distance for home entertainment applications), a vertical opening angle of 37° is required. Clearly, the assumption of a near orthographic projection is invalid. From the above discussion, it follows that there are apparently four ways to address the problem of estimating the orientation. 1. If we could assume that the angle is always comprised in the [−90°, 90°] interval, then using a sole silhouette would permit to recover the orientation. But this assumption is not possible in most of the applications. 2. If we could place two cameras to get orthogonal views, it would be possible to estimate sin (θ) and cos (θ) independently from those two views, and to recover the orientation during a simple post-processing step. However, the use of two cameras is a huge constraint because is not convenient. 3. The use of the perspective effects to overcome the intrinsic limitation could be considered. However, perspective effects most often consist in small details of silhouettes, and it does not seem a good idea to rely on small details because noise could ruin them in a real application. 4. Another possible solution to the underdetermination is to use a range camera. It is a reliable way to get more geometric information with a single sensor. This is the approach followed in this paper. It should be noted that some manufacturers have recently developed cheap range cameras for general public (see Microsoft’s kinect).
3 3.1
Our Method Data
We found it inappropriate and intractable to use real data for training the orientation estimator. Hand-labeling silhouettes with the orientation ground-truth is an error prone procedure (it is difficult to obtain an uncertainty less than 15° [17]). An alternative is to use motion capture to get the ground-truth. However, it is easy to forget a whole set of interesting poses, leading to insufficiently diversified databases. Moreover, using a motion capture system (and thus sequences) has the drawback to statistically link the orientation with the pose. Therefore, our experiments are based on synthetic data instead of real data. In order to produce synthetic data, we used the avatar provided with the open source software MakeHuman [2,15] (version 0.9). The virtual camera (a pinhole camera without any lens distortion) looks towards the avatar, and is placed approximately at the pelvis height. For each shooting, a realistic pose is chosen [13], and the orientation is drawn randomly. We created two different sets of 20, 000 human silhouettes annotated with depth: one set with a high pose variability and the other one with silhouettes closer to the ones of a walker. They correspond to the sets B and C of [13] and are shown in Figure 3. Each of these sets is divided into two parts: a learning set and a test set.
524
S. Piérard et al.
Fig. 3. Human synthetic silhouettes annotated with depth, with a weakly constrained set of poses (upper row) and a strongly constrained set of poses (lower row)
3.2
Silhouette Description
In order to use machine learning algorithms, silhouettes have to be summarized in a fixed amount of information called attributes, which can be numerical or symbolic. Therefore, we need to describe the silhouettes annotated with depth. We want the descriptors to be insensitive to uniform scaling, to translations, and to small rotations (this guarantees that the results remain identical even if the camera used is slightly tilted). The most common way to achieve this insensitivity consists in applying a normalization in a pre-processing step: input silhouettes are translated, rescaled, and rotated before computing their attributes. The normalization runs as follows. We use the centroid for translation, a size measure (such as the square root of the silhouette area) for scaling, and the direction of the first principal component (PCA) for rotation. Note that the depth information is not considered here. Once the pre-processing step has been applied, we still have to describe a silhouette annotated with depth. Most shape descriptors proposed in the literature are only applicable to binary silhouettes. Consequently, we extract a set of binary silhouettes from the annotated silhouette. Each silhouette is further described separately, and all the descriptors are put together. The binary silhouettes (hereafter named slices) are obtained by thresholding the depth map. Let S be the number of slices, s ∈ {1, 2, . . . , S} the slice index, and th(s) the threshold for the slice s. There are several ways to obtain S slices with thresholds. In this paper, we compare the results obtained with slicing methods based on the surface, on moments, and on the extrema. Note that all our slicing methods produce ordered silhouettes, that is, if s ≤ r are two indices, then the s slice is included in the r slice. Slicing based on surface. Slicing based on the surface produces silhouettes whose surface increases linearly. th(s) is such that the ratio between the area of the s slice and the binary silhouette is approximately equal to s/S. Figure 4 presents the result for 5 slices.
Estimation of Human Orientation in Images Captured with a Range Camera
20%
40%
60%
80%
525
100%
Fig. 4. Binary silhouettes produced by a surface-based slicing with 5 slices
Slicing based on moments. Let μ denote the mean of depth, and σ the standard s−1 deviation of depth. th(s) = μ − 2σ + 4σ S−1 . Slicing based on extrema. Let m and M denote, respectively, the minimum and the maximum values of depth. th(s) = m + Ss (M − m). It should be noted that the number of slices has to be kept small. Indeed, increasing the number of slices involves a higher computational cost, and a larger set of attributes which may be difficult to manage for regression algorithms. After the slicing process, each slice is described separately, and all the descriptors are put together. A wide variety of shape descriptors have been proposed for several decades [10,16]. In a preliminary work, we have derived the orientation on the basis of a single binary silhouette (in the [−90°, 90°] interval). It can be proved that the descriptor should distinguish between mirrored images. Therefore, the descriptors insensitive to similarity transformations are not suited for our needs. Several shape descriptors were evaluated, and we have found that the use of the Radon transform or the use of one shape context offer the best performances. These descriptors are briefly described hereafter. Descriptors based on the Radon transform. We have used a subset of the values calculated by a Radon transform as attributes. Radon transform consists in integrating the silhouettes over straight lines. We have used 4 line directions, and 100 line positions for a given direction. Descriptors based on the shape context. Shape contexts have been introduced by Belongie et al. [3] as a mean to describe a pixel by the location of the surrounding contours. A shape context is a log-polar histogram. In our implementation, we have a single shape context centered on the gravity center (of the binary silhouette) which is populated by all external and internal contours. We have used a shape context with 5 radial bins and sectors of 30 degrees (as in [3]). 3.3
Regression Method
The machine learning method selected for regression is the ExtRaTrees [7]. It is a fast method, which does not require to optimize parameters (we do not have to setup a kernel, nor to define a distance), and that intrinsically avoids overfitting.
526
S. Piérard et al.
In practice, it is not possible to estimate the orientation directly. Indeed, it is possible to find two silhouettes annotated with depth such that (1) they are almost identical, and (2) their orientations are θ −180° and θ 180°. Therefore, the function that maps the silhouette annotated with depth to the orientation presents discontinuities. In general, discontinuities are a problem for regression methods. For example, the ExtRaTrees use an averaging that leads to erroneous values at the discontinuity. Our workaround to maintain continuity consists in the computation of two regressions: one regression estimates sin θ and the other one estimates cos θ. The estimate θˆ is derived from: θ, cos θ (3) θˆ = tan−1 sin 2
θ + The same approach was also followed by Agarwal et al. [1]. Note that sin 2 θ is not guaranteed to be equal to 1. cos
4
Experiments
Prior to the orientation estimation in images captured with a range camera, we analyzed the impact of perspective effects. We have avoided the range normalization problem by slicing the 3D volume of a human isolated from the background. But perspective effects cannot be ignored. 4.1
The Impact of Perspective Effects
In our first experiment, we ignore the depth information, and try to estimate the orientation from the binary silhouette. The purpose of this experiment is twofold. Firstly, we want to highlight the impact of perspective effects in the orientation estimation. Secondly, we want to show that perspective effects cannot (alone) lead to satisfactory results. Figure 5 shows that the mean error on the orientation depends on the distance between the camera and the observed person. When the camera moves away from the avatar, the vertical opening angle is decreased to keep the silhouettes at approximately the same size. For example, the opening angle is 50° at 3 meters, and 8° at 20 meters. The larger the distance from the camera to the observed person is, the more the projective model approximates an orthographic projection. Clearly, the perspective plays a significant role. Despite these effects, binary silhouettes do not suffice to achieve an acceptable mean error, even for the set of strongly constrained poses (13° at 3 meters). Our second experiment shows that better results are obtained from the 3D information. However, the fact that the results depend on the distance between the person and the camera leads to the following question: which distance(s) should we choose to learn the model? We did not find the answer to this question. On the one hand, we would like to fill the learning database with samples corresponding to the operating conditions (typically a distance of about 3 meters). On the other hand, we want to avoid learning details that are not visible in practice because of noise, and therefore to use a large distance in order to reduce perspective effects.
Estimation of Human Orientation in Images Captured with a Range Camera
527
mean error [degree]
60
45
30 shape context, weakly constrained poses radon, weakly constrained poses shape context, strongly constrained poses radon, strongly constrained poses
15
0 0
5
10
15 20 25 distance camera - avatar [m]
30
35
40
Fig. 5. The mean error on the orientation estimation depending on the distance between the camera and the observed person
4.2
The Role Played by 3D in Orientation Estimation
In our second experiment, we evaluate the performance that can be reached from 3D information. The virtual camera is placed at a distance of 3 meters from the avatar. We report the results obtained with the descriptors that have been previously mentioned. The mean error results are provided in Tables 1 and 2 for, respectively, the sets of strongly and weakly constrained poses. The following conclusions can be drawn: 1. The mean error is lower for a strongly set of poses than for a weakly set of poses. The diversity of the poses in the learning set has a negative impact on the results. 2. Although increasing the number of slices always improves the performance, the number of slices only affects the performance slightly when it exceeds 2 or 3. Thus there is no need to have a high resolution for the distance values in the depth map. 3. The surface-based slicing method systematically outperforms the two other slicing strategies. 4. We are able to obtain mean errors as low as 9.2° or 4.3° on a 360° range of orientations depending on the set of poses considered. So the role played by 3D in orientation estimation is much more important than the one played by perspective effects. Moreover, these results demonstrate that several viewpoints (as used in [8] and [14]) are useless for the orientation estimation. It is difficult (if not impossible) to compare our results with those reported for techniques based on a classification method (such as the one proposed by Rybok et al. [14]) instead of a regression mechanism. Therefore, we limit our comparison to results expressed in terms of an error angle. However, one should keep in mind that a perfect comparison is impossible because the set of poses used has never been reported by previous authors. Agarwal et al. [1] obtained a mean error of 17° with a single viewpoint. Gond et al. [8] obtained a mean error of 7.57° using several points of view. Peng et al. reported 9.56° when two orthogonal
528
S. Piérard et al. Table 1. Mean errors obtained with a strongly constrained set of poses Radon: 1 slice 2 slices 3 slices 4 slices 5 slices shape context: 1 slice 2 slices 3 slices 4 slices 5 slices
surface-based slicing 13.0° 6.4° 6.1° 5.8° 5.7°
moments-based slicing — 10.6° 6.4° 6.3° 6.1°
extrema-based slicing 13.1° 8.6° 7.9° 7.6° 7.3°
surface-based slicing 18.2° 4.8° 4.5° 4.4° 4.3°
moments-based slicing — 12.2° 4.9° 5.1° 4.7°
extrema-based slicing 18.1° 8.3° 7.3° 6.8° 6.7°
Table 2. Mean errors obtained with a weakly constrained set of poses Radon: 1 slice 2 slices 3 slices 4 slices 5 slices shape context: 1 slice 2 slices 3 slices 4 slices 5 slices
surface-based slicing 28.9° 11.1° 9.9° 9.4° 9.2° surface-based slicing 32.8° 11.1° 10.0° 9.5° 9.2°
moments-based slicing — 24.4° 11.7° 11.5° 10.7° moments-based slicing — 28.1° 12.4° 12.5° 11.3°
extrema-based slicing 28.8° 19.4° 15.5° 14.0° 13.2° extrema-based slicing 32.6° 23.7° 18.8° 16.4° 15.2°
views are used. All these results were obtained with synthetic data, and can thus be compared to our results. The results reported by Gond et al. and Peng et al. are of the same order of magnitude as ours, but our method is much faster and simpler to implement. In contrast with existing techniques, we do not need complex operations such as camera calibration, shape from silhouettes, tensor decomposition, or manifold learning. 4.3
Observations for a Practical Application in Real Time
We applied our method to a real application driven by a kinect. A simple background subtraction method has been used to extract the silhouettes annotated with depth. We didn’t filter the depth signal, nor the slices. The estimated
Estimation of Human Orientation in Images Captured with a Range Camera
529
orientation has been applied in real time to an avatar, and projected on a screen in front of the user. A light temporal filtering has been applied to the orientation signal to avoid oscillations of the avatar. This allowed a qualitative assessment. The models have been learned from synthetic data without noise. It appears that the model learned with the descriptor based on the Radon transform is efficient, and that it outperforms the model learned with the descriptor based on the shape context. Without any kind of filtering, it seems thus that region-based descriptors are preferable to limit the impact of noise.
5
Conclusions
Estimating the orientation of the observed person is a crucial task for a large variety of applications including home entertainment, man-machine interaction, and intelligent vehicles. In most applications, only a single side view of the scene is available. To be insensitive to appearance (color, texture, . . . ), we rely only on geometric information (the silhouette and a depth map) to determine the orientation of a person. Under these conditions, we explain that the intrinsic limitation with a color camera is to confuse the orientations θ and 180° − θ. When the camera is close enough from the observed person, the perspective effects bring a valuable information which helps to overcome this limitation. But, despite perspective effects, performances remain disappointing in terms of the mean error of on the estimated angle. Therefore, we consider the use of a range camera and provide evidence that 3D information is appropriate for orientation estimation. We address the orientation estimation in terms of regression and supervised learning with the ExtRaTrees method and show that mean errors as low as 9.2° or 4.3° can be achieved, depending on the set of poses considered. These results are consistent with those reported in the literature. However, our technique has many advantages. It requires only one point of view (and therefore a single sensor), it is fast and easy to implement. Acknowledgments. S. Piérard has a grant funded by the FRIA, Belgium.
References 1. Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(1), 44–58 (2006) 2. Bastioni, M., Re, S., Misra, S.: Ideas and methods for modeling 3D human figures: the principal algorithms used by MakeHuman and their implementation in a new approach to parametric modeling. In: Proceedings of the 1st Bangalore Annual Compute Conference, pp. 10.1–10.6. ACM, Bangalore (2008) 3. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(4), 509–522 (2002)
530
S. Piérard et al.
4. Bhanu, B., Han, J.: Model-based human recognition: 2D and 3D gait. In: Human Recognition at a Distance in Video. Advances in Pattern Recognition, ch.5, pp. 65–94. Springer, Heidelberg (2011) 5. Enzweiler, M., Gavrila, D.: Integrated pedestrian classification and orientation estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, USA, pp. 982–989 (June 2010) 6. Gandhi, T., Trivedi, M.: Image based estimation of pedestrian orientation for improving path prediction. In: IEEE Intelligent Vehicles Symposium, Eindhoven, The Netherlands (June 2008) 7. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning 63(1), 3–42 (2006) 8. Gond, L., Sayd, P., Chateau, T., Dhome, M.: A 3D shape descriptor for human pose recovery. In: Perales, F.J., Fisher, R.B. (eds.) AMDO 2008. LNCS, vol. 5098, pp. 370–379. Springer, Heidelberg (2008) 9. Lee, M., Nevatia, R.: Body part detection for human pose estimation and tracking. In: IEEE Workshop on Motion and Video Computing (WMVC), Austin, USA (February 2007) 10. Loncaric, S.: A survey of shape analysis techniques. Pattern Recognition 31(8), 983–1001 (1998) 11. Ozturk, O., Yamasaki, T., Aizawa, K.: Tracking of humans and estimation of body/head orientation from top-view single camera for visual focus of attention analysis. In: International Conference on Computer Vision (ICCV), Kyoto, Japan, pp. 1020–1027 (2009) 12. Peng, B., Qian, G.: Binocular dance pose recognition and body orientation estimation via multilinear analysis. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, USA (June 2008) 13. Piérard, S., Van Droogenbroeck, M.: A technique for building databases of annotated and realistic human silhouettes based on an avatar. In: Workshop on Circuits, Systems and Signal Processing (ProRISC), Veldhoven, The Netherlands, pp. 243– 246 (November 2009) 14. Rybok, L., Voit, M., Ekenel, H., Stiefelhagen, R.: Multi-view based estimation of human upper-body orientation. In: IEEE International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, pp. 1558–1561 (August 2010) 15. The MakeHuman team: The MakeHuman (2007), http://www.makehuman.org 16. Zhang, D., Lu, G.: Review of shape representation and description techniques. Pattern Recognition 37(1), 1–19 (2004) 17. Zhang, W., Matsumoto, T., Liu, J., Chu, M., Begole, B.: An intelligent fitting room using multi-camera perception. In: International Conference on Intelligent User Interfaces (IUI), pp. 60–69. ACM, Gran Canaria (2008)
Human Identification Based on Gait Paths Adam Świtoński1,2, Andrzej Polański1,2, and Konrad Wojciechowski1,2 1
Polish-Japanese Institute of Information Technology, Aleja Legionów 2, 41-902 Bytom, Poland {aswitonski,apolanski,kwojciechowski}@pjwstk.edu.pl 2 Silesian Ubiversity of Technology, ul. Akademicka 16, 41-100 Gliwice, Poland {adam.switonski,andrzej,polanski,konrad.wojciechowski}@polsl.pl
Abstract. Gait paths are spatial trajectories of selected body points during person’s walk. We have proposed and evaluated features extracted from gait paths for the task of person identification. We have used the following gait paths: skeleton root element, feet, hands and head. In our motion capture laboratory we have collected human gait database containing 353 different motions of 25 actors. We have proposed four approaches to extract features from motion clips: statistical, histogram, Fourier transform and timeline We have prepared motion filters to reduce the impact of the actor’s location and actor’s height on the gait path. We have applied supervised machine learning techniques to classify gaits described by the proposed feature sets. We have prepared scenarios of the features selections for every approach and iterated classification experiments. On the basis of obtained classifications results we have discovered most remarkable features for the identification task. We have achieved almost 97% identification accuracy for normalized paths. Keywords: motion capture, human identification, gait recognition. supervised learning, features extraction, features selection, biometrics.
1 Introduction Biometrics is the discipline of recognizing humans based on their individual traits. There are numerous areas in which it is used. We can enumerate crime, civil and consumer identification, authorization and access control, work time registration, monitoring and supervision of public places, border control and many others. Biometrics methods most often are based on: finger, palm and foot prints, face, ear, retina and iris recognition, the way of typing, speech, DNA profiles matching, spectral analysis, hand geometry, and gait. The great advantage of the gait identification is the fact that it does not require the awareness of the identified human. Unfortunately, it is not so accurate as for instance fingerprints or DNA methods. Gait identification is useful when very high efficiency is not required. It can be used for the introductory detection or selection of the suspicious or wanted humans. It could be used in customer identification. If a customer is identified and the profile of his interest is evaluated on the basis of the earlier visits, special offer or care can be addressed to him, by salesman, displaying J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 531–542, 2011. © Springer-Verlag Berlin Heidelberg 2011
532
A. Świtoński, A. Polański, and K. Wojciechowski
banner or by playing the recordings of his favorite type of music in the music shop. Possible applications are very wide.
2 Motion Capture The gait is be defined as coordinated, cyclic combination of movements which results in human locomotion [5]. It means that even its short fragment is representative, and has common features with the remaining part of the gait. The gait can be captured by traditional two-dimensional video cameras of monitoring systems or by much more accurate motion capture systems. Motion capture system acquires motion as a time sequence of poses. There are numerous formats for representing a single pose. In the basic C3D format, without the skeleton model, we obtain only direct coordinates of the markers located on the human body and tracked by the specialized cameras. Comparison and processing of such data is difficult, because there is no direct, easily interpretable meaning of the C3D markers, C3D markers are only identified by different labels. The C3D “raw” data has to be further processed to estimate the pose in which we have direct information of the location and state of the body parts. The skeleton model has to be applied to properly transform labeled C3D markers to the interpretable coordinates of body parts, which define the pose of an individual. A well known format of the pose description is ASF/AMC. It describes the pose by skeleton tree - like structure with measured bones lengths. The root object is placed on the top of the tree and is described by its position in global coordinate system. Child objects are connected to their parents and have information of rotation relative to the parents represented by Euler angles. Direct applications of the motion capture (mocap) systems to the human identification tasks are limited because of the inconvenience of the capturing process. The identified human has to put on a special suit with attached markers and can move only within narrow bounded region monitored by the mocap cameras. However, there is one great advantage of the motion capture - it is the precision of measurements. It minimizes the influence of capturing errors and allows for discovery of the most remarkable features of the human gait. Thus, using motion capture in the development phase of the human identification system is reasonable. It makes it possible to focus on evaluating individuality of the motion features and then to work on detecting only those features form the 2D images.
3 Related Work It is believed that a human is able to recognize people by gait. However, experimental verification of this hypothesis, reported in the literature, is rather sparse. In the experimental study presented in [6], the gait data were captured and recorded for the group of six students who knew each other. The capture form was a moving light display of human silhouettes. Afterwards, students tried to identify randomly presented gait. Their performance was 38%, which is twice better than guessing, but still quite poor. In general, the number of all gait features is too great to be effectively analyzed by human brain. A human is probably able to recognize only some characteristic features of gaits, which are most characteristic for the observed persons.
Human Identification Based on Gait Paths
533
Gait identification methods can be divided into two categories: model based methods and motion or appearance based methods. In the motion based approaches we have only the outline of a human extracted from 2D image called a silhouette. In [9] silhouettes of walking humans were extracted from images by using background subtraction method, then silhouettes skeletons were computed and finally modified ICA was applied to project the original gait features from a high dimensional measurement space to a low-dimensional space. Subsequently the L2 norm was used to compare the transformed gaits. Similar approach was proposed in [10], based on the PCA reductions technique instead of ICA. In [11] recognition was performed by using temporal correlation of silhouettes. To track silhouettes the authors used optical flow methods or calculated the special images - motion energy and motion history [5]. In the model based approaches we define model of the observed human and capture its configurations in the subsequent time instants. The above mentioned ASF/AMC format is often applied as the skeleton model of the observed human. There are many methods to estimate this model directly from 2D images available in the literature. In [7] the authors use the particle swarm optimization algorithm to find optimal configuration of particles, corresponding to the model parts, which match the image in the best way. In [12] time sequences of all model configuration parameters are transformed into frequency domain and the first two Fourier components are chosen. Finally, the obtained description is reduced by PCA method. The comparison of time sequences, directly applicable to the sequence of motion frames can be performed by dynamic time warping [13]. It requires developing a suitable method of calculating the similarity between motion frames. The authors of the [14] propose 3D cloud point distance measure. First they build cloud points for compared frames and their temporal context. Further, they find global transition to match both clouds and finally calculate the sum of distances corresponding points of matched clouds. For the configuration coded by the unit quaternions, the distance can be evaluated as sum of quaternion distances. In [15] the frame distance is the total weighted sum of quaternion distances. Binary relational motion features proposed in [8] and [13] give a new opportunity of motion description. Binary relational feature can be applied if some given joints and bones are in the defined relation, for example the left knee is behind the right knee or the right ankle is higher than the left knee. However, it is very difficult to prepare a single set of features which is applicable to the recognition of every gait. Features are usually dedicated to specialized detections because of their relatively easy interpretation. We can generate large features vectors from generic features set proposed by [8], but because of the difficulty in pointing significant features, it leads to long pose description and redundant data, difficult to interpret. We have not found any comprehensive study based on the features calculated by the precise motion capture system and relatively easy to be extracted from 2D video recordings, focused on evaluating most remarkable features. We have addressed this problem in the present study.
4 Collected Database of Human Gaits We have used PJWSTK laboratory with Vicon motion capture system [1] to acquire human gaits. We have collected database of 353 gaits coming from 25 different males
534
A. Świtoński, A. Polański, and K. Wojciechowski
at the age of 20 to 35 years old. We have specified the gait route, the straight line of the 5 meters long. The acquiring process started and ended with T-letter pose type because of the requirement of the Vicon calibration process. Exemplary gait collected in one of the experiments is presented in Fig. 1.
Fig. 1. Example of a gait
The actor walks along the Z axis, Y axis has default orientation - up and down and perpendicular X axis registers slight hesitations outside the specified route. We have defined two motion types: slow gait and fast gait, without strict rules for the actors. Slow and fast gait have been interpreted individually. A typical slow gait usually lasts up to 5 seconds and contains several steps; fast gait usually lasts up to 4 seconds. The motions are stored in ASF/AMC format. The gait path can be defined as the time sequence of three dimensional coordinates of the path:
P : [1 : T ] → ( X , Y , Z ) ⊂ R 3
(1)
It can be estimated by the location of the root element of the ASF/AMC frames which points lower end of the spine. Examples of two motions of different actors with plotted line of the root positions are presented in Fig. 2. As we can notice the root position strongly depends on the height of the actor, exactly the length of his legs. In such a case identification based on this path would strongly depend on the actors heights instead of only the gate path. To minimize the influence of the actors height we can apply simple transformation of the path by translating them relatively to the first motion frame.
PTranslated = P − P(1) .
(2)
In fact it can be done not only for the Y attribute; translating in the same way X and Z attributes results in independence of the gait path on the position of the captured gait in the global coordinate system. Another way to reduce dependency of the gait path of the height of the actors and the location of the gait is normalizing the attributes to the specified range. It can be done in the linear way, the transformation for the default range (0,1) is presented below:
⎛ X − X min Y − Ymin Z − Z min Pscaled = ⎜⎜ , , ⎝ X max − X min Ymax − Ymin Z max − Z min
⎞ ⎟⎟ ⎠
(3)
where Xmin, Ximax, Ymin, Yimax, Zmin, Zimax are respectively minimum and maximum values of the X, Y and Z attributes in the given motion path.
Human Identification Based on Gait Paths
535
The above transformation seems to work better than translation. Despite the global location of the path, the actor’s height has an impact on the path variations. In contrast to the normalization, translating them relatively to the specified frame does not reduce such a dependency. Moreover, common range of the path makes them undistinguishable as regards the path length, which is the result of the time of capturing process. An evident way to obtain a gait path is tracking the movements of feet. In such a case we have to transform pose representation from kinematic chain of the ASF/AMC format to the cloud points and take proper point of each frame. It is disputable whether to choose the left or the right foot. To take into consideration both of them we can calculate a midpoint between them - we will call such a path center foot path.
Fig. 2. Example of collected gaits with plotted gait paths
In Fig. 2 we have presented two examples of gaits of different actors and plotted their gaits paths. The first one contains raw root paths, the second root paths with translation of the Y attribute relative to the first frame, the third left and right foot paths and the fourth center foot path.
Fig. 3. Main cycle detection
As it has been described above, each motion starts with the T pose type, which contains some individual features of the actors. They could be obtained on the basis of the individual abilities to stand in a static pose: keeping the right angle between the spine and the hands and differences between hands, slight movements of the hands, straightness of the hands and legs, distance between feet and many others. However, the T pose type is not natural pose during a typical gait. Thus, gait identification using typically absent features of the pose would artificially improve the results.
536
A. Świtoński, A. Polański, and K. Wojciechowski
a)
b)
c)
d)
e)
f)
Fig. 4. Example gait paths for five randomly selected actors. a) raw root paths, b) root paths translated relatively to the first frame, c) raw left feet paths, d) scaled to default range left feet paths, e) left feet trajectories of the Y attribute with scaling to the default range, f) left feet trajectories of the Z attribute with scaling to the default range.
In order to eliminate artifacts related to initial and terminal phases of motion in the laboratory environment, we have prepared a filter for detecting the main cycle of the gait. The gait can be represented as a repeated sequence of the steps with the left and right legs. The steps of the given legs are almost identical, hence we can calculate global gait features based only on two adjacent steps. To detect the beginning and the end of the main gait cycle it is sufficient to track distances between two feet and analyze the extremes. The longest distance takes place when a current step terminates and the next one starts. The shortest distance points the middle phase of the step.
Human Identification Based on Gait Paths
537
In Fig.3 we have visualized process of the main cycle detection which contains two adjacent steps, for a randomly chosen motion. The left chart presents the distances between the right and the left legs for the subsequent motion frames. The right figure shows the analyzed motion with the detected main cycle labeled by the green line. There is one more issue to consider in the main cycle detection. To directly compare main cycles of the motions, they should start with the step of the same leg. It means we should choose the proper minimum of the legs distances. If we assume that the first step should start with left leg in the front and the right in the back, we have to remove those minima for which the left leg is closer to the starting point than the right leg. In Fig. 4 we have presented fifteen, randomly chosen, different gait paths of five actors. Paths of different actors are plotted in different colors. The first chart presents raw ASF/AMC root paths. We can notice significant differences between actors, especially for the Y coordinates and smaller differences for X coordinates. It means that actors have different heights and walked along different lines in the laboratory room. The second chart presents root paths after translating all attributes relatively to the first frame. The differences are much smaller in comparison to the case without translation. The height of the actors does not have such an impact on the position of the feet, hence we can easily notice differences only for the X coordinate, similar to the root paths. It is difficult to state simple, general rules to recognize the actors for the paths with the proposed filtering applied. For the trajectories of Y attribute we can notice loops which reflect the subsequent steps. For Z attribute there are no loops, because actors are moving along Z axis. The fragments of plots corresponding to the T poses can be easily detected, and again they are quite specific to different actors. However, as mentioned, these fragments are later ignored by our person identification system, which analyses only the main cycle of the gait.
5 Experiments, Results and Conclusions On the basis of the gait paths we have tried to identify actors. In the experiment we have chosen paths for the following body parts: root, left, right and center foot, head, left and right hand. The root and the feet paths seem to most obviously estimate the way of the human gait, which should have some individual features. The reason of testing the head paths is relative simplicity of their detection from the 2D video images. The extraction of the hands from the video images also does not seem to be very complicated, and in addition, we intuitively expected that their movements could give some information useful in the identification task. Head and hands paths are detected in the same way as the feet paths. They are obtained from cloud points representations by choosing appropriate points. We have truncated all paths by cutting the motion to the main gait cycle window. In the next stage all frame sequences of paths were transformed by applying previously described filters: translation relatively to the first frame and linear scaling each attribute to the default range (0,1).
538
A. Świtoński, A. Polański, and K. Wojciechowski
The complexity of the problem and difficulty to propose general rules to identify the above presented gait paths, has inclined us to choose the supervised machine learning techniques. The crucial problem was to prepare a proper set of features describing each motion which will be able to identify actors. We have proposed four different approaches: statistical, histogram, Fourier transform and timeline. In the statistical approach we calculate mean values and variances of each pose attribute. In the histogram based one, we build separate histogram for each attribute with different number of bins: five, ten, twenty, fifty and one hundred. It means that there are five different histogram representations of every gait. In Fourier approach we transform the motion into frequency domain and take the first twenty harmonic components with the lowest frequencies. The number of harmonic components has been chosen based on the motion reconstruction with inverse Fourier transform. Twenty components are sufficient to restore motion in the time domain without visible damages. The feature set includes the amplitude of the harmonic component, which gives information of the total intensity of at the given frequency and the phase that points its time shift. We had expected that Fourier transform would be useful only for the gait representation with the main cycle detection. Only in that case, the same Fourier components store similar information and are directly comparable. Because of different gait speeds in our experiments, we have decided to build additional representation by applying linear scaling of the time domain to the equal number of frames. In this approach, applied in numerous studies, the time scale (measured in seconds) is replaced by gait phase scale (measured by percent of the gait cycle). The last approach is called the timeline. The feature set stores information on every attribute values as time sequence. The moments in which attribute values are taken into the set are determined by the division of the motion to the given number of intervals. For the same reason as described in the previous approach, timeline feature sets are expected to be most informative for the motions with the main cycle detection. We have prepared timeline motion representation with sequence of five, ten, twenty, fifty and one hundred different time moments. In the statistical approach we have calculated the velocities and accelerations across the paths and included them in the feature set in the same way as coordinate values. Thus, statistical feature set contains mean values and variances calculated for the coordinates of the paths, velocities and accelerations. As described below, the results with included velocities and accelerations were promising, much better than without them. That is why, we have repeated tests with velocities and acceleration added to the Fourier and timeline approaches. Once again they have been treated similarly as coordinates. We have calculated Fourier components for them and taken their temporal values in the following moments. The number of features depends strongly on the proposed approach. The entire motion is described by seven separate three dimensional gait paths and each path could be divided into three time sequences: coordinates, velocities and accelerations. For the statistical approach, each dimension of the path is described by means and variances, which gives 126 features. In histogram based, which has no velocities and accelerations, there is 105 features for the five bins histograms and 2100 for one hundred bins. Fourier sets contain 2520 features, and the number of timeline features is in the range (315, 6300), depending on the number of time moments.
Human Identification Based on Gait Paths
539
It seems that we do not need such a great number of features to identify actors. It concerns especially the Fourier and timeline datasets. Some of the features are probably useless and contains noise, which probably worsens classification results. What is more important, such a huge feature set does not allow researchers to easily evaluate them. To verify the hypothesis of useless features and to discover the most remarkable ones, we have prepared feature selection scenarios, separately for every dataset type. After applying selection we have repeated classification and analyzed the results. At the current stage, we have not used automatic selection techniques [2] based on the attribute subset evaluation because of the complexity of the problem. The attribute rankings methods [2] appear too naive to achieve the task. We believe that manual selections allows us to obtain clearer results. The selection scenarios we have prepared, are following. In all cases we have selected every combination of attributes associated with: • axes of the global coordinate system: X, Y and Z, • gait paths: root, left foot, right foot, center foot, left hand, right hand, head, • position, velocities and accelerations. For the statistical datasets we have made additional combinations by selecting means and variances and for Fourier datasets we have limited the number of Fourier components and selected modules and phases of the complex numbers. The number of experiments to execute was very large. Thus, we could not apply slow teaching and testing classifiers. We were also restricted to classification methods suitable for multiclass discrimination. In the introductory step we have used two statistical classifiers: k Nearest Neighbour [3] and Naive Bayes [4]. For the nearest neighbors classifier we have applied different number of analyzed nearest neighbors ranging from 1 to 10. In the Naive Bayes we have used normal distribution of the attributes and distribution estimated by a kernel based method. We have tested every combination of the preprocessing filters applied, features set calculation approaches with their all features selection scenarios, classifiers and their parameters. It gives almost three millions of different experiments made and because of applied leave-one-out method [2] for splitting the dataset into train and test part, over one billion training cycles and tests. In Fig. 5,6,7,8 and 9 we have visualized aggregated classification results. We have calculated classifier efficiencies in the meaning of percentage of correctly identified gaits. In the aggregation we have chosen the highest efficiency from the experiments performed for the specified approaches and attributes.
Fig. 5. Total classification results
540
A. Świtoński, A. Polański, and K. Wojciechowski
Fig. 6. Classification results for normalized datasets with translation relatively to the first frame and linear scaling of all attributes to the default range (0,1)
Fig. 7. Evaluation of coordinates values, velocity and acceleration attributes
Fig. 8. Evaluation of X,Y and Z attributes
Fig. 9. Evaluation of Fourier components
For the raw paths the most informative are hands paths, a little bit worse are feet. What is surprising, the root path which stores information of the actor’s height is less informative than feet. The best total efficiency is 96.6%, achieved by timeline approach with 50 time points and the main cycle detection. The main cycle detection, not only makes the results more reliable, but also improves them noticeably. It is observed, as we expected, for the timeline and Fourier approaches, which have obtained the highest efficiencies.
Human Identification Based on Gait Paths
541
The normalization of the paths which removes the data of actor’s height and gait location, worsens the results. For stronger normalization with attributes scaling best efficiency is 93,5% and for the weaker with translation is 94,3%. Opposite to the previous case, slightly better is Fourier approach than timeline and similarly, statistical and histogram are the worst ones. The normalization has caused significant loss of information by the root, hands and head paths. In the evaluation of the attributes and Fourier components presented in Fig. 7,8,9 we have taken into consideration only the reliable normalized paths by attributes scaling with the main cycle detection. There is another surprising observation - the velocities and the accelerations contain more individual data than the coordinate values. It is particularly noticeable for the root and feet paths. We can conclude that more important is how energetic the movements are rather than what is their shape. The reason for not repeating the tests for histogram approach with velocities and accelerations were the preliminary tests with only root paths. Histogram approach has obtained much lower efficiencies than Fourier and timeline ones and we have regarded it as less promising. On the basis of Fourier components calculated for the coordinates values, we can reconstruct the entire sequence of original paths, which means that they contain indirect information about velocities and accelerations. However, the knowledge is hidden and simple classifiers applied were not able to explore it. It was necessary to add direct features representing velocities and acceleration to improve the results. As we have expected, the most informative are directions of the Z and Y axes, pointing the main directions of the gait and up-down direction. It means that the actor should be observed from the side view. Despite quite good quality, which is sufficient for the 80% efficiency, X attributes contain some noise. Adding X attributes to Y and Z ones, it worsens the results in most cases. The most informative are feet paths, except for the cases of not normalized paths, which prefers including height data and hands paths. Unfortunately, head paths, which can be relatively easy to extract from 2D video recordings, have obtained the worst results. The second worst are the root paths. Root and head paths are static and reflect general gait path, in contrast to feet and hands paths, which have greater variations. This implies easier extraction of individual features. What is more, feet and hands paths contain data of the step’s length, height of feet lifting and hands waving which are surely individual. The best result for normalized paths with linear scaling attributes values has obtained Fourier approach with 5 components of Y and Z directions and feet, root and hands paths with complete description: coordinates values, velocities and accelerations. It is 93,5% of classifier efficiency, which means that we have misclassified 23 motions from the set of 353. Substituting root path with head path causes only one mistake more and removing root path, five mistakes more. The individual features are not concentrated in a single path, but they are dispersed in the movements of different body parts. It is required to track more details to achieve high accuracy. That probably explains the above mentioned the difficulty for a human to recognize gait reported in the reference [6]. For the best discovered feature set of normalized paths we have tested more sophisticated functional classifier, with greater computational costs, a multilayer perceptron [2]. We have iterated tests for different network structure complexities,
542
A. Świtoński, A. Polański, and K. Wojciechowski
learning rates and learning cycles. The multilayer perceptron has improved the classification twice. It has 96,9% of classifier efficiency, which means only 11 mistakes out of 353 tests. Acknowledgement. This paper has been supported by the project ,,System with a library of modules for advanced analysis and an interactive synthesis of human motion'' co-financed by the European Regional Development Fund under the Innovative Economy Operational Programme - Priority Axis 1. Research and development of modern technologies, measure 1.3.1 Development projects.
References 1. Webpage of PJWSTK Human Motion Group, http://hm.pjwstk.edu.pl 2. Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005) 3. Aha, D., Kibler, D.: Instance-based learning algorithms. Machine Learning (1991) 4. John, G.H., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, San Mateo, pp. 338–345 (1995) 5. Boyd, J.E., Little, J.J.: Biometric Gait Identification. In: Tistarelli, M., Bigun, J., Grosso, E. (eds.) Advanced Studies in Biometrics. LNCS, vol. 3161, pp. 19–42. Springer, Heidelberg (2005) 6. Cutting, J.E., Kozlowski, L.T.: Recognizing friends by their walk: gait perception without familiarity cues. Bulletin of the Psychonomic Society (1977) 7. Krzeszowski, T., Kwolek, B., Wojciechowski, K.: Articulated body motion tracking by combined particle swarm optimization and particle filtering. In: Bolc, L., Tadeusiewicz, R., Chmielewski, L.J., Wojciechowski, K. (eds.) ICCVG 2010. LNCS, vol. 6374, pp. 147– 154. Springer, Heidelberg (2010) 8. Muller, M., Roder, T.: A Relational Approach to Content-based Analysis of Motion Capture Data. Computational Imaging and Vision, vol. 36, ch. 20, pp. 477–506 (2007) 9. Pushpa Rani, M., Arumugam, G.: An Efficient Gait Recognition System For Human Identification Using Modified ICA. International Journal of Computer Science and Information Technology 2(1) (2010) 10. Liang, W., Tieniu, T., Huazhong, N., Weiming, H.: Silhouette Analysis-Based Gait Recognition for Human Identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(12) (2003) 11. Sarkar, S., Phillips, J., Liu, Z., Vega, I.R., Grother, P., Bowyer, K.: The HumanID Gait Challenge Problem:Data Sets, Performance, and Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(2) (2005) 12. Zhang, Z., Troje, N.F.: View-independent person identification from human gait. Neurocomputing 69 (2005) 13. Roder, T.: Similarity, Retrieval, and Classification of Motion Capture Data. PhD thesis, Massachusetts Institute of Technology (2006) 14. Kovar, L., Gleicher, M., Pighin, F.: Motion graphs. ACM, Trans. Graph (2002) 15. Johnson, M.: Exploiting Quaternions to Support Expressive Interactive Character Motion. PhD thesis, Massachusetts Institute of Technology (2003)
Separating Occluded Humans by Bayesian Pixel Classifier with Re-weighted Posterior Probability Daehwan Kim, Yeonho Kim, and Daijin Kim Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-Dong, Nam-Gu, Pohang, 790-784, Korea {msoul98,beast,dkim}@postech.ac.kr
Abstract. This paper proposes a Bayesian pixel classification method with re-weighted posterior probability for separating multiple occluded humans. We separate the occluded humans by considering the occlusion region as a pixel classification problem. First, we detect an isolated human using the human detector. Then we divide it into three body parts (head, torso, and legs) using the body part detector, and model the color distributions of each body part using a naive Bayes classifier. Next, we detect an occlusion region by associating the occluded humans in consecutive frames. Finally, we identify the pixels associated with a human or body parts in occlusion region by the Bayesian pixel classifier with reweighted posterior probability, which can classify them more accurately. Experimental results show that our proposed method can classify pixels in an occlusion region and separate multiple occluded humans. Keywords: Occlusion separation, Bayesian pixel classification, Re-weight posterior probability, and naive Bayes classifier.
1
Introduction
Vision-based human tracking is a very important research topic in the computer vision field since it can be applied to many applications such as surveillance, moving robot systems, object tracking, augmented reality and so on. Although many human tracking methods have been proposed, there are still several problems such as pose variations, appearance changes caused by illumination changes, and temporal or partial occlusion. In particular, tracking occluded humans is one of the most difficult problems due to their small visible area and vague appearance within the occlusion area. There are many methods to separate multiple occluded humans. Chang et. al. [1] used the fusion of multiple cameras to obtain depth. Mittal and Davis [3] separated multiple occluded humans in a cluttered scene using multiple calibrated cameras. However, these methods require multiple cameras to generate depth and additional time to calibrate the cameras. Wu and Nevatia [4] proposed a method for human detection in a crowded scene from static images by introducing and learning an edgelet feature. They made part detectors based on the edgelet feature and used them to detect the body parts of the occluded J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 543–553, 2011. c Springer-Verlag Berlin Heidelberg 2011
544
D. Kim, Y. Kim, and D. Kim
humans. Sabzmeydani and Mori [5] introduced an algorithm for learning the shapelet feature to detect pedestrians in still images. They dealt with partial occlusion by using the learned pedestrian detector based on the shapelet feature. The edge-based methods provide good detection and segmentation results, but they can not detect or segment occluded humans in cases of severe occlusion. Elgammal and Davis [7] conducted occlusion reasoning to recover relative depth information in a general probabilistic framework. They made the scenario to initialize the human appearance models before occlusion. After that, many appearance-based methods have followed their initialization scenario. Lin et. al. [9] described an interactive human segmentation approach by encoding both spatial and color information in nonparametric kernel density estimation. They also proposed handling multiple humans under occlusion by using a layered occlusion model and probabilistic occlusion reasoning. Hu et. al. [10] proposed a new approach for reasoning about multiple occluded people by using the spatial-color mixture of Gaussian appearance model. They deduced the occlusion relationships between the current states of the objects and the current observations. Appearance can be a good feature for occlusion separation due to its persistence and firmness over time. In particular, it is less sensitive by the partial occlusion or the exterior influence from an occlusion view. We consider separating occluded humans as a pixel classification problem. It makes the separation be classifiable although only a few pixels in an occlusion region are visible. First, we detect an isolated human using the human detector based on the Histogram of Oriented Gradient (HOG) [14], which is one of the commonly used detectors [13]. We detect a human using the HOG at every frame. Second, we divide the detected human into three body parts (head, torso, and legs) using the body part detector based on the Local Self-Similarity Descriptor (LSSD) [15], and then make the Bayesian pixel classifier by modeling the color distributions of each body part. Third, we detect an occlusion region where two or more humans are associated within one human in the current frame. Fourth, we identify a human label of the pixels in the occlusion region by the Bayesian pixel classifier with re-weighed posterior probability. We use a parametric approach to model the color distribution of the detected body parts. This approach enables us to model the color distributions of body parts because a human is divided into three parts from separable units of view. So we does not need to assume clothing to be a constant color. To make the Bayesian pixel classifier, we use a naive Bayes approach. It can easily estimate its parameters as well as operate quite well even in having a small amount of training data. Although the Bayesian pixel classifier can classify pixels in an occlusion region well, it often fails to classify the pixels where the occluded humans wear clothes with similar colors. Fig. 1 shows an example of the misclassified result when using the pixel classifier. Therefore, we propose a method to re-weight the posterior probability in the Bayesian pixel classifier. This method re-weights the posterior probability for a pixel by multiplying the mean of the posterior probabilities of the spatially relative region. Re-weighting the posterior probability makes the pixels separable. Fig. 2 describes the overall process for separating multiple occluded humans in human tracking.
Separating Occluded Humans by Bayesian Pixel Classifier
545
Fig. 1. An example of the misclassified result when using the traditional pixel classifier. In the left figure, the faces in the green boxes have similar colors (skin and black) and the color of the pants and jacket in the red boxes are similar (black). In the right figure, the pixel classification results of the face of the right person and the pants on the left person are wrong.
Fig. 2. The overall process for separating the multiple occluded humans in human tracking
There are two advantages. First, it does not require a number of visible pixels to separate multiple occluded humans because of doing classification at pixellevel. So they can be separable even under severe occlusion. Second, it does not need to calculate the motion information to handle occlusion. It is very difficult to obtain the motion information under occlusion. The remainder of this paper is organized as follows. Section 2 describes a method to separate multiple occluded humans. Section 3 shows the experimental results. Finally, section 4 presents our conclusions and suggests future work.
2
Separating Occluded Humans by Bayesian Pixel Classifier
The proposed method to separate occluded humans classifies pixels in the occlusion region. To classify the pixels, we make a Bayesian pixel classifier that is trained by modeling the color distributions and estimating their parameters when a human is independently detected before occlusion. We classify the pixel
546
D. Kim, Y. Kim, and D. Kim
in the occlusion region using the Bayesian pixel classifier, and then we separate the occluded humans by localizing each human using the classified pixels. However, the Bayesian pixel classifier often fails to classify the pixels in the occlusion region because the occluded humans are usually dressed in similar colors. To overcome this problem, we make the pixels correctly classifiable by re-weighting the posterior probability using structural information related to the human body. 2.1
Training the Bayesian Pixel Classifier
We use the multinomial naive Bayes (NB) model for learning the Bayesian pixel classifier. A pixel is represented by a RGB color: x = (f1 , f2 , f3 ) = (r, g, b). The posterior probability of a pixel x in class c is computed as P (c|x) ∝ P (c) P (fd |c) (1) 1≤d≤Nd
where P (fd |c) is the conditional probability of attribute fd occurring in a pixel of class c and Nd is the dimension of a pixel. We use P (fd |c) as an observation measurement given a class c. P (c) is the prior probability of a pixel occurring in a class c. We can estimate the parameters P (c) and P (fd |c) using the maximum likelihood (ML) technique. We use terms Pˆ (c) and Pˆ (fd |c) because the parameters are estimated values for P (c) and P (fd |c). The prior probability is obtained by counting the relative number. Nk Pˆ (c) = (2) N where Nk is the number of pixels of a body part and N is the total number of pixels of all parts in an occlusion region. We estimate the conditional probability Pˆ (fd |c) as the relative frequency of attribute f in pixels belonging to class c. Ncfd Pˆ (fd |c) = fd ∈Fd Ncfd
(3)
where Ncfd is the number of attributes fd in training pixels from class c. However, there is a problem with the MLE that the attribute-class probability Pˆ (fd |c) can be zero if it does not occur in the training data. To avoid this problem, we use the Laplace smoothing (LS) technique, which simply adds one to both the numerator and the denominator. Ncfd + 1 fd ∈Fd (Ncfd + 1)
Pˆ (fd |c) =
(4)
Eq. 4 shows a probability for the occurrence of an attribute fd . 2.2
Re-weight Posterior Probability
Even if the learned pixel classifier can actually classify a pixel quite well, it is often not enough for occluded humans who dress in similar color clothes. Furthermore, it is non-separable when the head areas are occluded because their color distributions are almost the same.
Separating Occluded Humans by Bayesian Pixel Classifier
547
To overcome this problem, we improve the pixel classification performance by re-weighting the posterior probability using structural information related to the human body. This presents a geometric structure with several organically linked parts. The geometric structure means that the region on the spatially relative location from a part always represents the same part in the predefined body pose. From this, we know that the color distributions on the spatially relative location between body parts should be kept constant in structural information related to the human body. To present each body part, we model them as an ellipse, where each ellipse (head, torso and legs) is parameterized with the major radius hH , hT , hL , the minor radius wH , wT , wL , and the center point CH , CT , CL . AHT is the articulated point between head and torso and AT L is that between the torso and legs. −−−−−→ −−−−−→ The angles θHT and θT L between the vector CH AHT and AHT CT and between −−−−−→ −−−−−→ −−−−−→ −−−−− → CH AHT ·AHT CT ) the vector CT AT L and AT L CL are represented by arccos( − −−−−→ −−−−−→ |CH AHT ||AHT CT | −−−−−→ −−−−−→ CT AT L ·AT L CL ), respectively. and arccos( − −−−−→ −−−−−→ |CT AT L ||AT L CL | To represent structural information related to the human body, we define the Structurally Linked Region (SLR), which is a region relative to a pixel. The SLR can be differently determined according to that the pixel is within which area. The SLR of pixels on the head, torso, and leg regions is its torso, leg, and torso area. We obtain the SLR of a pixel as follows. ˆ for SLR of a pixel. 1. Calculate the center point C −−−−−→ −−−−−→ ˆ – if x ∈ head, C = CH + CH AHT + AHT CT . −−−−−→ − −−−−→ – if x ∈ torso, Cˆ = CT + CT AT L + AT L CL . − −−−−→ −−−−−→ ˆ – if x ∈ leg, C = CL + CL AT L + AT L CT . 2. Obtain the SLR Rx with an ellipse shape. ˆ hT , and wT . – if x ∈ head, Rx is a region rotated by 180 − θHT with C, ˆ hL , and wL . – if x ∈ torso, Rx is a region rotated by 180 − θT L with C, ˆ hT , and wT . – if x ∈ leg, Rx is a region rotated by 180 − θT L with C,
From structural information related to the human body, we become aware that color distributions between body parts are constant on their spatial relativeness. To apply this fact to the occlusion problem, we propose a re-weighting method. First, we build the posterior probability map (PPM) of each class, where the entry xi at each pixel location in a class ck is the posterior probability Pˆ (ck |xi ). Second, we set the SLR Rck xi of each pixel xi in each PPM ˆ T ˆ L Pˆ (cT1 |xi ), Pˆ (cL 1 |xi ), . . . , P (cK |xi ) , P (cK |xi ). Third, we update all posterior probH T ˆ ˆ ˆ abilities P (ck |xi ), P (ck |xi ), and P (cL k |xi ) by multiplying the mean of all posterior probabilities in the SLR as shown in Eq. 5, 6, and 7. P¯ (cH ) ∗ Pˆ (cH k |xi ) = M (RcH k |xi ) k xi
(5)
P¯ (cTk |xi ) = M (RcLk xi ) ∗ Pˆ (cTk |xi )
(6)
P¯ (cL k |xi )
(7)
= M (RcTk xi ) ∗
where M (•) is the Mean function of •.
Pˆ (cL k |xi )
548
2.3
D. Kim, Y. Kim, and D. Kim
Separating Occluded Humans Using a Bayesian Pixel Classifier
We can determine the class label of pixels in an occlusion region by using a maximum a posterior (MAP) with the re-weighted posterior probability P¯ (cj |xi ). Cbest = arg max P¯ (cj |xi ) c
(8)
where xi is a pixel in occlusion region Or and cj is the class label to be joined to Or . Finally, we convert the class labels at the body part level into human class labels because our goal is to separate multiple occluded humans at the T L human-level. So the body part’s class labels cH k , ck , ck are combined to make one human label ck . From taking the class labeled pixels in occlusion region, we can estimate each human region by applying Mean-shift with the whole body’s predefined ellipse. The Mean-shift algorithm moves the center point of the ellipse by averaging the entry xi (ck ) at each pixel location in the same class labels ck . First, compute the initial location y0 , which can be obtained from the average of entries at pixel locations in the same class labels after the classification. Second, compute the new location yˆ1 from the initial location y0 . yˆ1 =
1 xi (ck ) NE
(9)
xi ∈E
where NE is the number of pixel entries within the ellipse E. Third, stop the iteration if |yˆ1 − y0 < | is satisfied, otherwise, set y0 = yˆ1 and go to the second step. Fig. 3 shows an example of the process used to separate the occluded humans. After re-weighting the posterior probability, we can see that their PPM are obviously changed to classify the body parts. Using these PPMs, we can separate the occluded humans.
Fig. 3. An example of the process used to separate the occluded humans. Before reweighing the posterior probability, there are many incorrectly classified pixels, but after re-weighing them, they are classified quite well.
Separating Occluded Humans by Bayesian Pixel Classifier
3
549
Experimental Results
The proposed method was implemented on a Windows PC platform with a 2.83 GHz Intel Core 2 Quad CPU and 8 GB RAM in the Matlab 9.0 environment. We evaluated our algorithm by experimenting on four occlusion examples captured using the Web-camera in our laboratory and recorded for real human tracking. We included an example in which the occluded humans wears multiple color clothes (example 2). This is different from the occlusion examples used by Elgammal and Davis [7]. They used the examples in which the occluded humans dressed the simple color. We summarizes the description of our experiments in Table 1. The first example consists of over 1000 images (320(w)x240(h)) with two occluded humans. One wore a blue shirt and brown pants and the other wore a black shirt and a gray pants. The two humans moved from side to side. They were occluded at the center position of the images. The second example consists of 104 images (about 70(w)x80(h)) with two occluded humans. This image sequence was cut from the original image sequence to focus on only the occlusion areas. So the size of humans was relatively smaller than that of humans in the first example. They wore shirts, which had very complex colors, and also were severely occluded in the 55th frame. Tho two humans went forward (toward the camera) simultaneously, crossed, and then were separated later. The third and fourth examples consist of over 600 images (320(w)x240(h)) with two occluded humans. The humans in the third and fourth examples wore the same color shirts (purple) and same color pants (black). Table 1. Description about occlusion examples Example 1 # of human 2 Head color P1(black, skin) P2(black, skin) Torso color P1(blue) P2(black) Leg color P1(brown) P2(gray)
Example 2 2 P1(black, skin) P2(black, skin) P1(brown, beige) P2(black, white) P1(black) P2(blue)
Example 3 Example 4 2 2 P1(black, skin) P1(black, skin) P2(black, skin) P2(black, skin) P1(dark purple) P1(blue) P2(light purple) P2(gray, white) P1(gray) P1(black) P2(black) P2(black)
We used background subtraction to take the foreground pixels. The holes in the experimental examples occurred due to the background subtraction. To show the occlusion performance, we compared two methods, where one used the Bayesian pixel classifier without re-weighting posterior probability (This is very similar to the proposed method used by Elgammal and Davis [7]. However, they used Gaussian mixture model for modeling the body part’s color). The other was the Bayesian pixel classifier with re-weighting posterior probability. The body parts of each human were detected when they were isolated and they were modeled by training the Bayesian pixel classifier. We detected an occlusion region where two or more humans were associated with one human in the current
550
D. Kim, Y. Kim, and D. Kim
Fig. 4. Occlusion separation result of example 1. (a) Input images (b) Pixel classification result without re-weighting of the posterior probability (c) Pixel classification result with re-weighting of the posterior probability and human localization result.
Fig. 5. Occlusion separation result of example 2. (a) Input images (b) Pixel classification result without re-weighting of the posterior probability (c) Pixel classification result with re-weighting of the posterior probability and human localization result.
Separating Occluded Humans by Bayesian Pixel Classifier
551
Fig. 6. Pixel classification results of example 3 (a) and 4 (b)
frame but were separately detected in the previous frame. Our experimental results showed the pixel classification and human separation performance of our algorithm. We showed the pixel classification results as assigning different colors to each human. Furthermore, we drew the detection boxes on the image by applying the Mean-shift after re-weighting to localize each human. Fig. 4 shows the occlusion separation results of example 1. Subjects in this example wore clothes with simple colors. According to our expectation, the traditional method, which does not have re-weighting of the posterior probability, gave good pixel classification results. However, the head area and many small pixel blobs were not classified correctly because the colors of the two heads and one torso were so similar. On the other hand, our method classified well. The pixels in a few blobs were incorrectly classified, but they could not affect the human localization result. From this result, we knew that the re-weighting method could rectify incorrectly classified pixels. Fig. 5 shows the occlusion separation result of example 2 under severe occlusion. In example 2, while there were many misclassified pixels on the head and torso area before re-weighting the posterior probability, we could obtain the good pixel classification results after re-weighting it. Many incorrectly classified pixels and blobs were rectified. Even when the woman was severely occluded in the 51th and 55th frame, most pixels were classified quite well. However, the pixels of the man’s head area were incorrectly classified because the two humans were stood in a straight line and structural information related to the human body was not sufficiently reflected in the re-weighting.
552
D. Kim, Y. Kim, and D. Kim
Fig. 6 shows the pixel classification results under occlusion. Their results show that our method can classify the pixels quite well, even though the colors of some body parts were the same. In the third example, the human subjects’ torso colors were same, but they were classified well because their pants’ colors were so different and the pixels were rectified by re-weighting the posterior probability. Similarly, in the fourth example, the leg colors of humans were rectified because their shirt’s colors were so different. From our experiments, we found that our proposed method can sufficiently separate the occluded humans by localizing each human.
4
Conclusion and Future Work
We have presented a method to separate multiple occluded humans by the Bayesian pixel classifier with the re-weighted posterior probability using structural information related to the human body. Our proposed method could identify adequately the occluded humans even if they wore clothes of difference colors because we used the multinomial naive Bayes model and the re-weighted posterior probability. Our experimental results also showed that our proposed method can separate multiple occluded humans adequately. From our experiments, we found that all pixels in an occlusion region were not classified exactly. We will use the temporal information as well as other cues to improve the pixel classification performance.
References 1. Chang, T.H., Gong, S., Ong, E.J.: Tracking Multiple People under Occlusion using Multiple Cameras. In: Proc. of British Machine Vision Conference (2000) 2. Zhao, T., Nevatia, R.: Tracking Multiple Humans in Crowded Environment. In: Proc. of IEEE Computer Vision and Pattern Recognition, vol. 2, pp. 406–413 (2004) 3. Mittal, A., Davis, L.S.: M2Tracker: A Multi-view Approach to Segmenting and Tracking People in a Cluttered Scene using Region-based Stereo. International Journal of Computer Vision 51(3), 189–203 (2003) 4. Wu, B., Nevatia, R.: Detection and Segmentation of Multiple, Partially Occluded Objects by Grouping, Merging, Assigning Part Detection Responses. International Journal of Computer Vision 82(2), 185–204 (2009) 5. Sabzmeydani, P., Mori, G.: Detecting Pedestrians by Learning Shapelet Features. In: Proc. of IEEE Computer Vision and Pattern Recognition, pp. 1–8 (2007) 6. Lin, Z., Davis, L.S., Doermann, D., Dementhon, D.: Hierarchical Part-template Matching for Human Detection and Segmentation. In: Proc. of IEEE International Conference on Computer Vision, pp. 1–8 (2007) 7. Elgammal, A., Davis, L.: Probabilistic Framework for Segmenting People under Occlusion. In: Proc. of IEEE International Conference on Computer Vision, pp. 145–152 (2001) 8. Senior, A., Hampapur, A., Tian, Y., Brown, L.: Appearance Models for Occlusion Handling. Image and Vision Computing 24(11), 1233–1243 (2006)
Separating Occluded Humans by Bayesian Pixel Classifier
553
9. Lin, Z., Davis, L.S., Doermann, D., Dementhon, D.: An Interactive Approach to Pose-assisted and Appearance-based Segmentation of Humans. In: Proc. of IEEE International Conference on Computer Vision, pp. 1–8 (2007) 10. Hu, W., Zhou, X., Min, H., Maybank, S.: Occlusion Reasoning for Tracking Multiple People. IEEE Tran. on Circuits and Systems for Video Technology 19(1), 114–121 (2009) 11. Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and Peopledetection-by-tracking. In: Proc. of IEEE Computer Vision and Pattern Recognition, pp. 1–8 (2008) 12. Zhang, L., Wu, B., Nevatia, R.: Detection and Tracking of Multiple Humans with Extensive Pose Articulation. In: Proc. of IEEE International Conference on Computer Vision, pp. 1–8 (2007) 13. Dollar, P., Wojek, C., Schiele, C., Perona, P.: Pedestrian Detection: A Benchmark. In: Proc. of IEEE Computer Vision and Pattern Recognition, pp. 304–311 (2009) 14. Danal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: Proc. of IEEE Computer Vision and Pattern Recognition, pp. 886–893 (2005) 15. Shechtman, E., Irani, M.: Matching Local Self-similarities across Images and Videos. In: Proc. of IEEE Computer Vision and Pattern Recognition, pp. 1–8 (2007)
An Edge-Based Approach for Robust Foreground Detection Sebastian Gruenwedel, Peter Van Hese, and Wilfried Philips Ghent University TELIN-IPI-IBBT, Sint Pietersnieuwstraat 41, 9000 Gent, Belgium Tel.: +32 9 264 34 12, Fax: +32 9 264 42 95 [email protected]
Abstract. Foreground segmentation is an essential task in many image processing applications and a commonly used approach to obtain foreground objects from the background. Many techniques exist, but due to shadows and changes in illumination the segmentation of foreground objects from the background remains challenging. In this paper, we present a powerful framework for detections of moving objects in realtime video processing applications under various lighting changes. The novel approach is based on a combination of edge detection and recursive smoothing techniques. We use edge dependencies as statistical features of foreground and background regions and define the foreground as regions containing moving edges. The background is described by short- and long-term estimates. Experiments prove the robustness of our method in the presence of lighting changes in sequences compared to other widely used background subtraction techniques. Keywords: foreground detection, foreground edge detection, background subtraction, video surveillance, video processing.
1
Introduction
Foreground/background segmentation is a crucial pre-processing step in many applications, aimed at the separation of moving objects (the foreground) from an expected scene (the background). Many techniques use this operation as part of their work flow. For instance, tracking algorithms may focus on foreground regions to detect moving objects and therefore speed up object-matching [13]. There are many techniques to detect moving objects in indoor and outdoor sequences [3]. Nevertheless, most of the techniques perform poorly when lighting changes suddenly. Especially in the case of indoor scenarios, there are problems to distinguish between foreground and background regions when sudden and/or partial lighting changes occur. Therefore, a robust detection of foreground objects under such circumstances is needed. The Gaussian Mixture Model (GMM) method of [14] uses a variable number of Gaussians to model the color value distribution of each pixel as a multi-modal signal. This parametric approach adapts the model parameters to statistical changes. As such it can adapt to lighting changes. However, this adaptation J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 554–565, 2011. c Springer-Verlag Berlin Heidelberg 2011
An Edge-Based Approach for Robust Foreground Detection
555
requires several frames during which the performance is generally very poor. Another very similar approach is presented in [5] wherein color and gradient information are explicitly modeled as time adaptive Gaussian mixtures. The recently published ViBe [1] is a sample-based approach for modeling the color value distribution of pixels. The sample set is updated according to a random process that substitutes old pixel values for new ones. It exploits spatial information by using the neighborhood of the current pixel as well as by adaptation to lighting changes. This method is more robust to noise changes than GMM, as described in [1]. However, the GMM based method adapts poorly to fast local and/or global lighting changes due to the slow adaptation of the background model. The performance of ViBe already improved for such changes, but is still not robust enough. The adaptation to a lighting change involves several frames, but after a few changes in a short time period both methods loose their ability to distinguish between foreground and background. Therefore moving objects will no longer be detected and the performance is insufficient. The method in [7] divides the scene in overlapping squared patches, followed by building intensity and gradient kernel histograms for each patch. The paper shows that contour based features are more robust than color features regarding changes in illumination. In [4], a region-based method describing local texture characteristics is presented as a modification of the Local Binary Patterns [8]. Each pixel is modeled as a group of adaptive local binary pattern histograms that are calculated over a circular region around the pixel. Similar to this approach is the method described in [12] which uses the texture analysis in combination with invariant color measurements in RGB space to detect foreground objects. The two aspects are linearly combined resulting in a multi-layer background subtraction method, which is modeled and evaluated similarly to GMM. These models are particularly robust to shadows. In this paper we propose a new method to subtract foreground (FG) from background (BG) by detecting moving edges in real-time video processing applications, in particular for tracking. We use edge dependencies as statistical features of foreground and background regions and define foreground as regions containing moving edges, and background as regions containing static edges of a scene. In particular, we are interested in finding edges on moving objects. The proposed method estimates static edges which is in contrast to changes in intensity of GMM and ViBe. The novelty is the background modeling which uses gradient estimates in xand y-direction. The x and y components of the gradient are estimated independently using adaptive recursive smoothing techniques for each pixel. Based on the gradient estimates, detection of moving edges becomes feasible. An edge is defined as a sharp change in the image intensity function. We use thresholding on the current gradient estimates and our background modeling to obtain foreground edges. Edge detection in general includes the relationship to neighboring pixels and is in theory independent of lighting changes [2]. Even if in practice this is not the case, lighting changes only affect the edge strength; adaptation is not needed and therefore the method copes better with local and global lighting changes.
556
S. Gruenwedel, P. Van Hese, and W. Philips
We compare the results of the proposed method with the results of two stateof-the-art FG/BG segmentation techniques. To do so, we artificially fill the interior of moving objects by clustering edges and filling those clusters with a convex hull technique. The results are obtained from several indoor sequences in the presence of local and global lighting changes. In particular, we choose the Gaussian Mixture Model (GMM) [10] based method by [14] and the samplebased approach ViBe [1] as a comparison to the proposed method. We show that our method performs best in sequences under changes in illumination. As an evaluation measure we compare the position of moving people obtained by [6] to ground truth data. This paper is structured as follows. In Section 2 the proposed method including the background model is explained. In Section 3 experimental results will be discussed in detail and we will show that our method performs best for the tested sequences in presence of lighting changes.
2
Background Subtraction Using Moving Edges
As it is often the case, edges are detected by computing the edge strength, usually a first order derivative expression such as the gradient magnitude, and searching for local maxima. In our method, we define foreground as regions of moving edges and use first order derivatives in x- and y-direction as input treating each direction independently. We estimate the x and y component of the gradient per pixel over time using a recursive smoothing technique. The smoothing is applied with a low learning factor and estimates the background of a scene, further referred to as long-term background edge model. Due to the low learning factor, changes in the gradient estimates will be incorporated slowly by the background models. By comparing the background edge models to the recent gradient estimates, we might detect more edges than actually present because of the low learning factor. However, this situation is prevented using a second smoothing approach, referred to as short-term background edge model, based on recursive smoothing with a higher learning factor. The two models per direction are used jointly to obtain a foreground gradient estimate per direction, containing only regions where motion occurs in the image. In Figure 1 the block scheme of our method is shown. First, we calculate the gradient estimates in x- and y-directions, represented by two matrices Gx,t and Gy,t , for the input image of frame t using a discrete differentiation operator (e.g. Sobel operator). In the next step, we compare our long-term background edge models with the current gradient estimates for each direction and obtain two l l and Fy,t , using hysteresis thresholding with two binary foreground masks, Fx,t thresholds Tlow and Thigh . The same procedure is done for the short-term models s s resulting in two binary masks, Fx,t and Fy,t , using only the threshold Tlow . The comparison per model is done using the differences between the background edge models in x- and y-direction and the x and y component of the gradient instead of the absolute value of differences, which results in a better detection of moving edges. We define Gfx,t and Gfy,t as the foreground gradient estimates
An Edge-Based Approach for Robust Foreground Detection
Long-term background edge model ܤ௫ǡ௧
Model
ܩ௫ǡ௧
Grayscale image
Detection
Short-term edge model Model
௦ ܤ௫ǡ௧
Model
ܤ௬ǡ௧
ܩ௫ǡ௧ ܩ௬ǡ௧
ܨ௬ǡ௧
Detection
Long-term background edge model
Detection ௦ ܨ௫ǡ௧
ܨ௫ǡ௧
ߘ ܩ௬ǡ௧
557
Foreground edge map
௦ ܨ௬ǡ௧
Model
௦ ܤ௬ǡ௧
Detection
Short-term edge model
Fig. 1. The input image is passed to differentiation operator resulting in the x and y component of the gradient. The gradient is compared to our background models resulting in foreground gradient estimates.
in x- and y-direction, respectively. The foreground gradient estimate in x, Gfx,t , contains the x component of the gradient, Gx,t , in foreground regions if, and s l only if, the binary masks Fx,t and Fx,t are one, otherwise zero values; and vice versa in y-direction. In the final step, edges are extracted from the foreground gradient estimates using a non-maximum suppression technique together with two thresholds Tlow and Thigh . The resulting moving edges of our method are exemplified in Figure 2. The model updates are performed using the recursive smoothing (running average) technique [9] with two different learning factors αs and αl , for the short-term and long-term model, respectively. In order to meet the real-time criteria we choose the simplest form of exponential smoothing, i.e. the mean value is a cumulative frame-by-frame estimate. In summary, we use four different parameters for our method: αs and αl for both short-term and long-term models in x- and y-direction as well as Tlow and Thigh . A detailed discussion of the short-term and long-term models can be found in Section 2.1 and 2.2. We will only focus on the explanation of the x-direction since the calculations for the y-direction are analogous. 2.1
Short-Term Model
The short-term models of x and y are responsible to smooth the x and y component of the gradient over a recent number of frames according to the learning factor αs . The model is needed, in combination with the long-term model, to suppress noise and to robustly detect moving edges. The model update is performed using a recursive smoothing technique with the learning rate αs , which is higher than the learning rate αl of the long-term background edge models. s We define the difference between the averaged gradient estimate, Bx,t , and the
558
S. Gruenwedel, P. Van Hese, and W. Philips
(a)
(b)
Fig. 2. Segmentation result of an input frame (a) using the proposed method (b) s current gradient estimate, Gx,t , as dsx,t (x, y) = Gx,t (x, y)−Bx,t (x, y) at location s (x, y). Formally, the model Bx,t is updated according to: s s Bx,t (x, y) = Bx,t−1 (x, y) + αs dsx,t (x, y)
(1)
where αs ∈ [0, 1] is the learning rate. The learning rate αs is constant and usually around 0.1. Frame difference is a special case of the short-term model with αs = 1 and the simplest case of motion detection; αs < 1 models the smoothing over a recent s number of frames rather than the last one. To obtain the binary mask Fx,t , s we threshold the difference between the Bx,t and the gradient estimate Gx,t . Formally, we get 1, dsx,t (x, y) > Tlow s (2) Fx,t (x, y) = 0, otherwise s represents a binary mask, specifying the presence of motion with 1 where Fx,t and otherwise 0 for each pixel. The threshold Tlow is the same as in the long-term modeling.
2.2
Long-Term Background Edge Model
l The long-term model Bx,t basically contains averaged gradient estimates over a long time period and therefore describes static edges in the background. The model uses the same running average technique as the short-term model, but with a very low learning factor αl ∈ [0, 1] (around 0.01). The difference between l the long-term gradient estimate, Bx,t , and the current gradient estimate, Gx,t , l l (x, y) at location (x, y). The update is defined as dx,t (x, y) = Gx,t (x, y) − Bx,t is calculated as follows: l s l Bx,t−1 (x, y) + αl dlx,t (x, y) , if Fx,t (x, y) = Fx,t (x, y) = 0 l Bx,t (x, y) = (3) l Bx,t−1 (x, y) , otherwise
An Edge-Based Approach for Robust Foreground Detection
559
Fig. 3. Detected moving edges of the proposed method in a partly illuminated scene. The first row shows the input frames and the second row the segmentation results. Our method produces reliable results even in dark regions. l where Bx,t (x, y) corresponds to the long-term model in x-direction in the t-th l is selectively updated according input frame at location (x, y). The model Bx,t s l to the binary masks Fx,t and Fx,t , i.e., the model is only updated in regions where moving edges are not detected. This makes sure that moving edges are not included in the long-term model. The learning factor αl ∈ [0, 1] is constant and describes the adaptation speed of the long-term model. Selective updating could cause a propagation of false detections in time because the model is not updated in region of moving edges. This situation is prevented using a second model, the short-term model. l is determined by the comparison of the background The binary mask Fx,t l model Bx,t and the input gradient estimates Gx,t . Formally, we calculate the l mask Fx,t as follows: l (4) = hyst dlx,t , Tlow , Thigh Fx,t
The resulting mask contains 1 for foreground and 0 for background regions for each pixel. The function hyst (·) corresponds to a hysteresis thresholding of the absolute value of the difference between long-term gradient estimate and current gradient estimate. All pixel values larger than Thigh are immediately accepted as foreground and vice versa; values smaller than Tlow are immediately rejected. Pixel values inbetween the two thresholds are accepted if they are in the neighborhood (8-connected) of a pixel that has a larger value than Thigh . 2.3
Detection of Moving Edges
In the final step, moving edges are generated from the foreground gradient estimates Gfx,t and Gfy,t . Gfx,t is found by setting foreground regions according s l and Fx,t to the gradient estimate in x-direction, Gx,t , and to binary masks Fx,t
560
S. Gruenwedel, P. Van Hese, and W. Philips
zero otherwise. In this stage we make sure that we take only foreground regions into account. The calculation is defined as follows: s l Gx,t (x, y) , Fx,t (x, y) = Fx,t (x, y) = 1 f (5) Gx,t (x, y) = 0, otherwise where Gfx,t contains the gradient estimate at location (x, y) and zero values otherwise. To obtain thin edges, we calculate the edge map for time t using the nonmaximum suppression technique like in many edge algorithms [2] by combining the foreground gradient estimate in x- and y-direction, Gfx,t and Gfy,t . Nonmaximum suppression searches for the local maximum in the gradient direction. As in Figure 2 already illustrated, our method produces edges describing moving objects.
3
Results and Discussion
We first tested the performance of the proposed method under different lighting conditions and visually compared the results to the Gaussian Mixture Model [14] and the ViBe technique [1]. In the second step, we performed an evaluation of these three background segmentation methods by using the foreground silhouettes from each method as input for constructing occupancy maps by Dempster-Shafer reasoning in a multicamera network according to [6]. The soundness of the maps per time instance is then used as an evaluation measure for the different FG/BG methods. In particular, these maps are useful for monitoring the activities of people and tracking applications, with the intention to find a correct trajectory of a person. The data set, we used for comparison, consists of indoor sequences which were captured by a network of four cameras (780x580 pixels at 20 FPS) with overlapping views in an 8.8m by 9.2m room. Recordings were taken for about one minute during which ground truth positions of each person were annotated at one second intervals. We also tested the proposed method on outdoor sequences with similar results. For the sequences we use a fixed learning factor of αs = 0.1 for the short-term and αl = 0.01 for the long-term models. The proposed framework is not very sensitive to the learning factor αl , provided that the factor is reasonable small (0.01 to 0.05). However, due to the fact that the short-term model is responsible for the smoothing of recent activities in the foreground, the learning rate αs is important and specifies the adaptation speed to changes in the foreground. Usually the factor is about ten times bigger than αl . 3.1
Visual Evaluation of the Proposed Method under Different Lighting Conditions
In the first step we compared the performance visually on three exemplary different sequences with evaluation measures of how well moving people are segmented
An Edge-Based Approach for Robust Foreground Detection
561
Fig. 4. Exemplary frames of a global lighting change. Black pixels correspond to foreground regions. First column: input frames; Second column: results of GMM [14]; Third column: results of ViBe [1]; Forth column: results of our proposed method. The second row shows the scene directly after a lighting change. Our method is not affected by this change.
for GMM and ViBe, and how well moving edges are detected on people. In Figure 3, the segmentation result of a partial illuminated sequence is shown. Our method produces reliable results even in dark regions, i.e., edges are still found on moving people. The results of our method are similar to those of change detection techniques, but differ favorably in the presence of lighting changes as shown in Figures 4 and 5. In Figure 4 exemplary segmentation results of the whole sequence for a global lighting change are shown. The second column shows the scene directly after a lighting change. It is clearly visible that our method is not affected by this change. The detection of edges on the walking person is still reliable, even in poorly illuminated parts of the scene. GMM and ViBe suffer from the adaptation to the lighting change and especially ViBe fails to segment the person in the scene. Figure 5 shows an example of global and local lighting changes. In this sequence, four people are moving around with the light changing at first globally and then locally in the scene. This makes it difficult to find a proper segmentation of moving people. The first column illustrates the results of all methods at the beginning of the sequence. In the second column a global illumination change occurred and GMM and ViBe suffer from this lighting change while our proposed method provides some edges on moving people. The third and fourth
562
S. Gruenwedel, P. Van Hese, and W. Philips
Fig. 5. Exemplary frames of a global and local lighting changes. Black pixels correspond to foreground regions. First row: input frames; Second row: results for GMM by [14]; Third row: results for ViBe by [1]; Forth row: results for the proposed method. The proposed method is less influenced by lighting changes.
column contain local lighting changes. ViBe fails completely in this case because the adaptation to lighting changes is quite slow. Even GMM has problems to adapt to these changes and to find a good segmentation. Our method performs best in these cases and provides a good segmentation of moving edges. Due to the local lighting changes shadows are partially segmented by our method as depicted in the last column. As shown in the example sequences, our method is less influenced by lighting changes and hence more robust. The results of our method under different light conditions are only affected in less detections of edges onto the objects or partial detection of shadows. Less edges are detected due to the poor lighting on the objects which results in too small intensity differences. 3.2
Numerical Evaluation of the Proposed Method for the Construction of Occupancy Maps
To quantitatively compare all described methods, we used an exemplary sequence which includes local and global lighting changes (example frames shown in Figure 5). We performed an evaluation based on the foreground silhouette from each method as an input for constructing occupancy maps by DempsterShafer reasoning in a multi-camera network [6]. This comparison is especially of interest for tracking applications as it is a measure of how often the tracking might be lost. Due to the fact that edges cannot be compared directly with foreground masks of FG/BG segmentation techniques we clustered edges using a nearest-neighbor technique and combined them by a convex hull to represent silhouettes of moving people. The convex hull is constructed around a cluster of
An Edge-Based Approach for Robust Foreground Detection 1
1 proposed method GMM ViBe
0.9
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
proposed method GMM ViBe
0.9
(1-p)
(1-n)
0.8
0 320
563
0.1 520
720
920
1120 1320 1520 1720 1920 2120 frame number
(a)
0 320
520
720
920
1120 1320 1520 1720 1920 2120 frame number
(b)
Fig. 6. Comparison of GMM, ViBe and the proposed method for each frame in the sequence. In (a) (1 − n) and in (b) (1 − p) for each method is shown. Higher values indicate better performance. The proposed method outperforms the other two methods.
edges and and usually results in a sub-optimal solution to construct the silhouette of a person. A convex hull for a set of edge points is generally the minimal convex set containing these points. However, this is only used for comparison with FG/BG methods to construct an occupancy map. We used occupancy maps (i.e. a top view of the scene) together with DempsterShafer reasoning, as explained in [6], to obtain the person’s position in the scene. An occupancy map is calculated using different camera views and fusing foreground silhouettes onto the ground plane. In this sequence we have four different views of the scene and per second manually annotated ground truth positions of each person. For each occupancy map the positions of people were compared to ground truth data. To evaluate the soundness of all maps per time instance we use two measures, n and p, as described in [11]: n represents a measure of evidence at a person’s position (within a radius of 10cm, n = 0 is the ideal case) and p as a measure of no evidence outside the positions (p = 0 is the ideal case). For p, we choose a radius of 70cm around the person’s position. Those measures provide a reasonable evaluation of FG/BG methods, as stated in [11], e.g. for tracking applications. The ideal case for a method should be that n = 0 and p = 0, which means that all objects are detected and the evidence of a person is concentrated around the ground truth position. In Figure 6 the evaluation over all 1800 frames (90s) is shown. The Figure 6 (a) denotes the measure (1 − n) and Figure 6 (b) (1 − p). The ideal case would be that for each frame the measure (1 − n) and (1 − p) is close to one. The results show that after lighting changes occur (frame 600), our method performs best for this sequence. The mean of (1 − n) and (1 − p) is shown in Figure 7(a). Our method has a performance of 60%, which is the double of GMM. Althrough, for some frames the results are still not satisfying which is due to the fact that only half of the people are segmented, resulting in lower evidences of occupancy. However,
564
S. Gruenwedel, P. Van Hese, and W. Philips mean of (1-n) and (1-p) over all frames 1
0.9
1-n : evidence at players positions 1-p : no evidence elsewhere
0.8 0.7
Processing time per frame [ms]
0.6 0.5 0.4 0.3
GMM
66.2
ViBe
18.4
Proposed method
43.2
0.2 0.1 0
proposed method
GMM
(a)
ViBe
(b)
Fig. 7. Comparison of GMM, ViBe and the proposed method: (a) the mean of (1 − n) and (1 − p) over all frames, (b) the processing time of all methods calculated in full resolution (780x580 pixels). The proposed method clearly outperforms the other methods and is able to run in real-time.
GMM and ViBe fail almost completely in the presence of lighting changes for this sequence. The processing time of our method is higher than ViBe, but still better than GMM based methods (Figure 7(b)). To sum up, our method performs best in the presence of lighting changes compared to GMM and ViBe which is due to the fact that our model is based on the detection of edges, which are much less influenced by lighting changes and therefore more robust.
4
Conclusion
In this paper, we presented a novel approach for background subtraction using moving edges. We showed that our method produces results similar to stateof-the-art foreground/background methods [14,1], but performs much better in the presence of lighting changes. The parameters of our method do not need fine tuning (short-term and long-term learning factors) since the results are satisfying in a wide range of environments with fixed set parameters. The problem of changing light conditions is still a critical issue for foreground segmentation techniques and needs further investigation; however, our proposed method based on edge information could solve this drawback and is a step towards the robustness against illumination changes. This edge-based approach can be used to model the lighting changes and thus help to find a better segmentation of foreground objects. A minor drawback of the proposed method are the not yet fully light-insensitive thresholds; further exploration to automatically adapt the thresholds to the light changes is required. Furthermore, tracking approaches could make use of moving edges because edges are a common feature of choice in these applications.
An Edge-Based Approach for Robust Foreground Detection
565
References 1. Barnich, O., Van Droogenbroeck, M.: Vibe: a powerful random technique to estimate the background in video sequences. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 945–948 (2009) 2. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8(6), 679–698 (1986) 3. Cristani, M., Farenzena, M., Bloisi, D., Murino, V.: Background subtraction for automated multisensor surveillance: a comprehensive review. Journal on Advances in Signal Processing, 24 (2010) 4. Heikkila, M., Pietikainen, M.: A texture-based method for modeling the background and detecting moving objects. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(4), 657–662 (2006) 5. Klare, B., Sarkar, S.: Background subtraction in varying illuminations using an ensemble based on an enlarged feature set. In: Computer Vision and Pattern Recognition Workshop, pp. 66–73 (2009) 6. Morbee, M., Tessens, L., Aghajan, H., Philips, W.: Dempster-shafer based multi-view occupancy maps. Electronics Letters 46(5), 341–343 (2010) 7. Noriega, P., Bernier, O.: Real time illumination invariant background subtraction using local kernel histograms. In: British Machine Vision Association, BMVC (2006) 8. Ojala, T., Pietikainen, M., Harwood, D.: Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition. Conference A: Computer Vision & Image Processing, vol. 1, pp. 582–585. IEEE, Los Alamitos (1994) 9. Piccardi, M.: Background subtraction techniques: a review. In: IEEE International Conference on Systems, Man and Cybernetics, vol. 4, pp. 3099–3104. IEEE, Los Alamitos (2004) 10. Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Anal. Mach. Intell. 22, 747–757 (2000) 11. Van Hese, P., Gr¨ unwedel, S., Ni˜ no Casta˜ neda, J., Jelaca, V., Philips, W.: Evaluation of background/foreground segmentation methods for multi-view occupancy maps. In: Proceedings of the 2nd International Conference on Positioning and Context-Awareness (PoCA 2011), p. 37 (2011) 12. Yao, J., Odobez, J.: Multi-layer background subtraction based on color and texture. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8. IEEE, Los Alamitos (2007) 13. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput. Surv. 38(4), 13 (2006) 14. Zivkovic, Z.: Improved adaptive gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 2, pp. 28–31 (2004)
Relation Learning - A New Approach to Face Recognition Len Bui, Dat Tran, Xu Huang, and Girija Chetty University of Canberra, Australia {Len.Bui,Dat.Tran,Xu.Huang,Girija.Chetty}@canberra.edu.au
Abstract. Most of current machine learning methods used in face recognition systems require sufficient data to build a face model or face data description. However insufficient data is currently a common issue. This paper presents a new learning approach to tackle this issue. The proposed learning method employs not only the data in facial images but also relations between them to build relational face models. Preliminary experiments performed on the AT&T and FERET face corpus show a significant improvement for face recognition rate when only a small facial data set is available for training. Keywords: Face Identification, Similarity Score, Relation Learning, Support Vector Machine.
1 Introduction The goal of pattern recognition is to classify unknown objects into classes or categories. Those objects are represented by a set of measurements called patterns or feature vectors. There is a strong growth of practical applications based on pattern recognition where face recognition [1] has attracted a lot of researchers because it has many potential applications. More machine learning methods have been developed for face recognition; however, there are still some challenges for face recognition. The big challenge is data insufficiency. Most of current machine learning methods require a lot of objects for each class to build a face model or face data description. However, only a few objects per class are available in practice. For example, in FERET database [2, 3] each person has a limited number of facial images. According to [4-9], statistical and discriminative methods are currently two main approaches to solving data insufficiency problem in pattern classification. In the statistical approach, the training data is used to estimate the class-conditional density p ( x | C k ) for the k-th class Ck , where x is the training data and k=1,…, K, K is the
number of classes. This density is then combined with prior class probability p ( C k )
to obtain the posterior class probability p ( Ck | x ) . Finally, we can use the estimated posterior class probabilities to assign a class label to an unknown object in the test data set. The class-conditional densities can be calculated using a parametric method (e.g. Gaussian Mixture Model) or non-parametric method (e.g. Kernel Density Estimation). Clearly, the most advantage of the statistical approach is that it can provide a statistical distribution on the training data. However, estimation methods in J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 566–575, 2011. © Springer-Verlag Berlin Heidelberg 2011
Relation Learning - A New Approach to Face Recognition
567
the statistical approach require a lot of data to estimate the data distribution and the data collection stage becomes a very difficult task in any pattern recognition application. In face recognition, collecting sufficient data is really a problem since only a few facial images for a person could be collected in practice. The well-known FERET database is a good example for this problem. Unlike the statistical approach, we do not need to estimate statistical distributions in the discriminative approach. We can directly find a discriminant function f ( x ) which maps a new object x onto a class label y . There is a variety of methods to determine the discriminant function such as k-Nearest Neighbors (k-NNs), Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs). It has been noted that the k-NNs method is the simplest and oldest method in pattern recognition and has been widely applied in face recognition systems. However, this method is very sensible to noises and variations. Especially, this method does not provide a good performance if only a small training data set is available. The ANNs method has become the most powerful method in practical applications such as optical character recognition. However, the ANNs method is easy to be over-fitting and sensible to noises and variations. Recently, the SVM method becomes the prominent method because it can provide a strong robust property. However the SVM method also requires a large training data set. An alternative approach to avoid the insufficient data problem is the use of similarity learning methods. However, these methods have some weak points and we will discuss in detail in the next section. We propose a new relation learning approach to deal with the insufficient data problem. We define a relation between two objects and apply this definition to determine all possible relations in the training data set. The relation between two objects is used to measure the similarity between them. Instead of considering objects in their data space as seen in other methods, a relation space will be used. The paper is organized as follows. In the next section, we will present related studies and our proposed relation learning method. We will discuss how to apply our method to face recognition in the third section. We will present experiments on face recognition and discussions about our approach in the fourth section. Finally, we will summarize our approach and suggest some extensions.
2 Distance-Based Classification Methods In early studies [10, 11], most of researchers use the k-NN method for classification. A simple distance is used to measure the dissimilarity between two objects. The most popular distances are L1, L2, Cosine and Mahalanobis. The nearest neighbor rule for making decision is used. Although this method is very simple and efficient, it will provide low classification accuracy if there are not sufficient data for training. Consider the following example. Figure 1 presents object x1 ∈ class A, object x 2 ∈ class B and an unknown object x . The nearest neighbor rule will decide that x ∈ B. In fact, it is a member of class A. This error can be avoided in Vector Quantization (VQ) where class centers are used to calculate distances. However, this method still has a weak point as seen in Figure 2. Since the class B's region is smaller than the class A's one, VQ will classify the unknown object x in to class B although it belongs to class A.
568
L. Bui et al.
Fig. 1. Misclassification for x using k-NN
Fig. 2. Misclassification for x using Vector Quantization
Fig. 3. Misclassification for pairs using subtraction of two objects as learning metric
To overcome the limitation of k-NN and VQ, learning metrics ([12-14]) has been applied. Pairs of objects in the same class are included in a positive set and pairs of objects in different classes are in a negative set. A probability metric is used to measure the similarity. Good recognition results have been reported. However this method has some hidden risks. First, if the metric is defined as the subtraction of the two objects in a pair, information on the two objects will be lost. Second, the subtraction is a many-to-one mapping so it can cause the overloading phenomenon. Figure 3 shows three pairs of objects that have the same subtraction value, however
Relation Learning - A New Approach to Face Recognition
569
two of them are positive and the other one is negative. In addition, a decision on pair ( x 1 , x 2 ) will be inferred to all pairs ( x1 + v, x 2 + v ) with an arbitrary vector v .
3 Proposed Relation Learning Method Let x i and x j be feature vectors in ℜ n where n is dimension of data space. A relation R defined on x i and x j is denoted as x i R x j . Some examples of relation can be "same person" if x i and x j are from facial images of the same person, or "same gender" if x i and x j are from facial images of two males or two females. In this paper we use the concatenation of x i and x j to represent the relation as follows
xi R x j
ℜn × ℜn
:
(x , x ) i
j
→ ℜ2 n (1)
6 xi x j
where x i = ( x1i , x2 i ,...xni )
( = (x
x j = x1 j , x2 j ,...xnj xi x j
1i
)
, x2i ,...xni , x1 j , x2 j ,...xnj
)
It is noted that this representation is a one-to-one mapping. A relation will define two categories which are have relation and have no relation on a given data set. Therefore any binary discriminant function with positive and negative values can be used to determine the two categories. Let f be a discriminant function on ℜ2 n , a relation function r is defined for relation R such that ℜn × ℜn
r :
(x , x ) i
(
j
→ ℜ
(
)
(
) ( ) r ( x , x ) < 0 if
6 r xi , x j = f xi R x j = f xi x j
)
where r x i , x j ≥ 0 if x i and x j have relation and
i
no relation. The robustness ρ of a discriminant function f
j
(2) x i and x j have
is a value such that all
perturbations of patterns ε ∈ ℜ with ε < ρ in the training set do not change the sign of discriminant function or the categories of patterns. These perturbations can be considered as noises or variations of patterns. sign f ( x + ε ) = sign f ( x ) (3) n
(
)
(
)
The robustness ρ of a relation function r is a value such that all perturbations of patterns εi , ε j ∈ ℜ n with εi < ρ , ε j < ρ in the training set which do not change the sign of relation function or the relation of patterns. The perturbations can be considered as noises or variations of patterns.
((
sign r xi + εi , x j + ε j
) ) = sign ( r ( x , x ) ) i
j
(4)
570
L. Bui et al.
Proposition 1: Suppose that a relation function r is defined by a discriminant function f
If f is robust then r is robust. If ρ is the robustness of f then ρ
2
is the robustness of r .
Proof From equation (2), we have r xi , x j = f x i x j
( r (x
)
i
(
)
)
(
+ εi , x j + ε j = f x i x j + εi ε j
Assume that εi , ε j < ρ 2
2
ρ2
2
)
, we have
ρ2
= ρ2 2 2 From assumption of the proposition, we have εi ε j
= εi
2
+ εj
<
+
( ( ) ) = s ign ( f ( x x
sign f xi x j
i
j
+ εi ε j
))
Therefore, we have
((
sign r xi , x j
)) = sign ( r ( x + ε , x i
i
j
+ εj
))
Fig. 4. Illustration for the case n = 1
To find a robust relation function r is to find a robust discriminant function f . We can see that the function representing f the decision hyperplane in Support Vector Machines (SVMs) can be used as a relation discriminant function for our proposed approach. The function in SVMs is of the form f ( x ) = wTΦ ( x ) + b (5)
where w is the normal vector of the hyperplane, Φ (.) is a kernel function and b is a constant. Calculation details of these parameters can be found in [15, 16].
Relation Learning - A New Approach to Face Recognition
571
4 Relation Learning for Face Recognition In this section, we will present the technique applying the relation learning for face recognition. First, we review the FERET evaluation protocol. We select this protocol because it is a de factor protocol widely accepted by the research community. Second, we will present our ways to build the classifier. 4.1 FERET Evaluation Protocol
The goal of FERET evaluation protocol is to provide a standard method to access an algorithm. The evaluation design cannot be too hard or too easy. In the protocol, an algorithm is given two sets of images, target set T (training set) and query set Q
(
(testing set). The algorithm reports the similarity score s qi , t j
)
between all query
images qi in the query set Q and all target images t j in the target set T . In face identification, the query is not always "Is the top match correct?" but "Is the correct answer in the top n matches?" like Google search engine responds our query; it lists one more possible answers for each query. For our method, we assume that a bigger similarity score implies a closer match. Finally, the performance statistics are reported as cumulative match scores. 4.2 Relation Function for Training Set
Let a training set T = {t1 , t2 ,..., t N } be a set of faces of M individuals, we build a relation function from this set. Two face images have a relation if they are faces of one individual. From this relation, we build two classes. The first one called positive class D + contains all pairs of faces which belong to same individuals. The second one called negative class D − contains all pairs of patterns which belong to different individuals. Then, we use the SVM algorithm to find the relation function r (the discriminant function for two classes). 4.3 Identification on Testing Set
In identification task, let q be a query image, we compute the relation value between it with all target images t j . As I mentioned above, the relation has reflection property
(
so we have to compute left relation r q, t j
) and right relation r ( t , q) . j
There are two ways to define compute similarity scores
(
) (
) ( ) ) = max ( r ( q, t ) , r ( t , q) )
s q, t j = r q, t j + r t j , q
(
s q, t j
j
j
(6) (7)
After computing similarity scores, these values are sorted in descending order. The first element has the highest similarity and the last one has the lowest. However, we have to deal a big problem that the number of element of negative class is too many. For example, if a training set has 100 images of 100 individuals, each individual has one image. The number of element of negative class is 9900, a
572
L. Bui et al.
huge number. It takes a lot of time to finding a discriminant function. To solve this problem, we use the divide-conquer strategy. We divide the negative class to K negative subclasses. D − = D1− ∪ D2− ∪ ... ∪ DK− ; (8) where Di− ∩ D −j = ∅, i ≠ j Each negative subclass combines with positive class to create a sub relation. It means that there are K sub relations.
{
}
Di = D + , Di− , i = 1,...., K
(9)
Each sub relation has a relation function. The relation function for the relation is defined as the following.
(
)
{ (
) (
)
(
r x i , x j = min r1 x i , x j , r2 x i , x j ,..., rK x i , x j
)}
(10)
Proposition 2: Suppose that a relation function r is defined by sub relation functions {r1 , r2 ,..., rK }
If {r1 , r2 ,..., rK } are robust then r robust
If ρk is the robustness rk of then ρ = min { ρ1 ,..., ρ K } is the robustness of r Proof Assume that εi , ε j < min { ρ1 , ρ2 ,..., ρ K } Case 1: x i x j ∈ D +
( ⇒ r (x
) + ε ) = min {r ( x
⇒ rk x i + εi , x j + ε j ≥ 0, ∀k i
+ εi , x j
j
k
i
+ εi , x j + ε j
)} ≥ 0
i
+ εi , x j + ε j
)} < 0
Case 2: x i x j ∈ D − ⇒ x i x j ∈ Dk− , ∃k
( ⇒ r (x
) + ε ) = min {r ( x
⇒ rk x i + εi , x j + ε j < 0 i
+ εi , x j
j
k
4.4 Experiments on Face Recognition
In this section, we will present experiments using our approach. They were conducted on two public AT&T and FERET datasets. AT&T face database was taken at AT&T Laboratories. It contains 400 images (92-by-112) of 40 individuals; each one has 10 images. FERET database was collected at George Mason University between August 1993 and July 1996. It contains 14,126 images of 1199 individuals; each one has one or more images. Figure 5 show some faces of two datasets. A face recognition system includes main four modules: detection, alignment, feature extraction and classification. Our study focuses only the fourth task classification. It means that all of method except for baseline method in our experiments uses the same feature vectors. The reason we choose that because recognition rates strongly depend on the quality of feature vectors [1].
Relation Learning - A New Approach to Face Recognition
573
Fig. 5. a) faces from AT&T b) faces from FERET
Experiments on AT&T Table 1. Experimental results on AT&T data set
Baseline L1 L2 L1+Mah L2+Mah SVM1 SVM2 Bayes Relation
Feature No PCA PCA PCA PCA PCA PCA PCA PCA
P1 91 88 89.5 86.5 90.5 92 91 81 91.5
P2 92 90.5 90.5 90 90 92 92.5 89.5 93
P3 89 84.5 85.5 84 87.5 90 89.5 83.5 93.5
P4 88.5 87.5 86 87.5 88 91.5 89.5 83 91.5
Accuracy rate (%) 90.13 87.63 87.88 87.00 89.50 91.50 90.50 84.00 92.38
There are four random partitions P1, P2, P3 and P4 on the database. Each partition includes one training set (target set) and testing set (query set). Each set contains 200 images; each person has 5 images and two sets are completely separate. There is not any image processing routine applied on these images. To get feature vectors, we just use simple PCA subspace proposed by Turk and Pentland [10]. All algorithms use the same feature vectors having dimension n = 10. We choose full image matching as the baseline algorithm. There are four nearest neighbor algorithms using L1, L2, L1 + Mahalanobis (L1+Mah), L2 + Mahalanobis (L2+Mah). For conventional support vector machines, we run two SVMs using strategies one-against-one (SVM1) and oneagainst-all (SVM2). There is only one generative method, Bayes classifier. The final algorithm uses our approach. In our method, we create 40.000 pairs from 200 faces of training set. There are 1000 pairs in positive class and 39000 pairs in negative class. Next, we randomly divide this relation to 10 sub relations. Each of them contains 1000 positive pairs and 3900 negative pairs. The kernel function is Radial Basis
(
)
Function K xi , x j = e
−γ xi − x j
2
. The values of cost parameters C are set to 2-5,..., 215
and the values of kernel parameters γ are set to 2-14,…, 24. The table 1 shows the results of these experiments. Experiments on FERET We randomly select 200 images from fa set used as training set and 200 images from fb set used as testing set. From ground-truth information of locations of left and right
574
L. Bui et al.
eyes and the mouth center, we crop, scale and mask images in training and testing sets (no further image processing). Then, we extract feature vectors and keep the first 10 eigenfaces. We also choose full image matching as the baseline algorithm. There are two nearest neighbor algorithms using metric L2 and L2+Mah. There is only one SVM algorithm, SVM2. The final algorithm uses our approach. We create 40.000 pairs from 200 faces of training set. There are 200 pairs in positive class and 39800 pairs in negative class. Next, we randomly divide this relation into 50 small sub relations. Each of them contains 200 positive pairs and 786 negative pairs. We also use Radial Basis Function kernel. The figure 6 shows the experimental results for 20 top-ranks.
Fig. 6. Experimental results on FERET
Discussion In general, the experimental results show that there is an improvement in recognition rates. Our approach achieves the highest average rate for experiments on AT&T (92.38±1.03) and slight better result for experiments on FERET. According to the experiments, there is no difference between two score methods (See equations (6) and (7)). However, its rate is lower than SVM1's in partition P1 of AT&T. I think that a relation function depends on sub relation functions. It means that if they have goodqualities then the relation function will give good results and vice versa. Therefore, in this case we got a bad set of sub relation functions. For FERET dataset, we cannot apply our method on complete set fa because it contains 1196 images; it means that there are 1430416 pairs in a relation set. This number is too huge for any discriminate algorithms even though it is divided into small sub sets. This is one of challenges for our method. The second is variations caused by imaging conditions such as illumination and pose or the expression of the face. In our opinion, if variations are too large (their values are over the robustness value of relation function) then our method become unpredictable. Therefore, we think that relation learning method should combine with image processing techniques or face modeling such as Active
Relation Learning - A New Approach to Face Recognition
575
Appearance Models (AAM) [17] and 3D Morphable Model (3DMM) [18]. These methods could reduce the effects of illumination and pose and extract good feature vectors with small variations.
5 Conclusion In summary, we proposed a new method called relation learning for classification. Preliminary experiments performed on the AT&T and FERET databases show good results. However, there have been still some challenges for this method such as the size of databases and time performance. However, we still believe that it could be solved in the future due to the rapid development of computer industry. In our opinion, one of interesting aspects of method can be extended to any relation in real life. We think we can apply the method to answer questions like "Who is his father?" instead of "Who is he?" in the future.
References 1. Li, S.Z., Jain, A.K.: Handbook of face recognition. Springer, Heidelberg (2005) 2. Phillips, P., et al.: The FERET evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 1090–1104 (2002) 3. Phillips, P.J., et al.: FRVT 2006 and ICE 2006 large-scale results. National Institute of Standards and Technology, NISTIR, vol. 7408 (2007) 4. Bishop, C.: Neural networks for pattern recognition. Oxford University Press, USA (1995) 5. Bishop, C., Springer Link: Pattern recognition and machine learning. Springer, New York (2006) 6. Duda, R.O., et al.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2000) 7. Fukunaga, K.: Introduction to statistical pattern recognition. Academic Press, New York (1990) 8. Kuncheva, L.: Combining pattern classifiers: methods and algorithms. Wiley Interscience, Hoboken (2004) 9. Webb, A.: Statistical pattern recognition. A Hodder Arnold Publication (1999) 10. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 1991, pp. 586–591 (1991) 11. Bartlett, M.S., et al.: Face recognition by independent component analysis. IEEE Transactions on Neural Networks 13, 1450–1464 (2002) 12. Guillaumin, M., et al.: Is that you? Metric Learning Approaches for Face Identification (2009) 13. Chopra, S., et al.: Learning a similarity metric discriminatively, with application to face verification, pp. 539–546 (2005) 14. Phillips, P., et al.: Support vector machines applied to face recognition, Citeseer (1998) 15. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297 (1995) 16. Vladimir, V., Vapnik, V.: The nature of statistical learning theory. Springer, Heidelberg (1995) 17. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: Burkhardt, H., Neumann, B., et al. (eds.) ECCV 1998. LNCS, vol. 1407, p. 484. Springer, Heidelberg (1998) 18. Blanz, V., Vetter, T.: Face recognition based on fitting a 3D morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1063–1074 (2003)
Temporal Prediction and Spatial Regularization in Differential Optical Flow Matthias Hoeffken, Daniel Oberhoff, and Marina Kolesnik Fraunhofer Institute FIT, Schloss Birlinghoven, 53754 Sankt Augustin, Germany
Abstract. In this paper we present an extension to the Bayesian formulation of multi-scale differential optical flow estimation by Simoncelli et. al.[1]. We exploit the observation that optical flow is consistent in consecutive time frames and thus propagating information over time should improve the quality of the flow estimation. This propagation is formulated via insertion of additional Kalman filters that filter the flow over time by tracking the movement of each pixel. To stabilize these filters and the overall estimation, we insert a spatial regularization into the prediction lane. Through the recursive nature of the filter the regularization has the ability to perform filling-in of missing information over extended spatial extents. We benchmark our algorithm, which is implemented in the nVidia Cuda framework to exploit the processing power of modern graphical processing units (GPUs), against a state-of-the-art variational flow estimation algorithm that is also implemented in Cuda. The comparison shows that, while the variational method yields somewhat higher precision, our method is more than an order of magnitude faster and can thus operate in real-time on live video streams.
1
Introduction
One of the main tasks in computer vision is the reliable, robust and time efficient motion estimation in video streams. Since in most scenarios the true (2D) motion estimation of objects is not feasible, dense optical flow maps of the entire scene become a not necessarily equivalent surrogate. Probably one of the first approaches were those developed by Lucas and Kanade [2] and Horn and Schmuck [3]. Since then several other methods and many better performing modifications were developed and we refer interested readers to surveys such as [4], which additionally provides benchmark results for many state of the art algorithms and implementations for the optical flow estimation. They are compared in terms of the angular error, the endpoint error and the (normalized) interpolations error on a natural and artificial ground-truth data set. Majority of these competing algorithms try to outperform each other by minimizing these error statistics and neglecting other criteria such as robustness and time efficiency.
The research leading to these results has received funding from the European Community’s Seventh Framework Programme under grant agreement n 215866, project SEARISE.
J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 576–585, 2011. Springer-Verlag Berlin Heidelberg 2011
Prediction and Regularization in Differential Optical Flow
577
However these criteria are crucial for many application fields (i.e robotics, video surveillance, etc.), in which real-time processing of visual information is required and therefore GPGPU based implementations are employed. To our knowledge only a few real-time capable algorithms are available that can handle ten or more reasonably sized frames per second with a decent quality of the resulting optical flow. One such algorithm is the “Bayesian multiscale differential optical flow” approach [1] whose GPGPU implementation is freely available [5]. Among the algorithms listed in [4] only the approaches [6] and [7] seem to come close to real-time requirement. These algorithms also notably take advantage from a GPGPU use but their implementations are not public. The aim of this paper is to introduce a qualitative improvement to the “Bayesian multiscale differential optical flow” [1] approach in terms of signal-tonoise ratio and errors due to the aperture problem while retaining its real-time capability. In the first section we outline the basic ideas of the approach [1]. Next, we describe an algorithmic extension of the approach, which considerably improves its robustness by keeping its excellent runtime performance. We exemplify our algorithm with snapshots of the optical flow computed on real world video streams and finally proceed to the conclusion.
2
Bayesian Multi-scale Differential Optical Flow
In the approach [1] the brightness constancy constraint of the Horn and Schmuck method is combined with an uncertainty model in the form of a two-dimensional Gaussian distributions over spatio-temporal gradients and resulting velocity estimates. Regularization is naturally achieved by introducing a prior on the parameters of the Gaussian. Additionally a scale pyramid is utilized to handle large and small velocities equally well and to speed-up computations. The different pyramid scales are combined via Kalman filters at each image pixel, starting at the coarsest scale. The prediction of the Kalman filters is used to warp the
Fig. 1. Left: The velocity estimation is the result of kalman filtering through the scales, from coarsest to finest. Right: The gradient at a given scale is computed from three pixels in consecutive frames, where the choice of pixels is guided by the prediction to avoid aliasing.
578
M. Hoeffken, D. Oberhoff, and M. Kolesnik
Fig. 2. Illustration of the information flow in the original algorithm and our extension
image at the next scale before the gradient operator is applied, followed by the combination with the measurement to yield a new estimate (Fig. 1). The Kalman filters naturally propagate and combine uncertainty from each scale. This has a stabilizing effect on the estimation and preserves important information such as the presence of the aperture problem.
3
Temporal Prior
Our extension to Simoncelli’s approach utilizes the observation that motion vectors in two consecutive videos frames are mostly consistent, meaning that sudden changes in motion of a pixel are relatively rare1. We exploit this fact by introducing additional Kalman filters over time to predict the motion in one video frame based on the preceding frame on each scale of the pyramid. Furthermore, motion in small neighborhoods tends to be roughly constant. Thus before the prediction is generated we integrate information from small local neighborhoods using a soft intersection of constraints (IOC) to increase the quality of the prediction. The recursive nature of the predictive process combined with this spatial integration has also a diffusive effect: Information is constantly propagated through the image space thus filling-in the information in regions with weak texture or texture suffering from the aperture problem. Figure 2 shows an illustration of 1
We mean changes in time when moving along the motion vector, not changes in space or changes on a fixed image position.
Prediction and Regularization in Differential Optical Flow
579
Fig. 3. Taking the geometric mean of several distributions enhances the region in which the distributions agree, thereby implementing a probabilistic intersection of constraints. Case illustrated here could arise at a corner between to orthogonal image edges belonging to the same object, which is moving diagonally with respect to both edges: The motion estimations from both edges suffer from the aperture problem, but if they are combined using the geometric average the resulting estimate predicts the diagonal motion correctly.
the computational steps of the original algorithm and our extended version. The details of the newly introduced steps are explained in the following sub-sections. 3.1
Generating the Prediction
We generate a prediction for the optical flow at a given point in time and space based on the optical flow map from the preceding frame. For this we assume that motion direction and velocity of objects in the scene do not change within one frame step. Under this assumption the prediction becomes an inverse problem: The motion of a pixel at some position in the current frame is predicted to be that of a pixel in the previous frame, which, when translated according to its own motion vector, ends up at this position. Infering the prediction is thus equal to inferring the probabilities pij that a pixel at position j in the last frame was translated to a pixel at position i in the current frame. Since the pixels have a finite spatial extent, evaluating these probabilities requires marginalization of the velocity distributions over the source and target pixel region: pij =
xi + xj + yi +
xi −
xj −
yi −
yj + yj −
N(vij ; μi , Σi )d˜ xi d˜ xj d˜ yi d˜ yj with vij = x ˜j − x ˜i
(1) where xi , yi denote the position around the source pixel, xj , yj denote the position around the target pixel, bold letters are used to signify vectors (i.e. x ˜i ≡ (˜ xi , y˜i ), the tilde is used to differentiate integration variables from pixel coordinates, N(vij ; μi , Σi ) is a normal distribution in vij with mean μi and covariance Σi , which are taken from the motion estimate at the source pixel, and is the pixel radius. Unfortunately this formulation of pij would involve the integration of truncated Gaussians, which involves an expensive numerical
580
M. Hoeffken, D. Oberhoff, and M. Kolesnik
evaluations of the Gaussian error function (Erf). On the other hand, approximating the integral by evaluating the Gaussians only for the motion vectors connecting the centers of the two pixels raises the risk of missing much of the probability mass. In order to avoid this while keeping the computation simple we approximate the rectangular pixel domain with a Gaussian kernel: pij = N(vij ; μi , Σj )N(˜ xi ; μ = xi , σ = )N(˜ xj ; μ = xj , σ = )d˜ xi x ˜j (2) which has the form of a standard Gaussian integral and can be evaluated analytically in one step 2 . The set of pij with a common destination pixel defines the un-normalized predictive distribution for the motion in this pixel. To avoid having to deal with large look-up tables and to arrive at a continuous distribution we convert this directly into a Gaussian via moment matching: Nit = pij (3) j
1 μ ˆ ti = t pij vjt−1 Ni j
T t−1 ˆt = 1 Σ pij vjt−1 − μi vj − μi i t Ni j ˆi p(˜ vit | vit−1 ) ≈ Np vit ; μ ˆi , Σ
(4) (5) (6)
We could now combine this predictive distribution directly with the measurement based estimate at the current frame. However, many of the estimates used to generate the prediction contain large amounts of uncertainty, due to the underlying texture, as well as measurement errors due to violations of brightness-constancy. To sharpen the prediction while decreasing the impact of measurement errors, we assume a locally homogeneous optical flow and perform the prediction based on a weighted geometric average of a local neighborhood of the prediction: ⎡ ⎤ 1 νij j
p(vit | vt−1 ) = ⎣ p(˜ vjt | vjt−1 )νij ⎦ (7) j
and re-normalize again to obtain a Gaussian predictive distribution. For νk we use a two-dimensional binomial kernel around the target pixel position. Effectively this operation implements a probabilistic intersection of constraints (IOC) (see [8] for a discussion of IOC) concerning the possible velocity vectors within the kernel covering image patch (see Figure 3 for an illustration). Its impact is comparable to the cortical MT area along the dorsal stream, which plays an important role for early visual motion processing [9]. It models MT by integrating information to a coarser resolution. 2
This approximation introduces some bias towards the center of the pixel and some bluring between neighboring pixels, but we assume that the approximation error can be neglected.
Prediction and Regularization in Differential Optical Flow
3.2
581
Incorporating Predictions
On each resolution level of the multi-scale pyramid we obtain two velocity field predictions: One from the next coarser resolution in the current frame and another one from the same resolution in the previous frame. Both predictions comprise information about their uncertainty in terms of two-dimensional Gaussian distributions for each pixel position. In order to bring these two sources of information together, we first convert both distributions into histograms that have a resolution corresponding to the resolution of the current pyramid level. Next, we combine these histograms with a soft gating mechanism that is inspired by [9]: ωi (x) ≡ N (x; μci , Σic ) · 1 + b · p(vit | vt−1 ) (8) where μci , Σic are given by the prediction from the next coarser scale. This gating term selectively boosts the probability of velocity vectors on which both distributions agree. The intensity of this boosting is controlled by the parameter b > 0. When the predictions diagree strongly, the gating term essentially falls back to the measurement based density and the prediction from the previous frame has no effect. The final distribution is then again obtained by moment-matching using the following weighted statistics: Ni = ωi (x) (9) x
1 μi = ωi (x)x Ni x 1 T Σi = ωi (x) (x − μi ) (x − μi ) Ni x
(10) (11)
The resulting Gaussian is then substituted for the prediction from the next coarser scale alone in the original algorithm.
4
Results
To illustrate the effect of our extension to the Bayesian optical flow algorithm and to compare it against state-of-the-art variational optical flow estimation we apply it to unlabeled real world video data. This data is more complex than available labeled motion data and the characteristics of each algorithm are readily apparent visually. Also to be able to exploit temporal prediction a longer sequence of images is needed. We use video data from two surveilance settings with significantly different space scales: One video stream shows a large block of the spectator seats during a soccer game (Stadium). The other is taken inside a car tunnel (Tunnel ). These two datasets are representative for two disctinct kinds of scenarios: One with large consistently moving objects and one with small complex local motion, and reflect our experience with other data; the reason we chose to do a qualitative presentation on representative examples only is
582
M. Hoeffken, D. Oberhoff, and M. Kolesnik
Table 1. Speed of the three compared algorithms in seconds per frame
Tunnel Stadium
image resolution BOF extended BOF VF [7] 704x576 0.026 0.089 0.137 808x614 0.035 0.11 0.176
Fig. 4. Top left: An image from the Stadium sequence. Bottom Left: The motion estimate by the original Bayesian optical flow algorithm. Bottom right: The results with our extensions. Top right: Results of the variational optical flow algorithm under real time constraints.
because we deemed the available benchmark data insufficient to demonstrate the strengths of our algorithm. As state-of-the-art flow we chose the implementation of variational flow (VF) in [7] as one of a few real-time capable implementations of variational optical flow. The parameter settings appropriate for real-time operation were suggested by the original authors in personal communication. We use the HLS color code for visualization of the flow with the hue channel indicating motion direction and the saturation channel for motion speed (though strongly compressed to allow for relative large speed variations within one image). Additionally, we use the available confidence estimates to fade the pixel color from solid white to the motion-based coloring, with a slight luminance reduction for high confidence estiamtes. Thus regions with no estimation
Prediction and Regularization in Differential Optical Flow
583
Fig. 5. Top left: An image from the Tunnel sequence. Bottom Left: The motion estimate by the original BOF algorithm. Bottom right: The results with our extensions. Top right: Results of the VF algorithm under real time constraints.
confidence (usually due to missing texture) appear white while regions with an estimated motion of zero appear light grey. Table 1 shows the timing information for the original Bayesian optical flow (BOF) of Simoncelli, our extension, and the VF algorithm in [7] with parameters set up for real time operation. It is apparent that while our extension makes processing about three times slower it is still significantly faster than the VF. Figure 4 shows the results on a frame of the Stadium video. Most significant movements here are waving flags, a synchronized, continuous up and down jumping of the spectators, and two persons walking up the edge of the playing field on the left. Simoncelli’s BOF algorithm captures most of these motions quite well, except for the larger structures which show quite sparse estimations. Comparing our extended algorithm to the BOF shows that our extension succeeds in filling in the sparse estimations of larger structures (i.e. flags). Yet to avoid blurring we had to make the influence of the prediction relatively weak. Nevertheless some of the motion detail in the ranks of jumping spectators seems to be lost. VF on other hand seems to capture the important motions well, but introduces much more noise than both Bayesian optical flow variants. Figure 5 shows the results on a
584
M. Hoeffken, D. Oberhoff, and M. Kolesnik
frame of the Tunnel video. Here the situation is entirely different: our extended algorithm corrects many of the wrong motion estimates on the car and performs filling in to an extent that gives motion estimation for the whole car without any excessive blurring. Compared to this the VF introduces a lot of noise, the car appears more blurred, and the person walking down the side of the tunnel can hardly be distinguished from the noise.
5
Conclusion
We presented a modification to the “Bayesian multiscale differential optical flow” by Simoncelli et. al. [1] that exploits the sequential consistency in video sequences in a purely local fashion and implemented it in CUDA based on a freely available implementation of the original algorithm [5]. We have compared this modification against both the original algorithm as well as a real-time capable variational flow algorithm that is based on the global optimization of a variational cost function [7]. The results computed on real world videos show that our algorithm, while a few times slower than the unmodified algorithm, is still real-time capable and significantly faster than the variational algorithm, which is also implemented in CUDA. In terms of quality the results show that in scenes with a dense and complex local structure such as long range view recordings of spectator crowd in a stadium our extended algorithm suffers from the smoothing in space and time and improves over the results of Simoncelli’s original algorithm only for larger consistently moving structures. On the other hand in scenes with relatively isolated large scale moving objects and persons our modifications improve significantly the results of Simoncelli’s original algorithm first by reducing false estimations caused by noise and the aperture problem and second by performing filling in. Also our extended algorithm clearly outperforms the variational flow operating in real-time, due to the consistent rejection of uncertain estimates. Two possible avenues for further improvement of the Bayesian optical flow are the active reduction of blurring in space and time by detecting and handling object borders and a more complex local integration scheme. The former is standard in modern variational optical flow algorithms and inspiration could be drawn from there. A more complex local integration scheme could, for example, be developed from the adjustable linear filtering approach developed by M. Chessa et al. [10].
References 1. Simoncelli, E.P., Jahne, B., Haussecker, H., Geissler, P.: Bayesian Multi-Scale Differential Optical Flow, vol. 2, pp. 397–422. Academic Press, San Diego (1999) 2. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of Imaging Understanding Workshop, pp. 121–130 (1981) 3. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981)
Prediction and Regularization in Differential Optical Flow
585
4. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M., Szeliski, R.: A database and evaluation methodology for optical flow. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1–8 (October 2007) 5. Hauagge, D.C.: Homepage, http://www.liv.ic.unicamp.b/~hauagge/Daniel_ Cabrini_Hauagge/Home_Page.html. 6. Rannacher, J.: Realtime 3d motion estimation on graphics hardware, Master’s thesis at Heidelberg University (2009) 7. Werlberger, M., Trobin, W., Pock, T., Wedel, A., Cremers, D., Bischof, H.: Anisotropic Huber-L1 Optical Flow. In: Proceedings of British Machine Vision Conference (BMVC) (September 2009) 8. Ferrera, V.P., Wilson, H.R.: Perceived direction of moving two-dimensional patterns. Vision Research 30(2), 273–287 (1990) 9. Bayerl, P., Neumann, H.: A fast biologically inspired algorithm for recurrent motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(2), 246–260 (2007) 10. Chessa, M., Sabatini, S.P., Solari, F., Bisio, G.M.: A recursive approach to the design of adjustable linear models for complex motion analysis. In: Proceedings of the Fourth Conference on IASTED International Conference: Signal Processing, Pattern Recognition, and Applications, pp. 33–38. ACTA Press, Anaheim (2007)
Parallel Implementation of the Integral Histogram Pieter Bellens1 , Kannappan Palaniappan4, Rosa M. Badia1,3 , Guna Seetharaman5 , and Jesus Labarta1,2 1
Barcelona Supercomputing Center, Spain Universitat Politecnica de Catalunya, Spain 3 Intelligence Research Institute (IIIA) Spanish National Research Council (CSIC), Spain 4 Dept. of Computer Science, University of Missouri, Columbia, Missouri, USA Air Force Research Laboratory, Information Directorate, Rome, New York, USA 2
5
Abstract. The integral histogram is a recently proposed preprocessing technique to compute histograms of arbitrary rectangular gridded (i.e. image or volume) regions in constant time. We formulate a general parallel version of the the integral histogram and analyse its implementation in Star Superscalar (StarSs). StarSs provides a uniform programming and runtime environment and facilitates the development of portable code for heterogeneous parallel architectures. In particular, we discuss the implementation for the multi-core IBM Cell Broadband Engine (Cell/B.E.) and provide extensive performance measurements and tradeoffs using two different scan orders or histogram propagation methods. For 640 × 480 images, a tile or block size of 28 × 28 and 16 histogram bins the parallel algorithm is able to reach greater than real-time performance of more than 200 frames per second.
1
Introduction
Regional histograms are widely used in a variety of computer vision tasks including object recognition, content-based image retrieval, segmentation, detection and tracking. Sliding window search methods using histogram measures produce high-quality results but have high computational cost. The integral histogram is a recently proposed preprocessing technique that abates this cost and enables the construction of histograms of arbitrary rectangular gridded (i.e. image or volume) regions in constant time. The integral histogram effectively enables exhaustive global search using sliding window-based histogram optimization measures to yield high-quality results [1]. Fast histogram computation using the integral histogram speeds up the sequential implementation by up to five orders of magnitude. However, the overall cost remains prohibitive for real-time applications with large images, large search window sizes and a large number of histogram bins. For example, a 512 × 512 image search using a 1000-bin feature histogram requires about 1Gb of memory and takes about one second [2]. The computation of such dense confidence maps J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 586–598, 2011. c Springer-Verlag Berlin Heidelberg 2011
Parallel Implementation of the Integral Histogram
587
remains infeasible for these types of applications. A parallel implementation of the integral histogram would enable methods using global optimization of histogram measures to be competitive with or faster than other approaches in terms of speed. There are a variety of commodity multicore architectures currently available for the parallelization of image- and video-processing algorithms, including IBM’s Cell/B.E., GPUs from NVidia and AMD and many-core CPUs from Intel [3]. The dramatic growth of digital video content has been a driving force behind active research into exploiting heterogeneous multi-core architectures for computationally intensive, multimedia analysis tasks. These include real-time (and super-real-time) object recognition, object tracking in multi-camera sensor networks, stereo vision, information fusion, face recognition, biometrics, image restoration, compression, etc. [4–10] In this paper we focus on a fast parallel integral histogram computation to improve performance for real-time applications. Section 2 defines the integral histogram and two propagation methods. Next the definition is subjected to a block data layout in Section 3 where we identify parallel tasks and task precedence. This description of the parallel integral histogram can easily be encoded in the Star Superscalar (StarSs) programming model (Sections 4 and 5). We discuss some performance results for the Cell Broadband Engine (Cell/B.E.) in Section 6 and elaborate on future directions of this work in Section 7.
2
Computation of the Integral Histogram
In accordance with the original formulation in [11] we define an image as a function f over a two-dimensional Cartesian space R2 such that x → f (x) for a pixel x ∈ R2 . f can be single-valued or multi-valued. This affects the binning function but not the algorithm itself. The binning function Q(f (x), b) evaluates to 1 if f (x) ∈ b for the bin b, otherwise its value equals 0. For a sequence of pixels S = x0 , x1 , . . . , xp and some Sxp ⊆ S the integral histogram H(xp , b) for bin b is defined as H(xp , b) = Q(f (x), b) (1) x∈Sxp
S forms the scan order. This definition states that the value for a bin b at a pixel xp in the integral histogram can be found by applying the binning function to a subset of the pixels preceding xp in the scan order. The structure of the computation resembles the propagation of pixel values throughout the image f . To emphasize the two-dimensional nature of f we can identify the vector x with its spatial coordinates (i, j). This carries over to H and f , that become H(i, j, b) and f (i, j) in this notation. The integral histogram fixes the scan order so that the computation of the histogram for a rectangular region T of f becomes computationally inexpensive. In that context the computation of the integral histogram, or propagation,
588
P. Bellens et al.
(a)
(b)
(c)
Fig. 1. (a) Intersection or computation of the histogram for the region T defined by {(k, m), (k, n), (l, m), (l, n)}, k < l, m < n. (b) The cross-weave scan results in two passes over the image. The rows and the columns can be updated independently. (c) The wavefront scan, corresponding to Porikli’s active set of points, requires just one pass over the image but has a more complex access pattern which is harder to parallelize.
precedes the computation of a histogram or intersection. We only consider scan orders that result in H(i, j, b) =
j i
Q(f (x, y), b)
(2)
x=0 y=0
This means that the integral histogram at (i, j) reflects the values of all the pixels above and to the left of (i, j). The intersection for the region T (Figure 1(a)) delimited by the points {(k, m), (k, n), (l, m), (l, n)}, k < l, m < n then reduces to the combination of four integral histograms: H(T, b) =H(k − 1, m − 1, b) + H(l, n, b) − H(k − 1, n, b) − H(l, m − 1, b)
(3)
Porikli [11] describes two scan orders or algorithms for propagation: a string scan and a scan using an “active set of points”. The string scan does not satisfy condition (2) and gives rise to a different intersection. We formulate a variant here. Same as the original string scan, our cross-weave scan requires two passes over the image but it satisfies condition (2) and can be parallelized. The method using an active set of points corresponds to the wavefront scan in this paper. We use data flow equations to describe the propagation steps instead of a traditional algorithmic description. In the following definitions the left-hand side can be computed only if all of its terms or components have been evaluated previously. The cross-weave scan processes the image in each dimension separately and accumulates the results in the Y- and the X-direction: (1) H(i, j, b) =
0
(2) H(i, j, b) = Q(f (i, j), b) + H(i, j − 1, b), j > 0 (3) H(i, j, b) =
H(i, j, b) + H(i − 1, j, b), i > 0
Parallel Implementation of the Integral Histogram
589
The horizontal pass (step (2)) updates the histograms independently for each row, as does the the vertical pass (step (3)) for each column. These passes can be reordered. After the last step condition (2) holds. Propagation using a wavefront scan performs a single pass over the input image. The integral histogram is computed by propagating an anti-diagonal wavefront calculation. The histogram for each pixel combines the histograms of its top, left and top left neighbor, while incrementing the associated bin: (1) H(i, j, b) =
0
(2) H(i, j, b) = H(i − 1, j, b) + H(i, j − 1, b) −H(i − 1, j − 1) + Q(f (i, j), b) This minor diagonal, pixel-level scan does not have a straightforward parallel interpretation like the cross-weave scan.
3
Parallelization Using Tiles and Scan Propagation
We derive a tiled or block-based version of the integral histogram that accesses data in d-dimensional chunks, where d = 2 for images. The calculations on tiles or blocks and the associated data transfers can then be expressed as a set of partially ordered tasks for the StarSs programming model (Section 4). The block data layout enables scalable and generalizable access to large multidimensional datasets in a regular manner with predictable and consistent performance. Figure 2 illustrates the tiled or block data layout for the image and the integral histogram. We preserve the usual coordinate system, but now a coordinate pair identifies a tile instead of individual pixels or histograms. An image of dimensions w × h has an integral histogram of (w × h) × bc bins, with bc the number of bins in a histogram. It admits a division into blocks of bw × bh pixels, whereas the integral histogram can be decomposed in tiles of bw × bh histograms. Conversely, each tile of the integral histogram holds Fig. 2. Tiled image (integral his- b ×b ×b bins. The image can be padded w h c togram) or block data layout for a w×h to eliminate boundary conditions. In the image. Each tile contains bw × bh pixels block data layout the image as well as (histograms). the integral histogram consist of wB × hB blocks with wB = w/bw and hB = h/bh blocks. We distinguish between blocks and individual elements using a slightly different notation. Blocks fi,j and Hi,j correspond to the tiles at row i and column j in the block data layout of the image and the integral histogram respectively, while (i, j) designates a pixel or a histogram depending on the context.
590
P. Bellens et al.
The chunking of image data into equal-sized 2D tiles requires propagation of partial integral histogram information inside and between tiles. The propagation for the parallel implementation naturally extends the cross-weave pattern or the wavefront pattern from Section 2. The wavefront scan processes the elements from left to right and from top to bottom within a block Hi,j and iterates over the bins for each histogram. On the level of tiles the computation of Hi,j requires fi,j as input, together with Hi,j−1 , Hi−1,j and Hi−1,j−1 . In the first pass the cross-weave scan computes Hi,j with Hi,j−1 and fi,j as input. The second pass updates Hi,j using Hi−1,j . We will restrict our attention to the case of the wavefront scan. The development of the algorithm for the cross-weave scan is analogous.
(a)
(b)
Fig. 3. (a) Inter-block dependencies for block Hi,j with communication halos or aprons for propagation between tiles. For some histograms we detailed the inter-block dependencies with arrows. (b) Tiled layout for the integral histogram. Each block contains bw × bh histograms. The halos duplicate the histograms at the borders and are used to pass histograms to the neighboring blocks.
The aforementioned dependence between blocks stems from the dependence between elements contained in those blocks. Figure 3(a) identifies three sets or d h v halos for a tile Hi,j , namely Hi,j , Hi,j and Hi,j . The subscript identifies the tile whose edges require the histograms in these halos. Halos or aprons enable d the flow of information between adjacent tiles [12].The singleton Hi,j contains the histogram (bh − 1, bw − 1) of the diagonally opposite block Hi−1,j−1 , while v h Hi,j = {(l, bw − 1) ∈ Hi,j−1 |l = 0, . . . , bh − 1} and Hi,j = {(bh − 1, l) ∈ d , elements Hi−1,j |l = 0, . . . , bw − 1}. Element (0, 0) of Hi,j depends on Hi,j v (k, 0), k = 0, . . . , bh − 1 need Hi,j and finally the propagation for elements h . (0, k), k = 0, . . . , bw − 1 uses Hi,j We can consequently identify each tile fi,j (or Hi,j ) with a unit of computation or a task ti,j . This map defines the task precedence via the tile dependencies: ti,j is eligible for execution if all tasks ti−k,j−l with 0 < k ≤ i, 0 < l ≤ j have
Parallel Implementation of the Integral Histogram
591
been computed. In particular, ti,j accepts as input arguments the tile fi,j and h v d the halos Hi,j , Hi,j and Hi,j . Its output consists of the integral histogram block h v d and Hi+1,j+1 . Figure 3(b) explicitly allocates Hi,j and the halos Hi+1,j , Hi,j+1 buffers for the halos to illustrate the data flow of the computation. Strictly considered these halos serve no practical purpose: they are conceptual constructs rather than actual data structures. We will see in Section 5 however that a physical instantiation of the halos helps in the implementation of the algorithm in StarSs.
4
Star Superscalar (StarSs) Parallel Programming Model
The StarSs programming model [13, 14] provides a convenient way to parallelize sequential code for various parallel architectures. It generally suffices that the user adds pragmas to the original code to mark the functions (or tasks) intended to execute on the parallel resources. The StarSs source-to-source compiler converts these pragmas into calls to the StarSs runtime library. As the application advances the StarSs runtime executes the tasks in parallel as dictated by the data dependencies present in the original program. The main thread of a StarSs application executes the sequential code and switches to the StarSs libraries when it encounters a call to a function marked with a pragma. The StarSs runtime does not immediately execute this function or task. Instead it analyzes the task arguments to find the true dependencies that define task precedence. StarSs avoids outFig. 4. Overview of the structure of a put dependencies and anti-dependencies StarSs application by renaming arguments. The main thread returns control to the user application and the StarSs runtime records the task in the Task Dependency Graph (TDG). Simultaneously the runtime schedules ready tasks (or tasks without outstanding dependencies in the TDG) to the resources or workers. Workers remove finished tasks from the TDG and update the state of dependent tasks. Ultimately this pruning of dependencies turns dependent tasks into ready tasks that again become scheduling candidates. In StarSs dependence analysis, renaming, scheduling and updates to the TDG take place concurrently at run-time. These facilities convert a sequential stream of tasks generated by a single thread into a TDG into a parallel execution of tasks on multiple resources. This model has a strong resemblance to dynamic scheduling in superscalar processors. The pipeline there decodes instructions in order but schedules them to multiple units in parallel to the extent allowed by the data dependencies.
592
5
P. Bellens et al.
Implementation Using Tile-Level Halos
The block algorithms of Section 3 can be implemented in a straightforward manner in StarSs. The main function is a sequential implementation of the respective scan order. For the wavefront scan the code steps through the blocks in the the diagonals parallel to the minor diagonal of the blocked integral histogram. At run-time this generates the sequence of tasks t0,0 , t1,0 , t0,1 , t2,0 , t1,1 , t0,2, . . . It suffices to ensure that the data dependencies as exposed via the task arguments completely define the task precedence. To that end we allocate physical buffers for the halos as in Figure 3(b). For example, a separate location in main memory h duplicates the elements of halo Hi,j from block Hi−1,j . We refer to such a buffer with the name of the halo it contains. Our implementation explicitly models the data flow between the tiles of the integral histogram. StarSs then detects the data dependencies between the tasks of the integral histogram. Although this data representation is slightly redundant, it is a small price to pay in the light of automatic parallelization. Both the image and the integral histogram are represented in block data layout v occupy an additional (Section 3). They consist of wB × hB tiles. The halos Hi,j h hB × bh × wB × bc bins, the halos Hi,j take up wB × bw × hB × bc bins and d h v d hB × wB × bc bins. Task ti,j reads the halos Hi,j , Hi,j and Hi,j and the Hi,j h v d , Hi,j+1 and Hi+1,j+1 , which in turn are read aside from Hi,j it produces Hi+1,j by ti+1,j , ti,j+1 and ti+1,j+1 . These chains of halo production and consumption establish the required data dependencies. Figure 6 depicts the TDG for the integral histogram for a small image size for both scan methods. The crossweave scan visits each tile Hi,j twice, once during the horizontal pass and next in the vertical pass, and generates twice as many tasks as the wavefront scan. The storage requirements can be reduced by noting that the horizontal halos can be recycled per row, the vertical halos per column and the diagonal halos h h per diagonal. The lifetime of Hi,j ends before Hi+1,j is produced because task ti,j executes and finishes before ti+1,j , ∀j = 0, . . . , wB − 1. The task precedence h for this application guarantees that the accesses to Hi,j do not overlap in time for fixed j. Similar observations hold for the vertical and diagonal halos. Hence h v , Hi,j and there is no reason to separate the input and output halos of a task: Hi,j d h v d Hi,j can occupy the same memory as Hi+1,j , Hi,j+1 and Hi+1,j+1 respectively. With this reduction the horizontal halos occupy wB × bw × bc additional bins, the vertical halos hB × bh × bc bins and the diagonal halos (wB + hB − 1) × bc bins. As a side-effect the task can be defined with less parameters, because the halos that are read and written are one and the same. At run-time this translates to fewer arguments per StarSs task, less dependency analysis and less runtime overhead.
6
Experiments
We implemented the cross-weave scan and the wavefront scan in StarSs according to the specification in Section 5. The programming model ensures portability
Parallel Implementation of the Integral Histogram
(a)
(b)
593
(c)
Fig. 5. Task Dependency Graph for the tiled integral histogram (wB = hB = 4) with (a) the cross-weave scan and (b) the wavefront scan. The tasks are numbered according to program order, which is identical to the scan order in our implementation. The crossweave scan has two different types of tasks, represented by the two different colors of its nodes. In (c) the tiles in the block data layout (Figure 2) are matched up with the corresponding tasks of the wavefront scan. The top left corner contains the coordinates of the tile and the bottom right corner the task number.
across a variety of platforms, but because of the limitation on the page count we restrict the scope to the Cell/B.E.. We chose this particular architecture because it belongs to the family of the heterogeneous multi-core architectures, which together with GPGPU shapes the form of high-performance computing today. And secondly because for this architecture peak performance is notoriously hard to achieve despite of its impressive potential. All measurements were performed on a QS20 Blade at the Barcelona Supercomputing Center. The default image size is 640 × 480 and the number of bins defaults to 16, 32, 64 and 128. 6.1
Integral Histogram for Single Images
Figure 6 summarizes the performance of the cross-weave scan and the wavefront scan for a single image. The StarSs implementation accepts the block size as an input argument, so we repeated the measurements for different values. This implementation proves to be very practical, because the block size for best performance is not uniform across different bin counts. E.g. the cross-weave scan for 16 bins performs best for blocks of 28×28 elements, whereas 128 bins require smaller blocks of 11 × 11. In general two factors determine the performance of a Cell/B.E. application. Computations on an SPE must be well balanced, so that code execution can seamlessly overlap the latency of the associated data transfers. This requirement relates to the granularity of tasks. If dynamic analysis (i.e. task creation, dependence analysis, scheduling,etc.) takes place while
594
P. Bellens et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 6. Performance of the cross-weave (a,b,c,d) and wavefront scan (e,f,g,h) in CellSs on an 640x480 image for different block sizes and different numbers of bins
Parallel Implementation of the Integral Histogram
(a)
595
(b)
Fig. 7. Comparison of the cross-weave scan in (a) with the wavefront scan in (b) for varying numbers of histogram bins and 8 SPEs
the SPEs execute, run-time overhead influences the performance as well. Larger tiles divide the image or the integral histogram in fewer parts, which results in less tasks and less overhead at execution time. The results in Figure 6 can be interpreted and understood as balancing granularity versus run-time overhead. The wavefront scan (175 fps and 20 fps for 16 and 128 bins resp., speedup between 5 and 6 for 8 SPEs) generates less tasks than the cross-weave scan (120 fps and 35 fps for 16 and 128 bins resp., speedup between 4 and 5 for 8 SPEs) for the same block size and consistently outperforms the latter. For both propagation methods the performance improves as the tasks become larger and fewer, until the task granularity becomes prohibitive. Remark that the TDG for the wavefront scan has limited parallelism at the top and the bottom (Figure 5(b)); this algorithm inherently scales poorly. Figure 8 summarizes the performance and scalability for the wavefront scan for the optimal tile sizes (Figure 6) on different image sizes and for different bin counts. 6.2
Integral Histogram for Sequences of Images
Practical applications tend to process sequences of images but the results in Section 6.1 measure the frame rate on an isolated image. This practice fails to capture the performance of our implementation in production conditions. When incorporated into an image processing pipeline we expect performance to improve. For such a sequence the initialization and the shutdown of the StarSs libraries are amortized over multiple images, while for a single image these inherently sequential operations preclude good scalability. The wavefront scan additionally suffers from a TDG that allows limited parallelism at the beginning and the end of the execution (Figure 5(b)). The lack of ready tasks in the TDG of image i can be compensated by tasks from image i + 1. Figure 7 compares the performance for single images with the performance over multiple images for multiple bin counts. The tile size for each bin count corresponds to the optimal tile size from Figure 6 for the associated scan method.
596
P. Bellens et al.
(a)
(b)
(c)
(d)
Fig. 8. Performance and speedup of the wavefront scan for different image sizes and bin counts, for the optimal tile sizes from Figure 6
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 9. Otsu-thresholded images using the integral histogram
Parallel Implementation of the Integral Histogram
7
597
Conclusions and Future Work
We described two parallel implementations of the integral histogram based on a block data layout, preserving the characteristic propagation of histograms in the original formulation of the algorithm. Each block corresponds to one (wavefront scan) or two (cross-weave scan) tasks. Inside the tiles we process the elements in row-major order, with halos passing histograms between tiles. This formulation has the advantage that it cleanly models the data flow and scales well for images of different sizes. As a result the implementation in StarSs was straightforward and easy. The cross-weave scan reaches 120 fps for histograms of 16 bins for images with dimensions 640 × 480. The wavefront scan has a more symmetric, but also a severely more restricted TDG. The reduction in tasks results in 220 fps for the same image size and bin count. However, it must be stressed that the code is portable and is being evaluated on all platforms supported by StarSs, including SMP and GPU. We are also in the process of developing applications based on the integral histogram. For example, Figure 9 illustrates the results from Otsu-thresholding using our implementation of the integral histogram on the motion energy filtering output produced by the flux tensor algorithm [15–17] using a sequence of surveillance images [18]. In order to improve the absolute performance for larger image sizes such as 1920 × 1080, image tiles could be distributed to multiple Cell/B.E. processors in a cluster configuration, or the processing of different images could be overlapped in time. Acknowledgements. The authors acknowledge the support of the Spanish Ministry of Science and Innovation (contract no. TIN2007-60625), the European Commission in the context of the HiPEAC Network of Excellence (contract no. IST-004408), and the MareIncognito project under the BSC-IBM collaboration agreement. This research was partially supported by grants to KP from the the U.S. Air Force Research Laboratory (AFRL) under agreements FA8750-10-10182 and FA8750-11-1-0073. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of AFRL or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. The FPSS video sequences were provided by Dr. Alex Chan at the U.S. Army Research Laboratory.
References 1. Aldavert, D., de Mantaras, R.L., Ramisa, A., Toledo, R.: Fast and robust object segmentation with the integral linear classifier. In: IEEE Conf. Computer Vision and Pattern Recognition, pp. 1046–1053 (2010) 2. Wei, Y., Tao, L.: Efficient histogram-based sliding window. In: IEEE Conf. Computer Vision and Pattern Recognition, pp. 3003–3010 (2010) 3. Blake, G., Dreslinski, R.G., Mudge, T.: A survey of multicore processors. IEEE Signal Processing Magazine 26(6), 26–37 (2009)
598
P. Bellens et al.
4. Lin, D., Huang, X., Nguyen, Q., Blackburn, J., Rodrigues, C., Huang, T., Do, M.N., Patel, S.J., Hwu, W.-M.W.: The parallelization of video processing. IEEE Signal Processing Magazine 26(6), 103–112 (2009) 5. Shams, R., Sadeghi, P., Kennedy, R., Hartley, R.: A survey of medical image registration on multicore and the GPU. IEEE Signal Processing Magazine 27(2), 50–60 (2010) 6. Palaniappan, K., Bunyak, F., Kumar, P., Ersoy, I., Jaeger, S., Ganguli, K., Haridas, A., Fraser, J., Rao, R., Seetharaman, G.: Efficient feature extraction and likelihood fusion for vehicle tracking in low frame rate airborne video. In: 13th Int. Conf. Information Fusion (2010) 7. Mehta, S., Misra, A., Singhal, A., Kumar, P., Mittal, A., Palaniappan, K.: Parallel implementation of video surveillance algorithms on GPU architectures using CUDA. In: 17th IEEE Int. Conf. Advanced Computing and Communications, ADCOM (2009) 8. Kumar, P., Palaniappan, K., Mittal, A., Seetharaman, G.: Parallel blob extraction using the multi-core cell processor. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2009. LNCS, vol. 5807, pp. 320–332. Springer, Heidelberg (2009) 9. Grauer-Gray, S., Kambhamettu, C., Palaniappan, K.: GPU implementation of belief propagation using CUDA for cloud tracking and reconstruction. In: 5th IAPR Workshop on Pattern Recognition in Remote Sensing (ICPR), pp. 1–4 (2008) 10. Zhou, L., Kambhamettu, C., Goldgof, D., Palaniappan, K., Hasler, A.F.: Tracking non-rigid motion and structure from 2D satellite cloud images without correspondences. IEEE Trans. Pattern Analysis and Machine Intelligence 23(11), 1330–1336 (2001) 11. Porikli, F.: Integral histogram: A fast way to extract histograms in Cartesian spaces. In: IEEE Conf. Computer Vision and Pattern Recognition, pp. 829–836 (2005) 12. Podlozhnyuk, V.: Image convolution with CUDA. Technical report, NVIDIA Corp., Santa Clara, CA (2007) 13. Planas, J., Badia, R.M., Ayguad´e, E., Labarta, J.: Hierarchical task-based programming with StarSs. Int. J. High Perform. Comput. Appl. 23(3), 284–299 (2009) 14. Perez, J.M., Bellens, P., Badia, R.M., Labarta, J.: CellSs: Making it easier to program the Cell Broadband Engine processor. IBM J. Res. Dev. 51(5), 593–604 (2007) 15. Palaniappan, K., Ersoy, I., Nath, S.K.: Moving object segmentation using the flux tensor for biological video microscopy. In: Ip, H.H.-S., Au, O.C., Leung, H., Sun, M.-T., Ma, W.-Y., Hu, S.-M. (eds.) PCM 2007. LNCS, vol. 4810, pp. 483–493. Springer, Heidelberg (2007) 16. Bunyak, F., Palaniappan, K., Nath, S.K., Seetharaman, G.: Flux tensor constrained geodesic active contours with sensor fusion for persistent object tracking. J. Multimedia 2(4), 20–33 (2007) 17. Bunyak, F., Palaniappan, K., Nath, S.K., Seetharaman, G.: Geodesic active contour based fusion of visible and infrared video for persistent object tracking. In: 8th IEEE Workshop Applications of Computer Vision (WACV 2007), Austin, TX, pp. 35–42 (February 2007) 18. Chan, A.L.: A description on the second dataset of the U.S. Army Research Laboratory Force Protection Surveillance System. Technical Report ARL-MR-0670, Army Research Laboratory, Adelphi, MD (2007)
System on Chip Coprocessors for High Speed Image Feature Detection and Matching Marek Kraft, Michal Fularz, and Andrzej Kasi´ nski Pozna´ n University of Technology, Institute of Control and Information Engineering, Piotrowo 3A, 60-965 Pozna´ n, Poland [email protected]
Abstract. Successful establishing of point correspondences between consecutive image frames is important in tasks such as visual odometry, structure from motion or simultaneous localization and mapping. In this paper, we describe the architecture of the compact, energy-efficient dedicated hardware processors, enabling fast feature detection and matching.1
1
Introduction
Feature detection and matching is an important step preceding higher-level algorithms of the image processing pipeline. The performance of these higher level algorithms, such as robot navigation based on simultaneous localization and map building or visual odometry, 3D reconstruction, object tracking and recognition, image mosaicking etc. is directly influenced by the quality of matches and the speed of the matching process. An increasingly popular approach to object feature detection is to recognize distinctive, natural image features instead of artificial landmarks. The progress in this field resulted in development of feature detectors that can successfully cope with realistic application scenarios. The progress in the field of natural feature detection is naturally followed by the progress in the field of image feature matching. Simple approaches, based on direct feature neighborhood correlation using sum of absolute differences (SAD) or sum of squared differences (SSD), have been ruled out by more robust approaches like e.g. zero normalized cross correlation (ZNCC). Such approaches use the direct values of feature neighborhood pixel intensities. The most recent approach, allowing for even more robustness, is to use feature descriptors, encoding the distinctive properties of image feature neighborhood. The performance of recently developed algorithms comes oftentimes at the cost of increased complexity, involving increased use of computational resources for implementation. The requirement of real-time performance is therefore hard to satisfy, especially in mobile, power- and resource-constrained applications. 1
This project is partially funded by the Polish Ministry of Science and Higher Education, project number N N514 213238.
J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 599–610, 2011. c Springer-Verlag Berlin Heidelberg 2011
600
M. Kraft, M. Fularz, and A. Kasi´ nski
In this paper, we describe dedicated coprocessors for the system on a chip (SoC) architectures, allowing robust image feature detection and matching. The coprocessors are based on the FAST (Features from Accelerated Segment Test) image feature detector and SURF (Speeded-Up Robust Features) feature descriptor. The algorithms have been tailored to enable efficient hardware implementation. The use of the FPGA circuits reduces the cost, size and power consumption of the resulting systems, offers the flexibility to modify definitions of features and allows to incorporate new functionality in new system hardware revisions.
2 2.1
Description of the Implemented Algorithms FAST Feature Detector
FAST is a recently proposed feature detection method [11][10]. As claimed in [11], the algorithm offers better repeatability when compared to other single scale feature detectors, e.g. Harris [4], DoG (difference of Gaussians), or SUSAN [13]. However, as a single scale algorithm, FAST does not offer the robustness to scale change, available e.g with SIFT (scale invariant feature transform) [7], SURF [2] or CenSurE (center surround extrema) [1] detectors. Unfortunately, major drawback of these multiscale methods is their computational complexity. The relatively good performance and conceptual simplicity if FAST makes it a good candidate for hardware implementation. In order to indicate whether the pixel p with a specific intensity value Ip is a corner, the FAST detector performs a so called ’segment test’. A 16-pixel Bresenham circle surrounding p is analyzed, and positive detection is declared if n points of this circle form a contiguous segment which is either darker than the center point minus a given threshold t, or brighter than the center point plus t (see Fig. 1).
15 14 13 12
16 1 2
3 4 5 6
p 11
7 10 9 8
Fig. 1. The illustration of segment test. The contiguous segment of pixels that satisfy the threshold condition is marked with the dashed line.
The detector performs best for n = 9 [11]. The order, in which the points on the Bresenham circle are analyzed is critical for the performance of the algorithm, as it allows for quick elimination of candidate points that do not satisfy the segment test. To find the best decision tree for the given training data (the images
SoC Coprocessors for High Speed Image Feature Detection and Matching
601
of the scene or environment on which the algorithm will be applied) Rosten et al. used the ID3 learning algorithm [9]. Recently, a faster and more robust method of feature detection based on segment test called AGAST (adaptive and generic accelerated segment test) was proposed in [8]. The main difference between AGAST and FAST lies in the method used to construct the decision tree. To reduce the occurrence of adjacent positive responses, the non-maximal suppression is applied. As the segment test is a Boolean function, an additional measure is needed. The corner score function V , defined as the sum of absolute differences between the intensity of the central and the intensity values of the pixels on the contiguous arc, is introduced. Corner score for all positive responses is computed and those with V lower than their neighbors are discarded. Let us denote the pixels brighter than Ip + t by Sbright , and the pixels darker than Ip − t by Sdark . The complete equation for corner score is given in (1). V = max(
|Ip→x − Ip | − t,
x∈Sbright
|Ip − Ip→x | − t)
(1)
x∈Sdark
The corner score is applied only to a fraction of image points that successfully passed the segment test, so the further processing time can be kept low. As the FPGAs are inherently well suited to perform parallel operations, our implementation of the algorithm does not rely on the decision tree. Instead, the full segment test and computation of the corner score function are performed by a dedicated hardware processor, with extensive use of parallelization and pipelining. 2.2
SURF Feature Descriptor
The SURF feature detector and descriptor proposed recently in [2] is derived from the SIFT feature detection and description method [7]. According to the authors, SURF offers the accuracy that is on par or better than the accuracy of SIFT while being several times faster. Both SIFT and SURF use multiscale feature descriptors, which encode the distribution of pixel intensities in the neighborhood of the feature. This approach offers better performance than SAD, SSD or ZNCC methods which use raw pixel intensities to for matching. SURF descriptor uses gradient responses computed in the x and y direction using Haar wavelets to describe the neighborhood of the feature. The use of Haar wavelets in conjunction with integral images results in a very efficient implementation, allowing for fast computation of the gradient response regardless of the size of the mask [6]. This allows to construct the scale space by resizing the masks instead of resizing the original image as it is the case with SIFT. Computation of the descriptor for a given feature is divided in two steps. At the first step, the dominant orientation is assigned to each feature to achieve rotation invariance. To this end, the Haar wavelet responses in x and y directions are computed at each point located within the circle with the radius of 6s. The circle is centered at the interest point, and s denotes the scale at which the feature has been detected. The wavelet masks also need to be resized
602
M. Kraft, M. Fularz, and A. Kasi´ nski
to correspond with the scale. The length of their sides is thus set to 4s. All the resulting responses are then weighted with a Gaussian centered at the detected interest point, the standard deviation of the Gaussian being σ = 2s. The response at each of the points can be marked as a point in 2D vector space. The magnitude of the x-response is the value on the abscissa, and the magnitude of the yresponse is the value on the ordinate. To assign the dominant orientation, a rotating circle sector, covering an angle of π3 rad around the interest point is placed at the interest point. The responses (the x and y gradient values) in each segment are summed and form a resultant vector. The angle of the longest resultant vector is taken as the dominant orientation for a given interest point. The second step is the computation of the descriptor itself. The computation begins with placing a square window (side length of 20s), so that the center of the window is aligned with the interest point to be described. The angle of rotation (orientation) of the window is taken from the previous step. The window is divided into 4 × 4 square subregions for a total of 16 subregions. Inside each of the subregions, 5 × 5 regularly spaced sample points are selected – see Fig. 2 for an illustration.
Fig. 2. Structure of the window used for the computation of the SURF descriptor for a given feature (as used in SURF64)
Subsequently, Haar wavelet responses along two principal directions (denoted dx and dy) are computed for each sample point. The masks have a side length of 2s. The responses are then weighted with a Gaussian, centered at the interest point (σ = 3.3s) in order to increase the robustness to geometric deformations and localization errors. The complete descriptor for each subregion is formed by summing all the respective responses and their absolute values calculated at each one of the sample points (see equation 2). DESCsub = [ dx, dy, |dx|, |dy|] (2) The inclusion of absolute values of the responses into the descriptor allows for encoding of the more complex intensity patterns and increases the distinctiveness. As every subregion brings 4 elements to the descriptor vector and the number of the subregions is 16, the overall descriptor length is 64. Te name of the descriptor is hence SURF64. The invariance to contrast changes is achieved by normalization of the descriptor vector to unit length. The authors proposed also a simpler, reduced version of the descriptor [2]. In this variant, the square
SoC Coprocessors for High Speed Image Feature Detection and Matching
603
window that is placed on the interest point is divided into 3×3 square subregions. This results in reduced dimensionality of the description vector – it contains 36 elements and is called SURF36. If the rotational invariance is not required, the upright version of the descriptor, called USURF can be used. Skipping the dominant orientation assignment step yields further reduction of the time required for computation. The authors report, that such simplified version is robust to in-plane rotations in the range of ±15◦ . For hardware implementation, we have chosen to use USURF36. As shown in [12], the high-performance SURF feature descriptors can be paired with fast, single-scale feature detectors like Harris [4] or FAST [11] with good results in applications like mobile robot navigation or tracking. Additional simplifications were also introduced to the original algorithm. The first is the abandonment of Gaussian weighting of responses in the processed window and subregions. The second simplification was to switch to fixed-point fractional representation of the description vector elements. The original SURF descriptor uses an array of 32-bit floating point numbers for feature description. Switching from floating-point to fixed-point arithmetics allows to reduce the complexity and resource cost of algorithm implementations on custom computing machines [3]. The last change introduced to the SURF descriptor was aimed at reducing the complexity of the descriptor normalization. Scaling to a unit length vector requires calculation of the square root and 36 or 64 multiplications. Instead, we have decided to normalize the results with respect to the maximum value in the descriptor vector. As we decided to use fixed-point, fractional representation of the descriptor elements, integer dividers can be used to perform normalization. Integer dividers are relatively resourceefficient and can be implemented as systolic, pipelined structures. As shown in [5], the simplifications do not introduce any significant matching performance penalty when compared with the full SURF implementation in the aforementioned applications.
3 3.1
Description of the Implemented Coprocessors Feature Detection and Description Processor
The coprocessors were implemented using VHDL hardware description language and are adapted for use with the Microblaze microprocessor from Xilinx. Microblaze is equipped with dedicated FSL (fast simplex link) interfaces, that allow for fast, unidirectional communication of the microprocessor with custom hardware peripherals using dedicated instructions. The outline of the architecture of the feature detection and description coprocessor is given in Fig. 3. The architecture is composed of two main datapaths – one for feature detection, and one for computation of the local feature descriptor. The datapaths meet eventually in the block that is responsible for output data normalization and formatting. To process data in parallel, the designed architecture requires to have simultaneous access to all pixels under investigation (the 16 pixels placed on the Bresenham circle and the central pixel). This requires constant access to a 7 × 7
604
M. Kraft, M. Fularz, and A. Kasi´ nski
Fig. 3. Block diagram of the implemented feature detection and description coprocessor
SoC Coprocessors for High Speed Image Feature Detection and Matching
605
processing window. To achieve this goal, 6 BlockRAM memories along with address generation logic were used as FIFO delay buffers. The FIFO depth is equal to the horizontal resolution of the image. An additional register file was used to store pixel intensity values in the investigated window. The intensity values of the pixels on the Bresenham circle and the central pixel are then passed to the thresholder module. The block diagram of the module is given in Fig. 4.
Fig. 4. Block diagram of the thresholder block
The thresholder module is composed of two groups of subtracters and multiplexers for performing the ’dark’ and the ’bright’ pixel test. Its function is to distinguish whether or not the pixels on the Bresenham circle have the intensity value greater than the center pixel intensity value plus threshold (’bright’ pixels), or lower than the center pixel intensity value minus threshold (’dark’ pixels). If this is the case, the result of the subtractions is positive, and is passed for further corner score calculation. If the result is negative, it is replaced by zero value, so it does not contribute to the corner score function. Additionally, a respective bit in the appropriate binary output vector (is bright or is dark) is set. The results from this block are then passed to the contiguity tester and the corner score computation block. The contiguity tester checks if the is bright and is dark vectors satisfy the segment test criterion. The corner score blocks consist mainly of two adder trees. The subtraction results (2 groups of 16 values, one for the bright and one for the dark pixels) are added and the larger value is selected and passed to the output, as given in equation 1. The local corner score value and the output of the segment test are then passed to another set of FIFO delay buffers. This is done to assure parallel access to the 7 × 7 window for the non-maximum suppression block. The block consists of 48 parallel comparators. If the corner score value for the central pixel is the maximum in the currently processed window and the segment test criterion for this pixel is satisfied the pixel is labeled as a feature and the corresponding line is set for one clock cycle. The line serves also as the write enable line for the intermediate FIFO (see Fig. 3). The values of synchronized counters are also passed to the FIFO to give the information on the image coordinates of the detected feature. The datapath for the SURF descriptor calculation also begins with FIFO delay lines. The lines, along with the register file, form a 6 × 6 processing window. The
606
M. Kraft, M. Fularz, and A. Kasi´ nski
window corresponds to one subregion of the descriptor. This allows for parallel computation of the responses at each one of the 25 sample points. The corresponding response components (25 for dx, |dx|, dy, |dy| each) are then added in four pipelined adder trees. The block diagram of the input stage of the SURF description coprocessor is given in Fig 5. The four outputs from the adder trees
Fig. 5. Block diagram of the SURF descriptor coprocessor input block
form a complete descriptor for a single subregion. The resulting subregion descriptors are then passed to the FIFO delay buffers with a register file. This allows to form a complete descriptor for 9 subregions, even if the computations are in fact performed for only one subregion at a time. Such approach results in significant savings of FPGA resources. The complete descriptor is passed to the intermediate FIFO. The maximum value selected from the |dx| and |dy| components is passed alongside for normalization. The SURF and FAST datapaths meet at the intermediate FIFO. The FIFO composes all the data that are then normalized and passed as the complete descriptor of a detected feature with coordinates to the microprocessor or other accompanying circuit. The FIFO, composed of dual-port BlockRAM memories, is used to allow for safe crossing of the clock domains, as the input SURF and FAST stages can be clocked at a different frequency than the output data formatting block. Using two different clock frequencies allows to avoid the situation in which the input data is fed to the formatting block at a rate that it cannot handle. The block diagram of the output block of the coprocessor is given in Fig. 6. The state machine controls the data flow. The output data for each feature consists of 13 32-bit words. First 12 words encode the fractional values (results of division by the maximum component) of the descriptor elements. Each one of these two words is comprised of 3 10-bit values. The last word gives the information on feature coordinates. Three parallel radix-2 integer dividers are used for normalization of the descriptor data. 3.2
Feature Matching Processor
The matching coprocessor is an IP core compatible with the FSL bus. It searches the dataset (consisting of the descriptors of all the features found in an image)
SoC Coprocessors for High Speed Image Feature Detection and Matching
607
Fig. 6. Block diagram of the SURF descriptor coprocessor output block
Standard FSL bus
FSL controller
for the element that is the closest match to a feature descriptor serving as the pattern. To this goal, SAD between the vectors of values (forming the descriptors) is computed. The descriptors consist of 36 10-bit signed values. Matching coprocessor is a scalable core, which allows to perform parallel search for one or more pattern descriptors simultaneously. The input data is given in the following order: number of descriptors to be compared (equals to the number of the features found in the image), one or more (depending on settings) pattern descriptors and series of descriptors to compare with the patterns. General outline of the matching coprocessor is given in Fig. 7. It consists of one FSL bus controller and one or more matching cores (the image shows a version with 4 matching cores). Compatibility with the FSL bus allows for easy connection with the MicroBlaze microprocessor and enables fast data transfers.
matching core
matching core
matching core
matching core
Fig. 7. Matching coprocessor schematic
FSL controller is a state machine which reads data from the FSL bus, decodes the command (coded in two most significant bits) and prepares the data for the matching cores. Schematic of a single matching core is presented in Fig. 8. The input data for a single core consists of the input data (360-bits wide), the number of vectors to be compared with the pattern (11-bits wide) and the flags indicating whether the input data is a pattern or the series of descriptors to compare with the pattern (2 x 1 bit). The matching coprocessor returns the following data: smallest SAD between the pattern and and the elements of the
608
M. Kraft, M. Fularz, and A. Kasi´ nski input data
1 DATA
10
-
abs 11
360
+ +
write pattern EN
+
10
reset
input data
+
« x36 ...
36
« x18 ...
1 DATA
360
reset
« x4 ...
10
write compare EN vector 36
10
sum of absolute differences
+
« x9 ...
16
« x2 ...
+
17
16
-
abs
+
A
EN
+
counter max 11
1 reset
smallest difference
Q
index of best matched vector
EN
D
write compare up vector counter up reset
Q D
11
FIFO depth = 7
A 11
=B
result ready 1
DATA EN
11
Fig. 8. Matching core schematic
image descriptors dataset (17-bits wide), the index of the best matching dataset element (11-bits wide) and the flag indicating that the compare process is finished and the results are ready to read. When both the pattern and the element of the dataset are sent to the matching core, the index of the currently processed element is propagated alongside. Each one of the 36 values in the pattern are subtracted from corresponding values of the dataset element. Absolute values of the results are summed in parallel. The final result is a 17-bit value of the sum of the absolute differences between the corresponding values in both of the descriptors. It is then compared to the currently lowest value. If the most recent SAD is lower than this value, it substitutes the currently lowest value. The index of the currently processed dataset element is delayed in the FIFO queue to synchronize it with the corresponding SAD value. The matching coprocessor can include more than one matching core, and the cores are completely independent and can work in parallel. This is thanks to to the FPGAs inherent capability of parallel processing. The number of matching cores is limited by the amount of available FPGA resources, and the achieved speedup is roughly proportional to the number of matching cores.
4
Performance Evaluation
The correct operation of the coprocessors was verified by simulation using timeaccurate models and tests with artificial images. Table 1 summarizes the resource
SoC Coprocessors for High Speed Image Feature Detection and Matching
609
Table 1. Resource usage of the implemented design (designations: FFs – flipflops, LUTs – lookup tables, BRAMs – block RAM memory blocks). The values in percent are given with respect to all corresponding resources available in XC6SLX150T device.
FAST with SURF) Matching (4 cores) Matching (1 core)
BRAMs 55 (21%) 0 (0%) 0 (0%)
FFs LUTs fclk 10835 (10%) 8956 (10%) 6258 (6%) 9861 (11%) 2042 (2%) 2450 (3%)
(MHz) 150 148 151
usage for the feature detection and description coprocessor and two variants of the matching coprocessor – with one and with four matching cores. Tests performed with the system containing the Microblaze microprocessor and the coprocessors clocked at 50MHz indicate, that the systems will be capable of performing the full feature detection and matching (with 4 matching cores) of VGA frames containing 500 features at the speed of 8 frames per second (FPS). However, the feature detection and description coprocessor is capable of working directly with the image data source (e.g. the CMOS sensor). In this case, the latencies caused by sending the image data to the coprocessor would be eliminated, increasing the speed to 11 FPS with a 50 MHz clock. Increasing the number of comparator cores in the matching processor to 8 increases the number of FPS to 21. The matching coprocessor can be easily scaled to include even more matching cores without a dramatic resource usage increase. Another way of achieving the speedup is to increase the frequency of the system clock (even up to 130-140 MHz – see table 1). This results in direct proportional performance gain. Higher frequencies can be achieved by using faster FPGAs, e.g. from the Virtex-6 line. The power consumption of the system at 50MHz was about 8W. The same algorithm, running on a 2.4GHz Core 2 Duo processor (P8400) reaches the processing speed of 5 FPS. Comparing the implemented coprocessors with other existing hardware architectures is a difficult task. This is because we either chose to modify the existing algorithms in a way that in our opinion would be the best fit for the hardware implementation or selected the ones that are not popular (like the FAST corner detector). An additional difficulty is that comparing algorithm implementations based on different reconfigurable devices is not fair. This is especially true when considering older designs, because the reconfigurable logic field evolves rapidly in terms of speed, amount of available resources, development tools etc. Side-to-side comparison of other solutions to ours is therefore practically impossible and an in-depth analysis would be beyond the scope of this paper.
5
Conclusions
The article presents in detail the implementation of the feature detection, description and matching coprocessors. The complete system created with the coprocessors can fit in a single FPGA. The results obtained are satisfactory and promising. The availability of relatively inexpensive, low power, scalable, small
610
M. Kraft, M. Fularz, and A. Kasi´ nski
footprint solution for feature detection and matching is desirable in many applications, such as unmanned aerial vehicles, mobile robots, driver assistance systems etc. The future work will be focused on testing the implemented system. The results have been already tested for correctness, but more functional tests are needed to evaluate the usability of the system in real life scenarios. The design is highly scalable and configurable, and can easily be adapted for connection with other functional blocks. Additionally, it shows potential for possible performance gains, both by tailoring of the architecture and by using faster FPGA devices. These opportunities will also be explored in the forthcoming research.
References 1. Agrawal, M., Konolige, K., Blas, M.R.: Censure: Center surround extremas for realtime feature detection and matching. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 102–115. Springer, Heidelberg (2008) 2. Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Speeded-up robust features (SURF). Computer Vision and Image Understanding 110(3), 346–359 (2008) 3. Deschamps, J.P., Bioul, G.J.A., Sutter, G.D.: Synthesis of Arithmetic Circuits: FPGA, ASIC and Embedded Systems. Wiley Interscience, Hoboken (2006) 4. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of the 4th Alvey Vision Conference, pp. 147–151 (1988) 5. Kraft, M., Schmidt, A.: Simplifying SURF feature descriptor to achieve real-time ˙ lnierek, A. (eds.) performance. In: Burduk, R., Kurzy´ nski, M., Wo´zniak, M., Zo Computer Recognition Systems 4. Advances in Intelligent and Soft Computing, vol. 95, pp. 431–440. Springer, Heidelberg (2011) 6. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: Proc. of Int. Conf. on Image Processing (1), pp. 900–903 (2002) 7. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91 (2004) 8. Mair, E., Hager, G.D., Burschka, D., Suppa, M., Hirzinger, G.: Adaptive and generic corner detection based on the accelerated segment test. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 183–196. Springer, Heidelberg (2010) 9. Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986) 10. Rosten, E., Drummond, T.: Fusing points and lines for high performance tracking. In: IEEE Int. Conf. on Computer Vision, vol. 2, pp. 1508–1511 (October 2005) 11. Rosten, E., Drummond, T.W.: Machine learning for high-speed corner detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 430–443. Springer, Heidelberg (2006) 12. Schmidt, A., Kraft, M., Kasinski, A.J.: An evaluation of image feature detectors and descriptors for robot navigation. In: ICCVG (2), pp. 251–259 (2010) 13. Smith, S.M., Brady, J.M.: SUSAN – a new approach to low level image processing. International Journal of Computer Vision 23(1), 45–78 (1997)
Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs Gert-Jan van den Braak, Cedric Nugteren, Bart Mesman, and Henk Corporaal Dept. of Electrical Engineering, Electronic Systems Group Eindhoven University of Technology, The Netherlands {g.j.w.v.d.braak,c.nugteren,b.mesman,h.corporaal}@tue.nl
Abstract. The Hough transform is a commonly used algorithm to detect lines and other features in images. It is robust to noise and occlusion, but has a large computational cost. This paper introduces two new implementations of the Hough transform for lines on a GPU. One focuses on minimizing processing time, while the other has an input-data independent processing time. Our results show that optimizing the GPU code for speed can achieve a speed-up over naive GPU code of about 10×. The implementation which focuses on processing speed is the faster one for most images, but the implementation which achieves a constant processing time is quicker for about 20% of the images.
1
Introduction
Computer vision applications are more and more used in every day life. For example in industrial applications like traffic surveillance [2], but also in consumer applications like the augmented reality applications on mobile phones [11]. Detecting shapes like lines and circles is an important and often computational intensive part of these computer vision applications. Since the end of 2006, with the release of “CUDA” by NVIDIA and “Close to Metal” by AMD, Graphical Processing Units (GPUs) have become more programmable and more usable for other applications than computer graphics. Since then, many computer vision applications have been implemented on GPUs [1]. The Hough transform is a popular technique to locate shapes in images. It is mostly used to find straight lines and circles in images, but it can also be used to detect arbitrary shapes. The Hough transform is a robust technique that works well even in the presence of noise and occlusion. It is used in many computer vision and image processing applications, like robot navigation [4], industrial inspection and object recognition [14]. A complete application for detection shapes in images usually consists of several steps: a) edge detection; b) thresholding; c) voting in Hough space; d) Hough space post-processing; e) displaying detected lines. These steps are illustrated in Fig. 3. In this paper we will only focus on the third step: voting in the Hough space. The first step can be a convolution based edge detection, e.g. Sobel edge detection, as can be found in the NVIDIA CUDA SDK. For the second step Otsu J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 611–622, 2011. c Springer-Verlag Berlin Heidelberg 2011
612
G.J. van den Braak et al.
thresholding can be used. Otsu thresholding makes a histogram of the edge image, processes the histogram and finds the best threshold value. An efficient GPU implementation of making a histogram can be found in [8]. In the Hough space post-processing stage the maximum in the Hough space is located. This maximum can be used to draw the most dominant line in the original image. This paper is organized as follows. First two parameterizations for lines and their corresponding Hough transform are presented in Section 2. In Section 3 the benchmark setup can be found, together with a brief description of GPU hardware and programming. The different GPU implementations of the Hough transform can be found in Section 4 and the results and evaluation can be found in Section 5. Finally conclusions and future work are presented in Section 7.
2
Hough Transform for Lines
The Hough transform for lines [6] is a voting procedure where each feature (edge) point in an image votes for all possible lines passing through that point. All votes are stored in the so called Hough space, which is two dimensional for the Hough transform for lines. The size of the Hough space is determined by the size of the input image and the required accuracy for the parameterization of the lines. Two different parameterizations for lines and their corresponding Hough transforms are described in this section. 2.1
Cartesian Hough Transform
A straight line can be described in a Cartesian coordinate system with a slope a and some intersect b with the vertical axis by the following equation: y = ax + b
(1)
In the Hough transform, the characteristics of the straight line are not considered as image points (xi , yi ), but instead in terms of its parameters a and b. Therefore Eq. 1 can be rewritten to: (2) b = yi − xi a For each image point (xi , yi ) a line of votes is placed in the Hough space for a range of angles θ. Parameter a is calculated as a = tan(θ), and the corresponding values for b are calculated with Eq. 2. In Fig. 1(a) the two points (xp , yp ) and (xq , yq ) form a line. The two corresponding lines in the Hough space are shown in Fig. 1(b). At the intersect of these two lines the (best approximated) value for the parameters a and b can be found. The parameters can become an infinite number when the line is vertical. Therefore the Hough space is usually divided into two parts: one part for angles between −45◦ and 45◦ which uses Eq. 2 and one part for angles between 45◦ and 135◦ which uses Eq. 3. (3) b = xi − yi a
Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs y
613
b (a, b)
(xq , yq ) (xp , yp )
a
x
(b)
(a)
Fig. 1. (a) A line through two points in an image. (b) Corresponding two lines in Hough space.
2.2
Polar Hough Transform
In the polar representation a line is parameterized with ρ and θ [3], as shown in Fig. 2. Parameter ρ represents the distance between the line and the origin, and the angle of the vector from the origin to this closest point, as given by Eq. 4. Eq. 1 and Eq. 4 are related by Eq. 5. ρ = x cos(θ) + y sin(θ)
(4)
1 ρ b= (5) tan(θ) sin(θ) In this polar parameterization the parameters ρ and θ are bounded. The angle √ θ ranges from 0◦ to 180◦ and the radius ρ ranges from 0 to W 2 + H 2 , where W and H are the width and height of the image respectively. a=−
x y
θ
ρ
Fig. 2. Polar representation of a line
3
Benchmark Setup
To measure the performance of the different parameterizations for lines in the Hough transform, a number of benchmarks are performed. The CPU used in these benchmarks is an Intel Core i7 930 with four cores running at 2.8 GHz. The CPU implementations use OpenMP to utilize all cores and calculate the Hough transform by iterating over all pixels in the binary input image. If a pixel value is equal to ‘1’, this pixel value is used in the voting process, otherwise it is discarded. In all implementations the trigonometric functions are pre-calculated and stored in an array. The GPU used in our setup is an NVIDIA GTX 470 with 448 CUDA cores running at 1.2 GHz and has 1280 MB of off-chip global memory. In NVIDIA’s latest architecture (Fermi) [9], 32 CUDA cores are grouped into a cluster. Each cluster has an on-chip shared memory (about 48 kB) and a cache.
614
G.J. van den Braak et al.
(a)
(b)
(c)
(d)
Fig. 3. Test image with resulting line (red) after Hough transform (a); intermediate images after edge detection (b) and thresholding (c); final Hough spaces (d)
The code executed on a GPU is called a kernel. Kernels run on the GPU in thousands or even millions of threads. Each thread executes the same program, but not necessarily the same instruction at the same time. Threads are organized into thread blocks. All threads in a thread block are executed on the same processing cluster and can communicate via its shared memory. Threads within a thread block are arranged in warps of (at most) 32 treads, and each thread in a warp executes the same instruction at the same clock cycle [10]. All images used in the measurements in this paper are gray scale images and have a resolution of 1920 × 1080 pixels. The chosen resolution for the parameters in the Hough space is one degree for the the angle parameter (a in Eq. 1 and θ in Eq. 4) and 1 pixel for the intersect parameter (b in Eq. 1 and ρ in Eq. 4). First Sobel edge detection and Otsu thresholding are applied on the images. An example test image can be found in Fig. 3. The number of edge pixels after thresholding in this image is 5.9%. Taken an average over the 2550 unique pictures in the Nist´er and Stew´enius benchmark set [7], the average number of edge pixels after applying Otsu thresholding is 9.6%. The distribution of the number of edge pixels in an image in this benchmark set is shown in Fig. 4.
Fig. 4. Distribution of the number of edge pixels in the images of the Nist´er and Stew´enius benchmark set [7]
Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs
4
615
GPU Implementations
The difference in execution time of the two different parameterizations of the Hough transform as described in Section 2 is measured by two implementations on a CPU and a GPU. The GPU versions are based on the fast GPU implementation as described in Section 4.2 below. A comparison between these CPU and GPU implementations is made in Section 5. Next to the different versions for the different parameterizations, three implementations for the Cartesian parameterization of the Hough transform have been made to explore the trade-offs between execution speed, predictability and code complexity. First a very basic implementation is described, which is used as the reference for the other implementations. The second implementation focuses on processing speed, at the cost of more complex code and a higher memory utilization. The last implementation achieves a constant processing time, which makes its processing time independent of the input image. In all three GPU implementations the trigonometric functions are calculated jointly by all threads in a thread block and stored in an array in the shared, on chip, memory. 4.1
GPU Implementation 1 - Basic
The first GPU implementation is a based on the CPU implementation, without GPU specific optimizations. The pseudo code for this implementation is given in Fig. 5. First the Hough space in off-chip global memory has to be reset to all zeros. Then a kernel is started with one thread for each pixel in the input image. If the value of the pixel is a ‘1’, the thread places a vote in both Hough spaces (the Hough transform in the Cartesian parameterization consists of two Hough spaces, see Section 2.1) for each possible angle. The Hough spaces are located in the global (off-chip) memory of the GPU and atomic operations have to be used for the voting process. Measurements (see Section 5) show that this implementation is just a bit slower than an optimized CPU implementation. The most time consuming step in this implementation are the atomic additions on the Hough spaces in global memory, since atomic operations which modify values in the same location and which are executed by threads in a warp are all serialized, and global memory latency is typically 400 - 800 clock cycles [10]. 1 2 3 4 5
pixel_value = image[x,y] if(pixel_value > threshold) { for i=0:N { a1 = A1[i]; b1 = y - x*a1 // tan() calculations are a2 = A2[i]; b2 = x - y*a2 // stored in arrays A1 and A2
6 7
atomicAdd(HS1[(a1,b1)], 1) atomicAdd(HS2[(a2,b2)], 1)
8 9 10
} }
Fig. 5. Pseudo code for voting in the Hough space for a single pixel in the basic GPU implementation of the Cartesian Hough transform
616
4.2
G.J. van den Braak et al.
GPU Implementation 2 - Fast
The second implementation focuses on processing speed. As mentioned in Section 3, less than 10% of the pixels are actually used in the voting process. This means that most of the threads in the previous solution are waiting for a few threads to finish. Therefore this second solution starts by making an array of all pixels that need to be processed. A second kernel processes this array to create the Hough space. This two step process is illustrated in Fig. 6. a
1920 1080
b c d
d
Fig. 6. Fast implementation of the Hough transform on GPU. Each thread block in the first kernel converts a part of the image to an array of pixel coordinates in the shared (on-chip) memory (a). The part of the array is added to the main array in global (offchip) memory (b). In the second kernel the array of pixel coordinates is processed by a thread block to create one Hough line in each of the two Hough spaces in the shared memory (c). When the complete array of coordinates has been processed, the Hough line is copied to the corresponding Hough space in global memory (d).
Creating the Array. The creation of the array is inspired by the work in [8], where a histogram for each warp in a thread block is made. For the Hough transform an array of only the pixels which have to be used in the voting process is desired. To build this array in a parallel way on the GPU (step a in Fig. 6), small arrays are made on a warp-level granularity. How an array per warp is made is summarized in pseudo-code in Fig. 7. Note that all threads in a warp execute the same instruction at the same time in parallel, but some threads may be disabled due to branching conditions.
1 2 3 4 5 6 7 8 9
pixel_value = image[x,y] if(pixel_value > threshold) { do { index++ SMEM_index = index SMEM_array[index] = (x,y) } while(SMEM_array[index] != (x,y)) } index = SMEM_index
Fig. 7. Building an array of coordinates of edge pixels in shared memory (SMEM) (step a in Fig. 6)
Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs
617
Each thread in a warp reads a pixel from the input image (line 1, Fig. 7). As the pixel value is larger than a threshold value (line 2), the pixel coordinates need to be added to the array. The index of where these coordinates are to be stored is increased by one (line 4), to ensure no previously stored coordinates are erased. The new index is also stored in the on-chip shared memory (line 5), so threads in the warp which do not have to store coordinates can update their index value after all coordinates in this iteration have been added to the array (line 9). Now each thread tries to write its coordinate pair (x, y) to the array in shared memory at location index (line 6). Only one thread will succeed (line 7), and the others have to retry to write to the next location in the array (line 3-7). On average this loop has to be executed three times before all threads in the warp (32 threads in total) have added their coordinate pair (x, y) to the array, since only 10% of the pixels are above the threshold, as shown in Section 3. There is a trade-off in the number of pixels each thread has to process. More pixels per threads result in less arrays to combine later, but too many pixels per thread means that there are not enough threads active to keep the GPU fully utilized. Also the maximum number of pixels in each small array is limited by the amount of shared memory available. Now all small arrays in the shared memory have been made, they have to be combined in one array in the off-chip global memory (step b in Fig. 6). First one thread in each thread block sums the lengths of all warp-arrays of the thread block. This sum is added by this single thread to the global length of all arrays by a global atomic operation. This operation returns the value of the global length before the sum was added. This global length value is now used to tell each warp at which index in the global array their warp-array can be stored. Voting in Hough Space. A second kernel is used to vote in the Hough space. Since atomic operations to the off-chip global memory are slow, the voting implementation is improved compared to the voting implementation in Section 4.1. A single thread block is used to create a single line (one value for the angle parameter) in each of the two Hough spaces simultaneously (step c in Fig. 6). The number of lines in the Hough space is determined by the required accuracy of the angle parameterization. This implies that the entire array will be read as many times as there are values for the angle parameter. Each Hough line is first put together in the shared memory, and later copied to the global memory to create the complete Hough space(step d in Fig. 6). This also removes the requirement that the Hough space in global memory has to be reset to zero, as was the case in the implementation in Section 4.1. 4.3
GPU Implementation 3 - Constant
For the third implementation the relative number of pixels to be processed does not influence the processing time. A graphical representation of this implementation is shown in Fig. 8. In this implementation, all threads in a thread block will together copy a couple of lines of the input image to the on-chip shared memory (step b in
618
G.J. van den Braak et al. a
1080
1920 1080
1920
b
b 1920
1080
c
c
d
d e
e
Fig. 8. Input-data independent implementation of the Hough transform on GPU. First the image is rotated by the first kernel (a). In a second kernel each thread block copies a part of the input image from the global (off-chip) memory to the shared (on-chip) memory (b). Then one Hough line is calculated in shared memory based on the part of the image in the shared memory (c). This line is stored in a sub-Hough space in global memory (d). This step is repeated to calculate the next Hough line, until one entire sub-Hough space is filled by each thread block. A third kernel sums all subHough spaces together to make the final Hough space (e). The last two kernels are implemented in two versions, one for the original image in landscape orientation and one for the rotated image in portrait orientation.
Fig. 8). Then all threads read this part of the input image pixel by pixel, and together produce one Hough line (step c in Fig. 8). Here atomic operations are not required, since consecutive threads vote for consecutive bins in the shared memory (since consecutive threads process consecutive pixels). This is only true if threads are working on the same image line, as can be seen in Eq. 3. If threads work on different image lines, they vote for the same value of b and atomic operations would be required. So all threads in a thread block need to synchronize after processing an image line, to remove the need for atomic operations. This method is most efficient when the least amount of synchronizations are required, e.g. the width of the input line is as large as possible. After one line in the Hough space is created, it is written to the off-chip global memory (step d in Fig. 8) and the next Hough line is generated in the same way. After all lines are generated and copied to global memory, a second kernel combines all sub-Hough spaces of parts of the image to one Hough space of the entire image (step e in Fig. 8). To create the second Hough space, the image is first rotated in another kernel (step a in Fig. 8). This makes it possible to read the image coalesced and vote for consecutive bins, since consecutive pixels are read according to Eq. 2. Then the same algorithm is used as described above, but now there are more lines which are smaller (since the image now has a portrait orientation instead of a landscape orientation). This means that creating this second Hough space takes more time than creating the first Hough space.
Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs
619
This implementation is limited by the amount of on-chip shared memory in the GPU. To reduce the number of sub-Hough spaces, a thread block should process a part of the image as large as possible. Since after the thresholding stage the pixels can only have two values (0 or 1, below or above threshold value), each pixel can be packed into a single bit. This means that more pixels can be stored in the shared memory (in comparison to the original approach where each pixels is stored in one byte), and the number of sub-Hough spaces (which have to be added later) is reduced. A second benefit is that the reading of the input image is faster, since the number of bytes required to read the complete image is reduced. Packing the image from bytes to bits can be done in the rotating stage (step a in Fig. 8), through what it does not take much extra processing time (about 2% of the total processing time).
5
Results
In this section the results of the CPU and GPU implementation for the two different parameterizations are discussed first. Then the results of the three different GPU implementations (basic, fast and constant) are discussed. All GPU timing measurements only include the execution time of the kernels, data transfer times to and from the GPU are not included since the pre- and post-processing steps are also executed on the GPU. The performance of the (fast) GPU implementation of the different parameterizations can be found in Table 1. These results are compared to an optimized CPU implementation, where all four cores in the test system are used. Table 1. Results of the two different parameterizations on the test image Hough version CPU time GPU time Speed up Cartesian 18.2 ms 2.6 ms 7.0× Polar 23.3 ms 3.3 ms 7.0×
As can be seen in Table 1, both parameterizations have about the same performance, as well on the CPU as on the GPU. The Polar parameterization is a bit slower, since each vote by a pixel requires two (floating point) multiplications instead of one. The average speed-up of the GPU implementations over the CPU implementations is seven times. For all three implementations (basic, fast and constant) of the Cartesian parameterization as described in Section 4, a GPU implementation has been made. The results can be found in Table 2, which show that the basic implementation is about 20% slower than the optimized CPU implementation. By optimizing the GPU code for speed, the performance can be increased by almost a factor of 10, but at the cost of more complex code. The code is not only more complex in number of source lines of code, but also in how easy its parameters like image size and Hough space size (which controls the quality of the result) can
620
G.J. van den Braak et al.
be adjusted. The code for the constant implementation is even more complex. The input image size is fixed (only multiples of the current image size are easy to implement), to make all optimizations possible. Table 2. Timing results of the three different GPU implementation on the test image. The results of the fast- and constant implementations are split over the different kernels. The number of source lines of code only includes the GPU kernel code. Implementation 1. Basic 2. Fast a. Array building b. Voting in Hough space 3. Constant a. Rotate and pack image b. Voting in sub Hough spaces c. Summing sub Hough spaces d. Voting in sub Hough spaces e. Summing sub Hough spaces
1 1 2 2
GPU time 21.6 ms 2.6 ms 0.3 2.3 10.6 ms 0.5 3.9 0.4 4.9 0.9
Source lines of code 29 97 ms ms 193 ms ms ms ms ms
The constant implementation which takes bits as input for the input pixels, executes in 10.6 ms as shown in Table 2. Without packing each pixel into a single bit, but leaving it in a byte, increases the execution time to 18.6 ms. Packing the pixels from bytes to bits only takes 0.2 ms extra in the rotating stage, which makes it well worth the effort, although it also increases the program complexity. A trade-off can be made between accuracy and processing speed. By reducing the number angles in the Hough space, the execution time is reduced, but so is its accuracy. The execution time of the fast and the constant implementation with a 3 degree and 6 degree accuracy can be found in Fig. 9. This figure shows that the execution time of the fast implementation scales linear with the number of edge pixels in the image. After some value (about 16% of all pixels being edge pixels), the constant implementation is even faster than the fast implementation.
(a) Accuracy 3 degrees
(b) Accuracy 6 degrees
Fig. 9. Execution time of the Fast and Constant GPU implementation of the Hough Transform (HT) with 3 degrees and 6 degrees of accuracy
Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs
6
621
Related Work
An OpenGL implementation of the Hough transform on a GPU is presented in [5]. Unfortunately no performance measurements are given, but it is mentioned that an array of all edge pixels is made on the CPU. The circle Hough transform has been implemented on a GPU by [13], also in OpenGL. Both papers use the rendering functions of OpenGL to calculate the Hough space. With the availability of CUDA nowadays, using OpenGL to program GPUs for general purpose computations has fallen in disuse. One CUDA implementation of the Hough transform can be found in CuviLib [12], a proprietary computer vision library. It uses the polar representation of a line for the Hough transform. Next to calculating the Hough space, it also finds the maxima in the Hough space at the same time.
7
Conclusion
In this paper we have introduced two new implementations for the Hough transform on a GPU, a fast version and an input-data independent version. We have shown that the parameterization (Cartesian or polar) used for lines in images does not influence the processing speed of the Hough transform significantly. Optimizing the GPU code for speed does result in a significant improvement. Another way to optimize the GPU code is to make it input-data independent. Our result show that the fast-implementation is the quicker of the two for about 80% of the images. The input-data independent implementation has the same processing speed for every image, and is faster if the number of edge pixels exceeds a certain threshold (about 16% in our case). While the effort for making a basic GPU implementation is about an hour, creating the fast implementation can take a couple of days, and the constant implementation even weeks. The program code for the input-data independent implementation is so complex that it is very hard to make any changes to parameters, like image size. The fast implementation does not suffer from this drawback. Therefore, and because it is the quicker solution for about 80% of input images, it is advisable to select the fast implementation of the Hough transform in every case where the processing time does not have to be fixed. The input-data independent implementation shows that packing the inputdata from bytes to bits can result in a large speed-up of the application. The GPU used in this paper already supports packing standard data types (char, int, float) into vectors of two, three or four elements. Packing bytes into bits would make a good addition to this, at little extra hardware costs. Future work will include the Hough transform for circles. The corresponding Hough space is much larger than the Hough space for lines, since it has three dimensions instead of two. This will create a new trade-off between the fast and the input-data independent approach. For the input-data independent implementation the image no longer has to be rotated, which would save over half of the processing time in the Hough transform for lines. But the final Hough space is much larger when detecting circles, which will limit the number of sub-Hough spaces which can be generated and makes them more costly to add together.
622
G.J. van den Braak et al.
References 1. Allusse, Y., Horain, P., Agarwal, A., Saipriyadarshan, C.: GpuCV: An OpenSource GPU-Accelerated Framework for Image Processing and Computer Vision. In: 16th ACM International Conf. on Multimedia, MM 2008, pp. 1089–1092. ACM, New York (2008) 2. Bramberger, M., Brunner, J., Rinner, B., Schwabach, H.: Real-Time Video Analysis on an Embedded Smart Camera for Traffic Surveillance. In: Proceedings of 10th IEEE Symposium on Real-Time and Embedded Technology and Applications, RTAS 2004, pp. 174–181 (2004) 3. Duda, R.O., Hart, P.E.: Use of the Hough Transformation to Detect Lines and Curves in Pictures. Commun. ACM 15 (January 1972) 4. Forsberg, J., Larsson, U., Wernersson, A.: Mobile Robot Navigation using the Range-Weighted Hough Transform. IEEE Robotics Automation Magazine (1995) 5. Fung, J., Mann, S.: OpenVIDIA: Parallel GPU Computer Vision. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 849–852. ACM, New York (2005) 6. Hough, P.: Method and Means for Recognising Complex Patterns. US Patent No. 3,069,654 (1962) 7. Nist´er, D., Stew´enius, H.: Scalable recognition with a vocabulary tree. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 2161–2168 (June 2006) 8. Nugteren, C., van den Braak, G.J., Corporaal, H., Mesman, B.: High Performance Predictable Histogramming on GPUs: Exploring and Evaluating Algorithm Tradeoffs. GPGPU 4 (2011) 9. NVIDIA Corporation: NVIDIA’s Next Generation CUDA Compute Architecture: Fermi (2009) 10. NVIDIA Corporation: NVIDIA CUDA C Programming Guide - Version 3.1 (2010) 11. Takacs, G., et al.: Outdoors Augmented Reality on Mobile Phone using LoxelBased Visual Feature Organization. In: Proceeding of the 1st ACM International Conference on Multimedia Information Retrieval, MIR 2008, pp. 427–434. ACM, New York (2008) 12. TunaCode (Limited): Cuda Vision and Imaging Library, http://www.cuvilib.com/ 13. Ujald´ on, M., Ruiz, A., Guil, N.: On the computation of the Circle Hough Transform by a GPU rasterizer. Pattern Recognition Letters 29(3), 309–318 (2008) 14. Wang, Y., Shi, M., Wu, T.: A Method of Fast and Robust for Traffic Sign Recognition. In: Fifth International Conference on Image and Graphics, ICIG 2009 (2009)
Feasibility Analysis of Ultra High Frame Rate Visual Servoing on FPGA and SIMD Processor Yifan He, Zhenyu Ye, Dongrui She, Bart Mesman, and Henk Corporaal Eindhoven University of Technology, The Netherlands {y.he,z.ye,d.she,b.mesman,h.corporaal}@tue.nl
Abstract. Visual servoing has been proven to obtain better performance than mechanical encoders for position acquisition. However, the often computationally intensive vision algorithms and the ever growing demands for higher frame rate make its realization very challenging. This work performs a case study on a typical industrial application, organic light emitting diode (OLED) screen printing, and demonstrates the feasibility of achieving ultra high frame rate visual servoing applications on both field programmable gate array (FPGA) and single instruction multiple data (SIMD) processors. We optimize the existing vision processing algorithm and propose a scalable FPGA implementation, which processes a frame within 102 μs. Though a dedicated FPGA implementation is extremely efficient, lack of flexibility and considerable amount of implementation time are two of its clear drawbacks. As an alternative, we propose a reconfigurable wide SIMD processor, which balances among efficiency, flexibility, and implementation effort. For input frames of 120 × 45 resolution, our SIMD can process a frame within 232 μs, sufficient to provide a throughput of 1000f ps with less than 1ms latency for the whole vision servoing system. Compared to the reference realization on MicroBlaze, the proposed SIMD processor achieves a 21× performance improvement. Keywords: Visual Servoing, FPGA, Reconfiguration, Wide SIMD.
1
Introduction
Visual servoing applies image sensors instead of mechanical encoders for position acquisition. On one hand, it reduces the number and accuracy requirement of encoders, and has been proven to obtain better performance than encoders in several applications, e.g., inkjet printing [2,9]. On the other hand, it also dramatically increases the computing workload due to the often computationally intensive vision algorithms. The ever growing demand for higher frame rate makes the realization of visual servoing systems even more challenging [3,4]. To address the issue of limited computing power, special purpose hardware is often used. Among them, the field programmable gate array (FPGA) is one of the most cost effective options. Though a dedicated FPGA implementation is usually of extreme efficiency, lack of flexibility and considerable amount of J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 623–634, 2011. c Springer-Verlag Berlin Heidelberg 2011
624
Y. He et al.
implementation effort are two of its clear drawbacks. On the other hand, wide single instruction multiple data (SIMD) processors are very popular among the stream processors for vision/image processing [1,6,7], because (i) the massive number of processing elements (PEs) inside an SIMD processor potentially renders very high throughput; (ii) massive parallelism in streaming applications typically shows up as data-level parallelism (DLP ) which is naturally supported by SIMD architectures; and (iii) SIMD is a low-power architecture as it applies the same instructions to all PEs. The low-power feature is crucial to an embedded system. For a non-embedded system, this feature is sill very important as the design of the heat removal system can be greatly simplified. All these merits make SIMD architecture a very interesting candidate for visual servoing. In order to demonstrate the feasibility of achieving ultra high frame rate visual servoing on both FPGA and SIMD processor, a typical industrial application, organic light emitting diode (OLED) screen printing, is analyzed in detail. Firstly, we improve the existing OLED center detection algorithm developed by Roel [9]. The proposed vision pipeline is not only more robust, but also more friendly to embedded processors and FPGA/ASIC realization. Only 32-bit fixed-point operations are used, while rendering sub-pixel accuracy. Moreover, the processing time is deterministic, which is crucial for latency oriented applications. After developing the visual pipeline, we propose an FPGA implementation with a processing time of only 102 μs on input image size of 120 × 45. However, realizing a dedicated visual servoing application implementation in FPGA requires considerable amount of effort, and a tiny change in the algorithm may cause re-design of the whole circuit. To balance among efficiency, flexibility, and implementation effort, we also propose a highly-efficient SIMD architecture for vision servoing applications, which is based on our previous design [6]. The number of PEs in this proposed SIMD processor can be dynamically reconfigured to match the resolution of the input frame and/or the performance requirement of the application. For input frames of size 120 × 45, our SIMD processor can process a frame within 232 μs, sufficient to meet the throughput requirement of 1000 fps with a latency of less than 1 ms for the whole vision servoing system. Compared to the reference realization on MicroBlaze [11], the proposed SIMD processor achieved a 21× performance improvement. The remainder of this paper is organized as follows. In section 2, we show the visual servoing setup and our proposed vision pipeline. In Section 3, we elaborate the proposed FPGA implementation and performance analysis. The proposed reconfigurable wide SIMD processor architecture and mapping of the vision pipeline are presented in detail in Section 4. Finally, we draw the conclusions of this work in Section 5.
2
Visual Servoing System and Algorithm Development
The experimental visual servoing setup, which is described in [2], is shown in Fig. 1(a). The camera and lights are fixed at the top of the setup. The OLED structures are mounted on the moving X-Y table, which moves on a 2D plane.
Feasibility Analysis of Ultra High Frame Rate Visual Servoing on FPGA
(a) Visual Servoing Setup
(b) System tecture
625
Archi-
Fig. 1. The Experimental Setup and System Architecture
Fig. 2. System Delay Breakdown (1000 fps)
The system architecture is described in Fig. 1(b). At each frame interval, the camera takes an image of the moving OLED structures. The image is then transferred to the vision processing platform through Ethernet or Camera Link interface. In the vision processing step, which is the main focus of this paper, the vision processing platform processes the input image and localizes the centers of the OLED structures. The data-acquisition is realized by using EtherCAT, where DAC, I/O, and ADC modules are installed to drive the current amplifiers of the motors. Based on the relative positions of the detected OLED centers, X-Y table is then driven to a proper position. In visual servoing systems, encoders are typically sampled at 1kHz [10]. To ensure the stability, we set the same sample rate (1000 fps) as the basic requirement for our visual servoing system. The timing breakdown of the complete visual servoing system is shown in Fig. 2, which consists of four components: (i) exposure of the image sensor; (ii) data readout from the image sensor; (iii) vision pipeline computing; and (iv ) control algorithm. The required exposure time is measured on a real setup with the OLED structure. The exposure time depends on the lighting condition and the type of surface of the plate. It can vary from 10 μs for paper [2] to 400 μs for an OLED wafer [9]. The image read out time is measured on the CameraLink interface. The control algorithm takes a relatively small amount of time, which is common in industrial applications. To reduce the delay of the system, only the Region Of Interest (ROI) of the image taken by the camera is read out and processed. A typical size of ROI for our OLED substrate localization application is 120 × 45 pixels or 160 × 55 pixels. The exposure time and image readout time is deterministic for a specific ROI
626
Y. He et al.
size, lighting source, and camera interface. The timing of the control part is also deterministic given a specific mechanical setup. Therefore, the major source of reduction in the delay can only come from the vision processing component. In order to achieve 1000f ps throughput as well as 1 ms latency, the timing budget remaining for vision processing is only about 350 μs. In order to meet this tight budget, we firstly optimize the existing vision pipeline, which is based on a previous PC-based implementation [9]. This PCbased implementation is suboptimal for FPGA/ASIC and embedded processor realization. On one hand, the using of contour tracing algorithm in the PC-based vision pipeline has several drawbacks: (i) the processing time is not deterministic; (ii) less robust due to higher false detection rate, e.g., fail to detect the OLEDs with scratches on them in the bottom image of Fig. 3(a); and (iii) less efficient on parallel architectures. By utilizing the characteristics of the repetitive structures, this paper propose an erosion-projection method (Fig. 4) to replace contour tracing to solve the aforementioned issues. The input of the pipeline is a region of interest (ROI), and the output of the pipeline are the coordinates of the centers of OLED structures. The input image is binarized with the optimum threshold calculated by the OTSU algorithm [8]. After binarization, noise and unrelated patterns are removed through the erosion step, remaining only the dominant structures (i.e., OLEDs). The number of erosion iterations is determined by the
(a) OLED Structures
(b) Detected Centers
Fig. 3. OLED Structures and Detected Centers
Fig. 4. Proposed Vision Pipeline for OLED Center Detection
Feasibility Analysis of Ultra High Frame Rate Visual Servoing on FPGA
627
feature to be detected, the size of the unrelated patterns, and the quality of the picture. In our case, two iterations are applied to a 120 × 45 input frame. Reduction of the segmented OLED structures into two vectors are performed by horizontal and vertical projection. The rough centers of the OLED structure are found by searching the two vectors reduced from projection. The accurate OLED centers are finally located by the weighted center-of-gravity inside each bonding box (Fig. 3(b)). Every stage of this new pipeline has a deterministic execution time, which leads to determinism of the complete vision pipeline. On the other hand, the floating point operations are usually too costly for embedded processor and dedicated hardware realization. Therefore, floating-point to fixedpoint transformation is applied carefully. The new algorithm only uses 32-bit fixed-point operations, yet still renders sub-pixel accuracy. This new approach reduces the complexity of the algorithm, provides improved robustness, and is also more friendly to embedded processors and dedicated hardware realization.
3
Feasibility Analysis of FPGA Realization
Before implementing the vision pipeline on FPGA, we first analyze the sequential reference implementation on MicroBlaze. MicroBlaze is a simple soft-core with a RISC instruction set [11]. It is chosen because it represents the typical general purpose processors (GPP) in embedded systems. Also MicroBlaze is configurable and easy to verify on FPGA. Although its performance is relatively low, it can still be served as a good reference of a dedicated FPGA implementation. The MicroBlaze used here has the following configuration: – 5-stage pipeline. – 32-bit multiplier and 32-bit divider. – All instructions and data are in local memory with one cycle latency. For a 120 × 45 resolution input frame, the execution time of the MicroBlaze implementation is almost 5 ms at 125 MHz, which is over 14 times of the 350 μs
Fig. 5. Cycle Breakdown on MicroBlaze (4.92 ms/frame for image size of 120 × 45)
628
Y. He et al.
Fig. 6. Block Diagram of the vision pipeline implemented on FPGA
budget available for vision processing. Fig. 5 shows the cycle count breakdown of the vision pipeline. To address the issue of limited computing power of the soft-core on FPGA, we propose a dedicated implementation (Fig. 6). Each stage of the vision pipeline is realized with dedicated modules, which runs as a synchronous systolic array. Frames is streamed through the pipeline, and intermediate values between each stage are buffered in Block Random Access Memory (BRAM). Table 1 shows the detailed timing breakdown for image of size 120 × 45. This implementation can run at 160 MHz on Virtex II-Pro FPGA (XC2VP30). The resource utilization is less than 26%. We can see that the proposed FPGA implementation achieves a speed-up of 48× (comparing to the reference MicroBlaze implementation), resulting in an execution time of 102 μs, which is far below the 350 μs budget. Since our FPGA implementation is a parameterized design, it is easy to adapt to input images of different sizes too. Table 2 presents the performance at different resolutions, where w is the image width and h is the image height. It shows that the proposed design has very good scalability. We conclude that FPGA is a feasible choice to achieve ultra high frame rate visual servoing. The vision pipeline can be further accelerated by utilizing more FPGA hardware resources. However, since the vision pipeline is no longer a bottleneck, further acceleration has a diminishing return in reducing the delay.
4
Feasibility Analysis of SIMD Realization
We have shown in Section 3 that a dedicated FPGA implementation is very suitable for ultra high frame rate visual servoing. However, lack of flexibility and considerable amount of implementation effort are two of its clear drawbacks. As an alternative, we exploit the feasibility of realizing ultra high frame rate visual servoing on single instruction multiple data (SIMD) processor. Wide SIMD processors are very popular among the stream processors for vision/image processing [1,6,7]. The massive number of processing elements (PEs) in it potentially renders very high throughput. And massive parallelism in streaming applications typically shows up as data-level parallelism (DLP ) which
Feasibility Analysis of Ultra High Frame Rate Visual Servoing on FPGA
629
Table 1. Cycle Breakdown of FPGA Implementation (image size of 120 × 45) MicroBlaze Proposed FPGA (125 MHz) (160 MHz) Initialize 2819 256 OTSU: Hist. & CH/CIA 74797 5659 2 OTSU: Max. σB 19936 312 Binarization 70201 5401 Erosion 284819 97 Find-Rough-Center 78832 170 Weighted Center of Gravity 83790 4501 Total Cycles 615194 16396 Time 4.92 ms 102 μs Kernel
Speed-up 11.01× 13.22× 63.90× 13.00× 2936× 463.72× 18.62× 37.5× 48.2×
Table 2. Performance Scalability on the Proposed FPGA Implementation Kernel 120 × 45 160 × 55 Initialize 256 256 OTSU: Hist. & CH/CIA 5659 9059 2 OTSU: Max. σB 312 312 Binarization 5401 8801 Erosion 97 117 Find-Rough-Center 170 220 Weighted Center of Gravity 4501 7335 Total Cycles 16396 26100 Time 102 μs 163 μs
O(1) O(wh) O(1) O(wh) O(h) O(w+h) O(wh)
is naturally supported by SIMD architectures. Moreover, SIMD is a low-power architecture which is crucial to embedded systems. All these merits make SIMD architecture a very interesting candidate for visual servoing. If we look at the cycle breakdown on MicroBlaze (Fig. 5) in more detail, we will find that over 95% of the total execution time is spent on five kernels: histogram + Cumulative Histogram and Cumulative Intensive Area (CH/CIA), binarization, erosion, find-rough-center, and weighted center-of-gravity. The computation part of these kernels are mostly pixel-wise operations with few dependency (thus DLP ), which makes them very suitable for SIMD processing. 4.1
Proposed Reconfigurable Wide SIMD Architecture
The proposed reconfigurable wide SIMD processor for visual servoing is based on Xetal-Pro, our ultra low-energy and high-throughput SIMD processor [6]. Fig. 7 presents its block diagram. The control processor (CP) is a 32-bit MIPSlike processor, equipped with a 32-bit 1-cycle multiplier and a 32-bit 16-cycle pipelined divider. The main task of the CP is to control the program flow, to handle interrupts, to configure other blocks, and to communicate with the outside world. The processing elements (PEs) and their corresponding scratchpad memory (SM) and frame memory (FM) banks are partitioned into tiles. Each tile consists of 8 PEs. This is based on the reconfiguration granularity requirement
630
Y. He et al.
Fig. 7. Block Diagram of the Reconfigurable Wide SIMD Processor
as well as the layout constraints. In the current implementation, there are 320 PEs in total (40 tiles). The 128bit×1024 pseudo-dual port SRAM per PE constitutes the frame memory (FM). The relatively large capacity of the FM allows on-chip storage of multiple VGA frames or images with higher resolution. The SM is a 32-entry scratchpad memory to exploit the often available data locality and to reduce the energy consumption of accessing the large FM. The communication network between PEs and SMs enables each PE to directly access data from its left and right neighbors. The whole system runs at 125MHz with 1.2V supply voltage under TSMC 65 nm technology, and offers a peak throughput of 80 GOP S (counting multiply and add operations only). With 320 PEs, the proposed SIMD processor is able to provide a tremendous amount of processing power. However, in practice, applications may not require or cannot fully utilize its entire capability. For example, in our OLED substrate localization application, two typical image resolutions are 120 × 45 pixels and 160 × 55 pixels, which lead to natural vector sizes of 120 and 160 respectively. Thus, only 120 PEs (15 tiles) or 160 PEs (20 tiles) is required for each case. The processor must be configured in such a way that the number of active PEs and the corresponding communication network meet this requirement. Another important motivation for a reconfigurable SIMD processor is the power consumption. When not all PEs are required, the unused PEs can be fully shut down to save power consumption. The number of active tiles of the proposed SIMD processor is dynamically configurable to meet various vector lengths or performance requirements. In order to enable this feature, two types of tiles are designed, which are only of slight difference. Basic Tile (Fig. 8(a)) composes the minimal system (8-PE SIMD) when MUX0 is configured to choose only among immediate number (imm), constant ‘0’, and data read from PE7 ’s own scratchpad memory. Augmented Tiles (Fig. 8(b)) can be enabled/disabled according to the application. To configure an SIMD processor with M+1 tiles, MUX0 ∼ MUXM −1 are fixed to their right neighbor (i.e., data from next PE0 ’s SM ) and MUXM has the freedom to choose among the other three inputs except its right neighbor. The configuration is done through setting the control registers (CTRL0 ∼ CTRLM ) by CP at run time. Each PE has a two-stage pipeline and shares the instruction fetch and decode stage of the CP. 16-bit ADD/SUB, MUL, MAC, and logical operations are supported. All instructions are executed in a single cycle. The global sum of the
Feasibility Analysis of Ultra High Frame Rate Visual Servoing on FPGA
631
(a)
(b) Fig. 8. (a) Basic Tile; (b) Augmented Tile
ACCU registers (one per PE ) is calculated by an adder tree. The latency of the adder tree is three cycles. Another main differences between this proposed SIMD and our previous Xetal-Pro is that local indirect addressing is supported with the local address generator. It has been shown that local indirect addressing can significantly improve the performance of some applications (e.g., Histogram, Hough Transform) in a massively-parallel SIMD processor [5]. 4.2
Visual Servoing on the Proposed Wide SIMD Processor
Based on the analysis of the kernels in the vision pipeline, the parts that can benefit from vectorization are identified. With this information, the kernels of the vision pipeline are partitioned into (i) the vector part, which is executed on the PE array of the proposed SIMD processor; and (ii) the scalar part, which is processed on the CP of the proposed SIMD processor. There are two main sources of inefficiency in the sequential implementation. Firstly, the fundamental limitation in operation throughput makes it impossible to achieve high performance. Secondly, the overheads such as loop control and address calculation greatly reduce the effective computational throughput. On the proposed architecture, the PE array provides a peak throughput of 2 × num of P Es operations per cycle if MAC instruction is utilized. In addition, the concurrent execution of the CP and PE array exploits the instruction-level parallelism (ILP) and the overhead is reduced by overlapping the execution of control and computation operations. The distribution of kernels between PE array and CP is shown in Fig. 9(a). With this mapping, the data communication between CP and PE array is minimized.
632
Y. He et al.
,
!"#$%%
"&'(# )*%+ -% $%! %%./0 1
+ 23 4
(a)
!
(b)
Fig. 9. (a) Kernel Assignment; (b) Frame Memory Mapping
Algorithm 1. Erosion 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Input : w × h binary image at i base in FM Output : w × h binary image at i base in FM temp base ← address for a two-entry buffer in FM on PE 0 to w − 1 do for i ← 1 to h − 1 do CP: temp ← temp base + (i mod 2) mem[i base + i − 2] ← mem[temp] accu ← mem[i base + i] accu ← accu + mem[i base + i − 1] accu ← accu + mem[i base + i + 1] accu ← accu + l mem[i base + i] accu ← accu + r mem[i base + i] f lag ← accu < 5 mem[temp] ← f lag?0 : 1 end mem[i base + h − 2] ← mem[temp] CP: temp ← 1 − temp mem[i base + h − 3] ← mem[temp] end
To process an image of size w × h, where w is the image width and h is the image height, w PEs are required (thus w/8 tiles are enabled in our proposed SIMD processor), which indicates that the PE array can provide a peak throughput of 2w operations per cycle. When a frame is captured and ready for processing, the processor uses the input shift register to get the frame to the frame memory (FM) line by line. Each PE processes one column of the frame. The unused tiles are shut down to save power. Each PE needs 2h + 512 FM entries to process a column (h entries for the space of the grey-level input image, h entries for the binary image, 256 entries for the shared space of the distributed histogram and CH, and 256 entries for the space of distributed CIA). The typical frame size in our case is 120×45 or 160 × 55, so the capacity of the FM is sufficient to process the whole frame. The memory mapping of the frame memory for a 120 × 45 resolution input is shown in Fig. 9(b). Algorithm 1 (pseudo code) gives an example of how to program the proposed SIMD processor. The erosion step in Fig. 4 uses a cross kernel. To calculate one output pixel, a PE needs to get the four neighboring pixels, two of which are
Feasibility Analysis of Ultra High Frame Rate Visual Servoing on FPGA
633
Table 3. Cycle Breakdown of SIMD Implementation (image size of 120 × 45) MicroBlaze Proposed SIMD Speed-up (125 MHz) (125 MHz) Initialize 2819 1280 2.20× OTSU: Hist. & CH/CIA 74797 4047 18.5× 2 OTSU: Max. σB 19936 15840 1.26× Binarization 70201 225 312× Erosion 284819 1038 274× Find-Rough-Center 78832 4601 17.1× Weighted Center of Gravity 83790 1971 42.5× Total Cycles 615194 29002 21.2× Time 4.92 ms 232 μs 21.2× Kernel
Table 4. Performance Scalability on the Proposed SIMD Implementation Kernel
120 × 45 160 × 55
Initialize 1280 OTSU: Hist. & CH/CIA 4047 2 OTSU: Max. σB 15840 Binarization 225 Erosion 1038 Find-Rough-Center 4601 Weighted Center of Gravity 1971 Total Cycles Time
29002 232 μs
1280 O(1) 4097 O(h) 15840 O(1) 275 O(h) 1278 O(h) 5770 O(w+h) 2295 O(h) 30835 247 μs
located in the neighborhood PE s’ FM. These two pixels can be accessed using the neighborhood communication in the proposed processor. For OLED center detection, erosion is called twice. As indicated by Fig. 5, the erosion kernel is the most time consuming part in the MicroBlaze implementation. Table 3 shows that after vectorization, speedup of 274× is achieved for a 120 × 45 resolution input. And we can also see that it is no longer a bottleneck on the proposed SIMD processor implementation. The performance of the complete implementation for a 120 × 45 resolution input is shown in Table 3. The wide SIMD implementation is able to achieve a speed-up of 21× (comparing to the reference MicroBlaze implementation), resulting in an execution time of 232 μs, which is well below the 350 μs budget 2 for vision processing. In contrast to Fig. 5, finding max σB is now most time consuming, because it is sequential and can only be done on the CP. Table 4 shows the results of input images with different sizes. We can see that the wide SIMD implementation has even better scalability than the dedicated FPGA implementation. The result also shows that it is feasible to achieve > 1000 fps and < 1 ms latency visual servoing on the proposed wide SIMD processor.
5
Conclusions and Future Work
This work performed a detailed analysis of achieving ultra high frame rate visual servoing on both FPGA and SIMD processor. A typical industrial application,
634
Y. He et al.
organic light emitting diode (OLED) screen printing, was chosen in our analysis. We optimized the existing vision pipeline for this application so that it is more robust and more friendly for hardware implementation. Through a proposed FPGA implementation, we shown that it is very efficient and feasible to achieve ultra high frame rate visual servoing on FPGA. However, a dedicated FPGA implementation is usually lack of flexibility, and requires considerable amount of implementation effort. As an alternative, we also explored the feasibility analysis on the popular SIMD processor. The result shows that our proposed SIMD processor is very suitable for ultra high frame rate visual servoing. It achieved a proper balance among efficiency, flexibility, and implementation effort. Compared to the reference realization on MicroBlaze, a 21× reduction on the processing time is gained, which greatly enables the performance improvement for visual servoing applications. For the future work, we would like to measure and compare the energy consumption in detail. We would also like to enable the fault-tolerance features of our SIMD processor to deal with the increasingly severe manufacturing variability issue.
References 1. Abbo, A., et al.: Xetal-II: a 107 GOPS, 600 mW massively parallel processor for video scene analysis. IEEE Journal of Solid-State Circuits 43(1), 192–201 (2008) 2. de Best, J., et al.: Direct dynamic visual servoing at 1 khz by using the product as 1.5d encoder. In: ICCA 2009, pp. 361–366 (December 2009) 3. Furukawa, N., et al.: Dynamic regrasping using a high-speed multifingered hand and a high-speed vision system. In: Proceedings of IEEE International Conference on Robotics and Automation, pp. 181–187 (2006) 4. Ginhoux, R., et al.: Beating heart tracking in robotic surgery using 500 Hz visual servoing, model predictive control and an adaptive observer. In: Proceedings of IEEE International Conference on Robotics and Automation, pp. 274–279 (2004) 5. He, Y., Zivkovic, Z., Kleihorst, R.P., Danilin, A., Corporaal, H., Mesman, B.: RealTime Hough Transform on 1-D SIMD Processors: Implementation and Architecture Exploration. In: Blanc-Talon, J., Bourennane, S., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2008. LNCS, vol. 5259, pp. 254–265. Springer, Heidelberg (2008) 6. He, Y., et al.: Xetal-Pro: An Ultra-Low Energy and High Throughput SIMD Processor. In: Proceedings of the 47th Annual Design Automation Conference (2010) 7. Kyo, S., et al.: IMAPCAR: A 100 GOPS In-Vehicle Vision Processor Based on 128 Ring Connected Four-Way VLIW Processing Elements. Journal of Signal Processing Systems, 1–12 (2008) 8. Otsu, N.: A threshold selection method from gray-level histograms. Automatica 11, 285–296 (1975) 9. Pieters, R., Jonker, P., Nijmeijer, H.: Real-Time Center Detection of an OLED Structure. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2009. LNCS, vol. 5807, pp. 400–409. Springer, Heidelberg (2009) 10. Pieters, R., et al.: High performance visual servoing for controlled m-positioning. In: WCICA, pp. 379–384 (2010) 11. Xilinx, Inc., http://www.xilinx.com/tools/microblaze.htm
Calibration and Reconstruction Algorithms for a Handheld 3D Laser Scanner Denis Lamovsky and Aless Lasaruk FORWISS, Universit¨ at Passau, 94030 Passau, Germany [email protected], [email protected]
Abstract. We develop a calibration algorithm and a three-dimensional reconstruction algorithm for a handheld 3D laser scanner. Our laser scanner consists of a color camera and a line laser oriented in a fixed relation to each other. Besides the three-dimensional coordinates of the observed object our reconstruction algorithm returns a comprehensive measure of uncertainty for the reconstructed points. Our methods are computationally efficient and precise. We experimentally evaluate the applicability of our methods on several practical examples. In particular, for a calibrated sensor setup we can estimate for each pixel a human-interpretable upper bound for the reconstruction quality. This determines a “working area” in the image of the camera where the pixels have a reasonable accuracy. This helps to remove outliers and to increase the computational speed of our implementation.
1
Introduction
Active projection methods constitute a simple and computationally efficient way to obtain the three-dimensional structure of real world objects exposed to a video camera. Besides the methods based on stripe projection laser-based projection methods became popular in the last decades [1–3]. Here, the object of interest is illuminated with single points or lines produced by a color laser. From the mutual displacement of the points or from the deformation of the laser lines in the image of the camera the shape of the object can be reconstructed. The popularity of laser-based methods is motivated by relatively low costs for the laser component compared to the costly stripe projection equipment. A major disadvantage of laser-based projection methods is the sparsity of the reconstructed surface data. The three-dimensional coordinates are available only for a small set of points in the image of the camera. A natural way to compensate for this sparsity is to move the main sensor components, the laser and the camera, to obtain structure redundancy from motion. There exist solutions based on moving the laser but letting the camera position with respect to the object fixed [1, 4]. These methods require the relation between the camera and the object to be estimated by means of calibration to obtain the three-dimensional points. A more flexible solution to the sparsity problem is to fix the relation between the laser and the camera. This makes one sensor from the two previously spatially independent sensor components. We call such a sensor a handheld 3D J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 635–646, 2011. c Springer-Verlag Berlin Heidelberg 2011
636
D. Lamovsky and A. Lasaruk
Fig. 1. Our prototype of a handheld 3D laser scanner
laser scanner. Once the relation between the camera and the laser is computed, the laser scanner is able to reconstruct the three-dimensional shape of the observed surface independently of the exact spatial relation between the laser scanner and the real world object. The price to pay for the obtained flexibility origins from the ill posed problem of determining the movement between the sensor and the observed object to refine the reconstructed surface. Absence of sufficiently stable algorithms for motion recovery from video sequences is probably the reason, why there are still no commercial multipurpose handheld laser scanners available on the market. Instead of this, commercial solutions use guide arms, markers, or magnetic sensors to compute the movement of the sensor head relatively to the object. This increases the costs of the resulting sensor system by magnitudes. The state-of-the-art laser scanner solutions mostly discard the question of accuracy and reliability of the calibration and spatial reconstruction results. At best, the properties of the reconstructed surface are compared with known geometry [4, 5]. This incautious approach discounts the systematic errors inherent to computing with measurements corrupted by noise. Furthermore, it does not help to estimate the quality of new sensor measurements at run-time, since this reconstruction quality highly depends on the precision of the detected laser points in the image of the camera and on the quality of the sensor calibration. On the other hand, the uncertainty information about the reconstructed space points is crucial for the subsequent structure from motion algorithms. In this paper we present the first steps for an accurate and efficient handheld laser scanner solution: a calibration algorithm of the spatial relation between the laser and the camera and a reconstruction algorithm of the surface points illuminated by the laser. Our prototype of a laser scanner is depicted in Figure 1. We focus on the case in which the laser illuminates a straight line. Our original contribution in this work is to introduce a systematic uncertainty propagation to the calibration and to the reconstruction process of our sensor system. For that we combine techniques from [6] and [7]. This not only amounts to a more precisely calibrated sensor compared to the related approaches. We compute a comprehensive uncertainty measure for the reconstructed space points
Calibration and Reconstruction Algorithms
637
Xi
x pi z
Li
P
0 y
Fig. 2. Sensor geometry: The viewing ray Li corresponding to the image point pi intersects the laser plane in the point Xi , which is incident to the visible surface of the object. Notice that there is no visible intersection between the laser plane P and the image plane of the camera.
at run-time. More precisely, for each point we obtain a covariance matrix of its coordinates. This matrix encodes the principal directions and magnitudes of the uncertainty dissemination of the point. We precisely point at the computational steps, where our algorithms introduce a systematic error – the so-called bias – arising as a price to pay for the computational efficiency. Our calibration procedure uses normalization techniques from [6] to reduce the bias. Our research is aimed to increase the interest in consequent uncertainty propagation in geometric computations within the vision community. We consider the uncertainty considerations in the style of our current work an important step towards high precision sensor systems. The paper is organized as follows: In Section 2 we introduce basic geometric relations between the sensor components. In Section 3 we collect abstract theoretical results needed for uncertainty propagation in our algorithms. Section 4 then turns to the details of calibration and reconstruction. We evaluate our method in Section 5 and discuss some efficiency aspects of our implementation. We conclude our results and point to the future work in Section 6.
2
Sensor Geometry
The geometry of the sensor is depicted in Figure 2. The camera observes an object which is illuminated by a color line laser. The set of the three-dimensional points lit by the laser forms a plane P in space. We call this plane the laser plane. We call an image point pi corresponding to a surface point Xi lit by the laser a laser point. For each laser point there is a corresponding viewing ray Li . The intersection of each viewing ray Li and the laser plane P yields the three-dimensional coordinates Xi of the objects visible surface.
638
D. Lamovsky and A. Lasaruk
Ci
Xij
x
pij z
Lij
P
0 y Fig. 3. Calibration setup for the i-th view: From the chessboard pattern image we obtain the pose of the chessboard plane Ci . From the intersection of the viewing rays Lij corresponding to the laser points pij with the chessboard plane we obtain sample laser plane points Xij .
We fix a reference coordinate system for the camera and the laser. All we need for the three-dimensional reconstruction besides the intrinsic parameters and the pose of the camera is the location of the laser plane in this coordinate system. The basic idea for obtaining the plane is to sample sufficiently many incident points from multiple views. For that we expose a chessboard plate to the sensor system from different poses. The geometric setup for our calibration procedure for one view is shown in Figure 3. (a) For each view i = 1, . . . , v we compute the pose of the chessboard plane Ci and the laser points pij together with estimates of the precision. (b) For each view i we compute the viewing rays Lij corresponding to the laser points pij . (c) For each view i and for each laser point pij we compute the intersection point Xij of the corresponding viewing ray Lij and the chessboard plane. (d) We use all the computed Xij to obtain the parameters of the laser plane P . Estimation tasks in step (a) are theoretically well studied problems [6, 8, 9]. The implementation of the three relevant algorithms – chessboard pattern detection, intrinsic calibration of the camera, and reconstruction of the chessboards plane equation – can, for example, be found in the open source library OpenCV [10]. Notice that the obtained information in step (a) is sufficient to simultaneously estimate the intrinsic parameters and the pose of the camera.
3
Uncertainty in Sensor Geometry
Following the fundamental work in [6] and [7] we model uncertainty of the sensor geometry by considering all occurring geometric entities as Gaussian random
Calibration and Reconstruction Algorithms
639
variables. More precisely, each geometric object x modeled by a vector ranging over Rd is considered as a random variable with the probability density function 1 1 p(x) = exp − (x − μx )t Σx−1 (x − μx ) . 2 (2π)d det(Σx ) We then write shortly x ∼ N (μx , Σx ). The parameters μx ∈ Rd and Σx ∈ Rd×d are called the expectation and the covariance matrix of x respectively. We denote random variables by bold letters and allow us to use the sloppy notation x ∈ Rd to indicate the domain of x. We furthermore allow singular covariance matrices, for which the theoretical foundations are explained in [7]. We are going to approximately propagate the expectation and the covariance matrices of the geometric objects through our calibration and reconstruction algorithms. The quality of this uncertainty propagation highly depends on geometry representation. In this section we describe theoretical backgrounds for doing the propagation precisely and efficiently. The key observation is the fact that all of required geometric operations can be expressed by bi-linear functions or by the so-called eigenvector fit problems. A function f : Ri × Rj → Rk is called bi-linear, if there exist matrix-valued functions U : Ri → Rk×j and V : Rj → Rk×i such that for all x ∈ Ri and y ∈ Rj we have f (x, y) = U (x)y = V (y)x. (1) Let us consider a bi-linear function f : Ri × Rj → Rk . Suppose now, x and y are independent Gaussian random variables with x ∼ N (μx , Σx ) and y ∼ N (μy , Σy ). Then z = f (x, y) is a random variable. To approximate the expectation μz and the covariance matrix Σz of z we linearize f at (μx , μy ) with V (μy ) 0 x − μx f (x, y) ≈ f (μx , μy ) + . (2) 0 U (μx ) y − μy The expectation μz of z is then approximated by μz ≈ f (μx , μy ) = U (μx )μy = V (μy )μx .
(3)
Since bi-linear functions are quadratic forms, the computation of the expectation is essentially as efficient as matrix multiplication and also numerically stable. Note that according to Jensen’s inequality the approximation of μz introduces a bias, i. e. a systematic error, to results computed with Equation (3). However, due to the bi-linearity of f the bias has a moderate magnitude [11]. From Equation (2) it is not hard to see that the covariance matrix Σz is approximately given by Σz ≈ V (μy )Σx V (μy )t + U (μx )Σy U (μx )t .
(4)
Equations (3) and (4) are our basic uncertainty propagation equations. Whenever y is a “certain entity”, we may simply set Σy = 0, which consistently yields the exact propagation law for linear transformations μz = f (μx , μy ) = V (μy )μx
and
Σz = V (μy )Σx V (μy )t .
(5)
640
D. Lamovsky and A. Lasaruk
In Section 4 we specify operators in the style of Equation (1) for the intersection between a line in space and a plane in space and back-projection of a point to a viewing ray by a pinhole camera. Estimation of the plane equation from a set of points in space in presence of uncertainty is more involving. The state-of-the-art technique for that within the vision community is to solve ||Ax||2 → min
subject to
||x||2 = 1.
(6)
The solution to this least-squares eigenvector fit problem problem without perturbations of A is given by the eigenvector corresponding to the least eigenvalue of At A. Suppose that each row ai ∈ Rd of A for i = 1, . . . , n is additively corrupted by random Gaussian noise. In other words, each ai is an outcome of a random variable ai ∼ N (μi , Σi ), where μi ∈ Rd is the unknown true expectation. Then the solution x to the eigenvector fit problem is a random variable. It is now erroneous to consider the eigenvector corresponding to the smallest eigenvalue of At A as a solution, since eigenvector fit problems are biased in presence of perturbations of A. Kanatani [6] suggests instead to solve the normalized problem xt
n
ωi (μx )(ai ati − ε2 Σi )x → min
subject to
||x||2 = 1,
i=1
where the weights ωi (μx ) are reciprocally proportional to μtx Σi μx and depend on the true expectation of x. In addition, the unknown common noise level ε ∈ R is estimated simultaneously, since it is usually heavy to predict in the practice. The algorithm proposed by Kanatani iteratively refines estimates xk ∈ Rd of μx and γk ∈ R, which are closely related to ε2 as we will see in Equation (8). We set γ0 = 0 and ω0,i = 1 for all i = 1, . . . , n. The iteration step to obtain xk+1 given γk and ωk,1 , . . . , ωk,n is then to solve the eigenvector fit problem xtk+1 (M − γk N )xk+1 → min subject to ||xk+1 ||2 = 1, (7) n n where M = i=1 ωk,i ai ati and N = i=1 ωk,i Σi . As the smallest eigenvalue λd of M − γk N gets close to zero the iteration can be terminated with μx ≈ xk and ε2 ≈
γk , 1 − (d − 1)/(rn)
(8)
where r ∈ N is the number of non-zero ωi,k . Otherwise, γk and the weights ωk,i for i = 1, . . . , n are updated as γk+1 = γk +
λd , xtk+1 N xk+1
and ωi,k+1 =
1 , xtk+1 Σi xk+1
if xtk+1 Σi xk+1 > 0 and ωi,k+1 = 0 else respectively. On convergence the above algorithm is an optimal unbiased consistent estimator of μx and an unbiased estimator of ε2 . Notice that γi for all i ∈ N can be set to constant 1 provided
Calibration and Reconstruction Algorithms
641
that Σi are estimated precisely. An approximation of the covariance matrix of the solution x obtained by iteratively solving the problem in Equation (7) amounts to ([6], p. 285., Equation (9.100)) Σx = ε
2
d−1 i=1
vi vit , λi − λd
(9)
where λ1 > · · · > λd ∈ R are the eigenvalues and v1 , . . . , vd ∈ Rd are the corresponding eigenvectors of M − γk N . We allow us to denote eigenvector fit problems with occurring random variables in a sloppy notation to indicate that A is a random matrix of the above style ||Ax||2 → min subject to ||x||2 = 1. We have now collected sufficient uncertainty propagation theory to present our calibration and reconstruction algorithms.
4
Calibration and Reconstruction
We start to specify the matrix-valued functions U and V in terms of Equation (1) by discussing the intersection of a line in space with a plane in space. A coordinate-free representation of a line in space suitable for our bi-linear uncertainty propagation are the so-called Pluecker coordinates (see [7] for details). A line is encoded by a homogeneous vector L = (l1 , . . . , l6 ) ∈ R6 , which is best explained by the construction procedure from two Euclidean points X, Y ∈ R3 : X −Y L= . X ×Y We encode the calibration plane by a homogeneous vector P = (n, d) ∈ R4 , where n = (nx , ny , nz ) ∈ R3 is the normal vector of the plane and d ∈ R is the weighted distance of the plane to the origin. The intersection point of a line and a plane is then given by the bi-linear expression ⎛ ⎜ ⎜ ⎜ Π(P ) = ⎜ ⎜ ⎜ ⎝
d 0 0 d 0 0 0 −nz nz 0 −ny nx
X = Π(P )t L = Γ (L)t P, ⎞ 0 −nx ⎛ 0 −ny ⎟ 0 ⎟ ⎜−l6 d −nz ⎟ ⎟ , and Γ (L) = ⎜ ⎝ l5 ny 0 ⎟ ⎟ ⎠ −nx 0 l1 0 0
(10) l6 0 −l4 l2
−l5 l4 0 l3
⎞ −l1 −l2 ⎟ ⎟. −l3 ⎠ 0
Without loss of generality we assume that camera parameters are estimated precisely and that the lens distortions are already compensated by un-distorting the original images. Consequently, we can restrict to the finite pinhole camera model. A pinhole camera is given by a matrix K ∈ R3×4 . For each homogeneous
642
D. Lamovsky and A. Lasaruk
pixel coordinates p ∈ R3 and each corresponding homogeneous world point X ∈ R4 we have a λ ∈ R with λp = KX. The back-projection of the image point p to the corresponding viewing ray L ∈ R6 in space in Pluecker coordinates is given by (11) L = Q− (K)p, where the inverse camera matrix Q− ∈ R6×3 is given according to [7] by 0 I3 Π(k2 )k3 Π(k3 )k1 Π(k1 )k2 . Q− (K) = I3 0 Here I3 is the 3 × 3 identity matrix and k1 , k2 , k3 ∈ R4 are the three rows of K. To obtain the laser plane P ∈ R4 in space from a set of homogeneous points X1 , . . . , Xn ∈ R4 in space we minimize the squared sum of the incidence residuals Xti P with respect to the normalization constraint ||P||2 = 1. By defining A ∈ Rn×4 where the rows of A are outcomes of Xi this amounts to the eigenvector fit problem n
(Xti P)2 = ||AP||2 → min,
subject to
||P||2 = 1.
(12)
i=1
Notice that μXi are in fact obtained in the practice by averaging and uncertainty propagation. We, hence, tacitly use μXi as the outcomes of Xi in Equation (12). We can now formalize the calibration algorithm for our laser scanner. (a) Detect in each calibration view i = 1, . . . , v the laser points pi1 , . . . , pini and the chessboard plane Ci with uncertainty estimates pij ∼ N (μpij , Σpij ) and Ci ∼ N (μCi , ΣCi ). (b) For each view i compute the viewing rays Li1 , . . . , Lini for the laser points pi1 , . . . , pini by using Equation (11) and the error propagation in Equation (5) μLij = Q− (K)μpij , ΣLij = Q− (K)Σpij Q− (K)t . (c) For each view i compute the intersection points Xi1 , . . . , Xini between the chessboard plane and the viewing rays Li1 , . . . , Lini using Equation (10) and the error propagation in Equations (3) and (4) μXij ≈ Π(μCi )t μLij = Γ (μLij )t μCi , ΣXij ≈ Π(μCi )t ΣLij Π(μCi ) + Γ (μLij )t ΣCi Γ (μLij ). (d) Compute the plane equation of the laser plane from the points X11 , . . . , Xvnv in all views by solving the eigenvector fit problem in Equation (12) ni v
(Xtij P)2 = ||AP||2 → min,
subject to
i=1 j=1
by applying methods in Equation (7) and (9).
||P|| = 1
Calibration and Reconstruction Algorithms
643
Notice that the only source of bias in the above algorithm is the intersection of the viewing rays Lij with the chessboard planes Ci . Reconstruction of spatial points is formalized by the following algorithm. (a) Detect laser points p1 , . . . , pn in the image with uncertainty estimates pi ∼ N (μpi , Σpi ). (b) Compute the viewing rays L1 , . . . , Ln for the laser points using Equation (11) μLi = Q− (K)μpi , ΣLi = Q− (K)Σpi Q− (K)t (c) Compute the intersection points X1 , . . . , Xn between the laser plane P and the viewing rays L1 , . . . , Ln using Equation (10) μXi ≈ Π(μP )t μLi = Γ (μLi )t μP ΣXi ≈ Π(μP )t ΣLi Π(μP ) + Γ (μLi )t ΣP Γ (μLi ). Notice that an expression for μXi involving only a single matrix multiplication is derived from the above algorithm as μxi ≈ Π(μP )t Q− (K)μpi .
(13)
Again, the only source of bias is the intersection computation between the viewing rays and the laser plane.
5
Experimental Results
We have implemented the algorithms described in this paper. Our prototype depicted in Figure 1 uses a 650nm red color laser David LE 650-5-3-F with 5mW power and a Logitech color web camera running at resolution 640 × 480. We consciously did not use a high quality industrial camera to show that our methods are applicable also for cheap sensors. Sample input images of the camera together with the corresponding measured points are illustrated in Figure 4. It is a common point of criticism to statistical methods in geometry that the uncertainty of the measured data can not be retrieved in practice with sufficient precision. Unfortunately, a detailed discussion of methods for assessing these hardware-dependent parameters is out of scope of the current paper. To assess the metric quality of the reconstructed points we use objects with known geometry as exemplified in Figure 1. The height of the normed block and the plate is a representative measure for the reconstruction quality of the sensor. Our experiments show relative height errors of 0.28% − 4.3%. This corresponds to 0.017cm - 0.26cm error in a working distance of 20cm - 40cm when measuring a real block height of 6cm (see Figure 5), which is a fairly good result for the used hardware. To assess the quality of the covariance matrices computed by our method we have repeatedly reconstructed points of a fixed scene by using Equation (13)
644
D. Lamovsky and A. Lasaruk
Fig. 4. Experimental results: (a) section of the real camera images in the reconstruction process (b) reconstructed coordinates of the surface points (in m) after transforming them into the coordinate system orthogonal to the measurement table plate
without applying any uncertainty propagation. From the measurements x1 , ..., xn collected thorough a longer period of time we have computed the empirical expectation and covariance matrices n
μ ¯=
1 xi n i=1
n
and
¯= Σ
1 (xi − μ ¯)(xi − μ ¯)t . n − 1 i=1
(14)
An interesting observation sketched in Figure 5 is the relation between the standard deviation of distance measurements and the distance between the sensor and the object. The empirical deviation increases with the distance. In contrast to that the computed deviation has its minimum at the distance of the mean of the laser plane points used for calibration (ca. 0.35m). We conclude that empirical covariance matrices computed using the mean value for the laser plane do not reflect the real uncertainty situation properly as they do not cope for the uncertainty of the laser plane. Furthermore, our handheld laser scanner evidently obtains an optimal working distance which can be adjusted during the calibration process. We increase the computation speed of our implementation and remove a large portion of outliers by discarding image points, for which the reconstruction will unlikely yield precise results. For each image point p we start with the distribution p ∼ N (p, ε2 diag(1, 1, 0)). With ε = 0.25 we simulate an anisotropic laser point which is precise up to a fourth of a pixel size. We now apply the
Calibration and Reconstruction Algorithms
645
Fig. 5. Size of the normed block in Figure 1 (in m) and its standard deviation (in m) in dependency of the working distance (in m): Empirical deviation is depicted above the graph. Computed deviation is depicted below the graph.
reconstruction algorithm in Section 4 which yields an X ∼ N (μX , ΣX ). The square root σ ∈ R of the largest eigenvalue of the Euclidean part of ΣX is an upper bound for the coordinates dissemination. If σ is large the reconstructed point X is computed in-precisely, whenever the noise of our laser point detector is larger than ε. It makes sense to discard such points for the entire algorithm. An example for the discarded region is depicted in Figure 6. We conclude that the laser line in our special setup (laser below the camera) must be directed in the top region of the image to maximize the size of the working image area.
0.1 0
Contours 0.01 0.001 0.0004 0.0003
100
0.08 0.06 0.04 0.02
200
0 -0.02
300
-0.04 -0.06
400
-0.08 500
-0.1 0
100
200
300
400
500
600
700
800
Fig. 6. Distribution of σ (in cm) over the image: The gray regions above the 0.01contour have more than 1cm standard deviation. The black region in the upper-left image part has small standard deviations in all directions but the points are reconstructed behind the camera. All points above the 0.01-contour are discarded.
646
6
D. Lamovsky and A. Lasaruk
Conclusions and Future Work
We developed a calibration algorithm and a reconstruction algorithm for a handheld laser scanner which provide a comprehensive measure of uncertainty for the computed results. Our procedures efficient and precise but biased. The bias is, however, moderate in magnitude. Integration of efficient bias correction will be a part of our future work. Uncertainty interpretation of reconstructed points helps to remove outliers and supports subsequent reconstruction algorithms. We consider uncertainty propagation at the low level discussed in the paper a key feature for successive three-dimensional surface reconstruction from motion. The next step in our future work is to join several measurements obtained by moving the laser scanner around the object. We claim that this will amount to a precise surface reconstruction algorithm for a handheld laser scanner. Acknowledgments. This work is partially funded by the Bayerische Forschungsstiftung (BFS).
References 1. Franca, J., Gazziro, M., Ide, A., Saito, J.: A 3d scanning system based on laser triangulation and variable field of view. In: IEEE International Conference on Image Processing, ICIP 2005, vol. 1, pp. I – 425–428 (September 2005) 2. Rovid, A., Hashimoto, T.: Improved vision based 3d shape measurement method. In: IEEE International Conference on Computational Cybernetics, ICCC 2007, pp. 245–250 (October 2007) 3. Arthaya, B., Gunawan, I., Gunawan, H.: Point clouds construction algorithm from a home-made laser scanner. In: 8th International Conference on Intelligent Systems Design and Applications, ISDA 2008, vol. 1, pp. 570–575 (November 2008) 4. Kawasaki, H., Furakawa, R., Nakamura, Y.: 3d acquisition system using uncalibrated line-laser projector. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol. 1, pp. 1071–1075 (2006) 5. Boehnen, C., Flynn, P.: Accuracy of 3d scanning technologies in a face scanning scenario. In: Fifth International Conference on 3-D Digital Imaging and Modeling, 3DIM 2005, pp. 310–317 (June 2005) 6. Kanatani, K.: Statistical Optimization for Geometric Computation: Theory and Practice. Elsevier Science Inc., New York (1996) 7. Corrochano, E.B., F¨ orstner, W.: Uncertainty and projective geometry. In: Handbook of Geometric Computing, pp. 493–534. Springer, Heidelberg (2005) 8. Zhang, Z.: A Flexible new technique for camera calibration. Technical report, Microsoft Research, Technical Report MSR-TR-98-71 (1998) 9. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 10. Bradski, G., Kaehler, A.: Learning OpenCV: Computer Vision with the OpenCV Library. O’Reilly, Cambridge (2008) 11. Heuel, S.: Statistical Reasoning in Uncertain Projective Geometry for Polyhedral Object Reconstruction. PhD thesis, Institute of Photogrammetry, University of Bonn (2002). Springer, Heidelberg (2004)
Comparison of Visual Registration Approaches of 3D Models for Orthodontics Raphaël Destrez1,3, Benjamin Albouy-Kissi2, Sylvie Treuillet1, Yves Lucas1, and Arnaud Marchadier1,3 1
Laboratoire PRISME, Polytech’Orléans 12 rue de Blois, 45067 Orléans, France 2 Université d'Auvergne, ISIT, BP 10448, 63000 Clermont-Ferrand, France 3 UsefulProgress, 23 rue d’Anjou, 75008 Paris, France {raphael.destrez,arnaud.marchadier}@gmail.com [email protected] [email protected] [email protected]
Abstract. We propose to apply vision techniques to develop a main tool for orthodontics: the virtual occlusion of two dental casts. For that purpose, we process photos of the patient mouth and match points between these photos and the dental 3D models. From a set of 2D/3D matches of the two arcades, we calculate the projection matrix, before the mandible registration under the maxillary through a rigid transformation. We perform the mandible registration minimizing the reprojection errors. Two computation methods, depending on the knowledge of camera intrinsic parameters, are compared. Tests are carried out both on virtual and real images. In the virtual case, assumed as perfect, we evaluate the robustness against noise and the increase of performance using several views. Projection matrices and registration efficiency are evaluated respectively by reprojection errors and the differences between the rigid transformation and the reference pose, recorded on the six degrees of freedom. Keywords: Visual registration, 3D models, projection matrix, orthodontics.
1 Introduction Orthodontics is a dental specialty concerned with the correction of bad positions of jaws and teeth to optimize dental occlusion for functional and esthetical purposes. The first classification of malocclusions was established by E.H Angle in 1898[1]. Today, the treatment can involve a chirurgical intervention or dental device implantation. Commonly, the orthodontist uses dental casts of patient’s arcades to plan the treatment and manually achieve an attempt of occlusion. As in many other medical applications, this manual practice can be replaced by new planning and diagnostic imaging tools. Dental casts can be completed by dentition photos [2] or radiographies [3]. Other imaging tools can be used like CT-scan [4][5] or face scan devices [6][7][8][9] that provide additional information about soft tissues and face appearance, but still requiring dental casting. Anyway these techniques are not widespread because there are expensive and bulky. J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 647–657, 2011. © Springer-Verlag Berlin Heidelberg 2011
648
R. Destrez et al.
A real contribution of imaging is the possibility to create 3D numerical models of dental casts from surface laser scans [10] or 100µm resolution CT-scan [16]. These techniques differ on speed and cost but none enables an adjustment of the dental arcades corresponding to the real occlusion of patient. This information can be obtained with a silicone device used during the teeth printing [11] or another reference [7] but it requires additional operation for the practitioner. CT-scan or heavy devices [12] can be used to obtain the pose of the mandible relative to the maxillary but these techniques are not widespread. In practice, after the scan of dental casts, the search for the optimal occlusion is computerized [16]. A technician moves manually the mandible to modify the occlusion of the dental arcades 3D models. For this task, he relies on dedicated software and photos of mouth patient (Fig. 2a above). This process is very tedious and the result is operator dependent. So, our goal is to perform automatically most of the numerical occlusion setting. Thereby, it will be faster and independent of the operator. We chose to use real photos of the dentition to capture the natural occlusion of the arcades and to map it on the virtual model. This strategy avoids adding another step to the current orthodontist practice. In this paper, we carry out a comparative study of two methods to evaluate the relevance and precision of this approach. These studies are carried out for different configurations of viewpoints and 2D/3D correspondences.
2 Model Registration from Mouth Images The 3D models of the two dental arcades are obtained separately, preventing any expression in a shared work coordinate system with the original fitting of the arcades. Consequently, it is necessary to determine a rigid transformation setting in occlusion the mandible against the maxillary. We propose to use colors photos of mouth patient that display the real “in vivo” occlusion for several viewpoints (Fig. 2a). We choose a set of correspondences between 3D points on the surface model and their 2D projections in images to estimate this rigid transformation. The projection matrix obtained from homogenous coordinates can be expressed by the relation between 3D point, Mj, and its projection in the image i, mij: ∝
.
(1)
where ∝ means « by a scaling factor ». The 3 × 4 projection matrix Pi can be decomposed into a product of two matrices. These contain intrinsic parameters of the camera and extrinsic parameters: |
.
(2)
where ti and Ri are respectively the translation and the rotation defining the pose of the camera and K is the camera calibration matrix where corresponds to the pixel slope, and are the frame principal point coordinates, and horizontal and vertical sensor resolutions and is the focal camera distance:
Comparison of Visual Registration Approaches of 3D Models for Orthodontics
0 0
(3)
. 0
649
1
For current cameras, pixels are commonly considered as squares and the principal , , , point at the center of pictures, leading to the relations with little impact on the precision. Moreover, it allows the calculation of the matrix K from the picture size and the focal distance given in pixels. From a set of 2D/3D correspondences for the two arcades, we propose a two step resolution of the problem: firstly we estimate the projection matrix for each view from maxillary correspondences, before the registration of the mandible with analytical (DeMenthon) [13] or minimization [14] methods. Two situations can be encountered, depending on the knowledge of the focal distance. If the camera is calibrated, we must only determinate its pose (position and orientation), embedding six degrees of freedom (dof in the following). Conversely, if we don’t know the focal length, we calculate the complete projection matrix with 11 dof. In addition, the rigid registration of the mandible provides 6 dof of more. Several views can be used jointly during the registration. 2.1 Estimation of Pi With K Knowledge (WiK) In this method called « WiK », the determination of the projection matrix of the image i amounts to the extrinsic parameters estimation: three Euler’s angles defining the orientation Ri and three translation components of the ti vector. Firstly, we use the POSIT algorithm of DeMenthon [13] for a rough estimation of the six parameters, from correspondences established on the maxillary. Then, from this initialization, the six parameters are optimized by the Levenberg-Marquardt algorithm minimizing the sum of the squared reprojection error: ∑
∑
².
(4)
where i, j, and k are respectively picture, point and coordinate indexes, mijk refers to a refers to a the reprojection of the reference point coordinate in the picture and corresponding 3D point by the Pi matrix. For evaluation in experiments presented in section 3, we use the reprojection mean squared error (MSE): ∑
∑
².
(5)
2.2 Etimation of Pi Without K Knowledge (WoK) When K is unknown, the projection matrix Pi is estimated by least square minimization on the maxillary 2D/3D correspondences. The last element of Pi is fixed to 1. Then, the 11 parameters of the Pi matrix are optimized with the LevenbergMarquardt algorithm by minimizing sei (equation 4). This approach is called « WoK ».
650
R. Destrez et al.
2.3 Registration of the Mandible M One projection matrix estim mated, the mandible is registrated under the maxillaryy in occlusion position. The reeference coordinate system of the maxillary is illustraated figure 1: the origin is the grravity center of the point cloud of the maxillary, the Y aaxis is facing up and the Z axis is parallel to the rows of molars. The reference coordinnate system of the mandible refference is similar but placed on the gravity center of the mandible point cloud.
Fig. 1. Reeference coordinate system of the registration
The registration is carrieed out by calculating the rigid transformation T accordding to the fact that 3D point projections of the mandible are superimposed with the projections selected in thee pictures. Projections are obtained using the matrixx Pi estimated from the maxillaary points where a 3D point Mmand on the mandible andd its projection mi,mand on the pho oto is linked by: ,
∝
.
(6)
The transformation T emb beds 6 dof: three Euler’s angles and three translatiions estimated by minimizing the reprojection error (equation 4) using Levenbeerga identity transform as initial conditions. Marquardt algorithm with an
3 Experiments To validate the two previou us steps, tests have been carried out both on virtual and rreal images. From a “perfect” virtual v case, we evaluate the robustness against noise and the increase of performancee using several views. The virtual pictures are obtainedd by screenshots from the VTK K rendering of 3D models (Fig. 2a). So, several pointss of view can be simulated controlling the acquisition settings (focal, view angle, etc… …). Such virtual pictures are disstortion free but we can raise and control the noise levell on the point coordinates to sim mulate real watching process. Real pictures are taken witth a numerical 8 mega pixels camera (Canon EOS 350D), a 60 mm lens and an annuular flash (Fig. 2a). We use OpeenCV for the calculation functions like POSIT [13] and the levmar functions [15] for th he implementation of the Levenberg-Marquardt algorithm m.
Comparison of Visual Registration R Approaches of 3D Models for Orthodontics
a)
651
b)
d side view (right) (top) real image (1300 × 867) (down) virrtual Fig. 2. (a)Front view (left) and views (1300 × 867). (b) Initiall location of the 3D models before the registration.
3.1 Projection Matrix Esstimation To evaluate the estimation of projection matrix, the tests are carried out on 3D virttual ood occlusion position decided by orthodontics speciallist. models registrated in a go Ten features are manually pointed p on the 3D models of the maxillary, and ten othhers on the mandible one. The tw wo previous methods are compared on virtual data and rreal pictures. Estimated projection matrices are evaluated with the reprojection errrors (equation 5). We have donee this test for two different points of view: a front view and a side view. This second point p of view is one of the most lateral views that we can obtain. The figure 2a repressents a typical configuration of this point of view. 3.1.1 Virtual Data While perfect projectionss are simulated by using the VTK screen, 2D//3D correspondences can be co onsidered as perfect (Fig. 3). From this “perfect” case, we add an uniform noise on th he 2D point coordinates to compare the robustness of W WiK and WoK methods against noise. n The tables 1 and 2 present the average ± the standdard deviation, and the maximall values of the reprojection error (equation 5) respectivvely for maxillary and mandiblee points, after the optimization step. For each noise level, we average a hundred ran ndom realizations. To evaluate the error introduced bby a human operator, we also tesst a manual 2D point picking. Regarding the mandible reprojection errors, we can observe that the WiK methhod oK method for the estimation of the projection mattrix. surpasses globally the Wo It remains that the non-perfect estimation of K used in the WiK method coould vation for the maxillary errors. As only 2D/3D maxilllary explain the opposite observ correspondences are used to t estimate the projection matrix, the maxillary errors are lower than mandible errorss. The errors increase linearly with the noise level (frrom zero in the perfect case). It should be noticed that these errors remain small compaared to the size of pictures expreessed in pixels. Concerning the manual selection, mandiible MSE for the side view reacches 21.6 pixels (0.93 mm) for the WoK method while iit is 10.3 pixels for the WiK meethod. The manual picking errors are equivalent to a nooise around 6 pixels. This demonstrates that capturing manually the 2D reprojection witth a high precision is really a triicky task.
652
R. Destrez et al.
Fig. 3. (left) Virtual 2D picturre obtained by 3D model projection on the screen with 3D pooints projections (right) dentition 3D D model with manually selected 3D points Table 1. Reprojection MS SE (in pixels) for the WiK method and two virtual viewpoints
Noise (pixels) 2 4 Manual
MSE maxillarry 1.37±0.1 19 2.73±0.3 39 3.48±1.5 53
2 4 Manual
1.36±0.1 18 2.72±0.3 35 4.00±1.9 93
Front view Max. error MSE maxillary mandible 2.17 1.92±0.31 4.34 3.83±0.61 5.81 5.66±2.96 Side view 2.18 1.98±0.32 4.37 3.96±0.63 7.50 6.35±2.86
Max. error mandible 3.06 6.12 10.7 3.17 6.34 10.3
Table 2. Reprojection MS SE (in pixels) for the WoK method and two virtual viewpoints
Noise (pixels) 2 4 Manual
MS SE maxilllary 1.08± ±0.22 2.16± ±0.44 2.90± ±1.36
2 4 Manual
1.09± ±0.20 2.18± ±0.39 3.00± ±1.57
Front view Max. error MSE maxillary mandible 1.81 5.22±2.16 3.61 10.45±4.36 5.20 14.8±4.90 Side view 1.92 3.69±1.30 3.84 7.37±2.62 6.14 12.9±6.00
Max. error mandible 8.45 16.94 22.1 6.04 12.09 21.6
3.1.2 Real Case The estimation of the projeection matrix has been tested on two real images (Fig.2a) with a manual point selectiion. Five tests have been carried out to display a tendenncy. In table 3, we can observe, that mandible reprojection errors are still more importtant than maxillary ones. In thiss case, 1 mm corresponds to 11,5 pixels. We note again the superiority of the WiK method. m Especially in the manual case, we can see tthat reprojection errors are higher than on virtual images for the WoK method. This cann be mplexity for the manual picking in the real case, while the explained by a higher com
Comparison of Visual Registration Approaches of 3D Models for Orthodontics
653
color picture had not the same appearance (colors, texture, different lightings…) than the 3D model (shadows, relief…). In the virtual case, the simulated images are really similar to the 3D model so that ease the selection of correspondences. Table 3. Reprojection MSE (in pixels) for the WiK and WoK methods on two real pictures
Method WiK WoK
MSE on the maxillary 5.24±2.72 3.77±1.29
WiK WoK
5.44±2.68 4.54±2.19
Front view Max. error MSE on the maxillary mandible 9.57 8.38±2.90 5.93 36.6±11.1 Side view 10.6 7.70±3.08 8.70 16.8±4.50
Max.error mandible 13.6 53.5 12.2 24.0
3.2 Mandible Registration To test the registration both for virtual and real images, the 3D models of arcades are positioned in the same initial configuration, corresponding to a practical configuration. The registration has been done separately with each viewpoint and then using both images together. From the reference pose, the rigid transformation between mandible and maxillary is defined by the Euler’s angles α = −15°, β = −6 ° and γ = 6° and by the translation components X = −5 mm, Y = −24 mm and Z = −11 mm. This configuration is presented in the figure 2b. The quality of the final registration is evaluated by the differences on the six parameters of the rigid transformation. The reference pose is the one obtained by the orthodontist expert with the current technique. We present only the best results, obtained for a registration involving two viewpoints together. It includes also the reprojection error occurring after the registration step because this is the actual criteria minimized by the LM algorithm during the registration. As previously, a uniform noise is added on the perfect data to evaluate algorithm robustness and the errors are averaged on 100 random tests. Moreover, we carry out a set of manual selection tests. Currently, in orthodontic practice, there are no quantitative criteria to estimate the quality of a registration. We considered that a position error of 0.1 mm and an orientation error of 0.5 mm are relevant bounds compared to teeth dimensions. 3.2.1 Virtual Data Concerning the reprojection errors observed after the registration, they are lower than in table 3. In the manual selection case with the WiK method using only one picture, reprojection errors are 3.24 pixels for the front view and 3.80 for the side view. In the same configuration, for the WoK method, errors are 4.50 pixels for the front view and 4.65 for the side view. We can explain that by the fact that the minimized criterion is the mandible reprojection error. Moreover, the errors are slightly lower if the registration is done with one view rather than with two (in the same configuration for a registration using two points of view,
654
R. Destrez et al.
reprojection error is 4.27 pixels for the front view). In this latter case, optimization is a compromise between the two sets of points. This difference is more important for the WoK method. The figures 4, 5 and table 4 present the 3D differences after the registration relative to the reference positioning provided by the orthodontist expert. Globally, the errors with the WiK method are less important than with the WoK one with a linear dependence to noise. In the “perfect” case, the registration using two viewpoints provide very good results. In the graphs, horizontal straight lines correspond to the manual selection case (MS).
Errors (mm)
0.3 0.25 0.2 0.15 0.1 0.05 0 0
1
X axis X axis MS
2 3 Noise Y axis Y axis MS
4
5 Z axis Z axis MS
Angle errors (degrees)
Fig. 4. Gravity center errors on the virtual data (WiK method)
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
1 α α MS
2
Noise β β MS
3
4
Fig. 5. Angle errors on the virtual data (WiK method)
5 γ γ MS
Comparison of Visual Registration Approaches of 3D Models for Orthodontics
655
Firstly, we have examined the positioning of the gravity center (Fig. 4 and table 4).The Z translation is the less precise because this direction corresponds to the depth in the pictures. The Y translation is the more precise because this axis is perpendicular to the camera axis. For the WiK method, this tendency is not respected, but all errors are low. For the manual selection the maximal error is 0.17 mm. For the WoK method, this tendency is respected but errors can be more important with a maximum of 0.38 mm for the manual selection. Secondly, we focused on the angle errors (Fig. 5 and table 4). The β and α angles seem to be the most difficult to evaluate with respectively 0.48° and 0.33° errors for the manual selection with the WiK method. Whatever the method, the γ angle is the easiest to estimate. Another time, WoK method has higher errors than WiK method. Table 4. 3D errors on the virtual data for the WoK method (in pixels for the gravity center errors and in degrees for the angle errors)
Gravity center errors 2 MS Angles errors 2 MS
X Y Z 3D 0.10±0.068 0,08±0,064 0,16±0,116 0.21 0,37 0,23 0,38 0,57 α β γ 0,43±0,31 0,42±0,32 0,26±0,17 1.36 0,75 0,32
3.2.2 Real Case For the 3D registration errors (table 5),WiK method produces again globally better results than WoK method. The errors are not very high, but they cannot always afford the precision required for a real dental occlusion. The tests carried out with a unique view reveal sometimes some aberrations. In that case, the registration is not realistic and some 3D errors can be high (until Z = 73 mm for the WoK method). By combining the two viewpoints, the registration is clearly improved and generally the aberrations almost disappear. Nevertheless, 3D errors are higher than with the virtual case. In average, the translation error is maintained below 0.8 mm and the rotation error below 1.2 degrees for the WiK method (respectively 2 mm and 4° for the WoK method). Table 5. 3D errors relative to the translation (mm) and rotation (degrees) after registration in the real case
WiK WoK WiK WoK
X 0.27±0.21 0.94±0.46 α 1.26±0.86 1.20±1.00
Y Z 3D 0.56±0.22 0.14±0.16 0.68±0.2 1.27±0.63 0.58±0.42 1.81±0.5 β γ 0.95±0.743 0.27±0.20 3.74±0.77 1.77±1.07
656
R. Destrez et al.
As in the virtual case, we observe an increase of the reprojection error by using the both picture. Nevertheless, these errors are higher than for the virtual case and the WiK method is still more effective (average of 6.86 pixels) against the WoK method (average of 12.7 pixels). Repetitive tests on real images show that registration is very sensitive to the point selection even if we try to click carefully the same features. Sometimes, when the registration is moderate with one view and sharp with the other one, the quality of the registration combining the two views is tempered by the poor results with the first view.
4 Conclusion These first experiments demonstrate that the automatic registration of two dental arcades can be achieved from images of mouth patient. The proposed methods are effective in the « perfect » virtual case, both for the projection matrix estimation and model registration. Nevertheless, the performance of these tests decreases when adding noise, uncertainties or with real shot constrains. It will be interesting to complete this study observing the impact of distortions on virtual cases. Globally, WiK method provides the better results than WoK method for all tested aspects even if in the manual cases, reprojection errors and 3D errors remain significant after registration. For the registration, reprojection error doesn’t seem the unique and the best criteria to minimize: relative distance between several critical points or 3D distances obtained from the calculation of triangulated 3D points could be more relevant. In the future, the most challenging problem is to realize an automatic matching between singular points or lines on 3D models of dental arcades and their corresponding features in the color pictures of the mouth. Our strategy will consist in bringing together the 2D and 3D representations to simplify the matching process. For this goal, we will carry out a 3D reconstruction from two or more views to extract and describe 3D features (bending radius) or reciprocally take advantage of 2D features (colorimetric gradient, texture). Moreover, we will enhance the 3D model with texture mapping. Acknowledgements. We thank the Ortho-Concept laboratory (http://www.orthoconcept.com/) and more especially Daniel Julie for providing us most of numerical data.
References 1. Angel, E.H.: Treatment of malocclusion of the teeth and fractures of the maxillae: Angle’s system, 6th edn. S.S. White Dental Manufacturing Company (1900) 2. Ettorre, G., et al.: Standards for digital photography in cranio-maxillo-facial surgery - Part I: Basic views and guidelines. Journal of Cranio-Maxillofacial Surgery 34, 65–73 (2006) 3. Broadbent, B.H.: A new X-ray technique and its application to orthodontia. The Angle Orthodontics 1(2), 45–66 (1931)
Comparison of Visual Registration Approaches of 3D Models for Orthodontics
657
4. Chung, R., Lagravere, M., Flores-Mir, C., Heo, G., Carey, J.P., Major, P.: Analyse comparative des valeurs céphalométriques de céphalogrammes latéraux générés par CBCT versus céphalogrammes latéraux conventionnels. International Orthodontics 7(4), 308–321 (2009) 5. Plooij, J.M., et al.: Digital three-dimensional image fusion processes for planning and evaluating orthodontics and orthognathic surgery. A systematic review. Int. Journal of Oral and Maxillofacial Surgery (in press) 6. Terajima, M., et al.: A 3-dimensional method for analyzing facial soft-tissue morphology of patients with jaw deformities. American Journal of Orthodontics and Dentofacial Orthopedics 135, 715–722 (2009) 7. Sohmura, T., et al.: High-resolution 3-D shape integration of dentition and face measured by new laser scanner, vol. 23. Institute of Electrical and Electronics Engineers, New York (2004) 8. Schendel, S.A., Lane, C.: 3D Orthognathic Surgery Simulation Using Image Fusion. Seminars in Orthodontics 15, 48–56 (2009) 9. Rangel, F.A., et al.: Integration of digital dental casts in 3-dimensional facial photographs. American Journal of Orthodontics and Dentofacial Orthopedics 134, 820–826 (2008) 10. Asquith, J., Gillgrass, T., Mossey, P.: Three-dimensional imaging of orthodontic models: a pilot study. European Journal of Orthodontics 29(5), 517–522 (2007) 11. Alfano, S.G., Leupold, R.J.: Using the neutral zone to obtain maxillomandibular relationship records for complete denture patients. The Journal of Prosthetic Dentistry 85(6), 621–623 (2001) 12. Enciso, R., Memon, A., Fidaleo, D.A., Neumann, U., Mah, J.: The virtual craniofacial patient: 3D jaw modeling and animation. Studies in Health Technology and Informatics 94, 65–71 (2003) 13. DeMenthon, D.F., Davis, L.S.: Model-based object pose in 25 lines of code. Int. Journal of Computer Vision 15, 123–141 (1995) 14. Marquardt, D.: An algorithm for least-squares estimation of nonlinear parameters. SIAM Journal on Applied Mathematics 11, 431–441 (1963) 15. Lourakis, M.: levmar: Levenberg-Marquardt nonlinear least squares algorithms in {C}/{C}++ (Juil 2004), http://www.ics.forth.gr/~lourakis/levmar/ 16. http://www.ortho-concept3d.com
A Space-Time Depth Super-Resolution Scheme for 3D Face Scanning Karima Ouji1 , Mohsen Ardabilian1 , Liming Chen1 , and Faouzi Ghorbel2 1 2
LIRIS, Lyon Research Center for Images and Intelligent Information Systems, Ecole Centrale de Lyon. 36, av. Guy de Collongue, 69134 Ecully, France GRIFT, Groupe de Recherche en Images et Formes de Tunisie, Ecole Nationale des Sciences de l’Informatique, Tunisie
Abstract. Current 3D imaging solutions are often based on rather specialized and complex sensors, such as structured light camera/projector systems, and require explicit user cooperation for 3D face scanning under more or less controlled lighting conditions. In this paper, we propose a cost effective 3D acquisition solution with a 3D space-time superresolution scheme which is particularly suited to 3D face scanning. The proposed solution uses a low-cost and easily movable hardware involving a calibrated camera pair coupled with a non calibrated projector device. We develop a hybrid stereovision and phase-shifting approach using two shifted patterns and a texture image, which not only takes advantage of the assets of stereovision and structured light but also overcomes their weaknesses. We carry out a new super-resolution scheme to correct the 3D facial model and to enrich the 3D scanned view. Our scheme performs the super-resolution despite facial expression variation using a CP D nonrigid matching. We demonstrate both visually and quantitatively the efficiency of the proposed technique.
1
Introduction
3D shape measurement plays an important role in a wide range of applications such as manufacturing, surgical operation simulation, medical tracking, facial recognition and animation. 3D face capture has its specific constraints such as safety, speed and natural deformable behavior of the human face. Current 3D imaging solutions are often based on rather specialized and complex sensors, such as structured light camera/projector systems, and require explicit user cooperation for 3D face scanning under more or less controlled lighting conditions [1,4]. For instance, in projector-camera systems, depth information is recovered by decoding patterns of a projected structured light. These patterns include gray codes, sinusoidal fringes, etc. Current solutions mostly utilize more than three phase-shifted sinusoidal patterns to recover the depth information, thus impacting the acquisition delay; they further require projector-camera calibration whose accuracy is crucial for phase to depth estimation step; and finally, they also need an unwrapping stage which is sensitive to ambient light, especially when the number of pattern decreases [2,5]. Otherwise, depth can J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 658–668, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Super-Resolution Scheme for 3D Face Scanning
659
be recovered by stereovision using a multi-camera system as proposed in [3,6]. Correspondence between stereo images is performed by a stereo matching step to retrieve the disparity information [7]. When a number of light patterns are successively projected onto the object, the algorithm assigns to each pixel in the images a special codeword which allows the stereo matching convergence and the 3D depth information is obtained by optical triangulation [3]. Meanwhile, the model computed in this way generally is quite sparse. One needs to resort to other techniques to densify the obtained sparse model. Recently, researchers looked into super-resolution techniques as a solution to upsample and denoise depth images. Kil et al. [11] were among the first to apply super-resolution for laser triangulation scanners by regular resampling from aligned scan points with associated Gaussian location uncertainty. Super-resolution was especially proposed for time-of-flight cameras which have very low data quality and a very high random noise by solving an energy minimization problem [12,15]. In this paper, we propose a cost effective 3D acquisition solution with a 3D space-time super-resolution scheme, using a calibrated stereo rig coupled with a non calibrated projector device, which is particularly suited to 3D face scanning, i.e. rapid, easily movable and robust to ambient lighting conditions. The proposed solution is a hybrid stereovision and phase-shifting approach which not only takes advantage of the assets of stereovision and structured light but also overcomes their weaknesses. According to our method, first a 3D sparse model is estimated from stereo-matching with a fringe-based resolution. To achieve that, only two π-shifted sinusoidal fringes are used to sample right and left candidates, with subpixel precision. Then, the projector vertical axis is automatically estimated. A dense 3D model is recovered by the intra-fringe phase estimation, from the two sinusoidal fringe images and a texture image, independently from right and left cameras. The left and right 3D dense models are fused to produce the final 3D model which constitutes a spatial super-resolution. We carry out a new superresolution scheme to correct the 3D facial model and to enrich the 3D scanned view. Our scheme considers the facial deformable aspect and performs the superresolution despite facial expression variation using a CP D non-rigid matching. In contrast to conventional methods, our method is less affected by the ambient light thanks to the use of stereo in the first stage of the approach, replacing the phase unwrapping stage. Also, it does not require a camera-projector offline calibration stage which constitutes a tedious and expensive task. Moreover, our approach is applied only to the region of interest which decreases the whole processing time. Section 2 presents an overview of the offline and online preprocessing preceeding the 3D model generation. Section 3 details the 3D sparse model generation and the projector axis estimation. In Section 4, we high-light the spatial superresolution principle which densify the 3D sparse model. Section 5 describes the super-resolution process through the time axis and explains how the 3D spacetime super-resolution is carried out. Section 6 discusses the experimental results and section 7 concludes the paper.
660
2
K. Ouji et al.
Offline and Online Preprocessing
First, an offline strong stereo calibration computes the intrinsic and extrinsic parameters of the cameras, estimates the tangential and radial distortion parameters, and provides the epipolar geometry as proposed in [10]. In online process, two π-shifted sinusoid patterns and a third white pattern are projected onto the face. A set of three couples of left and right images is captured, undistorted and rectified.The proposed model is defined by the system of equations (1). It constitutes a variant of the mathematic model proposed in [5]. Ip (s, t) = Ib (s, t) + Ia (s, t) · sin(φ(s, t)), In (s, t) = Ib (s, t) + Ia (s, t) · sin(φ(s, t) + π), It (s, t) = Ib (s, t) + Ia (s, t).
(1)
At time t , Ip (s, t) constitutes the intensity term of the pixel s on the positive image, In (s, t) is the intensity of s on the negative image and It (s, t) is the intensity of s on the texture image. According to our proposal, a π-shifting between the first two patterns is optimal for a stereo scenario; left-right samples used in stereo matching are located with a subpixel precision. As to the role of the last pattern, it is twofold: it allows to normalize the phase information, but it is also used to texture the 3D model. This model is defined in order to decorrelate the sinusoidal signals Ia (s, t)·sin(φ(s, t)) and Ia (s, t)·sin(φ(s, t)+π) distorted on the face from the non-sinusoidal term Ib (s). In fact, Ib (s, t) represents the texture information and the lighting effect and constitutes a contamination signal for the sinusoidal component. Ia (s, t) is the intensity modulation. φ(s, t) is the local phase defined at each pixel s. Solving (1), Ib (s, t) is computed as the average intensity of Ip (s, t) and In (s, t). Ia (s, t) is then computed from the third equation of the system (1) and the phase value φ(s, t) is estimated by equation (2). Ip (s,t)−In (s,t) φ(s, t) = arcsin 2·It (s,t)−I . (2) p (s,t)−In (s,t) When projecting the sinusoidal pattern by the light source on the object, gamma distortion makes ideal sinusoidal waveforms nonsinusoidal such that the resulting distorsion is phase dependent. The gamma correction step is crucial to get efficient sinusoidal component and is performed using a Look-Up-Table as proposed in [4] and the localization of the face is carried out as described in [7].
3
3D Sparse Model and Projector Axis Estimation
The sparse 3D model is generated through a stereovision scenario. It is formed by the primitives situated on the fringe change-over which is the intersection of the sinusoidal component of the positive image and the second π-shifted sinusoidal component of the negative one [7]. Therefore, the localization has a sub-pixel precision. Corresponding left and right primitives necessarily have the same Ycoordinate in the rectified images. Thus, stereo matching problem is resolved in
A Super-Resolution Scheme for 3D Face Scanning
661
each epiline separately using Dynamic Programming. The 3D sparse facial point cloud is then recovered by computing the intersection of optical rays coming from the pair of matched features. When projecting vertical fringes, the video projector can be considered as a vertical adjacent sources of light. Such a consideration provides for each epiline a light source point OP rj situated on the corresponding epipolar plane. The sparse 3D model is a serie of adjacent 3D vertical curves obtained by the fringes intersection of the positive and the negative images. Each curve describes the profile of a projected vertical fringe distorted on the 3D facial surface. We propose to estimate the 3D plane containing each distorted 3D curve separately. As a result, the light source vertical axis of the projector is defined as the intersection of all the computed 3D planes. This estimation can be performed either as an offline or online process unlike conventional phase-shifting approaches where the projector is calibrated on offline and cannot change its position when scanning the object.
4
3D Spatial Super-Resolution
Here, the idea is to find the 3D coordinates for each pixel situated between two successive fringes in either left or right camera images to participate separately on the 3D model elaboration. Therefore, we obtain a left 3D point cloud from the left images and a right 3D point cloud from the right images. The spatial super-resolution consists of merging the left and right 3D point clouds. The 3D coordinates of each pixel are computed using phase-shifting analysis. Conventional phase-shifting techniques estimates the local phase in [0..2π] for each pixel on the captured image. Local phases are defined as wrapped phases. Absolute phases are obtained by a phase unwrapping principle. Phase unwrapping consists of determining the unknown integral multiple of 2π to be added at each pixel of the wrapped phase map to make it continuous. The algorithm considers a reference axis for which absolute phase value is equal to 0. In the proposed approach, the sparse model replaces the reference axis and lets us retrieve 3D intra-fringe information from wrapped phases directly in contrast to conventional approaches. Each point in the sparse model is used as a reference point to contribute to the extraction of the intra-fringe information from both left and right images. In fact, each point Pi in the sparse model constitutes a reference point for all pixels situated between Pi and its next neighbor Pi+1 on the same epiline of the sparse model. For a pixel Pk situated between Pi (Xi , Yi , Zi ) and Pi+1 (Xi+1 , Yi+1 , Zi+1 ), we compute its local phase value φk using equation (2). We propose to find depth information for Pk from the local phase directly and avoid the expensive task of phase unwrapping. The phase value of Pi is φi = 0 and the phase value of Pi+1 is φi+1 = π. The phase φk which belongs to [0..π] has monotonous variation if [Pi Pi+1 ] constitutes a straight line on the 3D model. When [Pi Pi+1 ] represents a curve on the 3D model, the function φk describes the depth variation inside [Pi Pi+1 ].
662
K. Ouji et al.
Fig. 1. Intra-fringe 3D information retrieval scheme
Therefore, the 3D coordinates (X(φk ), Y (φk ), Z(φk )) of the 3D point Pk corresponding to the pixel point Gk are computed by a geometric reconstruction as shown in figure 1. The 3D intra-fringe coordinates computation is carried out for each epiline i separately. An epipolar plane is defined for each epiline and contains the optical centers OL and OR of respectively left and right cameras and all 3D points situated on the current epiline i. Each 3D point Pk is characterized by its own phase value φ(Pk ). The light ray coming from the light source into the 3D point Pk intersects the segment [Pi Pi+1 ] in a 3D point Ck having the same phase value φ(Ck ) = φ(Pk ) as Pk . To localize Ck , we need to find the distance Pi Ck . This distance is computed by applying the sine law in the triangle (OP rj Pi Ck ) as described in equation (3). Pi Ck OP rj Pi = . sin(θC ) sin(π − (θC + α))
(3)
The distance OP rj Pi and the angle α between (OP rj Pi ) and (Pi Pi+1 ) are known. Also, the angle θ between (OP rj Pi ) and (OP rj Pi+1 ) is known. Thus, the angle θC is defined by equation (4). After localizing Ck , the 3D point Pk is identified as the intersection between (OR Gk ) and (OP rj Ck ). This approach provides the 3D coordinates of all pixels. Points meshing and texture mapping are then performed to obtain the final 3D face model. π θC = .φ(Ck ). (4) θ Conventional super-resolution techniques carry out a registration step between low-resolution data, a fusion step and a deblurring step. Here, the phase-shifting
A Super-Resolution Scheme for 3D Face Scanning
663
analysis provides a registrated left and right point clouds since their 3D coordinates are computed based on the same 3D sparse point cloud provided by stereo matching. Also, left and right point clouds present homogeneous 3D data and need only to be merged to retrieve the high-resolution 3D point cloud.
5
3D Space-Time Super-Resolution
A 3D face model can present some artifacts caused by either an expression variation, an occlusion or even a facial surface reflectance. To deal with these problems, we propose to apply a 3D super-resolution through the time axis for each couple of successive 3D point sets Mt−1 and Mt at each moment t. First, a 3D non-rigid registration is performed and formulated as a maximum-likelihood estimation problem since the deformation between two successive 3D faces is non rigid in general. We employ the CP D (Coherent Point Drift) algorithm proposed in [13,14] to registrate the 3D point set Mt−1 with the 3D point set Mt . The CPD algorithm considers the alignment of two point sets Msrc and Mdst as a probability density estimation problem and fits the GM M (Gaussian Mixture Model) centroids representing Msrc to the data points of Mdst by maximizing the likelihood as described in [14]. The source point set Msrc represents the GM M centroids and the destination point set Mdst represents the data points. Nsrc constitutes the number of points of Msrc and Msrc = {sn |n = 1, ..., Nsrc }. Ndst constitutes the number of points of Mdst and Mdst = {dn |n = 1, ..., Ndst }. To create the GMM for Msrc , a multi-variate Gaussian is centered on each point in Msrc . All gaussians share the same isotropic covariance matrix σ 2 I, I being a 3X3 identity matrix and σ 2 the variance in all directions [13,15]. Hence the whole point set Msrc can be considered as a Gaussian Mixture Model with the density p(d) as defined by equation (5). p(d) =
N dst
1 p(d|m), d|m ∝ N (sm , σ 2 I). N dst m=1
(5)
Once registered, the 3D point sets Mt−1 and Mt and also their correponding 2D texture images are used as a low resolution data to create a high resolution 3D point set and its corresponding high resolution 2D texture image. We apply the 2D super-resolution technique as proposed in [16] which solves an optimization problem of the form: minimizeEdata(H) + Eregular (H).
(6)
The first term Edata (H) measures agreement of the reconstruction H with the aligned low resolution data. Eregular (H) is a regularization or prior energy term that guides the optimizer towards plausible reconstruction H. The 3D model Mt cannot be represented by only one 2D disparity image since the points situated on the fringe change-over have sub-pixel precision. Also, the left and right pixels participate separately in the 3D model since the 3D coordinates of each pixel is retrieved using only its phase information as
664
K. Ouji et al.
described in section 4. Thus, we propose to create three left 2D maps defined by the X, Y and Z coordinates of the 3D left points and also three right 2D maps defined by the X, Y and Z coordinates of the 3D right points. The optimization algorithm and the deblurring are applied to compute high-resolution left and right images of X, Y, Z and texture separately from the low-resolution images of the left and right maps of X, Y, Z and texture. We obtain a left high-resolution 3D point cloud Lt using left high-resolution data of X, Y and Z. Also, a right high-resolution 3D point cloud Rt is obtained using right high-resolution data of X, Y and Z. The final high-resolution 3D point cloud is retrieved by merging Lt and Rt which are already registrated since both of them contain the 3D sparse point cloud computed from stereo matching.
6
Experimental Results
The stereo system hardware is formed by two network cameras with a 1200x1600 pixel resolution and a LCD video projector. The projected patterns are digitalgenerated. The projector vertical axis is defined by a directional 3D vector N proj and a 3D point Pproj . They are computed by analysing each 3D plane defined by all the 3D points situated at the vertical profile of the same fringe change-over. The equations of the fringe planes are estimated by a mean square optimization method with a precision of 0.002mm. The directional vector N proj is then computed as the normal vector to all the normal vectors of the fringe planes. N proj is estimated with a deviation error of 0.003rad. Finally, the point Pproj is computed as the intersection of all the fringe planes using a mean square optimization. Figure 2 presents the primitives extracted from left and right views of a face with neutral expression and figure 3 presents the reconstruction steps to retrieve the corresponding dense 3D model. Figure 3.a presents the sparse 3D face model of 4486 points. The spatial super-resolution provides a 3D dense model of 148398 points shown in figure 3.b before gamma correction and in figure 3.c after gamma correction. Figure 3.d presents the texture mapping result. Capturing a 3D face requires about 0.6 seconds with an INTEL Core2 Duo CPU (2.20Ghz) and 2GB RAM. The precision of the reconstruction is estimated using a laser 3D face model of 11350 points scanned by a MINOLTA VI300 non-contact 3D digitizer. We perform a point-to-surface variant of the 3D
(a) Left view
(b) Right view
Fig. 2. Primitives extraction on localized left and right faces
A Super-Resolution Scheme for 3D Face Scanning
(a) Sparse Point Cloud
(c) After gamma correction
665
(b) Before gamma correction
(d) Textured Model
Fig. 3. 3D Spatial super-resolution results
rigid matching algorithm ICP (Iterative Closest Point) between a 3D face model provided by our approach and a laser 3D model of the same face. The mean deviation obtained between them is 0.3146mm. Also, a plane with a non-reflective surface is reconstructed to measure its precision quality. Its sparse model contains 14344 points and the phase-shifting process provides a final dense model of 180018 points. To measure the precision, we compute the plane’s theoretical equation which constitutes the ground truth using three manually marked points on the plane. Computing the orthogonal distance vertex between the theoretical plane and the 3D point cloud provides a mean deviation of 0.0092mm. Figure 4 presents the primitives extracted on localized left and right views of two successive frames of a moving face with an expression variation and figure 5 presents their corresponding dense 3D models computed by only spatial superresolution. Figure 5.a and figure 5.b show the reconstruted meshes. Figure5.c and figure 5.d present their texture mapping results. At time t, the left and right cameras create two different right and left captured views of the face which leads to some occluded regions leading to artifacts creation as shown in figure 5. Occluded regions are situated on face border. Also, the left view of the second stereo frames presents an occlusion on the nose region and corresponding points situated on the fringe change-over are not localized as shown in figure 4.c which creates an artifact on the computed 3D model as shown
666
K. Ouji et al.
(a) First left view
(b) First right view
(c) Second left view
(d) Second right view
Fig. 4. Left and right primitives of two successive frames
(a) First Mesh
(c) First Textured model
(b) Second Mesh
(d) Second Textured model
Fig. 5. 3D Spatial super-resolution results for two successive stereo frames
A Super-Resolution Scheme for 3D Face Scanning
(a) Mesh
667
(b) Textured Model
Fig. 6. 3D Space-time super-resolution results
in figure 5.c and figure 5.d. To deal with these problems, 3D information from the first and second 3D models are merged despite their non-rigid deformation thanks to the super-resolution approach proposed in section 5. As shown in figure 6, our approach enhances the quality of the computed 3D model and completes as well its scanned 3D view.
7
Conclusion and Future Work
We proposed in this paper a 3D acquisition solution with a 3D space-time superresolution scheme which is particularly suited to 3D face scanning. The proposed solution is a hybrid stereovision and phase-shifting approach, using two shifted patterns and a texture image. It is low-cost, fast, easily movable and robust to ambient lighting conditions. The 3D scanning process can generate some 3D artifacts especially in the presence of a facial surface reflectance, an occlusion or an expression variation. Our super-resolution scheme considers the facial deformable aspect and demonstrates its efficiency to deal with the artefact problem using a CP D non-rigid matching. When the 3D scanned model present severe artifacts, the temporal superresolution fails to correct the 3D face and the artifacts can propagate through the following 3D frames. We aim to enhance the 3D video quality using more 3D frames in the temporal super-resolution process. Acknowledgments. This research is supported in part by the ANR project FAR3D under the grant ANR-07-SESU-003.
References 1. Blais, F.: Review of 20 years of range sensor development. J. Electronic Imaging. 13, 231–240 (2004) 2. Zhang, S., Yau, S.: Absolute phase-assisted three-dimensional data registration for a dual-camera structured light system. J. Applied Optics. 47, 3134–3142 (2008)
668
K. Ouji et al.
3. Zhang, L., Curless, B., Seitz, S.M.: Rapid shape acquisition using color structured light and multipass dynamic programming. In: 3DPVT Conference (2002) 4. Zhang, S., Yau, S.: Generic nonsinusoidal phase error correction for threedimensional shape measurement using a digital video projector. J. Applied Optics. 46, 36–43 (2007) 5. Zhang, S.: Recent progresses on real-time 3D shape measurement using digital fringe projection techniques. J. Optics an Lasers in Engineering. 48, 149–158 (2010) 6. Cox, I., Hingorani, S., Rao, S.: A maximum likelihood stereo algorithm. J. Computer Vision and Image Understanding 63, 542–567 (1996) 7. Ouji, K., Ardabilian, M., Chen, L., Ghorbel, F.: Pattern analysis for an automatic and low-cost 3D face acquisition technique. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2009. LNCS, vol. 5807, pp. 300–308. Springer, Heidelberg (2009) 8. Lu, Z., Tai, Y., Ben-Ezra, M., Brown, M.S.: A Framework for Ultra High Resolution 3D Imaging. In: CVPR Conference (2010) 9. Klaudiny, M., Hilton, A., Edge, J.: High-detail 3D capture of facial performance. In: 3DPVT Conference (2010) 10. Zhang, Z.: Flexible Camera Calibration by Viewing a Plane from Unknown Orientations. In: ICCV Conference (1999) 11. Kil, Y., Mederos, Y., Amenta, N.: Laser scanner super-resolution. In: Eurographics Symposium on Point-Based Graphics (2006) 12. Schuon, S., Theobalt, C., Davis, J., Thrun, S.: LidarBoost: Depth Superresolution for ToF 3D Shape Scanning. In: CVPR Conference (2009) 13. Myronenko, A., Song, X., Carreira-Perpinan, M.A.: Non-rigid point set registration: Coherent Point Drift. In: NIPS Conference (2007) 14. Myronenko, A., Song, X.: Point set registration: Coherent Point Drift. IEEE Trans. PAMI 32, 2262–2275 (2010) 15. Cui, Y., Schuon, S., Chan, D., Thrun, S., Theobalt, C.: 3D Shape Scanning with a Time-of-Flight Camera. In: 3DPVT Conference (2010) 16. Farsiu, S., Robinson, D., Elad, M., Milanfar, P.: Fast and robust multi-frame superresolution. IEEE Trans. Image Processing (2004)
Real-Time Depth Estimation with Wide Detectable Range Using Horizontal Planes of Sharp Focus Proceedings Hiroshi Ikeoka, Masayuki Ohata, and Takayuki Hamamoto Department of Electrical Engineering, Tokyo University of Science, Tokyo, Japan {ikeoka,oohata,hamamoto}@isl.ee.kagu.tus.ac.jp
Abstract. We have been investigating a real-time depth estimation technique with a wide detectable range. This technique employs tilted optics imaging to use the variance of the depth of field on the horizontal planes of sharp focus. It requires considerably fewer multiple focus images than the conventional passive methods, e.g., the depth-from-focus and the depth-from-defocus methods. Hence, our method helps avoid the bottleneck of the conventional methods: the fact that the motion speed of optical mechanics is significantly slower than that of the image processing parts. Therefore, it is suitable for applications, such as for use in automobiles and for robotic tasks, involving depth estimation with a wide detectable range and real-time processing. Keywords: depth estimation, depth of field, tilted lens, depth-from-focus, depthfrom-defocus.
1
Introduction
Computer vision technologies have advanced rapidly and considerable research has been conducted on the development of practical applications using technological advancements such as those in high-performance image sensors. In these technologies, depth estimation is considered to be a very important research field. There are two types of depth estimation methods in which one camera is used for image sensing: the “active” method employs the projection and reflection of artificial light such as infrared and laser light from the imaging device, and the “passive” method uses the reflection of environment light such as sunlight. In general, the composition of the latter is simpler and the detectable range is wider than that of the former. We have been investigating a passive method that possesses a wider detectable range. The “depth-from-focus” (DFF) technique and the “depth-from-defocus” (DFD) technique are well known passive depth-estimation methods [1][2]. The depth estimation method using the DFF technique is achieved by finding the most in-focus position over multiple focus images [1]. In addition, we have investigated a high-speed depth-estimation system that is based on DFF, developed a smart image sensor that performs both edge detection and in-focus judgment, and proposed an interpolation method to determine the depth in textureless regions [3]. These studies have contributed to improving the speed of depth estimation; however, the speed at which J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 669–680, 2011. © Springer-Verlag Berlin Heidelberg 2011
670
H. Ikeoka, M. Ohata, and T. Hamamoto
multiple focus images can be captured is considerably slower than that at which the other processing steps can be completed. In order to avoid this bottleneck, we propose a new method in which only a few multiple focus images are used. In the following sections, we first indicate the drawbacks of the conventional method. Next, we describe an imaging method in which we employ the principles of tilted optics and the variance of the depth of field (DOF); these principles are used in our method. We then explain our proposed depth estimation method in which the sharpness curve of the variance of the DOF is taken into consideration. Finally, we present a few experimental results to demonstrate the performance of our method.
2
The Drawback of the Conventional Method with a Single Camera for Real-Time Use
The conventional method requires many “planes of sharp focus” (POF) that are confronted in front of the image sensor. Generally, as shown in Fig.1, we obtain a number of multiple focus images by using repeatedly changing the focal position by a small value with using a mechanical component or using spectroscopy optics that has a limited number of divisions. Thus, there is a trade-off between the estimation speed and the estimation accuracy. Therefore, this process is a bottleneck for high-speed depth estimation using a conventional passive method with a single camera, like the DFF and DFD techniques. Moreover, without modification, the conventional method is considered unsuitable for real-time depth estimation. This is a disadvantage for developing a practical depth estimation system. To solve these problems, we estimated the depth value by changing the arrangement of the POFs so that they were horizontal. Before now, there have been several methods using horizontal POFs. However, these methods have disadvantages such as the involvement of mechanical structures, the acceptance of target objects that have constant texture only, and the inability to estimate depth pixel by pixel [4][5]. Previously, we also tried to develop a method with POFs. The method obtains depth by counting distribution of overlap in-focus area of DOF in vertical; which is get by using the AND operation with two POF images. However the method also cannot estimate depth pixel by pixel [6]. In contrast, our new proposed method does not have these disadvantages. Accordingly, it is easy to use in various applications. We explain the details of our depth estimation method in the following sections. POF
Target object
Camera
Depth z
Fig. 1. Conventional methods (DFF, DFD)
Real-Time Depth Estimation with Wide Detectable Range Using Horizontal Planes
3
671
Depth Estimation by the DOF of a Tilted Optical Lens
3.1
A Tilted POF
Now let us consider the situation depicted in Fig.2: an image sensor is placed on the left side of a horizontal axis and an optical unit (lens) that has the tilted angle, θ, with respect to the image sensor. Similarly, a POF is tilted, which is different from normal optics. (Ly, Lz) is the center of rotation for the lens, H is the distance between the first principal point and the center of rotation, and H’ is the distance between the second principal point and the center of rotation. There are two light lines that reach a POF via each focal point, and a light line that reaches a POF via the two principal points. Hence, we can get an intersecting point of these light lines on a POF; the z and y coordinates are expressed with Y on the image sensor. Furthermore, by getting rid of Y from each coordinate, we can obtain a POF expression indicated by the following. y=
{(
)
1 L z cosθ − f − H ′ cos 2 θ z + f cosθ (H + H ′) sin θ (Lz − H ′ cos θ ) (1) − Lz (L z cosθ − L y sin θ + H − H ′)− H ′ sin θ (L z sin θ + L y cosθ ) + HH ′ cosθ }
However, if the distance between an image sensor and a target object is very large, we assume that the lens is thin. For this, the following expression is satisfied.
H = H′ = 0
(2)
Therefore, substituting the above in (1), we obtain the following expression.
L cos θ − L y sin θ ⎛ f ⎞ ⎜⎜ cos θ − ⎟⎟ z − z (3) Lz ⎠ sin θ ⎝ In this case, an image sensor plane, a lens principal plane, and a POF intersect at the following point on the y-axis. y=
1 sin θ
y
H’ H
Focal points
POF
Image sensor
f Y Y
Object f
ρ
Detected area Zf
T Lens
Center of rotation(Ly, Lz)
Zn (y,z)
0 Fig. 2. A titled POF and DOF when using a tiled lens
Near DOF limit DOF
M
Far DOF limit Depth z
672
H. Ikeoka, M. Ohata, and T. Hamamoto
⎛ Lz cosθ − Ly sin θ ⎞ ⎜⎜ − ,0 ⎟⎟ sin θ ⎝ ⎠
(4)
This law is known as the Scheimpflug principle. This condition shows that the POF is able to be inclined to the horizontal by angle θ of the lens; our method uses horizontal POFs. There is a way to set a tilted POF by using not only a tilted lens, but also a tilted image sensor. However, the image by the latter setting has distortion. Hence, if the final output of an application is a depth map image, the former setting is better. 3.2
DOF by a Tilted POF
When using a normal optical unit, the plane of the image sensor and the principal plane of the lens are parallel, and the near DOF limit and far DOF limit are parallel with respect to these planes. However, in a situation where we use a tilted POF, it looks different, as shown in Fig.2. Both of the DOF limits shown by the dashed lines in Fig.2 for a tilted lens are linear, as in the case of normal optics. If ϕ is the angle of the POF with respect to the ground, the gradient of POF is:
tan ϕ =
L z cos θ − f − H ′ cos 2 θ . sin θ (L z − H ′ cos θ )
(5)
Strictly, because a shape of circle of confusion is oval and its size changes by the distance between the lens and the POF, the depth of focus is variable. However, because its variant is slight, we approximate the depth of focus with constant variant g. Hence, the gradients of both the near DOF limit and far DOF limit are derived as follows [7].
tan ϕ + = tan ϕ +
g⎛ 1 g⎛ 1 ⎞ ⎞ + tan ϕ ⎟ , tan ϕ − = tan ϕ − ⎜ + tan ϕ ⎟ ⎜ f ⎝ tan θ f ⎝ tan θ ⎠ ⎠
(6)
Given that the DOF limits have gradients like (6), the DOF limits are: f − (L z + H cos θ ) tan ϕ + + (L y − H sin θ ) , tan θ f y = tan ϕ − ⋅ z − − (L z + H cos θ ) tan ϕ − + (L y − H sin θ ) . tan θ
y = tan ϕ + ⋅ z −
(7)
Furthermore, if we assume that a thin lens is used, we can introduce (2) into (7). Therefore, we can obtain the vertical DOF size at a given depth by the difference between the two equations in (7). 3.3
Depth Estimation by a Horizontal POF
In this section, we describe the proposed method by using the vertical DOF size, which is related to the depth value. First, a POF is set parallel to the ground ( ϕ = 0 ). Thus,
tan ϕ = 0 .
(8)
Real-Time Depth Estimation with Wide Detectable Range Using Horizontal Planes
673
Furthermore, the gradients of both the near and far DOF limits are represented as follows from (6): g g tan ϕ + = , tan ϕ − = − . (9) f tan θ f tan θ By substituting (9) into (7), we obtain simplified DOF-limit expressions. 3.4
Depth Estimation by Using DOF
The variance of the DOF, which is related to the depth value, is one of the causes of the reduced accuracy with a conventional method such as the DFF and DFD methods. However, after we noticed that the DOF was dependent on the depth value, we decided to use it actively. In the case of capturing images with the tilted lens, the detected area on the target object by the DOF that is surrounded by (7) is dependent on the depth value, and expands according to the depth value. Hence, we can get a depth value by observing the in-focus area on the target object plane. By contrast, as an optical unit has a field of view, the projected pixel size of the DOF narrows by the depth value. Hence, the relative quantity between the gradients of the DOF, which are expressed in (9), and the field of view can determine whether the projected pixel size of the DOF increases or decreases with the depth value. Incidentally, the blurring of the DOF changes gradually, and various contrast values for a plane on the target object are mixed. Hence, it is hard to judge a DOF area from an input image. Therefore, we used images from multiple POFs that were set at different altitudes.
4 4.1
Our Proposal for Depth Estimation Using Horizontal POFs Sharpness Curve with Horizontal POFs
In this section, we describe the sharpness curve with horizontal POFs. A plane that had a white noise pattern with constant contrast was placed parallel to the image sensor, and we captured three images of the target plane by using horizontal POFs that were arranged at even intervals and had different altitudes. One of the captured images are shown in Fig 3(a). We then calculated the following d-value for these captured images, which was based on the derivation of a Laplacian. d = li (x + 1, y ) + li (x − 1, y ) − 2 li ( x , y ) + li ( x , y + 1 ) + li ( x , y − 1 ) − 2 li ( x , y )
.
(10)
Here, l is the pixel intensity, F is the focal position, and (x, y) are the pixel coordinates. When the d-value is high, the focal position is judged to be the in-focus position. In these images, we could obtain the d-value of one line DOF in the three images, and graphed (Fig 3(b)) the variance of the sharpness value. We recognize that the in-focus position at which the POFs are placed is the peak point of this graph. The curve of Fig 3(c) is a graph without the offset value and processed by a low-passfilter. The d-value range between the maximum and minimum in the d-value histogram was divided into four blocks, and we set the mode value that occurred in the small d-value block as the offset value. As shown in Fig 3(d), the sharpness curves are approximated by one Gaussian curve. Thereby, we consider the standard deviation of the approximated Gaussian curve that is equals to estimation of DOF. Incidentally, in this paper, we do not mention why approximation using the Gaussian curve is valid; the reason is described elsewhere [6].
674
H. Ikeoka, M. Ohata, and T. Hamamoto Reference column line
Position of POF
DOF
(a) An image of white noise when using horizontal
Upper POF Middle POF Lower POF
Offset size
(b) Variance of d-value for one line in figure (a)
Upper POF Middle POF Lower POF
(c) d-values with offset removed and filtered with LPF
Upper POF Middle POF Lower POF
(d) E Estimated i dG Gaussian i curve
Fig. 3. Variance of sharpness on horizontal planes of sharp focus
4.2
Three Horizontal POFs and Sharpness Curves
In the following, we explain the estimation method of using the approximated Gaussian curve. Under the assumption that each sharpness curve in Fig.4 is approximated by a Gaussian curve, only the peak position y i is different, while peak value dp and standard deviation σ are the same. Hence, the expression is:
⎛ 1 ⎛ y − yi ⎞2 ⎞ ⎟ di = d p exp⎜ − ⎜ ⎜ 2 ⎝ σ ⎟⎠ ⎟ ⎠ ⎝
(i = −1, 0, 1)
.
(11)
Real-Time Depth Estimation with Wide Detectable Range Using Horizontal Planes
0
Sharpness value (d-value)
dp
Peak position of sharpness value on each image
y0
'y 'y
y-1
y0 㹷 y1
Pixel position [pixels]
Sharpness value (d-value) d0 d1 dp 0 d-1
675
Position of noticed pixel
Fig. 4. Prediction of sharpness curve from three planes of sharp focus
Here, i is the POF number, with -1 as the lower value, 0 as the middle, and 1 the upper. Because the altitude interval of each POF is Δy , the d values of the upper and lower POFs are shifted by + Δy and − Δy , so that each d value on the different curves are arranged on one Gaussian curve that is related to the middle POFs. Therefore,
y−1 = y − y0 − Δy ,
y0 = y − y 0 ,
y1 = y − y0 + Δy
.
(12)
Hence, the sharpness curve of which the center is y0 is expressed by the following expression:
⎛ 1 ⎛ y ⎞2 ⎞ di = d p exp⎜ − ⎜ i ⎟ ⎟ ⎜ 2⎝σ ⎠ ⎟ ⎠ ⎝
(i = −1, 0, 1)
.
(13)
Thus, the expression is changed as follows: ⎛ 1 ⎛ y − y0 + i ⋅ Δ y ⎞ 2 ⎞ d i = d p exp ⎜ − ⎜ ⎟ ⎟⎟ ⎜ 2⎝ σ ⎠ ⎠ ⎝
(i = −1, 0, 1)
.
(14)
In consequence, if we get y0 , Δy and σ , we can determine di. The relationships between y0 , Δy , σ and the depth value are decided when the optical system is constructed. The DOF expands corresponding to the depth value, but by the field of view, the projected area size of the DOF on an image sensor is narrower. Hence, whether σ is wider or narrower depends on a setting parameter of the optical system, while σ depends on the depth value. 4.3
Algorithm of Our Proposed Method
If one depth value is determined, then y0 , Δy and σ can also be determined. Hence, a theoretical value is obtained by (14). When the theoretical value di’ is the
676
H. Ikeoka, M. Ohata, and T. Hamamoto
closest to the observed value di, the depth value setting is considered to be the correct depth value. In this judgment, we use the following expression.
E = (d −1 − d −′1 ) + (d 0 − d 0′ ) + (d1 − d1′ ) 2
2
2
(15)
If the contrast of each pixel value in an image is constant, the d-values of the pixels overlap on the Gaussian curve that is shown in Fig.4. In reality, however, the blur values ( σ ) are equal, but the contrast is not generally different. Hence, we cannot estimate the Gaussian curve from one image. Therefore, by preparing multiple input images in advance, we can obtain the parameters y0 , Δy and σ of the Gaussian curve pixel by pixel. Hence, our method can estimate the depth value pixel by pixel. The real algorithm of our proposed method is shown in the following. [1] We captured three input images from our camera system with the tilted optics. [2] We calculate the d-value by using (10) except for the circumference pixels, and generate three d-value images with the offset subtracted. [3] We repeat the following from [3-1] to [3-3] for all pixels. [3-1] We obtain di of a noticed pixel (x, y) from each d-value image. [3-2] We repeat the following from [3-2-1] to [3-2-2] for the depth value range between the nearest point and farthest point. [3-2-1] We get the three parameters of the sharpness curve ( y0 , Δy , and σ ) corresponding to the depth value that is determined in [3-2]. [3-2-2] We repeat the following from [3-2-2-1] to [3-2-2-2] for dp ranging between 0 and 1020 (the maximum d-value for an 8-bit monochrome image). [3-2-2-1] We substitute y0 , Δy , σ and dp into (14), and get di’. [3-2-2-2] We substitute the results of [3-1] and [3-2-2-1] into (15), and get the E-value. [3-3] The depth value of (x, y) is the solution when the E-value is at the minimum. However, if the E-value is smaller than the prepared threshold value, (x, y) do not provide the depth value. However, if we execute the above algorithm directly, it takes many times. Hence, in our experience, we have reduced computation quantity of finding the E-value at the minimum by adopting the Newton method. 4.4
Our Proposed Method and the DFD Method
Our proposed method is similar to the DFD method in the point of using the estimation of the Gaussian curve from three images. However, the DFD method use vertical POFs, and estimates the depth value by using an estimation of the peak value of the Gaussian curve from the variance of the blur pixel by pixel. In contrast, our proposed method focuses on the expanse of the DOF, which is dependent on the depth value. In the case of the DFD method, the sharpness curve changes rapidly. However, in the case of our method, using horizontal POFs, it is easy to use for an application that needs a wide detectable range.
Real-Time Depth Estimation with Wide Detectable Range Using Horizontal Planes
5 5.1
677
Experiment with Our Proposed Method Experiment Environment
We experimented to confirm our proposed method. In this regard, we used the simple camera system shown in Fig.5, of which the primary device specifications are listed in Table 1. We obtained three 8-bit monochrome images. Our camera system was set using the conditions (1), (5), and (8). We set three horizontal POFs with the lens that are tilted at angles θ = 38.0°, and Lz = 116.5 mm. In this case, the exact value of the POF angle is -1.34°; the real POF angle is almost horizontal. The altitudes of the POFs were 10.0 mm, 20.0 mm, and 30.0 mm. In addition, we treated the optical unit as a thick lens, setting H = 23.9 mm and H’ = 5.5 mm. In this time, we cannot ready spectroscopic device. Therefore, we obtained three images at each depth by simple vertical movement of the camera with three steps. 5.2
Relationship between Depth Value and Sharpness Curve
In advance of the main experiment, we investigated the relationship between the depth value and DOF and made a table holding the relationships between the depth and each of the parameters, y0 , Δy , and σ . At this time, we got y0 , Δy and σ
Fig. 5. Camera system used in our experiment Table 1. Primary specifications of the apparatus used in our experiment Item
Maker & Model of product
Specifications
Canon EOS 50D - Image sensor size - Effective pixels - Total pixels
22.2 × 14.8 mm approx. 1220M pixels approx. 1240M pixels
Optical bench
HORSEMAN LD - Minimum flange back - Monorail length - Raise, fall, shift length - Swing, tilted angle
approx. 70 mm 400 mm 30 mm 360°
Lens module
FUJIFILM FUJINON SWD - Focal length - Apperture ratio - Minimum diaphragm
90 mm 5.6 64
Camera
678
H. Ikeoka, M. Ohata, and T. Hamamoto
y0
769984.2 / z 372.476
(a) Relation between lower position of the plane of sharp focus y0 and depth z
'y
48409 / z 3.6449
(b) Relation between the distance between these POFs Δy and depth z
V
47.295 exp( 4 104 z )
(c) Relationship between the standard deviation σ and depth z Fig. 7. The dependence of parameters on the depth value
for some steps of depth by estimation with an image board that has noise pattern like fig.3(a) by using the same image sensor and optics shown in Fig.5. Thus, we could derive the relationships between the depth and y0 , Δy , and σ that are shown in Fig.6. We experimented by using these relationships. These relationships based on the actual measurement absorb the influence of complex distortions which are not involved in simple optical theory occurred by using the tilted lens. 5.3
The Experiment for Our Proposed Method
We used a toy vehicle as a target object: the dimensions of the back surface are 40 mm by 30 mm high. The distance between the camera and target object was 1108.0 mm. In this experiment, we obtained three input images like Fig.7(a). Then, we obtained the d-value images of Fig.7(b) from these input images. We estimated the
Real-Time Depth Estimation with Wide Detectable Range Using Horizontal Planes
679
(a) Images input by lower, middle, and upper POFs
(b) d-value images from lower, middle, and upper POFs Fig. 8. Input and d-value images in our experiment Correct position of the object
(a) Positions of estimable pixels (Depth map)
(b) Histogram of depth value
Fig. 9. Results of our experiment
depth value of each pixel with an accuracy of approximately 1 mm. The target pixel had three d-values with the offset (equals 10) subtracted, that were larger than the threshold value of 2. In the edge pixels of the back surface, 359 pixels have a depth value. Fig.8(a) shows positions of the estimable pixels and the nearer pixel is shown brighter, and Fig.8(b) shows a histogram of these depth values. In Fig.8(b), the correct depth value between the camera and the target object is 1108.0 mm in reality. However, the average of the estimated depth values is 1126.28 mm, and the MAE (mean absolute error) is 33.4 mm. The processing time was 30.5 seconds, and an Intel Core2 Duo running at 2 GHz was used for processing. Therefore, we optimized the
680
H. Ikeoka, M. Ohata, and T. Hamamoto
processing algorithm that find minimum the E values by implementing the Newton method. As a result, this speed was increased, and the processing time was 46 milliseconds. Furthermore, the MAE was also increased to 33.06, because the dp is able to have accuracy to 1 decimal place by the processing speed-up.
6
Conclusion
In this paper, we proposed a depth estimation method that uses sharpness curves that express the variance of the DOF with horizontal POFs. We succeeded in substantially reducing the number of multiple focus images with a sufficiently simple optical system that does not involve mechanical structures. This makes it possible to get rid of the drawback of the conventional passive method for real-time use. Additionally, we confirmed the feasibility of our proposed method by a simple experiment. We are planning to improve the DOF detection algorithm using pixel values and inter-frame difference. Additionally, we are going to construct an application using our method.
References 1. Krotkov, E.: Focusing. International Journal of Computer Vision 1(3), 223–237 (1987) 2. Nayar, S.K., Nakagaw, Y.: Shape from Focus. Journal IEEE Transactions on Pattern Analysis and Machine Intelligence 16(8), 824–831 (1994) 3. Ikeoka, H., Kashiyama, H., Hamamoto, T., Kodama, K.: Depth Estimation by Smart Imager Sensor Using Multiple Focus Images. The Institute of Image Information and Television Engineers 62(3), 384–391 (2008) 4. Krishnan, A., Narendra, A.: Range Estimation from Focus Using a Non-Frontal Imaging Camera. International Journal of Computer Vision 20(3), 169–185 (1996) 5. Cilingiroglu, U., Chen, S., Cilingiroglu, E.: Range Sensing with a Scheimpflug Camera and a CMOS Sensor/Processor Chip. IEEE Sensors Journal 4, 36–44 (2004) 6. Ikeoka, H., Hamamoto, T.: Real-Time Depth Estimation with Wide Detectable Range Using Variance of Depth of Field by Horizontal Planes of Sharp Focus. The Institute of Image Information and Television Engineers 64(3), 139–146 (2008) 7. Merklinger, Harold, M.: Focusing the view camera: MacNab Print, Canada (1993)
Automatic Occlusion Removal from Facades for 3D Urban Reconstruction Chris Engels1 , David Tingdahl1 , Mathias Vercruysse1 , Tinne Tuytelaars1 , Hichem Sahli2 , and Luc Van Gool1,3 1
K.U. Leuven, ESAT-PSI/IBBT 2 V.U. Brussel, ETRO 3 ETH Zurich, BIWI
Abstract. Object removal and inpainting approaches typically require a user to manually create a mask around occluding objects. While creating masks for a small number of images is possible, it rapidly becomes untenable for longer image sequences. Instead, we accomplish this step automatically using an object detection framework to explicitly recognize and remove several classes of occlusions. We propose using this technique to improve 3D urban reconstruction from street level imagery, in which building facades are frequently occluded by vegetation or vehicles. By assuming facades in the background are planar, 3D scene estimation provides important context to the inpainting process by restricting input sample patches to regions that are coplanar to the occlusion, leading to more realistic final textures. Moreover, because non-static and reflective occlusion classes tend to be difficult to reconstruct, explicitly recognizing and removing them improves the resulting 3D scene.
1
Introduction
We seek to reconstruct buildings within urban areas from street level imagery. Most earlier approaches to 3D reconstruction have worked solely on low level image data, finding correspondences and backprojecting these into 3D. In contrast, we believe that in order to obtain high quality models, higher level knowledge can best be incorporated into the 3D reconstruction process from the very start, i.e. information of what the image actually represents should be extracted in combination. Here, we focus on one particular example of such top-down, cognitive-level processing: detecting cars or vegetation allows us to remove these occluding objects and their textures from the 3D reconstructions and to focus attention on the relevant buildings behind. On the other hand, as soon as 3D information becomes available, it is helpful in the interpretation of the scene content, as probabilities to find objects at different locations in the scene depend on the geometry, e.g. whether there is a supporting horizontal plane available. So the geometric analysis helps the semantic analysis and vice versa. Obtaining a better understanding of the scene content not only helps the 3D modeling, but can also be useful in its own right, J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 681–692, 2011. c Springer-Verlag Berlin Heidelberg 2011
682
C. Engels et al.
Geometric analysis Structure from motion
Dense reconstruction
Vehicle detection
False positive removal
Plane fitting
Vehicle Segmentation
Occlusion removal
Vegetation detection Semantic analysis
Fig. 1. Overview of our processing pipeline
creating a semantically rich model. Instead of just having the geometry (and possibly texture) of a scene, we now also have a partial interpretation. Ultimately, we aim for a system that knows what it is looking at, recognizing common objects such as doors, windows, cars, pedestrians, etc. Such a semantically rich model can be considered as an example of a new generation of representations in which knowledge is explicitly represented and therefore can be retrieved, processed, shared, and exploited to construct new knowledge. We envision that this will be a key component for the next generation GPS systems, providing a more intuitive user interface and/or additional information (e.g. warnings concerning speed limits, pedestrian crossings, tourist information about specific buildings, etc.). Automatic extraction of such models is crucial if one wants to keep the system up to date without having to employ an army of image annotators. The specific problem we will be focusing on here is the occlusion of building facades by foreground objects. Occluders are often non-static (e.g. vehicles or pedestrians) and therefore usually not relevant from a user perspective. Moreover, they are often difficult to reconstruct accurately due to specular reflections (e.g. cars), non-rigidity (e.g. pedestrians), or very fine structures (e.g. trees). Some measures can be used to mitigate the presence of these occlusions, such as increasing camera height or fusing images from different viewpoints to help complete textures, but some regions of a building facade may simply not be visible from any viewpoint. In such cases, it is possible to estimate the appearance of the occluded region using established inpainting approaches (e.g. [4]). However, these approaches frequently assume that occlusions will be spatially limited and manually masked, which is not feasible for a larger dataset. Instead, we propose automatically finding occlusions using object-specific detectors. In practical situations, occlusions originate almost exclusively from a limited number of known object categories such as cars, pedestrians, or vegetation. These can be detected and recognized using state-of-the-art object detection methods (e.g. [6]). We add a refinement step based on Grab-cut [14] to obtain clean segmentations and show this allows to remove the foreground objects using inpainting without any manual intervention. We compare this method with a baseline scheme where objects are removed based on depth masking. Additionally, we show that superior results can be obtained by exploiting the planar structure of building facades and ground. We first rectify the image
Automatic Occlusion Removal from Facades for 3D Urban Reconstruction
683
with respect to the plane, so as to reduce the effects of perspective deformation. During the inpainting process, we restrict the input sample patches to regions belonging to the particular plane being completed, leading to more realistic final textures. Finally, we fill in the missing depth information by extending the planes until they intersect. In summary, our processing pipeline goes as follows (see also Fig. 1): given an image sequence, our approach initializes by estimating camera parameters and creating a sparse reconstruction. We estimate an initial dense, multi-view 3D reconstruction, from which facade and ground planes are detected. This geometric analysis is described in Sec. 3. In parallel, we detect vehicles within a detection-by-parts framework, while vegetation is detected with a patch-based texture classifier. The vehicle detections provide only a rough location of occlusions, which we then refine to obtain a final segmentation mask. These steps are described in Sec. 4. We proceed to eliminate occlusions using a patch-based inpainting approach that constrains input samples to a neighboring facade. We replace depths corresponding to occlusions with those of background facades, thereby eliminating the occlusions from the final reconstruction. This is explained in Sec. 5. Finally, in Sec. 6 we show some experimental results, and Sec. 7 concludes the paper.
2 2.1
Previous Work Cognitive 3D
This approach to 3D scene modeling in a sense comes close to the seminal work of Dick et al. [5], who combine single view recognition with multiview reconstruction for architectural scenes. They use a probabilistic framework that incorporates priors on shape (derived from architectural principles) and texture (based on learned appearance models). A different kind of constraint is used in [21], where a coarse piecewise-planar model of the principal scene planes and their delineations is reconstructed. This method has been extended by fitting shape models to windows [20], recognizing these models in the images using a Bayesian framework similar to the method described in [5]. However, these methods are limited to scenes that are highly constrained geometrically, resulting in a relatively strict model of what the scene may look like. This limits the applicability. Their method cannot easily be relaxed to general street views with trees, pedestrians, moving cars, etc. We want to exploit the recent advances in the recognition of generic object classes [6,12] to apply similar ideas to more general scenes. Note that we will not use strict geometric or probabilistic models during 3D modeling though. Instead, the higher level information is exploited to focus the attention, i.e. to select interesting parts to reconstruct. The opposite scheme, where geometry is exploited to help recognition, has further been explored in [9]. Also worth mentioning here are a series of works that try to estimate depth from a single image [15,8,9]. Finally, [16] investigates how to infer depth from recognition, by transferring depth profiles from training images of a particular object class to a newly detected instance.
684
2.2
C. Engels et al.
Occlusion Removal
Occlusion removal has been extensively studied within the computer vision and graphics communities, mostly building on advances made in the work on texture synthesis. Most approaches rely on the prior manual annotation of occlusions. Our inpainting strategy is based on the patch exemplar-based technique of Criminisi et al. [4]. Wang et al. [19] extended this approach to also infer depth from stereo pairs. Several works have noted that manual workload can be greatly decreased using interactive methods that allow a user to quickly mark foreground and background areas, while exact segmentations are determined using graph cuts. The PatchMatch algorithm of Barnes et al. [1] allows for interactive rates for inpainting and reshuffling via simple user-defined constraints and efficient nearest neighbor search. Within the context of building facades and urban reconstruction, increased contextual knowledge is available by assuming a structure’s planarity and repetition of floors, windows, etc. Konushin and Vezhnevets [11] reconstruct building models by detecting floors and estimating building height. Occlusions are removed by cloning upper floors and propagating them downward. Rasmussen et al. [13] use multiple views and median fusion to remove most occlusions, requiring inpainting only for smaller regions. Benitez et al. [2] relies on a LIDAR point cloud to find and remove occlusions closer than detected planes by combining image fusion and inpainting. Xiao et al. [22] semantically segment street-side scenes into several classes, including vegetation and vehicles, but do not actively fill in missing data. Instead, they rely on the missing information being available from other views.
3
Geometric Analysis
Planes are the most dominant geometric primitives found in urban environments and form a natural way of representing metropolitan scenes in 3D. Parameterizing a scene into planes not only gives a lightweight representation of the scene, but also provides the information necessary for geometric algorithms such as image rectification and occlusion detection. This section describes how we obtain a 3D reconstruction from a set of images from which the dominant planes are extracted. 3.1
3D Reconstruction with ARC3D
A sparse 3D reconstruction does not typically provide enough geometry for plane extraction. The ground plane can be especially difficult due to weak texture containing very few salient feature points to match between the images. However, many dense reconstruction methods can do better here, as the search space between corresponding image pixels is limited once the cameras are calibrated. To this end, we use the publicly available ARC3D web service [18], which computes
Automatic Occlusion Removal from Facades for 3D Urban Reconstruction
685
dense depth maps from a set of input images. It is composed of an uncalibrated structure-from-motion pipeline together with a dense stereo matcher. The user uploads images to a server via a client tool, which then computes the depth maps and notifies the user via email when the computation is done. 3.2
Plane Extraction
The depth maps from ARC3D are merged into a 3D point cloud which is used as input to a plane detector algorithm. A maximum likelihood RANSAC scheme is employed to find the most dominant plane in the point cloud. We use a thresholded-squared Euclidean distance between the candidate plane and a test point to determine if the test point is an inlier and keep the candidate plane with highest number of inliers. Subsequent planes are found by iteratively running RANSAC on all the outliers from the previous detection. 3.3
Image Rectification
Projective distortion may cause unwanted artifacts from the inpainting algorithm. Thus we need to rectify the input images such that each imaged plane is seen from a fronto-parallel viewpoint. This can be achieved by applying the homography H = KRK −1 (1) to each pixel in the input image [7], where K is the camera calibration matrix and (2) R = r1T r2T r3T is a rotation matrix. With πF and πG as the facade and ground plane normals in the camera frame, the rotation is formed as follows. First, we need to align the camera viewing direction (z-axis) with the plane normal, thus r3 = πF . Further, r1 is selected to align the image x-axis with the intersection between πF and πG , r1 = πF × πG . Finally, we set r2 = r1 × r3 to complete the orthogonal basis.
4 4.1
Semantic Analysis Vehicle Detection
To detect foreground occluding objects such as cars or pedestrians we use a local features-based approach, namely the Implicit Shape Model (ISM) [12]. Here, interest points are matched to object parts and vote for possible locations of the object center using a probabilistic generalized Hough transform. Models for different viewpoints are integrated using the method of [17]. If needed, false detections can be removed in an extra processing step, exploiting the 3D and geometric constraints. Here, we use the fact that cars should always be supported by the ground plane.
686
C. Engels et al.
Fig. 2. Left: source image. Center: Initial segmentation mask. Right: refined segmentation mask.
4.2
Vehicle Segmentation
Since we know which interest points contributed to the object detection, we can use them to obtain a rough segmentation of the detected object, by transferring segmentations available for the training images (see [12]). However, the interest points typically do not cover the entire object and as a result these segmentations often contain holes. As we will show later, this has a detrimental effect on the inpainting results. Therefore, we propose to add an extra step to refine the segmentation. To this end, we build on the work of [14,3] for interactive scene segmentation. We replace the interactive selection of foreground with the initial segmentation based on object detection. This results in significantly cleaner segmentations, while the method remains fully automatic, as illustrated in Fig. 2. This segmentation results in an occlusion mask Mcar . 4.3
Vegetation Detection
Unlike vehicles, vegetation tends to have a more random structure, which we detect using a patch-based texture classifier. For each patch, we construct a 13-dimensional vector containing mean RGB and HSV values, a five bin hue histogram, and an edge orientation histogram containing orientation and number of modes. The classifier uses an approximate k-nearest neighbor search to match the patch to vegetation or non-vegetation descriptors from a training set, which is supplied by a separate set of manually segmented images. Finally, we perform morphological closing on the detections to refine the vegetation occlusion mask Mveg and combine the vegetation mask with the segmented vehicle detections to obtain an occlusion mask M = Mveg ∪ Mcar .
5 5.1
Occlusion Removal Inpainting
Our approach to occlusion removal closely follows that of Criminisi et al. [4]. They assume a user-provided occlusion mask M and a source region Φ = M¯ from which to sample patches that are similar to a patch Ψ on the boundary of Φ. Given the local neighborhood, the patch Ψˆ ∈ Φ minimizing some distance
Automatic Occlusion Removal from Facades for 3D Urban Reconstruction
687
function d(Ψ , Ψ ) is used to fill the occluded pixels in Ψ . The authors recommend defining d as the sum of squared differences of observed pixels in CIE Lab color space. The key insight of that work is that the order in which patches are filled is important. By prioritizing patches containing linear elements oriented perpendicularly to the boundary of M , the algorithm preserves linear structures on a larger scale and leads to a more convincing replacement of the occlusion. Rather than sampling over the entire image, we segment the image by planes and perform the inpainting in the rectified images produced in Sec. 3.3. We limit both the source and mask regions to the area lying on the facade. This has the effect of restricting vegetation, sky, other facades, etc. from the fill region. We examine the effects of this sampling strategy in Sec. 6.1. 5.2
Removing Occlusions in 3D
After the textures are inpainted, we still need to remove the geometric aspect of the occlusion. Rather than simply discarding the 3D vertices in the occluded areas, we again make use of the planes, this time to fill in the occluded areas. All 3D points that project into the combined occlusion mask M are part of the occluding geometry and must be dealt with. For each 3D point M , a line l is formed between the camera center C and M . This line is then extended to find the first plane it intersects: ΠM = arg min (d(C, Πi )) , Πi
(3)
where, d denotes the Euclidean distance between a point and a plane. The new position of M is selected as the point of intersection between l and ΠM . The effect of this is shown in Fig. 3.
Fig. 3. Left: Original point cloud. Right: The geometric occlusions have been removed by transferring each occluded point to its corresponding plane.
688
5.3
C. Engels et al.
Mesh Generation
We create a mesh from the filled point cloud using the Poisson reconstruction algorithm [10]. The method casts the remeshing into a spatial Poisson problem and produces a watertight, triangulated surface from a dense 3D point cloud. Its resilience to noise makes it a favorable approach for image based 3D reconstruction. The Poisson reconstruction algorithm assumes a complete and closed surface, and covers areas of low point density with large triangles. We clean the mesh by removing all triangles with a side larger than 5 times the mean value of the mesh. Finally, we texture the mesh with the inpainted texture.
6
Results
Fig. 4 shows the same scene as Fig. 3 after the mesh has been generated and the updated texture has been added. Although an air conditioner is falsely classified as a vehicle and removed, the remaining artifacts are still minor compared to those created by the difficult reconstruction of reflective cars.
Fig. 4. Reconstructed facade before and after occlusion removal
6.1
Planar Constraint on Inpainting
Fig. 5 shows an example of the effects of using knowledge of the facade and ground planes to rectify and constrain the source and masked regions. On the left is essentially the approach of Criminisi, where Φ = M¯, i.e. the source region is simply the complement of the occlusion mask. Both approaches have difficulty with the ground due to the cars’ shadows not being included in the mask. However, our approach’s constraints prevent the appearance of growing out of the facade. 6.2
Mask Selection
As discussed earlier, selecting the correct region for occlusion masking is critical. Leaving occluding pixels labeled as background means the boundary of the masked region M is already inside the occlusion, which prevents the inpainting algorithm from even initializing correctly. Creating too large a mask allows for
Automatic Occlusion Removal from Facades for 3D Urban Reconstruction
689
Fig. 5. Example inpainted region without (left) and with (right) planar constraints
Fig. 6. Occlusion masks from ISM segmentation before (above) and after (below) refinement with Grab-cut
a reasonable initialization but may cause the system to miss critical structures that would otherwise provide local cues to the unobserved facade. Fig. 6 shows a comparison between the raw ISM mask and the refined mask, while Fig. 7 shows the resulting images. Because ISM searches for parts and centers of an object, it is possible that boundaries or non-discriminative sections of an object may not be captured. The refinement enlarges the area out to stronger edges in the image which are more likely to form the object boundary.
690
C. Engels et al.
Fig. 7. Effects of segmentation refinement. Left: completed texture image; right: inset. From top to bottom: Original image; inpainted image with ISM mask; inpainted image with refined mask.
7
Conclusion
In this work we have demonstrated the use of cognitive level information about the scene to improve the quality of 3D reconstructions of urban environments. In particular, we have investigated how occluding objects such as cars or vegetation can be removed automatically without the need for manual intervention. As future work, we want to further evaluate the proposed method on a larger scale, as well as integrate more semantic information into the 3D model, including occlusion filling.
Automatic Occlusion Removal from Facades for 3D Urban Reconstruction
691
Acknowledgments. This work was supported by the Flemish FWO project Context and Scene dependent 3D Modeling (FWO G.0301.07N) and the European project V-City (ICT-231199-VCITY).
References 1. Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (Proc. SIGGRAPH) 28(3) (August 2009) 2. Benitez, S., Denis, E., Baillard, C.: Automatic production of occlusion-free rectified facade textures using vehicle-based imagery. In: Photogrammetric Computer Vision and Image Analysis, p. A:275 (2010) 3. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(9), 1124–1137 (2004) 4. Criminisi, A., Perez, P., Toyama, K.: Region filling and object removal by exemplarbased image inpainting. IEEE Transactions on Image Processing 13, 1200–1212 (2004) 5. Dick, A.R., Torr, P.H.S., Cipolla, R.: Modelling and interpretation of architecture from several images. Int. J. Comput. Vision 60, 111–134 (2004) 6. Felzenszwalb, P., Girshick, R., McAllester, D.: Cascade object detection with deformable part models. In: Computer Vision and Pattern Recognition (2010) 7. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) ISBN: 0521540518 8. Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of cluttered rooms. In: International Conference on Computer Vision (2009) 9. Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. International Journal of Computer Vision 80(1), 3–15 (2008) 10. Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Proceedings of the Fourth Eurographics Symposium on Geometry Processing, SGP 2006, pp. 61–70. Eurographics Association, Aire-la-Ville (2006) 11. Konushin, V., Vezhnevets, V.: Abstract automatic building texture completion. Graphicon (2007) 12. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision 77(13), 259–289 (2008) 13. Rasmussen, C., Korah, T., Ulrich, W.: Randomized view planning and occlusion removal for mosaicing building facades. In: IEEE International Conference on Intelligent Robots and Systems (2005), http://nameless.cis.udel.edu/pubs/2005/ RKU05 14. Rother, C., Kolmogorov, V., Blake, A.: “grabcut”: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23, 309–314 (2004) 15. Saxena, A., Chung, S.H., Ng, A.Y.: 3-d depth reconstruction from a single still image. International Journal of Computer Vision, IJCV 76 (2007) 16. Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., Van Gool, L.: Shape-fromrecognition: Recognition enables meta-data transfer. Computer Vision and Image Understanding 113(12), 1222–1234 (2009)
692
C. Engels et al.
17. Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., Schiele, B., Van Gool, L.: Towards multi-view object class detection. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2006, vol. 2, pp. 1589–1596. IEEE Computer Society, Washington, DC, USA (2006) 18. Vergauwen, M., Van Gool, L.: Web-based 3d reconstruction service. Mach. Vision Appl. 17(6), 411–426 (2006) 19. Wang, L., Jin, H., Yang, R., Gong, M.: Stereoscopic inpainting: Joint color and depth completion from stereo images. In: Conference on Computer Vision and Pattern Recognition (2008) 20. Werner, T., Zisserman, A.: Model selection for automated reconstruction from multiple views. In: British Machine Vision Conference, pp. 53–62 (2002) 21. Werner, T., Zisserman, A.: New techniques for automated architectural reconstruction from photographs. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 541–555. Springer, Heidelberg (2002) 22. Xiao, J., Fang, T., Zhao, P., Lhuillier, M., Quan, L.: Image-based street-side city modeling. ACM Trans. Graph. 28, 114:1–114:12 (2009)
hSGM: Hierarchical Pyramid Based Stereo Matching Algorithm Kwang Hee Won and Soon Ki Jung School of Computer Science and Engineering, College of IT Engineering, Kyungpook National University, 1370 Sankyuk-dong, Buk-gu, Daegu 702-701, South Korea
Abstract. In this paper, we propose a variant of Semi-Global Matching, hSGM which is a hierarchical pyramid based dense stereo matching algorithm. Our method aggregates the matching costs from the coarse to fine scale in multiple directions to determine the optimal disparity for each pixel. It has several advantages over the original SGM: a low space complexity and efficient implementation on GPU. We show several experimental results to demonstrate our method is efficient and obtains a good quality of disparity maps. Keywords: stereo matching, disparity map, hierarchical SGM, GPU.
1
Introduction
According to a recent tendency of supporting “Single-Instruction in MultipleCores with Multiple-Data” concept in processing units, many of existing stereo matching algorithms are revisited in terms of parallelism which can be accomplished through the multi-core CPUs and the general purpose utilization of GPUs. Besides, like full HD resolution videos for 3D TV or satellite imageries for city modeling, the rapid increase of image data can also accelerate the development of high-performance parallel stereo matching algorithms. Moreover, there are many computer vision applications that require the stereo matching algorithm of real-time performance. The up-to-date stereo algorithms perform a global optimization by defining an objective function and optimize it. But the time complexities of those optimizations are, however, considerably high even though the computing powers are constantly increased. One more issue of the global optimization can be its enormous memory consumption, which can potentially cause relevant page fault that degrades the overall performance of an algorithm, especially for the highresolution input data. Hirschm¨ uller suggests the Semi-Global Matching (SGM) to reduce the computation cost while maintaining the quality of matching result by substituting multidirectional 1D optimization for 2D global optimization [1]. The matching costs of a certain pixel for candidate disparities are computed locally and aggregated along all specified directions, which are 8 or 16 directions suggested by the author, up to the current pixel. In consequence, the required memory is J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 693–701, 2011. c Springer-Verlag Berlin Heidelberg 2011
694
K.H. Won and S.K. Jung
proportional to the number of pixels, the disparity range, and the number of directions. Furthermore, the optimal disparities of the pixels are not determined until the aggregated costs from all directions are assigned. This computational dependency with the relatively large memory usage is a hindrance to the maximum use of parallelism, because the conventional GPUs are designed to utilize on-chip memory which is fast but has limited size of 16k to 48k per independent logical block. In this paper, we propose a different version of multi-directional optimization performed on the pyramid images. Our method aggregates the matching costs from the coarse to fine scales in multiple directions to determine the optimal disparity for each pixel. The original SGM aggregates the matching costs along 1-pixel width path, but the suggested algorithm aggregates matching costs along down the cone-shaped volume on the scale-space as shown in Fig. 1. We named our algorithm as hierarchical Semi-Global Matching, hSGM.
P
P
Fig. 1. The matching cost aggregation in SGM(left) and hSGM(right)
hSGM has several advantages over the original SGM. First, hSGM has a lower space complexity, because it requires the resolution of the higher level of pyramid for the matching cost aggregation. Second, hSGM can be easily implemented on the current GPU architecture, since the hierarchical approach resolves the computation dependency on the optimization process of 1D path so that it maximizes the use of parallelism. The remainder of this paper consists of the following sections. In section 2, the stereo matching algorithms focused on parallelism are briefly introduced. In section 3, we present the semi-global optimization process on a hierarchical pyramid and introduce its GPU implementation in detail. In section 4, we performed a couple of experiments to evaluate our algorithm comparing to the original SGM and its GPU versions. In section 5, we conclude the paper with some future works.
2
Related Work
In the famous Middlebury ranking for stereo algorithms, many of top ranked algorithms contain the global optimization process with the objective function as their important part. Nevertheless, as we mentioned in the previous section, they
hSGM: Hierarchical Pyramid Based Stereo Matching Algorithm
695
still suffer from the expensive computation cost and the inefficient memory utilization. The computation time and memory space complexities of many stereo matching algorithms are analyzed well in the literature of Guerra-Filho [2]. The original SGM was trying to reduce the computational expenses of 2D global optimizations, but the aggregated matching costs along each path are dependent on the costs of the previous pixel on the path. As we addressed, this dependency can effect negatively on the implementation of parallelism. Many variants of SGM have been proposed up to now, most of which try to reduce their computation time and memory consumptions. Ernst et al. [3] implemented the SGM on GPU by aggregating matching costs on multiple paths simultaneously. Gibson and Marques [4] also implemented it on GPU with a similar strategy but optimized their implementation using parallel sorting, caching on the shared memory, and so on. Both implementations are much faster than the original SGM, but they still require the same amount of memory space with the original one. Humenberger et al. [5] suggested a modified SGM by dividing the input image into many horizontal strips to reduce the required memory size. They assumed that each strip is a whole image. This scheme can be applied on GPU implementation by decreasing the strip (or block) size to fit the on-chip memory. However, their assumption fails on the boundaries of each strip by producing discontinuity. Gehrig et al. [6] proposed to utilize the low-resolution disparity map in order to enhance the computation speed of the current resolution disparity map. hSGM makes use of the aggregated matching cost from the costs on the higher level instead of the low-resolution disparity. Moreover, the matching costs are aggregated down to the current level with adaptively determined penalty according to the similarity between pixels of current and higher level. Humenberger et al. [7] compared the performance of stereo matching algorithms on CPU, Digital Signal Processor (DSP), and two GPUs. They employed local aggregation of matching costs within a specified size of local windows by which the parallelism can be accomplished more easily. However, local optimization often produces outliers for the weakly textured area that is larger than the window size. Our algorithm also aggregates matching costs locally on the current level of the pyramid, but utilizes the aggregated matching costs from the higher level. Therefore, hSGM is easily parallelized on current-generation GPUs while maintaining the quality of disparity map resulted from a global optimization.
3 3.1
Semi-Global Optimization on the Hierarchical Pyramid Propagation of Aggregated Matching Costs
After the matching cost C(p, d) between a pixel p of one image and a pixel p + d of the other is computed, the multi-path aggregation is performed to determine the optimal disparity in the original SGM. The aggregated cost at pixel p on the directional path r is denoted by
696
K.H. Won and S.K. Jung
Lr (p, d) = C(p, d) + min Lr (p−r , d), Lr (p−r , d ± 1) + P1 , min Lr (p−r , i) + P2 i
− min Lr (p−r , k),
(1)
k
where d is the disparity, p−r represents the precedence pixel of p on the path r, P1 and P2 (≥ P1 ) are the penalties for small disparity changes and large (≥ 1) disparity changes along the path, respectively. Similarly, the aggregated matching cost on the hierarchical pyramid is represented with the current level ln and the upper level (low-resolution) ln−1 is computed using Equation 2. Lr (ln , p, d) = C(ln , p, d) + min Lr (ln−1 , p−r , d), Lr (ln−1 , p−r , d ± 1) + P1 , min Lr (ln−1 , p−r , i) + P2 − min Lr (ln−1 , p−r , k). i
k
(2)
In Fig. 2, the precedence constraint which is caused by the Equation 1 and 2, are represented. The Lr (p0 , di ) is required to evaluate the Lr (p1 , di ) in SGM. However, Lr (ln , p0 , di ) and Lr (ln , p1 , di ) can be obtained independently along the path from higher hierarchy (dashed arrow). The aggregation process from the higher level to the current level often fails to support the minor pixels which are smashed in the lower resolution. To prevent this drawback, we adaptively determine the penalty P 2 as follows P2 = P1 +
(P2 − P1 ) , |I(ln−1 , p−r ) − I(ln , p)|
(3)
where I(ln , p) is the value of pixel p in the current level ln , and I(ln−1 , p) is the value of pixel in the upper level. The optimal disparity at the pixel p is determined by computing the total cost for each disparity d from all paths r that pass through the pixel p. argmin Lr (ln , p, d). (4) d
r
p1
p0 p0
p1
ln
ln-1 ln-2 Matching Cost Aggregation: SGM
Matching Cost Aggregation: hSGM
Fig. 2. The aggregation paths and the precedence constraint in SGM and hSGM
hSGM: Hierarchical Pyramid Based Stereo Matching Algorithm
697
Finally, the outliers are removed by the post-processing such as median-filtering and left-right consistency check. 3.2
The GPU Implementation
There are several reasons why hSGM is well suited for a parallel implementation on GPU. First, the scalable kernel can be designed and evaluated on each scalespace without modification. Second, our algorithm divides the input image into many precedence-independent local regions on each scale. A local region can be allocated into blocks with multiple threads in CUDA architecture. The overall performance is increased by assigning relevantly accessed variables to shared (on-chip) memory in each block. After the hierarchical pyramid is constructed, the aggregated costs of each level are computed using the scalable invocation of the same kernel and saved in the global memory. For each kernel invoke, the input image of the level ln is divided into many local regions and assigned to each block. The blocks are assigned to multiple cores for block-level parallelism and each block provides thread -level parallelism with thread synchronization method in a block. Fig. 3 shows the processing steps in a single block. The h, w, and nd represents the height and width of a local window, and the number of disparity, respectively. At first, the image patches for each block are copied in shared memory and then the matching costs are computed at h × nd threads. For the next step, the matching costs are aggregated. This process starts from the Lr (ln−1 , p−r , di ) which reside in the global memory using Equation 2. After that, the matching costs of the current level are aggregated using the Equation 1 along all specified paths. Those aggregated costs for each pixel are accumulated in the shared memory and finally the optimal disparity is selected using Equation 4 for every pixels in the window. The required size of shared memory S for local aggregation of matching costs in each block is computed as S = w2 + w2 · nd + 2w · nd + w2 nd ,
(5)
Lr(ln-1, p-r , di)
for each Block: h ൈ nd threads w processed concurrently
processed concurrently
h
∑r Lr(ln)
Left patch w + nd - 1
h
… … … Right patch
path
h w Matching Cost Computation
h w
Cost Aggregation for each Path
w Disparity Selection
Fig. 3. The GPU implementation of hSGM: the processing steps in each block
698
K.H. Won and S.K. Jung
where w×w is the resolution of the local window, nr is the number of aggregation directions, and nd is the disparity range. In Equation 5, the second term represents the size of matching costs for each pixel and for all disparities. The third term represents the accumulation buffer of the current pixel for all disparities nd on w paths plus the same size for the previous pixels of each path. The block dimension will be w × nd . The last term is for the total amount of the aggregated cost in Equation 4. The size of local window is determined inversely by putting the shared memory as the maximum (16k or 48k). The size of local window of lower-resolution is larger than that of current scale because the disparity range is reduced by the scale (< 1.0).
4
Experimental Result
We compared hSGM with several different implementations of SGM, the SGM on CPU, SGM on GPU with global aggregation of matching costs, and SGM on GPU with local aggregation in overlapped window. We first implemented SGM and hSGM on CPU to compare the quality of the resulted disparity maps, because the qualities of results from each GPU versions are the same with the results of their CPU implementations. Both CPU implementations consist of the hierarchical computation step of Mutual Information (HMI) as matching cost [12], the multi-path aggregation step, and the post-processing step including median filtering and left-right consistency check. Without supporting the sub-pixel disparity, other parameters like penalties, P1 and P2 are selected experimentally. For hSGM, we use 20 × 20 for the local window size, and three levels of hierarchical pyramid with the scale factor, 0.5 while the bigger local window requires the larger local memory, the much smaller window size than this degrades the quality of computed disparity map. The Middlebury stereo set is used as input images and the results are evaluated through the webpage [8]. The values on the Table 1 represents the percent of bad pixels compare to the ground truth. The threshold for the disparity error is 1 and the bad pixels are not evaluated for the partially occluded and boundary regions. The values of SGM are different from that of the original paper [1,8]. It is partially because our implementation do not support the sub-pixel accuracy and may have different penalty values and hold-filling strategy from that one. However, those elements can be applied to both methods to improve the quality of results. Table 1. Evaluation of SGM and hSGM (percent of bad pixels)
Data Set Methods Tsukuba Venus Teddy Cones SGM 6.02 3.98 11.9 7.47 hSGM 5.41 3.69 15.1 9.53
hSGM: Hierarchical Pyramid Based Stereo Matching Algorithm
699
Fig. 4. The resulted disparity maps of SGM (left column) and hSGM (right column)
The quality of resulted disparity map of hSGM is similar to that of SGM as shown in Table 1 and Fig. 4, while the size of required memory is about 30 percent of the original SGM. For the next experiment, we implemented and compared a GPU version of hSGM and two GPU versions of SGM using CUDA. The first GPU version of the SGM is similar to the implementation of Ernst et al. [3], and Gibson and Marques [4]. The global memory is allocated to accumulate the aggregated matching costs of each path for the SGM. We also utilized the on-chip memory as the cache for the input image and intermediate aggregated costs. The other GPU implementation of SGM divides the input image into many overlapped local windows and finds the optimal disparity in the local window. This implementation is much faster than previous one because the precedence between any two local areas are not considered. But the tile effect, which is drawback of local optimization as we addressed in section 2, can occur as shown in Fig. 5. The local window size is determined for each scale using the Equation 5 with fixed on-chip memory size. In our experiment, we use Nvidia GTX580, and image of 450 × 375 with 32-levels of disparity. Table 2 shows the computation times of various implementations of SGM and hSGM in milliseconds. Contrary to the local window version of SGM, the Table 2. The performance of GPU implementations
Methods Computation time (ms)
SGM hSGM CPU GPU GPU global memory global memory local window local window 22300.0
108.9
46.1
54.0
700
K.H. Won and S.K. Jung
Fig. 5. The tile effect of local optimization
hSGM removes the tile defects between two adjacent windows by aggregating the matching cost of higher level hierarchy with additional 17% of computation time. hSGM used the 1/8 times of the memory space compared to SGM in this experiment. It is because hSGM aggregates the matching costs of the current level from the aggregation costs of the upper level. For your reference, the computation time of CPU-hSGM is almost similar to that of CPU-SGM.
5
Conclusion and Future Work
We proposed a hierarchical pyramid based stereo matching algorithm. Our algorithm is different from other hierarchical algorithms in terms of utilizing the aggregated costs itself instead of the determined disparity values from the higher hierarchy. We accomplished the effect of global optimization by using independent local optimizations combined with the aggregated matching cost of higher hierarchy in the image pryramid. As a result, the intuitive parallelization using the current generation of GPUs is possible. hSGM requires less memory space and is faster than other GPU implementations while achieving a good quality of disparity map. Our future work will be the application of our algorithm to some computer vision problem such as object detection or tracking problems that require realtime execution and accurate range images, simultaneously. Acknowledgement. This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and Technology(2011-0006132).
hSGM: Hierarchical Pyramid Based Stereo Matching Algorithm
701
References 1. Hirschm¨ uller, H.: Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 807–814 (2005) 2. Guerra-Filho, G.: An Optimal Time-Space Algorithm for Dense Stereo Matching. Journal of Real-Time Image Processing, 1–18 (2010) 3. Ernst, I., Hirschm¨ uller, H.: Mutual Information Based Semi-Global Stereo Matching on the GPU. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne, T.-M., Monroe, L. (eds.) ISVC 2008, Part I. LNCS, vol. 5358, pp. 228–239. Springer, Heidelberg (2008) 4. Gibson, J., Marques, O.: Stereo Depth with a Unified Architecture GPU. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (2008) 5. Humenberger, M., Engelke, T., Kubinger, W.: A Census-Based Stereo Vision Algorithm Using Modified Semi-Global Matching and Plane Fitting to Improve Matching Quality. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, 6th Workshop on Embedded Computer Vision (2010) 6. Gehrig, S.K., Rabe, C.: Real-Time Semi-Global Matching on the CPU. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, 6th Workshop on Embedded Computer Vision (2010) 7. Humenberger, M., Zinner, C., Kubinger, W.: Performance Evaluation of a CensusBased Stereo Matching Algorithm on Embedded and Multi-Core Hardware. In: Proceedings of the 6th Int. Symposium on Image and Signal Processing and Analysis (2009) 8. http://vision.middlebury.edu/stereo/ 9. Kolmogorov, V., Zabih, R.: Computing Visual Correspondence with Occlusions using Graph Cuts. In: Int. Conference on Computer Vision, vol. 2, pp. 508–515 (2001) 10. Sum, J., Li, Y., Kang, S., Shum, H.-Y.: Symmetric Stereo Matching for Occlusion Handling. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 399–406 (2005) 11. Hirschm¨ uller, H.: Stereo Vision in Structured Environments by Consistent SemiGlobal Matching. In: IEEE Conference on Computer Vision and Pattern Recognition (2006) 12. Kim, J., Kolmogorov, V., Zabih, R.: Visual Correspondence Using Energy Minimization and Mutual Information. In: IEEE International Conference on Computer Vision (2003)
Surface Reconstruction of Rotating Objects from Monocular Video Charlotte Boden and Abhir Bhalerao Department of Computer Science, University of Warwick, Coventry, CV4 7AL, UK
Abstract. The ability to model 3D objects from monocular video allows for a number of very useful applications, for instance: 3D face recognition, fast prototyping and entertainment. At present there are a number of methods available for 3D modelling from this and similar data. However many of them are either not robust when presented with real world data, or tend to bias their results to a prior model. Here we use energy minimisation of a restricted circular motion model to recover the 3D shape of an object from video of it rotating. The robustness of the algorithm to noise in the data and deviations from the assumed motion is tested and a 3D model of a real polystyrene head is created. Keywords: Structure from Motion, 3D Modelling, Face Modelling, Turntable Sequence Reconstruction.
1
Introduction
The ability to model the shape of a 3D object from monocular video would allow for many useful applications to be developed, for example: 3D face recognition, fast prototyping for industry and entertainment, and transmission of exact 3D information over the internet for video conferencing or reproduction. As much work has been completed on this and related areas such as multiview stereo, with great success, one might suppose that this is a solved problem. However in practical situations a robust unbiased solution is often difficult to obtain. The method proposed here is practically applicable and does not overbias the reconstructed shape towards a strong prior. Points are tracked from frame to frame and their 3D positions found by minimising a cost function based on the assumption that the object being modelled is rotating about an axis perpendicular to the optical axis. The internal camera parameters are known a priori. The shape of the object is not constrained. The many modelling techniques which exist already will be discussed in section 2. In section 3 the method proposed here involving minimising a cost function obtained from a circular motion model will be described. This method will be used to create models from synthetic and real data of rotating heads and the reconstructions obtained will be compared with ground truth in section 4. J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 702–711, 2011. c Springer-Verlag Berlin Heidelberg 2011
Rotating Object Modelling
703
Fig. 1. A camera, with focal length f , is placed a distance d away from the axis of rotation of an object that is rotating at angular velocity ω
2
Previous Work
There are many ways of reconstructing a surface from images. Feature points or textured patches can be tracked from frame to frame and their motions used to find the 3D structure of the object using such methods as [8], [6] or [10]. These methods use autocalibration techniques and so very little has to be known a priori. However such techniques can produce unpredictable results due to ‘critical motions’, which are very common in practice, for which there are multiple possible solutions. It is possible to avoid this issue by using prior information either about the structure of the object, as in the case of 3D Morphable Models [1] and appearance models [3], about the motion that is being undertaken [4,5], about the cameras internal parameters, or about both the motion and internal parameters of the camera, which is similar to multi-view stereo [9]. For point based methods, once a good initial estimate has been found the solution can be refined by minimising the reprojection error, i.e. bundle adjustment (see [11]), by such means as sparse Levenberg-Marquardt minimisation. In some cases, for instance when the object to be modelled is untextured, it is difficult to track points. In such cases an alternative approach may be to find the silhouette of the object in the various images and deduce what 3D shape could have produced such an outline [12] [5]. One such method [5], which requires the object to be rotating and all the internal parameters of the camera apart from the focal length to be known, can produce excellent results. However if only a limited number of views are available, or if it is difficult to obtain a silhouette, then the results will be more limited. This may arise if the object being modelled
704
C. Boden and A. Bhalerao
is a head turning, in which case there will be a limited number of views, or if the object is a component part of another object, in which case it may be hard to obtain a silhouette. Another related approach is space-carving. This method fills space and then carves away voxels until it is possible for the remaining shape to produce images which are consistent with those input (by reprojection). In this way a ‘photohull’ is produced. This technique can be very successful when the motion of the object is known. When it is easily possible to control the lighting environment, shape from shading can also be used to find the shape of the object. This can work well but is not always applicable if the lighting is not easy to control, for instance for outdoor cloudy scenes, or where there are multiple light sources and reflections. This work uses a circular motion model such as that used by [12] for silhouettes and by [4] for point tracking. The method is based on point tracking and bears some similarity to the works of Fitzgibbon et al. [4]. However ours uses internally precalibrated cameras and uses all frames and all points rather than merging triplets of frames. We assume that the object is rotating about an axis which is perpendicular to the optical axis (see fig. 1). This obviously is not useful for reconstructing objects from archive footage or from surveillance videos, but is easily achievable for cooperative subjects. It is also assumed that a video camera is used so that there is necessarily continuity of motion between frames. As we are primarily concerned with the surface reconstruction problem rather than the camera calibration problem, these assumptions are not unreasonable and allow for a more accurate solution to be found without over biasing the result.
3
Method
The camera is placed on the z axis at a distance d from the origin, at which it points. The intrinsic parameters are known, the focal length being f and aspect ratio a (see fig. 1). The object to be reconstructed is placed at the origin and allowed to rotate about the Y axis. A point on the object, Xi = (Xi , Yi , Zi , 1)T , will therefore be projected onto a point in the image, xi = (xi , yi , 1)T , as follows: ⎛ ⎞ ⎛ ⎞⎛ ⎞ Xi f 0 0 −1 0 0 0 ⎜ Yi ⎟ ⎟ xi = ⎝ 0 af 0 ⎠ ⎝ 0 −1 0 0 ⎠ ⎜ (1) ⎝ Zi ⎠ , 0 0 1 0 0 1 −d 1 in other words: f Xi d − Zi af Yi yi = , d − Zi
xi =
or in cylindrical polar coordinates:
(2) (3)
Rotating Object Modelling
705
Fig. 2. At time t the point pi lies on a rotating rigid object at a distance ri from the centre of rotation at angle θi + φ (t). The point is imaged by a camera a distance d away with focal length f .
f ri sin(θi (t)) d − ri cos(θi (t)) af hi yi (t) = . d − ri cos(θi (t))
xi (t) =
(4) (5)
The object is constrained to rotate about the Y axis. Therefore r is constant and θ is a function of time. As the object is rotating rigidly the change in θ from frame to frame will be the same for all points. So θi (t) can be expressed as θi + φ(t) (see fig. 2). Therefore: xi (t) = which gives dxi dφ = dt dt
f ri sin (θi + φ (t)) d − ri cos (θi + φ (t)) xi (t) xi (t)2 − . tan (θi + φ (t)) f
(6)
(7)
This expression has a singularity at θi = t = 0, however it can be rearranged to give: −1 dxi dφ f tan (θi + φ(t)) xi (t) = (8) dt dt f − xi tan (θi + φ(t)) The following energy function can then be formed: −1 2
dxi dφ f tan (θi + φ(j)) E= xi (j) − , (9) dt t=j dt t=j f − xi (j) tan (θi + φ(j)) i j which can then be minimised with respect to θi and φ to give the most likely solution.
706
C. Boden and A. Bhalerao
In practice points are seeded at random positions in the foreground of an initial frame. Points are tracked backwards and forwards using pyramid based normalised cross correlation. The points are normalised such that the principal point is at zero. Each θi is then initialised at θi = π8 if xi is greater than 0, or at θi = −π otherwise. The θi and φ(j) values are then adjusted using 8 the MATLAB implementation of the interior-point algorithm [2] until the cost function is minimised. Point positions, the focal length and the distance to the camera must be input to the minimisation procedure. Once θi has been found an estimate of ri (j) and hi (j) can then be calculated at each frame as follows: ri (j) =
xi (j) d xi (j) cos(θi + φ(j)) + f sin(θi + φ(j))
(10)
yi (j) d − yi (j) ri (j) cos(θi + φ(j)) af
(11)
hi (j) =
Reconstruction Error as a Fraction of Model Height
The final estimate of ri and hi is the median over all frames of these estimates. These can then be converted into cartesian coordinates if desired.
0.06 0.05 0.04 0.03 0.02 0.01 0 −0.01 0
2
4
6
8
10
Tilt Angle (degrees)
Fig. 3. Vertices of a 3D face model were projected into a virtual camera and their projections used to reconstruct the 3D positions of the vertices. Various angular deviations of the rotation axis from the plane perpendicular to the optical axis were used. Here the error in the reconstruction is plotted against these tilt angles.
Rotating Object Modelling
4
707
Experimental Results
Reconstruction Error as a Fraction of Model Height
In order to judge the effectiveness of this algorithm, ground truth data were generated from the vertices of a laser-scanned head obtained from the Basel face database [7]. The positions of the projections of these vertices onto a virtual camera were then used as input to the algorithm. To test the robustness of the algorithm to noise in the data further tests were completed where Gaussian noise was added to the projected positions. The error was then computed as the mean Euclidean distance between the reconstructed points and their corresponding ground truth positions. A graph showing these errors is shown in figure 4. Another likely problem when applying this technique to real data is that the axis is not perpendicular to the optical axis. To test the robustness of the algorithm to this scenario, sequences where the Basel data were tilted towards the camera were generated and the error computed. The results for these tests are shown in figure 3. The algorithm was used to reconstruct a model from real data of a polystyrene head (‘Poly’) on a turntable, see figure 6 for sequence and 7 for the reconstruction. The video was shot with a standard consumer video camera, a Sony DCRSR90, with a resolution of 640×480 pixels. Five hundred points were tracked. 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0
0.5
1
1.5
2
Standard Deviation of Gaussian Noise (pixels)
Fig. 4. Perfect data were generated by projecting vertices of a face model onto a virtual camera. Gaussian noise of various standard deviations was then added to these projections. Here the error in the reconstructed locations of the vertices (which is the mean Euclidean distance between reconstructed and ground truth positions) is plotted against the standard deviation of this noise.
708
C. Boden and A. Bhalerao
Fig. 5. The surface shown is the ground truth face from the ‘Basel’ database [7]. The points are the vertices of the reconstructed model. See http://www.dcs.warwick.ac. uk/~cboden/reconstruction.html for a video of the entire sequence.
Fig. 6. ‘Poly’ sequence: Video sequence of a model head, ‘Poly’, rotating on a turntable. 200 frames were taken with a Sony DCR-SR90 camcorder at a resolution of 640×480 pixels. Feature points on these images were tracked and used to reconstruct the surface (see fig. 7).
The location of these points in five frames in the middle of the sequence were then used as the input to the reconstruction algorithm (so the profile and close to profile views shown in fig. 7 were not used to make the reconstruction). Fig. 3 shows that the reconstruction error increases increasingly for increasing tilt angle. The error at 10◦ being 3.8 times the error at 0◦ . This shows that the reconstruction does degrade with increasing tilt angle as expected, however for small tilt angles the error is not great and so the reconstruction may still be useable, depending on the desired application. Fig. 4 shows that the error generally increases as the noise level increases, the error at a noise level of 2 pixels being ≈ 7.3 times the error for zero noise. For a model with height 100 pixels the reconstruction error at a plausible track error of 0.5 pixels standard deviations is approximately 2 pixels. Here Gaussian noise was added to every point independently at each frame. However in practice errors are usually non-random and so these values are perhaps not an ideal indication of the visual result and do not show how the effect of noise may vary depending on track length, speed of motion, or other characteristics.
Rotating Object Modelling original
mesh
709
textured
Fig. 7. Reconstruction of Poly: The first column shows the original video sequence; the second shows the reconstructed mesh (created by Delaunay Triangulation from the middle view); and the third a textured version of the reconstructed mesh. The middle row is a frame which was used in the reconstruction, the other two rows show frames which are far from those used in the reconstruction. See http://www.dcs.warwick.ac. uk/~cboden/reconstruction.html for video of entire sequence.
In fig. 5 the reconstruction of ground truth tracks generated from the model face from the ‘Basel’ database is shown. The spheres represent the reconstructed points and the mesh is the input mesh. This shows that for good data the reconstruction algorithm is very accurate. The reconstruction of ‘Poly’ shown in fig. 7 shows that this technique performs promisingly on real data. A limited number of close to frontal frames were used as input and the reconstruction has the correct shape and looks convincing from
710
C. Boden and A. Bhalerao
the profile view. However the reconstruction is quite noisy and so from certain angles looks quite poor. In further work we will attempt to remedy this by introducting a smoothing term to the cost function.
5
Conclusions
Here a method was devised for generating 3D models of rotating objects from video sequences. It was assumed that either the cameras were internally calibrated or that the height and width of the object being modelled was known. The motion of the object was restricted to be a rotation about an axis perpendicular to the optical axis. The sensitivity of the algorithm to noise in the data and tilts of the axis towards the camera was tested. Reconstructions were made of synthetic ground truth tracks from a model head and of real video data of a polystyrene head rotating (the ‘Poly’ sequence). It has been found that the performance of the algorithm degrades reasonably slowly when the rotation axis tilts away from its assumed position. Therefore for situations where the motion can be controlled the method will produce reasonable results even if it is difficult to align the axis exactly. However it will not be of use for reconstructing points which rotate about an arbitrary axis relative to the camera. The method performs well at modelling rotating objects when there is little or no noise in the tracks. When there is a greater level of noise the reconstruction is adversely affected. This could be remedied by including a smoothing term in the energy function, for instance a shape prior or surface constraints. Further work will involve attempting to make such improvements in a manner that does not introduce too much bias to the result. The reconstruction of ‘Poly’ highlights the above. A good reconstruction was obtained from five frames. However the obtained model was slightly noisy and so improvements could be made by using a smoothing term. Reconstructions of real heads will be created in future work.
References 1. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1999, pp. 187–194. ACM Press/Addison-Wesley Publishing Co., New York (1999) 2. Byrd, R.H., Gilbert, J.C., Nocedal, J.: A trust region method based on interior point techniques for nonlinear programming. Mathematical Programming 89, 149– 185 (2000), doi:10.1007/PL00011391 3. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23, 681–685 (2001) 4. Fitzgibbon, A.W., Cross, G., Zisserman, A.: Automatic 3D model construction for turn-table sequences. In: Koch, R., Van Gool, L. (eds.) SMILE 1998. LNCS, vol. 1506, pp. 155–170. Springer, Heidelberg (1998)
Rotating Object Modelling
711
5. Hernandez, C., Schmitt, F., Cipolla, R.: Silhouette coherence for camera calibration under circular motion. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(2), 343–349 (2007) 6. Nister, D.: Untwisting a projective reconstruction. International Journal of Computer Vision 60, 165–183 (2004), doi:10.1023/B:VISI.0000029667.76852.a1 7. Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3d face model for pose and illumination invariant face recognition. In: Proceedings of the 6th IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS) for Security, Safety and Monitoring in Smart Environments, Genova, Italy. IEEE, Los Alamitos (2009) 8. Pollefeys, M., Koch, R., Van Gool, L.: Self-calibration and metric reconstruction inspite of varying and unknown intrinsic camera parameters. International Journal of Computer Vision 32, 7–25 (1999), doi:10.1023/A:1008109111715 9. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 519–528 (2006) 10. Triggs, B.: Autocalibration and the absolute quadric. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, p. 609 (1997) 11. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment – A modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) ICCV-WS 1999. LNCS, vol. 1883, pp. 298–375. Springer, Heidelberg (2000) 12. Wong, K.-Y.K., Cipolla, R.: Reconstruction of sculpture from its profiles with unknown camera positions. IEEE Transactions on Image Processing 13, 381–389 (2004)
Precise Registration of 3D Images Acquired from a Hand-Held Visual Sensor Benjamin Coudrin1,2,3,4,5 , Michel Devy2,3, , Jean-Jos´e Orteu4,5 , and Ludovic Br`ethes1 1
NOOMEO ; rue Galil´ee, BP 57267, 31672 Lab`ege CEDEX, France CNRS; LAAS; 7 avenue du Colonel Roche, F-31077 Toulouse, France [email protected] Universit´e de Toulouse; UPS, INSA, INP, ISAE; LAAS-CNRS : F-31077 Toulouse, France 4 Universit´e de Toulouse ; Mines Albi; ICA: Campus Jarlard, F-81013 Albi, France 5 Ecoles des mines Albi, Campus Jarlard, F-81013 Albi, France 2
3
Abstract. This paper presents a method for precise registration of 3D images acquired from a new sensor for 3D digitization moved manually by an operator around an object. The system is equipped with visual and inertial devices and with a speckle pattern projector. The presented method has been developed to address the problem that a moving speckle pattern during a sequence prevents from correlating points between images acquired from two successive viewpoints. So several solutions are proposed, based on images acquired with a moving speckle pattern. It improves ICP-based methods classically used for precise registration of two clouds of 3D points.
1 Introduction Digitizing 3D objects is a sequential process of geometric modeling from sensory data acquired while moving a sensor in front of the object, or moving the object in front of the sensor. 3D images registration must be performed to merge 3D images acquired from plural view points, so that the fused model is consistent with the reality of the scene to model, and is accurate. Precision is an important consideration in most applications ranging from non destructive testing to medical imagery. 3D digitization generated numerous works in the last twenty years: several products exist on the market, exploiting, for 3D data acquisition, telemetric laser sensors or visual technologies, with or without light projection. This paper presents a vision based hand-held 3D scanner allowing acquisition of geometry and texture from a scene, representing it first by clouds of 3D points, before building a textured mesh. The registration of 3D images has been addressed for a long time. Besl et al. introduced the ICP algorithm (Iterative Closest Points) [2], which is the main method allowing to estimate the rigid transformation between two clouds of 3D points. Then numerous variations [13] improved the ICP convergence. Johnson et al. [7] proposed surface descriptors to get strong pairings between patches extracted from two images to be registered: it allows to reduce the dependence from the initial estimate quality.
Corresponding author.
J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 712–723, 2011. c Springer-Verlag Berlin Heidelberg 2011
Precise Registration of 3D Images Acquired from a Hand-Held Visual Sensor
713
Sandhu et al. [12] recently proposed a multi-hypothesis registration method combining ICP with a particle filter approach. Some works are also ongoing on registration robustness, aiming to provide a guaranteed registration [8]. Registration in visual-based modeling is generally addressed using Bundle Adjustment [19], exploiting fine image correlation methods [5,16]. Viola et al. introduced a method based on information theory [20] to design a similarity measurement which has been extended in [15,17] and applied mostly to registration between 3D and 2D data. Segal et al. [14] proposed a method very similar to our approach, but with different matching constraints. Visual methods can exploit fixed speckle pattern projection to make possible correlation even on uniform surfaces [16]. Being hand-held and compact, our sensing system has been equipped with embedded projecting device. Registration methods based on fixed speckle patterns cannot be applied here. We had to develop specific criteria for registration of 3D images based on uncorrelable visual information. Firstly, the sensor is described in section 2, and an overview of classical registration methods used in a modeling process is presented in section 3. Our registration criterion is detailed in section 4 and experimental results obtained from this approach are commented in section 5. Finally, section 6 presents perspectives for our future works.
2 The Sensor With demand growing in retro-conception (acquisition of CAD model of existing objects) and photomechanical applications (3D measurements for dimensional control or defect detection), several 3D digitization systems emerged in the last years, especially contactless optical measurements systems, based on image processing and computer vision, and very often, on structured light projection. Although very efficient, these systems are generally expensive, bulky and need a complex setup imposing to stay fixed on a tripod during capture. These characteristics limit their use to places specifically equipped which reduces considerably the field of applications. Recently, some portable 3D digitization systems appeared. These systems need to place a magnetic reference in the digitization environment or to equip the object with markers to make easier the registration process. These setup contraints are often problematic and prohibitive for some applications, particularly for the digitalization of artworks, statues, archeological or other precious objects. With our system, we propose an innovative solution for ultra-portable 3D digitization with no object preparation. Lightweight and compact, this hand-held sensor allows to digitize 3D shapes from shootings taken by an operator moving the sensor around the object. Our vision based sensor achieves 3D modeling of an object from successive registrations of partial 3D reconstructions of the scene. To localize the sensor’s position in the scene, it integrates a camera dedicated to localization and an inertial motion unit. It is optimized for digitization of objects contained in 1m3 volume. The device comes in the form of a pistol fitted with a simple push button used to trigger the scan. In use, the operator points the sensor at the object. Clicking on the trigger causes the ignition of two laser pointers placed in the housing to help the user to maintain the sensor at the best work distance. A long push on the trigger causes a sequence acquisition
714
B. Coudrin et al.
(a) Fig. 1. Optinum
(b)
Fig. 2. (a) Registered 3D point clouds; (b) A textured mesh
until the button is released. The generated sequence consists of 3D point clouds with adjustable density acquired with variable frequency. Between 3D acquisitions, 2D images can be shot by the localization camera. Inertial measurements acquired at high frequency allow to determine the sensor’s attitude. All these data are exploited here to fuse successive 3D views (figure 2 (left)) and to capture appearance information (grayscale or color), so that, depending on the application, the system can generate meshed and textured models (figure 2 (right)). For an hand-held sensor context, an important challenge concerns the registration of partial data acquired from different view points. Moreover, the precision of the generated 3D data is an essential consideration regarding to numerous applications; our objective is to guarantee a 0.1 mm accuracy. The next section presents our algorithms and strategies for a precise registration with no prior preparation of the object or specific material for our sensor.
3 Registration Many approaches for registration of two 3D images Vk and Vk+l acquired respectively at instants tk and tk+l are based on the Iterative Closest Points algorithm (ICP) [2] [13]. This method aims to align two sets of 3D measurements from geometric optimization. Though ICP methods can achieve good results in finding the rigid transformation (R, t) that brings Vk+l in the reference frame of Vk , they need to be fed with a good initial estimate of this transformation. Moreover, these methods are geometry-based and are therefore dependant on the density of data. The ICP Algorithm Supposing it exists a good initial guess of the rigid transformation between two 3D images Vk and Vk+l , ICP method minimizes an inter-point distance criterion to align the two models. It is an iterative algorithm using a set of points {pki } selected in scan Vk . Each iteration is divided in :
Precise Registration of 3D Images Acquired from a Hand-Held Visual Sensor
715
– pairing points pki with nearest neighbours in scan Vk+l using a k-d tree structure, – weighting and rejecting pairs, – estimating rigid transformation minimizing the Sum of Squared Distances (SSD) between paired points. The estimated transformation is then applied to align Vk+l on Vk . These steps are iterated until a distance convergence criterion is achieved or when a maximum number of iterations is reached. Rejecting pairs is done using two filters. The first one ensures unicity of pairs: one point in Vk+l can be paired with only one point of Vk . During the pairing step, a candidates list in Vk+l is built for each point pki from Vk . These lists are then browsed to ensure both unicity of the pairs and optimality in their choice, in term of 3D distance. The second filter exploits a statistical criterion based on distance distribution between . espaired points [21]. We therefore have a set of N noisy or unperfect pairs pki , pk+l i timating rigid transformation to align Vk and Vk+l involves minimizing the SSD score: 2 N = i=1 pki − Rpk+l +t i Arun and al. [1] proposed a least-square minimization method based on SVD decomposition to solve this problem in a closed form. The initialization Step Any local optimization method is exposed to the initialization problem, i.e. these methods are sensitive to the quality of the initial guess. When such a guess is not available, a coarse alignment of scans has to be made first; we exploit measurements from inertial sensing and 2D images [4]. Inertial sensing allows to get attitude of the sensor at the acquisition instants; it gives a good estimate of the rotation component R of the transformation; the internal filter of the IMU device provides this attitude estimate. Making more profit of IMU measurements has not been considered, due to the requirement on the registration accuracy. Exploiting 2D images that our sensor provides, we can achieve a robust interest points matching. The translation component t of the transformation is then computed using 3D points corresponding to matched interest points. Density Dependence Many works in the state of the art, have proven the capacity of ICP algorithm for data alignment in euclidian space. However, this method tries to pair points and gives an estimate based on noisy measurements. The result is necessarily a compromise of pairing errors. With perfect matchings, the rigid transformation could be found exactly, but scans are discrete: paired points pk+l and pki do not correspond exactly. So the ICP rei sult is strongly dependent to the sensor resolution. Figure 3 shows that as the 3D scans are sampled, precision (and quickly convergence) can not be guaranteed. The next section shows how a method exploiting 2D images provided by our sensor can improve scan alignment based on ICP algorithm.
716
B. Coudrin et al.
Fig. 3. ICP error according to scan density
4 Image-Based Registration The ICP-based registration method could give an inaccurate result due to the pairing method. Working on points sets exclusively is convenient in term of coding simplicity, but the accuracy depends on the scan density. To solve this problem, several methods have been proposed, for instance using a point-to-plane metric [9] or exploiting reverse calibration methods [10]. Our sensor exploits 2D images. A classical method, like bundle adjustment [19], consists in optimizing the rigid transformation and the reconstructed 3D points by directly correlating projections in successive images.Using our stereo setup, this could lef t right give precise matched pixels between four images, Ik , Ik acquired at time tk , lef t right acquired at time tk+l . But the scene (including light) must remain and Ik+l , Ik+l invariant between acquisition instants tk and tk+l . In our case, a speckle pattern is projected to help the stereo-correlation phase. Being hand-held, our sensor imposes the light projector to be embedded on the sensor, and therefore to move with it. It is consequently impossible to achieve a direct correlation between images Ik and Ik+l since the lighting conditions have changed greatly. 4.1 Image-Based Pairing
lef t right , Ik+l with Let us consider two pairs of stereo images Iklef t , Ikright and Ik+l a rough estimate (R, t) of the sensor motion between tk and tk+l . Images are rectified (distortion and epipolar rectification) and the transformation between left and right camera is fixed and calibrated. The process of 3D reconstruction based on stereo vision [6], [5] has given two clouds of 3Dpoints Vk and Vk+l . We need to feed the ICP algorithm with a set of paired 3D points pki , pk+l . i
Precise Registration of 3D Images Acquired from a Hand-Held Visual Sensor
717
Fig. 4. Matching using our method. The surface Sk (resp. Sk+l ) is reconstructed from the image pair acquired at time tk (resp. tk+l ). The matching point pk+l of pk is the projection of pk+l on Sk from the view point at time tk . The weighting function applied to this pair is related to the distance from pk to the normal vector at point pk+l .
Our approach looks first for matchings between points selected in Vk , and pixels in images acquired at time tk+l . Figure 4 illustrates this process. Firstly a set of points {pki } of Vk is selected from a sampling grid applied on image lef t Ik according to a given sampling rate. Using the estimated sensor motion, a point pki can be expressed in the reference frame of the sensor in its position at time tk+l . Let us k+l k+l this estimated matched point of pk : p = R−1 pk − t note p i
i
i
i
k+l The classical ICP algorithm selects in Vk+l the closer point of p . Instead, a new i lef t right correlation step is performed between Ik+l and Ik+l in order to improve this matchk+l ing. At first p is projected in I lef t using the pinhole camera model: i
k+l
T k+l where K0 is the calibration matrix of the rectified left (su sv s) = K0 p i lef t . camera, u and v are coordinates (with sub pixel accuracy) of a pixel in image Ik+l right Then the stereo corresponding point of (u, v) is found in Ik+l , using interpolation for the correlation [18]. This operation gives a disparity d between matched pixels in left and right images. A 3D point (x, y, z) is reconstructed from (u, v) and d: ⎡ ⎤ ⎡ ⎤ u dx ⎣ dy ⎦ = Q ⎣ v ⎦ 1 dz
718
B. Coudrin et al.
with Q the reconstruction matrix, which is a priori known from the stereo baseline and the intrinsic parametres of the rectified stereo sensor. pk+l = (xyz)T will be used i k for ICP, as the matched point of pi of Vk . If (R, t) is the exact transformation, then k+l pk+l = p and the following equation should be verified: pk = Rpk+l + t i
i
i
i
4.2 Transformation Estimation Classical transformation estimation uses the minimisation of a SSD score based on the paired points. To help filtering bad pairs and expand the convergence basin, some additional constraints are taken into account by associating a weight to every pair. k k+l + = N φ (i) p − Rp t 2 i i i=1 Function φ (i) is the weighting function applied to each pair of points. The chosen weighting in our method is inspired by the normal intersection search method of Chen [3]. For each point pk+l , the distance from it to the normal vector of its matching point i di pki (figure 4) is computed. φ (i) = 1 − dmax , with di being the euclidean distance k+l between point pi and normal vector ni of the matching point pki , dmax being the maximal distance over all pairs pki , pk+l . i 4.3 Pyramidal Approach When the initial estimation of the transformation is not perfect, the selection method described in section 4.1 will not give proper results. Figure 5(top) illustrates this error. Sampled points from Vk are drawn with black circles, reconstructed points are drawn with green circles, and exact theoretical corresponding points are drawn with empty red circles. It is shown here that pairs are not matching exact corresponding points, and sometimes leading to important mistakes. To solve this problem we propose to reconstruct several candidates to be matched lef t right with pki . They are obtained also by correlation between Ik+l and Ik+l , in the neigh k+l bourhood of the pixel (u, v), projection of p (figure 6). The size and resolution of i
the neighbourhood are adapted according to the convergence of the algorithm.
Fig. 5. (top) Errors with poor estimations, and (bottom) after our refinement
Precise Registration of 3D Images Acquired from a Hand-Held Visual Sensor
719
Fig. 6. During matching process, part of the neighbourhood of the considered image point is reconstructed. Final matchings are determined using a point to point distance.
An iterative pyramidal approach is applied. Starting the algorithm, we choose a large window for neighbourhood, but with a sparse resolution, e.g. one can choose to reconstruct one pixel out of three in a 21 × 21 window around the projected pixel (u, v) (Figure 5(bottom)). These parameters are adapted during the algorithm to help convergence following a given strategy. When a set of parameters leads to convergence of the estimated motion, one can reduce the neighbourhood window (from 21 × 21 to 3 × 3) and increase the resolution (from one out of three, to one out of 0.5). The key point of the method is that reducing the density to subpixel resolution allows to find more precise pairs of points. Theorically, with infinitesimal resolution, exact corresponding points will be found. Practically, the results are limited by the precision of the interpolation method used for subpixel correlation. We show in section 5 results of our experiments using this method in comparison with classical geometric approaches.
5 Results Results presented here are based on two data sets. The first one is made with scans of a mechanical object. The second set is a sequence acquired on a standard cylinder with an known internal diameter of 70.004mm. Rectified pairs of images from the mechanical object sequence are shown on figure 7 and those from the cylinder sequence are presented on figure 8. Key features of our method are illustrated through these experiments. The mechanical object sequence is used to show the convergence of the method when registering two 3D views with a reduced number of points. The cylinder sequence
720
B. Coudrin et al.
(a)
(b)
(c)
(d)
Fig. 7. Stereo acquisitions of a mechanical object
(a)
(b)
(c)
(d)
Fig. 8. Stereo acquisitions of a standard cylinder
is used to illustrate the precision gain brought by the method. More experimental results are available in [4]. 5.1 Mechanical Object For this evaluation we registered the point clouds from the mechanical object sequence using three methods : the original ICP [2], the Park method [11] and our approach. Figure 9 shows the evolution of the ICP error on 50 iterations for the three methods. Park method proposes a fast convergence, like it was shown in [13], and a better result compared to the original ICP. Our method, in the first strategy also converges fastly. First strategy change (iteration 14) allows to reach a similar precision than Park method. Second strategy change (iteration 40) reaches subpixel precision and allows a better matching, reducing the final error. Point clouds registered with each method are shown in figure 10. The initial position is also given in the figure. Classical ICP and Park methods allow to reach a good registration but a bias can be observed. Indeed, these methods tend to converge using areas where the point density is more important, or more uniform. Here, they favor the alignment of dominant planes. Our method allows to reach a more homogeneous registration due to our weighting constraints and because it can converge using less points (Table 1) and is, consequently less sensitive to dominant areas.
Precise Registration of 3D Images Acquired from a Hand-Held Visual Sensor
721
Fig. 9. RMS registration error for the mechanical test object
(a)
(b)
(c)
(d)
Fig. 10. Registration results. (a) Initial estimate (b) ICP (c) Park method (d) Our method.
Registered models have been compared to the CAD model of the object. Table 1 summarizes principal error measurements extracted from this comparison. 5.2 Cylinder Sequence We present in figure 11 a comparison of results on a standard cylinder between a classical geometric method and ours. Registered sequence of twenty-five acquisitions has been fitted to the exact CAD model of the cylinder; the projection error map based on point-to-surface distance between the built cloud of points and the theoretical CAD model, is built using the Geomagic Qualify v12 software. First constatation is that the error span is narrowed with our method. More than 99% of the points are included in a [−154μm, 154μm] with a large majority being included in a [−50μm, 50μm]. With a classical approach a large amount of points are out of this range and are mostly included in the [−100μm, 100μm]. One can note that with a classical geometric approach, due to the error propagation, points on the border are not of good quality. With our approach, due to the precision gain, error still propagates (since we only do pair registrations series) but the divergence effect is smoothed over the cloud.
722
B. Coudrin et al.
Table 1. Registration evaluation. (a) ICP (b) Park method (c) Our method.
A
B
C
Nb points
119906 119906 21378
Mean (mm)
−0.017 −0.027 0.024
Std dev. (mm)
0.277
0.213
0.208
Max error (mm) 1.922
1.315
1.264
Min error(mm) −1.645 −1.724 −1.864
(a)
(b)
Fig. 11. Image-based registration: (left) classical ICP, (right) our refined approach
6 Conclusion We presented in this article an approach for precise registration of 3D images acquired from a hand-held sensor, based on a moving speckle pattern projector. We illustrated the weakness of classical geometric methods when exploiting highly-sampled data sets. A combination of geometric and image processing solutions has been proposed for a precise refinement step after a coarse initial estimate has been provided. Accuracies given by our method are complient with requirements of most industrial applications. The method still needs to be improved to reduce the total number of iterations by adapting strategies efficiently. The method has been only validated to perform the registration between two successive acquisitions; future works will focus on adaptation of this method to multiple views registration.
References 1. Arun, K.S., Huang, T.S., Blostein, S.D.: Least-squares fitting of two 3-d point sets. IEEE Trans. on PAMI 9 (1997) 2. Besl, P., McKay, N.: A method for regtration of 3-d shapes. IEEE Trans. on PAMI 14 (1992) 3. Chen, Y., Medioni, G.: Object modelling by registration of multiple range images. Image Vision Comput. 10(3), 145–155 (1992)
Precise Registration of 3D Images Acquired from a Hand-Held Visual Sensor
723
4. Coudrin, B., Devy, M., Orteu, J.J., Br`ethes, L.: An innovative hand-held visual digitizing system for 3d modelling. Optics and Lasers in Engineering (2011) 5. Devernay, F., Faugeras, O.: Shape from stereo using fine correlation: Method and error analysis. Technical report, INRIA Sophia-Antipolis (2000) 6. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) ISBN: 0521623049 7. Johnson, A., Herbert, M.: Surface registration by matching oriented points. In: Proc. Int. Conf. on 3-D Digital Imaging and Modeling (3DIM) (1997) 8. Li, H., Hartley, R.I.: Five-point motion estimation made easy. In: Proc. Int. Conf. on Pattern Rerognition, ICPR (2006) 9. Low, K.: Linear least-squares optimization for point-to-plane icp surface registration. Technical report, University of North Carolina at Chapel Hill (2004) 10. Neugebauer, P.: Geometrical cloning of 3d objects via simultaneous registration of multiple range images. In: Proc. Int. Conf. on Shape Modeling and Applications (1997) 11. Park, S.Y., Subbarao, M.: A fast point-to-tangent plane technique for multi-view registration. In: Proc 4th Int. Conf on 3-D Digital Imaging and Modeling, pp. 276–284 (2003) 12. Sandhu, S.D.R., Tannenbaum, A.: Particle filtering for registration of 2d and 3d point sets with stochastic dynamics. In: Proc. Conf. Computer Vision and Pattern Recognition (2008) 13. Rusinkiewicz, S., Levoy, M.: Efficient variants of the icp algorithm. In: Proc. Int. Conf on 3-D Digital Imaging and Modeling (3DIM) (2001) 14. Segal, A., Haehnel, D., Thrun, S.: Generalized-icp. In: Proc. Conf. Robotics: Science and Systems (RSS), Seattle, USA (June 2009) 15. Studholme, C., Hill, D., Hawkes, D.: An overlap invariant entropy measure of 3d medical image alignment. Pattern Recognition 32 (1999) 16. Sutton, M.A., Orteu, J.J., Schreier, H.: Image Correlation for Shape, Motion and Deformation Measurements: Basic Concepts,Theory and Applications. Springer Publishing Company, Incorporated (2009) 17. Suveg, I., Vosselman, G.: Mutual information based evaluation of 3d building models. In: Proc. Int. Conf. on Pattern Rerognition (ICPR), vol. 3 (2002) 18. Th´evenaz, P., Blu, T., Unser, M.: Interpolation revisited. IEEE Trans. on Medical Imaging 19(7), 739–758 (2000) 19. Triggs, B., McLauchlan, P., Hartley, R., Fitzgibbon, A.: Bundle adjustment a modern synthesis. In: Proc. Int. Workshop on Vision Algorithms, with ICCV 1999 (1999) 20. Viola, P., Wells, W.M.: Alignment by maximization of mutual information. Int. Journal on Computer Vision, IJCV (1997) 21. Zhang, Z.: Iterative point matching for registration of free-form curves and surfaces. Int. Journal on Computer Vision (IJCV) 13 (1992)
A 3-D Tube Scanning Technique Based on Axis and Center Alignment of Multi-laser Triangulation Seung-Hae Baek and Soon-Yong Park School of Computer Science and Engineering, Kyungpook National University, Daegu, Republic of Korea [email protected], [email protected]
Abstract. This paper presents a novel 3D tube scanning technique based on a multi-laser triangulation. A multi-laser and camera module, which will be mounted in front of a mobile robot, captures a sequence of 360 degree shapes of the inner surface of a cylindrical tube. In each scan of the sequence, a circular shape, which is composed of four partial ellipses, is reconstructed from a multilaser triangulation technique. To reconstruct a complete shape of the tube, the center and axis of the circular shape in each scan are aligned to a common tube model. To overcome inherent alignment noises due to off-axis robot motion, sensor vibration, and etc., we derive and apply a 3D Euclidean transformation matrix in each scan. In experimental results, we show that the proposed technique reconstructs very accurate 3D shapes of a tube even though there is motion vibration.
1
Introduction
Tube inspection based on laser triangulation is one of the NDT (Non-destructive Testing) techniques and its inspection accuracy is very high. A laser-based inspection system usually consists of one or more lasers and vision cameras. This tube inspection technique has wide application areas such as gas pipes, plant boiler pipes, sewer pipes, and so on. These facilities are important for our lives, therefore, there should be periodically inspections for secure operations. Conventional laser-based inspection systems are mostly applied to large diameter tubes such as sewer pipes, gas pipes and so on. In these cases, the size of an inspection sensor device does not need to be restricted due to the tube size. For example, a long omni-directional laser with an illuminating device has been developed. An investigation on the reconstruction of the inner surface of a large size tube is done by Matsui [6]. The pose of a vision camera is estimated by using corresponding point sets which are acquired from illuminated images. In reference [4], as a similar case, only video images are used in a sewer reconstruction system, and corresponding point sets are used for measurement of camera motion. If a laser sensor obtains a feature from the image of illuminated inner surfaces, camera motion can be easily measured. Using the camera motion data, several tube models are merged with an axis alignment. There are also some laser inspection systems which do not consider either camera motion or the direction of a tube axis. In references [7] and [5], several tube models J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 724–735, 2011. © Springer-Verlag Berlin Heidelberg 2011
A 3-D Tube Scanning Technique Based on Axis and Center Alignment
725
are reconstructed without any information of camera motion. Assuming that the direction of a tube axis is equal to the camera motion, reconstructed models from all images are merged into a complete tube model. A tube inspection system for a small-size tube has been investigated by Frey[2]. In this system, four laser devices are attached on the four side of a vision sensor. The angle between each laser is 90 degree. By moving the laser sensor along the tube axis, a 3D model is reconstructed. In this paper, the tube axis is assumed be the same as the direction of camera motion. In this paper, we propose a novel 3D tube scanning technique. A complete 3D tube model which represents the inner surface of a cylindrical tube is reconstructed accurately by using a multi-laser triangulation technique. Especially, we are interested in 3D shape reconstruction of a very small-size tube, which diameter is about 100mm. Therefore, the size of our scanning device is only 70 mm in height and 120 mm in length. A sequence of laser images is captured from a compact vision camera. In each image frame, a circular shape representing a partial tube surface is reconstructed. To reconstruct a complete 3D tube model, we align the axis and center of the partial circular shapes. Two geometric constraints are employed. One is that the profile of a tube is a perfect circle. The other is that the camera moves with a constant velocity. The center and axis of each circular shape are aligned to a common tube model by deriving translation and rotation transformations. This paper consists of following sections. In Section 2, we summarize a proposed tube scanning system. In Section 3, the procedures of tube reconstruction using calibration data and captured laser images are described. Section 4 is important in this paper. An alignment scheme of the tube center and axis is described. In Section 5, we show two experimental results and error analysis. Conclusion is in Section 6.
2 2.1
Overview of a Tube Scanning Module Camera-Laser Scanning Module
Our tube scanning module consists of a vision camera equipped with a wide angle lens and a multi-laser holder (Fig. 1). The multi-laser holder firmly holds four compact line lasers and four tiny mirrors. The holder is then inserted and fixed around the camera lens. To reduce the volume of the scanning module, the line lasers are installed so that their light frontward. Four mirrors mounted in front of the line lasers, respectively, reflect the laser light exactly 22.5 degree. Thus, the angle between the light plane of a laser and the camera’s optical axis is almost 45 degrees. The vision camera captures the laser light reflected on the inner surfaces of a cylindrical tube. Once the camera captures an image, four curves are captured in the image, which are the laser projections to the tube surface. The laser curves are extracted and 3D shapes of the curves are computed by the camera-laser calibration parameters. The calibration is assumed to be done in advance. From a single image, we can obtain a circular shape of the tube from the four laser curves. To reconstruct a complete shape of the tube, the scanning module should move forward to scan the whole inner surfaces of the tube. Suppose we can obtain a single circular shape in
726
S.-H. Baek and S.-Y. Park
each scan of the module. Then the shape is represented with respect to the camera coordinate system in the same scan. Since each scan has independent camera coordinates, the simple collections of all scan results cannot be exact shape of the tube, unless robot motion or calibration is perfect. If there is any vibration or off-center error in robot motion, all scans cannot be accurately registered to a common coordinate system. In this point of view, the problem of constructing a complete shape of a tube is similar to the problem of 3D shape registration. Therefore, we need exact transformations to register each scan data to a reference coordinate system. In this paper, we solve the problem by aligning the tube center and axis. We calculate the tube center and axis in each scan and align the centers and axes from all scan. To align the centers we derive a translation matrix in each scan. Similarly, to align the tube axis, we derive a rotation matrix.
Fig. 1. A tube inspection robot and camera-laser scanning module : (clockwise) tube inspection robot, side and front views of an experimental module, the real scanning module
2.2
Camera and Laser Calibration
To use our camera-laser scanning module, it is necessary to calibrate the module. The camera and laser parameters are acquired from separate calibration steps. In computer vision, the perspective projection camera model is widely used. Using the perspective projection model, we can get external and internal camera parameters. We use the Zhang’s calibration method for camera calibration [9]. Several images of a checkerboard pattern are acquired by changing the pose of the pattern. Because we use a wide angle lens, 3.5mm focal length, the lens distortion is also calibrated. For laser calibration, we use a general laser calibration method. A line laser can be represented by a plane equation in 3D space. The projection of a line laser in 3D space generates a virtual plane. The laser plane can be represented with three or more points. An equation ax + by + cz + d = 0 represents a plane. Here, a, b, and c are the elements of a plane normal vector of the laser plane. In our scanning module, four line lasers are used. Therefore, the plane equation of each laser is obtained one by one. Using the same checkerboard pattern used in the camera calibration, we project the line pattern from a laser and capture the 3D coordinates of the intersections of the line with the checkerboard edges. Three or more intersection points are enough to
A 3-D Tube Scanning Technique Based on Axis and Center Alignment
727
calibrate the corresponding laser plane. The same procedures are applied to all line lasers, but independently.
3 A Circular Shape Reconstruction by Multi-laser Triangulation We reconstruct the 3D shapes of the laser curves by using a multi-laser triangulation method with the calibration parameters. Laser curves in a captured image are extracted and their 3D coordinates are obtained. 3.1
Laser Curve Extraction and Fitting
The projection of a line laser onto the inner surface of the tube forms a 2D curve in a captured image. To calculate accurate 3D coordinates of the curve, we first remove the image distortion using the lens distortion parameters. In the undistorted image, intensities of the four curves are very high. Therefore, only those pixels which belonged to the curves are easily extracted by a threshold value. Fig. 2 shows four laser curves in a captured image. In the undistorted image, the curves can be easily extracted. However, it is not easy to tell which laser a curve is projected from. To classify the laser curves (which are composed of bright pixels) into four lasers, we use a curve fitting method. First, some pixels on a laser curve are sampled. Second, a fourth order equation is obtained by the least squares minimization of the sampled pixels. Each curve has then its own fitted curve equation. Finally, we classify all laser pixels into four curve equations by grouping the laser pixels based on the distance to fitted curves. This curve fitting scheme reduces the time to identify and classify laser pixels and improves the accuracy to detect the intersection between laser curves. In Fig. 2, the right-top image is a zoomed part of the red rectangle in the undistorted image. In the right-bottom, the red points are some parts of a fitted laser curve and the white points are extracted laser pixels classified into the red curve.
Fig. 2. Laser line extraction : (clockwise) original image, un-distortion image, zoomed image in red rectangle, curve fitting and laser center extraction, Gaussian distribution of yellow line
728
S.-H. Baek and S.-Y. Park
For more accurate 3D reconstruction, the extracted laser pixels are refined in a subpixel level. The Gaussian distribution method is used to get sub-pixel laser pixels [6]. The left-bottom of Fig. 2 shows the Gaussian distribution of the yellow line in the zoomed image part. Here, p is the y coordinate in the yellow line and f(p) means the intensity at p. The sub-pixel level along the yellow line is defined as (i+d) and d is determined as following equation. d = [ln(f(i-1)-ln(f(i+1)]/2[ln(f(i-1))-2ln(f(i))+ln(f(i+1))] 3.2
(1)
3D Reconstruction of Laser Curves
Using four laser curves in an undistorted image and camera-laser parameters, we can calculate the 3D positions of the curves with respect to the camera coordinate system. The optical triangulation technique[8] is used as shown in Fig. 3. In this figure, n is the normal vector of a laser plane π0. Pa is a point on the plane π0. P = [X, Y, Z]T is a 3D point on the reconstructed curve and p = [x, y] T is the projection of P. In addition, r = [x, y, f] T is a vector which origin is at the camera center and directing to p. Therefore, P lies in the extended line of the vector r and also on the plane π0.
Fig. 3. Optical triangulation
Equation 2 shows the relationship between P, Pa, and n. Equation 3 means P is in the extension line of r. Equation 4 is derived from Equation 2 and 3. (P – Pa) · n = 0
(2)
P = μr
(3)
P = r (Pa · n)/(r · n)
(4)
Because the laser pixels are projections of 3D curves which lie on the laser planes, we can obtain 3D coordinates of the points from Equation 4. A captured laser image has four laser curves, therefore, the tube model reconstructed from one laser image
A 3-D Tube Scanning Technique Based on Axis and Center Alignment
729
Fig. 4. Tube reconstruction from a single image frame. Left: front view. Right: side view.
consists of four separated 3D arcs in the camera coordinate system. Fig. 4 shows an example of four arcs reconstruction. Different laser curves are shown in different colors.
4
Complete Tube Scanning Based on Axis and Center Alignment
Because four laser devices and the vision camera are fixed in the camera-laser module, their geometric relationship is unchanged. In an ideal case, there is a unique tube axis and center even if there are four laser curves which are reconstructed from a single scan image. In a real case, however, the tube axis and center is not unique due to inherent systematic errors such as calibration error, undistortion error, and subpixel error. In addition, as mentioned in an earlier section, there is also an alignment error between different scans. In this section, we address the problem and solution of aligning tube axis and center to reconstruct a complete 3D model of a cylindrical tube. In one of the previous researches, the inner texture of a tube can be used to align the tube axis. However, this method cannot be used in our case because little texture is in the tube. Instead of using texture information, we use some geometric constraints. First one is that the reconstruction target is a cylindrical tube. Second constraint is that the tube is straight. The third one is that the robot motion between two consecutive scans is small. The proposed alignment method consists of following steps. First, the tube center and axis is calculated by using an ellipse fitting method. Next, the fitted centers are aligned by translation transformation and fitted tube axes are rotated also to be aligned with a basis axis. The tube axis of the first scan is defined as the basis axis. Finally, transformed tube axes and centers of all scans are aligned. 4.1
Calculation of Tube Center and Axis
One of the constraints in our method is that the profile of the tube is a perfect circle. Due to this constraint, the 3D curve which is the intersection of a line laser and the tube’s inner surface forms an ellipse in 3D space. Fig. 5 shows an intersection between a laser plane and the circular tube. In this figure, the dotted line W means a vector along the tube axis, C means the tube center, and π means the laser plane. Here, the tube center is on the laser plane. The red-dotted closed curve means the ellipse generated by the intersection of π and the tube. In the figure, a and b are vectors which length is the same as the major and minor axes of the ellipse, respectively.
730
S.-H. Baek and S.-Y. Park
The starting point of the two vectors is at C. In addition, θ is the angle between W and a. and it can be determined by the following equation. θ = cos-1 (|a| / |b|)
(5)
All vectors are represented with respect to the camera coordinate system Oc.
Fig. 5. Intersection of a laser plane with a circular tube
Tube axis W and center C are derived by using an ellipse fitting method. Among many 3D ellipse fitting methods [1][3], we use a least square fitting method. An ellipse in 3D space is rotated so that it becomes parallel to the xy-plane in the camera coordinate system. And all parameters of the ellipse, center, major axis, and minor axis are obtained by using the ellipse fitting in the 2D xy-plane. After that, the rotated ellipse is transformed back to the original laser plane. Then we can obtain the ellipse parameters in 3D space. By rotating a by θ around the pivot axis b, W is derived and normalized. Using the same procedures described above, we obtain four ellipse models. Since lasers and mirrors are fixed in the same angle and distance, in an ideal case, all W and C of the four laser planes should be the same. However, in a real case, they are not the same due to inherent noises and fitting errors. Therefore, we need to determine a new and unique tube axis and center, which represent the axis and center of a tube model to be reconstructed. Fig. 6 shows four laser planes and their centers and axes. CEk and WEk (k=0, 1, 2, 3) mean the center and axis obtained by the k-th laser plane. We define new center Cn and axis Wn by averaging CEk and WEk. They represent the center and axis of the tube model in frame n. Wn is also normalized to a unit vector.
Fig. 6. Average axis and center, Cideal and Wideal are the ideal center and axis vector
A 3-D Tube Scanning Technique Based on Axis and Center Alignment
4.2
731
Tube Center and Axis Alignment
We align all tube models with that in a basis model. First of all, all tube centers have to lie in the basis axis. This also means that each tube axis should lie in the basis axis. If there is a translation from the n-th tube center to the basis center, the 3D ellipse in the n-th frame translates together. Fig. 7 shows the translation of a tube center. We define the first tube model as the basis tube model. C0 and W0 mean the basis center and axis, respectively. Cn and Wn are n-th tube center and axis. C’n is the projection of Cn onto the basis axis W0. tn is a translation vector from n-th tube model to the basis axis. We define the translation from Cn to C’n , not the basis center C0. It is because that most alignment errors occur in x and y directions, not in z direction, due to the sensor motion in the horizontal and vertical directions. In Equation 6, vector tn is derived by using C0, Cn, and W0. The length of (C’n – Cn) is the inner product of Cn – C0 and W0. tn = C’n - Cn = C0 – Cn +( ( Cn - C0 )
W ) W 0
(6)
0
After the tube center alignment, all centers lie on the basis axis. However, the tube axes do not coincide as shown in Fig. 7. Therefore, we need to align the tube axes also. By deriving a rotation matrix Rn in each tube axis, we align all axes. Fig. 7 shows the axis rotation transform. Suppose φ is the angle between W0 and Wn. To align the Wn with W0, we first define their cross product. The red-colored vector means cross product between Wn and W0. Then, it becomes the pivot axis of rotation from Wn to W0 by angle φ. To obtain the rotate matrix, we need the third axis which is the cross product of (Wn⊗W0) and W0. Using the three orthogonal axis, we derive Rn which rotates Wn to W0 by angle φ.
Fig. 7. Center and axis alignment
4.3
Complete Tube Reconstruction
In each image frame, as described in previous sections, we obtain a circular shape after transforming four ellipse models. To reconstruct a complete 3D tube model, we then need to merge all scans. Merging all scans is a simple task. Since we assume that the velocity of the laser-camera motion is constant, each scan is merged to a global coordinate system. The first camera coordinate is defined as the global coordinate system and all other scans are aligned with respect to the first tube axis W0.
732
S.-H. Baek and S.-Y. Park
Let Pn be a 3D point in n-th image frame. Equation 7 shows an aligned point P’n that is one of the n-th image frame. Pn is a point before alignment. Rn means the axis rotation from the n-th image axis to basis axis W0. Vector tn means the center translation from the center of the n-th image frame to the basis center. Constant v is the velocity of the laser-camera module’s motion. P’n = RnPn + tn +vnW0
5
(7)
Experiment Results
Two experiments are done to reconstruct complete tube models. The first experiment is done using a camera-laser module installed on a motorized linear stage system. The module moves forward to the direction of the tube axis. The second experiment is done with a handheld module. We hold and move the module slowly into the tube to simulate the serious vibration of an inspection robot. In each case, we compare reconstruction results in three types of axis alignment. First, a tube model is reconstructed without any alignment. Second, we apply only the center translation. Third, we apply both center translation and axis rotation. In the first case, all scans are aligned along the z-axis of the first camera coordinates. To evaluate the accuracy of the reconstruction results, we measure the tube radius from the basis axis as shown in the following equation [1]. (X – C0)T (I – W0W0T) (X – C0) = r2
(8)
Equation 8 shows the measurement equation of the tube’s radius. X mean a measured point. C0 and W0 are the tube alignment center and axis. I is the 3 by 3 identity matrix. r is the tube radius. We measured the distance from the tube axis to every point and calculate the average and standard deviation of the absolute difference between real radius and measured radius. Fig. 8 shows the experiment system. The left image shows a linear stage with the camera-laser module. The center image shows the module inserted to a cylindrical PVC tube of diameter 108mm. The right image shows the scene of laser reflection in the tube.
Fig. 8. Experiment setup : linear stage and image capture module(left), insertion of linear stage into a PVC tube(middle), reflection of laser in tube wall(right)
5.1
Experiment 1
As the first case, we have done an experiment with the linear stage. In this case, the laser module was moved by using the linear stage. The stage velocity is set to be constant and little vibration is occurred during the motion of the laser module.
A 3-D Tube Scanning Technique Based on Axis and Center Alignment
733
Fig. 9 shows the results of the first experiment. We acquire 100 image frames. The left image shows the result without any alignment. The middle model was reconstructed with center translation only. In the right figure, a tube model was reconstructed with center and axis alignment. Appearance of the reconstructed models looks similar because there is little motion vibration.
Fig. 9. 3D reconstruction results using a linear stage: From left, original, center alignment, and center and axis alignment
Table 1 shows the average and standard deviation of the absolute difference between the measurement radius and the real radius. After center and axis alignment, the shape of reconstructed tube model is more accurate than other cases. Table 1. Comparison of tube radius (mm)
Case Original model Center alignment Center and axis alignment 5.2
Average 3.983 1.867 1.722
Standard deviation 1.105 0.477 0.431
Experiment 2
In the second experiment, we hold and move the camera-laser module into the tube. While moving the module, we capture 70 frames. Because of the unstable motion of the hand, tube reconstruction in each scan has different center and axis. In Fig. 10, the top row is the results of tube modeling. The bottom is the projection of tube models into the xy-plane of the first camera coordinates. In the top and bottom figures, we easily notice the variation of tube centers without any alignment. The reconstructed tube model has serious shape distortions due to hand motion. After center and axis alignment, the tube model is accurately reconstructed. Table 2 compares the radius measurement in this experiment. Table 2. Comparison of tube radius (mm)
Case Original model Center alignment Center and axis alignment
Average 6.241 2.160 1.852
Standard deviation 1.121 0.568 0.661
734
S.-H. Baek and S.-Y. Park
Fig. 10. 3D reconstruction results of handheld motion. Top: From left, original model, center alignment only, center and axis alignment. Bottom: Projection to the xy-plane of the first the camera system. Near (green) to far ellipses (blue) from the first camera. From left, original model, center alignment only, center and axis alignment.
6
Conclusion and Future Works
In this paper, we address the problem of accurate shape reconstruction of a cylindrical tube using a multi-laser and camera module. The module scans a 360 degree shape of the inner surface of a cylindrical tube. For accurate shape reconstruction, ellipse models reconstructed by independent multi-laser triangulations are aligned. The center and axis of the ellipses are aligned and all scan models are merged to reconstruct a complete 3D tube model. Experimental results show that the proposed tube scanning technique yield very accurate 3D tube models. Even with serious vibration of the scan module, we can reconstruct a cylindrical tube model. In the future, we will install the scan module at the front of a tube inspection robot. More practical experiments and error analysis will be done. In addition, by employing velocity or acceleration sensor, each scan model can be merged to a global model more accurately. Acknowledgement. This Work was supported by Korea Institute of Energy Technology Evaluation and Planning(KETEP) funded by Ministry of Knowledge Economy(MKE).
References 1. Eberly, D.: Fitting 3D Data with a Cylinder. Geometric Tools, LLC (2008), http://www.geometrictools.com 2. Frey, C.W.: Rotating Optical Geometry Sensor for Fresh Water Pipe Inspection. In: The Seventh IEEE Conf. on Sensors, pp. 337–340 (2008)
A 3-D Tube Scanning Technique Based on Axis and Center Alignment
735
3. Jiang, X., Cheng, D.: Fitting of 3D circles and ellipses using a parameter decomposition approach. In: Fifth Int. Conf. on 3-D Digital Imaging and Modeling (2005) 4. Kannala, J., Brandt, S.S., Heikkilä, J.: Measuring and modelling sewer pipes from video. Machine Vision and Applications 19, 73–83 (2008) 5. Klingajay, M., Jitson, T.: Real-time Laser Monitoring based on Pipe Detective Operation. World Academy of Science, Engineering and Technology 42, 121–126 (2008) 6. Matsui, K., Yamashita, A., Kaneko, T.: 3-D Shape Measurement of Pipe by Range Finder Constructed with Omni-Directional Laser and Omni-Directional Camera. In: Int. Conf. on Robotics and Automation (2010) 7. Ouellet, M., Senecal, C.: Measuring and modelling sewer pipes from video. North America Society for Trenchless Technology (2004) 8. Winkelbach, S., Molkenstruck, S., Wahl, F.M.: Low-cost laser range scanner and fast surface registration approach. In: Franke, K., Müller, K.-R., Nickolay, B., Schäfer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 718–728. Springer, Heidelberg (2006) 9. Zhou, F., Zhang, G.: Complete calibration of a structured light stripe vision sensor through planar target of unknown orientations. Image and Vision Computing 23, 59–67 (2005)
Combining Plane Estimation with Shape Detection for Holistic Scene Understanding Kai Zhou, Andreas Richtsfeld, Karthik Mahesh Varadarajan, Michael Zillich, and Markus Vincze Automation and Control Institute, Vienna University of Technology Gusshausstr. 27-29, A-1040, Vienna, Austria {zhou,ari,varadarajan,zillich,vincze}@acin.tuwien.ac.at
Abstract. Structural scene understanding is an interconnected process wherein modules for object detection and supporting structure detection need to co-operate in order to extract cross-correlated information, thereby utilizing the maximum possible information rendered by the scene data. Such an inter-linked framework provides a holistic approach to scene understanding, while obtaining the best possible detection rates. Motivated by recent research in coherent geometrical contextual reasoning and object recognition, this paper proposes a unified framework for robust 3D supporting plane estimation using a joint probabilistic model which uses results from object shape detection and 3D plane estimation. Maximization of the joint probabilistic model leads to robust 3D surface estimation while reducing false perceptual grouping. We present results on both synthetic and real data obtained from an indoor mobile robot to demonstrate the benefits of our unified detection framework. Keywords: unified probabilistic model, plane estimation, shape detection
1
Introduction
Recent research in image understanding have been geared towards to unified probabilistic frameworks. These have been successfully applied in many computer vision tasks such as object detection, object recognition, image segmentation, surface layout recovering, traffic surveillance, visual navigation and geometrical contextual cue extraction etc. [1–6]. While the conventional 3D object detection pipeline segments objects from the scene based on assumptions of planarity of the supporting scene, these supporting planes (or background in the segmentation process) are estimated independent of the object detection process. Hence, the quality of the segmentation process depends to a large extent on the background extraction, which can be less than perfect. The combination of isolated object detection and geometrical contextual analysis provides a robust and efficient solution to this typical chicken-and-egg problem of locating objects and reasoning about 3D scene layout simultaneously. One approach to solve the problem is by using a joint probabilistic framework for the inter-linked modules. However, J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 736–747, 2011. c Springer-Verlag Berlin Heidelberg 2011
Combining Plane Estimation with Shape Detection
Stereo Images
Edge Detection
Point Clouds Generation
Plane Fitting
Perceptual Grouping (Shape Detection)
Joint Probability Maximization
737
3D Scene Representation
Fig. 1. Overview of the proposed 3D scene understanding system
such a framework requires the incorporation of confidence coefficients of both the object recognizer and the estimation of scene geometry modules. While it is possible to have confidence measures returned from most conventional object detectors, this is untrue of most common geometrical scene estimators. In order to handle this issue [1, 2] assume a uniform distribution while [7, 8] learn this information from training data sets. There have also been attempts at using inference models based on stereo imagery to address this issue. However, highlevel 3D scene understanding methods bootstrapped by these stereo algorithms are still straightforward and simple in many cases. Given the requirement of obtaining the best possible inference on scene geometry and object detection, it is necessary to refine the detailed 3D structure of the scene and the object detection process using a unified probabilistic framework. In Figure 1, we show the overall processing schema of the scene understanding architecture. Various sections detail components of the architecture. In Section 2 we review current algorithms and techniques. In Section 3 we elaborate on our approach to perceptual organization for object detection and plane fitting. As mentioned earlier, a unified probabilistic framework requires confidence coefficients from both the object representation and 3D structure estimation modules. By building measures of confidence for the inter-linked basic shape estimation and plane detection modules we demonstrate reduction in false positive detection rates. The detailed mathematical framework of our unified detection model is outlined in Section 4. Subsequent sections present the experimental results, evaluations, conclusions and future work.
2
Background and Related Work
Comprehensive scene understanding has been an ultimate goal of computer vision research for more than five decades [9]. This goal starts from several sophisticated and ambitious algorithms relying on heuristics which attempted understanding 3D scene structure from a single image, and in recent years much progress has been made in this field under consideration of the spatial relationship between objects and scene geometry [1–6, 10, 11]. Most recently [1] models the problem of joint scene recovering and object recognition from a single still image. A prior model for scene understanding using 15 typical 3D scene geometries is introduced in [10]. Li et al. proposed a Feedback Enabled Cascaded Classification Model, that maximizes the joint likelihood of several related computer
738
K. Zhou et al.
vision sub-tasks, in the machine learning domain [12]. The scene geometries, which serve as first approximation in all the methods mentioned above, have been limited in accuracy and number of modalities. These limitations have been noticed by several researchers recently leading to attempts to utilize live stereo reconstruction instead of hard-coded modalities or heuristics to provide 3D scene understanding [11, 13]. Because the scene geometry comes with almost infinitive variation in appearance, we limit ourselves to the simple but common planar model determination of the scene structure. Hence it is desirable to design a robust multi-plane fitting algorithm as well as the evaluation criterion for the estimated planes. This multiple model estimation problem has received widespread attention in recent literature [14–18]. The most popular approach to this problem is the RANSAC [19] family of algorithms. But the Random Selection approaches were originally designed for single model estimation and the modification and extensions which are capable of tackling the problem of multi-model estimation [14, 20] require the cautiously tuning weight parameters of each modality. We draw inspiration from the recent success of [21], considering only the largest connected inliners to evaluate the fitness of a candidate plane. Differing from the CC-RANSAC which uses only connectivity of points to form the plane, we combine both connectivity and continuation by considering Euclidean distance as well as distance between normals. This combination facilitates the robustness of plane estimation and simultaneously provide the confidence value tied with the estimated plane. Inspired by work of [1], who used assumptions about spatial abstraction for improving the performance of object detectors, improving plane estimation using geometric shape primitives is the principle on which this work is based. Detection of geometric shape primitives based on perceptual grouping of edges is a well known topic in computer vision with an abundance of literature since the eighties. [22] introduced geons as a means of describing objects of arbitrary shape by components and [23] discuss the potential of geons in computer vision systems. [24] presents a method based on joint statistical constraints defined on complex wavelet transforms to represent and detect geons. [25] uses perceptual grouping of edge segments to reduce the complexity of detecting 3D models in edge images and shows impressive results on highly cluttered images. This however requires precise CAD-like 3D models given a-priori. Approaches such as [26], [27] and [28] use groups of edge fragments to detect learned classes of shapes and show impressive results on databases. [29] combines qualitative and quantitative object models to detect and track just a few object primitives. That system is still brittle in the presence of texture and clutter. To this end, [30] describe objects beyond simple shapes but still of limited complexity (cups, hats, desks) with qualitative, part-based shape abstraction based on a vocabulary of 2D part models corresponding to closed contours of various shapes. Their system can extract such representations from images containing textured objects as well as complex backgrounds. A major challenge in perceptual grouping is the combinatorial explosion when identifying possible groups of image features as the number of possible groups
Combining Plane Estimation with Shape Detection
739
grows exponentially with group size. [25] addresses the problem of exponential run-time complexity using a grid overlaid on the image indexed by line endpoints. A typical problem of indexing is the appropriate choice of bin size. [31] use curve parameters to construct index spaces of higher-parametric models and addresses the problem of bin size and indices close to bin boundaries. [32] proposes to use indexing in the image space where search lines emanating from the ends of image edges are used to find collinearities and junctions and finally closed contours. Search lines are grown incrementally over processing time, thus avoiding problematic distance thresholds.
3
Basic Shape and Plane Estimation
In the following we show how to use a perceptual grouping system to detect basic geometric features in 2D, how to calculate associated 3D information and to how to get representative confidence values. The estimation of plane together with the confidence value will also be addressed in this section. 3.1
Basic Shape Estimation Based on Perceptual Organization
The proposed perceptual grouping system is based on the work of [32] and [33], which provides an anytime solution for finding associations between features avoiding the need for arbitrary thresholds. The results of perceptual grouping, relevant for the work presented here, are junctions between lines, closed contours and rectangles. Processing up to this point is purely 2D, i. e. we detect projections of visual features onto the image plane. To estimate 3D information, we match features from both images from the stereo pair and use line-based stereo matching of specific points to calculate the 3D geometric information. For matching of lines, the MSLD descriptor of [34] in addition with the epipolar constraint of the calibrated stereo camera setup is used. For closures and rectangles matching of feature corners is sufficient to get the correct match. To assess a confidence value for stereo matched lines, we take into account that lines almost parallel to the epipolar line as well as lines pointing away from the viewpoint typically have higher errors in 3D reconstruction. So we use the angles between the epipolar line and the corresponding line in the left and right image (θ2Dl , θ2Dr ) as well as the angle between the line and the z-coordinate in the 3D space (θ3Dz ), after normalization between 0 and 1: P (fline ) =
θ2Dl θ2Dr θ3Dz · π · π π/ /2 /2 2
(1)
For the confidence value of closures, we use the length of left and right lines (Nll , Nlr ) with respect to the length of gaps in the left and right image (Ngl , Ngl ), counted in number of pixels. This value represents the pixel support of the closures in the edge image: P (fclosure ) =
Nll Nlr · Nll + Ngl Nlr + Ngr
(2)
740
K. Zhou et al.
The confidence value for rectangles is composed of the pixel support and the parallelism of opposing line pairs. This parallelism is represented by the difference of the angles of opposing line pairs in the monocular image. For the confidence value, only the smaller value of the left and right image is used (θpl , θpr ), again normalised between 0 and 1. P (frectangle ) =
3.2
Nrl θpl Nrr θpr · (1 − π ) · · (1 − π ) Nrl + Ngl /2 Nrr + Ngr /2
(3)
Plane Estimation
It has been verified in [21][35] that using RANSAC based approaches to decide the inliers from pre-clustered data instead of from the entire point cloud data could significantly improves performance in plane fitting task. We adopt the Connected Component RANSAC (CC-RANSAC) [21] as the underlying plane estimator and assign confidence value to the estimated plane by calculating the average normal vector of connected points. This confidence value is used for the joint probability maximization and will be addressed in detail in Section 4. We start from the RANSAC hypotheses generation and evaluate each hypothesis only on a set of points C = {ci , i = 1, 2, . . . , n} that belong to the same connected planar component, as in [21]. Consider three points, XCi , XCj , XCk , the −−→ −−→ t normal vector of the plane generated by these three points is rijk = VLij × VLjk , −−→ where VLij is the vector joining XCi and XCj . The XCi , XCj , XCk are removed from C and operation proceeds by considering the next three neighboring points t+1 and calculating rijk , operation proceeds till there is no point or less than 3 points in C. The average normal vector r¯ of all the points in C are computed 1 t using the collection of {rijk , . . . , rijk , . . .}. We define θCS as the angle between average normal vector r¯ and normal vector of estimated plane S, then we have the confidence value for the plane which contains C inliers as follows, p(S) = (1 −
θCS n )· π/ N 2
(4)
Where n denotes the density of all the inliers belonging to the estimated plane and N is the number of points in the entire dataset. Eq. 4 in essence represents the continuation and connectivity of all the inliers belonging to the estimated plane. Higher confidence values denote better quality of the estimated plane.
4
Joint Probabilistic Model
As we described in Section 3, the estimation of plane surfaces and detection of object shape can be achieved independently, but our estimates can be improved if the interaction between objects and supporting planes is taken into account. The likelihoods for representing the correct detection of scene elements, are p(S) and p(E|O) which denote the prior of the scene information and image evidences
Combining Plane Estimation with Shape Detection
741
produced by the object candidates. As in Section 3.2, the scene information S = {n, h} is represented with several normal vectors of the planar surfaces n = {ni } and h = {hi } is the distance of each plane to the origin of the coordinate system (i.e. the left camera). According to Bayes’ theorem, p(E|O) = p(O|E)p(E)/p(O), where P (O|E) is the detection’s confidence returned by the detector as in Section 3.1. And the p(E) and p(O) can be considered to satisfy uniform distribution, therefore p(E|O) ∝ p(O|E). For each object candidate oi , we introduce a boolean flag ti , where ti = 1 denotes true-positive detection of the object and vice-versa. Therefore, the object detection can be represented with combination of both detection results and assigned flag, it can be formulated as O = {oi } = {fi , ti }, where f is the collection of the feature detection results {f1 , . . . , fM }. With the probabilistic representation of planes and objects, we formulate the joint probability model of the holistic scene as follows, p(O, E, S) = p(S)
M
p(O|S)p(E|O)
j=1
=
K
p(ni , hi )
i=1
M
(5) p(fj , tj |S)p(ej |fj , tj )
j=1
where K, M are the number of plane estimations and object candidates, respectively. p(fj , tj |S) is the probability of object detection with the underlying geometry, and denotes the relation between supporting planes and detected objects. Since the boolean flag tj is determined by both scene geometry S and feature detection results f = {f1 , . . . , fM }, and the feature detection process is independent with scene geometry, we have p(fj , tj |S) = p(tj |fj , S)p(fj |S) ∝ p(tj |fj , S). Consequently Eq. 5 can be rewritten as p(O, E, S) ∝
K i=1
p(ni , hi )
M
p(tj |fj , S)p(fj , tj |ej )
(6)
j=1
To sum up, our joint probabilistic model consists of three parts, (1) the confidence value of plane estimation, (2) the confidence value of detected lines returned by the perceptual grouping algorithm, (3) the likelihood of true-positive object detection with the underlying plane estimation and current line detection. The first and second confidence values are returned by the plane estimator and line detector as described in Section 3.1 and 3.2. The third confidence value is determined by the distance and angle between detected lines and plane: ⎧ αε π ⎪ if 0 ≤ θj < ⎨| cos 2θj | · d 4 j p(tj = 1|fj , S) = (7) ε π π ⎪ ⎩| cos 2θj | · if ≤ θj < dj 4 2 where θj is the angle between line j and estimated plane, dj denotes the distance of the mid-point of the line j to the plane. As defined in RANSAC, the
742
K. Zhou et al.
“scale” parameter ε filters points that are at a distance larger than ε from the estimated plane. Eq. 7 in essence gives a higher confidence value to lines which are parallel or perpendicular with the plane estimation, as well as lines which are geometrically close to the plane. Since approximately parallel lines are more likely to be found on top of objects, the distances of these lines to the estimated plane are usually larger than the approximately perpendicular lines. Hence, we use a weight parameter α (empirically set to 2) to trade off these two kinds of lines. To maximize joint probability, we present the optimization problem as follows, arg max
N
hi ,ni ,tj i=1
ln p(ni , hi ) +
M
[ln p(tj |fj , S) + ln p(fj , tj |ej )]
(8)
j=1
where ti are the parameters to be estimated. We select the plane which has the highest confidence value of all the plane estimation results, and only consider this plane as the scene geometry for the joint probabilistic model optimization. Then the first part of Eq. 8 is a constant and the second part can be calculated independently through M 3D matched lines comparisons of ln p(tj = 0|fj , S) + ln p(fj , tj = 0|ej ) with ln p(tj = 1|fj , S) + ln p(fj , tj = 1|ej ). After labeling all the objects, the pose of the plane with the highest confidence is refined by searching the nearby space. This refined pose should satisfy the criteria of maximizing the number of objects parallel or orthogonal to it. Then with the fixed number of S, the M comparisons are repeated again to obtain the updated true-positive list of the objects.
5
Experiments
In order to compare the performance of proposed joint probabilistic approach with CC-RANSAC, we generate a synthetic dataset with noisy 3D points. A simple scene consisting of one supporting plane and object clutter is used. All points belonging to the dominant plane detected using RANSAC (points shaded red in left image of Fig. 2)) have been manually removed and replaced with two synthetic supporting planar patches (parallel to the original plane), modeling two supporting surfaces at different heights. This synthetic scene facilitates qualitative comparison of CC-RANSAC and the proposed methods with different scales of inlier noise. These planar patches have been generated with 15000 points (7500 each), corrupted by Gaussian noise of standard deviation σ. The colored points (total amount of points of three objects is 8039) in right image of Fig. 2 represent the objects. In Fig. 3 we compare the results of RANSAC, CC-RANSAC and the proposed approach with the synthetic dataset. The red points represent the typical results of inliers belonging to the detected planes (as seen from the side view) and the proposed method clearly outperforms RANSAC and CC-RANSAC. The estimated plane using CC-RANSAC is tilted towards the objects because of the
Combining Plane Estimation with Shape Detection
743
Remove points of real plane, add two synthetic planes
Fig. 2. Synthetic data generation
(a) RANSAC
(b) CC-RANSAC
(c) proposed approach
100
100
90
90
80
80
70 60
precision rate (%)
recall rate (%)
Fig. 3. Comparison of RANSAC, CC-RANSAC and the proposed method using synthetic data (side view). The inlier noise scale σ of each plane is set to 0.01, the height between two planes is 0.05. The typical estimation results of three methods are illustrated with red points.
RANSAC CCRANSAC Proposed
50
70 60 50
40
40
30
30
20 0.01
0.02
0.03 0.04 scale of inlier rate
(a) Recall
0.05
RANSAC CCRANSAC Proposed
20 0.01
0.02
0.03 0.04 scale of inlier rate
0.05
(b) Precision
Fig. 4. Qualitative comparison of RANSAC, CC-RANSAC and the proposed method with various inlier noise scale
higher density of points in that area. The isolated plane estimation with CCRANSAC is also worse because the RANSAC based methods always converge to the largest plane near the optimum, which in this case is the diagonal plane. We qualitatively compare RANSAC, CC-RANSAC and the proposed holistic method on synthetic data with different inlier noise scale, each method is given 20 trials and the results in average are collected. The recall rate measures the
744
K. Zhou et al.
(a) Stereo features
(b) 3D scene
(c) false matching removed
(d) SeqRANSAC
(e) CC-RANSAC
(f) Proposed approach
Fig. 5. Processed stereo features (a), the reconstructed 3D scene (b) with the removed false matchings (c) and the results of the three different approaches, namely the sequential RANSAC (d), the Conected-Components-RANSAC (e) and our proposed approach
(a) Stereo features
(b) 3D scene
(c) false matching removed
(d) SeqRANSAC
(e) CC-RANSAC
(f) Proposed approach
Fig. 6. Processed stereo features (a), the reconstructed 3D scene (b) with the removed false matchings (c) and the results of the three different approaches, namely the sequential RANSAC (d), the Conected-Components-RANSAC (e) and our proposed approach
Combining Plane Estimation with Shape Detection
745
proportion of estimated inliers in actual inliers of the model, and the precision rate presents the proportion of correctly estimated inliers in all the estimated inliers. From Fig. 4, we see with the increasing of inlier noise scale, the proposed method gets the best performance in term of accuracy and stability. Figure 5 and 6 show results of our approach in comparison to the sequential RANSAC and the CC-RANSAC approach. We use two Point-Grey FL2G-13S2CC camera pair for the stereo setup, mounted on a Pioneer P3-DX mobile robot (1.2m height, 45◦ down-tilt) to acquire the raw images. Two challenging scenes are demonstrated in Fig. 5 and Fig. 6. The first scenario contains a multi-rack shelf with objects sparsely distributed inside. It can be seen in Fig. 5d that a wrong vertical plane (green dots) is detected with seqRANSAC (sequentially apply RANSAC and remove the inliers from the dataset), and in Fig. 5e this wrong detected plane is split into two planar patches (green and pink dots) with CCRANSAC, which is still incorrect. This false estimation is caused by the missing connectivity check (SeqRANSAC) or continuation check (CC-RANSAC). Fig. 6 contains a stair scene with a walking person, which makes CC-RANSAC separate one floor into two parts, our proposed approach averts this error while obtaining better estimation accuracy (see the boundaries between steps in Fig. 6f and 6e).
6
Conclusion
An alternative approach using a joint probabilistic model for detecting object features and estimating supporting planes has been presented in this paper. The proposed approach has also been shown to provide superior performance in comparison to seqRANSAC and CC-RANSAC, using planarity cues obtained from perceptual grouping of object features. This improvement is achieved by finding a set of parameters that maximize the joint probability of a unified model. Experimental results using synthetic and real data have demonstrated the validity of our method. This work can be considered as the initial and simplified step of holistic scene understanding using both sophisticated feature detectors and spatial reasoning. Future work will build upon this framework to obtain better 3D parameterization. The work can be improved by using genetic programming for multi-model classification and underlying structure estimation, which can provide better representation for confidence values of plane estimation as well as automatically detect the number of modalities. Also alternative shape detectors and their relationship with the geometrical scene need to be investigated in future. Acknowledgements. The research leading to these results has received funding from the EU FP7 Programme [FP7/2007-2013] under grant agreement No. 215181, CogX.
746
K. Zhou et al.
References 1. Bao, S.Y.-Z., Sun, M., Savarese, S.: Toward coherent object detection and scene layout understanding. In: CVPR, pp. 65–72 (2010) 2. Sun, M., Bao, S.Y.-Z., Savarese, S.: Object detection with geometrical context feedback loop. In: BMVC, pp. 1–11 (2010) 3. Li, L.J., Fei-Fei, L.: What, where and who? classifying events by scene and object recognition. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1–8 (2007) 4. Saxena, A., Chung, S., Ng, A.: 3-d depth reconstruction from a single still image. International Journal of Computer Vision 76, 53–69 (2008) 5. Hoiem, D., Efros, A., Hebert, M.: Recovering surface layout from an image. International Journal of Computer Vision 75, 151–172 (2007) 6. Cornelis, N., Leibe, B., Cornelis, K., Van Gool, L.: 3d urban scene modeling integrating recognition and reconstruction. International Journal of Computer Vision 78, 121–141 (2008) 7. Hoiem, D., Efros, A., Hebert, M.: Putting objects in perspective. In: CVPR 2006, vol. 2, pp. 2137–2144 (2006) 8. Hoiem, D., Rother, C., Winn, J.: 3d layoutcrf for multi-view object class recognition and segmentation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 9. Roberts, L.G.: Machine Perception of 3D Solids. PhD thesis, Dept. of Electrical Engineering, Massachusetts Institute of Technology (1963) 10. Nedovic, V., Smeulders, A., Redert, A., Geusebroek, J.M.: Stages as models of scene geometry. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1673–1687 (2010) 11. Helmer, S., Lowe, D.: Using stereo for object recognition. In: IEEE International Conference on Robotics and Automation, ICRA 2010, pp. 3121–3127 (2010) 12. Congcong, L., Adarsh, K., Ashutosh, S., Tsuhan, C.: Towards holistic scene understanding: Feedback enabled cascaded classification models. In: Twenty-Fourth Annual Conference on Neural Information Processing Systems (2010) 13. Gavrila, D.M., Munder, S.: Multi-cue pedestrian detection and tracking from a moving vehicle. Int. J. Comput. Vision 73, 41–59 (2007) 14. Zuliani, M., Kenney, C.S., Manjunath, B.S.: The multiransac algorithm and its application to detect planar homographies. In: IEEE International Conference on Image Processing (2005) 15. Toldo, R., Fusiello, A.: Robust multiple structures estimation with J-linkage. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 537–547. Springer, Heidelberg (2008) 16. Zhang, W., Kosecka, J.: Nonparametric estimation of multiple structures with outliers. In: Dynamical Vision Workshop (2006) 17. Chin, T., Wang, H., Suter, D.: Robust fitting of multiple structures: The statistical learning approach. In: ICCV (2009) 18. Delong, A., Osokin, A., Isack, H.N., Boykov, Y.: Fast approximate energy minimization with label costs. In: CVPR (2010) 19. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981) 20. Stewart, C.V.: Bias in robust estimation caused by discontinuities and multiple structures. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 818–833 (1997)
Combining Plane Estimation with Shape Detection
747
21. Gallo, O., Manduchi, R., Rafii, A.: CC-RANSAC: Fitting planes in the presence of multiple surfaces in range data. Pattern Recogn. Lett. 32, 403–410 (2011) 22. Biederman, I.: Recognition-by-components: A theory of human image understanding. Psychological Review 94, 115–147 (1987) 23. Dickinson, S., Bergevin, R., Biederman, I., Eklund, J., Munck-Fairwood, R., Jain, A.K., Pentland, A.: Panel report: The potential of geons for generic 3-d object recognition. Image and Vision Computing 15, 277–292 (1997) 24. Tang, X., Okada, K., Malsburg, C.v.d.: Represent and Detect Geons by Joint Statistics of Steerable Pyramid Decomposition. Technical Report 02-759, Computer Science Department, University of Southern California (2002) 25. Lowe, D.G.: Three-dimensional object recognition from single two-dimensional images. Artificial Intelligence 31, 355–395 (1987) 26. Nelson, R.C., Selinger, A.: A cubist approach to object recognition. Technical Report TR689, Dept. of Computer Science, Univ. of Rochester (1998) 27. Ferrari, V., Fevrier, L., Jurie, F., Schmid, C.: Groups of adjacent contour segments for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 36–51 (2008) 28. Ommer, B., Malik, J.: Multi-Scale Object Detection by Clustering Lines. In: International Conference on Computer Vision (2009) 29. Dickinson, S., Metaxas, D.: Integrating Qualitative and Quantitative Object Representations in the Recovery and Tracking of 3-D Shape. In: Harris, L., Jenkin, M. (eds.) Computational and Psychophysical Mechanisms of Visual Coding, pp. 221–248. Cambridge University Press, New York (1997) 30. Sala, P., Dickinson, S.: Model-Based Perceptual Grouping and Shape Abstraction. In: The Sixth IEEE Computer Society Workshop on Perceptual Grouping in Computer Vision, POCV (2008) 31. Sarkar, S., Boyer, K.L.: A computational structure for preattentive perceptual organization: Graphical enumeration and voting methods. IEEE Transactions on System, Man and Cybernetics 24, 246–266 (1994) 32. Zillich, M., Vincze, M.: Anytimeness Avoids Parameters in Detecting Closed Convex Polygons. In: The Sixth IEEE Computer Society Workshop on Perceptual Grouping in Computer Vision, POCV (2008) 33. Richtsfeld, A., Vincze, M.: Basic object shape detection and tracking using perceptual organization. In: International Conference on Advanced Robotics, ICAR, pp. 1–6 (2009) 34. Wang, Z., Wu, F., Hu, Z.: Msld: A robust descriptor for line matching. Pattern Recognition 42, 941–953 (2009) 35. Yasovardhan, R., Hemanth, K., Krishna, K.: Estimating ground and other planes from a single tilted laser range finder for on-road driving. In: International Conference on Advanced Robotics, ICAR 2009, pp. 1–6 (2009)
Simple Single View Scene Calibration Bronislav Pˇribyl and Pavel Zemˇc´ık Brno University of Technology, Faculty of Information Technology, Graph@FIT, Boˇzetˇechova 2, 612 66 Brno, Czech Republic {ipribyl,zemcik}@fit.vutbr.cz http://www.fit.vutbr.cz/research/groups/graph
Abstract. This paper addresses automatic calibration of images, where the main goal is to extract information about objects and relations in the scene based on the information contained in the image itself. The purpose of such calibration is to enable, for example, determination of object coordinates, measurements of distances or areas between objects in the image, etc. The idea of the presented work here is to detect objects in the image whose size is known (e.g. traffic signs in the presented case) and to exploit their relative sizes and positions in the image in order to perform the calibration under some assumptions about possible spatial distribution of the objects (e.g. their positioning on a plane in the presented case). This paper describes related research and the method itself. It also shows and discusses the results and proposes possible extensions. Keywords: Scene calibration, scene reconstruction, Euclidean reconstruction, single view, automatic parameters estimation, principal component analysis.
1
Introduction
The quality of cameras as well as computational power of devices has been increasing steadily in last few years which has led to a growing interest in computer vision methods and their utilisation in a variety of applications. Many applications exist which do not need to work with calibrated images; however, some applications demand scene calibration in order to perform successfully. Mobile robot navigation, forensic engineering, object parameters estimation, or different kinds of scene reconstruction are typical classes of algorithms which benefit from the calibrated scene. A dependence exists between precision of the calibration process and the amount and quality of input information required for it. On one side of the scale lies the approaches such as image-based rendering, which do not extract too much geometric information. They are not very precise in terms of the reconstructed 3D structure, but they do not have special demands on the input data. On the other side of the scale lies the more precise methods, such as stereometry, which can be very accurate but can also demand complex and exact input data (i.e. precise camera locations and their internal parameters). A similar dependence often arises between the precision and complexity of the calibration process and J. Blanc-Talon et al. (Eds.): ACIVS 2011, LNCS 6915, pp. 748–759, 2011. c Springer-Verlag Berlin Heidelberg 2011
Simple Single View Scene Calibration
749
further – between the complexity and type of structure which can be extracted from the scene. Methods generating projective description of the scene tend to be simpler than methods working with an affine description. However, Euclidean reconstruction usually requires far more complex methods [1]. We developed a simple method of scene calibration which does not have comprehensive demands on the input data. It requires only one image of the scene without knowledge of either external or internal camera parameters. However, objects of a known size must be present in the scene and their distance from a planar surface or ground-plane must be known. Also, some basic assumptions about the camera are made. Our goal is to obtain a partial Euclidean description of the scene. In order to be more specific, we want to estimate certain parameters of objects in the scene, such as their size, distance between them, or sizes of specific areas in the scene. This can be achieved by estimating parameters of the ground-plane and real object positions. 1.1
Related Work
As uncalibrated images allow only projective reconstruction, either camera or scene calibration is needed to obtain an affine or Euclidean scene description. Scene calibration is a task of setting-up a relation between real-world coordinates and image coordinates by exploiting the scene constraints. This task has often been confused with (internal or external) camera calibration, which is a different task altogether and is generally not needed for scene calibration. Methods of scene calibration vary considerably, mainly between single and multiple-view approaches. When using two views of a scene, enough information is provided for depth reconstruction. More than two views help us to reduce uncertainty or allow us to check the consistency of features matched between individual images [2]. When the camera motion between views is known, we speak about stereo vision and principles of epipolar geometry that can be exploited. Such methods are able to produce an Euclidean description of the scene but often require knowledge of internal camera parameters in addition to external ones - i.e. Salman and Yvinec [3] produce a highly accurate scene representation in the form of a triangle mesh. When camera motion between views is unknown, we speak about structure from motion problem where epipolar geometry is also very useful. Szelisky and Kang [4] do not perform camera calibration which makes only projective reconstruction possible, but their achievement was to recover a dense depth map of the scene from multiple views. Koenderink and Van Doorn [5] use just two views for affine structure recovery, but they require the internal camera parameters to be known. Christy and Horaud [6] showed in their work that it is possible to obtain even a Euclidean scene description efficiently. Multiple views of the scene are often not available in practise, so single view techniques are needed. In general, one view alone does not provide enough information for a complete 3D reconstruction. Ambiguity has to be resolved by using some apriori information (i.e. by utilising geometric relations of objects in real world). Avitzour assumes that objects rest on a planar ground and that the camera internal
750
B. Pˇribyl and P. Zemˇc´ık
parameters are known in his calibration procedure [7]. Based on these preconditions, it is possible to estimate parameters of the ground-plane. Criminisi et al. does not need to know anything about the camera, he only needs some special geometric primitives like parallel lines to be present in the image. It is then possible to determine the vanishing line and vanishing points with use of affine geometry and perform affine measurements in the scene [8]. Similarly Masoud and Papanikolopoulos use regular geometric primitives like directional roadway markings to calibrate traffic scenes [9]. Huynh does not work with either external or internal camera parameters; however he specialises in scenes which contain some symmetrical objects or object configurations [10]. Our method described in the following sections works with only a single view of the scene and it does not require knowledge of any camera parameters. It is able to calibrate scenes without regular geometric primitives, in contrast to the majority of other single view methods.
2
Proposed Calibration Method
Let us suppose that objects in the image, whose approximate size is known, exist and that they rest on a planar surface or their distance from that surface is known. The goal of scene calibration is to find optimal projection parameters in the described case. These parameters affect backprojection of image coordinates of captured objects onto real-world coordinates. Optimal projection parameters are achieved when backprojected points form a plane. Because 3 points in any configuration in the space define a plane, at least 4 points (i.e. objects of the known size) are needed to determine optimal parameters of the ground-plane and to successfully perform the calibration. 2.1
Projection Model
We use a model of perspective projection because images acquired by most of the cameras are produced through the naturally occurring perspective projection. Since dimensions of the sensing element are not known, it is modelled as a planar lattice with an arbitrary aperture which can be arbitrarily positioned on the optical axis. As the freedom of both of these variables is redundant we choose to fix the aperture, thus working with the position of the screen in our calculations. We presume that the rays coming from the scene converge in the centre of projection and draw an image on the screen. Three types of entities are used – entities related to real-world objects (denoted by symbols without any index, e.g. A, x); entities related to planar images of objects (denoted by symbols with one apostrophe, e.g. A , x ); and entities related to spherical images of objects (denoted by symbols with two apostrophes, e.g. A , x ). A right-handed Cartesian coordinate system, which is attached to the camera, has been used to determine world coordinates. Its origin is identical with the centre of projection, where x-axis goes horizontally from left to right, y-axis goes vertically from top to bottom and z-axis goes perpendicularly
Simple Single View Scene Calibration
751
through the screen into the scene as shown in Fig. 1. We assume that the z-axis is identical with the optical axis of the camera (i.e. the principal point of the camera is in the middle of the image). Practically, the principal point is often displaced by a few pixels from the centre of the image. However, experiments show that inaccuracy caused by this fact is insignificant. The origin of the image coordinate system is in the middle of the image and the x -axis and y -axis are parallel to their real-world counterparts, x-axis and y-axis respectively. Transition from a world coordinate system to an image coordinate system is, therefore, simply done by discarding the z-coordinate.
Fig. 1. Object image A on the planar screen is backprojected using rays r1 and r2 which belong to a conic beam. The real distance of star-shaped object A from the centre of projection O is determined by the position in which the object clips precisely into the beam.
Real-world object coordinates are computed from image coordinates using backprojection together with a known size of real objects. The procedure can be visualised by casting rays from the centre of projection through border points of an object image. Such rays form a conic beam. The real object must be situated inside this beam. The distance of the object from the centre of projection can be determined by exploiting the fact that a real object must clip precisely into the beam (see Fig. 1 for illustration). In practice, only one ray per object has to be cast to determine its distance. This ray should go through the same point of each object (e.g. centre of gravity, bottom left corner, etc.). In order to determine the object coordinates, it is necessary to know the scale of the map between the real object and its spherical image. The scale s is computed as a fraction of object height h and the height of its spherical image h : h . (1) h Sizes of both planar and spherical images of the object depend on out-ofplane rotation of the object (regarding the image plane here). If the chosen s=
752
B. Pˇribyl and P. Zemˇc´ık
object detector does not provide information about out-of-plane rotation, it is necessary to introduce an abstraction – for example, to assume that bounding spheres of objects are detected instead of the objects themselves. In cases where the chosen object detector does not detect bounding spheres of objects, it is possible to simulate such behaviour simply: based on the knowledge of width and height of the real object, it is possible to decide which of its dimensions is less distorted in the image. The other dimension can then be appropriately enlarged so the distortion will be virtually equal in both dimensions. This abstraction causes exactly the same consequence as if all the detected objects were oriented towards the camera. Sizes of images of object bounding spheres, which are of equal size and of equal distance from the centre of projection, depend on the distance of the object from the optical axis – objects further from the optical axis has larger planar images. It is possible to eliminate this dependency by converting the planar image into the spherical image. That is why we compute with sizes of spherical images instead of planar images (see Fig. 2 for an illustration and [11] for a proof of this phenomenon).
Fig. 2. Perspective projection on a plane and on a sphere. Objects A and B and their bounding spheres of equal size hA = hB are in equal distance from the centre of projection O. Size hA of planar image and size hA of spherical image of the object A are of nearly equal size, because A is situated near the optical axis z. However, this is not true for size hB of planar image and size hB of spherical image of the object B. Sizes hA and hB of planar images of both objects differ significantly, whereas spherical images of both objects are of the same size hA = hB .
The height of a spherical image h is given by top and bottom y-coordinates and yb of the planar image and also by distance k of the screen from the centre of projection. yt
h = 2πk ·
arctan yt − arctan yb k k 2π
y y = k · arctan t − arctan b . k k
(2)
Simple Single View Scene Calibration
753
Vector v going from the centre of projection O to the planar image A is created afterwards: −−→ v = OA .
(3)
Consequently, this vector is normalised to the magnitude of k (which means a conversion from a planar image to a spherical image) and multiplied by the map scale s: v=
k · v s . v
(4)
The resulting vector v has its initial point in the centre of projection O and its terminal point in some point A of the real-world object. Thus, the object is at coordinates A=O+v .
(5)
The described procedure is only valid if all objects are elevated at the same known height e above the surface. If this height is nonzero, it is necessary to shift the computed plane vertically downwards by e to approximate the real ground-plane. Although the vertical direction in the scene is not known, it can very well be estimated from the average top-down orientation of all detected objects. If the assumption about no out-of-plane rotation of objects is made, as stated above, the vertical direction can then be estimated as an average inplane rotation angle α ¯ of all detected objects, where α is provided by the object detector. If the objects are elevated at different heights above the surface, it is first necessary to unify their heights in order to be able to estimate the groundplane. The unification can then be done (e.g. by a vertical projection of the centre of each object on the surface). Such points will be called a “foot” of the object A and will be referred to as Af . Image coordinates Af of the foot must be estimated by shifting image coordinates A of the object by vector d given by the known object elevation e and its in-plane rotation angle α. Af = A + d ,
(6)
h h d = sin α · e · ; cos α · e · . h h
(7)
where
Image coordinates Af of the foot will replace image coordinates A of the object itself in further calculations, assuming the foot is approximately at the same distance from the camera as the object itself. 2.2
Finding Optimal Solution
It is possible to search for the optimal value of projection parameter k with the ability to determine real-world 3D coordinates of objects of known size for
754
B. Pˇribyl and P. Zemˇc´ık
a certain value of k. The assumption that all objects rest on the ground, or are situated at the known height above it, is exploited here. This means we search for such a k when it is possible to lay a plane through a cluster of 3D points with minimal effort (minimising an error function). Parameters of the plane can be determined by means of Principal Component Analysis (PCA) [12], because we actually search for the subspace (i.e. the plane), in which the 3D points will be orthogonally projected by minimising the error, expressed as the sum of squared distances of the points from such a plane. First, the mean [¯ x, y¯, z¯]T of all 3D coordinates is subtracted from each coordinate to eliminate the bias of the coordinate set. This mean expresses the point, through which the resulting plane will pass. Then, n×3 matrix M is constructed ⎤ ⎡ x0 x1 · · · xn−1 (8) M = ⎣ y0 y1 · · · yn−1 ⎦ , z0 z1 · · · zn−1 where n is the number of objects whose 3D coordinates have been computed. Each column of the matrix contains coordinates of one object. Therefore, it is possible to create a covariance matrix Σ of matrix M Σ = MMT .
(9)
Afterwards, a triplet of eigenvalues and eigenvectors of 3 × 3 matrix Σ is computed by means of PCA. Eigenvectors associated with the two largest eigenvalues lie in the searched plane. The third eigenvector associated with the smallest eigenvalue is perpendicular to both of the other eigenvectors and it is a normal vector of the plane. The smallest eigenvalue is equal to the sum of squared distances of all 3D points from the given plane; thereby, it expresses a mean square error of the solution for a given k. Because we work with a non-linear system, analytical calculation of eigenvalues and eigenvectors is very complex. Therefore, we search for an optimal projection parameter k by searching for the minimum of an error function. It has been experimentally found that the solution error as a function of k has a typical behaviour as depicted in Fig. 3. It usually has several smooth maxima alternating with sharp minima for small values of k. For large values of k, the function asymptotically approaches some value dependent on the input parameters of the calibration problem. Based on the behaviour of the error function, it is possible to find its global minimum and thereby also the optimal projection parameter k. The function value can be sampled, for example, with an exponentially growing step. If some minimum is found, its location is refined by dividing the sampling step repeatedly. Afterwards, the step size is reinitialised and the search for another minimum takes place. If there has been a global minimum in the course of the function, this approach has led to its localization in every examined case.
Simple Single View Scene Calibration
755
Fig. 3. Typical behaviour of the solution error (logarithmic scale) as a function of k (distance of the screen from the centre of projection). For small values of k several smooth maxima alternating with sharp minima appear. For large values of k the function asymptotically approaches some value dependent on input parameters of the calibration problem. Global minimum occurs at optimal value ko of the projection parameter k.
3
Experimental Results
The described method has been implemented in C++ language and tested on a set of artificial as well as real scenes. Traffic signs were used as objects of known size because they appear quite frequently in urban scenes, most of them are of a unified size and good traffic sign detectors were also available. The examined scenes contained typically 4 – 7 traffic signs. Every time the surface in the scene was planar, the minimum of the error function was found. The order of error of the optimal solution was usually about 10−5 – 10−17 . If the scene surface was not planar but curved, the error function did not contain any minima, thus the method could not be applied. Examples of both artificial and real calibrated scenes follow in Fig. 4 and Fig. 5. The whole procedure suffers from small inaccuracies, the main sources of them being a geometric distortion of the image caused by the acquisition process and imperfect detection of objects. The inaccuracies are further amplified when backprojecting object images back onto real-world coordinates, which are later utilised in the surface plane estimation. If parameters of the camera are not known, geometric distortion of the image cannot be easily dealt with. However, the accuracy of the object detection can be controlled very well. The effect of detection inaccuracy on the calculation of object real-world coordinates is considerable, especially when the images of objects are small (i.e. far or small objects). See Fig. 6 for an illustration of displacement. For example, 1 px detection inaccuracy causes approximately 28 % displacement for an object whose image is only 10 px large. The detection inaccuracy affects much less objects with larger images, because the linear growth of object image size causes quadratic growth of maximal tolerable detection inaccuracy, which will cause constant displacement. When maximal acceptable displacement is,
756
B. Pˇribyl and P. Zemˇc´ık
Fig. 4. Artificial scene. Top: scheme of the captured image – circles represent detected traffic signs with a diameter 0.7 m on 2.27 m long rods. The image has been constructed using perspective projection of the scene on a screen with a pixel size of 1.651 µm, which was positioned 1 cm away from the centre of the projection. Each traffic sign is marked (from top to bottom) with actual coordinates of the contact point with the groundplane (the foot), its computed coordinates and displacement relative to the real size of each object. Bottom: course of the error function for the depicted scene. Our method has found the minimum at value k = 0.0103192 m = 1.03192 cm with error 5.82 · 10−6 , which is very close to the actual distance of the screen from the centre of the projection.
Simple Single View Scene Calibration
757
Fig. 5. Real scene with a roundabout. Top: image of the scene with 4 detected traffic signs (black rectangles). Computed distances between signs are stated (black lines). Bottom: course of the error function of the solution for the depicted scene. Our method has found the global minimum at value k = 0.00305137 m with error 2.86114 · 10−14 . (note that the error function has 2 minima in this case). However, the ground truth is unknown in this case.
758
B. Pˇribyl and P. Zemˇc´ık
for example, 5 %, the uttermost detection inaccuracy can only be a fraction of a pixel for object images which are about 10 px large, 3 px for images that are about 40 px large and 10 px for images which are about 70 px large. Thus, it is desirable to exploit the presence of rather bigger and nearer objects when calibrating a scene. A crucial aspect of automatic calibration is also the choice of an accurate object detector or an accurate manual marking of the objects when using semi-automatic calibration.
Fig. 6. Object displacement (in %) is the relative error of computed real-world coordinates with respect to object size. The displayed chart shows how displacement depends on the size of the object image and on the error of object detection. Objects with small images suffer from immense displacement while, objects with mid-sized and big images are much more resistant to object detection inaccuracies.
4
Conclusions
We developed a simple scene calibration method, which demands only a single view of the scene. The basic idea of the described method is to exploit relative sizes and positions of known-sized objects in the image under the assumption that the objects lie in a plane. This method is able to work semi-automatically if objects are marked manually, as well as automatically if object detectors are used. The calibrated scene makes possible the determination of coordinates of objects lying on the ground-plane, measurement of distances between the objects or measurement of areas between them. Our approach is limited by the fact that it works only with planar scenes; despite this limitation, the method is practically usable in various environments (e.g. in urban environment). The prerequisite is the presence of known-sized objects in the image and the knowledge of their positions with respect to the ground-plane.
Simple Single View Scene Calibration
759
The method has been tested on a set of artificial and real scenes. In the scenes with planar surface, a solution has been always found. The precision of the solution is sufficient for many applications, which demand a calibrated scene. Future research includes processing of the scenes with a non-planar surface and deeper sensitivity analysis of the method. Acknowledgements. This work was supported by the European Commission under the contract FP7-215453 “WeKnowIt” and by the Ministry of Education, Youth and Sports of the Czech Republic by projects MSMT 2B06052 “Biomarker” and MSM 0021630528 “Security-Oriented Research in Information Technology”.
References 1. Faugeras, O.: Stratification of 3-D vision: projective, affine, and metric representations. Journal of the Optical Society of America 12, 465–484 (1995) 2. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003) 3. Salman, N., Yvinec, M.: High Resolution Surface Reconstruction From Overlapping Multiple-Views. In: Proceedings of the 25th Annual Symposium on Computational Geometry, pp. 104–105. ACM, New York (2009) 4. Szeliski, R., Kang, S.B.: Direct Methods for Visual Scene Reconstruction. In: Proceedings of the IEEE Workshop on Representation of Visual Scenes, pp. 26–33. IEEE, Los Alamitos (2002) 5. Koenderink, J.J., Van Doorn, A.J.: Affine Structure from Motion. Journal of the Optical Society of America 8, 377–385 (1991) 6. Christy, S., Horaud, R.: Euclidean Shape and Motion from Multiple Perspective Views by Affine Iterations. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 1098–1104 (2002) 7. Avitzour, D.: Novel Scene Calibration Procedure for Video Surveillance Systems. IEEE Transactions on Aerospace and Electronic Systems 40, 1105–1110 (2004) 8. Criminisi, A., Reid, I., Zisserman, A.: Single View Metrology. International Journal of Computer Vision 40, 1105–1110 (2000) 9. Masoud, O., Papanikolopoulos, N.P.: Using Geometric Primitives to Calibrate Traffic Scenes. Transportation Research Part C: Emerging Technologies 15, 361–379 (2007) 10. Huynh, D.Q.: Affine Reconstruction from Monocular Vision in the Presence of A Symmetry Plane. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, pp. 476–482. IEE, Los Alamitos (2002) 11. Sohon, F.W.: The Stereographic Projection. Chemical Publishing Company, Brooklyn (1941) 12. Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (2002)
Author Index
Aelterman, Jan 195 Albouy-Kissi, Adelaide 396 Albouy-Kissi, Benjamin 647 Alenius, Sakari 405 Antonini, Marc 447 Ardabilian, Mohsen 658 Assoum, Ammar 127 Auclair-Fortier, Marie-Flavie 437 Aziz, M. Zaheer 34 Bacelar, Miguel 92 Badia, Rosa M. 586 Baek, Seung-Hae 724 Baptista, Jos´e 92 Bellens, Pieter 586 Ben-Abdallah, Hanene 46 Ben Romdhane, Nadra 46 Bhalerao, Abhir 702 Boden, Charlotte 702 Borghesani, Daniele 70 Borkar, Amol 58 Bosaghzadeh, Alireza 127 Br`ethes, Ludovic 712 Brezovan, Marius 183 Bruno, Odemir Martinez 207, 284, 337, 349 Bui, Len 566 Bulas-Cruz, Jos´e 92 Burdescu, Dumitru Dan 183 Calderara, Simone 70 Cˆ amara, Gilberto 162 Caselles, Vicent 13, 82 Chen, Liming 483, 658 Chen, Min 360 Chen, Ying-Ching 372 Chen, Yung-Fu 372 Chetty, Girija 566 Chiang, John Y. 372 Contente, Olga 92 Coppi, Dalia 70 Cormier, Stephane 396 Cornelis, Bruno 417 Corporaal, Henk 293, 611, 623
Coudrin, Benjamin 712 Cucchiara, Rita 70 Daubechies, Ingrid 417 De Mey, Marc 417 Destrez, Rapha¨el 647 Devy, Michel 712 De Vylder, Jonas 195 de With, Peter H.N. 507 Diep, Daniel 261 Dietlmeier, Julia 139 Dooms, Ann 417 Dornaika, Fadi 127 Duan, Jiang 360 Dubuisson, S´everine 150 Egiazarian, Karen 459 Engels, Chris 681 Falvo, Maur´ıcio 207 Ferreira, Paulo J.S.G. 92 Florindo, Jo˜ ao Batista 207, 284 Fonseca, Leila Maria Garcia 162 Fularz, Michal 599 Fusek, Radovan 228 Gallardo-Caballero, Ram´ on 316 Ganea, Eugen 183 Garc´ıa-Manso, Antonio 316 Garc´ıa-Orellana, Carlos J. 316 Gaura, Jan 216 Ghita, Ovidiu 139 Ghorbel, Faouzi 658 Gon¸calves, Wesley Nunes 337, 349 Gonzales, Christophe 150 Gonz´ alez-Velasco, Horacio M. 316 Grana, Costantino 70 Gruenwedel, Sebastian 554 Hamamoto, Takayuki 669 Hammami, Mohamed 46 Hannuksela, Jari 405 Hansen, Jean-Fr´ed´eric 519 Haq, Nuhman ul 429 Hayat, Khizar 429
762
Author Index Marchadier, Arnaud 647 Martens, Maximiliaan 417 Masmoudi, Khaled 447 Mertsching, B¨ arbel 34 Mesman, Bart 293, 611, 623 Montesinos, Philippe 261 Morais, Raul 92 Morvan, Jean-Marie 483 Mozdˇreˇ n, Karel 228 Mu˜ noz Fuentes, Pablo 22
Hayes, Monson 58 He, Yifan 623 Hoeffken, Matthias 576 Houit, Thomas 384 Huang, Xu 566 Iancu, Andreea Ikeoka, Hiroshi
183 669
Jermyn, Ian 171 Jin, Lina 459 Jung, Soon Ki 693 Jur´ anek, Roman 273 Kasi´ nski, Andrzej 599 Kato, Zoltan 171 K.C., Santosh 249 Khromov, Denis 239 Kim, Daehwan 543 Kim, Daijin 543 Kim, HyunCheol 305 Kim, Whoi-Yul 305 Kim, Yeonho 543 Knopf, Michael 34 Kolesnik, Marina 576 Kornprobst, Pierre 447 Korting, Thales Sehn 162 Kraft, Marek 599 Krumnikl, Michal 216 Krzeszowski, Tomasz 115 Kwolek, Bogdan 115 Labarta, Jesus 586 Lamiroy, Bart 249 Lamovsky, Denis 635 Lasaruk, Aless 635 Le Borgne, Herv´e 22 Lee, Sang Hwa 471 Leroy, Damien 519 Li, Huibin 483 Liu, Yunqiang 13, 82 L´ opez-Mart´ınez, Carlos Lucas, Yves 647 Lukin, Vladimir 459
103
Nemeth, Jozsef 171 N’Guyen, Xuan Son 150 Nielsen, Frank 384 Noreen, Neelum 429 Nugteren, Cedric 611 Oberhoff, Daniel 576 Ohata, Masayuki 669 Orjuela, Sergio A. 327 Orteu, Jean-Jos´e 712 Ortiz-Jaramillo, Benhur Ouji, Karima 658
Pajares, Gonzalo 103 Palaniappan, Kannappan 586 Park, Jong-Il 471 Park, Soon-Yong 724 Park, Yong Gyu 471 Peemen, Maurice 293 Peng, Tao 360 Pereira, Carlos 92 Philips, Wilfried 1, 195, 327, 417, 554 Pi´erard, S´ebastien 519 Piˇzurica, Aleksandra 417 Platiˇsa, Ljiljana 417 Pola´ nski, Andrzej 531 Ponomarenko, Nikolay 459 Popescu, Bogdan 183 Prevost, Lionel 495 Pˇribyl, Bronislav 748 Puech, William 429 Quinones, Rolando
Machado, Bruno Brandoli 337 Machal´ık, Stanislav 273 Mac´ıas-Mac´ıas, Miguel 316 Magnier, Baptiste 261 Manfredi, Guido 437
327
327
Rapp, Vincent 495 Reis, Manuel J.C.S. 92 Richtsfeld, Andreas 736 Ruˇzi´c, Tijana 417
Author Index Sahli, Hichem 681 S´ anchez-Llad´ o, Javier 103 Sandıkcı, Sel¸cuk 507 Sang, Linlin 471 Seetharaman, Guna 586 Senechal, Thibaud 495 Sharma, Siddharth 471 She, Dongrui 623 Silv´en, Olli 405 Smith, Mark T. 58 Soares, Salviano 92 Sojka, Eduard 216, 228 ˇ Surkala, Milan 228 ´ Swito´ nski, Adam 531 Tian, Qiyuan 360 Tingdahl, David 681 Tran, Dat 566 Tranquart, Francois 396 Treuillet, Sylvie 647 Tuytelaars, Tinne 681 Valente, Ant´ onio 92 van den Braak, Gert Jan
Van Droogenbroeck, Marc 519 Van Gool, Luc 681 Van Hamme, David 1 Van Hese, Peter 554 Van Langenhove, Lieva 327 Varadarajan, Karthik Mahesh 736 Varjo, Sami 405 Veelaert, Peter 1 Vercruysse, Mathias 681 Vincze, Markus 736 Wendling, Laurent 249 Whelan, Paul F. 139 Wojciechowski, Konrad 115, 531 Won, Kwang Hee 693 Ye, Zhenyu
611
623
Zavidovique, Bertrand 396 Zemˇc´ık, Pavel 273, 748 Zhou, Kai 736 Zillich, Michael 736 Zinger, Svitlana 507 Ziou, Djemel 437
763