Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4678
Jacques Blanc-Talon Wilfried Philips Dan Popescu Paul Scheunders (Eds.)
Advanced Concepts for Intelligent Vision Systems 9th International Conference, ACIVS 2007 Delft, The Netherlands, August 28-31, 2007 Proceedings
13
Volume Editors Jacques Blanc-Talon DGA/D4S/MRIS, CEP/GIP 16 bis, rue Prieur de la côte d’or, 94114 Arcueil, France E-mail:
[email protected] Wilfried Philips Ghent University, Telecommunications and Information Processing (TELIN) St.-Pietersnieuwstraat 41, 9000 Ghent, Belgium E-mail:
[email protected] Dan Popescu CSIRO ICT Centre, Macquarie University Campus Herring Road, North Ryde, NSW 2113, Australia E-mail:
[email protected] Paul Scheunders University of Antwerp, Vision Lab Universiteitsplein 1 (N Building), 2610 Antwerp, Belgium E-mail:
[email protected]
Library of Congress Control Number: 2007933316 CR Subject Classification (1998): I.4, I.5, I.3, I.2.10 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-540-74606-4 Springer Berlin Heidelberg New York 978-3-540-74606-5 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12115747 06/3180 543210
Preface
This volume collects the papers accepted for presentation at the Ninth International Conference on “Advanced Concepts for Intelligent Vision Systems” (ACIVS 2007). The ACIVS conference was established in 1999 in Baden-Baden (Germany) as part of a large multiconference. Since then ACIVS has been developed as an independent scientific event and has maintained the tradition of being a single track event with oral presentations of 25 minutes each, even though the number of participants has been steadily growing every year. The conference currently attracts computer scientists from more than 20 countries, mostly from Europe, Australia and Japan, but also from the USA, Asia and the Middle East. Although ACIVS is a conference on all areas of image and video processing, submissions freely gather within some major fields of interest. More than a quarter of the selected papers deal with image and video coding, motion estimation, moving object detection and other video applications. This year, topics related to biometrics, pattern recognition and scene understanding for security applications (including face recognition) constitute about a fifth of the conference. Image processing – which has been the core of the conference over the years – loses slightly in volume, while more than a third of the selected papers deals with computer vision, scene interpretation and many dedicated applications. We would like to thank the invited speakers James Crowley (INRIA/GRAVIR), Andr´e Gagalowicz (INRIA/MIRAGES), Ron Kimmel (Technion Haifa) and Peter Centen (Thomson Grass Valley) for enhancing the technical program with their presentations. A conference like ACIVS would not be feasible without the concerted effort of many people and support of various institutions. The paper submission and review procedure was carried out electronically and a minimum of three reviewers were assigned to every paper. From 221 submissions, 45 were selected for oral presentation and 55 as posters. A large and energetic Program Committee, helped by additionnal referees (about 220 people) – listed on the following pages – completed the long and demanding reviewing process. We would like to thank all of them for their timely and high-quality reviews. Also, we would like to thank our sponsors, Philips Research, Barco, Eurasip, the IEEE Benelux Signal Processing Chapter and the Flemish FWO Research Community on Audiovisual Systems, for their valuable support. Last but not least, we would like to thank all the participants who trusted us in organizing this event for the ninth time. We hope they attended a stimulating scientific event and enjoyed the atmosphere of the ACIVS social events in the historic city of Delft. July 2007
J. Blanc-Talon D. Popescu W. Philips P. Scheunders
Organization
ACIVS 2007 was organized by the Techical University of Delft and Ghent University.
Steering Committee Jacques Blanc-Talon (DGA/MRIS, Arcueil, France) Wilfried Philips (Ghent University, Ghent, Belgium) Dan Popescu (CSIRO, Sydney, Australia) Paul Scheunders (University of Antwerp, Wilrijk, Belgium)
Organizing Committee Pieter Jonker (Delft University of Technology, Delft, The Netherlands) Mandy Jungschlager (Delft University of Technology, Delft, The Netherlands) Wilfried Philips (Ghent University, Ghent, Belgium) Paul Scheunders (University of Antwerp, Wilrijk, Belgium)
Sponsors ACIVS 2007 was sponsored by the following organizations: – – – – – – –
Philips Research NXP Semiconductors The IEEE Benelux Signal Processing Chapter Eurasip Barco DSP Valley The FWO Research Community on Audiovisual Systems (AVS)
The ACIVS 2007 organizers are especially grateful to NXP Semiconductors for their financial sponsorship.
Program Committee Hamid Aghajan (Stanford University, Stanford, USA) Fritz Albregtsen (University of Oslo, Oslo, Norway) Marc Antonini (Universit´e de Nice Sophia Antipolis, Nice, France) Kenneth Barner (University of Delaware, Newark, USA) Attila Baskurt (INSA Lyon, Villeurbanne, France) Laure Blanc-Feraud (CNRS, Sophia-Antipolis, France)
VIII
Organization
Philippe Bolon (University of Savoie, Annecy, France) Salah Bourennane (Ecole Centrale de Marseille, Marseille, France) Patrick Bouthemy (IRISA/INRIA, Rennes, France) Jocelyn Chanussot (INPG, Grenoble, France) Pamela Cosman (University of California at San Diego, La Jolla, USA) Yves D’Asseler (Ghent University, Ghent, Belgium) Jennifer Davidson (Iowa State University, Ames, USA) Arturo de la Escalera Hueso (Universidad Carlos III de Madrid, Leganes, Spain) Ricardo de Queiroz (Universidade de Brasilia, Brasilia, Brazil) Christine Fernandez-Maloigne (Universit´e de Poitiers, Chasseneuil, France) Don Fraser (University of New South Wales, Canberra, Australia) Theo Gevers (University of Amsterdam, Amsterdam, The Netherlands) J´erˆome Gilles (CEP, Arcueil, France) Georgy Gimel’farb (The University of Auckland, Auckland, New Zealand) Daniele Giusto (University of Cagliari, Cagliari, Italy) Dimitris Iakovidis (University of Athens, Athens, Greece) John Illingworth (University of Surrey, Guildford, UK) Fr´ed´eric Jurie (CNRS - INRIA, Saint Ismier, France) Andrzej Kasinski (Poznan University of Technology, Poznan, Poland) Richard Kleihorst (NXP Semiconductors Research, Eindhoven, The Netherlands) Murat Kunt (EPFL, Lausanne, Switzerland) Hideo Kuroda (Nagasaki University, Nagasaki, Japan) Kenneth Lam (The Hong Kong Polytechnic University, Hong Kong, China) Peter Lambert (Ghent University, Ledeberg-Ghent, Belgium) Bangjun Lei (China Three Gorges University, Yichang, China) Henri Maitre (Ecole Nationale Sup´erieure des T´el´ecommunications, Paris, France) Xavier Maldague (Universit´e de Laval, Qu´ebec, Canada) Eric Marchand (IRISA/INRIA, Rennes, France) G´erard Medioni (USC/IRIS, Los Angeles, USA) Fabrice M´eriaudeau (IUT Le Creusot, Le Creusot, France) Alfred Mertins (Universit¨ at zu L¨ ubeck, L¨ ubeck, Germany) Rafael Molina (Universidad de Granada, Granada, Spain) Adrian Munteanu (Vrije Universiteit Brussel, Brussels, Belgium) Vittorio Murino (Universit` a degli Studi di Verona, Verona, Italy) Laurent Najman (ESIEE, Paris, France) Edgard Nyssen (Vrije Universiteit Brussel, Brussels, Belgium) Nikos Paragios (Ecole Centrale de Paris, Chatenay-Malabry, France) Jussi Parkkinen (University of Joensuu, Joensuu, Finland) Fernando Pereira (Instituto Superior T´ecnico, Lisbon, Portugal) Stuart Perry (Canon Information Systems Research Australia, Sydney, Australia) B´eatrice Pesquet-Popescu (ENST, Paris, France) Matti Pietik¨ ainen (University of Oulu, Oulu, Finland)
Organization
IX
Aleksandra Pizurica (Ghent University, Ghent, Belgium) Gianni Ramponi (Trieste University, Trieste, Italy) Paolo Remagnino (Faculty of Technology, Kingston University, Surrey, UK) Joseph Ronsin (IETR, Rennes, France) ´ Luis Salgado Alvarez de Sotomayor (Universidad Polit´ecnica de Madrid, Madrid, Spain) Hugues Talbot (ESIEE, Noisy-le-Grand, France) Kenneth Tobin (Oak Ridge National Laboratory, Oak Ridge, USA) Frederic Truchetet (Universit´e de Bourgogne, Le Creusot, France) Dimitri Van De Ville (EPFL, Lausanne, Switzerland) Iris Vanhamel (Vrije Universiteit Brussel, Brussels, Belgium) Ewout Vansteenkiste (Ghent University, Ghent, Belgium) Peter Veelaert (University College Ghent, Ghent, Belgium)
Reviewers Arnaldo Abrantes (ISEL, Lisbon, Portugal) Hamid Aghajan (Stanford University, Stanford, USA) Alexandre Alahi (Swiss Federal Institute of Technology, Lausanne, Switzerland) Fritz Albregtsen (University of Oslo, Oslo, Norway) David Alleyson (Grenoble University, Grenoble, France) Jesus Angulo (Ecole des Mines de Paris, Fontainebleau, France) Marc Antonini (Universit´e de Nice Sophia Antipolis, Nice, France) Didier Auroux (Universit´e Paul Sabatier, Toulouse, France) Tuncer Aysal (McGill University, Montreal, Canada) Attila Baskurt (INSA Lyon, Villeurbanne, France) Rik Bellens (Ghent University, Ghent, Belgium) Gilles Bertrand (ESIEE, Marne-la-Vall´ee, France) Jens Bialkowski (Universit¨ at Erlangen-N¨ urnberg, Erlangen, Germany) Jacques Blanc-Talon (DGA/MRIS, Arcueil, France) Wayne Blanding (University of Connecticut, USA) Isabelle Bloch (Ecole Nationale Sup´erieure des T´el´ecommunications, Paris, France) Philippe Bolon (University of Savoie, Annecy, France) Patrick Bonnin (Universit´e de Versailles, Velizy, France) Alberto Borghese (University of Milan, Milan, Italy) Salah Bourennane (Ecole Centrale de Marseille, Marseille, France) Patrick Bouthemy (IRISA/INRIA, Rennes, France) Salim Bouzerdoum (University of Wollongong, Australia) Ralph Braspenning (Philips Research, Eindhoven, The Netherlands) Alice Caplier (INPG, Grenoble, France) Douglas Chai (Edith Cowan University, Australia) Jocelyn Chanussot (INPG, Grenoble, France) Jean-Marc Chassery (INPG, Grenoble, France) Kacem Chedi (ENSSAT, Lannion, France) Sei-Wang Chen (National Taiwan Normal University, Taipei, Taiwan)
X
Organization
Olivier Colot (University of Lille, Villeneuve d’Ascq, France) Pamela Cosman (University of California at San Diego, La Jolla, USA) Emmanuel D’Angelo (CEP, Arcueil, France) Nicola D’Apuzzo (Homometrica Consulting, Zurich, Switzerland) Yves D’Asseler (Ghent University, Ghent, Belgium) Matthew Dailey (Asian Institute of Technology, Klong Luang, Thailand) Jennifer Davidson (Iowa State University, Ames, USA) Steve De Backer (University of Antwerp, Wilrijk, Belgium) Johan De Bock (Ghent University, Ghent, Belgium) Arturo de la Escalera Hueso (Universidad Carlos III de Madrid, Leganes, Spain) Lieven De Lathauwer (ENSEA, Cergy, France) Ricardo de Queiroz (Universidade de Brasilia, Brasilia, Brazil) Herv´e Delingette (INRIA, Sophia-Antipolis, France) Patrice Delmas (The University of Auckland, Auckland, New Zealand) Claude Delpha (SUPELEC, Gif, France) Kamil Dimililer (Near East University, Nicosia, Cyprus) Karen Drukker (University of Chicago, Chicago, USA) Touradj Ebrahimi (EPFL, Lausanne, Switzerland) Abir El abed (Laboratoire d’Informatique de Paris 6, Paris, France) Ahmet Elgammal (Rutgers University, USA) Valentin Enescu (Vrije Universiteit Brussel, Brussels, Belgium) Fr´ed´eric Falzon (ALCATEL-ALENIA, Cannes, France) Aly Farag (University of Louisville, USA) Dirk Farin (TU-Eindhoven, Eindhoven, The Netherlands) Hamed Fatemi (Eindhoven University, Eindhoven, The Netherlands) Christine Fernandez-Maloigne (Universit´e de Poitiers, Chasseneuil, France) David Filliat (ENSTA, Paris, France) James Fowler (Mississipi State University, Starkville, USA) Don Fraser (University of New South Wales, Canberra, Australia) Hans Frimmel (CSIRO e-health Centre, Brisbane, Australia) Andr´e Gagalowicz (INRIA, Rocquencourt, France) ShaoShuai Gao (NIST, USA) Sidharta Gautama (Ghent University, Ghent, Belgium) Theo Gevers (University of Amsterdam, Amsterdam, The Netherlands) J´erˆome Gilles (CEP, Arcueil, France) Daniele Giusto (University of Cagliari, Cagliari, Italy) Bart Goossens (Ghent University, Ghent, Belgium) D.S. Guru (University of Mysore, Mysore, India) Allan Hanbury (Vienna University of Technology, Vienna, Austria) Rachid Harba (Universit´e d’Orl´eans, Orl´eans, France) Mark Hedley (CSIRO ICT Centre, Sydney, Australia) Mark Holden (CSIRO ICT Centre, Sydney, Australia) Dimitris Iakovidis (University of Athens, Athens, Greece) J´erˆome Idier (IRCCyN, Nantes, France) Fr´ed´eric Jurie (CNRS - INRIA, Saint Ismier, France)
Organization
XI
Martin Kampel (Vienna University of Technology, Vienna, Austria) Stavros Karkanis (Technological Educational Institute (TEI) of Lamia, Lamia, Greece) Andrzej Kasinski (Poznan University of Technology, Poznan, Poland) Scott King (Texas A&M University - Corpus Christi, Corpus Christi, USA) Richard Kleihorst (NXP Semiconductors Research, Eindhoven, The Netherlands) Pertti Koivisto (Tampere University of Technology, Finland) Stephan Kopf (Mannheim University, Mannheim, Germany) Murat Kunt (EPFL, Lausanne, Switzerland) Matthias Kunter (Technische Universit¨ at Berlin, Berlin, Germany) Hideo Kuroda (Nagasaki University, Nagasaki, Japan) Arijit Laha (Institute for Development and Research in Banking Technology, Hyderabad, India) Kenneth Lam (The Hong Kong Polytechnic University, Hong Kong, China) Peter Lambert (Ghent University, Ledeberg-Ghent, Belgium) Guillaume Lavoue (INSA, Lyon, France) Jean-Pierre Lecadre (IRISA, Rennes, France) Kuang-chih Lee (Riya Photo Search, USA) Bangjun Lei (China Three Gorges University, Yichang, China) Martin Lettner (Vienna University of Technology, Vienna, Austria) Rongxin Li (CSIRO ICT Centre, Epping, NSW, Australia) Chia-Wen Lin (National Chung Cheng University, Chiayi, Taiwan) Hiep Luong (Ghent University, Ghent, Belgium) Henri Maitre (Ecole Nationale Sup´erieure des T´el´ecommunications, Paris, France) Dimitrios Makris (Kingston University) Xavier Maldague (Universit´e de Laval, Qu´ebec, Canada) Antoine Manzanera (ENSTA, Paris, France) Eric Marchand (IRISA/INRIA, Rennes, France) Tom Matth´e (Ghent University, Ghent, Belgium) G´erard Medioni (USC/IRIS, Los Angeles, USA) Bernard Merialdo (EURECOM, France) Fabrice M´eriaudeau (IUT Le Creusot, Le Creusot, France) Alfred Mertins (Universit¨ at zu L¨ ubeck, L¨ ubeck, Germany) Maurice Milgram (Jussieu Universit´e, Paris, France) Ali Mohammad-Djafari (CNRS, Gif-sur-Yvette, France) Rafael Molina (Universidad de Granada, Granada, Spain) Greg Mori (Simon Fraser University, Burnaby, Canada) Chantal Muller (CREATIS LRMN - UMR CNRS 5220 - U630 INSERM - INSA Lyon, Villeurbanne, France) Adrian Munteanu (Vrije Universiteit Brussel, Brussels, Belgium) Vittorio Murino (Universit` a degli Studi di Verona, Verona, Italy)
XII
Organization
Mike Nachtegael (Ghent University, Ghent, Belgium) Laurent Najman (ESIEE, Paris, France) Loris Nanni (University of Bologna, Bologna, Italy) Mai Nguyen-Verger (ENSEA, Cergy, France) Mark Nixon (University of Southampton, Southampton, UK) Edgard Nyssen (Vrije Universiteit Brussel, Brussels, Belgium) Daniel Ochoa (Escuela Superior Polit´ecnica del Litoral, Guayaquil, Ecuador) Matthias Odisio (University of Illinois at Urbana-Champaign, Urbana, USA) Nikos Paragios (Ecole Centrale de Paris, Chatenay-Malabry, France) Miu Kyu Park (Yonsei University, Seoul, Korea) Jussi Parkkinen (University of Joensuu, Joensuu, Finland) Fernando Pereira (Instituto Superior T´ecnico, Lisbon, Portugal) Stuart Perry (Canon Information Systems Research Australia, Sydney, Australia) B´eatrice Pesquet-Popescu (ENST, Paris, France) Sylvie Philipp-Foliguet (ETIS, Cergy, France) Wilfried Philips (Ghent University, Ghent, Belgium) Aleksandra Pizurica (Ghent University, Ghent, Belgium) Dan Popescu (CSIRO, Sydney, Australia) Gianni Ramponi (Trieste University, Trieste, Italy) Ilse Ravyse (Vrije Universiteit Brussel, Brussel, Belgium) Philippe R´efr´egier (Ecole Centrale de Marseille, Marseille, France) Paolo Remagnino (Faculty of Technology, Kingston University, Surrey, UK) Daniel Riccio (University of Salerno, Fisciano, Italy) Joost Rombaut (Ghent University, Ghent, Belgium) Joseph Ronsin (IETR, Rennes, France) Simon Rusinkiewicz (Princeton University, USA) ´ Luis Salgado Alvarez de Sotomayor (Universidad Polit´ecnica de Madrid, Madrid, Spain) Matilde Santos Pe˜ nas (University of Madrid, Spain) Paul Scheunders (University of Antwerp, Wilrijk, Belgium) Stefan Schulte (Ghent University, Ghent, Belgium) Daming Shi (Nanyang Technological University, Singapore, Singapore) Jan Sijbers (University of Antwerp, Wilrijk (Antwerpen), Belgium) Tadeusz Sliwa (IUT Le Creusot, Le Creusot, France) Peter Sturm (INRIA, France) Hugues Talbot (ESIEE, Noisy-le-Grand, France) Jean-Philippe Thiran (Swiss Federal Institute of Technology Lausanne, Lausanne, Switzerland) Kenneth Tobin (Oak Ridge National Laboratory, Oak Ridge, USA) Frederic Truchetet (Universit´e de Bourgogne, Le Creusot, France) Gabriel Tsechpenakis (University of Miami, USA) Dimitri Van De Ville (EPFL, Lausanne, Switzerland)
Organization
Gert Van de Wouwer (University of Antwerp, Wilrijk, Belgium) Iris Vanhamel (Vrije Universiteit Brussel, Brussels, Belgium) Ewout Vansteenkiste (Ghent University, Ghent, Belgium) Peter Veelaert (University College Ghent, Ghent, Belgium) Anne Wansek (CEP, Arcueil, France) A.M. Wink (University of Cambridge, UK) Marcel Worring (University of Amsterdam, Amsterdam, The Netherlands) Emmanuel Zenou (SUPAERO, Toulouse, France) Yue-Min Zhu (INSA, Lyon, France)
XIII
Table of Contents
Computer Vision A Framework for Scalable Vision-Only Navigation . . . . . . . . . . . . . . . . . . . ˇ Siniˇsa Segvi´ c, Anthony Remazeilles, Albert Diosi, and Fran¸cois Chaumette
1
Visual Tracking by Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valentin Enescu, Ilse Ravyse, and Hichem Sahli
13
A New Approach to the Automatic Planning of Inspection of 3D Industrial Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.M. Sebasti´ an, D. Garc´ıa, A. Traslosheros, F.M. S´ anchez, S. Dom´ınguez, and L. Pari
25
Low Latency 2D Position Estimation with a Line Scan Camera for Visual Servoing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Bri¨er, Maarten Steinbuch, and Pieter Jonker
37
Optimization of Quadtree Triangulation for Terrain Models . . . . . . . . . . . Refik Samet and Emrah Ozsavas
48
Analyzing DGI-BS: Properties and Performance Under Occlusion and Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pilar Merch´ an and Antonio Ad´ an
60
Real-Time Free Viewpoint from Multiple Moving Cameras . . . . . . . . . . . . Vincent Nozick and Hideo Saito
72
A Cognitive Modeling Approach for the Semantic Aggregation of Object Prototypes from Geometric Primitives: Toward Understanding Implicit Object Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Michael Goebel and Markus Vincze A Multi-touch Surface Using Multiple Cameras . . . . . . . . . . . . . . . . . . . . . . Itai Katz, Kevin Gabayan, and Hamid Aghajan
84
97
Fusion, Detection and Classification Fusion of Bayesian Maximum Entropy Spectral Estimation and Variational Analysis Methods for Enhanced Radar Imaging . . . . . . . . . . . . Yuriy Shkvarko, Rene Vazquez-Bautista, and Ivan Villalon-Turrubiates
109
XVI
Table of Contents
A PDE-Based Approach for Image Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . Sorin Pop, Olivier Lavialle, Romulus Terebes, and Monica Borda Improvement of Classification Using a Joint Spectral Dimensionality Reduction and Lower Rank Spatial Approximation for Hyperspectral Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. Renard, S. Bourennane, and J. Blanc-Talon
121
132
Learning-Based Object Tracking Using Boosted Features and Appearance-Adaptive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bogdan Kwolek
144
Spatiotemporal Fusion Framework for Multi-camera Face Orientation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chung-Ching Chang and Hamid Aghajan
156
Independent Component Analysis-Based Estimation of Anomaly Abundances in Hyperspectral Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexis Huck and Mireille Guillaume
168
Unsupervised Multiple Object Segmentation of Multiview Images . . . . . . Wenxian Yang and King Ngi Ngan
178
Image Processing and Filtering Noise Removal from Images by Projecting onto Bases of Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bart Goossens, Aleksandra Piˇzurica, and Wilfried Philips
190
A Multispectral Data Model for Higher-Order Active Contours and Its Application to Tree Crown Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P´eter Horv´ ath
200
A Crossing Detector Based on the Structure Tensor . . . . . . . . . . . . . . . . . . Frank G.A. Faas and Lucas J. van Vliet
212
Polyphase Filter and Polynomial Reproduction Conditions for the Construction of Smooth Bidimensional Multiwavelets . . . . . . . . . . . . . . . . . Ana Ruedin
221
Multidimensional Noise Removal Method Based on Best Flattening Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damien Letexier, Salah Bourennane, and Jacques Blanc-Talon
233
Low-Rank Approximation for Fast Image Acquisition . . . . . . . . . . . . . . . . . Dan C. Popescu, Greg Hislop, and Andrew Hellicar
242
Table of Contents
A Soft-Switching Approach to Improve Visual Quality of Colour Image Smoothing Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samuel Morillas, Stefan Schulte, Tom M´elange, Etienne E. Kerre, and Valent´ın Gregori Comparison of Image Conversions Between Square Structure and Hexagonal Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiangjian He, Jianmin Li, and Tom Hintz
XVII
254
262
Biometrics and Security Action Recognition with Semi-global Characteristics and Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Catherine Achard, Xingtai Qu, Arash Mokhber, and Maurice Milgram
274
Patch-Based Experiments with Object Classification in Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rob Wijnhoven and Peter H.N. de With
285
Neural Network Based Face Detection from Pre-scanned and Row-Column Decomposed Average Face Image . . . . . . . . . . . . . . . . . . . . . . Ziya Telatar, Murat H. Sazlı, and Irfan Muhammad
297
Model-Based Image Segmentation for Multi-view Human Gesture Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chen Wu and Hamid Aghajan
310
A New Partially Occluded Face Pose Recognition . . . . . . . . . . . . . . . . . . . . Myung-Ho Ju and Hang-Bong Kang
322
Large Head Movement Tracking Using Scale Invariant View-Based Appearance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gangqiang Zhao, Ling Chen, and Gencai Chen
331
Robust Shape-Based Head Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunshu Hou, Hichem Sahli, Ravyse Ilse, Yanning Zhang, and Rongchun Zhao
340
Evaluating Descriptors Performances for Object Tracking on Natural Video Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mounia Mikram, R´emi M´egret, and Yannick Berthoumieu
352
A Simple and Efficient Eigenfaces Method . . . . . . . . . . . . . . . . . . . . . . . . . . Carlos G´ omez and B´eatrice Pesquet-Popescu
364
A New Approach to Face Localization in the HSV Space Using the Gaussian Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed Deriche and Imran Naseem
373
XVIII
Table of Contents
Gait Recognition Using Active Shape Models . . . . . . . . . . . . . . . . . . . . . . . . Woon Cho, Taekyung Kim, and Joonki Paik
384
Statistical Classification of Skin Color Pixels from MPEG Videos . . . . . . Jinchang Ren and Jianmin Jiang
395
A Double Layer Background Model to Detect Unusual Events . . . . . . . . . Joaquin Salas, Hugo Jimenez-Hernandez, Jose-Joel Gonzalez-Barbosa, Juan B. Hurtado-Ramos, and Sandra Canchola
406
Realistic Facial Modeling and Animation Based on High Resolution Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hae Won Byun
417
Image Processing and Restoration Descriptor-Free Smooth Feature-Point Matching for Images Separated by Small/Mid Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ping Li, Dirk Farin, Rene Klein Gunnewiek, and Peter H.N. de With A New Supervised Evaluation Criterion for Region Based Segmentation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adel Hafiane, S´ebastien Chabrier, Christophe Rosenberger, and H´el`ene Laurent A Multi-agent Approach for Range Image Segmentation with Bayesian Edge Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smaine Mazouzi, Zahia Guessoum, Fabien Michel, and Mohamed Batouche
427
439
449
Adaptive Image Restoration Based on Local Robust Blur Estimation . . . Hao Hu and Gerard de Haan
461
Image Upscaling Using Global Multimodal Priors . . . . . . . . . . . . . . . . . . . . Hiˆep Luong, Bart Goossens, and Wilfried Philips
473
A Type-2 Fuzzy Logic Filter for Detail-Preserving Restoration of Digital Images Corrupted by Impulse Noise . . . . . . . . . . . . . . . . . . . . . . . . . M. T¨ ulin Yildirim and M. Emin Y¨ uksel
485
Contrast Enhancement of Images Using Partitioned Iterated Function Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theodore Economopoulos, Pantelis Asvestas, and George Matsopoulos A Spatiotemporal Algorithm for Detection and Restoration of Defects in Old Color Films . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bekir Dizdaroglu and Ali Gangal
497
509
Table of Contents
XIX
Medical Image Processing Categorizing Laryngeal Images for Decision Support . . . . . . . . . . . . . . . . . . Adas Gelzinis, Antanas Verikas, and Marija Bacauskiene
521
Segmentation of the Human Trachea Using Deformable Statistical Models of Tubular Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Romulo Pinho, Jan Sijbers, and Toon Huysmans
531
Adaptive Image Content-Based Exposure Control for Scanning Applications in Radiography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Helene Schulerud, Jens Thielemann, Trine Kirkhus, Kristin Kaspersen, Joar M. Østby, Marinos G. Metaxas, Gary J. Royle, Jennifer Griffiths, Emily Cook, Colin Esbrand, Silvia Pani, Cristian Venanzi, Paul F. van der Stelt, Gang Li, Renato Turchetta, Andrea Fant, Sergios Theodoridis, Harris Georgiou, Geoff Hall, Matthew Noy, John Jones, James Leaver, Frixos Triantis, Asimakis Asimidis, Nikos Manthos, Renata Longo, Anna Bergamaschi, and Robert D. Speller
543
Shape Extraction Via Heat Flow Analogy . . . . . . . . . . . . . . . . . . . . . . . . . . . Cem Direko˘glu and Mark S. Nixon
553
Adaptive Vision System for Segmentation of Echographic Medical Images Based on a Modified Mumford-Shah Functional . . . . . . . . . . . . . . . Dimitris K. Iakovidis, Michalis A. Savelonas, and Dimitris Maroulis
565
Detection of Individual Specimens in Populations Using Contour Energies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Ochoa, Sidharta Gautama, and Boris Vintimilla
575
Logarithmic Model-Based Dynamic Range Enhancement of Hip X-Ray Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Corneliu Florea, Constantin Vertan, and Laura Florea
587
A New Color Representation for Intensity Independent Pixel Classification in Confocal Microscopy Images . . . . . . . . . . . . . . . . . . . . . . . . Boris Lenseigne, Thierry Dorval, Arnaud Ogier, and Auguste Genovesio Colon Visualization Using Cylindrical Parameterization . . . . . . . . . . . . . . . Zhenhua Mai, Toon Huysmans, and Jan Sijbers Particle Filter Based Automatic Reconstruction of a Patient-Specific Surface Model of a Proximal Femur from Calibrated X-Ray Images for Surgical Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guoyan Zheng and Xiao Dong
597
607
616
XX
Table of Contents
Video Coding and Processing Joint Tracking and Segmentation of Objects Using Graph Cuts . . . . . . . . Aur´elie Bugeau and Patrick P´erez
628
A New Fuzzy Motion and Detail Adaptive Video Filter . . . . . . . . . . . . . . . Tom M´elange, Vladimir Zlokolica, Stefan Schulte, Val´erie De Witte, Mike Nachtegael, Aleksandra Piˇzurica, Etienne E. Kerre, and Wilfried Philips
640
Bridging the Gap: Transcoding from Single-Layer H.264/AVC to Scalable SVC Video Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan De Cock, Stijn Notebaert, Peter Lambert, and Rik Van de Walle Improved Pixel-Based Rate Allocation for Pixel-Domain Distributed Video Coders Without Feedback Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . Marleen Morb´ee, Josep Prades-Nebot, Antoni Roca, Aleksandra Piˇzurica, and Wilfried Philips Multiview Depth-Image Compression Using an Extended H.264 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yannick Morvan, Dirk Farin, and Peter H.N. de With Grass Detection for Picture Quality Enhancement of TV Video . . . . . . . . Bahman Zafarifar and Peter H.N. de With Exploitation of Combined Scalability in Scalable H.264/AVC Bitstreams by Using an MPEG-21 XML-Driven Framework . . . . . . . . . . . Davy De Schrijver, Wesley De Neve, Koen De Wolf, Davy Van Deursen, and Rik Van de Walle Moving Object Extraction by Watershed Algorithm Considering Energy Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kousuke Imamura, Masaki Hiraoka, and Hideo Hashimoto Constrained Inter Prediction: Removing Dependencies Between Different Data Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yves Dhondt, Stefaan Mys, Kenneth Vermeirsch, and Rik Van de Walle
652
663
675
687
699
711
720
Performance Improvement of H.264/AVC Deblocking Filter by Using Variable Block Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seung-Ho Shin, Duk-Won Oh, Young-Joon Chai, and Tae-Yong Kim
732
Real-Time Detection of the Triangular and Rectangular Shape Road Signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boguslaw Cyganek
744
Table of Contents
XXI
High-Resolution Multi-sprite Generation for Background Sprite Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Getian Ye
756
Motion Information Exploitation in H.264 Frame Skipping Transcoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qiang Li, Xiaodong Liu, and Qionghai Dai
768
Joint Domain-Range Modeling of Dynamic Scenes with Adaptive Kernel Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Borislav Anti´c and Vladimir Crnojevi´c
777
Competition Based Prediction for Skip Mode Motion Vector Using Macroblock Classification for the H.264 JM KTA Software . . . . . . . . . . . . Guillaume Laroche, Joel Jung, and B´eatrice Pesquet-Popescu
789
Efficiency of Closed and Open-Loop Scalable Wavelet Based Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel F. L´ opez, Vicente Gonzalez Ruiz, and Inmaculada Garc´ıa
800
Spatio-temporal Information-Based Simple Deinterlacing Algorithm . . . . Gwanggil Jeon, Fang Yong, Joohyun Lee, Rokkyu Lee, and Jechang Jeong
810
Image Interpretation Fast Adaptive Graph-Cuts Based Stereo Matching . . . . . . . . . . . . . . . . . . . Michel Sarkis, Nikolas D¨ orfler, and Klaus Diepold
818
A Fast Level-Set Method for Accurate Tracking of Articulated Objects with an Edge-Based Binary Speed Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cristina Darolti, Alfred Mertins, and Ulrich G. Hofmann
828
Real-Time Vanishing Point Estimation in Road Sequences Using Adaptive Steerable Filter Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcos Nieto and Luis Salgado
840
Self-Eigenroughness Selection for Texture Recognition Using Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jing-Wein Wang
849
Analysis of Image Sequences for Defect Detection in Composite Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. D’Orazio, M. Leo, C. Guaragnella, and A. Distante
855
Remote Sensing Imagery and Signature Fields Reconstruction Via Aggregation of Robust Regularization with Neural Computing . . . . . . . . . Yuriy Shkvarko and Ivan Villalon-Turrubiates
865
XXII
Table of Contents
A New Technique for Global and Local Skew Correction in Binary Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Makridis, Nikos Nikolaou, and Nikos Papamarkos
877
System for Estimation of Pin Bone Positions in Pre-rigor Salmon . . . . . . Jens T. Thielemann, Trine Kirkhus, Tom Kavli, Henrik Schumann-Olsen, Oddmund Haugland, and Harry Westavik
888
Vertebral Mobility Analysis Using Anterior Faces Detection . . . . . . . . . . . M. Benjelloun, G. Rico, S. Mahmoudi, and R. Pr´evot
897
Image Processing Algorithms for an Auto Focus System for Slit Lamp Microscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Gierl, T. Kondo, H. Voos, W. Kongprawechon, and S. Phoojaruenchanachai
909
Applying Image Analysis and Probabilistic Techniques for Counting Olive Trees in High-Resolution Satellite Images . . . . . . . . . . . . . . . . . . . . . . J. Gonz´ alez, C. Galindo, V. Arevalo, and G. Ambrosio
920
An Efficient Closed-Form Solution to Probabilistic 6D Visual Odometry for a Stereo Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.A. Moreno, J.L. Blanco, and J. Gonz´ alez
932
Color Image Segmentation Based on Type-2 Fuzzy Sets and Region Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samy Tehami, Andr´e Bigand, and Olivier Colot
943
Image Interpretation ENMIM: Energetic Normalized Mutual Information Model for Online Multiple Object Tracking with Unlearned Motions . . . . . . . . . . . . . . . . . . . Abir El Abed, S´everine Dubuisson, and Dominique B´er´eziat
955
Geometrical Scene Analysis Using Co-motion Statistics . . . . . . . . . . . . . . . Zolt´ an Szl´ avik, L´ aszl´ o Havasi, and Tam´ as Szir´ anyi
968
Cascade of Classifiers for Vehicle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Ponsa and Antonio L´ opez
980
Aerial Moving Target Detection Based on Motion Vector Field Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlos R. del-Blanco, Fernando Jaureguizar, Luis Salgado, and Narciso Garc´ıa
990
Image Coding Embedding Linear Transformations in Fractal Image Coding . . . . . . . . . . 1002 Michele Nappi and Daniel Riccio
Table of Contents
XXIII
Digital Watermarking with PCA Based Reference Images . . . . . . . . . . . . . 1014 Erkan Yavuz and Ziya Telatar JPEG2000 Coding Techniques Addressed to Images Containing No-Data Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024 Jorge Gonz´ alez-Conejero, Francesc Aul´ı-Llin` as, Joan Bartrina-Rapesta, and Joan Serra-Sagrist` a A New Optimum-Word-Length-Assignment (OWLA) Multiplierless Integer DCT for Lossless/Lossy Image Coding and Its Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037 Somchart Chokchaitam and Masahiro Iwahashi On Hybrid Directional Transform-Based Intra-band Image Coding . . . . . 1049 Alin Alecu, Adrian Munteanu, Aleksandra Piˇzurica, Jan Cornelis, and Peter Schelkens Analysis of the Statistical Dependencies in the Curvelet Domain and Applications in Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1061 Alin Alecu, Adrian Munteanu, Aleksandra Piˇzurica, Jan Cornelis, and Peter Schelkens A Novel Image Compression Method Using Watermarking Technique in JPEG Coding Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1072 Hideo Kuroda, Shinichi Miyata, Makoto Fujimura, and Hiroki Imamura Improved Algorithm of Error-Resilient Entropy Coding Using State Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084 Yong Fang, Gwanggil Jeon, Jechang Jeong, Chengke Wu, and Yangli Wang Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097
A Framework for Scalable Vision-Only Navigation ˇ Siniˇsa Segvi´ c, Anthony Remazeilles, Albert Diosi, and Fran¸cois Chaumette INRIA/IRISA, Campus de Beaulieu, F-35042 Rennes Cedex, France
Abstract. This paper presents a monocular vision framework enabling feature-oriented appearance-based navigation in large outdoor environments containing other moving objects. The framework is based on a hybrid topological-geometrical environment representation, constructed from a learning sequence acquired during a robot motion under human control. The framework achieves the desired navigation functionality without requiring a global geometrical consistency of the underlying environment representation. The main advantages with respect to conventional alternatives are unlimited scalability, real-time mapping and effortless dealing with interconnected environments once the loops have been properly detected. The framework has been validated in demanding, cluttered and interconnected environments, under different imaging conditions. The experiments have been performed on many long sequences acquired from moving cars, as well as in real-time large-scale navigation trials relying exclusively on a single perspective camera. The obtained results imply that a globally consistent geometric environment model is not mandatory for successful vision-based outdoor navigation.
1
Introduction
The design of an autonomous mobile robot requires establishing a close relation between the perceived environment and the commands sent to the low-level controller. This necessitates complex spatial reasoning relying on some kind of internal environment representation [1]. In the mainstream model-based approach, a monolithic environment-centred representation is used to store the landmarks and the descriptions of the corresponding image features. The considered features are usually geometric primitives, while their positions are expressed in coordinates of the common environment-wide frame [2,3]. During the navigation, the detected features are associated with the elements of the model, in order to localize the robot, and to locate previously unobserved model elements. However, the success of such approach depends directly on the accuracy of the underlying model. This poses a strong assumption which impairs the scalability and, depending on the input, may not be attainable at all. The alternative appearance-based approach employs a sensor-centred representation of the environment, which is usually a multidimensional array of sensor
This work has been supported by the French national project Predit Mobivip, by the project Robea Bodega, and by the European MC IIF project AViCMaL.
J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 1–12, 2007. c Springer-Verlag Berlin Heidelberg 2007
2
ˇ S. Segvi´ c et al.
readings. In the context of computer vision, the representation includes a set of key-images which are acquired during a learning stage and organized within a graph [4]. Nodes of the graph correspond to key-images, while the arcs link the images containing a required number of common landmarks. This is illustrated in Figure 1. The navigation between two neighbouring nodes is performed using
(a)
(b)
Fig. 1. Appearance-based navigation: the sketch of a navigation task (a), and the set of first eight images from the environment representation forming a linear graph (b). Note that the graph has been constructed automatically, as described in 3.1.
well developed techniques from the field of mobile robot control [5]. Different types of landmark representations have been considered in the literature, from the integral contents of a considered image [6] and global image descriptors [4], to more conventional point features such as Harris corners [2,7]. We consider the latter feature-oriented approach, in which the next intermediate key-image is reached by tracking common features from the previous key-image. Here, it is critical to recognize landmarks which recently entered the field of view, or regained a normal appearance after occlusion, motion blur or illumination disturbances. Estimating locations of invisible features (feature prediction) is therefore an essential capability in feature-oriented navigation. We present a novel framework for scalable mapping and localization, enabling robust appearance-based navigation in large outdoor environments. The framework is presented in a broader frame of an envisioned long-term architecture, while more details can be found in [8,9]. Mapping and navigation are considered separately as an interesting and not completely solved problem. The employed hierarchical environment representation [4,10] features a graph of key-images at the top, and local 3D reconstructions at the bottom layer. The global topological representation ensures an outstanding scalability, limits the propagation of association errors and simplifies consistency management in interconnected environments. On the other hand, the local geometric models enable accurate feature predictions. We strive to obtain the best predictions possible, and favour local over global consistency by avoiding a global environment model. The results of demanding robot control experiments demonstrate that a globally consistent 3D reconstruction is not required for a successful large-scale vision-based navigation. An appearance-based navigation approach with feature prediction has been described in [11]. Simplifying assumptions with respect to the motion of the robot
A Framework for Scalable Vision-Only Navigation
3
have been used, while the prediction was implemented using intersection of the two epipolar lines, which has important limitations [12]. The need for feature prediction has been alleviated in [7], where the previously unseen features from the next key-image are introduced using wide-baseline matching [13]. A similar approach has been proposed in the context of omnidirectional vision [14]. In this closely related work, feature prediction based on point transfer [12] has been employed to recover from tracking failures, but not for feature introduction as well. However, wide-baseline matching [14,7] is prone to association errors due to ambiguous landmarks. In our experiments, substantially better feature introduction has been achieved by exploting the point transfer predictions. In comparison with model-based navigation approaches such as the one described in [3], our approach does not require a global consistency. By posing weaker requirements, we increase the robustness of the mapping phase, likely obtain better local consistencies, can close loops regardless of the extent of the accumulated drift and have better chances to survive correspondence errors. Notable advances have been recently achieved in model-based SLAM [15]. Nevertheless, current implementations have limitations with respect to the number of mapped points, so that a prior learning step still seems a necessity in realistic navigation tasks. Our approach has no scaling problems: experiments with 15000 landmarks have been performed without any performance degradation. The paper is organized as follows. The envisioned architecture for visionbased navigation is described in Section 2. Implementation details of the current implementation are described in Section 3. Section 4 provides the experimental results, while the conclusion is given in Section 5.
2
The Envisioned Architecture
The presented work is an incremental step towards a system for appearancebased navigation in interconnected structured environments, which is a longterm research goal in our laboratory [16]. The desired autonomous system would be capable to autonomously navigate in previously mapped environment, towards a goal specifed by a desired goal-image. The devised architecture assumes operation in three distinct phases, as illustrated in Figure 2(a). The mapping phase creates a topological–geometrical environment representation from a learning sequence acquired during a robot motion under a human control. The key-images are selected from the learning sequence and organized within a graph in which the arcs are defined between nodes sharing a certain number of common features. The matching features in the neighbouring nodes are used to recover a local 3D reconstruction, which is assigned to the corresponding arc. These features are considered for tracking whenever the robot arrives close to the viewpoints from which the two key-images were acquired. The task preparation phase is performed after the navigation task has been presented to the navigation system in the form of a desired goal-image, as illustrated in Figure 2(b). The initial topological localization corresponds to locating the current and the desired images in the environment graph by content-based
4
ˇ S. Segvi´ c et al.
(a)
(b)
Fig. 2. The envisioned architecture for feature-oriented appearance-based navigation (a), The entries which are considered and implemented in this work are typeset in bold. The illustration of the three procedures from the task preparation phase (b).
image retrieval [16]. The two images are consequently injected into the graph using the correspondences obtained by wide-baseline matching. Finally, the optimal topological path is determined using a shortest path algorithm. The nodes of the determined path denote intermediate milestones through which the robot is supposed to navigate towards the desired goal. The navigation phase involves a visual servoing processing loop [17], in which the point features from images acquired in real-time are associated with their counterparts in the key-images. Thus, two distinct kinds of localization are required: (i) explicit topological localization, and (ii) implicit fine-level localization through the locations of the tracked landmarks. Topological location corresponds to the arc of the environment graph incident to the two key-images having most content in common with the current image. It is extremely important to maintaining an accurate topological location as the navigation proceeds, since that defines the landmarks considered for localization. During the motion, the tracking may fail due to occlusions, motion blur, illumination effects or noise. Feature prediction allows to deal with this problem and resume the feature tracking on the fly while minimizing the chances for correspondence errors.
3
Scalable Mapping and Localization
In the broader context presented in Section 2, we mainly address the mapping and the navigation phase, which have been implemented within the mapping and localization components of the framework. Both components rely on feature tracking and two-view geometry. The devised multi-scale differential tracker with warp correction and checking provides correspondences with few outliers. Bad tracks are identified by a threshold R on RMS residual between the warped current feature and the reference appearance. The employed warp includes isotropic scaling and affine contrast compensation [18]. The two-view geometry is recovered in a calibrated context by random sampling, with the five-point algorithm [19] as the hypothesis generator.
A Framework for Scalable Vision-Only Navigation
5
For simplicity, the actual implementation allows only linear or circular topological representations. This obviates the need for the localization and planning procedures, which we have addressed previously [16]. The resulting implementation of the task preparation phase is described along the localization component. 3.1
The Mapping Component
The mapping component constructs a linear environment graph and annotates its nodes and arcs with precomputed information. The nodes of the graph are formed by choosing the set of key-images Ii . The same indexing is used for arcs as well, by defining that arc i connects nodes i − 1 and i (cf. Figure 3). If the graph is circular, arc 0 connects the last node n − 1 with the node 0. Each node is assigned the set Xi of features from Ii , denoted by distinctive identifiers. Each arc is assigned an array of identifiers Mi denoting landmarks located in the two incident key-images, and annotated with the recovered two-view geometries Wi .
Fig. 3. The linear environment graph. Nodes contain images Ii , extracted features Xi and scale factors si . Arcs contain match arrays Mi and the two-view geometries Wi . The figure also shows the current image It , which is considered in 3.2. If the topological location is i + 1, the features considered for tracking belong to Wi , Wi+1 and Wi+2 .
The elements of Wi include motion parameters Ri and ti (|ti | = 1), and metric landmark reconstructions Qi . The two-view geometries Wi are deliberately not put into an environment-wide frame, since contradicting scale sequences can be obtained along the graph cycles. The scale ratio si between the incident geometries Wi and Wi+1 is therefore stored in the common node i. Neighbouring pairs of geometries Wi+1 and Wi+2 need to have some features in common, Mi+1 ∩ Mi+2 = ∅, in order to enable the transfer of features from the next two key-images (Ii+1 , Ii+2 ) on the path (cf. 3.2). Quantitatively, a particular arc of the map can be evaluated by the number of correspondences |Mi | and the estimate of the reprojection error σ(Wi ) [12]. Different maps of the same environment can be evaluated by the total count of arcs in the graph |{Mi }|, and by the parameters of the individual arcs |Mi | and σ(Wi ). It is usually favourable to have less arcs, since that ensures a smaller difference in lines of sight between the relevant key-images and the images acquired during navigation.
6
ˇ S. Segvi´ c et al.
The devised mapping solution uses the tracker to find the stablest point features in a given subrange of the learning sequence. The tracker is initiated with all Harris points in the initial frame of the subrange. The features are tracked until the reconstruction error between the first and the current frame of the subrange rises above a predefined threshold σ. Then the current frame is discarded, while the previous frame is registered as the new node of the graph, and the whole procedure is repeated from there. This is similar to visual odometry [20], except that we employ larger feature windows and more involved tracking [18] in order to achieve more distinctive features and longer feature lifetimes. To ensure a minimum number of features within an arc of the graph, a new node is forced when the absolute number of tracked points falls below n. The above matching scheme can be complemented by wide-baseline matching [13] when there are discontinuities in the learning sequence caused by a large moving object, or a “frame gap” due to bad acquisition. Such events are reflected by a general tracking failure in the second frame of a new subrange. Wide-baseline matching is also useful for connecting a cycle in the environment graph. To test whether the learning sequence is acquired along a circular physical path, the first and the last key-image are subjected to matching: a circular graph is created on success, and a simple linear graph otherwise. In case of a monolithic geometric model, the loop closing process would need to be followed by a sophisticated map correction procedure, in order to try to correct the accumulated error. Due to topological representation at the top-level, this operation proceeds reliably and smoothly, regardless of the extent of the drift. 3.2
The Localization Component
In the proposed framework, the tracked features belong either to the actual arc (topological location), or the two neighbouring arcs as illustrated in Figure 3. We focus on on-line facets of the localization problem: (i) robust finelevel localization relying on feature prediction, and (ii) maintenance of the topological location as the navigation proceeds. Nevertheless, for completeness, we first present a minimalistic initialization procedure used in the experiments. The initialization procedure. The navigation program is started with the following parameters: (i) map of the environment, (ii) initial topological location of the robot (index of the actual arc), and (iii) calibration parameters of the attached camera. This is immediately followed by wide-baseline matching [13] of the current image with the two key-images incident to the actual arc. From the obtained correspondences, the pose is recovered in the actual geometric frame, allowing to project the mapped features and to bootstrap the processing loop. Feature prediction and tracking resumption. The point features tracked in the current image It are employed to estimate the current two-view geometries Wt:i (Ii , It ) and Wt:i+1 (Ii+1 , It ) towards the two incident key-images, using the same procedure as in 3.1. An accurate and efficient recovery of the three-view geometry is devised by a decomposed approach related to [21]. The approach
A Framework for Scalable Vision-Only Navigation
7
relies on recovering the relative scale between the two independently recovered metric frames, by enforcing the consistency of the common structure. The main advantages with respect to the “golden standard” method [12] are the utilization of pairwise correspondences (which is of particular interest for forward motion), and real-time performance. Thus, the three-view geometry (It , Ii , Ii+1 ) is recovered by adjusting the precomputed two-view geometry Wi+1 towards the more accurate (in terms of reprojection error) of Wt:i and Wt:i+1 (see Figure 3). The geometry (It , Ii+1 , Ii+2 ) is recovered from Wi+2 and Wt:i+1 , while (It , Ii−1 , Ii ) is recovered from Wi and Wt:i . Current image locations of landmarks mapped in the actual arc i + 1 are predicted by the geometry (It , Ii , Ii+1 ). Landmarks from the previous arc i and the next arc i + 2 are transferred by geometries (It , Ii−1 , Ii ) and (It , Ii+1 , Ii+2 ), respectively. Point transfer is performed only if the estimated reprojection error of the employed current geometry is within the safety limits. The predictions are refined (or rejected) by minimizing the residual between the warped current feature and the reference appearance. As in tracking, the result is accepted if the procedure converges near the predicted location, with an acceptable residual. An analogous procedure is employed to check the consistency of the tracked features, which occasionally “jump” to the occluding foreground. Maintaining the topological location. Maintaining a correct topological location is critical in sharp turns where the tracked features die quickly due to the contact with the image border. An incorrect topological location implies a suboptimal introduction of new features and may be followed by a failure due to insufficient features for calculating Wt:i and Wt:i+1 . Best results have been obtained using a geometric criterion: a transition is taken when the reconstructed camera location overtakes the next key-image Ii+1 . This can be expressed as −Ri+1 · ti+1 , tt:i+1 < 0 . The decision is based on the geometry related to the next key-image Wt:i+1 , which is geometrically closer to the hypothesized transition. Backwards transitions can be analogously defined in order to support reverse motion of the robot. After each transition, the reference appearances (references) are redefined for all relevant features in order to achieve better tracking. For a forward transition, references for the features from the actual geometry Wi+1 are taken from Ii+1 , while the references for the features from Wi+2 are taken from Ii+2 (cf. Figure 3). Previously tracked points from geometries Wi+1 and Wi+2 are instantly resumed using their previous positions and new references, while the features from Wi are discontinued.
4
Experimental Results
The performed experiments include mapping, off-line localization, and navigation (real-time localization and control). Off-line sequences and real-time images have been acquired of the robotic car Cycab under human and automatic control.
ˇ S. Segvi´ c et al.
8
4.1
Mapping Experiments
10 8 6 4 2 0
200 stdev npoints
150 100 50
npoints
stdev
We first present quantitative mapping results obtained on the learning sequence ifsic5, corresponding to the reverse of the path shown in Figure 1(a). The analysis was performed in terms of the geometric model parameters introduced in 3.1: (i) |Mi | (ii) σ(Wi ), and (iii) |{Mi }|. Figure 4(a) shows the variation of |Mi | and σ(Wi ) along the arcs of the created environment graph. A qualitative illustration of the inter-node distance (and |{Mi }|) is presented in Figure 4(b) as the sequence of recovered key-image poses (common global scale has been enforced for visualisation purposes). The figure suggests that the mapping component adapts the density of key-images to the inherent difficulty of the scene. The dense nodes 7-14 correspond to the first difficult moment of the learning sequence: approaching the traverse building and passing underneath it. Nodes 20 to 25 correspond to the sharp left turn, while passing very close to a building. The hard conditions persisted after the turn due to large featureless bushes and a reflecting glass surface: this is reflected in dense nodes 26-28, cf. Figure 4(c). The number of features in arc 20 is exceptionally high, while the incident nodes 19 and 20 are very close. The anomaly is due a large frame gap causing most feature tracks to terminate instantly. Wide-baseline matching succeeded to relate the key-image 19 and its immediate successor which consequently became key-image 20. The error peak in arc 21 is caused by an another gap which has been successfully bridged by the tracker alone.
0
1
2
3
4 5 6
7 8 910111213 14
0 0
5
10
15
20
15
16
17
28 27 26 25 24 23 18 19 202122
25
index
(a)
(b)
(c)
Fig. 4. The mapping results on the sequence ifsic5 containing 1900 images acquired along a 150 m path: counts of mapped point features |Mi | and reprojection errors σ(Wi )(a), the reconstructed sequence of camera poses (b), and the 28 resulting keyimages (c)
The second group of experiments, concerns the learning sequence loop taken along a circular path of approximately 50 m. We investigate the sensitivity of the mapping algorithm with respect to the three main parameters described in 3.1: (i) minimum count of features n, (ii) maximum allowed reprojection error
A Framework for Scalable Vision-Only Navigation
9
σ, and (iii) the RMS residual threshold R. The reconstructions obtained for 4 different parameter triples are presented in Figure 5. The presence of node 0’ indicates that the cycle at the topological level has been successfully closed by wide-baseline matching. Ideally, nodes 0’ and 0 should be very close; the extent of the distance indicates the magnitude of the error due to the accumulated drift. Reasonable and usable representations have been obtained in all cases, despite the smooth planar surfaces and vegetation which are visible in Figure 5(bottom). The experiments show that there is a direct coupling between the number of arcs |{Mi }| and the number of mapped features |Mi |. Thus, it is beneficial to seek the smallest |{Mi }| ensuring acceptable values for σ(Wi ) and |Mi |. The last map in Figure 5 (top-right) was deliberately constructed using suboptimal parameters, to show that our approach essentially works even in cases in which enforcing the global consistency is difficult. The navigation can smoothly proceed despite a discontinuity in the global geometric reconstruction, since the local geometries are “elastically” glued together by the continuous topological representation.
26 0’
77 0’ 0
n=100, σ=1, R=4
0 32 0’
n=50, σ=2, R=6
28 0’
0
n=50, σ=4, R=6
0
n=25, σ=2, R=6
Fig. 5. Reconstructed poses obtained on sequence loop, for different sets of mapping parameters (top). Actual key-images of the map obtained for n = 50, σ = 4, R = 6 (bottom). This map will be employed in localization experiments.
4.2
Localization Experiments
In the localization experiments, we measure quantitative success in recognizing the mapped features. The results are summarized in Figure 6, where the counts of tracked features are plotted against the arcs of the employed map. We first present the results of performing the localization on two navigation sequences obtained for similar robot motion but under different illumination. Figure 6(a) shows that the proposed feature prediction scheme enables large scale appearance-based navigation, as far as pure geometry is concerned. Figure 6(b) shows that useful results can be obtained even under different lighting conditions, when the feature loss at times exceed 50%.
ˇ S. Segvi´ c et al.
10
80
90
90
80
80
70
70
70
60
60
60
50
50
50
40
40
40
30
30
30 Total points Tracked max Tracked avg
20
20 Total points Tracked max Tracked avg
20
10
10
10 0
5
10
15
20
25
0
5
(a)
10
15
(b)
20
25
Avg tracked 1st round Avg tracked 2nd round
0 0
5
10
15
20
25
(c)
Fig. 6. Quantitative localization results: processing ifsic5 (a) and ifsic1 (b) on a map built on ifsic5, and using the map from Figure 5 over two rounds of loop (c)
The capability of the localization component to traverse cyclic maps was tested on a sequence obtained for two rounds roughly along the same circular physical path. This is a quite difficult scenario since it requires continuous and fast introduction of new features due to persistent changes of viewing direction. The first round was used for mapping (this is the sequence loop, discussed in Figure 5), while the localization is performed along the combined sequence, involving two complete rounds. During the acquisition, the robot was manually driven so that the two trajectories were more than 1 m apart at several occasions during the experiment. Nevertheless, the localization was successful in both rounds, as summarised in Figure 6(c). All features have been successfully located during the first round, while the outcome in the second round depends on the extent of the divergence between the two trajectories. 4.3
Navigation Experiments
In the navigation experiments, the Cycab was controlled in real-time by visual servoing. The steering angle ψ has been determined from average x components of the current feature locations (xt , yt ) ∈ Xt , and their correspondences in the next key-image (x∗ , y ∗ ) ∈ Xi+1 : ψ = −λ (xt − x∗ ) , where λ ∈ R+ . One of the large-scale navigation experiments involved a reference path of approximately 750 m, offering a variety of driving conditions including narrow sections, slopes and driving under a building. An earlier version of the program has been used allowing a control frequency of about 1 Hz. The navigation speed was set accordingly to 30 cm/s in turns, and otherwise 80 cm/s. The map was built on a learning sequence previously acquired under manual control. The robot smoothly completed the path despite a passing car occluding the majority of the features, as shown in Figure 7. Several similar encounters with pedestrians have been
Fig. 7. Images obtained during the execution of a navigation experiment. The points used for navigation re-appear after being occluded and disoccluded by a moving car.
A Framework for Scalable Vision-Only Navigation
11
processed in a graceful manner too. The system succeeded to map features (and subsequently to find them) in seemingly featureless areas where the road and the grass occupied most of the field of view. The employed environment representation is not very accurate from the global point of view. Nevertheless, the system succeeds to perform large autonomous displacements, while also being robust to other moving objects. We consider this as a strong indication of the forthcoming potential towards real applications of vision-based autonomous vehicles.
5
Conclusion
The paper described a novel framework for large-scale mapping and localization, based on point features mapped during a learning session. The purpose of the framework is to provide 2D image measurements for appearance-based navigation. The tracking of temporarily occluded and previously unseen features can be (re-)started on-the-fly due to feature prediction based on point transfer. 2D navigation and 3D prediction smoothly interact through a hierarchical environment representation. The navigation is concerned with the upper topological level, while the prediction is performed within the lower, geometrical level. In comparison with the mainstream approach involving a monolithic geometric representation, the proposed framework enables robust large-scale navigation without requiring a geometrically consistent global view of the environment. This point has been demonstrated in the experiment with a circular path, in which the navigation bridges the first and the last node of the topology regardless of the extent of the accumulated error in the global 3D reconstruction. Thus, the proposed framework is applicable even in interconnected environments, where a global consistency may be difficult to enforce. The localization component requires imaging and navigation conditions such that enough of the mapped landmarks have recognizable appearances in the acquired current images. The performed experiments suggest that this can be achieved even with very small images, for moderate-to-large changes in imaging conditions. The difficult situations include featureless areas (smooth buildings, vegetation, pavement), photometric variations (strong shadows and reflections), and the deviations from the reference path used to perform the mapping, due to control errors or obstacle avoidance. In the current implementation, the mapping and localization throughput on 320 × 240 gray–level images is 5 Hz and 7 Hz, respectively, using a notebook computer with a CPU roughly equivalent to a Pentium 4 at 2GHz. Most of the processing time is spent within the point feature tracker, which uses a threelevel image pyramid in order to be able to deal with large feature motion in turns. The computational complexity is an important issue: with more processing power we could deal with larger images and map more features, which would result in even greater robustness. Nevertheless, encouraging results in real-time autonomous robot control have been obtained even on very small images. In the light of future increase in processing performance, this suggests that the time of vision-based autonomous transportation systems is getting close.
12
ˇ S. Segvi´ c et al.
References 1. DeSouza, G.N., Kak, A.C.: Vision for mobile robot navigation: a survey. IEEE Trans. PAMI 24(2) (2002) 2. Burschka, D., Hager, G.D.: Vision-based control of mobile robots. In: Proc. of ICRA, Seoul, South Korea, pp. 1707–1713 (2001) 3. Royer, E., Lhuillier, M., Dhome, M., Chateau, T.: Localization in urban environments: Monocular vision compared to a differential gps sensor. In: Proc. of CVPR, Washington, DC, vol. 2, pp. 114–121 (2005) 4. Gaspar, J., Santos-Victor, J.: Vision-based navigation and environmental representations with an omni-directionnal camera. IEEE Trans. RA 16(6), 890–898 (2000) 5. Samson, C.: Control of chained systems: application to path following and timevarying point stabilization. IEEE Trans. AC 40(1), 64–77 (1995) 6. Matsumoto, Y., Inaba, M., Inoue, H.: Exploration and navigation in corridor environment based on omni-view sequence. In: Proc. of IROS, Takamatsu, Japan, Takamatsu, Japan, pp. 1505–1510 (2000) 7. Chen, Z., Birchfield, S.T.: Qualitative vision-based mobile robot navigation. In: Proc. of ICRA, Orlando, Florida, pp. 2686–2692 (2006) ˇ 8. Segvi´ c, S., Remazeilles, A., Diosi, A., Chaumette, F.: Large scale vision based navigation without an accurate global reconstruction. In: Proc. of CVPR, Minneapolis, Minnesota (2007) ˇ 9. Di´ osi, A., Remazeilles, A., Segvi´ c, S., Chaumette, F.: Experimental evaluation of an urban visual path following framework. In: Proc. of IFAC Symposium on IAV, Toulouse, France (2007) 10. Bosse, M., Newman, P., Leonard, J., Soika, M., Feiten, W., Teller, S.: An atlas framework for scalable mapping. In: Proc. of ICRA, Taiwan pp. 1899–1906 (2003) 11. Hager, G.D., Kriegman, D.J., Georghiades, A.S., Ben-Shalar, O.: Toward domainindependent navigation: dynamic vision and control. In: Proc. of ICDC, Tampa, Florida pp. 1040–1046 (1998) 12. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2004) 13. Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. Int. J. Comput. Vis. 60(1), 63–86 (2004) 14. Goedem´e, T., Nuttin, M., Tuytelaars, T., Gool, L.V.: Omnidirectional vision based topological navigation. Int. J. Comput. Vis. (to appear) 15. Davison, A.: Real-time simultaneous localisation and mapping with a single camera. In: Proc. of ICCV, Nice, France, pp. 1403–1410 (2003) 16. Remazeilles, A., Chaumette, F., Gros, P.: 3d navigation based on a visual memory. In: Proc. of ICRA, Orlando, Florida, pp. 2719–2725 (2006) 17. Chaumette, F., Hutchinson, S.: Visual servo control, part I: Basic approaches. IEEE Robotics and Automation magazine 13(4), 82–90 (2006) ˇ 18. Segvi´ c, S., Remazeilles, A., Chaumette, F.: Enhancing the point feature tracker by adaptive modelling of the feature support. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, Springer, Heidelberg (2006) 19. Nist´er, D.: An efficient solution to the five-point relative pose problem. IEEE Trans. PAMI 26(6), 756–770 (2004) 20. Nist´er, D., Naroditsky, O., Bergen, J.: Visual odometry. In: Proc. of CVPR, Washington, DC, pp. 652–659 (2004) 21. Lourakis, M., Argyros, A.: Fast trifocal tensor estimation using virtual parallax. In: Proc. of ICIP, Genoa, Italy, pp. 169–172 (2005)
Visual Tracking by Hypothesis Testing Valentin Enescu, Ilse Ravyse, and Hichem Sahli Vrije Universiteit Brussel (VUB), Interdisciplinary Institute for BroadBand Technology (IBBT), Department of Electronics & Informatics (ETRO), Pleinlaan 2, 1050 Brussel {venescu,icravyse,hsahli}@etro.vub.ac.be
Abstract. A new approach for tracking a non-rigid target is presented. Tracking is formulated as a Maximum A Posteriori (MAP) segmentation problem where each pixel is assigned a binary label indicating whether it belongs to the target or not. The label field is modeled as a Markov Random Field whose Gibbs energy comprises three terms. The first term quantifies the error in matching the object model with the object’s appearance as given by the current segmentation. Coping with the deformations of the target while avoiding optical flow computation is achieved by marginalizing this likelihood over all possible motions per pixel. The second term penalizes the lack of continuity in the labels of the neighbor pixels, thereby encouraging the formation of a smoothly shaped object mask, without holes. Finally, for the sake of increasing robustness, the third term constrains the object mask to assume an elliptic shape model with unknown parameters. MAP optimization is performed iteratively, alternating between estimating the shape parameters and recomputing the segmentation using updated parameters. The latter is accomplished by discriminating each pixel via a simple hypothesis test. We demonstrate the efficiency of our approach on synthetic and real video sequences.
1
Introduction
Object tracking is an important task for many computer vision applications. Simple tracking techniques consider that the target has a primitive shape (ellipse or rectangle) and its motion can be described by a parametric model (translation, rotation, affine)[1,2,3]. These trackers model the object appearance using image templates, color histograms, or joint spatial-color histograms and rely on a small number of parameters, which enable them to reach real-time operation. Good tracking performances are achieved as long as the target preserves its appearance and shape. Thus, challenging conditions such as non-rigid motion, partial occlusion, illumination variations, and out-of-plane rotation are not handled by these trackers and require models that explicitly cater for these factors. Tracking a deforming/articulated target can be achieved by encompassing the object region with an active contour (pioneered by [4]) and fitting it to the object region. It is worth to note that many tracking methods based on contour evolution [5,6,7,8,9] hinge upon segmentation methods that isolate the object from J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 13–24, 2007. c Springer-Verlag Berlin Heidelberg 2007
14
V. Enescu, I. Ravyse, and H. Sahli
the background based on some appearance features: image gradient [4,10,11], color/intensity properties of the object and background regions such as distribution functions [12] and homogeneity [13], or combinations thereof. Indeed, although tracking may additionally incorporate optical flow [6,7], it essentially segments the object in the current frame using a prior model (the region appearance) extracted from the previous frame. Evolution of the contour is governed by an energy functional which defines the smoothness of the contour as well as the contribution of various image features. Critical for the performance of these methods is the choice of the image-driven energy terms and the contour representation. The late algorithms use an implicit representation (based on level sets as proposed by [10,11]) that has numerous advantages over the explicit representation (based on control points) in snakes [4]. However, level set-based trackers have a complex mathematical formulation which makes their numerical implementation rather cumbersome and slow. Recently, shape constraints are enforced on level set methods for explicitly handling occlusion [9,14]. A viable alternative to contour-based tracking is to recover the object mask in the current frame by maximum a posteriori (MAP) segmentation of the image into two regions, object and background, given a prior model of the object appearance. The binary label field corresponding to this segmentation is usually modeled as a Markov Random Field (MRF) [15]. In general, two types of constraints are encoded with the MRF models: the data constraint, aiming to reduce the discrepancy between the object model and its appearance induced by a given segmentation of the current image, and the smoothness constraint, which specifies that the object is a spatially coherent entity. Nevertheless, the general form of the Gibbs energy [15] associated with the MRF model enables the accommodation of a range of additional constraints, from shape constraints [16,17] to contrast [18] and motion continuity constraints [19] for pairs of adjacent pixels. All these constraints are enforced by minimizing the Gibbs energy, which is equivalent to computing the MAP label field. Currently popular MRF optimization methods for binary labeling include the Iterated Conditional Modes (ICM) algorithm [20] and the maximum flow (graph cut) algorithm [21]. Example applications are provided in [20,22] and [18,16,17,19], respectively. In this paper, we propose a new approach for tracking a non-rigid target based on the MAP-MRF paradigm. The novelty of our approach consists in three main points: i) the deployment of a new data-constraint (likelihood) term which takes into account the target structure and its possible deformations while avoiding optical flow computation, ii) the empirical estimation of the parameters of an elliptic shape model which constrains the segmentation solution, and iii) the derivation of a fast iterative optimization algorithm based on a simple probabilistic test. Among the works related to ours, we cite [19] and [17], on which we partially draw on. The sequel of this paper is organized as follows. In Section 2, we formulate the tracking problem and elaborate on the three components of the tracking model. In Section 3, we present the iterative optimization algorithm for tracking, while in Section 4 some experimental results and discussions are provided. Section 5 concludes the paper.
Visual Tracking by Hypothesis Testing
2
15
Problem Formulation
Let xk be an image mask consisting of a set of binary labels {xik }, where xik assigns the pixel i at time k to one of the following classes: object of interest (xik = 1) and background (xik = 0). Let cik represent the color information of the pixel i at time k. We formulate tracking as a segmentation problem where, given two video frames, Ik = {cik } (current frame) and Ik−1 = {cik−1 } (initial frame), and the initial object mask, xk−1 , the goal is to determine the current object mask, xk . In doing that, we want to obtain a smooth object mask and also stimulate the segmentation to partially obey an elliptic shape model with unknown parameters Θ. To this end, we cast the problem in a Bayesian framework where xk and Θ can be found by optimizing the posterior probability given by the Gibbs distribution p(xk , Θ|Ik , Dk−1 ) =
1 exp(−E(xk , Θ)), Z
(1)
where Dk−1 {xk−1 , Ik−1 } and Z is the partition function (a normalizing constant that does not depend on xk and Θ). E(xk , Θ) is an energy function E(xk , Θ) = Edata (Ik |xk , Dk−1 ) + Esmooth (xk ) + Eshape (xk |Θ),
(2)
defined as the summation of three energy terms encoding various constraints as detailed in the following. An uniform prior term for Θ can be added in (2), but we prefer to estimate Θ in a heuristic manner rather than probabilistically. 2.1
Data Term
The energy Edata penalizes the mismatch between the current image and a given segmentation: Edata (Ik |xk , Dk−1 ) = − log p(Ik |xk , Dk−1 ),
(3)
where p(Ik |xk , Dk−1 ) is the image likelihood. Assuming the colors of the pixels are conditionally independent, we can decompose Edata as Edata (Ik |xk , Dk−1 ) =
N
V (cik |xik , Dk−1 ),
(4)
i=1
where V (cik |xik , Dk−1 ) is a potential function defined as V (cik |xik , Dk−1 ) = − log p(cik |xik , Dk−1 ),
(5)
and N is the number of pixels in the scan region (whose meaning will be defined shortly). We assume that a pixel can not move more than Nm pixels horizontally or vertically between two frames. Thus, a pixel i in the current frame may correspond to a pixel j in the initial frame that belongs to a circular neighborhood i Nk−1 of radius Nm , centered on the position of pixel i. Alternatively, if pixel i
16
V. Enescu, I. Ravyse, and H. Sahli
is disoccluded in the current frame, then it has no correspondence in the initial frame. Since a pixel can not move more than Nm pixels between two successive frames, to find the object mask in the current frame we do not need to scan the whole frame, but only a region obtained by dilating the initial object mask with Nm pixels. Henceforth, this region is referred to as the scan region. An example can be viewed in Fig. 2(b), where the marked scan region corresponds to the second frame of the sequence and is based on the object mask in the first frame. Now, instead of computing the pixel correspondences (which is the difficult process of optical flow estimation), we prefer to compute the pixel likelihood p(cik |xik , Dk−1 ) by marginalizing the joint probability of the pixel’s color and the potential correspondences i → j, including the event of correspondence to none: p(cik |xik , Dk−1 ) = p(cik , i → j|xik , Dk−1 ) (6) i j∈Nk−1 ∪{none}
Using the Bayes theorem and the chain rule for the summation term in (6) yields P (xik |cik , i → j, Dk−1 )p(cik |i → j, Dk−1 )P (i → j|Dk−1 ) P (xik |Dk−1 ) (7) The first multiplicand in the numerator of (7) is found by observing that the label xik depends only on the correspondence i → j and the label xjk−1 : p(cik , i → j|xik , Dk−1 ) =
P (xik |cik , i
→ j, Dk−1 ) =
P (xik |i → none) j = none, i P (xik |i → j, xjk−1 ) j ∈ Nk−1 ,
(8)
where P (xik |i → j, xjk−1 ) is the probability of the label at pixel i when its corresponding pixel in the initial frame, along with its segmentation label, are known. Since the segmentation for the initial frame may contain errors, this probability can be specified as [19] Perror xik = xjk−1 , j i P (xk |i → j, xk−1 ) = (9) 1 − Perror xik = xjk−1 , where Perror is a constant that approximates the probability of segmentation label being incorrect. P (xik |i → none) is the probability of the label at pixel i with no corresponding pixel in the initial frame. Since this occurs when a pixel is disoccluded, this probability is set to Pdis xik = 1, P (xik |i → none) = (10) 1 − Pdis xik = 0, where Pdis is another constant. The second multiplicand in the numerator of (7) reduces to p(cik |i → j, cjk−1 ) as the color of the pixel i in the current frame depends solely on the color of the correspondent pixel j in the initial frame. The color of the pixel i in the current frame, cik , is modeled as normally distributed with mean equal to the color of the
Visual Tracking by Hypothesis Testing
17
corresponding pixel j in the initial frame or as uniformly distributed for pixel i corresponding to none [19]: U (cik ) j = none j i p(ck |i → j, ck−1 ) = (11) i G(cik ; cjk−1 , C) j ∈ Nk−1 , where G(x; x¯, C) is a normal distribution of mean x ¯ and covariance C (C is a diagonal matrix with the same variance, σ 2 , for all color components, matching the illumination variation between frames), and U is a uniform distribution on the color space (RGB or normalized RGB in our implementation). The third multiplicand in the numerator of (7) is the prior probability of the event that pixel i in the current frame corresponds to pixel j or to none: Pnone j = none, p(i → j) = 1−Pnone j ∈ N i , (12) k−1 |N i | k−1
i |Nk−1 |
i where is the number of pixels of the circular neighborhood Nk−1 and Pnone is a constant which reflects the probability of having no correspondence. Invoking the law of total probability for the denominator in (7) yields P (xik |Dk−1 ) = P (xik |i → u, Dk−1 )P (i → u|Dk−1 ) i u∈Nk−1 ∪{none}
=
P (xik |i → u, xuk−1 )P (i → u),
(13)
i u∈Nk−1 ∪{none}
where the two terms inside the sum are given by (8) and (12), respectively. Finally, by substituting (13) into (6), we derive the likelihood of pixel i as p(cik |xik , Dk−1 ) =
i j∈Nk−1 ∪{none}
p(xik |i → j, xjk−1 ) · p(cik |i → j, cjk−1 ) · p(i → j)
i u∈Nk−1 ∪{none}
2.2
p(xik |i → u, xuk−1 ) · p(i → u)
.
(14)
Smoothness Term
The term Esmooth (xk ) in (2) penalizes the lack of continuity in the labels of the neighbor pixels, thereby encouraging the formation of a smoothly shaped object mask, without holes: Esmooth (xk ) = V (xik , xjk ) (15) i
j∈Ni
where Ni is a 8-neighborhood of pixel i and the potential V (xik , xjk ) takes the form of a generalized Ising model [15,22] λ1 if xik = xjk , 2 j i V (xk , xk ) = dist (i,j) (16) 0 if xik = xjk . The quantity dist(i, j) gives the distance between the pixels i and j, and λ1 is a constant used to control the smoothness.
18
2.3
V. Enescu, I. Ravyse, and H. Sahli
Shape Term
For the sake of increasing robustness, the term Eshape (xk |Θ) in (2) constrains the object mask to assume an elliptic shape model with unknown parameters: Eshape (xk |Θ) = V (xik |Θ), (17) i
with V (xik |Θ) being the shape potential of pixel i, λ2 if ei = xik , V (xik |Θ) = 0 if ei = xik ,
(18)
where λ2 is a constant that controls the compliance of the object mask with an elliptic shape, and e = {ei } is a mask image where the pixel i, of position pi , has an associated binary label 1 if i ∈ E(Θ), ei = (19) 0 if i ∈ / E(Θ), which indicates whether that pixel belongs or not to the elliptic region E(Θ) of parameters Θ = (μ, Σ), with μ being the ellipse center and Σ a covariance-like matrix: E(Θ) = {i; (pi − μ)T Σ −1 (pi − μ) ≤ 4}. (20) Constraining the shape of the object mask prevents the ”leaking” effect which occurs when the data likelihood increases due to noise and clutter. Contrary to [17], we are not fitting the ellipse to the contour of the object mask, but we estimate the shape parameters in a fast, heuristic manner, as explained next.
3
Iterative Estimation
We wish to estimate the segmentation field xk and the shape parameters Θ by maximizing the posterior distribution p(xk , Θ|Ik , Dk−1 ), which is equivalent with minimizing the energy function ⎛ ⎞ N j ⎝V (cik |xik , Dk−1 ) + V (xik |Θ) + E(xk , Θ) = V (xik , xk )⎠ , (21) i=1
j∈Ni
obtained by plugging (4), (15), and (17) into (2). We perform the optimization of (21) by iterating over the following two steps: 1. Update the segmentation field xk given the best estimate of the shape parameters, Θ . This step involves the minimization of the energy function E(xk , Θ ), which can be carried out through a deterministic relaxation of the ICM type [20] by performing local hypothesis tests as explained in the sequel.
Visual Tracking by Hypothesis Testing
19
Assuming the segmentation field xk is known with the exception of label xik , we can estimate xik by performing the following hypothesis test:
E
xik
=
0, {xik }i =i , Θ
xik =1
≷
xik =0
E xik = 1, {xik }i =i , Θ .
(22)
Combining (21) with (22) and eliminating the common term of the two energy factors in (22), determined by {xik }i =i , leads us to a simplified hypothesis test involving only potentials related to the pixel i and its neighbors (with known labels): V (cik |xik = 0, Dk−1 ) + V (xik = 0|Θ ) + j∈Ni V (xik = 0, xjk ) xik =1
≷
xik =0
V (cik |xik = 1, Dk−1 ) + V (xik = 1|Θ ) +
j∈Ni
V (xik = 1, xjk ). (23)
Thus, it is possible to relabel (refine) a given initial segmentation field by sequentially applying the decision rule (23) for all pixels. After relabeling a pixel, its new label updates the initial segmentation field that is used for testing the remaining pixels in the scan region. The refined field, denoted by xk , serves to compute a better estimate for Θ (see the next step). 2. Update the shape parameters Θ given an estimate of the segmentation field, xk . Since solely the shape energy depends on Θ, this step reduces to minimizing Eshape (xk , Θ) as given by (17). In practice, the direct minimization of Eshape is difficult to achieve due to the special form of the shape potential (18). An alternative solution can be found by observing that the penalty on the shape energy is minimal when the elliptic mask controlled by Θ = (μ, Σ) coincides with the elliptic idealization of the object mask. Thus, the parameters Θ can be estimated by fitting an ellipse to the object region as given by xk . To this end, it suffices to assign to μ and Σ the first two moments of the position of the pixels belonging to the object: N pi χi,1 μ = i=1 , (24) N i=1 χi,1 N T i=1 pi pi χi,1 Σ= − μμT , (25) N χ i,1 i=1 where pi is the position vector of the pixel i and χi,1 is an indicator function which takes the value 1 if xi k = 1 and 0 otherwise. Following the estimation of Θ, we proceed with the first step to refine xk . Repeatedly performing these two optimization steps amounts to a local-descent procedure that gradually approaches a local minimum of E(xk , Θ). With a greedy initialization of the segmentation field, where a likelihood-based decision rule is applied for labeling each pixel, V (cik |xik = 0, Dk−1 )
xik =1
≷
xik =0
V (cik |xik = 1, Dk−1 ),
i = 1, . . . , N,
(26)
20
V. Enescu, I. Ravyse, and H. Sahli
the convergence of the optimization procedure is reached after a few scans. Note that the likelihood factor has to be computed only once as it does not depend on the shape parameters and the local configuration of pixel labels, which get modified during the iterative process. The optimization procedure may be terminated when the number of changed labels per scan falls under a pre-specified level. This solution provides a good trade-off between the segmentation quality and the computational cost.
4
Experimental Results
In this section we present two synthetic and one real-world tracking examples to illustrate the performance of the proposed algorithm. Throughout the experiments the model parameters were set as follows: Pnone = 0.2, Perror = 0.05, Pdis = 0.2, Nm = 8, σ = 10, λ1 = 10, and λ2 = 0.5. The target appearance model is built from the first frame of each sequence, given the initial object mask. After the current object mask is computed using the proposed optimization algorithm, we find its contour by subtracting from the mask its morphologically eroded version and we overlay it in green on the corresponding frame. Our unoptimized C++ implementation of the tracker running on a Pentium 4 3GHz processor delivers a performance of 2 ÷ 5 frames/second, depending on the target size. Fig. 1 shows the tracking results of the ”Fish” video sequence (100x100, 400 frames), where a synthetic fish swims on a black background and changes its movement direction several times. The first two rows of Fig. 1 capture several challenging frames of the sequence where the deformation of the target is extreme. Despite the non-rigid character of the target, the tracker is able to closely follow its contour. The internal state of the tracker (in one frame) after the convergence of the optimization algorithm is visualized in last row of Fig. 1: (a) the elliptic shape mask, (b-c) the images of the data likelihood (14) multiplied with the shape probability, exp(−V (xik |Θ)) (see (18)), assuming a pixel belongs to the background and respectively to the object, and (d-e) the images of the smoothness probability, exp(− j∈Ni V (xik , xjk )) (see (16)), assuming a pixel belongs to the background and respectively to the object. These probability images are calculated only for the pixels inside the scan region, with black representing the highest probability and white the lowest. One can easily notice that, for Figs. 1(b) and (d), the probability is high and uniform outside the object region, whereas for Figs. 1(c) and (e) the probability is high and uniform inside the object region. This is in accordance with the fact that (b) and (d) are built on the hypothesis that a pixel belongs to the background, while (c) and (e) are built assuming the reverse. Moreover, the uniformity in these probability images can be explained by the lack of clutter in the background and the relative color homogeneity of the fish. Fig. 2 shows the tracking results of the ”Ellipse” video sequence (100x100, 5 frames), where a synthetic ellipse (textured in 4 colors) undergoes translation and rotation motions on a background that has the same color distribution as
Visual Tracking by Hypothesis Testing
(a)
(b)
(c)
(d)
21
(e)
Fig. 1. Video sequence ”Fish”: the first two rows display the tracking results for several challenging frames where the target undergoes severe deformations; the third row (a)(e) visualizes the internal state of the tracker for the frame in the upper-left corner (as explained in Section 4)
the ellipse. Despite the cluttered background, our tracker isolates well the target. This is in stark contrast with the approaches used in [9,18,16,8,12,5,13], which are based on data likelihoods using the color densities of the two classes or region homogeneity measures such as mean color and variance. Obviously, in this case, these approaches would have failed as such data likelihoods are identical for the object and background pixels. On contrary, the proposed likelihood (14) is discriminative enough as it is based on the local color structure and not on the color statistics of an image region. This can be seen in Figs. 2(a)-(e), which visualize the internal state of the tracker for the second frame in the sequence (the meaning of the images is the same as for Figs. 1(a)-(e)). Indeed, the data likelihoods corresponding to the two hypotheses, depicted in Figs. 2(b) and (c) respectively, clearly identify the object in the scanning region. Fig. 3 shows the results of tracking a human face in a real video, the ”Tom” sequence (352x288, 446 frames), where a person approaches and departs the camera, moves around to the window, rotates his head, and touches his nose with the hand. Obviously, this induces challenging conditions such as target scaling, illuminations changes, out-of-plane rotations, partial occlusion, and cluttered background (the hand and the face have similar colors). Even with so many difficulties, our tracker still delivers a good, quasi real-time performance.
22
V. Enescu, I. Ravyse, and H. Sahli
scan region
(a)
(b)
(c)
(d)
(e)
Fig. 2. Video sequence ”Ellipse”, where the ellipse and the background color distributions are identical. The first row depicts all the five frames of the sequence and the tracking results; the second row (a)-(e) visualizes the internal state of the tracker for the second frame of the sequence (as explained in Section 4)
Fig. 3. Video sequence ”Tom”: the images show the results of tracking a human face in challenging conditions such as target scaling, illumination changes, out-of-plane rotations of the head, and cluttered background
Visual Tracking by Hypothesis Testing
5
23
Conclusion
This paper proposed an efficient and robust approach for tracking non-rigid moving objects. We have formulated the tracking problem as the MAP estimation of a binary label field that partitions the current frame into object and background regions based on the object appearance in the previous frame. An MRF model was used to enforce data, region smoothness, and elliptic shape constraints. Based on the local color structure of the target, the data constraint (likelihood) enables the tracker to handle target deformations by integrating all the possible motions of a pixel in a small neighborhood. This imparts the tracker a good discriminative power in cluttered backgrounds as opposed to the color statistics-based approaches. We have shown how MAP optimization can be carried out efficiently in an iterative manner, by alternating between computing the shape parameters and estimating the segmentation based on a simple hypothesis test. The experimental results have proven that the proposed algorithm performs very well in a variety of challenging conditions.
Acknowledgement This work has been done in the framework of a) the VIN project, funded by the Interdisciplinary Institute for Broadband Technology (IBBT) (founded by the Flemish Government in 2004), and b) the SERKET project co-funded by the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT) and the the involved companies (Barco).
References 1. Hager, G., Belhumeur, P.: Efficient region tracking with parametric models of geometry and illumination. IEEE T-PAMI 20, 1025–1039 (1998) 2. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE TPAMI 25, 564–575 (2003) 3. Zhang, H., Huang, W., Huang, Z., Li, L.: Affine object tracking with kernel-based spatial-color representation. Comp. Vision and Pattern Recog. (CVPR) 1, 200–293 (2005) 4. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. Intl. Journal Comp. Vis. 1, 321–332 (1988) 5. Moelich, M., Chan, T.: Tracking objects with the Chan-Vese algorithm. Technical report 03-14, Computational Applied Mathematics, UCLA, Los Angeles (2003) 6. Mansouri, A.: Region tracking via level set PDEs without motion computation. IEEE T-PAMI 24, 947–961 (2002) 7. Paragios, N., Deriche, R.: Variational Principles in Optical Flow Estimation and Tracking. In: Osher, S., Paragios, N. (eds.) Geometric Level Set Methods in Imaging, Vision, and Graphics, pp. 299–317. Springer, Heidelberg (2003) 8. Freedman, D., Zhang, T.: Active contours for tracking distributions. IEEE T-IP 13, 518–526 (2004) 9. Yilmaz, A., Li, X., Shah, M.: Contour-based object tracking with occlusion handling in video acquired using mobile cameras. IEEE T-PAMI 26, 1531–1536 (2004)
24
V. Enescu, I. Ravyse, and H. Sahli
10. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. Intl. Journal Comp. Vis. 22, 61–79 (1997) 11. Malladi, R., Sethian, J., Vemuri, B.: Shape modeling with front propagation: A level set approach. IEEE T-PAMI 17, 158–175 (1995) 12. Zhu, S., Yuille, A.: Region competition: Unifying snakes, region growing, and bayes/MDL for multiband image segmentation. IEEE T-PAMI 18, 884–900 (1996) 13. Chan, T., Vese, L.: Active contours without edges. IEEE T-IP 10, 266–277 (2001) 14. Cremers, D.: Dynamical statistical shape priors for level set-based tracking. IEEE T-PAMI 28, 1262–1273 (2006) 15. Li, S.Z.: Markov Random Field Modeling in Computer Vision. Springer, Heidelberg (1995) 16. Freedman, D., Zhang, T.: Interactive graph cut based segmentation with shape priors. Comp. Vision Pattern Recog. (CVPR) 1, 755–762 (2005) 17. Slabaugh, G., Unal, G.: Graph cuts segmentation using an elliptical shape prior. In: Intl. Conf. Image Proc. pp. 1222–1225 (2005) 18. Boykov, Y., Jolly, M.: Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In: Intl. Conf. Comp. Vision (ICCV)., vol. 1, pp. 105–112 (2001) 19. Leichter, I., Lindenbaum, M., Rivlin, E.: Bittracker - a bitmap tracker for visual tracking under very general conditions. Technion, Computer Science Department, Technical Report CIS-2006-03.revised (2006) 20. Besag, J.: On the statistical analysis of dirty pictures. J. Royal Stat. Soc. B 48, 259–302 (1986) 21. Greig, D., Porteous, B., Seheult, A.: Exact maximum a posteriori estimation for binary images. J. Royal Stat. Soc. B 51, 271–279 (1989) 22. Aach, T., Kaup, A.: Bayesian algorithms for adaptive change detection in image sequences using Markov random fields. Signal Proc. Image Communic. 7, 147–160 (1995)
A New Approach to the Automatic Planning of Inspection of 3D Industrial Parts J.M. Sebastián, D. García, A. Traslosheros, F.M. Sánchez, S. Domínguez, and L. Pari Departamento de Automática, Ingeniería Electrónica e Informática Industrial (DISAM) Escuela Técnica Superior de Ingenieros Industriales, Universidad Politécnica de Madrid C/ José Gutiérrez Abascal, 2, 28006 Madrid, España {jsebas,altrami,sergio,lpari}@etsii.upm.es,
[email protected]
Abstract. The present article describes a novel algorithm of planning in order to carry out in an automatic way the dimensional inspection of elements with three-dimensional characteristic and which ones belong to the manufactured pieces, the measurements are obtained with a high precision. The method is considered as generalized from the piece complexity, the points since the measurements must be done and the range of the application of the system is not limited. According to the previously mentioned the analysis discretizes the space configurations of the positioning system of the piece and the surface of the own piece. All the techniques here presented have been proved and validated in a real system of inspection; the system is based on stereoscopic cameras which are endowed with a laser light. Keywords: 3D inspection, automatic planning, quality control.
1 Introduction The aim of this study is focused to visual inspection of machine parts with threedimensional characteristics for quality control tasks. Three dimensional inspections are influenced by numerous factors which makes it quite different to other types of inspection. Some aspects such as the presence of occlusions, reflections or shadows introduce many inconveniences that make the analysis very difficult. Our work focuses on inspecting metal parts in order to improve accuracy and the tolerances. As it is well known, tolerance checking is one of the most demanding tasks that can be done in an industrial environment in what concerns to precision in measurements [1], [2]. The comparison between the real measurements and the ideal one makes necessary to have such information previously, usually in the form of a computer aided design (CAD) model. The use of CAD models involves specific working methods and organization of data that differs to other techniques commonly adopted. Also, materials employed in the manufacturing of such parts are usually metals, with specular properties that involve special methods of analysis. On the other hand, the most motivational aim in the development of an inspection system is to get the system to be able to find by itself the best strategy to perform the job in terms of optimizing some criteria. This study handles all these problems, focusing on the search for methods that improve precision in measurements. Our aim is not to limit this study to specific configurations of J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 25–36, 2007. © Springer-Verlag Berlin Heidelberg 2007
26
J.M. Sebastián et al.
the inspection system, instead to build a system which is capable of performing threedimensional measurements; in a similar way to the coordinated measurement machines (CMM). In [3], [4], [5], [6], or [7] different approaches to the planning problem are shown, although their solutions depend too much on the architecture of their inspection systems. This work has been developed using an inspection system called INSPECTOR – 3D. In previous works, the characteristics of the system [8] and some early approaches to automatic planning [9] were introduced. The content of this paper unfolds as follows. Section 2 covers a brief description of the inspection system employed in this study. Section 3 describes our approach to what concerns analyzing the part. Section 4 clarifies some preliminary aspects necessary to understand section 5 where our approach to inspection planning is explained in detail. Section 6 shows some common examples while section 7 presents some conclusions about this work.
2 INSPECTOR-3D System Description INSPECTOR-3D system consists of two fixed converging cameras, a laser plane for surface scanning, a part positioning device with 3 degrees of freedom (2 rotational and 1 linear) and a workstation that controls the whole inspection process. Figure 1 shows an image of the system. All the degrees of freedom of the image acquisition system have been eliminated in order to simplify the camera calibration process and minimize the uncertainties of the final measurements. It is easy to demonstrate that calibrating the axes of the positioning device is much simpler and precise than the dynamic calibration of the cameras.
Fig. 1. Image of the 3D system Inspector
Fig. 2. Functional architecture of the system
Referring to the functional architecture of the system, the inspection procedure consists of two stages, as shown in Figure 2. The first stage takes place off-line. The user analyses a CAD model of the part and selects the elements to inspect (called “entities”, as described later). This information, together with the calibration models, constitutes the input data to the planning algorithm. The output information consists
A New Approach to the Automatic Planning of Inspection of 3D Industrial Parts
27
of the set of states that the system needs to follow in order to complete the inspection process. In the online stage, small areas of the part are scanned in a sequential way according to the inspection plan. As a result, a cloud of points is obtained. These points are classified and referred to a common reference system. Finally, a comparison of this measurement with the tolerance zones is accomplished. Two important aspects need to be mentioned. In the first place, planning has only been considered as an off-line problem prior to any type of measurement. In the second place, the fact of working with high optical resolution normally implies that only a small part of each feature is visible. Therefore, data acquisition requires successive operations of orienting the part and scanning small areas. Besides allowing to deal with the planning problem, this system has been used as an excellent test stage for studies related to precision in measurement, calibration and evaluation of feature extraction algorithms.
3 Digitalizing the Part The available information on the inspection process comes from two different sources: The inspection system itself and the part to be inspected. Regarding the inspection system, by calibrating both cameras, the laser plane and the part positioning device, it is possible to obtain a complete model of the system and use it to calculate the projection of the part on both images during inspection.
Fig. 3. Digitalizing a part in triangles
In what concerns to the part, there are different ways for representing the geometric information, such as spatial enumeration (octrees), set-theoretic modeling (computational solid geometry or CSG for short) and boundary representations [2]. Nevertheless, our approach to data representation is based on digitalizing the surface of the part in triangles [10], as shown in Figure 3. This technique, although widely used in computer graphic applications has not been used as a base for the analysis of the inspection planning problem. If we know the position of the part and the equation of the laser plane, it is easy to calculate the intersection of the plane with the triangles and to project such intersection on both images. With this approach, several advantages can be obtained. On one hand, we can reduce the analysis to areas around the projections, decreasing the
28
J.M. Sebastián et al.
calculation time and avoiding errors, and on the other hand; it is easy to associate each digitized point to a triangle of the CAD model, avoiding later processing. Finally, as the calibration models of both cameras and the laser plane are known, two independent and redundant measurements can be calculated in order to detect the presence of outliers. As a result, a better performance of the system is obtained, reducing the presence of digitization errors. However, there are still some situations in which important errors in the measurement process can appear. These errors are basically due to: Multiple configurations for inspecting a single feature, presence of internal reflections and direct visualization of specular reflections. Although some of these effects can be minimized by controlling the dynamic range of digitization, the power of the laser unit or the aperture of the camera lens, there are still many unacceptable situations for inspection which makes necessary to find mechanisms for the automatic selection of the best conditions of inspection, that is, an inspection planning process.
4 Preliminary Aspects of the Planning Problem Our approach can be classified into the well known generation and verification techniques [11]. These techniques analyze every possible configuration of the system in a sequential way, considering only those configurations that allow to measure the parts (applying restrictions of visibility) and selecting among them the most adequate one in terms of a specific metric of quality (named reliability) [5]. In our system, the metric of quality has been set in terms of the behavior of the feature detection algorithm which analyses the intersection of the laser plane with the part when seen from both cameras. In order to analyze a finite set of possible states of the system, the rank of variation of every degree of freedom has been digitalized. Each combination of discrete values of the degrees of freedom of the system will be named configuration of the system. The analysis will be repeated sequentially for every triangle and for every configuration of the system until all features have been analyzed. However, we need to consider some related concepts previously. 4.1 Definition of Entities In the first place, it is important to clarify the concept of entity. Since a discrete representation of the part is being used and the aim of the system is to measure specific features, a way for relating such concepts has been established by means of a new concept called entity. An entity is defined as the set of triangles associated to the areas of the part to inspect. Besides the geometrical information of the triangles, an entity usually includes information related to tolerance zones and reference systems (in some cases called simulated reference system) [12]. Therefore, various analysis such as inspection of parallelism between two faces or cylindricity of a surface are now considered as a problem of inspecting entities or triangles. At this point, two aspects need to be clarified. On one hand, although an approximate representation of the surface of the part has been used, the actual comparison has been performed between the measurements and the exact dimensions of the part. Thus, the discrete representation
A New Approach to the Automatic Planning of Inspection of 3D Industrial Parts
29
has only been used as a useful approach to the analysis of the problem. On the other hand, the definition of entities and the process of entering information on tolerances have been done manually using a friendly user interface. 4.2 Configuration Space Another concept to take into account is the way in which the degrees of freedom of the system have been considered. As mentioned before, the DOF of the whole system are those of the part positioning device: two for rotating the part and one for displacing the area to inspect under the laser plane. In the following analysis, a clear difference between these two types of degrees of freedom will be established. In fact, the space of analysis will be reduced to a 2-dimensional space taking into account only the rotational axes of the positioning system and hence considering displacement as a property associated to each configuration. The reason, as will be explained later, is that the analysis of visibility and reliability depends fundamentally on the orientation of the part. The result of the analysis will be represented in a diagram of discrete states of the system named configuration diagram, each diagram represents a possible configuration of the system. In this diagram, the level of digitalization of each degree of freedom depends on the level of detail aimed. Very high levels of digitalization imply more accurate solutions although a higher number of states to analyze. As it is clear, the analysis has focused on the degrees of freedom of the system instead of considering other solutions such as the study of all the points of view around the part using a digitalized sphere [13]. The reason is that these approaches that analyze large sets of points of view do not have to be physically realizable with the system as opposed to the configuration space approach. 4.3 Visibility The first set of restrictions assures that a specific triangle is visible to the cameras. We use a definition of the concept of visibility that involves both the cameras and the laser plane. A triangle is considered visible when a range of displacement of the part positioning device exists which assures that the intersection of the laser plane with the triangle is visible by both cameras at all times during the complete scanning of this triangle. Therefore, if a triangle is visible under a specific configuration of the system, it means it is possible to record a range of valid displacement of the part for that configuration. In order to optimize the implementation of the definition of visibility, the following restrictions have been consecutively applied in the INSPECTOR 3-D system: • Orientation: The triangle is oriented in such a way that its external face is visible by both cameras. • Field of view: The projection of the intersection lies inside the image. • Occlusions: No other triangle occludes the vision of the one being analyzed. The verification of the previous restrictions allows a specific configuration of the system to be compliance with the definition of visibility. The result is a set of valid configurations in which the triangle can be digitalized through laser scanning.
30
J.M. Sebastián et al.
4.4 Reliability Once the condition of visibility is verified, a metric of quality will be associated to every visible configuration of the system. The aim is to be able to select the most adequate configuration for the measurement process between the visible ones. The criterion for the selection will be established in terms of the quality of the image being observed. Such criterion depends on the behavior of the feature extraction algorithm. In this case, the algorithm extracts the peak position of the laser plane in the image. In order to measure the quality of each configuration, a metallic planar surface was oriented sequentially sweeping the range of variation of the rotation axes of the positioning system. At each configuration the resulting images have been stored and analyzed. Based on the type of laser reflection obtained, four different cases have been distinguished, as indicated in Figure 4:
Fig. 4. Different types of intersections (NOT VISIBLE, GAUSSIAN, SATURATION, SPECULAR)
• Not visible intersection. (NOT VISIBLE). The intersection could not be seen under this configuration. • Gaussian intersection (GAUSSIAN). The laser intersection is not saturated. Subpixel techniques may be employed to improve precision [14]. • Saturated Intersection (SATURATION). The laser intersection is saturated. Subpixel algorithms cannot be applied. Instead, the mass center of the saturated intersection is calculated. • Specular reflection (SPECULAR). The reflection of the laser plane hits directly on the sensor making impossible to process the image. The occurrence of each case is strongly related to the relative orientation between the reflected laser plane, the metal part and the camera. Figure 5 shows the different elements involved in the analysis. The triangle is defined by vector n, perpendicular to it, the reflected laser plane by vector r, resulting from the intersection of this plane with the plane perpendicular to the original laser plane and the axes of both cameras by vectors v1 and v2. In this context, the cosine of the angle between vectors r and v* (v1 or v2) constitutes a reliable measure to indicate the type of intersection that can be seen by each camera. We have defined specific thresholds to differentiate each of the four cases and the transitions between them, obtaining seven different states as shown in Table 1.
A New Approach to the Automatic Planning of Inspection of 3D Industrial Parts
31
Table 1. Weights associated to every possible type of image
TYPE OF INTERSECTION
cos(r , v )
WEIGHT
Specular Saturation-Specular Saturation Gaussian – Saturation Gaussian Not visible – Gaussian Not visible
1 0.975 0.95 0.9 0.8 0.7 0.5
0 0.25 0.5 0.75 1 0.5 0
Fig. 5. Laser reflection with respect to the camera position
Moreover, we have associated a weight in the range between 0 and 1 to each of the possible seven states according to how favorable these states are with respect to inspection. Therefore, as the orientation of the part with respect to the cameras and to the laser plane is always known, it is possible to detect the type of intersection being visualized and to get a value of reliability for that configuration. Thus, every valid configuration of the diagram has two associated values: a range of displacement for the digitized triangle (obtained in the visibility analysis) and a reliability measure obtained from the Table 2. Table 2. Weights associated to Cosine camera axis – laser reflection COSINE WEIGHT
0.0 0.0
... 0.0
0.5 0.0
0.6 0.25
0.7 0.5
0.8 1.0
0.9 0.75
1.0 0.0
4.5 Level of Digitalization of the Part An important aspect to consider in the following studies is the level of digitalization of the triangles. Since there are two visibility restrictions (field of view and occlusions) affected by the size of the triangles, it is important to assure that their size is not too large to invalidate many configurations. However, if the size of the triangles is too small, there is a risk of causing excessive processing. In our approach, the part is initially digitalized into triangles using conventional techniques [15]. Next, these
32
J.M. Sebastián et al.
triangles are divided recursively, using the middle point of their sides, into another four triangles until the projection of the maximum dimension of all the triangles is smaller than 40% of the dimensions of the image.
5 Planning Algorithm The procedure starts with a first stage in which the possibility of inspecting every single triangle of an entity is analyzed. Once this analysis is done, it is possible to known if a specific entity can be inspected with the system. The only condition that must be verified is that every triangle of the entity has at least one reliable configuration. However, it is possible that if the analysis ends here and the best configuration for inspecting each triangle is selected, a set of very different configurations would be obtained. This would lead to a large number of successive operations of orienting and scanning of small areas. Instead, an additional fusion of the results of the triangles of one entity will be developed in order to unify their conditions of inspection, in those cases in which common inspection configurations exist. As a result, it will be possible to inspect groups of neighboring triangles under the same configuration. This set of triangles will be named Group of Inspection. 5.1 Planning on Individual Triangles By applying the previous considerations, a value of reliability will be obtained for every visible configuration of a specific triangle. 5.2 Fusing the Diagrams of the Triangles of the Same Entity In order to fuse the configuration diagrams of different triangles, the following property will be used: neighboring triangles with similar orientation show small variations in their configuration diagrams. At this point, it is important to clarify the use of the concept of similarity. It is considered that two triangles are similar when the intersection between their configuration diagrams is not empty. This definition is logical since the intersection of both diagrams implies that common inspection conditions exist for both triangles. A very useful approach is based on representing the information of the part by means of some kind of structure that reflects some aspects of proximity or closeness between two different triangles. According to this, a new graph based representation, called proximity graph, has been defined. In general terms, this graph consists of a set of nodes where every one represents a triangle of the part. A node is related to another by means of a link if both triangles share any side. Additionally, a weight has been associated to every link in the graph. But instead of using a single value, as it is usually done, a complete configuration diagram has been used. This configuration diagram is obtained from the intersection between the two diagrams of the triangles of the nodes (Figure 6). Therefore, the proximity graph is a new representation that combines
A New Approach to the Automatic Planning of Inspection of 3D Industrial Parts
33
Fig. 6. Configuration diagram associated to one link
information of similarity and proximity. The study of fusing different triangles to create groups of inspection is based on the analysis of the proximity graph. 5.3 Initial Definitions During the analysis of the proximity graph, a triangle (or node) can be in two possible states: classified, when assigned to a group of inspection, and unclassified, when not assigned to any group of inspection. If a triangle is not classified, it is considered as a possible new member of a group and it is labeled as candidate. As a result, in order to study the triangles, three different lists will be maintained: list of groups, list of candidate triangles and list of unclassified triangles. Initially, the lists of unclassified triangles are filled with all the triangles. The search process ends up when the list of unclassified triangles is empty. The results are stored in the list of groups. Creation of new groups: When a new group is created, the list of candidates is emptied. Every group has a single configuration diagram associated to it. Initially, the diagram of the group will be equaled to the diagram of the triangle considered as seed in the group and it will be modified according to the diagrams of new triangles added to the group. The criterion to modify the configuration diagram of the group consists of calculating the minimum between the diagram of the group and the diagram of the triangle to be added. Selection of the seed triangle: The selection of a node as a seed is based on choosing the one with the largest number of neighbors in the proximity graph and the largest number of valid configurations in the configuration diagram of the node. The aim is not only to start from a configuration diagram with many valid configurations but also
34
J.M. Sebastián et al.
to avoid nodes which correspond to triangles on the borders of an entity. Such nodes usually have few neighbors and are located in areas where occlusions occur. Analysis of neighbors: Once a node is selected as a seed, it is classified as a member of an inspection group and it is removed from the list of unclassified triangles. Besides, all unclassified neighbors of the seed node are included in the candidate list. Next, all candidates are analyzed in search for the triangle whose reliability diagram has the largest intersection with the reliability diagram of the group. Therefore, the impact of adding a triangle to a group is minimal, due to the small reduction of reliable configurations that the reliability diagram of the group suffers. Once a candidate is added to a group, it becomes the new seed for the analysis, and therefore all its neighbors are added to the list of candidates. Group Completion: A group is considered completed when the list of candidates is emptied or when none of the configuration diagrams of the candidates intersect with the configuration diagram of the group (Figure 7). The configuration used for inspection is obtained selecting the configuration with the highest value in the configuration diagram. The range of displacement of the part is obtained from the union of the ranges of displacement of all the triangles of the group.
Fig. 7. Entities in the proximity graph
6 Example Figure 8 shows the result of applying the planning algorithm to an angular entity. As observed, three inspection groups are obtained, each one associated to a different angular configuration. Therefore, the inspection will consist in three different stages. In each stage, the part will be oriented and displaced according to the configuration diagrams of the groups (Figure 9). It is important to point out the fact that when the planning algorithm does not find a complete solution for the whole entity, it provides a partial solution for the triangles that have valid configurations. This allows one to understand the reasons why a complete inspection of the part cannot be performed.
A New Approach to the Automatic Planning of Inspection of 3D Industrial Parts
35
Fig. 8. Three inspection groups obtained from an angular entity
Fig. 9. Digitization process based on three inspection groups
7 Conclusions This study describes a new planning algorithm to perform dimensional inspection of a metal part with three dimensional characteristics in an automatic way. The algorithm follows a generation and verification strategy and it works directly in the configuration space of the inspection system. It uses a discrete representation of both the degrees of freedom of the system and the information of the part, represented as a set of triangles. The geometrical information of the features to inspect has been grouped into entities (sets of triangles). Each entity has been represented using a graph-based diagram called proximity graph, very appropriate for this discrete analysis. Our approach has several advantages; we have not imposed any restrictions to the complexity of the features to inspect or to the types of measurements to perform. Limitations are only those associated to the use of a visual measurement system (lack of visibility) and to the limited degrees of freedom of the system which only allow
36
J.M. Sebastián et al.
one to orient the part under specific configurations. Moreover, the planning algorithm provides partial solutions to the problem being solved. The entities analyzed cannot be completely inspected according to specifications; it is still possible to obtain solutions for the triangles that can constitute interesting information to help to orient the inspection process. The performance of this planning algorithm has been largely tested using a set of more than twenty complex mechanical parts from the automobile industry and the results have been quite satisfactory.
References 1. Rivera-Rios, A.H., Shih, F-L., Marefat, M.: Stereo Camera Pose Determination with Error Reduction and Tolerance Satisfaction for Dimensional Measurements. In: Proceedings of the International Conference on Robotics and Automation, April 2005, Barcelona, Spain (2005) 2. Malamasa, E.N., Petrakisa, E.G.M., Zervakisa, M., Petito, L., Legatb, J-D.: A survey on industrial vision systems, applications and tools. Image and Vision Computing 21, 171– 188 (2003) 3. Chen, S.Y., Li, Y.F.: Vision Sensor Planning for 3-D Model Acquisition. IEEE Trans. On Systems, Man. And Cybernetics 35(5) (2005) 4. Kosmopoulos, D., Varvarigou, T.: Automated inspection of gaps on the automobile production line through stereo vision and specular reflection. Computers in Industry 46, 49– 63 (2001) 5. Trucco, E., Umasuthan, M., Guallance, A., Roberto, V.: Model based planning of optimal sensor placement for inspection. IEEE Trans. on Robotics and Automation 13(2) (1997) 6. Chen, S.Y., Li, Y.F.: Automatic Sensor Placement for Model-Based Robot Vision. IEEE Trans. On Systems, Man. And Cybernetics 34(1) (2004) 7. Reed, M.K., Allen, P.K., Stamos, I.: Automated model acquisition from range images with view planning. In: Conference on Computer Vision and Pattern Recognition (1997) 8. Garcia, D., Sebastian, J.M., Sanchez, F.M., Jiménez, L.M., González, J.M.: 3D inspection system for manufactured machine parts. In: Proceedings of SPIE. Machine Vision Applications in Industrial Inspection VII, vol. 3652, pp. 25–29 (1999) 9. Sebastian, J.M., Garcia, D., Sanchez, J.M., Gonzalez, J.M.: Inspection system for machine parts with three-dimensional characteristics. In: Proceedings of SPIE. Machine Vision Systems for Inspection and Metrology VIII vol. 3836 (1999) 10. Farin, G.: Curves and surfaces for computer aided geometric design. A practical guide. Academic Press, London (1993) 11. Maver, J., Bajcsy, R.: Oclussions as a guide for planning the next view. IEEE Transactions on Pattern Analysis ans Machine Intelligence 15(5) (1993) 12. Modayur, B.R., Shapiro, L.G., Haralick, R.M.: Visual inspection of machine parts. In: Sanz (ed.) Advances in Image Processing, Multimedia and Machine Vision, Springer, Heidelberg (1996) 13. Yi, S., Haralick, R.M., Shapiro, L.G.: Optimal sensor and light source positioning for machine vision. Computer Vision and Image Understanding 1 (1995) 14. Ficher, R.B., Naidu, D.K.: A comparison of algorithms for subpixel peak detection. In: Sanz (ed.) Advances in Image Processing, Multimedia and Machine Vision, Springer, Heidelberg (1996) 15. Velho, L., Figueiredo, L.H.D., Gomes, J.: A unified approach for hierarchical adaptative tessellation of surfaces. ACM Transactions on Graphics 18(4), 329–360 (1999)
Low Latency 2D Position Estimation with a Line Scan Camera for Visual Servoing Peter Briër1,2, Maarten Steinbuch1, and Pieter Jonker1 1 Department of Mechanical Engineering Section Dynamics and Control Technology, Technical University Eindhoven, P.O. Box 513, 5600 MB Eindhoven, The Netherlands {p.brier,m.steinbuch,p.p.jonker}@tue.nl 2 OTB Group B.V., Luchthavenweg 10, 5657 EB Eindhoven, The Netherlands
[email protected]
Abstract. This paper describes the implementation of a visual position estimation algorithm, using a line-scan sensor positioned at an angle over a 2D repetitive pattern. An FFT is used with direct interpretation of the phase information at the fundamental frequencies of the pattern. The algorithm is implemented in a FPGA. The goal is to provide fast position estimation on visual data, to be used as feedback information in a dynamic control system. Traditional implementations of these systems are often hampered by low update rates (<100 Hz) and/or large latencies (>10 msec). These limit the obtainable bandwidths of the control system. Presented here is an implementation of an algorithm with a high update rate (30kHz) and low latency (100 μsec). This system can be used for a range of repetitive structures and has a high robustness. Resolutions of less than 0.1 μm have been demonstrated on real products with 210x70 μm feature size.
1 Introduction Using visual information to determine the position of objects relative to each other is a universal task. In image processing this relates to the “Image registration” problem: finding the position, orientation and scaling of (parts of) a reference image inside another image. Many methods and implementations have been described to perform this task [1], all with merits and shortcomings in terms of their general use, complexity, computational efficiency, robustness and performance. One of the commonly used algorithms is Phase Correlation (PC) in the Frequency Domain (FD). It is an attractive method of measuring displacements because it is highly invariant to changes in illumination. Such a method is described by [2]. One of the limitations of FD-PC is the fact that sub-pixel displacements are not measured. Various techniques have been proposed to overcome this limitation, for instance in [3] and [4]. However, many of these require multiple steps with transformations between domains in order to derive the translation values. If one would like to perform these steps at high speed and with high resolution this translates into high resource usage. This paper focuses on the “simplest possible” approach to extract this sub-pixel translation value at high speed, for a narrowly specified class of (real existing) objects, with pre-existing constraints on the realization of the solution (in terms of accuracy, speed, size and cost price). J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 37–47, 2007. © Springer-Verlag Berlin Heidelberg 2007
38
P. Briër, M. Steinbuch, and P. Jonker
1.1 Problem Definition In semiconductor (display) manufacturing it is often required to position a mechanical system relative to a product. These products contain repeating structures (e.g. pixels). In displays these structures typically have a periodicity in the range of 10..1000 μm and the overall product dimensions can be up to several meters [5]. Subsequent process steps need to be aligned with the existing structures within small tolerances. An example of such a structure is presented in figure 1.
Fig. 1. Typical display product with periodicity Sx and Sy. The product is actuated in X and Y direction. Also drawn in this picture is a sensor array with N sensor elements and total length L at angle α.
By using a camera system mounted on the mechanical system, the measured position can be used to guide the motion system. In this closed loop control setup the measurement update-rate, latency and jitter are limiting the closed loop system performance [6]. The overall system setup is shown in figure 2, showing the closed loop structure.
Fig. 2. System setup
Low Latency 2D Position Estimation with a Line Scan Camera for Visual Servoing
39
In order to achieve adequate system performance for this application, a sensor system is required that has the specifications as noted in table 1. Table 1. System specification requirements
Specifcation Update rate Latency Jitter Resolution Price, Size, Power consumption
Value Unit > 10.000 Hz < 100 μsec < 10 μsec <1 μm Comparable to standard (video) camera systems
One could use full 2D image information plus the fore mentioned algorithms to extract this position information, but one is faced with a number of challenges: The availability of a camera systems with adequate resolution and frame-rate; the cost, size and power consumption of a system to process the huge amount of full-frame data in a timely manner. Smart Camera's or application specific smart camera's based on design space exploration [7,8] could bring a solution, especially if they have an SIMD processor on board [9,10] and/or an FPGA [11], but at this moment this technology is not yet available as commodity product. Consequently, we have chosen for an own implementation of our algorithm directly in an FPGA. This paper describes a method based on line-scan sensor data. Line-scan sensors offer several benefits compared to area scan sensors when used on large, irregular shaped or moving objects [12]. They have been successfully used in various inspection systems and equipment, most notably in the printing, steel and semiconductor industry [13]. 1.2 Measurement Principle For objects containing repetitive structures, sub pixel displacement information is readily available by measuring the phase of the fundamental frequency (or any harmonic) of the structure in the FD. If a scan line is drawn at an angle with the principal axis of a 2D repetitive structure, there will be 2 fundamental frequencies present in this scan line; f1 and f2. They correspond to the projection of Sx and Sy on the scan line that is oriented at an angle α with the principal axis of the product (see figure 1).
f1 =
1 1 , f = . 2 S x ⋅ cos(α ) S y ⋅ sin(α )
(1)
In this manner, the phase (and thus the position) of the structure can be measured in two directions using a single scan line, under the assumption that f1 and f2 do not coincide with each other or with one of their harmonics (3).
f 2 ≠ n ⋅ f1
n∈Ν.
(2)
40
P. Briër, M. Steinbuch, and P. Jonker
Based upon the actual values of f1 and f2 there are some optimal values for α where the resulting frequencies are co-prime and have a large separation. The phases can be calculated from N sensor data points using a Discrete Fourier Transform (DFT). (3) The X and Y phase, Px and Py , is found by calculating Px =
sin(α ) ⋅ S y cos(α ) ⋅ S x ⋅ arg( X a ) , Py = ⋅ arg( X b ) . 2π 2π
(4)
Where Xa is the DFT element corresponding to f1 and Xb is the DFT element corresponding to f1. When the object is moved over distances beyond 2π this displacement needs to be accumulated and added to the measured phase value, yielding X and Y. This simple algorithm enables efficient implementation in hardware, such as an ASIC or FPGA. 1.3 Implementation Specific Details In order to avoid aliasing when applying the DFT, the sensor length should be an integer multiple of the projected Sx and Sy. It is unpractical to vary the physical size of the sensor to match. However: from a given sensor a variable amount of pixels can be selected and the remainder discarded, yielding NX pixels. For efficient implementation of the DFT algorithm 2N data points are required. For N=10, 1024 points are required. This may not be equal to NX. In this case the NX points are interpolated to 1024 points. This can be done with a linear, spline or any other interpolation algorithm. The actual values of NX and N depend on the application. N typically equals 10 to 16. NX is usually in the range of 512 to 8192. In general, more data points yield more resolution. However the finite word length and truncation errors of the algorithm should also be taken in account. Image data points are typically sampled with 8 to 16 bit intensity resolution. Processing may take place in 8 to 64 bits resolution. Higher bit depth requires more resources (sometimes the resource usage scales with N2) so the selection of the minimum required bit depth at each step of the algorithm is important. Also: the bits may be used for fixed point or floating point representation of numerical values. The number of bits used for the mantissa and exponent can also be selected. The overall system performance can be calculated mathematically from the algorithm. While this is generally true for resolution and accuracy, this may be less valid for resource usage. A possible approach to reach the required system performance is to implement all algorithms with configurable bit depth, and study the overall performance and resource usage as a function of the selected bit depths, starting with pre calculated values. This also enables quick alteration of the implementation when the requirements change.
2 Implementation As a first step in implementing this system, a standard digital camera mounted on a microscope was used to capture images. These images where analyzed with Matlabtm
Low Latency 2D Position Estimation with a Line Scan Camera for Visual Servoing
41
using an implementation of the proposed algorithm. Various synthetic images where constructed to test the performance of the algorithm. The addition of (synthetic) noise, defocusing and distorting the images yielded insight in the behavior of the algorithm, which proved useful during the test and debugging stage of the hardware implementation. Secondly the algorithm on the FPGA was implemented as a “system on a chip” design (figure 3) where different modules are interconnected using busses. The image data flow is separated from the control flow via different channels. For the signal processing only (fast) on-chip static memory is used. Additional off-chip SDRAM is used for testing and monitoring purposes only (e.g. storing measured or simulated data). The final implementation only requires an FPGA and some external components to interface to the line-scan camera.
Fig. 3. FPGA “system on a chip” design
2.1 Hardware The system was implemented with a Fairchildtm 2K line scan CCD sensor and an Alteratm Stratixtm FPGA. The optical system uses a coaxial illumination with a strobed LED (3W red Luxeontm Lumiledtm). The LED is strobed to reduce motion blur and boost the peak intensity of the LED during the exposure period of the line-scan sensor. Strobe and exposure times van vary from 1 to 100 usec. The typical field-of view is 1 to 10 mm (depending on the optics). A schematic diagram of the optical system is shown in figure 4.
Fig. 4. Optical setup with coaxial lighting with; Object (O), Lens (L), Beamsplitter (BS), Sensor (P) and LED (I)
42
P. Briër, M. Steinbuch, and P. Jonker
2.2 System Components The system has various components and algorithm steps. A brief explanation of these components is presented here: Acquisition, Intensity correction, Crop and resample, DFT, Phase extraction and unwrapping, Filtering and Quadrature encoder output. The FPGA system clock fsystem equals 80 MHz. From this clock all other clocks are derived. 2.2.1 Acquisition This performs the transfer of the pixel elements from the sensor to the FPGA. This is done synchronized to a line clock (tline = 33 μsec). The line clock is linked to the integration time of the sensor tint. During this time light reaching the pixel elements is integrated. In the same time, 2048 pixels, each sampled at 12 bits are transferred using 2 parallel taps from the sensor to the FPGA. In the FPGA the data is mixed and transferred as one 12 bits wide data stream at 80 MHz. At the time of arrival of a full line in the FPGA the data is delayed by 2tline (66 μsec). 2.2.2 Intensity Correction The acquired image is normalized on a per-pixel basis to counteract any illumination differences along the scan-line as well as sensor gain and offset differences. The correction table is pre-programmed using a low-pass filtered version of the line scan data. The image is normalized around a fixed DC level (zero), to match the scaling of the DFT algorithm. The DC level is measure by averaging all pixels of each line. This process introduces a delay of up to 8tsystem (100 nsec). 2.2.3 Crop and Resample From all possible 2048 pixels, a smaller number of pixels are selected. The amount of pixels depends on the FOV and feature size. This sub-image is stretched to 1024 pixels, using linear interpolation. This process introduces a delay of up to 8tsystem (100 nsec). 2.2.4 DFT The DFT is implemented as a parallel array of butterfly units. It produces a complex FFT with limited word length real and imaginary output values (typically 8 to 16 bits). There is a global scaling value for all output values. The Altera reference implementation is used [14]. In the experiment the data length was set to 1024 points, the word length was set to 12 bits. It is configured in a streaming mode, which means that the result is available immediately after all data points are clocked in. A data point is clocked in each system clock cycle, so after 1024 cycles = 12.8 μsec the result is available. This would add a minimal delay of 12.8 μsec. Due to implementation considerations this data is transferred at the next line clock cycle. So the total delay of this step equals 1tline (33 μsec). 2.2.5 Phase Extraction and Unwrapping The real and imaginary values corresponding to the X and Y components are extracted from the DFT data. These values are converted to amplitude and phase. The phase is determined using a CORDIC algorithm [15]. This conversion is performed in 5 steps and therefore adds a delay of 5tsystem (62.5 nsec).
Low Latency 2D Position Estimation with a Line Scan Camera for Visual Servoing
43
2.2.6 Filtering The position data is send to a low-pass filter with an adjustable cut-off frequency. If the filter is disabled, it introduces a delay of a few system clock periods (t < 100nsec). However: if the filter is enabled, it provides a first order low-pass characteristic. This will add a phase-lag to the data that is dependent on the set cut-off frequency. Reducing the bandwidth can reduce high frequency noise that may be present in the signal. The filter is updated at a 4 MHz clock frequency. This way it also provides smooth transitions, at the output, between the measured positions delivered at the input of this filter at each tline clock period. 2.2.7 Quadrature Encoder Output In order to connect this system to a motion control system, the position information is converted into industry standard quadrature encoder signals. 2 digital outputs (A,B) provide 90 degree phase shifted signals that change state depending on the measured position change (a position increment or decrement). These 2 outputs are available for both X and Y position (AX ,BX, and AY ,BY). Using 2 Zero pulses (ZX and ZY) the zero crossing of the structure period can be communicated to the motion controller, providing it with the absolute position within the periodicity.
3 Measurements The performance of this algorithm is analyzed on real-world image data. Figure 1 shows a typical image of a repetitive structure found in flat-panel display production (pixel-elements). Figure 4a show the pixel intensities on the scan-line and figure 4b is the amplitude of the Fourier transform of this intensity data. It clearly shows the 2 amplitude peaks at frequency 5/N and 7/N. Additional peaks are present, and these are harmonics of the base frequencies or due to other patterns in the image. These are not used in the current implementation of the measurement system. Some useful data may be present in these peaks and they could be combined with the base frequency information. If this would yield an increase in signal to noise ratio or linearity, that is in par with the added complexity is not known at the moment. 5000
300
4500
280
4000
260
3500
240
3000 Amplitude
Intensity
220 200
2500
180
2000
160
1500
140
1000 500
120 100
0 0
50
100
150
200 250 300 Scanline pixel
350
400
450
500
0
5
10
15
20
25 30 Frequency
35
40
Fig. 5. Typical scan line sensor data (a) and FFT amplitude data (b)
45
50
44
P. Briër, M. Steinbuch, and P. Jonker
3.1 Hardware Implementation and Verification With the DFT resolution set at 12 bits, all communication and SDRAM measurement memory options enabled, the designs consumes approximately 80% of the available resources of the FPGA and is just capable to function at fsystem = 80MHz. Using synthetic images and test patterns the internal functioning and timing of the design is verified and proved to function as expected. For the subsequent measurements a TFT display substrate with a 210x70 μm feature size is mounted on a linear motor stage with a high resolution optical encoder. The optics is adjusted and the intensity information is calibrated. Figure 6 shows the measured sensor data from an arbitrary scan line and the scaled data that is sent to the DFT. 4096 Raw Sensor
Corrected
2048 1024
Intensity [DN]
3072
0
-1024
-512
0 Pixel
512
-1024 1024
Fig. 6. Raw sensor data (thin dashed line) and scaled, intensity calibrated data with DC offset removed (thick line) of 70x210 μm substrate. The total FOV is 4.85 mm.
3.2 Resolution When the periodic signal is at maximum strength and delivers a full scale signal (at 12 bits this equals 4096 DN), the minimal measurable step with the given substrate, in the 70μm direction is 17.09 nm (5).
70 μm / 212 = 17.09 nm
(5)
It is unlikely however that the signals are present at maximum strength. After adjustment of the focus a signal amplitude was measured of approximately 900 (DN), corresponding to a 77 nm resolution. During motion this signal amplitude varies between 800 and 1000, this is depending on the exact alignment and substrate quality. Scratches and dust particles are present on the test substrate and the mechanical mounting of the sensor and substrate can move several microns as the setup uses normal linear guides instead of high accuracy (air) bearings. 3.3 Linearity The substrate is scanned at a velocity of 100mm/sec in the direction of the 70 μm feature size (corresponding to the Y direction in figure 1).
Low Latency 2D Position Estimation with a Line Scan Camera for Visual Servoing
45
Position (micrometer)
1600 Position
1400
5 4 3 2 1 0 -1 -2 -3 -4 -5
Deviation
1200 1000 800 600 400 200 0 0
2
4
6 8 time (msec)
10
12
Deviation (micrometer)
The measured position is logged at 30kHz and compared to the measured linear position (figure 7). This measurement shows a periodic non-linearity. The period equals the periodicity of the 70μm feature. Multiple scans over the same substrate showed this deviation to be invariant over time (short term < 1 minute) and repeatable to within 1μm. Apart from the periodic deviation, other sources of errors seem to be present. At this time it is not yet known if these are systematic measurement errors or have a mechanical origin related to the experimental setup.
14
Fig. 7. Measured positions and deviation from actual position
3.4 Robustness The sensor system is, by nature, highly invariant to global or low frequency intensity variations. Small local variations in the pattern (damage, particles) do not immediately degrade the measurement, as multiple pixels over a large area are included in the measurement. Qualitative experiments where performed to determine the effect of various disturbances (defocus, light intensity variations and contaminated structures) These disturbances generally lead to reduced signal amplitudes and thus a reduced signal to noise ratio. They do not however lead to catastrophic loss of position information, big jumps in the measured value, variable delays or accumulated measurement errors. This characteristic is favorable if the measured data is used as feedback information in a motion system. Disturbances may lead to reduced overall system performance, but not to a complete failure of the system. During these experiments it was also observed that the signal amplitude is a very good measure for the correct focus distance. It can thus be used to measure on a 3rd axis (depth).
4 Conclusions It has been shown that the proposed method can obtain high resolution (<1μm) , high speed (30 KHz), low latency (< 100 μs) position information from display substrates. The system can be implemented with relatively cheap hardware and it is robust for variations in lighting and defocus. A major limitation of this approach is that it requires periodic structures that can be sampled with an appropriate resolution. However, this includes a large range of
46
P. Briër, M. Steinbuch, and P. Jonker
products, especially in the semiconductor and flat-panel display industry. The characteristics of the sensor system make it possible to use it in a closed loop motion control system. Linearity is not optimal and how it is influenced by the object geometry, mechanical alignment and measurement parameters is not yet fully understood. Experiments showed that measurements can also be performed in the Z-direction (depth) by analyzing the signal amplitude. Table 2. System performance measurements
Specifcation Update rate Latency Jitter Resolution Linearity Reproducibility Price, Size, Power consumption
Value 30 99
Unit kHz μsec
Notes
3 lines, can be minimized to < 2 lines after optimization < 0.1 μsec All timing fixed to system clock cycle. No variable timing. <0.1 μm For 70 μm structure. Depending on signal amplitude +/-5 μm Significant periodic error <1 μm Short term (< 1 minute) Not yet within specification due to the use of “off the shelve” components. However: all parts can be manufactured at low cost. (Total BOM < 750 US$)
5 Future Work The sensor system performance needs to be analysed in more detail, the accuracy and linearity should be quantified for a number of different objects, at different scales. The influences of certain misalignments and maladjustments need to be analysed. The performance of the sensor system in a real closed loop control system will be studied. Extraction of additional information from the measurements may be possible. Using the amplitude information of the peaks, height can be measured. With adequate sensor length and resolution, it may also be possible to provide feedback on the rotation, as it results in a (small) frequency shift of the peaks.
References 1. Brown, L.B.: A survey of image registration techniques. ACM Computing Surveys 24(4), 325–376 (1992) 2. Kuglin, C.D., Hines, D.C.: The phase correlation image alignment method. In: Proc. Int. Conf. on Cybernetics and Society, pp. 163–165. IEEE, Los Alamitos (1975) 3. Humblot, F., Collin, B., Mohammad-Djafari, A.: Evaluation and practical issues of subixel image registration using phase correlation methods. In: Proceedings PSIP (2005) 4. Hoge, W.C.: A Subspace Identification Extension to the Phase Correlation Method. IEEE Trans. Medical Imaging 22(2), 277–280 (2003) 5. SEMI, Flat panel display sizes, Information on SEMI website http://www.semi.org
Low Latency 2D Position Estimation with a Line Scan Camera for Visual Servoing
47
6. Franklin, G., Powell, J.D., Emami-Naeini, A.: Feedback Control of Dynamic Systems. Prentice Hall, Englewood Cliffs (2005) 7. Jonker, P.P., Caarls, W.: Application Driven Design of Embedded Real-Time Image Processors. In: Proceedings of Acivs 2003, Advanced Concepts for Intelligent Vision Systems, Ghent University, Ghent, B, September 2-5, pp. 1–8 (2003) 8. Caarls, W., Jonker, P.P., Corporaal, H.: Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing. IEICE Transactions on Information and Systems E89-D(7), 2036–2043 (2006) 9. Abbo, A.A., Kleihorst, R.P., Sevat, L., Wielage, P., van Veen, R., op de Beekck, M.J.R., van der Avoird, A.: A Low-power Parallel Processor IC for Digital Video Cameras 10. Kyo, S.: A 51.2 GOPS Programmable Video Recognition Processor for Vision base Intelligent Cruise Control Applications. In: Proceedings of MVA 2002 (IAPR Workshop on Machine Vision Applications) (2002) 11. van der Horst, J., van Leeuwen, R., Broers, H., Kleihorst, R., Jonker, P.: A Real-Time Stereo SmartCam, using FPGA, SIMD and VLIW. In: Proc. Of the 2nd Workshop on Applications of Computer Vision, Graz, Austria, May 12, 2006, (2006) 12. Zographos, A.N., Evans, J.P.O., Godber, S.X., Robinson, M.: Line-scan system for allround inspection of objects. In: Proc. SPIE vol. 3174, pp. 274–282 13. Kim, J.-H., Ahn, S., Jeon, J.W., Byun, J.-E.: A high-speed high-resolution vision system for the inspection of TFT LCD. In: IEEE International Symposium on Industrial Electronics, vol. 1, pp. 101–105 (2001) 14. Altera Corporation: FFT MegaCore Function User Guide, Document UG-FFT-3.0 (2006) Altera corporation website www.altera.com 15. Andraka, R.: A survey of CORDIC algorithms for FPGAs. In: ACM/SIGDA sixth international symposium on Field programmable gate arrays (1998)
Optimization of Quadtree Triangulation for Terrain Models Refik Samet and Emrah Ozsavas Ankara University, Engineering Faculty, Computer Engineering Department 06100 Tandogan, Ankara, Turkey
[email protected],
[email protected]
Abstract. The aim of the study is to increase the accuracy of a terrain triangulation while maintaining or reducing the number of triangles. To this end, a nontrivial algorithm for quadtree triangulation is proposed. The proposed algorithm includes: i) a resolution parameters calculation technique and ii) three error metric calculation techniques. Simulation software is also devised to apply the proposed algorithm. Initially, a data file is read to obtain the elevation data of a terrain. After that, a 3D mesh is generated by using the original quadtree triangulation algorithm and the proposed algorithm. For each of the algorithms, two situations are analyzed: i) the situation with fixed resolution parameters and ii) the situation with dynamically changing resolution parameters. For all of the cases, terrain accuracy value and number of triangles of 3D meshes are calculated and evaluated. Finally, it is shown that dynamically changing resolution parameters improve the algorithms’ performance.
1 Introduction Interactive visualization of very large scale terrain data imposes several efficiency problems. To best exploit the rendering performance, the scene complexity must be reduced as much as possible without leading to an inferior visual representation. Therefore, the geometric simplification must be controlled by an approximation error threshold. Additionally, different parts of the visible terrain can be rendered at different Level-of-Detail (LOD) to increase rendering performance [1]. Multiresolution terrain modelling is an efficient approach to improve the speed of 3D terrain modelling [2]. The concept of multiresolution refers to the possibility of using different representations of a spatial entity, having different levels of terrain accuracy and complexity [3]. The existing algorithms and models for constructing multiresolution terrain models can be divided into two categories: 1) Grid-based algorithms and 2) Triangulated Irregular Network (TIN)-based algorithms. Grid-based algorithms include the quadtree triangulation algorithm [4], [5] and the triangle bisect algorithm [2]. The quadtree triangulation algorithm constructs multiresolution models based on a bottom-up approach which is easy to implement [4]. The triangle bisect algorithm constructs multiresolution models based on a top-down approach [2]. Many algorithms have been developed based on the triangle bisect algorithm, such as the adaptive quadtree [6], the ROAMing algorithm [7], Right TIN model [8], the longest edge bisection algorithm [9] and dynamic adaptive meshes J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 48–59, 2007. © Springer-Verlag Berlin Heidelberg 2007
Optimization of Quadtree Triangulation for Terrain Models
49
[10]. The most common drawback of the grid-based algorithms is that large, flat surfaces may require the same polygon density as do small, rough areas [6]. TIN-based algorithms are inefficient because TINs are non-uniform in nature, and consequently surface following and intersection are hard to handle efficiently due to the lack of a spatial organization of the mesh polygons [6]. TIN-based algorithms can produce near optimal results in the number of triangles needed to satisfy a particular error threshold, but most do not operate in real-time [11]. Quadtree based multiresolution triangulations have been shown to be exceptionally efficient for grid digital terrain data [12]. The purpose of this study is to increase the terrain accuracy while maintaining or reducing the number of triangles in order to reduce the graphics load, using quadtree triangulation. We propose a dynamic calculation technique for the resolution parameters and three new error metric calculation techniques for the subdivision criteria of the quadtree triangulation algorithm. For each of the four techniques (the original algorithm’s technique and three techniques proposed in this paper), two situations are analyzed: i) the situation with fixed resolution parameters and ii) the situation with dynamically changing resolution parameters. For all of the cases, the terrain accuracy and number of triangles values of 3D meshes are calculated and evaluated. This is done by applying the quadtree triangulation algorithm on data files consisting the terrain elevation data.
2 Quadtree Triangulation Algorithm In grid-based algorithms, a terrain is also called a height field because it consists of an NxM field of height values. Height fields are usually stored as NxM gray scale images. The color of each pixel of the images represents the height value (0-255) for the corresponding location in the terrain. These height fields can be generated automatically or they can come in the form of Digital Elevation Maps (DEMs) that describe actual regions of the Earth’s surface [13]. Terrain accuracy, process time, memory requirement, support for real-time processing, and number of triangles of the generated mesh are the parameters for evaluating the techniques used in the 3D mesh generation step [13]. From a GIS point of view the evaluation should be based on accuracy of the approximated terrain and number of triangles required to draw the terrain. Terrain accuracy is a measure of how close the approximated terrain resembles the original height field. This measure is calculated using the vertical distance between corresponding points in the rendered terrain and the height field. Once the vertical elevation difference has been computed for each point in the height field, the actual terrain accuracy as a percent is calculated as follows [13]: accuracy = 100 * ( ( totalHeight – totalDelta ) / totalHeight )
(1)
where the totalHeight is the sum of the heights of all the points in the height field and the totalDelta is the sum of the vertical differences of the points. The number of triangles in the 3D model is easily found. The quadtree triangulation algorithms have problems with terrain accuracy and the triangulation of the 3D mesh is never optimal. This is due to the sensitivity to localized, high frequency data within large, uniform resolution areas of lower complexity [6].
50
R. Samet and E. Ozsavas
To increase the terrain accuracy, much more triangles must be drawn in the 3D mesh and this causes an extra graphics load that is undesirable for real-time applications. For example, terrain data of an area of 50km2 with 5-meter ground elevation sampling resolution produces a grid of size 10000x10000. After triangulation the total number of triangles is about 200 million, which will cause problems in real-time applications. There are two construction approaches to generating a 3D mesh from a set of points on a regular grid. One approach is performed bottom-up and the other topdown to generate the hierarchy [12]. 2.1 Split Metric Calculation The important part is deciding what the correct level of detail should be at each node. The basic idea is to render sections of the terrain at a higher level of detail when they are close to the viewpoint and at lower levels of detail when they are farther away from the viewpoint. In addition to this, the error of the terrain should be taken into account to ensure that flat areas use fewer triangles since less detail is required and bumpier areas use more triangles to show more detail. Once error values are stored, deciding on the correct level of detail for each node is left up to a split metric calculation. The following equation is used for the split metric variable f [4]: f = L / ( d * C * max ( c * d2 , 1 ) )
(2)
where L is the distance from the current node to the viewpoint, C is the minimum global resolution, c is the desired global resolution, d is the width of the node and d2 is the error metric for the node. Here, C and c are user-configurable parameters and L and d are calculated at terrain vertices. If the condition (f<1) is satisfied, that node is subdivided into four children nodes. Calculation of the error metric d2 is described below. 2.2 Calculation of the Error Metric d2 of the Original Algorithm (Technique 1) To calculate the error metric at a given node, the original algorithm [4], [12], [14] determines the elevation differences (error values) between the actual terrain height and the displayed terrain height at the centers of each of the four edges and the two diagonals of the node [4]. We now demonstrate the calculation steps at a given node using the original algorithm. Here, V(i), i=0,1,...,8, is a vertex of a node. E(i) denotes the error and H(i) denotes the height of the vertex V(i). Step 1. Determine the maximum of the error values at the two diagonals. E ( i ) = max ( abs ( ( H ( i + 1 ) + H ( i + 5 ) ) / 2 – H ( i ) ) , abs ( ( H ( i + 3 ) +H(i+7))/2–H(i))),i=0
(3)
Step 2. Determine the error values at the centers of each of the four edges. E ( i ) = abs ( ( H ( i – 1 ) + H ( i + 1 ) ) / 2 – H ( i ) , i = 2 , 4 , 6 , 8
(4)
Step 3. The error of the node is the maximum of the five error values. d2 = max ( E ( i ) ) , i = 0 , 2 , 4 , 6 , 8
(5)
Optimization of Quadtree Triangulation for Terrain Models
51
The algorithm explained above is a fast and moderately simple way to display height fields with continuous levels of detail. However, this algorithm suffers from problems of terrain accuracy and total number of triangles. The number of triangles is increased by increasing the terrain accuracy. The increase in number of triangles causes extra graphics load. It may be virtually impossible to create 3D models in realtime for increasingly large terrains. However, reduction in the number of triangles causes a loss in terrain accuracy.
3 The Proposed Quadtree Triangulation Algorithm The work presented here is based on the bottom-up quadtree triangulation approach. The proposed algorithm includes a technique to calculate the resolution parameters and 3 different techniques to calculate the d2 value for each node in the height field. 3.1 Resolution Parameters Calculation Technique Before the calculation of the resolution parameters for each node, the number of height intervals and the start and the end height values of each height interval should be determined. Step 1. Calculate the average height of 9 vertices of a node. avg = ∑ H ( i ) / 9 , i = 0 , 1 , … , 8
(6)
The calculated avg value is the height of that node. Step 2. Find the height interval number that has a smaller start height than the node’s height and greater end height than the node’s height. The found height interval number is used as c value for error metric (d2) calculation for that node. Step 3. Find the level of the node. The level of a node is calculated by taking the logarithm of the width of the node. The C parameter for that node is then equal to the level value and calculated as follows: C = log 2 ( d )
(7)
3.2 Calculation of the Error Metric d2 of the First Proposed Technique (Technique 2) Step 1. Calculate the avg value (Eq. 6) of a node. Step 2. Calculate the differences between these 9 vertices and their average. E ( i ) = abs ( H ( i ) – avg ) , i = 0 , 1 , ... , 8
(8)
Step 3. Find the maximum difference value. This is the error metric for that node. d2 = max ( E ( i ) ) , i = 0 , 1 , … , 8 3.3 Calculation of the Error Metric d2 of the Second Proposed Technique (Technique 3) Step 1. Find the maximum of the 9 vertices for a node.
(9)
52
R. Samet and E. Ozsavas
max = max ( H ( i ) ) , i = 0 , 1 , … , 8
(10)
Step 2. Find the minimum of the 9 vertices for a node. min = min ( H ( i ) ) , i = 0 , 1 , … , 8
(11)
Step 3. Calculate the difference between the maximum and minimum values. This is the error metric for that node. d2 = max - min
(12)
3.4 Calculation of the Error Metric d2 of the Third Proposed Technique (Technique 4) Step 1. Find the median value of the 9 vertices for a node. The median is the middle value of the given numbers when arranged in ascending order. For an even number of elements the median is the average value of the two middle elements. med = median ( H ( i ) ) , i = 0 , 1 , … , 8
(13)
Step 2. Calculate the differences between the median value and 9 vertices. E ( i ) = abs ( H ( i ) – med ) , i = 0 , 1 , ... , 8
(14)
Step 3. Find the maximum of the difference values. This is the error metric for that node. d2 = max ( E ( i ) ) , i = 0 , 1 , … , 8
(15)
4 Simulation Procedure We have developed a simulator in C to generate a mesh and evaluate the terrain accuracy and number of triangles values of each of the error metric calculation techniques. OpenGL functions [15] are used to draw the rendered terrain. The simulation procedure is realized into three steps: Step 1: In the initial step of the simulation, the elevation data of a terrain are read from a raw file into a matrix of size (terrain width)x(terrain height). This matrix is called the height field matrix. Step 2: The quadtree data structure is created to represent terrains of size (2N+1)x(2N+1). In this tree structure every node has either four or zero children nodes. Nodes with zero children are the leaf nodes. In order to represent the quadtree structure, a numeric matrix of size (terrain width)x(terrain height) is used. Each node in the tree corresponds to one value in the quadtree matrix, which is the subdivision metric. Calculating quadtree matrix values involves recursively descending the tree and, at each node, establishing if the node is at the correct level of detail or if it should be subdivided into four children, in which case each child is then processed by the recursive algorithm.
Optimization of Quadtree Triangulation for Terrain Models
53
After determining the height intervals, 3D meshes are generated by using the original algorithm and the proposed algorithm with fixed and dynamically changing resolution parameters. Step 3: Total triangle number and terrain accuracy values are calculated for all the 3D meshes. These values are automatically saved to a text file. In calculating the terrain accuracy, vertices that are corners of triangles have vertical difference, and thus delta value, equal to zero. To calculate delta values for other vertices in the rendered terrain, a bounding rectangle of all triangles is determined in 2D and the plane equation of each triangle is calculated [16]. The plane equation is used to check if a vertex is inside the triangle or not, for all vertices in the bounding rectangle. If a vertex is inside a triangle then its delta value is calculated.
5 Implementation Results and Evaluation We used two raw files ‘test1.raw’ (513*513) and ‘test2.raw’ (2049*2049) as data files to the simulator. 5.1 Implementation 1: The Situation with Fixed Resolution Parameters Initially, the original algorithm and the three error metric calculation techniques of the proposed algorithm were applied to all of the nodes with fixed parameter values. In this implementation, the resolution parameters were taken from the user and not changed during the triangulation process. To be able to compare the results of this implementation with the results of Implementation 2, the counts of the numbers of triangles should be equal or nearly equal. Thus the terrain accuracy values can be compared and evaluated. In order to have much more opportunity for evaluation, we have tested different values of resolution parameters. 5.2 Implementation 2: The Situation with Height Intervals In this implementation, we used height intervals in order to determine the desired global resolution parameter c as described in Section 3.1. We tried different numbers of height intervals, but in this paper we present the results of the trials with 2, 4 and 8 height intervals separately in order to shorten the presentation. In the triangulation process, for each node of the height field, the interval number that has a smaller start height than the node’s height and greater end height than the node’s height is found. The found height interval number is used as the c value for error metric (d2) calculation for that node. The corresponding c values and the start-end height values of height intervals are in reverse order. The reason of using such a reverse order is to have greater c values, smaller f values and many more triangles in the low parts of the terrain. The C parameter is calculated for every node separately during the triangulation process as described in Section 3.1. 5.3 Evaluations The results show that dynamically changing parameters make the four techniques give greater terrain accuracy values with a decreasing or nearly same number of triangles.
54
R. Samet and E. Ozsavas
Fig. 1. Results of Technique 1 for test1.raw
Fig. 2. Results of Technique 2 for test1.raw
Optimization of Quadtree Triangulation for Terrain Models
Fig. 3. Results of Technique 3 for test1.raw
Fig. 4. Results of Technique 4 for test1.raw
55
56
R. Samet and E. Ozsavas
Fig. 5. Results of Technique 1 for test2.raw
Fig. 6. Results of Technique 2 for test2.raw
Optimization of Quadtree Triangulation for Terrain Models
Fig. 7. Results of Technique 3 for test2.raw
Fig. 8. Results of Technique 4 for test2.raw
57
58
R. Samet and E. Ozsavas
For example, Technique 1 gives 98.023% terrain accuracy with 25055 triangles by using fixed resolution parameters (C=2, c=4) and the same technique gives greater (98.198%) terrain accuracy with decreasing number of triangles (24521) by using two height intervals for test1.raw. This result is achieved also for the different values of viewpoint locations, C and c parameters. We propose three different techniques in order to show more clearly the effect of using dynamically changing resolution parameters. A user may choose one of the techniques to apply according to need and technical support. The figures 1-8 show the comparisons of the results of the four techniques with dynamically changing resolution parameters (the situation with height intervals) against fixed resolution parameters (the situation without height intervals) for two data files. Each figure has two curved lines: the dashed ones show the results of the situations with 2, 4 and 8 height intervals and so have three points. The other lines show the results of Implementation 1 and have three points, each point shows the results of a fixed resolution parameters group. Each figure shows the results of one of the four techniques for one of the files and so there are eight figures.
6 Conclusion We have presented a new algorithm that includes a technique to calculate the resolution parameters and three new techniques to calculate the error metric for the split operation of the quadtree triangulation. This algorithm results in greater terrain accuracy, with less triangles than the original quadtree triangulation algorithm. The most powerful graphic card available for workstations at this time can handle around 100 million triangles per second with 256MB of graphic card memory. The importance of the contribution of the proposed algorithm becomes much clearer with increasing file sizes. This is very important for real-time applications because of the memory and capacity possibilities. We are currently investigating the effect of using different numbers of height intervals. Finally, as future work, new techniques to better calculate the error metric may need to be developed and evaluated.
References 1. Mello, F., Strauss, E., Oliviera, A., Gesualdi, A.: Non-Uniform Mesh Simplification Using Adaptative Merge Procedures (2000) 2. Yang, B., Shi, W., Li, O.: An integrated TIN and Grid method for constructing multiresolution digital terrain models. International Journal of Geographical Information Science 19, 1019–1038 (2005) 3. Magillo, P., Bertocci, V.: Managing Large Terrain Data Sets with a Multiresolution Structure. In: 11th International Workshop on Database and Expert Systems Applications Proceedings, pp. 894–898 (2000) 4. Röttger, S., Heidrich, W., Slusallek, P., Seidel, H.: Real-Time Generation of Continuous Levels of Detail for Height Fields (1998) 5. Lee, M., Samet, H.: Navigating through triangle meshes implemented as linear quadtrees. ACM Transactions on Graphics 19, 79–121 (2000) 6. Lindstrom, P., Koller, D., Ribarsky, W., Hodges, L., Faust, N., Turner, G.: Real-time continuous level of detail rendering of height fields. Computer Graphics 20, 109–118 (1996)
Optimization of Quadtree Triangulation for Terrain Models
59
7. Duchaineau, M., Wolinsky, M., Sigeti, D.: ROAMING terrain: real-time optimally adapting meshes. In: Proceedings of IEEE Visualization’97, Phoenix, Arizona, pp. 81–88. IEEE Computer Society Press, Los Alamitos (1997) 8. Evans, W., Kirkpatrick, D., Townsend, G.: Right triangular irregular Networks. Technical Report 97-09, Department of Computer Science, University of Arizona, Tucson, Arizona (1997) 9. Lindstrom, P., Pascucci, V.: Terrain simplification simplified: a general framework for view-dependent out-of-core visualization. IEEE Transactions on Visualization and Computer Graphics 8, 239–254 (2002) 10. Cignoni, P., Ganovelli, F., Gobbetti, E., Marton, F., Ponchio, F., Scopigno, R.: BDAM: batched dynamic adaptive meshes for high performance terrain visualization. Computer Graphics Forum 22, 505–514 (2003) 11. Cine, D., Egbert, P.: Terrain Decimation Through Quadtree Morphing. IEEE Transactions on Visualization and Computer Graphics 7, 62–69 (2001) 12. Pajarola, R.: Overview of Quadtree-based Terrain Triangulation and Visualization. UCIICS Technical Report No. 02-01 (2002) 13. Lanthier, M., Bradley, D.: Evaluation of Real-Time Continuous Terrain Level of Detail Algorithms, Honours Project, Carleton University, Ottawa (2003) 14. Pajarola, R.: Large Scale Terrain Visualization Using The Restricted Quadtree Triangulation (1998) 15. Wright, R., Sweet, M., Lipchak, B.: OpenGL SuperBible, 3rd edn. (2004) 16. Plane Equation, MathWorld, Web Page Source, http://mathworld.wolfram.com/Plane.html
Analyzing DGI-BS: Properties and Performance Under Occlusion and Noise Pilar Merchán1 and Antonio Adán2 1
Universidad de Extremadura, Escuela de Ingenierías Industriales, Badajoz, Spain
[email protected] 2 Universidad de Castilla La Mancha, Escuela Superior de Informática, Ciudad Real, Spain
[email protected]
Abstract. This paper analyzes a new 3D recognition method for occluded objects in complex scenes. The technique uses the Depth Gradient Image Based on Silhouette representation (DGI-BS) and settles the problem of identificationpose under occlusion and noise requirements. DGI-BS synthesizes both surface and contour information avoiding restrictions concerning the layout and visibility of the objects in the scene. Firstly, the paper is devoted to show the main properties of this method compared with a set of known techniques as well as to explain briefly the key concepts of the DGI-BS representation. Secondly, the performance of this strategy in real scenes under occlusion and noise circumstances is presented in detail.
1 Previous Works and Comparison with Ours When a single image of a scene is available, the information of the objects that compose the scene may be insufficient to solve recognition problems. This circumstance occurs frequently in real environments, where a random placing of the objects in the scene happens. Usually, an object is occluded by one or several objects and, consequently, a little portion of its surface can be sensed, so recognition and pose calculation become very close problems. In the last two decades the 3D recognition problem has been dealt from a wide range of environments and requirements. In general, two research lines can be found: surface correspondence and surface registration. Surface correspondence is the process that establishes which portions of two surfaces overlap. By means of the surface correspondence, registration computes the best transformation that aligns two surfaces. The correspondence between surfaces is the more interesting line for us. Several kinds of classifications about 3D recognition can be found in the literature. The classification founded on the object features to be matched distinguishes between extrinsic and intrinsic algorithms [1]. Intrinsic features concern to the surface itself whereas the extrinsic ones regards with the external space to the surface. The classification that has to do with the 3D representation (Mamic et. al [2]) establishes methods based on object centered representations (curved and contour representations, axial representations, surface representations and volumetric J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 60–71, 2007. © Springer-Verlag Berlin Heidelberg 2007
Analyzing DGI-BS: Properties and Performance Under Occlusion and Noise
61
representations) and methods based on observer centered representations (graph aspects, techniques based on silhouettes, principal components, etc). From another point of view, the techniques are classified regarding to the kind of algorithm they use. Thus, we can find: iterative methods (ICP [3]); correspondence search methods through points [4], curves [5, 6] and regions [7] and exhaustive search (Pose clustering [8, 9] and geometric hashing methods [10, 11, 12]). A lot of 3D recognition techniques developed in the last years are based on surface representation models. We are interested on making a major reference to this sort of strategies because they are close to ours. Next, the most important and known works in this area and a comparison among them (that includes our technique) are presented. In the cases where the authors did not report explicit numerical estimations or details concerning the parameters, the symbol NR (not reported) appears. In other cases, some results have been simplified and some values averaged. Also some characteristics based on the information included in each paper have been deduced or estimated. The referenced techniques are: SAI (Simplex Angle Image) [13], Splash [6], COSMOS (Curvedness-Orientation-Shape Map On Sphere) [14], Point Signature [15], Spin Image [16], Harmonic Shape Image [17], Surfaces Signature [18], Directional Histogram [19], Cone-Curvature Models [20] and the DGI-BS (Depth Gradient Image Based on Silhouette) representation [21], our strategy. From now on they will be denoted T1 to T10. The comparison is summarized in tables 1 and 2. Table 1 presents a set of key aspects of the representation models: kind of objects supported, mesh topology, occlusion, global/local representation, memory requirements, mapping and completeness of the model and features. Table 2 is dedicated to show the properties and characteristics of each technique as well as to present several data and information related to the experimentation. Type of matching, type of scene, main restrictions (shape, occlusion and noise), selection of initial points, number of views, need of a previous processing of the data, scale problems and experimentation (number of objects in the database, synthetic/real objects and kind of sensor) have been included. Among these techniques, few of them can accomplish 3D recognition in complex scenes including: a set of disordered objects, contact, occlusion, noise and non textured objects. Most of the methods impose some kind of restrictions related to the 3D sensed information or the scene itself. The main restrictions, limitations or weak points are: several views of the scene are needed (T7, T8); only isolated objects are permitted (T1, T3, T8); shape restrictions (T1, T4, T7, T8); occlusion restrictions (in most of the works, except for T1, T5, T6 and T10, the performance of the method in occlusion circumstances is not specifically dealt with); high sensitivity to noise (T1, T3, T4 and T5); points or zones of the object must be selected in advance (T4, T5, T6, T7, T9); pre-processing stages, like segmentation, are needed (T3, T7, T9, T10); only synthetic scenes (T2, T7, T8); small databases of real objects (<15 objects) in the experimental work T1(NR), T2, T3, T6(NR), T7. It is really difficult to make a comparison between these different techniques. Before choosing the most appropriate, several factors should be taken into account, depending on the specific application and environment.
62
P. Merchán and A. Adán Table 1. Properties of 3D representation models Technique
Mesh Memory OcclusionRepresentation topology requirement
Objects
T1
SAI
Spherical topology
Spherical
Yes
Global
High
T2
Splash
Free shape
Polygonal
Yes
Local
NR
T3
COSMOS
Free shape (No Polyhedral)
T4
Point signature
Free shape (No Polyhedral)
Polygonal
Yes
T5
Spin Image
Free shape
Polygonal
Yes
T6
Harmonic Shape Image
Free shape
Polygonal
Yes
T7
Surface Signature
Free shape
Simplex mesh
Yes
Polygonal
No
T8
Directional Free shape Histogram
Polygonal
No
T9
ConeCurvature
Spherical topology
Spherical
Yes
T10
DGI-BS
Free shape
No meshes are used
Yes
Mapping
Completeness
Complete. Spherical mapping of Shape and surface curvature at continuity of the all points surface represented Gaussian map of Partial. Surface surface continuity is not normals along a represented geodesic circle
Feature
Curvature
Angle-based features
Spherical mapping of orientation of Partial for non- Shape spectral CSMPs convex objects function (SSF) (constant-shape maximal patch) Profile of distances Distance and Local Medium depending on a NR angle-based reference angle features 2D histogram of distances to the reference Surface Local High Partial tangent plane histogram and surface normal at all points Harmonic map of the underlying surface onto a unit disc. Local NR Surface Complete Curvature curvature is stored on the map for all points Simplex angles map depending on the Distance and distance between Local o global Medium Partial angle-based points and angles with features regard to the normal surface Distance-based Global NR Histogram of distances NR features Mapping of ConeLocal o global Medium Curvature values on a Complete Cone curvature MWS structure Depth gradient values Local Medium/Low with regard to the Complete Depth gradient silhouette points Global
NR
Table 2. Properties of different 3D recognition methods Restrictions/advantages Technique
Matching
Scene Shape
T1
T2
T3
T4
Distance Isolated between SAIs. objects Local matches grouped. Several Splash Similarity objects measure Shape spectral Isolated moments COSMOS objects comparison SAI
Yes
Initial points or One/several Invariant zones view(s) of Preprocessing to scale must be the scene
Occlusion Robustness percentage to noise selected 70%
NR
No
Yes
Experimentation Real/ Database 3D Synthetic size sensor objects
One
No
NR
Yes
9
NR
NR
Real Laser models. Range Synthetic Finder scenes Laser Real range scanner
No
NR
High
No
One
Scene smoothing and holes filling
No
NR
NR
No
One
Segmentation
No
10
One
No
No
15
Real
Range finder
One
Filtering followed by smoothing and sampling of the mesh to make equal resolutions
No
20
Real
Structured light range finder
Real
NR
Only if the High, Local matches Yes partial depending grouped and Several Polyhedral Yes. Point view is in on the Arbitrary signature registration objects Surfaces the tolerance spawned database range
T5
Spin Images
Local matches grouped and Several registration objects spawned
No
70%
Low
Yes. Arbitrary
T6
Harmonic Shapes
Similarity Several between objects representations
No
Occlusion angle limitation 67.5º
High
Yes. Arbitrary
One
Objects not in contact. Segmentation
NR
NR
T7
Surface signature
Local matches Yes. grouped and Several Surfaces registration objects of spawned revolution
NR
High
Yes. Specific
Several in a complex scene
Segmentation
Yes
10
Real and Laser Synthetic scanner
T8
Directional Histogram model
Similarity measure
Isolated objects
0
High
Yes. Sampling directions
Several
No
Yes
500
Synthetic
T9
Conecurvature
Similarity measure
Several objects
Yes
NR
High
Yes. Arbitrary
One
Segmentation and partial modeling
Yes
70
Real
Similarity Several between DGI-BS objects representations
No
83%
High
No
One
Segmentation
Yes
27
Real
T10
No
NR Gray range finder Gray range finder
Analyzing DGI-BS: Properties and Performance Under Occlusion and Noise
63
In this work, the problem of correspondence and registration of surfaces is tackled by using a new strategy that synthesizes both surface and contour information in 3D scenes avoiding most of the usual restrictions. In other words: only one view of the scene is necessary, multi-occlusion is allowed, no initial points or regions must be chosen and no shape restrictions are imposed. The technique is based on a new 3D representation called Depth Gradient Image Based on Silhouette (DGI-BS). The main restriction of this method is that it needs a previous segmentation of the scene. Through a simple DGI-BS matching algorithm, the surfaces correspondence problem is solved and an initial alignment between objects is yielded. This paper has a double aim: to show the main properties of DGI-BS compared with closer methods and to analyze its performance under noise and occlusion. Therefore, just a short explanation of the DGI-BS concept will be presented here. A deeper study about it and the recognition strategy can be found in [21].
2 A Brief Review About Recognition Using DGI-BS 2.1 The DGI-BS Model The DGI-BS representation synthesizes the surface information (through depth image) and the shape information (through the contour) of the whole object in a single image smaller than 1 mega pixel (640.000 pixels in our case). The building of a DGIBS model is summarized in Figure 1.
....
....
Synthetic Model: view ν Sampling direction
Z
Z
P median filter Depth Image
Sampling in Z
contour extraction Filtered depth Image
Sample d points
Object contour Silhouette
Depth gradient Colour map (mm)
Depth gradient
60 40
p
t
Fig. 1. DGI-BS concept
DGI-BS ν
20
64
P. Merchán and A. Adán
Let us suppose a range image of an isolated object in the scene. Suppose also that ν is the viewpoint which the object is observed from. The first step consists of extracting the depth of the points belonging to the object and generating a depth image Z. After that, a simple image processing on Z allows to extract the silhouette S and to define, for every pixel P of S, the normal direction to the silhouette in the image. Then, the depth values of a set of equal-spaced pixels of Z, in the normal direction, are compared with the depth of the corresponding pixel of S. When this gradient is performed for all pixels of S, a matrix of p × t dimensions is obtained, p being the silhouette dimension and t being the number of sampled pixels for each normal. That is what has been called DGI-BS of the object from ν. When the object is viewed from k different viewpoints, we obtain a whole DGI-BS model which consist of an image of dimension (p, k*t). Thus, the whole model characterizes the complete object. In practice, to build a complete DGI-BS representation, a synthesized high-resolution geometric model of the object (which is obtained offline) and a tessellated sphere to define the k viewpoints are used. 2.2 Recognition in Complex Scenes Let us suppose now a complex scene sensed by a 3D device and suppose that the object to be recognized is occluded by others in it. Since partial information of the object is given, a partial DGI-BS representation is obtained. In this case, DGI-BS is built by using the portion of the silhouette corresponding to the non-occluded part of the object, that is called “real silhouette” of the object in the scene. Therefore, it is necessary to perform a preliminary range image processing consisting of two phases: segmentation and real silhouettes labeling. Since this paper is not focused on present such processes and due to length limitations only a brief reference of these issues is given below. Segmentation means that the scene splits in a set of disjointed surface portions belonging to the objects of the scene. Segmentation can be accomplished by discovering depth discontinuities and separating set of 3D points in the range image. We have used the technique by Merchán and Adán [22] which is based on establishing a set of suitable data exploration directions to perform a distributed segmentation. Once the scene has been segmented, the silhouettes labeling is an easy task. The procedure is basically as follows. Firstly, the silhouette of every object separated in the scene is extracted and its parts are labeled as connected (adjacent) or not connected with others silhouettes. Secondly, the parts that are occluded and those that are not are identified by using the scene depth image. These parts are respectively labeled as false and real silhouettes. If the silhouette of an object has several occlusions, we can build a partial DGI-BS representation for every one of the real silhouette parts of the object can be built. In practice, one or, at most, two real parts per object (which correspond to one or two occlusions) are expected. The partial DGI-BS representation is built as explained in section 2.1. It is a sub-matrix of the non-occluded version. Figure 2 left shows the extracted segments of the scene with the real silhouettes marked on them as well as their DGI-BS representation. The best matching point between the occluded and nonoccluded (related to the model in the database) DGI-BS versions can be seen.
Analyzing DGI-BS: Properties and Performance Under Occlusion and Noise
MODEL
SCENE Real silhouette
65
POSE IN THE SCENE
Surface correspondence
p1
1 DGI-BS matching (k1,p1) Real silhouette
Surface correspondence
2 DGI-BS matching (k2,p 2) Real silhouette
Surface correspondence
3 DGI-BS matching (k3,p3) Real silhouette Surface correspondence
4 DGI-BS matching (k4,p4) Real silhouette Surface correspondence
5 DGI-BS matching (k5,p5)
Fig. 2. Recognition and pose results for scene A. Left: Depth images of the extracted segments and real silhouettes labeling. Middle: Surfaces correspondence and scene-model matching results in the DGI-BS space. Right: Models posed in the scene.
Finally, a segment of the scene (which corresponds to a portion of the complete object) is recognized and posed by means of a simple matching algorithm in the DGI-BS space. Thus, when a surface portion Θ is segmented in a complex scene, DGI-BSΘ is searched in the whole DGI-BS models database obtaining a list of candidates L. For the best matching (the first of L), two coordinates (k, p) are determined: the best view of the model (index k) and the best fitting point (index p). Let us assume that the best matching corresponds to DGI-BS of the object M, DGI-BSM, at coordinates (k*, p*). The association DGI-BSΘ-DGI-BSM implies a point-to-point correspondence in the depth images ZΘ and ZM and, consequently, the same point-to-point correspondence is maintained for the 3D points CΘ and CM. Hence, an initial transformation T* between CΘ and CM can be easily calculated. After that, a refined transformation that definitively aligns the two surfaces (scene and model) in a common coordinate system is computed by using the well-known ICP registration technique and a threshold error,
emax . Thus, through the ICP error, eICP ,
66
P. Merchán and A. Adán
the goodness of the coarse transformation T* and the validity of the DGI-BS matching
eICP > emax the surface correspondence is considered as wrong
can be evaluated: if
and the next candidate of the list L is taken to compute and evaluate a new value for eICP.
3 Experimental Analysis of the Method DGI-BS 3.1 DGI-BS Under Occlusion One important aspect in the computer vision area concerns the occlusion. The occlusion arises when the whole information of the object or, in a general sense, the scene, is not available for the user. Therefore, there is occlusion in one view of an isolated object because only part of the object can be sensed. We call it self-occlusion. If there are several objects in the scene, the occlusion degree may increase due to the fact that the objects may occlude each other. In this case we say that multi-occlusion exists. For instance, in Figure 3 self-occlusion appears for object 3, scene A, whereas objects 1, 2, 4 and 5 correspond to multi-occlusion cases. A
B 1
3
4
1
3
4
5
2
3 4
5 2
2
1
H
G 4 1
3
4
3
5
F 2
1
2 1
E
D
C
2
5 4
3 1
3
3
4
2 5
4
2
5 1
Fig. 3. Scenes used in our experimentation. Recognition and pose of multi-occluded objects.
Table 3 presents a set of occlusion properties and recognition results concerning to a set of scenes sensed by a gray range finder in our lab. These scenes are shown in Figure 3. They are composed of polyhedral and free-form objects without pose restrictions, having high multi-occlusion percentages. Thus, 60.5% of the objects are occluded by others. The occlusion has been calculated for two aspects. Firstly, with respect to the surface, the occlusion percentage O1 is the one proposed by Johnson and Hebert [16]:
O1 = 1 -
SS SM
(1)
Analyzing DGI-BS: Properties and Performance Under Occlusion and Noise
67
Where SS is the surface of the segment in the scene and SM is the total surface of the object in the model. Columns 3, 4 and 5 give the values of SS, SM, in squared millimeters, and O1. It is important to point out that the occlusion percentages range goes from less (20.89%, C4) up to very high values (83.46%, F4). Maximum occlusion percentages reported in the referenced works are up to 60-70%. In our experimentation objects with harder surface occlusion (>70%) were recognized and posed correctly. Since the DGI-BS representation also depends on the non-occluded contour portion of the object (which we have called real silhouette), an occlusion measurement with respect to the contour of the object has also been considered. This is interesting information that has not been treated by any referenced author of section 1. The expression of such a measure is defined as follows:
O2 =1-
LS LM
(2)
Table 3. Occlusion percentages in the scenes and recognition results SCENE
A
B
C
D
E
F
G
H
Real Surface Silhouette silhouette length occlusion length (pixels) (%)
Total area
Segment area
(mm2)
(mm2)
1
119,84
32,16
73,17
216
145
32,87
2
198,06
54,52
72,47
244
141
42,21
3
196,00
97,43
50,29
255
255
0,00
4
184,55
49,49
73,18
275
220
20,00
5
152,83
34,90
77,16
202
104
48,51
1
89,37
44,43
50,29
152
152
0,00
2
64,81
25,76
60,26
175
118
32,57
3
172,72
32,62
81,11
256
127
50,39
4
139,72
43,02
69,21
253
183
27,67
5
81,39
34,80
57,24
193
193
0,00
1
254,47
95,13
62,62
299
113
62,21
2
64,25
42,11
34,46
156
146
6,41
3
133,48
82,42
38,25
196
145
26,02
4
119,84
94,81
20,89
233
233
0,00
5
110,29
69,48
37,00
212
212
0,00
1
139,72
103,77
25,73
247
247
0,00
2
139,33
104,47
25,02
224
169
24,55 67,46
Object
(pixels)
Silhouette occlusion Candidate
View
(%)
3
89,37
49,80
44,28
169
55
4
234,32
147,87
36,89
292
292
0,00
1
102,01
69,36
32,00
136
49
63,97
2
110,68
74,28
32,89
215
215
0,00
3
92,41
58,86
36,31
165
165
0,00
4
172,72
109,04
36,87
256
161
37,11
5
72,35
33,39
53,85
96
96
0,00
1
102,22
27,13
73,46
204
131
35,78
2
110,68
36,81
66,74
197
155
21,32
3
92,41
35,42
61,67
192
192
0,00
4
110,29
18,25
83,45
225
114
49,33
1
102,22
68,62
32,87
213
194
8,92
2
68,35
31,12
54,47
145
96
33,79
3
257,19
135,65
47,26
295
89
69,83
4
198,06
134,95
31,86
248
213
14,11
5
89,37
44,91
49,75
159
159
0,00
1
140,50
74,40
47,04
247
164
33,60
2
120,40
61,56
48,87
240
81
66,25
3
123,97
72,65
41,40
181
104
42,54
4
73,21
50,31
31,28
162
162
0,00
5
64,81
49,50
23,62
182
182
0,00
9º 1º 1º 1º 8º 1º 1º 1º 5º 1º 2º 2º 6º 1º 3º 1º 1º 1º 3º 1º 1º 2º 1º 1º 1º 3º 1º 1º 1º 1º 1º 1º 1º 3º 2º 4º 2º 1º
R R R R R R R R R R E R R R R R R R R R R R R R R R R R R R R R R R R R R R
68
P. Merchán and A. Adán
Now, LS means the length of the real silhouette of the object in the scene whereas LM is the length of the silhouette in the model (without occlusion) from the camera viewpoint. The values of LS, LM, given in pixels, and O2, are introduced in columns 6, 7 and 8. Note that around 27% of the objects have contour occlusions higher than 40% and that even percentages above 60% are found. Of course, the total surface of the model, SM, and the length of the silhouette of the object in the model, LM, can be computed because a high-resolution geometric model of each object has been built in an off-line process. As said before, the DGI-BS matching process yields a list of candidates L that is checked through eICP . The column ‘candidate’ shows the first candidate of the list L that verifies
eICP < emax and, therefore, that is chosen. For 68.4% of the cases the
first candidate was the first of the list L and the average position was 1.97. Thus, the optimum candidate is found in the first or second position in L. Finally, the whole recognition-pose process is evaluated as right (R) or erroneous (E) in the last column. In summary, the recognition-pose success rate was 97.4% (37/38). Consequently, we can conclude that our approach is highly effective. In Figure 3 examples of recognition and pose of objects with multi-occlusion are depicted. The rendered models of these objects located in their respective poses have been represented superimposed to the 3D points of the scene. 3.2 DGI-BS Under Noise In this section the performance of the recognition method when Gaussian noise is injected in the range image of the scene is analyzed. In practice, noise arises due to several reasons. For instance, spurious and erroneous 3D points may appear in shaded or highlighting regions. The random Gaussian noise N (0, σ 2 ) is introduced in x, y and z point coordinates of the scene and the signal-noise relationship is evaluated as SNR = 20 log
S db , N
where S corresponds to the 3D sensor precision and N is the standard deviation of the injected noise. We have tested our method from 7db to 42db. Specifically, the SNR injected were: 7 db, 10 db, 13 db, 16 db, 19 db, 22 db, 27 db, 32 db, 37 db and 42 db. The noise has been added to every object in the scenes A to H, 38 objects in total. Table 4. Recognition results under noise SNR (db) 100 42 37 32 27 22 19 16 13 10 7
Standard deviation (mm) 0.008 0.014 0.025 0.045 0.079 0.112 0.158 0.224 0.316 0.447
Noise (%)
Recogn/pose (%)
0.8 1.4 2.5 4.5 8 11 16 22 32 45
97.4 97.4 97.4 97.4 94.7 94.7 94.7 94.7 81.6 76.3 68.4
First candidate in L (%) 68.4 65.8 60.5 52.6 52.6 55.3 57.9 50.0 36.8 31.6 18.4
Mean position in L 1.97 2.26 2.08 2.18 2.16 2.24 2.38 2.38 3.13 3.39 4.0
Analyzing DGI-BS: Properties and Performance Under Occlusion and Noise
69
Table 4 shows the SNR values, the corresponding standard deviation, in millimeters, and the percentage of added noise in the three first columns. Figure 5 illustrates the appearance of the segment A4 when the noise is introduced. It shows the 3D points with injected noise, the depth images, the real silhouettes and the DGI-BS representations. Notice that the real silhouette of the object is modified (reduced) as the noise increases and that a realistic noisy environment can be set above 16/19 db. Table 4 and Figure 4 summarize the results obtained after the application of our recognition approach to all segments of scenes A to H. The fourth column of table 4 reports the recognition/pose success average percentage with respect to the total of segments of the scenes. Note that high success rates (>90%) are maintained for SNR values above 16db (realistic cases). Even for excessive SNR ratios (7db) the success is next to 70%. This gives an idea of the DGI-BS robustness under noisy environments. The percentage of first-candidate in the list L chosen as correct is shown in the fifth column. This percentage is higher than 50% for SNR above 16 db. Finally, the last column presents the average position in L of the chosen candidate. As expected, 100 90
Recognition percentage
80
First-candidate-in-L percentage 4
60 3.5
50
Mean Position
Percentage
70
40
3
Mean Position in L
2.5
30 2
20 1.5
10
0
10
20
30
40
0
50
10
60
20
30
40 50 60 SNR (db)
70
80
70
90
80
90
100
100
SNR (db)
Fig. 4. Recognition results under noise SNR(db) 3D points
Depth Images Real silhouettes
DGI-BS
SNR(db) 3D points
Depth Images Real silhouettes
100 19 42 16 37 13 32 10 27 7 22
Fig. 5. DGI-BS under noise
DGI-BS
70
P. Merchán and A. Adán
this position increases as SNR decreases, but the results are acceptable. Note that for very low noise the first or second candidate in L in chosen as candidate and for realistic noises (>13db) the mean position is next to 3. This is another fact that proves the low sensitivity of the DGI-BS to noise.
4 Conclusions A method for recognizing and positioning objects in complex scenes has been analyzed in this paper. The method is based on the DGI-BS (Depth Gradient Image Based on Silhouette) representation. After comparing it with a set of current and representative techniques, we can state that DGI-BS has very interesting properties. The method is simple because only one range image is used and an easy DGI-BS matching procedure solves both recognition and pose. It is also rather general and applicable because most usual restrictions are avoided. Thus, it works without shape restrictions, with hard occlusion, with noisy data and no initial points or zones have to be marked in advance. Finally, DGI-BS is a complete and compact 3D representation that integrates the whole surface and contour information of an object into an image less than 1Mpixel. This property allows us to solve the ambiguity problem when different objects from a specific viewpoint appear similar. Consequently, many errors in recognition can be corrected.
Acknowledgements This work has been supported by the Spanish projects PBI05-028 JCCLM and DPI2006-14794-C02.
References 1. Planitz, B.M., Maeder, A.J., Williams, J.A.: The correspondence framework for 3D surface matching algorithms. Computer Vision and Image Understanding 97, 347–383 (2005) 2. Mamic, G., Bennamoun, M.: Representation and Recognition of 3D Free-Form Objects. Digital Signal Processing 12(1), 47–76 (2002) 3. Besl, P., McKay, N.: A method for registration of 3-D shapes. IEEE TPAMI 14(2), 239– 256 (1992) 4. Thirion, J.P.: Extremal points: definition and application for 3D image registration. In: Proc. of the IEEE Conference on CVPR, pp. 587–592. IEEE Computer Society Press, Los Alamitos (1994) 5. Krsek, P., Pajdla, T., Hlavac, V.: Differential invariants as the base of triangulated surface registration. Computer Vision and Image Understanding. 87(1-3), 27–38 (2002) 6. Stein, F., Medioni, G.: Structural indexing: efficient 3D object recognition. IEEE TPAMI 14(2), 125–145 (1992) 7. Besl, P.J., Jain, R.C.: Invariant surface characteristics for 3D object recognition in range images. Comput. Vision Graph. Image Process 33, 33–80 (1886) 8. Linnainmaa, S., Harwood, D., Davis, L.: Pose determination of a three-dimensional object using triangle pairs. IEEE TPAMI 10(5), 634–647 (1988)
Analyzing DGI-BS: Properties and Performance Under Occlusion and Noise
71
9. Olson, C.F.: Efficient pose clustering using a randomized algorithm. Int. J. Comput. Vision 23(2), 131–147 (1997) 10. Chen, C.S., Hung, Y.P., Cheng, J.B.: RANSAC- based DARCES: a new approach to fast automatic registration of partially overlapping range images. IEEE TPAMI 21(11), 1229– 1234 (1999) 11. Mokhtarian, F., Khalili, N., Yuen, P.: Multi-scale free-form 3D object recognition using 3D models. Image Vision Comput. 19, 271–281 (2001) 12. Wolfson, H.J., Rigoutsos, I.: Geometric hashing: an overview. IEEE Comput. Sci. Eng. 4(4), 10–21 (1997) 13. Hebert, M., Ikeuchi, K., Delingette, H.: A Spherical Representation for Recognition of Free-Form Surfaces. IEEE TPAMI 17(7), 681–690 (1995) 14. Dorai, C., Jain, A.K.: COSMOS-A Representation Scheme for 3-D Free-Form Objects. IEEE TPAMI 19(10), 1115–1130 (1997) 15. Chua, C.S., Jarvis, R.: Point Signatures: A new representation for 3D Object Recognition. International Journal of Computer Vision 25(1), 63–85 (1997) 16. Johnson, A.E., Hebert, M.: Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes. IEEE TPAMI 21(5), 433–449 (1999) 17. Zhang, D.: Harmonic Shape Images: A 3D free-form surface representation and its applications in surface marching. Ph.D. Thesis, CMU-RI-TR-99-41, CMU, Pittsburgh (1999) 18. Yamany, S.M., Farag, A.A.: Surface Signatures: An Orientation Independent free-form Surface Representation Scheme for the Purpose of Objects Registration and Matching. IEEE TPAMI 24(8) (2002) 19. Liu, X., Sun, R., Kang, S.B., Shum, H.: Directional Histogram Model for ThreeDimensional Shape Similarity. In: Proc. of the IEEE Conference on CVPR, IEEE Computer Society Press, Los Alamitos (2003) 20. Adán, A., Cerrada, C., Feliu, V.: Modeling Wave Set: Definition and Application of a New Topological Organization for 3D Object Modeling. Computer Vision and Image Understanding 79(2), 281–307 (2000) 21. Merchán, P., Adán, A., Salamanca, S.: Depth Gradient Image Based On Silhouette: A Solution for Reconstruction of Scenes in 3D Environments. In: Proc. of the 3DPVT (2006) 22. Merchán, P., Adán, A.: Exploration trees on highly complex scenes: A new approach for 3D segmentation. Pattern Recognition 40(7), 1879–1898
Real-Time Free Viewpoint from Multiple Moving Cameras Vincent Nozick1,2 and Hideo Saito2 1
Gaspard Monge Institute, UMR 8049, Marne-la-Vall´ee University, France 2 Graduate School of Science and Technology, Keio University, Japan {nozick,saito}@ozawa.ics.keio.ac.jp
Abstract. In recent years, some Video-Based Rendering methods have advanced from off-line rendering to on-line rendering. However very few of them can handle moving cameras while recording. Moving cameras enable to follow an actor in a scene, come closer to get more details or just adjust the framing of the cameras. In this paper, we propose a new Video-Based Rendering method that creates new views of the scene in live from four moving webcams. These cameras are calibrated in realtime using multiple markers. Our method fully uses both CPU and GPU and hence requires only one consumer grade computer.
1
Introduction
Video-Based Rendering (VBR) is an emerging research field that proposes methods to compute new views of a dynamic scene from video streams. VBR techniques are divided into two families. The first one, called off-line methods, focuses on the visual quality rather than on the computation time. These methods usually use a large amount of cameras or high definition devices and sophisticated algorithms that prevent them from live rendering. First, the video streams are recorded. Then the recorded data is computed off-line to extract 3d informations. Finally, the rendering step creates new views of the scene usually in real-time. This three-step approach (record - compute - render) provides high quality visual results but the computation time can be long compared to the length of the input video. The methods from the second family are called on-line methods. They are fast enough to extract information from the input videos, create and display a new view several times per second. The rendering is then not only real-time but also live. Almost all the VBR techniques use calibrated input cameras and these calibrated cameras must remain static during the shot sequence. Hence it is impossible to follow a moving object with a camera. Using one or more moving cameras allows to come closer to an actor to get more details. This technique can also be used to adjust the framing of the cameras. Furthermore, if someone involuntary moves a camera, calibration update is not required. Hence this technique provides more flexibility in the device configuration. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 72–83, 2007. c Springer-Verlag Berlin Heidelberg 2007
Real-Time Free Viewpoint from Multiple Moving Cameras
73
In this article, we present a new live VBR method that handles moving cameras. For every new frame, each camera is calibrated using multiple markers laid on the scene. Contrary to most Augmented Reality applications, the multiple markers used in our method do not need to be aligned since their respective position are estimated during the calibration step. Our method uses four input webcams connected to a single computer. The lens distortion correction and the new views are computed on the GPU while the video stream acquisition and the calibration are performed by the CPU. This configuration fully exploits both CPU and GPU. Our method follows a plane-sweep approach and contrary to concurrent methods, a background extraction is not required. Therefore this method is not limited to render a unique object. Our method provides good quality new views using only one computer while concurrent methods usually need several computers. According to our knowledge, this is the first live VBR method that can handle moving cameras. In the following parts, we propose a survey of previous works on both recent off-line and on-line Video-Based Rendering techniques. Then we explain the plane sweep algorithm and our contribution. Finally, we detail our implementation and we discuss experimental results.
2 2.1
Video-Based Rendering : Previous Work Off-Lines Video-Based Rendering
The Virtualized Reality presented by Kanade et al. [1] is one of the first methods dealing with VBR. The proposed device first records the video streams from 51 cameras and then computes frame by frame a depth map and a reconstruction for every input camera. Finally, the new views are created using the reconstruction computed from the most appropriate cameras. Considering the amount of data to compute, the depth map and reconstruction process can take very long. Therefore this method is hardly compatible with live rendering. Goldlucke et al. [2] follows the same off-line approach using 100 cameras. Zitnick et al. [3] also uses this approach but with around ten high definition cameras. The depth maps are computed using a powerful but time-consuming segmentation method and the rendering is performed with a layered-image representation. This method finally provides high quality new views in real-time. The Stanford Camera Array presented by Wilburn et al. [4] uses an optical flow approach and provides real-time rendering from 100 cameras. Franco and Boyer [5] provide new views from 6 cameras with a Visual Hulls method. VBR methods designed to handle moving cameras are very rare. Jarusirisawad and Saito [6] propose a projective grid space method that can use uncalibrated cameras. This method does not use marker but the cameras’ movements are limited to pure rotation and zooming. This method also need a significant amount of time to compute a new view and thus is not well suited to real-time rendering. All these VBR methods use a three-step approach (record - compute - render) to create new views. Naturally, they can take advantage of the most powerful algorithms even if they are time consuming. Furthermore, since most of these
74
V. Nozick and H. Saito
methods use a large amount of information, the computing process becomes very long, but the visual result is usually excellent. 2.2
Live Video-Based Rendering
Contrary to off-line methods, live (or on-line) methods are fast enough to extract informations from the input videos, create and display a new view several times per second. However powerful algorithms such as global optimization are not suited for real-time implementation, and thus we can not expect equivalent accuracy and visual results from both off-line and on-line methods. Furthermore, only a few VBR methods reach on-line rendering and, according to our knowledge, none of them handles moving cameras. And since live rendering imposes severe constraints on the choice of the algorithms used, it becomes difficult to adapt a live method for moving cameras. Currently, the Visual Hulls algorithm is the most popular live VBR method. This method first computes a background extraction on every frame such that it remains only the main “object” of the scene on the input images. The 3d shape of this object is then approximated by the intersection of the projected silhouettes. Several on-line implementations have been proposed and most of them are described in [7]. The most significant method is probably the Image-Based Visual Hulls proposed by Matusik et al. [8]. This method reaches real-time and live rendering using four cameras connected to a five computer cluster. The easiest method to implement is very likely the Hardware-Accelerated Visual Hulls presented by Li et al. [9]. The main drawback of these methods is the impossibility to handle the background of the scene since only one main “object” can be rendered. Furthermore, the Visual Hulls methods usually require several computers, which makes their use more difficult. On the other hand, these methods have the ability to place the input cameras far from each other, for example around the main object. However this advantage becomes a big constraint for real-time calibration since few cameras could see common calibration markers. Yang et al. [10] propose a distributed Light Field using a 64-camera device based on a client-server scheme. The cameras are controlled by several computers connected to a main server. Only those image fragments needed to compute the new view are transferred to the server. This method provides live rendering but requires at least height computers for 64 cameras and additional hardware. Technically, this method can probably be adapted to moving cameras but it may become difficult to move correctly 64 cameras . Finally, some plane-sweep methods reach on-line rendering. Yang et al. [12] compute new views in live from five cameras using four computers. Geys et al. [13] combine a plane sweep algorithm with a 3d shape optimization method and provide live rendering from three cameras and one computer. Since our method belongs to the latter family, we will expose the basic plane-sweep algorithm and [12,13] contribution in the next section. Then we will detail our rendering method and the camera calibration step.
Real-Time Free Viewpoint from Multiple Moving Cameras
3 3.1
75
Plane-Sweep Algorithm Overview
Given a small set of calibrated images from video cameras, we wish to generate a new view of the scene from a new viewpoint. Considering a scene where objects are exclusively diffuse, we first place the virtual camera camx and define a near plane and a f ar plane such that every object of the scene lies between these two planes. Then, we divide space between near and f ar planes in parallel planes Di in front of camx as shown in Fig. 1.
Fig. 1. Plane-sweep : geometric configuration
Let’s consider a visible object of the scene lying on one of these planes Di at a point p. Then this point will be seen by every input camera with the same color (i.e. the object color). Consider now a point p that lies on a plane but not on the surface of a visible object. As illustrated on Fig. 1, this point will probably not be seen by the input cameras with the same color. Therefore, points on planes Di whose projection on every input camera provides a similar color potentially correspond to the surface of an object of the scene. A usual way to create a new image is to process the planes Di in a back to front order. For each pixel p of each plane Di , a score and a color are computed according to the matching of the projected colors (Figure 2). When every pixel p of a plane is computed, every score and color are projected on the virtual camera camx . The final image is computed in a z-buffer style : consider a point p projected on a pixel of the virtual image. This pixel’s color will be updated only if the score of p is better that the current score. We note that, thanks to this plane approach, this method is well suited for use on graphic hardware.
76
V. Nozick and H. Saito
Fig. 2. Left : all input images are projected on the current plane. A score and a color are computed for every point of this plane. Right : these computed scores and colors are projected on the virtual camera.
3.2
Previous Implementation
The plane sweep algorithm was first introduced by Collins [11]. Yang et al. [12] propose an real-time implementation using register combiners. They first place the virtual camera camx and choose among the input cameras a reference camera that is closest to camx . For each plane Di , they project the input images, including the reference image. During the scoring stage, they compute for every pixel p of Di a score by adding the Sum of Squared Difference (SSD) between each projected image and the projected reference image. This method provides real-time and online rendering using five cameras and four computers, however the input cameras have to be close to each other and the navigation of the virtual camera should lie between the viewpoints of the input cameras, otherwise the reference camera may not be representative of camx . Lastly, there may appear discontinuities in the computed video when the virtual camera moves and changes its reference camera. Geys et al. [13] combined Yang et al. plane-sweep implementation with an energy minimization method based on a graph cut algorithm to create a 3d triangle mesh. This method provides real-time and on-line rendering using 3 cameras and only one computer. However this method requires a background extraction and only compute a 3d mesh for non-background objects.
4
Our Scoring Method
The score computation is a crucial step in the plane sweep algorithm. Both visual results and speedy computation depend on it. Previous methods computes
Real-Time Free Viewpoint from Multiple Moving Cameras
77
scores by comparing input images with a/the reference image. Our method aims to avoid the use of such reference image that is usually not representative of the virtual view. We also try to use every input image together rather than to compute images by pair. However, since the scoring stage is performed by the graphic hardware, only simple instructions are supported. An appropriate solution is then to use variance and average tools. Consider a point p lying on a plane Di . The projection of p on each input image j provides a color cj . We propose to set the score as the variance computed from every cj and the final color as the average color of the cj . If every input colors cj match together, this method will provide a small variance which corresponds to a high score. Furthermore, the average color will be highly representative of the cj . If the input colors cj mismatch, the provided score will be low since the computed variance will be high. In the latter case, the average color will not be representative of the input colors cj but since the score is low, this color will very likely not be selected for the virtual image computation. Finally, our plane-sweep implementation can be explained as follows : ◦ reset the scores of the virtual camera ◦ for each plane Di from far to near • for each point (fragment) p of Di → project p on the n input images. cj is the color obtained from this projection on the j th input image → compute the average and the variance of {cj }j=1...n → set the color and the score of p to the computed average and variance • project Di ’s scores and colors on the virtual camera • for each pixel q of the virtual camera → if the projected score is better than the previous ones then update the score and the color of q ◦ display the computed image This method does not require any reference image and all input images are used together to compute the new view. The visual quality of the computed image is then noticeably increased. Moreover, this method avoids discontinuities that could appear in the virtual video when the virtual camera moves and changes its reference camera. Finally, this method handles dynamic backgrounds.
5
On-Line Calibration from Multiple Markers
Like most VBR techniques, our method requires calibrated cameras. Furthermore, since the cameras are allowed to move, the calibration parameters must be updated for every frame. The calibration should be accurate but also real-time for multiple cameras. To satisfy this constraints, we opted for a marker-based approach. The markers we used are 2D patterns drawn in black squares. They are
78
V. Nozick and H. Saito
Fig. 3. Estimation of the relationship between every marker by homography
detected and identified by ARtoolkit [14], a very popular tool for simple on-line Augmented Reality applications. Using only one marker is usually not enough to calibrate a camera efficiently. Indeed, the marker detection may not be accurate or the marker may not be detected (detection failure or occlusion). Multiple markers reduce the detection failure problem and provide better results. In most of the multiple markers applications, the markers are aligned and their respective position must be known. To decrease the constraints on the markers layout, some methods like [15] or [16] use multiple markers with arbitrary 3d positions. In our case, the cameras view-point can change every time, then it seems to be easier to increase the number of markers seen by a camera if they are close to each other. Thus a coplanar layout is well suited for our VBR method. In the following part, we present a method using multiple markers with arbitrary position and size. In this method, ARtoolkit is used only to provide markers’ position in the image coordinates, but not for calibration. First, the cameras internal parameters, should be preliminary computed. Then, the full calibration part can begin. The user sets some markers in a planar configuration. They can have different sizes and any layout is satisfactory. A reference marker should be chosen to be the origin of the scene referential. Then one of the input camera takes a picture containing all the markers so the geometrical relationship between the markers could be estimated. Indeed, a homography H between this picture and the reference marker is computed (see figure 3). Applying H on the pixel coordinate of every detected marker will provide its position in the scene referential. Then, every moving camera can be calibrated in real-time. At least one marker should appear in an image to compute a calibration matrix. First, every detected marker is computed independently. A projection matrix is estimated by Zhang method [17] using correspondences between the marker pixel coordinates and its position in the scene referential previously computed. Then the final projection matrix is set as the average projection matrix computed from every marker. Thus in this method, both rotation and translation are handled. To make the use of our VBR method easy, we propose to define the far plane as the plane containing the markers.
Real-Time Free Viewpoint from Multiple Moving Cameras
6
79
Implementation
We implemented this method in C++ with OpenGL. The video streams acquisition is performed using Video for Linux. During the calibration step, we use ARtoolkit [14] only for the marker detection. The full calibration is then computed as explained in section 5. The plane that contains all the markers is assimilated to the far plane. Thus the user just have to define the depth of the scene. Nevertheless, these two planes can also be set automatically using a precise stereo method as described in [13]. Naturally, the calibration parameters can be locked and unlocked during the shooting. Thus, even if all the markers are occluded, the camera calibration remains correct, but in that case, the cameras have to remain static during this time interval. Our method is specially well-suited to be used with webcams. However this kind of cameras are usually subject to lens distortion. Since the markers used for the calibration step can appear every where in the input images and not only in the central part, lens distortion correction is indispensable. In our method, we only focused on the radial distortion correction [18]. Our experiments shows that this correction can not be done by CPU in real-time. Indeed, the CPU is already fully exploited by the video stream acquisition, the markers detection, the camera calibration and others tasks related to the plane sweep algorithm. Hence the correction is performed by the GPU using fragment shaders. This step is done off-screen in one pass for each input image using Frame Buffer Objects. Implementation indications can be found on [19]. Skipping the radial distortion correction will have repercussion on both the calibration accuracy and the score computation. Then the visual result will be a bit decreased. Concerning the new view computation, the user should define the number k of planes Di used for the rendering. The new view computation requires k passes. Each plane Di is drawn as multi-textured GL QUADS. Multi-texturing provides an access to every texture simultaneously during the scoring stage. The scores are computed thanks to fragment shaders using the algorithm discribed in section 4. The scores are stored in the gl FragDepth and the colors in the gl FragColor. Then we let OpenGL select the best scores with the z-test and update the color in the frame buffer. To summarize, the CPU performs the video stream acquisition, the camera calibration and the virtual camera control. The GPU corrects the lens distortion and creates the new view. This configuration fully uses the capability of both CPU and GPU. We tried other configurations but the result was not real-time.
7
Results
We tested our method with an Intel Core2 1.86 GHz with a GeForce 7900 GTX. The video acquisition is performed by four usb Logitech fusion webcams and reaches 15 frames per second with a 320×240 resolution. The computation time to create a new view is linearly dependent on the number of planes used, on the number of input images, and on the resolution
80
V. Nozick and H. Saito
Fig. 4. Images computed in live from four cameras
of the virtual view. The number of planes required depends on the scene. In our tests, the experimentations showed that under 10 planes, the visual result became unsatisfying and over 60 planes, the visual results are not improved. Since the bottleneck of our method is the video stream acquisition, we used 60 planes in our experiments. Finally, we set the virtual image resolution to 320×240. With this configuration, our method reaches 15 frames per second. A normal base-line configuration for our system is a roughly 30◦ angle between the extreme cameras and the center of the scene. The cameras do not have to be aligned. Figure 4 shows nine views corresponding to an interpolation between the four cameras. Figure 5 depicts a virtual view taken at mid-distance between two adjacent cameras with the same camera configuration used in Figure 4. The middle image corresponds to the real image and the right image is the difference between the virtual and the real image. Despite some mismatching in objects’ borders, our method provides good accuracy. Whatever the markers layout are, our real-time calibration method provides accurate calibration for our plane sweep method as long as every input camera can detect at least one or two markers. Thanks to this method, the cameras can move in the scene without any repercussion on the visual result. Figure 6 shows two images created during the same sequence where cameras have been moved. The markers can have different size but all of them should be detected and identified in a single image during the first calibration step.
Real-Time Free Viewpoint from Multiple Moving Cameras
81
Fig. 5. Left : new view created at mid-distance between two adjacent cameras, Middle : real image, Right : difference between the real image and the computed view
Fig. 6. During the same video sequence, the input cameras can be moved
In our tests, the bottle neck of the method is the webcam acquisition framerate but some others webcams provides higher frame rates. Our application speed would then be limited by the plane-sweep method, and especially by the virtual view resolution. Currently, four webcams is our upper limit for real-time new view computation. Using more cameras would increase the visual result quality and more powerful GPU would probably help to increase the number of cameras, but the real-time video-stream acquisition would become a problem. Our experiments also show that using only three cameras slightly decreases the visual result and restricts the cameras configuration to smaller base-lines.
8
Conclusion
In this article we present a live Video-Based Rendering method that handles moving cameras and requires only a consumer grade computer. The new view is computed from four input images and our method follows a plane sweep approach. Both the CPU and the GPU are fully exploited. The input cameras are
82
V. Nozick and H. Saito
calibrated using multiple markers. These markers must be coplanar but their disposition and their size do not have to be known in advance such the user can choose the most adequate configuration. Our method reaches live rendering with four webcams. These cameras can be moved to follow an actor or to focus on a specific part of the scene. Our implementation shows that it is possible to combine live Video-Based Rendering with moving cameras using markers. Concerning future works, we intend to enhance our method using real-time calibration without markers.
References 1. Kanade, T., Narayanan, P.J., Rander, P.: Virtualized reality: concepts and early results. In: proc. of the IEEE Workshop on Representation of Visual Scenes, p. 69 (1995) 2. Goldlucke, B., Magnor, M.A., Wilburn, B.: Hardware accelerated Dynamic Light Field Rendering. In: proc. of Modelling and Visualization VMV 2002, aka Berlin, Germany, pp. 455–462 (2002) 3. Zitnick, C.L., Kang, S.B., Szeliski, M.R.: High-quality video view interpolation. In: proc. ACM SIGGRAPH 2004, pp. 600–608. ACM Press, New York (2004) 4. Wilburn, B., Joshi, N., Vaish, V., Talvala, E.-V., Antunez, E., Barth, A., Adams, A., Horowitz, M., Levoy, M.: High Performance Imaging Using Large Camera Arrays. In: proc. of ACM SIGGRAPH 2005, pp. 765–776. ACM, New York (2005) 5. Franco, J.-S., Boyer, E.: Fusion of Multi-View Silhouette Cues Using a Space Occupancy Grid. In: proc. of International Conference on Computer Vision ICCV’05, pp. 1747–1753 (2005) 6. Jarusirisawad, S., Saito, H.: New Viewpoint Video Synthesis in Natural Scene Using Uncalibrated Multiple Moving Cameras. In: International Workshop on Advanced Imaging Techniques IWAIT 2007, pp. 78–83 (2007) 7. Magnor, M.A.(ed.): Video-Based Rendering. A K Peters Ltd. 8. Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-Based Visual Hulls. In: proc ACM SIGGRAPH 2000, pp. 369–374. ACM, New York (2000) 9. Li, M., Magnor, M., Seidel, H.-P.: Hardware-Accelerated Visual Hull Reconstruction and Rendering. In: proc. of Graphics Interface GI’03, Halifax, Canada, pp. 65–71 (2003) 10. Yang, J.C., Everett, M., Buehler, C., McMillan, L.: A real-time distributed light field camera. In: proc. of the 13th Eurographics workshop on Rendering, Italy, pp. 77–86 (2002) 11. Collins, R.T.: A Space-Sweep Approach to True Multi-Image. In: proc. Computer Vision and Pattern Recognition Conf. pp. 358–363 (1996) 12. Yang, R., Welch, G., Bishop, G.: Real-Time Consensus-Based Scene Reconstruction using Commodity Graphics Hardware. In: proc. of Pacific Graphics, pp. 225–234 (2002) 13. Geys, I., De Roeck, S., Van Gool, L.: The Augmented Auditorium: Fast Interpolated and Augmented View Generation. In: proc. of European Conference on Visual Media Production, CVMP’05, pp. 92–101 (2005) 14. Billinghurst, M., Campbell, S., Chinthammit, W., Hendrickson, D., Poupyrev, I., Takahashi, K., Kato, H.: Magic book: Exploring transitions in collaborative ar interfaces. In: proc. of SIGGRAPH 2000, p. 87 (2000)
Real-Time Free Viewpoint from Multiple Moving Cameras
83
15. Uematsu, Y., Saito, H.: AR registration by merging multiple planar markers at arbitrary positions and poses via projective space. In: proc. of ICAT2005, p. 4855 (2005) 16. Yoon, J.-H., Park, J.-S., Kim, C.: Increasing Camera Pose Estimation Accuracy Using Multiple Markers. In: Pan, Z., Cheok, A., Haller, M., Lau, R.W.H., Saito, H., Liang, R. (eds.) ICAT 2006. LNCS, vol. 4282, pp. 239–248. Springer, Heidelberg (2006) 17. Zhang, Z.: A flexible new technique for camera calibration. proc. of IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 1330–1334 (2000) 18. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 19. Pharr, M., Fernando, R.: GPU Gems 2: Programming Techniques For HighPerformance Graphics And General-Purpose Computation. Addison-Wesley Professional, Reading (2005)
A Cognitive Modeling Approach for the Semantic Aggregation of Object Prototypes from Geometric Primitives: Toward Understanding Implicit Object Topology Peter Michael Goebel and Markus Vincze Automation & Control Institute Faculty of Electrical Engineering and Information Technology Vienna University of Technology, A-1040 Vienna {gp,vincze}@acin.tuwien.ac.at http://www.acin.tuwien.ac.at
Abstract. Object recognition has developed to the most common approach for detecting arbitrary objects based on their appearance, where viewpoint dependency, occlusions, algorithmic constraints and noise are often hindrances for successful detection. Statistical pattern analysis methods, which are able to extract features from appearing images and enable the classification of the image content have reached a certain maturity and achieve excellent recognition on rather complex problems. However, these systems do not seem directly scalable to human performance in a cognitive sense and appearance does not contribute to understanding the structure of objects. Syntactical pattern recognition methods are able to deal with structured objects, which may be constructed from primitives that were generated from extracted image features. Here, an eminent problem is how to aggregate image primitives in order to (re-) construct objects from such primitives. In this paper, we propose a new approach to the aggregation of object prototypes by using geometric primitives derived from features out of image sequences and acquired from changing viewpoints. We apply syntactical rules for forming representations of the implicit object topology of object prototypes by a set of fuzzy graphs. Finally, we find a super-position of a prototype graph set, which can be used for updating and learning new object recipes in hippocampal like episodic memory that paves the way to cognitive understanding of natural scenes. The proposed implementation is exemplified with an object similar to the Necker cube. Keywords: Cognitive Modeling, Cognitive Representation, Fuzzy & Planar Graphs, SubGraph Matching, Image Primitives.
This work was supported by project S9101 ”Cognitive Vision” of the Austrian Science Foundation.
J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 84–96, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Cognitive Modeling Approach for the Semantic Aggregation
1
85
Introduction
The field of Cognitive Systems, a multidisciplinary field, is concerned with highlevel advanced cognitive capabilities that are enablers for the achievement of more intelligent goals such as scene understanding, and autonomous navigation in complex cluttered environments. Vision, as a key perceptional capability is related to rather difficult problems, such as visual object recognition, representation and scene understanding [Pin05]. A projection of a 3D scene of observed objects onto a 2D sensor array is commonly used for image generation. Statistical pattern analysis methods, which are able to extract features from appearing images and enable the classification of the image content have reached a certain maturity and achieve excellent recognition on rather complex problems. However, these systems seem not directly scalable to human performance in cognitive sense and appearance does not contribute to understanding the structure of objects. The interpretation of an image depends on a-priori semantic and contextual knowledge, hence an abstract mapping of numerical data (e.g. pixels) is required for the representation of physical objects. Thus, semantic interpretation defines the process of transforming a (syntactic) representation into a logical form that represents its meaning, which requires much background knowledge in order to resolve ambiguities within the observed scene. Image understanding is a knowledge based process aiming for task-oriented reconstruction and high-level semantic interpretation of a scene by means of images. Human perception, in accordance to Gestalt theory [Kof35], tends to inherently assume the simplest and most regular organization that is consistent with a given image. Geometric relationships, such as collinearity and parallelism, are non accidental constant properties of projections of collinear or parallel edges of the visible layout. This tendency underlies the organization of visible surfaces into objects [WS02]. Therefore, syntactical approaches utilizing this relationships were developed, e.g. in 1982 Marr and Nishihara [Mar82] presented a model of recognition, restricted to the set of objects that can be described as generalized cones; Biederman [HB06] introduced in 1987 geon theory, arguing that complex objects are made up of arrangements of basic component parts (i.e. geons that represent cubes, cylinders, spheres, etc); and Riesenhuber and Poggio [RP02] proposed a hierarchy of local features such as lines and vertexes. Christou et al. [CTB99] studied whether contextual information regarding an observer’s location within a familiar scene could influence the identification of objects. Results suggest that object recognition can be supported by knowledge of where we are in space and in which direction we are looking. Johnson-Laird’s [JL83] mental model theory proposes reasoning as a semantic process of construction and manipulation of models in working memory of limited capacity. It provides a unified account of deductive, probabilistic and modal reasoning. Representation of object geometry includes complete explicit mathematical descriptions, parametric forms, algebraic surfaces, superquadrics, generalized cylinders, polygonal meshes, nonrigid deformable objects and syntactic descriptions (see [CF01] for a survey).
86
P.M. Goebel and M. Vincze
Early geometric models, based on three dimensional point or line descriptions, specifying points relative to the medial axis could not properly map real data and model. Wire frame or edge models perform better but have to deal with ambiguity of edge connections. Surface models describe the shape of objects with spline or polynomial approximations, where polygonal patches represent surfaces very well, but give no conceptual structure. Volumetric models represent solid components in a relation to each other or the whole object [Fau01, Fis89]. Eckes et al. [ETM06] presented a system for the automatic interpretation of cluttered scenes containing multiple, partly occluded objects in front of unknown, complex backgrounds. Their approach is based on extended elastic graph matching by using stereo and color cues for the analysis. It differs to our approach in that it utilizes matching of object graphs only and goes not beyond statistical pattern recognition. Another approach proposed by Hyundo et al. [HMCT06] for semi-autonomous learning of objects without the need for manual labeling of training images. In their approach a ”teacher” simply shows an object to be trained to the system. Although their approach focuses on fully autonomous learning, again, it goes not beyond statistical pattern recognition. However, in this work we propose a deeper insight into the aggregation layer of a recently proposed vision model [GV07b], concerning the aggregation of extracted object lines and point primitives into objects inspired by psycho-biophysical processes in the mammalian brain. A major in our approach is the ability to change the viewpoint as an important issue for object representation learning. Thus, reasoning on the superposition of several different 3D views of an object yields a robust probabilistic object representation. In particular, we use recipes derived from geon-like polyhedral object definitions in order to support concatenation of line and point primitives to build object prototypes, which are stored by hippocampal-like memory [GV07a]. The paper is organized as follows: Section 2 gives an overview of the recently proposed modeling framework; Section 3 explains our approach for representing objects with graphs; Section 4 presents the grammatical approach for defining object recipes; and in Section 5 we conclude together with an outlook on our further work.
2
Overview of the Modeling Framework
Findings of psychological vision perception inspired our recent work for presenting a new practical concept for the modeling of visual object representations with the aim to close the gap between appearance based image processing and cognitive vision models. Within this chapter, we review in brief the conjectured part of the cognitive constructivist framework of Peschl [Pes06] and give an overview of the developed vision model concept, which was integrated in the framework by the authors in the works [GV07b, GV07a] and we refer to these for details.
A Cognitive Modeling Approach for the Semantic Aggregation
2.1
87
The Cognitive Constructivists Framework
Cognition is directly connected to learning and strategies of understanding the surrounding environment. Peschl [Pes06] reinterprets understanding of knowledge based teaching and learning in the light of individual and collective knowledge construction and knowledge creation by a constructivist framework for developing modes of knowing and is organized in five cognitive levels: (I) level of executive behavior: Realized with a list of observations, it describes a phenomenon on its behavioral level. On this level, the cognitive vision model concept, contributed by the authors, is embedded in five layers, where the layers (i)physical, (ii)sensory, and (iii)activation provide early processing facilities and deliver object primitives of the scene at hand. The aggregation of these primitives into object prototypes is done in level (iv) and this is the focus of this paper. Finally, the session layer (v) is used for updating the knowledge base of the framework. (II) level of hidden patterns of behavior: The patterns are the result of more or less complex inductive and constructive processes. Scientific explanations are situated on this level, they offer cognitive, mental, or even physical mechanisms that make explicit the relationship between hidden (theoretical) structures and observed phenomena. (III) level of causes: This level concerns the exploration and the construction of causes; the resulting knowledge is the source for a deeper understanding of a phenomenon. (IV) level of potentiality: This level changes the perspective from the mode of (constructive) perception to the mode of externalization; new physical realities are created or existing (physical) realities are changed. (V) level of reflection This step has the potential to fundamentally questioning the knowledge that has been constructed so far by reflecting on the knowledge, its premises, and on the construction and learning processes that have led to that knowledge; completely unexpected results and new perspectives can be brought up that have never been considered before. 2.2
Processing in the Vision Module
Figure 1 shows layers (iv) and (v) of the proposed vision model with a focus on the aggregation layer. Suppose a cube is observed by a trinocular stereo camera; after digitalization in the physical layer (i) of the model, the sensory layer (ii) provides the spatiotemporal processing of a 4D-temporal feature map, similar to the map in cortex area V1 and V2 of the mammalian brain [GV07b]. Then, the activation layer (iii) is triggered by an activation function and generates line and point primitives of the cube object under investigation. In aggregation layer (iv) the object primitives are concatenated by object recipes. These recipes use a fuzzy graph builder to generate a fuzzy planar graph in a construction with four states, which is shown in Figure 2: (a) it starts with the detection and classification of junctions3 , using proximity balls at the endings
88
P.M. Goebel and M. Vincze
d
L
b
a
d
Proximity Balls
a
A* search
W c
a
Y
W
a
b
c
c
a
L d
a
W
b
L
a
Collinearity Hulls
(a)
Fig. 1. The model layers (4) and (5) with focus on the aggregation layer as the contribution of this work. Line and point primitives, generated by not shown early processing layers (1..3) are concatenated by object recipes in aggregation layer (4). The object prototypes are matched together and a superposition representation is generated for LTM update within session layer (5).
(b)
(c)
(d)
Fig. 2. The four states of the construction: (a) defining proximity balls for detection and classification of junctions’; (b) grouping by collinearity hulls in 1D, and (c) 2D; (d) grouping of 2D planes in 3D
of line primitives to tolerate skew lines; (b) perceptual grouping uses collinearity with support from A* search to find correspondence junctions in 1D; (c) it finds junction correspondences between results from state (b); and (d) grouping of 2D planes in 3D. Both A* search for the next path in search and the fuzzy planar graph partitioning method use an estimate from Hough [DH72] space of extracted line segments, as shown as in Figure 3. The A* heuristic search strategy applies lowest-cost-first and best-first searches together to construct, optimizing path cost as well as heuristic information in its selection of current best path [PMG98]. The object with best fitness per viewpoint is selected for storage in short term memory (STM). Thus, object candidates, concatenated by the fuzzy graph builder in Figure 1 using the geon-recipe grammar and stored as a sequence in STM are matched to the scene graph. This matching yields the superposition of the outcomes from all object candidate views into the representation of the reconstructed object in Figure 4. With this strategy, possible perturbations within single object views, such as missing vertices or edges are superseded with the information provided by other object candidate views. From the long term memory (LTM) representation, the geon library can be updated, which yields learning of objects by their structural description. Hence, the rehearsal of presented object prototypes from
A Cognitive Modeling Approach for the Semantic Aggregation
89
different viewpoints as time series for learning is used with the selected object prototypes for storage in LTM. In fact, new objects can be represented and used in a playground by higher cognitive processes.
3
The Planar Graph Approach
Graph theoretical approaches appear especially efficient for the description of abstract structural relations and topology of objects. In this chapter we develop the graph representation for our aggregation approach and investigate what role isomorphism can play during object construction.
Fig. 3. The convex hull of the cube, specified by the lines extracted with using the 2D Hough space of the line primitives. The lines are reprojected in 3D space by crossing plane method. This method eases also the solution of the correspondence problem in trinocular stereo vision.
3.1
Fig. 4. The reconstruction of the suggested cube. The object graphs are connected and the ambiguities, known from 2D projections, appear pretty resolved, the topology of the object could be preserved. The method may be applied to other polytopes as well.
About (Fuzzy) Graphs
An undirected graph (see Figure 5) G = (V, E) consists of a set V of vertices and a set E of edges whose elements are unordered pairs of vertices. The edge e = (u, v) ∈ E is said to be incident with vertices u and v, where u and v are the end points of e, and these two vertices are called adjacent. The set of vertices adjacent to v is written as A(v), and the degree of v is the number of vertices adjacent to v and is denoted as |A(v)|. The minimum δ (maximum Δ) degree of the graph G is the minimum (maximum) degree of all vertices of G. If all vertices have the same degree d, the graph is d-regular (e.g. a 3-regular graph is a cubic graph). The graph in Figure 5 is defined with V = {a, b, c} , n = |V | = 3; E = {e1 , e2 , e3 } , m = |E| = 3 edges; and A(v) = {a, b} , |A(v)| = 2. When the edges E are assigned with weights w(ei ), the graph gets a weighted graph
90
P.M. Goebel and M. Vincze
G = (V, E, w). A partition (X, X ) is defined as the proper disjoint subsets of V . The complement of X ⊆ V is denoted X = V −X. The open neighborhood of X is defined as Γ (X) = {v ∈ X |(u, v) ∈ E f or some u ∈ X}; an induced subgraph X is the graph H = X, F where F = {(u, v) ∈ E|u, v ∈ X}. An alternating sequence of distinct adjacent vertices and their incident edges is called a path; when a u . . . v path exists, the graph G is connected; otherwise G splits in a number of subgraphs; G\{u} means the vertex u deleted from G. To change the Graph in Fig.5 to a fuzzy graph, we add membership functions as weights α, β, . . . ; μ, and define V = {a, α, b, β, c, γ} , n = |V | = 3 for fuzzy vertices; and E = {e1 , μ1 , e2 , μ2 , e3 , μ3 } , m = |E| = 3 for fuzzy edges. The membership functions are chosen in accordance to the specific task at hand. 3.2
Relational Graph Matching
In object recognition, the comparison between the object under consideration and the model to which the object could be related is crucial. When structure information is represented by graphs, finding a correspondence between the vertices and the edges of two distinct graphs can be made by graph matching, ensuring that similar substructures in one graph are mapped to similar substructures in the other (we refer to Conte [CFSV04] for a comprehensive review). Processing of natural images with intrinsic variabilities of patterns, noise, and occlusions often yields incomplete graph representations to which the matching has to be tolerant. Matching on graphs generally leads to NP-hard problems, however, the compactness of information finally provides advantages over other representations. In this paper we follow the combinatorial subgraph matching by the semidefinite program convex relaxation approach of Schellewald and Schn¨ orr [SS05]. In their approach, see Fig. 6, model graphs GK (shown at the left) representing object views are matched to scene graphs GL (shown at the right) by bipartite matching. We extend their approach to graphs representing scenes in 3D space for pattern completion in hippocampal like memory as described in [GV07a]. v=c e2 u=a
X’ e3 e1
X b
Fig. 5. An example graph, with m = 3, n = 3; e2 incident to u and v; and the partition (X, X )
Fig. 6. A subgraph matching: object graph (l) is matched to the scene graph (r) [SS05]
A Cognitive Modeling Approach for the Semantic Aggregation
3.3
91
Developing Planar Graphs
A graph G that may be drawn without edge crossings in a plane is said to be planar, which is also referred to as planar embedding [Lie01]. The main advantage is that intractable problems may become tractable when they are restricted to planar graphs; hence, planar graphs are proven to have better asymptotic time complexity than the best known algorithms for general graphs [Nis88]. Approaches for transforming a nonplanar graph into a planar one can be categorized in: (i) local methods – such as vertex splitting, or graph editing (i.e. deleting some vertices and edges); (ii) global methods – such as partitioning the graph into several planar layers; and (iii) subgraph splitting – looking for its largest planar subgraph, even if restricted to induced subgraphs. Thus, every Table 1. The table shows Euler relations between the number of vertices, edges and faces of convex regular and quasi-regular polyhedrons CONVEX REGULAR POLYHEDRA Object
Vertices Faces Edges
CONVEX QUASIREGULAR POLYHEDRA Object
Vertices
Faces
Types
n
f
m
Types
n
f
Edges m
Tetrahedron
4
4
6
2n-Prism
2n
2{ngons}+n{4}
3n
Cube
8
6
12
2n-Antiprism
2n
2{ngons}+n{3}
4n
Octahedron
6
8
12
Cuboctahedron
12
8{3}+6{4}
24
Icosahedron
12
20
30
Buckminsterball
60
12{5}+20{6}
90
Dodecahedron
20
12
30
ngons ... polygon with n sides
planar graph has an embedding in which the edges are straight line segments, moreover, each connected subset of the plane that is delimited by a closed subgraph of G is called a face of the embedding. For a connected plane graph G with n vertices, m edges and f faces, Euler (1750) found the correspondence n − m + f = 2, which together with the observation that a planar graph with n ≥ 3 vertices, having as many edges as possible, appears that each face is incident to exactly three vertices yield the conjecture m ≤ 3n − 6 (see Table 1 for a list of Euler relations on polyhedrons). Hence, applying Steinitz’s (1922) theorem1 and restricting our attention to 3-connected graphs yield planar graphs well related to 3-dimensional polytopes. More general, planar graphs are embeddable in the 2-dimensional plane, which means each plane is being a topological space, representing a compact 2-manifold2 . Hence, in our approach, we declare the observations from distinct viewpoints as 2-manifold mappings of the real object and follow the graph partitioning method. Since we have restricted ourselves to polyhedral objects, we get then 1
2
The edge graph of a 3D polytope P is defined GP = (VP , EP ), where VP is the set the vertices of P , and EP is the set of the edges of P. When – X is a set of points in Rn , (Ui ⊂ X) an open set, ϕi := Ui ∩ X → Rd |d < n a projection – and iff {Ui ∩ Uj=i = ∅} and ϕi · ϕ−1 , ϕj · ϕ−1 are two mappings j i d d R → R infinitely differentiable – then it’s a manifold C ∞ of dimension d. [Fau01].
92
P.M. Goebel and M. Vincze
quasi planar graphs as intersection graphs of faces appearing as subgraph embeddings in the scene graph of the observation at hand. This means that we can use the metric distances between extracted primitives in 3D as topological properties for finding their corresponding 2-dimensional embedding planes; thus, we get the entire scene graph partitioned into planar (hopefully closed) subgraphs, representing the faces of the object under investigation.
001 (L)
011 (W)
101 (W)
111 111 (Y) (Y) (Y) 000
(Y) 000
100 (L)
a)
010 (L)
110 (W)
101
101 (T)
111 (T)
000 (T)
010 (T)
Y
Y
001
011
000
010
L
T W 110
001
011
000
010
101 (T)
000 (L)
100 (T)
111
Y
L
011 (T)
111 (L)
T
L
001
011
010 (T)
110 (L)
000
010
100
111
T
L
L
L
Y
W
L L
110
d)
101
W W
110
110 (L)
c)
101
L
001 (L)
(Y) 010
100 (W)
L 100
111 (W)
000 (L)
T L
011 (L)
101 (Y)
111
T
W
L
001 (W)
110 (L)
b)
101
111
L
011 (L)
100 (L)
W
100
001 (L)
001
011
000
010
T
T
T
L 100
110
Fig. 7. A plane graph derived from four distinct viewpoints. The vertex labels are given according to the spatial object coordinates. Junction class labels are used for orientation of the plane graph. It is obvious that the pairs a-c and b-d are instances from the same equivalence class.
Figure 7 shows a cube with the quasi-plane graph partitioned in two planes of vertices V1 = 100, 101, 111, 110 and V2 = 000, 001, 011, 010, as seen from four distinct viewpoints as (a..d) in association to the manifold. In the top row, the corner labellings (000...111) and junction classifications (L,W,Y,T)3 of the solid cube are given, where the hidden lines and corners, which of course are not observable, are shown grayed for illustration purpose only. In the bottom row, the corresponding quasi-plane graphs representing the subgraph embeddings are shown; the hidden vertices are grayed. Neglecting possible errors from perspective view, in Figure 7 sections (a...d) of the planar graphs appear as follow – case a: vertex {000} of plane V2 is hidden when looking across corner {111}, having three surfaces visible – case b: looking across the fold 4 between vertices {101,111} causes two surfaces to be visible – case c: appears as the mirroring situation of (a), and causes vertex {010} to hide – and finally case d: is a more extreme view, 3
4
(L) means an occluding line and denotes a blade, where two surfaces meet, just one visible; (Y) means a 3-junction, where three surfaces intersect with the angles between each pair are < 180 degrees; (T) means a 3-junction with one of the angles has exactly 180 degrees; (W) means a 3-junction with one angle > 180 degrees. Denotes where two surfaces meet, both visible.
A Cognitive Modeling Approach for the Semantic Aggregation
93
but results in just the rotation of case (b). Here the question about the absolute number of possible nonequivalent equivalence classes per partition of the plane graph can be answered by applying Burnside’s theorem5 for finding the cycle index depending on the maximal number of vertices in the graph [KSS07]; the calculation with two possible colorings (a = 2) yields the following table: Vertices/Partition Number of n Equivalence Classes 3 4 4 11 5 34 6 156
The Number of graphs in the Pattern Inventory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15=m 111 1 112 32 1 1 112 46 6 6 4 2 1 1 1 1 2 5 9 15 21 24 24 21 15 9 5 2 1 1
It follows that for our example with four vertices per partition, eleven distinct subgraphs are possible; thus, one can see that the partitioning of the object graph has also the advantage in reducing the number of vertices and so the number of equivalence classes, which all have to be considered in the matching stage, where the topology of prototype graphs are checked against the object repository by relational graph matching (see section 3.2). Here, the isomorphism in the resulting subgraphs of these 2-manifolds is used together with attributes in form of the spatial coordinate information of the primitives for a redundant reconstruction of the underlying original 3D object types.
4
Defining Object Recipes
Recipe knowledge is situated at abstraction level II in Peschl’s cognitive model [Pes06]; in our present approach we use predefined recipes for concatenating the object primitives into object prototypes. We utilize herein well performing structures from linguistic research for symbolic and semantic object processing. 4.1
Knowledge, Grammar and Formal Language
Language is defined as a set of (finite or infinite) sentences of finite length, constructed from a finite alphabet of symbols; knowledge about context and structure is crucial for understanding and interpreting a particular sentence [All89]. A grammar is the formal specification of allowable structures in a language and its knowledge is characterized by: (i) phonetic and phonological forms – i.e. how words are composed from sounds; (ii) morphology – i.e. how words are constructed from morphemes6 ; (iii) syntax – i.e. how words can be put together to form correct sentences; (iv) semantics – i.e. what words mean and how these meanings combine in sentences to form sentence meanings; (v) pragmatics – i.e. how sentences are used in different contexts and how context affects the interpretation of a sentence; and finally (vi) world knowledge – that language 5
6
Suppose a group G acts on a set Z, and ∀π ∈ G we denote the subset of elements in Z thatare fixed by π as F ix(π) = {x ∈ X|π(x) = x} then the number of orbits in Z 1 is |G| π∈G |F ix(π)|. Moreover, if there are k disjoint cycles in the representation of π and there a colors available (e.g. vertex is present/absent), then |F ix(π)| = akπ . i.e. basic meaning units, e.g. such as root and suffix.
94
P.M. Goebel and M. Vincze
speakers must have about the general structure of the world, besides knowledge about the partners beliefs and goals, in order to maintain a conversation. Thus, a formal regular grammar [Cho56] is defined by G := {N, T, P, S}, a quadruple with: (N ) as the finite set of nonterminal symbols; (T ) the finite set of terminal symbols; (P ) the finite set of production rules; and (S) the start symbol. An alphabet Σ is defined to be the disjunctive conjunction Σ := N ∪ T ∪ S. Hence, in our approach we construct object grammars by using isomorphism to associate: (i) words with primitives; (ii) morphology with corners and edges; (iii) syntax with object graphs; and (iv) semantics with object recipes. 4.2
An Example Recipe
For specifying recipes, we have chosen a probabilistic finite state automaton approach [KD02] and define the automaton by Q, Σ, δ, τ, S0, F , where Q is the finite set of states, Σ the Alphabet, δ : Q × Σ → Q the transition function, τ : Q × Σ →]0, 1] the transition probability, S0 the initial state and F : Q → [0 . . . 1] the probability for a state to be final. Thus, according to [Cho56], we define a finite-state grammar G as a Markov chain [Dyn61], i.e. a system with finite number of states S0 , . . . , Sq , a set Σ = {σijk |0 ≤ i, j ≤ q; 1 ≤ k ≤ Nij ∀i, j} of transition symbols, and a set C = {(Si , Sj )} of certain pairs of states of G that are said to be connected. Therefore, when the system moves from state Si to Sj , it produces symbols σijk ∈ Σ. Then we get Sα1 , .. . , Sαm as a cyclic sequence of states in G with α1 = αm = 0, and Sαi , Sαi+1 ∈ C ∀i < m. Hence, as a symbol moves from Sαi to Sαi+1 , it produces the symbol aαi αi+1 k , for some k ≤ Nαi αi+1 . Moreover, the sequence S0 , . . . , Sq generates all sentences for all appropriate choices of k − i: {σα1 α2 k1 ◦ σα2 α3 k2 ◦ . . . ◦ σαm−1 αm km−1 }. Thus, we get from our finite state language the recipe according to the graphs in Figure 7 as sentences by state sequences as description of the planar graph: Sentence symbol state sequence (Sα ,Sα ) f or m=4 i i+1
Partition No 1 100 101 111 110 Interconnection States +1 +L +W +Y +L C 2 000 001 011 010 ↓ +2 +Y +L +W +L e e e e
→
e e e e
These sentences are cyclic and read as: ”Node 100 alias L connects to 101 alias W connects to 111 alias Y connects to 110 alias L connects to 100 alias L”; and also ”100 alias L connects to 000 alias Y connects to 100 alias L...”. In fact, the recipe yields a description of the object topology and links together structural information from distinct viewpoints. The partition of the object graph corresponds directly with the Hough plane [DH72] (see Section 2.2).
5
Conclusion
In this work, the building of object prototypes, supported by a recently proposed vision model was given with focus on the aggregation of line and point
A Cognitive Modeling Approach for the Semantic Aggregation
95
primitives into objects. Primitives, stemming from image sequences are detected and aggregated into internal object representations by applying fuzzy graph constraints and guidance from geon recipes of an object repository. As example, a solid Necker cube was given and simulated in a 3D simulation setup. We have shown that planar graphs are advantageous in that they utilize implicit subgraph splitting, which results in lower complexity of the graphs in the split plane. Furthermore, the ability to collect information of views from different viewpoints enables us to deal with implicit object topology. The modeling approach will be utilized by future work for the representation and recognition of other object types with occlusions in cluttered, and noisy environments.
References [All89]
Allen, J.: Introduction to Natural Language Understanding. BENJAMIN/CUMMINGS Publishing (1989) [CF01] Campbell, R.J., Flynn, P.J.: A survey of free-form object representation and recognition techniques. Comp. Vis. I. Underst. 81 (2001) [CFSV04] Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern recognition. Int. J. of Patt. Rec. and AI (2004) [Cho56] Chomsky, N.: Three models for the description of language. IRE Transactions on Information Theory 2, 113–124 (1956) [CTB99] Christou, C.G., Tjan, B.S., B¨ ulthoff, H.H.: Viewpoint information provided by a familiar environment facilitates object identification. TR MaxPlanck institute f. biol. cybernetics 68 (1999) [DH72] Duda, R., Hart, P.: Use of the hough transformation to detect lines and curves in pictures. Comm. ACM 15 (1972) [Dyn61] Dynkin, E.B.: Theory of Markov Processes. Prentice-Hall, Englewood Cliffs (1961) [ETM06] Eckes, C., Triesch, J., Malsburg, C.: Analysis of cluttered scenes using an elastic matching approach for stereo images. Neural Computation 18, 1441–1471 (2006) [Fau01] Faugeras, O.: Three-Dimensional Computer Vision. MIT Press, Cambridge (2001) [Fis89] Fisher, R.B.: From Surfaces to Objects. Wiley, Chichester (1989) [GV07a] Goebel, P.M., Vincze, M.: Implicit modeling of object topology with guidance from temporal view attention. In: Proceedings of the International Cognitive Vision Workshop, ICVW 2007, Bielefeld University (D) (2007) [GV07b] Goebel, P.M., Vincze, M.: Vision for cognitive systems: A new compound concept connecting natural scenes with cognitive models. In: INDIN 2007, Vienna, Austria. LNCS, Springer, Heidelberg (to appear) [HB06] Hayworth, K.J., Biederman, I.: Neural evidence for intermediate representations in object recognition. In: Vision Research. (in Press) (2006) [HMCT06] Hyundo, K., Murphy-Chutorian, E., Triesch, J.: Semi-autonomous learning of objects. In: Proceedings of the CVPR 2006 (2006) [JL83] Johnson-Laird, P.N.: Mental Models, Towards a Cognitive Science of Language, Inference and Consciousness. Harvard Press (1983) [KD02] Kermorvant, C., Dupont, P.: Stochastic grammatical inference with multinomial tests. In: Adriaans, P., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, Springer, Heidelberg (2002)
96
P.M. Goebel and M. Vincze
[Kof35] [KSS07] [Lie01] [Mar82] [Nis88] [Pes06] [Pin05] [PMG98] [RP02] [SS05]
[WS02]
Koffka, K.: Principles of Gestalt Psychology. Hartcourt, NY (1935) Klima, R.E., Sigmon, N.P., Sitzinger, E.L. (eds.): Applications of abstract algebra with Maple and Matlab. CRC Press, Boca Raton, USA (2007) Liebers, A.: Planarizing graphs – a survey and annotated bibliography. Journal of Graph Algorithms and Applications 5, 1–74 (2001) Marr, D.: Vision: A Computational Approach. Freeman & Co. San Francisco (1982) Nishizeki, T.: Planar Graphs: Theory and Algorithms. Elsevier, Amsterdam (1988) Peschl, M.F.: Modes of knowing and modes of coming to know. Constructivist Foundations 1(3), 111–123 (2006) Pinz, A.: Object categorization. Foundations and Trends in Computer Graphics and Vision. 1(4), 257–353 (2005) Poole, D., Mackworth, A., Goebel, R.: Computational Intelligence. Oxford Press, Oxford (1998) Riesenhuber, M., Poggio, T.: Neural mechanisms of object recognition. Current Opinion in Neurobiology 12, 162–168 (2002) Schellewald, C., Schn¨ orr, C.: Probabilistic subgraph matching approach based on convex relaxation. In: Rangarajan, A., Vemuri, B., Yuille, A.L. (eds.) EMMCVPR 2005. LNCS, vol. 3757, pp. 171–186. Springer, Heidelberg (2005) Wang, R.F., Spelke, E.S.: Human spatial representation: insights from animals. Trends in Cognitive Sciences 6(9) (2002)
A Multi-touch Surface Using Multiple Cameras Itai Katz, Kevin Gabayan, and Hamid Aghajan Dept. of Electrical Engineering, Stanford University, Stanford, CA 94305 {katz,gabayan,aghajan}@stanford.edu
Abstract. In this paper we present a framework for a multi-touch surface using multiple cameras. With an overhead camera and a side-mounted camera we determine the fingertip positions. After determining the fundamental matrix that relates the two cameras, we calculate the three dimensional coordinates of the fingertips. The intersection of the epipolar lines from the overhead camera with the fingertips detected in the side camera image provides the fingertip height. Touches are detected when the measured fingertip height from the touch surface is zero. We interpret touch events as hand gestures which can be generalized into commands for manipulating applications. We offer an example application of a multi-touch finger painting program.
1 Introduction Traditional human input devices such as the keyboard and mouse are not sufficiently expressive to capture more natural and intuitive hand gestures. A richer gesture vocabulary may be enabled through devices that capture and understand more gesture information. Interactive tables that detect fingertip touches and interpret these events as gestures have succeeded to provide a rich input alternative. Multi-touch input devices have been explored by Bill Buxton [1] as early as 1984.Touch position sensing in touchpads is a common laptop input device [2], measuring the capacitance created between a user’s finger and an array of charged plates under the touch surface. Larger touch sensitive surfaces have been constructed from similar sensor arrays. The Mitsubishi Diamond Touch display [3] sends a unique electrical signal through the seats of each of its users, and an antenna array under the touch surface localizes and identifies unique touches. Computer vision techniques for human-computer interface have provided a high bandwidth alternative to traditional mechanical input devices as early as 1969 in Krueger’s Videoplace [4]. While cameras provide a high rate of input information, algorithms that efficiently reduce this data to desired parameters are necessary to minimize the interface latency. Modifications to the operating conditions of an optical interface that isolates desirable features can simplify the complexity of the latent image analysis. Another touch detection method is based on forcing a finger position to illuminate on contact. The Coeno Office of Tomorrow touch sensitive surface [5] uses a scanning laser diode to produce a thin layer of laser light over the touch sensitive surface. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 97–108, 2007. © Springer-Verlag Berlin Heidelberg 2007
98
I. Katz, K. Gabayan, and H. Aghajan
Touches that contact the surface reflect the light, simplifying fingertip localization to the identification of bright spots in an image.
Fig. 1. The multi-touch surface prototype with multiple positions for the side camera
Jeff Han revisited touch surfaces based on frustrated total internal reflection (FTIR) in a novel multi-touch sensitive surface. In his implementation, an acrylic pane lined with infrared LEDs acts as an optical waveguide whose property of total internal reflection is disturbed by the mechanical disturbances of finger presses [6]. A camera behind the acrylic pane views the fingertip presses as they illuminate with infrared light, and connected components analysis provides the positions of multiple finger presses. The system does not provide information regarding where fingertips hover or which fingertips belong to which hand. Touch-like gestures may estimate actual surface contact in situations where one may be unable to directly measure fingertip proximity to a surface. The steerable projector and camera in IBM’s Everywhere Displays [7] senses touch-like gestures performed upon flat surfaces in a room. The Smart Canvas [8] project classifies completed strokes seen from a monocular camera as being a touch or transitional stroke. Pauses performed between strokes in the Smart Canvas system make drawing gestures unnatural yet simplify stroke segmentation. A solution requiring minimal hardware using a stereo pair of cameras and a black paper background is offered by Malik, which performs hand contour analysis for fingertip detection and orientation measurement, and measures fingertip altitude from stereo disparity [9]. The use of hand shadow analysis for the extraction of touch and hover information has been explored by Wilson [10], in which the author subjectively estimated the best touch detection threshold altitude of the PlayAnywhere system to be 3-4mm. Touch detection estimates in overhead camera-based systems suffer in depth resolution, causing a system to report touches when a user’s finger hovers near the touch surface. The Smart Canvas paper [8] suggests the use of side cameras aligned with the touch surface to detect touches, but does not explore an implementation. We have chosen to explore the use of side cameras (see Fig. 1) as a means of improving visionbased touch detection accuracy. In our multi-touch surface we strive to build a system that is low-cost, scalable, and improves existing vision-based touch-table accuracy.
A Multi-touch Surface Using Multiple Cameras
99
2 System Setup The hardware supporting the proposed technique consists of a flat table surface mounted within a constructed frame (see Fig. 2). Attached to the frame are support struts which house a downward-pointing overhead camera approximately two feet above the surface. A side-mounted camera supported by a tripod is stationed two feet away and is level with the surface.
Fig. 2. The touch surface and camera rig
The software consists of a pipeline divided into three layers: image processing, interpretation, and application. By dividing up the system in this fashion, each module can be developed independently without affecting other modules. This method also allows for greater generalization, as different applications can be easily substituted.
3 Image Processing Layer The image processing layer is responsible for converting the raw images into a descriptor suitable for gesture interpretation. At each time step an image is acquired from both the overhead and side-mounted cameras. This layer consists of hand segmentation and contour extraction. 3.1 Hand Segmentation After image acquisition, the hand is segmented in both cameras. For the overhead camera, segmentation is required for fingertip detection. For the side-mounted camera, segmentation is required for contact detection. Initially, we assumed that having a uniform, matte background would allow us to implement an extremely simple background subtraction and thresholding algorithm. Further examination showed this to be ineffective for several reasons. First, since webcams typically have poor noise characteristics, the black background regions are exceptionally noisy. Thus, black regions deviate strongly from a learned background,
100
I. Katz, K. Gabayan, and H. Aghajan
resulting in misclassification. Second, having the hand in close proximity to the table surface results in highly conspicuous shadows, which are also misclassified by the aforementioned algorithm. These drawbacks led us to a more sophisticated method given by Morris [11]. Rather than acquiring multiple background images and noting only the per-pixel mean, we also note the per-pixel standard deviation (see Fig. 3). Upon receiving an input image, we determine how many standard deviations each pixel is from its corresponding mean.
Fig. 3. Mean and standard deviation images
Morris [11] observes that the color of a shadow is always the same as the surface it is projected onto. Furthermore, the luminosity of a shadow is always lower than that of the surface. Therefore, we can effectively remove the hand’s shadow by applying a luminosity threshold after the segmentation phase. In general, shadow removal is an open problem. With our simplifying assumptions (the uniform background), this simple heuristic works surprisingly well. One difficulty remains when the hand color closely resembles the background color. This occurs when parts of the hand are in shadow, and results in those pixels being falsely classified as background. We solve this problem by supplying additional lighting in the form of lamps attached to the support structure. 3.2 Contour Extraction The final step in the image processing layer is to find the contour of the hand silhouette. After finding the contour of all foreground blobs, all but the longest contour is eliminated. These eliminated contours correspond to regions where the segmentation algorithm misclassified. Thus the segmentation phase can misclassify portions of the image without adversely affecting the end result.
4 Interpretation Layer The interpretation layer is responsible for converting the pixel information from the previous layer into semantic knowledge to be passed up to the application layer. This layer consists of modules that detect fingertip positions and share information between the two cameras. It also handles events that include contact with the table surface, gestures, and occlusions.
A Multi-touch Surface Using Multiple Cameras
101
4.1 Fingertip Detection Initially, every point on the hand’s contour is a potential fingertip. The objective of the fingertip detection module is to eliminate irrelevant points. The first step is to identify the palm center as the hand silhouette pixel whose distance to the nearest background pixel is the largest. When inspecting the distance between each contour pixel and the palm center, pixels that do not exhibit a local maximum are ignored as fingertip candidates. The remaining points include the fingertips (see Fig. 4), but also the knuckles of the thumb, the outside edge of the palm, and any peaks that are the result of noise. To eliminate these false positives, we introduce the following heuristic: a neighborhood of 60 contour points, centered about the point in question is analyzed to determine its local centroid. The distance between the point in question and the local centroid is thresholded. True fingertips will have a characteristically high distance.
Fig. 4. Fingertip Detection
Finally, we account for a few special cases. Points close to the image border are eliminated, as are multiple fingertip points clustered together. The performance of this module is excellent; the performance of the pipeline to this point seems to be limited by the capabilities of the segmentation phase, as described earlier. 4.2 Coordinate Transform Once the fingertip positions are acquired in overhead camera coordinates, they must be converted into side-mounted camera coordinates for contact detection. In this section we describe two methods of performing this transformation. The first method is a generalization of the common bilinear interpolation algorithm, extended to non-orthogonal data points. It is assumed that there are four marker points viewable in the overhead camera, denoted Q11, Q12, Q21, and Q22, with known x, y values (see Fig. 5). For each marker point, the corresponding x value in the side-mounted camera is also known. These values are denoted by F(Q11), etc.
102
I. Katz, K. Gabayan, and H. Aghajan
Fig. 5. Modified bilinear interpolation
First the weight for R1 is computed: y R1 =
(y
Q 21
− y Q11 ) ⋅ (x P − x Q11 )
(x
Q 21
− xQ11 )
+ y Q11
L2 = ( y Q 21 − y Q11 ) + (xQ 21 − x Q11 ) 2
F ( R1) = +
(y
− y Q11 ) + (x P − x Q11 ) 2
R1
(y
2
2
⋅ F (Q 21)
L2
− y R1 ) + (xQ 21 − x P ) 2
Q 21
L2
(1)
2
⋅ F (Q11)
After computing the weight for R2 in a similar manner, interpolation is performed along the y axis:
F ( P) =
( y R 2 − y P ) ⋅ F ( R1) + ( y P − y R1 ) ⋅ F ( R 2) ( y R 2 − y R1 ) ( y R 2 − y R1 )
(2)
This method works well, but can only interpolate for values of P that lie within the convex hull of the data points. A less restrictive alternative transforms coordinates using the fundamental matrix, at the cost of more complex calibration. To calculate the fundamental matrix, we manually choose 16 corresponding points in the overhead and side images. These points are chosen so as to provide maximal coverage in the two planes. Although only 8 points are strictly required, doubling the number of correspondences and using RANSAC estimation [12] returns a more accurate matrix. The resulting matrix F can be used with a fingertip point m to determine the epipolar line l in the side image:
Fm = l
(3)
4.3 Fingertip Association For meaningful gesture interpretation, the position of individual fingers must be tracked through multiple frames. This involves assigning a label to each finger and maintaining the consistency of that label for as long as the finger is visible in both
A Multi-touch Surface Using Multiple Cameras
103
cameras. Note that it is not necessary to assign a label based on the type of the finger (e.g. index finger be assigned label 1, etc.). For each input image, we assign labels by comparing the current frame to the previous frame. For each fingertip in the current frame, we look for the nearest neighbor fingertip in the previous frame. If the hand is in motion, it is possible for fingertips to be misclassified but experimentation has shown this to be a rare occurrence, since the time between frames is generally short. If the number of fingertips between two successive frames is different, new labels are registered or old labels are deleted as necessary. 4.4 Contact Detection By calculating the height of each finger from the touch surface, i.e., its distance along the z-axis, we can define a contact event as when the fingertip is at z=0. A priori, we manually register camera parameters (position, orientation, and principal point) and the endpoints of the table edge in the side camera image. During runtime, we search the camera images to establish a correspondence, then project the corresponding points to localize the fingertip in three dimensions. Our method of contact detection requires that each side camera be placed such that its principal axis lies within the touch surface plane to optimally view the intersection of a fingertip with the touch surface. This camera configuration also simplifies our localization calculations.
Fig. 6. (a) Detected fingertips with associate epipolar lines (b) Detected fingertips labeled with height values
After fingertips have been detected in the overhead view, corresponding fingertip locations are searched for in the side camera binary image. Using the fundamental matrix between the two views, an epipolar line is defined in the side camera image for each fingertip detected in the overhead view. The corresponding fingertip location is searched for along this epipolar line beginning at the intersection of the epipolar line and the table edge line. Traveling up the epipolar line from this intersection point, a fingertip is detected as the first ON pixel encountered (see Fig. 6). The angular offset of any pixel from the principal axis of a camera may be calculated as a function of the principal point, pixel offset, camera field of view, and image dimensions:
104
I. Katz, K. Gabayan, and H. Aghajan
θ offsetH = θ offsetV =
(x − x
principal
)⋅ θ
HFOV
)⋅ θ
VFOV
(4)
W
(y − y
principal
(5)
H
where x and y are the coordinates in question, xprincipal and yprincipal are the coordinates of the principal point, θHFOV and θVFOV are the camera’s horizontal and vertical fields of view, and W and H are the image width and height in pixels, respectively. With the fingertip localized in both cameras, the angular offsets of each fingertip are determined for each camera view. The vertical angular offset θ of the fingertip in the side camera view defines a plane T (see Fig. 7), in which the fingertip exists, with a relative normal vector as:
n = 0, d ⋅ tan θ , d
(6)
in which d is an arbitrary constant. The horizontal and vertical angular offsets θ and φ from the overhead camera O at (Ox, Oy, Oz) project a point P on the touch surface at the coordinates:
(Ox + h ⋅ tan θ , O y + h ⋅ tan φ ,0) z
(7)
O
Q P
θ S y
x Fig. 7. Calculation of the fingertip height from the touch surface
The fingertip lies along line segment OP at point Q where the line segment intersects with plane T:
k=
n ⋅ (S − O ) n ⋅ (P − O )
Q = O + k (P − O ) where S is the position of the side camera.
(8) (9)
A Multi-touch Surface Using Multiple Cameras
105
The error of the measured z-value corresponding to one pixel in the side camera image can be simply calculated with a right triangle, knowing the camera’s vertical field of view θVFOV, image height H in pixels, and the fingertip distance d from the side camera:
Δz =
d ⋅ tan (θVFOV / 2) (H / 2 )
(10)
4.5 Occlusion Detection Occlusion of detected fingertip touches can be expected to occur in any side camera (see Fig. 8). Touch detections that are occluded by touches occurring nearer to the side camera should not be evaluated for touch detection. Our coarse approximation of the fingertips as points of contact on a planar surface is convenient for fingertip detection, but it ignores the unique volume that each finger occupies in space. With such an assumption we may report that a detected touch is occluded when another touch is detected along the line between itself and the side camera. The side camera touch positions conveniently correspond to the horizontal angular offset of the detected fingertip. When fingertip coordinates from the overhead camera are transformed to the same angular offset in the side camera the fingertips are evaluated for occlusions. All touches that share the same angular offset except the touch nearest to the side camera are reported as occluded.
Fig. 8. (a) Occluding virtual buttons that are near to each other or overlap positions in the side camera view. (b) Non-occluding virtual buttons that are distant from one another.
In reality the volume occupied by each finger provides some width to its profile viewed in a side camera and so occludes touches not perfectly collinear with itself and the camera position. Touches in the side camera that are within a threshold angular offset of another touch are evaluated for occlusions. 4.6 Gesture Recognition The touch detection and tracking provides the touch positions’ current and previous states that may be interpreted as gestures. Our gestures (see Fig. 9) are simply detected as a change of state from the previous frame to the current frame, whose incremental output is well suited for applications like map viewers that are navigated via incremental motions.
106
I. Katz, K. Gabayan, and H. Aghajan
If only a single touch is detected and the touch has translated more than a threshold distance, a pan command is issued with the detected motion vector as its parameter. If only two touches are detected, the distance between them is measured, and angular orientation is measured by drawing a line between the touch positions and measuring its angle against the positive x-axis of the overhead camera view. If the angular orientation of two touches changes beyond a threshold angle, a rotate command is issued with the angle change as its parameter. If the distance between the two touches increases or decreases beyond a threshold, a zoom-in or zoom-out is performed, respectively.
Fig. 9. Single and double-touch gestures that are suitable for a map viewing application
5 Application Layer A simple paint program (see Fig. 10) has been developed that takes advantage of the multi-touch output of our interface. It paints a different color on a black background for each detected fingertip press, allowing a user to freehand draw and leave rainbow colored stripes on the virtual canvas by dragging an open hand across the multi-touch surface. An unfilled outline of the user’s hand is displayed on the canvas to aid the user in placing strokes in the desired locations on the screen.
Fig. 10. A multi-touch paint program screenshot
A Multi-touch Surface Using Multiple Cameras
107
6 Conclusions and Future Work We have demonstrated a multi-touch surface built from commodity hardware that explores the use of a side camera for touch detection. Gestures suitable for mapviewing are recognized and a paint application that takes advantage of multiple touches has been developed. The slow frame rate that is qualitatively under 10Hz introduces a latency between the update of virtual button positions in the side camera image and the true position of the finger, forcing the user to perform their gestures slowly. These issues may be resolved by improvements to the multiple camera image acquisition rate in OpenCV. We encounter false positives of gesture recognition among many true positive identifications which may be resolved through spatiotemporal smoothing. The addition of more side cameras is expected to markedly reduce the number of reported occlusions. A larger surface that can better fit two hands and incorporate more side cameras would benefit from a larger workspace, and would have potential to detect hand ownership of touches. Future work aims to improve the robustness of each component. The current method of segmentation relies on a static background—improved implementations will dispense with this requirement, possibly by using adaptive background subtraction. More sophisticated fingertip detection methods could also have great value in eliminating occasional false positives.
References 1. Buxton, B.: Multi-Touch Systems That I Have Known And Loved (Acessed March 21, 2007), http://billbuxton.com/multitouchOverview.html 2. Synaptics Technologies, "Capacitive Position Sensing," visited March 21, 2007, http://www.synaptics.com/technology/cps.cfm 3. Mitsubishi DiamondTouch, visited March 21, 2007, http://www.merl.com/projects/ DiamondTouch/ 4. Levin, G.: Computer Vision for Artists and Designers: Pedagogic Tools and Techniques for Novice Programmers. AI & Society 20(4), 462–482 (2006) 5. Haller, M., Leithinger, D., Leitner, J., Seifried, T.: Coeno-Storyboard: An Augmented Surface for Storyboard Presentations. In: Mensch & Computer 2005, Linz, Austria, September 4-7, 2005 (2005) 6. Han, J.: Low-Cost Multi-Touch Sensing through Frustrated Total Internal Reflection. In: ACM Symposium on User Interface Software and Technology (UIST), ACM, New York (2005) 7. Kjeldsen, R., Levas, A., Pinhanez, C.: Dynamically Reconfigurable Vision-Based User Interfaces. In: Crowley, J.L., Piater, J.H., Vincze, M., Paletta, L. (eds.) ICVS 2003. LNCS, vol. 2626, Springer, Heidelberg (2003) 8. Mo, Z., Lewis, J.P., Neumann, U.: SmartCanvas: A Gesture-Driven Intelligent Drawing Desk System. In: Proceedings of Intelligent User Interfaces (IUI ’05), January 9-12, 2005 (2005) 9. Malik, S., Laszlo, J.: Visual Touchpad: A Two-handed Gestural Input Device. In: International Conference on Multimodal Interfaces (ICMI ’04), October 13-15, 2004 (2004)
108
I. Katz, K. Gabayan, and H. Aghajan
10. Wilson, A.: PlayAnywhere: A Compact Interactive Tabletop Projection-Vision System. In: ACM Symposium on User Interface Software and Technology (UIST), ACM, New York (2005) 11. Morris, T., Eshehry, H.: Real-Time Fingertip Detection for Hand Gesture Recognition. In: Advanced Concepts for Intelligent Vision Systems (ACIVS ‘04), Ghent University, Belgium, September 9-11 (2002) 12. Toor, P.H.S., Murray, D.W.: The development and comparison of robust methods for estimating the fundamental matrix. Intl. J. of Computer Vision, 24(3), 271–300 (1997)
Fusion of Bayesian Maximum Entropy Spectral Estimation and Variational Analysis Methods for Enhanced Radar Imaging Yuriy Shkvarko, Rene Vazquez-Bautista, and Ivan Villalon-Turrubiates CINVESTAV Jalisco, Avenida Científica 1145, Colonia El Bajío, 45010, Zapopan Jalisco, México, Telephone (+52 33) 3770-3700, Fax (+52 33) 3770-3709 {shkvarko,fvazquez,villalon}@gdl.cinvestav.mx http://www.gdl.cinvestav.mx
Abstract. A new fused Bayesian maximum entropy–variational analysis (BMEVA) method for enhanced radar/synthetic aperture radar (SAR) imaging is addressed as required for high-resolution remote sensing (RS) imagery. The variational analysis (VA) paradigm is adapted via incorporating the image gradient flow norm preservation into the overall reconstruction problem to control the geometrical properties of the desired solution. The metrics structure in the corresponding image representation and solution spaces is adjusted to incorporate the VA image formalism and RS model-level considerations; in particular, system calibration data and total image gradient flow power constraints. The BMEVA method aggregates the image model and system-level considerations into the fused SSP reconstruction strategy providing a regularized balance between the noise suppression and gained spatial resolution with the VA-controlled geometrical properties of the resulting solution. The efficiency of the developed enhanced radar imaging approach is illustrated through the numerical simulations with the real-world SAR imagery.
1 Introduction The Bayesian approach for high resolution radar image formation is detailed in many works; here we refer to [1] – [3] where such approach is adapted to remote sensing (RS) applications considered in this paper. Further information theoretic-based development of the Bayesian imaging paradigm that employs the maximum entropy (ME) robust regularization of the nonlinear image reconstruction inverse problem was developed recently in [4] – [6] where it was addressed to as the Bayesian maximum entropy (BME) method. On the other hand, an alternative approach to image enhancement and noise suppression was proposed and detailed in [7] – [10] where the variational analysis (VA) paradigm was employed to incorporate a priori information regarding the image geometrical properties specified by its gradient flow over the image frame, while no particular model of the imaging system was employed. In view of this, the VA paradigm may be classified as system model-free image enhancement approach [7]-[10]. Some second order partial differential equation (PDE) models for J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 109–120, 2007. © Springer-Verlag Berlin Heidelberg 2007
110
Y. Shkvarko, R. Vazquez-Bautista, and I. Villalon-Turrubiates
specifying the gradient flow over the image frame were employed in different VA approaches to incorporate the intrinsic image geometry properties into the enhancement procedures [7] – [10]. On one hand, a considerable advantage of the VA paradigm relates to its flexibility in designing the desirable error metrics in the corresponding image representation and reconstruction spaces via defining different variational cost functionals and relevant PDE in the overall VA optimization problem [8], [10]. On the other hand, the crucial limitation of all VA-based methods lies in their descriptive system-model-free deterministic regularization nature because these methods do not employ statistical optimization strategies, i.e. do not consider a particular RS system model, and do not incorporate image and noise statistics into the VA enhancement strategy. In contrast, the BME approach is based on the statistical optimization paradigm [4], [5], [11] adapted for a particular RS system model and robust a priori information about the statistics of noise and the desired image. The latter is associated with the spatial spectrum pattern (SSP) of the wavefield backscattered from the probing surface. As the SSP represents the power distribution in the RS environment, the power non-negativity constraint is incorporated implicitly in the BME strategy but that do not incorporate specific VA geometrical properties of the image, e.g. its gradient flow over the scene/frame. In view of this, the following problem arises: how to aggregate the statistically optimal BME method with the VA formalism for enhanced RS imaging that incorporate the advantages of both the VA and the BME approaches? An approximation to this problem was initially proposed in [12] where it was considered in the context of alleviating the ill-posed nature of the VA techniques that employ the anisotropic diffusion RDE as an optimization criterion [7], [8]. In this paper, we address a new balanced statistical-regularization fusion paradigm that leads to a new method addressed to as the fused Bayesian maximum entropy variational analysis (BMEVA) technique. The VA paradigm is adapted via incorporating the image gradient norm preservation into the overall reconstruction problem to control the geometrical properties of the desired solution.
2 Problem Statement Following [1], [3], [4] we define the model of the observation wavefield u by specifying the stochastic equation of observation (EO) of an operator form u=Se+n; e∈E; u,n∈U; S:E→U,
(1)
in Hilbert signal spaces E and U with the metrics structures induced by the inner products, [u1, u2]U and [e1, e2]E, respectively, where the Gaussian zero-mean random fields e, n, and u correspond to the initial coherent backscattered field, noise and observed wavefield, respectively. Next, taking into account the experiment design (ED) theory-based projection formalism [4], [5] we proceed from the operator form EO (1) to its conventional finite-dimensional vector form, U=SE+N ,
(2)
where E, N and U define the zero-mean vectors composed of the coefficients Ek , Nm , and Um of the numerical approximations (sample decomposition [4]) of the relevant
Fusion of Bayesian Maximum Entropy Spectral Estimation
111
operator-form EO (1), i.e. E represents the K-D vector composed with the coefficients {Ek=[e,gk]E, k=1,…,K} of the K-D approximation, e(K)(r)=(PE(K)e)(r)=∑Ekgk(r), of the initial backscattered wavefield e(r) distributed over the RS scene (image frame) R∋r [4], and PE(K) is a projector onto the K-D signal approximation subspace E(K)=PE(K)E=Span{gk} spanned by some properly chosen set of K basis functions {gk(r)} [5], [11]. The M-by-K matrix S that approximates the signal formation operator (SFO) in (2) is given now by [4] Smk=[Sgk,ϕm]U; m=1,…,M; k=1,…,K ,
(3)
where the set of the base functions {ϕm(y)} that span the finite-dimensional spatial observation subspace U(M)=PU(M)U=Span{ϕm} defines the corresponding projector PU(M) induced by these array spatial response characteristics {ϕm(y)} [11]. The ED aspects of the SSP estimation inverse problem involving the analysis of how to choose (finely adjust) the basis functions {gk(r)} that span the signal representation subspace E(K)=PE(K)E=Span{gk} for a given observation subspace U(M)=Span{ϕm} were investigated in more details in the previous studies [5], [14]. Here, we employ the pixel-format basis [5], [11] and the ED considerations regarding the metrics structure in the solution space defined by the inner product ||B||2B(K)=[B,MB] ,
(4)
where matrix M is referred to as the metrics inducing operator [4], [5]. Hence, the selection of M provides additional geometrical degrees of freedom of the problem model. In this study, we incorporate the model of M that corresponds to a matrixform approximation of the Tikhonov’s stabilizer of the second order that was numerically designed in [4]. The RS imaging problem under consideration is to find an BMEVA-optimal estimate Bˆ (r ) of the SSP B(r) distributed over the scene (image frame) R∋r by processing whatever values of the discrete measurements U of the data signals (2) are available that incorporates also non-trivial image model information into the estimation strategy. Thus, the purpose of our study is to develop a generalization of the BME estimator [4], [5] adapted for the high-resolution SSP reconstruction problem that aggregates the prior image model considerations induced by the adopted metrics structure (4) in the image representation and solution spaces with geometrical considerations invoked from the VA formalism [7], [10].
3 Generalized BME Estimator for SSP The objective of a statistical BME estimator is to obtain a unique and stable estimate Bˆ by processing the measurement data U in an optimum fashion, “optimum” being considered in a sense of the Bayes minimum risk strategy [4], [13]. Note that the illposed nature of such inverse problem results in the ill-conditioned SFO [3], [11]. The ME principle [13] provides the well-grounded way to alleviate the problem illposeness. According to the ME paradigm [13], the whole image is viewed as a composition of a great amount of elementary discrete speckles (pixels) with the elementary “pixel brightness” normalized to the elementary unit of the adopted image
112
Y. Shkvarko, R. Vazquez-Bautista, and I. Villalon-Turrubiates
representation scale, e.g. 256 grades of gray in the conventional gray-scale image formats [6], [11]. Following the ME approach developed in [4], the a priori probability density function (pdf) p(B) of the discrete-form image B is to be defined via maximization of the entropy of the image probability that satisfies also the constraints imposed by the prior knowledge [5]. The vector B is viewed as an element of the nonnegative set BC of the K-D vector space BC∋B with the squared norm induced by the inner product (4). In addition, the physical factors of the experiment can be generalized via imposing the physically obvious ED constraint that bounds the average squared norm of the SSP by some preserved constant total power E0, i.e.
∫BC [B, MB] p(B)dB = E0
,
(5)
which specifies the calibration constraint for the SSP reconstruction. Thus, the a priori pdf p(B) is to be found as a solution to the Lagrange entropy maximization problem with the Lagrange multipliers α, and λ and is specified as follows [4]
− ∫B ln p(B) p(B)dB − α( ∫B [B, MB] p (B)dB − E 0 ) C
C
− λ ( ∫B p (B)dB − 1) → max , C
(6)
p(B)
for B∈BC, and p(B)=0 otherwise. Routinely following the variational scheme detailed in [4] we obtain the solution to (6) that yields the Gibbs-type a priori pdf p(B | α ) = exp{− ln
∑ (α) − α [ B, MB ] } ,
(7)
where ∑(α) represents the so-called Boltzmann statistical sum [4], in which the normalization parameter α must be adjusted to satisfy the calibration constraint (5). Next, we define the log-likelihood function of the desired vector B [4] given the Gaussian measurements U specified by the EO (2)
Λ (B | U) = ln p(B | U ) = − ln det{ SD(B)S + + R N } − [U, (SD(B)S + + R N ) −1 U ].
(8)
Using (7) and (8), the BME strategy for SSP reconstruction can be stated now as the following nonlinear optimization problem [15] Bˆ = arg min{−Λ(B | U ) − ln p(B | α)} . B ,α
(9)
The optimization problem (9) is structurally similar to one considered in [4]. The modifications incorporated in this particular study include the redefined metrics structure (4) and boundary constraint (6). Thus, following the approach developed in [4], the desired solution to (9) in a form of the nonlinear procedure is Bˆ = W ( Bˆ ) [ V ( Bˆ ) − Z( Bˆ ) ] .
(10)
ˆ ) UU + F + (B ˆ )}diag represents a vector that has a statistical meaning Here, V (Bˆ ) = {F (B of a sufficient statistics (SS) for the SSP estimator, operator F (Bˆ ) = D(Bˆ )
Fusion of Bayesian Maximum Entropy Spectral Estimation
113
)(I + S + R −N1SD −1 (Bˆ )) −1 S + R −N1 is referred to as the SS formation operator, the vector
ˆ ) = {F (B ˆ ) R N F + (Bˆ )}diag Z(B
represents the shift (bias) vector, and W (Bˆ ) =
(T(Bˆ ) + 2αD 2 (Bˆ )M ) −1 has the statistical meaning of a solution dependent (i.e.,
adaptive)
regularizing
window
operator
with
the
stabilizer
T(Bˆ ) =
ˆ )S}diag } . Adaptation is to be performed over both the current diag{{S + F + (Bˆ )F(B SSP estimate, Bˆ , and the normalization constant α adjusted to satisfy the calibration constraint (5).
4 VA Formalism for SSP Enhancement The goal of adapting the VA formalism is to enhance the overall quality of the SSP reconstructed via the BMEVA procedure (10). The VA purpose is to perform the simultaneous extraction and synthesis of the geometrical image model information from a sequence of the evolutionary innovated image reconstructions (the frames in the VA terminology [10]) via incorporating the additional quality control functional (termed the VA energy function) into the overall BMEVA fusion strategy. The fusion process is dynamic with the fusion rate driven by some anisotropic diffusion gain function [7], [10] dictated by enhancement goals. In our particular study, we limit ourselves with the control of the spatial gradient flow functional that results in the following VA energy minimization problem [10]
EVA (B(r ) ) = ∫ ρ 2 (| ∇B(r ) |, r )dr → min , R
B(r )
(11)
where ∇ = (∂ / ∂x ∂ / ∂y ) T defines the spatial differential operator [11] in the Cartesian coordinate system r=(x,y)T∈R that when applied to the SSP B(r) returns its gradient distribution ∇B(r ) over the image frame R. Following the conventional definition for the VA energy function proposed in [7], [8] we adopt here the Lorentzian model of the error functional ρ 2 (⋅) in (11), i.e. ⎡ 1 ⎛ | ∇ B (r ) | ⎞ 2 ⎤ ρ 2 (r ) = σ 2 log ⎢1 + ⎜ ⎟ ⎥ , ⎠ ⎦⎥ ⎣⎢ 2 ⎝ σ
(12)
which does not have an explicit dependence on the SSP B(r), and where σ is a normalizing constant. With the Lorentzian error functional (12), the variational procedure δEVA ( B(r )) = 0 leads to the following Euler-Lagrange PDE as the VA optimization criterion [10] −1 ⎡⎛ ⎤ ∂B(r, t ) | ∇B(r, t ) |2 ⎞⎟ ⎢ ⎥ , ⎜ = ∇ ⋅ ⎜1 + ∇ B ( r , t ) ⎟ ⎢⎝ ⎥ ∂t 2σ 2 ⎠ ⎣ ⎦
(13)
where t represents the evolution time translated into the iteration step number in the numerical reformulation of the PDE. The (13) defines the so-called Perona-Malik
114
Y. Shkvarko, R. Vazquez-Bautista, and I. Villalon-Turrubiates
anisotropic diffusion equation [7]. It is a nonlinear PDE that has no analytic solution. Hence, the VA problem (11) can be solved only numerically employing some efficient iterative techniques [11], [12]. We next have to proceed with the fusion of the VA optimization problem (11) with the generalized BME strategy (9).
5 BMEVA Method and Numerical Implementation Technique In the proposed BMEVA method, we aggregate the VA and BME approaches in the fused strategy Bˆ BMEVA = arg min{− Λ (B | U ) − ln p (B | α) + γ || ∇B || 2L } , B,α|γ
(14)
where ∇B defines the numerical (pixel-format) approximation to the gradient vector, || ∇B || 2L represents the numerical approximation to the VA energy function (11) with the adopted Lorentzian error functional (12), and γ is referred to as the regularization parameter that balances the VA and BME criterions in the fused BMEVA strategy. In fact, the (14) is an NP-hard optimization problem [11], i.e. ill-posed in a computational sense [8], [11]. This problem has no analytic solution in polynomial time [8], hence, it must be solved numerically employing some practically reasonable regularization [11] to alleviate its ill-poseness. Here, we adopt the robust regularization approach based on the logarithm series tools [8]. Pursuing such technique, we first, substitute || ∇B || 2L in (14) by its second-order logarithm series approximation 2
|| ∇B || 2L ≈ σ 2 ∑
n =1
(− 1)n +1 ⎛⎜ || ∇B || ⎞⎟ 2n = [B, QB] , n
⎝ σ 2 ⎠
(15)
where Q = (1 2)L + τLL
and
τ = − 1 8σ 2 ,
(16)
are the composed VA-regularized weighting matrix and the relaxation parameter, respectively, and the matrix L represents the numerical approximation [11] to the Laplacian second-order spatial differential operator ∇ 2 . With the approximations (15), (16) the strategy (14) can be transformed into Bˆ BMEVA = arg min{− Λ (B | U ) − ln p (B | α) + γ [B, QB ]} . B , α |γ
(17)
Due to the performed robust regularization, the modified strategy (17) relates now to a convex-type optimization problem [8], [11], thus, it can be solved numerically in a polynomial time [8]. The variational technique [4], [5] applied to the problem (17) yields the following numerical variational equation for the desired SSP TB + Z − V + 2αD 2 MB + 2γQB = 0 .
(18)
Fusion of Bayesian Maximum Entropy Spectral Estimation
115
Last, solving routinely (18) with respect to B and exposing the dependence of T(B), D(B), V(B), and Z(B) on the solution Bˆ we obtain the desired BMEVA estimator ˆ)] , Bˆ = W(Bˆ )[ V (Bˆ ) − Z(B
(19)
ˆ ) = (T(B ˆ ) + 2 α D 2 (Bˆ ) M + 2 γQ) −1 , W (B
(20)
where
represents the adaptively regularized VA-balanced nonlinear spatial window operator. The derived BMEVA estimator (19), (20) can be converted into an efficient iterative algorithm using the Seidel fixed-point iteration method [11]. Pursuing such the approach [11], we refer to the SSP estimate on the right-hand side in (19) as the current estimate Bˆ (t ) at the tth iteration step, and associate the entire right-hand side of (19) with the rule for forming the estimate Bˆ (t +1) for the next iteration step (t+1) that yields Bˆ (t +1) = W (Bˆ (t ) )[ V (Bˆ (t ) ) − Z(Bˆ (t ) ) ] .
(21)
Due to the performed regularized windowing (20), the iterative algorithm (21) converges in a polynomial time [8] regardless of the choice of the balance factor γ within the prescribed normalization interval, 0 ≤ γ ≤ 1 . Note, that in the simulations reported in the next resuming section, forty iterations were sufficient to provide the 1% convergence error rate, (i.e. || Bˆ (t +1) − B( t ) ||2 ≤ 10−2 ∀t > 40 ) of the developed iterative BMEVA algorithm (21) for all considered simulation scenarios.
6 Simulations and Discussions In the simulations, we considered the SAR with partially (fractionally) synthesized aperture as an RS imaging system [6], [14]. The SFO was factorized along two axes in the image frame: the azimuth (horizontal axis) and the range (vertical axis). Following the common practically motivated technical considerations [3], [6], [14] we modeled a triangular shape of the SAR range ambiguity function (AF) of 3 pixels width of the 256-by-256 frame pixel format, and two side-looking SAR azimuth AFs for two typical scenarios of fractionally synthesized apertures: (i) azimuth AF of a Gaussian shape of 5 pixels width at 0.5 of its maximum level associated with the first system model and (ii) azimuth AF of a |sinc|2 shape of 7 pixels width at the zero crossing level associated with the second system model, respectively. In the simulations, the developed BMEVA method was implemented iteratively (21) and compared with the conventional matched spatial filtering (MSF) low-resolution image formation method [2], [3] and the previously proposed high-resolution BME and VA approaches to illustrate the advantages of the fused strategy. The results of the simulation experiment indicative of the reconstruction quality are reported in Figures 1 thru 4 for two different RS scenes borrowed from the real-world RS imagery of the Metropolitan area of Guadalajara city, Mexico [16]. Figures 1.a. thru 4.a show the
116
Y. Shkvarko, R. Vazquez-Bautista, and I. Villalon-Turrubiates
a. Original super-high resolution scene
b. Image formed with the MSF method
c. Image post-processed with the VA method
d. SSP reconstructed with the BME method
e. SSP reconstructed with the BMEVA method (γ=1)
f. SSP reconstructed with BMEVA method (γ=0.25)
Fig. 1. Simulation results for the first scene: first system model
a. Original super-high resolution scene
b. Image formed with the MSF method
c. Image post-processed with the VA method
d. SSP reconstructed with the BME method
e. SSP reconstructed with the BMEVA method (γ=1)
f. SSP reconstructed with BMEVA method (γ=0.25)
Fig. 2. Simulation results for the second scene: first system model
Fusion of Bayesian Maximum Entropy Spectral Estimation
117
a. Original super-high resolution scene
b. Image formed with the MSF method
c. Image post-processed with the VA method
d. SSP reconstructed with the BME method
e. SSP reconstructed with the BMEVA method (γ=1)
f. SSP reconstructed with BMEVA method (γ=0.25)
Fig. 3. Simulation results for the first scene: second system model
a. Original super-high resolution scene
b. Image formed with the MSF method
c. Image post-processed with the VA method
d. SSP reconstructed with the BME method
e. SSP reconstructed with the BMEVA method (γ=1)
f. SSP reconstructed with BMEVA method (γ=0.25)
Fig. 4. Simulation results for the second scene: second system model
118
Y. Shkvarko, R. Vazquez-Bautista, and I. Villalon-Turrubiates
original super-high resolution test scenes (not observable in the simulationexperiments with partially synthesized SAR system models). Figures 1.b thru 4.b present the results of SSP imaging with the conventional MSF algorithm [2]. Figures 1.c thru 4.c present the SSP frame enhanced with the VA method [7]. Figures 1.d thru 4.d show the images reconstructed with the BME method [6]. Figures 1.e thru 4.e show the images reconstructed applying the proposed BMEVA technique for the equally balanced criterions in the fused strategy, i.e. γ=1 [15]. Finally, figures 1.f thru 4.f present the BMEVA reconstruction results for experimentally adjusted balance factor γ=0.25 [15]. Finally, the quantitative performance enhancement metrics evaluated as the improvement in the output signal to noise ratio (IOSNR) [4] were calculated for the simulations with different input SNRs (μ) and the resulting IOSNRs are reported in Tables 1 and 2. The qualitative simulation results presented in Figures 1 thru 4 and corresponding quantitative performance metrics reported in Tables 1 and 2 manifest the considerably enhanced reconstruction performances achieved with the proposed BMEVA method in comparison with the previously developed BME and VA approaches that do not employ the fusion strategy. Table 1. IOSNR values [dB] provided with different reconstruction methods. Results are reported for different SNR μ for the first test scenes and two different simulated SAR systems. IOSNR [dB] System 1 Reconstruction Method BMEVA BMEVA BME (γ=1) (γ=0.25)
SNR [dB] μ
VA
10 15 20 25 30
0.811 0.813 0.812 0.815 0.813
3.671 3.641 3.629 3.626 3.627
4.551 4.606 4.673 4.669 4.643
4.898 4.900 4.906 4.901 4.912
IOSNR [dB] System 2 Reconstruction Method BMEVA BMEVA BME (γ=1) (γ=0.25)
VA
2.012 2.009 1.999 2.012 2.011
6.208 6.232 6.264 6.319 6.350
8.581 8.667 8.628 8.704 8.739
9.021 9.141 8.968 8.970 9.067
Table 2. IOSNR values provided with different reconstruction methods. Results are reported for different SNRs for the second test scenes and two different simulated SAR systems. SNR [dB] μ
VA
10 15 20 25 30
0.726 0.728 0.728 0.725 0.725
IOSNR [dB] System 1 Reconstruction Method BMEVA BMEVA BME (γ=1) (γ=0.25)
3.220 3.849 4.933 5.930 6.932
7.630 7.638 7.652 7.669 7.685
7.871 7.880 7.977 7.981 7.980
VA
1.923 1.913 1.947 1.921 1.923
IOSNR [dB] System 2 Reconstruction Method BMEVA BMEVA BME (γ=1) (γ=0.25)
4.402 4.812 5.445 6.393 7.434
10.761 10.783 10.796 10.843 10.802
11.301 11.356 11.354 11.356 11.422
Qualitatively, the enhancement results in better detailed inhomogeneous regions with better preserved edges between the homogeneous zones. Also, the imaging artifacts typical to the reconstructions performed with the inversion techniques are
Fusion of Bayesian Maximum Entropy Spectral Estimation
119
considerably suppressed. The achieved enhancement effects can be explained as a result of incorporating the balanced control of the adaptive regularization with preservation of the image geometrical features performed with the BMEVA technique
7 Concluding Remarks In summary, we may conclude that the proposed BMEVA method provides the considerably improved image reconstruction achieved due to performing the adaptive (i.e. nonlinear) regularized windowing in the flat regions with enhanced preservation of the edge features. The new approach incorporates also some adjustable parameters viewed as the regularization degrees of freedom. Those are invoked from the BME and VA methods. The BMEVA method aggregates the image model and system-level considerations into the fused SSP reconstruction strategy providing a regularized balance between the noise suppression and gained spatial resolution with the VAcontrolled geometrical properties of the resulting solution. The reported simulations demonstrate the efficiency of the developed method.
References 1. Falkovich, S.E., Ponomaryov, V.I., Shkvarko, Y.V.: Optimal Reception of Space-Time bSignals in Channels with Scattering. Radio I Sviaz, Moscow (1989) 2. Wehner, D.R.: High-Resolution Radar, 2nd edn. Artech House, Boston (1994) 3. Henderson, F.M., Lewis, A.V.: Principles and Applications of Imaging Radar. In: Manual of Remote Sensing, 3rd edn. Wiley, New York (1998) 4. Shkvarko, Y.V.: Estimation of Wavefield Power Distribution in the Remotely Sensed Environment: Bayesian Maximum Entropy Approach. IEEE Transactions on Signal Processing 50, 2333–2346 (2002) 5. Shkvarko, Y.V.: Unifying Regularization and Bayesian Estimation Methods for Enhanced Imaging with Remotely Sensed Data. Part I - Theory. IEEE Transactions on Geoscience and Remote Sensing 42, 923–931 (2004) 6. Shkvarko, Y.V.: Unifying Regularization and Bayesian Estimation Methods for Enhanced Imaging with Remotely Sensed Data. Part II - Implementation and Performance Issues. IEEE Transactions on Geoscience and Remote Sensing 42, 932–940 (2004) 7. Black, M., Sapiro, G., Marimont, D.H., Hegger, D.: Robust Anisotropic Diffusion. IEEE Trans. Image Processing 7(3), 421–432 (1998) 8. Starck, J.L., Murtagh, F., Bijaoui, A.: Image Processing and Data Analysis: The Multiscale Approach. Cambridge University Press, Cambridge (1998) 9. Ben Hamza, A., Krim, H., Unal, B.G.: Unifying Probabilistic and Variational Estimation. IEEE Signal Processing Magazine 19, 37–47 (2002) 10. John, S., Vorontsov, M.: Multiframe Selective Information Fusion From Robust Error Estimation Theory. IEEE Trans. Image Processing 14(5), 577–584 (2005) 11. Barrett, H.H., Myers, K.J.: Foundations of Image Science. Wiley, New York (2004) 12. Vazquez-Bautista, R.F., Morales-Mendoza, L.J., Shkvarko, Y.V.: Aggregating the Statistical Estimation and Variational Analysis Methods in Radar Imagery. In: IEEE International Geoscience and Remote Sensing Symposium, IGARSS, Toulouse, France, vol. 3, pp. 2008–2010. IEEE, Los Alamitos (2003)
120
Y. Shkvarko, R. Vazquez-Bautista, and I. Villalon-Turrubiates
13. Erdogmus, D., Principe, J.C.: From Linear Adaptive Filtering to Nonlinear Information Processing. IEEE Signal Processing Magazine 23, 14–33 (2006) 14. Franceschetti, G., Iodice, A., Perna, S., Riccio, D.: Efficient Simulation of Airborne SAR Raw Data of Extended Scenes. IEEE Transactions on Geoscience and Remote Sensing 44, 2851–2860 (2006) 15. Morales-Mendoza, L.J., Vazquez-Bautista, R.F., Shkvarko, Y.V.: Unifying the Maximum Entropy and Variational Analysis Regularization Methods for Reconstruction of the Remote Sensing Imagery. IEEE Latin America Transactions. 3, 60–73 (2005) 16. Space Imaging. In: GeoEye Inc. (2007), http://www.spaceimaging.com/quicklook
A PDE-Based Approach for Image Fusion Sorin Pop1,2 , Olivier Lavialle2 , Romulus Terebes1 , and Monica Borda1 1
2
Technical University of Cluj-Napoca, 26-28 Baritiu Street 400027, Cluj-Napoca, Romania Equipe Signal et Image, LAPS-IMS UMR 5218, 351, Cours de la Liberation F-33405 Talence, France
Abstract. In this paper, we present a new general method for image fusion based on Partial Differential Equation (PDE). We propose to combine pixel-level fusion and diffusion processes through one single powerful equation. The insertion of the relevant information contained in sources is achieved in the fused image by reversing the diffusion process. To solve the well-known instability problem of an inverse diffusion process, a regularization term is added. One of the advantages of such an original approach is to improve the quality of the results in case of noisy input images. Finally, few examples and comparisons with classical fusion models will demonstrate the efficiency of our method both on blurred and noisy images.
1
Introduction
Image fusion is a process which consists in combining different sources to increase the quality of the resulting images. In case of pixel-level fusion, the value of the pixels in the fused image is determined from a set of pixels in each source image. In order to obtain output images which contain better information, the fusion algorithms must fulfil certain requirements: (i) the algorithm must not discard the relevant information contained in the input images; (ii) it must not create any artifacts or inconsistencies in the output images. In the last decade, many studies were dedicated to image-level fusion methods [1]. Among the classical methods, we can note the well known methods based on pyramid decompositions [2] and [3], wavelet transform [4], or different weighted combinations [5]. These techniques were applied in a wide variety of application fields including remote sensing [6], medical imagery [7] and defect detection [8]. The most popular fusion methods are based on a multiscale decomposition. These approaches consist in performing a multiscale transform on each source image to obtain a composite multiscale representation. Then, by defining a selective scheme, the fused image is obtained through the use of an inverse multiscale transform. In this paper, we propose an original image-level approach based on the use of a Partial Differential Equation. The PDE formulation is inspired by the works dedicated to the non-linear diffusion filters. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 121–131, 2007. c Springer-Verlag Berlin Heidelberg 2007
122
S. Pop et al.
Initially proposed by Perona and Malik [9], the non-linear diffusion filters have been widely used in edge preserving and enhancement filtering. The gray levels of an image (U ) are diffused according to: ∂(U ) = div[c(x, y, t)∇U ] ∂(t)
(1)
The scalar diffusivity c(x, y, t), in a pixel of coordinates x, y, is chosen as a non-increasing function (g) of the gradient. It governs the behavior of the diffusion process. A typical choice for the diffusivity function g is [9]: c(x, y, t) = g(|∇U |) =
1+
1 |∇U| λ
2
(2)
with λ some gradient threshold. Practical implementations of the P-M filter are giving impressive results, noise is eliminated and edges are kept or even enhanced provided that their gradient value is greater than the threshold λ. The equation (1) can be put in terms of second order derivatives taken in the directions of the gradient vectors (η) and in the orthogonal ones (ξ): ∂(U ) = g(|∇U |)Uξξ + [g(|∇U |) + g (|∇U |) |∇U |] Uηη ∂(t)
(3)
This expression allows an easier interpretation of the original equation, which acts like a low pass filter on the edge directions and selectively, can enhance edges approaching a backward diffusion for |∇U | ≥ λ. In [10] Catt et al. are showing that the P-M filter is ill-posed and can also enhance noise. By simply replacing the original image in the diffusivity function by a Gaussian smoothed one Uσ = Gσ ∗ U , the authors establish the existence, uniqueness and regularity of the solution for their improved filter: ∂(U ) = div [g(|∇Gσ ∗ U|)∇U ] ∂(t)
(4)
The regularized anisotropic diffusion equation does not have a directional interpretation; however, from a practical point of view, the authors noticed similar results with the PM filter. Shock filters constitute another successful class of PDE-based filters. In order to sharpen an image, these filters, initially proposed by Osher and Rudin [11] employ an inverse diffusion equation. The well-known stability problem of the inverse heat equation is solved for the discrete domain by the mean of minmod function. Other important theoretical and practical contributions were brought by Weickert [12] and [13]. The proposed EED (Edge Enhancing Diffusion) and CED (Coherence Enhancing Diffusion) models are anisotropic diffusion methods or often called tensor based diffusion.The purpose of a tensor based approach is to steer the smoothing process according to the directional information contained in the image structure.
A PDE-Based Approach for Image Fusion
123
In the next section, we will introduce a PDE formulation considering the source images as initial states of a diffusion process. We will extract fused versions of the images depending on the temporal evolution of the process. At each step the PDE formulation will allow to take into account the information contained in each source image to lead to a more convenient set of resulting images. In order to ensure the stability of the process, two additional constraints are introduced. To deal with noisy inputs, we propose a fusion-diffusion scheme by adding a diffusion term in the PDE. In section 3, we will show some results obtained by our fusion approach on blurred images and we will compare these results with those provided by some classical approaches. Then we will illustrate the efficiency of our approach in case of noisy source images. Conclusions and perspectives are given in section 4.
2 2.1
PDE-Based Fusion Fusion Term
In pixel-based fusion, we consider that each source image provides a part of the relevant information we want to obtain in the output. The source images are supposed to be already registered. We proposed to apply a PDE-based evolution process for each source image. At each step of the process, we are interested in keeping the relevant information contained in the current source while adding the information provided by each pixel in the other images. To achieve this task, we propose a PDE process involving an inverse diffusion process. The general continuous evolution equation of a source data can be formalized as: ∂(Ui ) = −βi div [gF (|∇U |max )∇Umax ] ∂(t)
(5)
where i represents the current source, max denotes the source corresponding to the maximum absolute value of the gradient and βi is a positive weight parameter: 0 if i = max βi = (6) β ∈ [0; 1] otherwise The weight parameter (β) sets the importance of fusion. Even if equation (5) describes the evolution of a single image (i), the principle of our approach is to perform the process on each of the input images. The images are updated in parallel at each time step. The aim is to inject in the current image the relevant information from the other sources. We consider that in each location the relevant information is provided by the image corresponding to the maximum absolute value of the gradient.
124
S. Pop et al.
Looking for the maximum of the absolute gradient value leads to detecting the edges. We search the maximum of the gradient for each pixel. When the maximum gradient occurs in the current image, the current pixel remains unchanged (βi = 0). Otherwise, if the maximum is detected in another source, the edge is injected by inversing a diffusion process. The quantity of the fusion can be modulated by a function gF of absolute gradient value. In this paper we adopt the constant positive function (gF = 1), which will provide an isotropic behavior for the fusion process. Thus, the fusion process is a linear inverse diffusion process, which is similar to a Gaussian de-convolution. The use of a diffusion equation in a discrete image domain requires an appropriate numerical scheme. We adopt an explicit time scheme and the forward and backward approximations for spatial derivatives. The maximum gradient absolute value is evaluated for the nearest neighborhood (4 pixels for 2D case). We present the numerical scheme for 1D case when the fusion function is constant (gF = 1): ∂(Ui ) = −βi Dx+ (Umax ) − Dx− (Umax ) ∂(t)
(7)
where Dx± (U ) = ±
U (x ± dx) − U (x) dx
(8)
for both terms inside the brackets, max denotes the source corresponding to the maximum absolute value of the gradient. The major drawbacks of this type of process are the instability, noise amplification and oscillations [14]. We limit these undesirable effects by imposing the boundaries for the gray level of each pixel : mink (Ukt=0 ) ≤ Ui ≤ maxk (Ukt=0 )
(9)
where Uk with 1 ≤ k ≤ K is the kth of the K sources. The limits are fixed considering maximum and minimum values through all sources and are applied at each time step. Thus the oscillations are limited between the minimum and maximum for each sample (see Fig. 1). So, the gray level constraint limits the oscillations and maintains the outputs in the dynamic range of the inputs (the minimum-maximum principle). In addition, we are interested to avoid any oscillations of our model. To solve this problem, we propose a regularization term. The aim is to force the difference between two neighboring pixels to be limited by the maximum of the difference observed in the input images (neighborhood constraint). For 1D case, this can be written as: mink Dx− (Ukt=0 ), 0 ≤ Dx− (Ui ) ≤ maxk Dx− (Ukt=0 ), 0 (10) mink Dx+ (Ukt=0 ), 0 ≤ Dx+ (Ui ) ≤ maxk Dx+ (Ukt=0 ), 0 Precisely, (10) forces the value of each difference between two neighbors to be within two bounds: the lower bound is negative (if there exists a negative step
A PDE-Based Approach for Image Fusion
125
Fig. 1. Evolution of two 1D signals limited by the gray level contraint
in the input sources) or zero and the upper bound is positive (if there exists a positive step in the inputs) or zero. For 2D case, two other limits corresponding to North and South differences are added to the East and West differences of 1D case. These limits are integrated as a regularization term in the equation (5). The PDE becomes: ∂(Ui ) = −βi div [gF (|∇U |max )∇Umax ] + γdiv gR (∇Ui , ∇Ukt=0 )∇Ui (11) ∂(t) where γ is a positive weight regularization parameter, which sets the importance of the regularization term and gR is a function which is different from zero when the constraint (10) is not respected. gR will be defined in (13) for the discrete version of the PDE. In order to have a compact discrete version of the equation (11), we present in equation (12) the 1D case, but the extension in 2D case is obvious: ∂(Ui ) + − ∂(t) = −βi [Dx (Umax ) − Dx (Umax )] + +γ gR Dx (Ui ), Dx+ (Ukt=0 ) Dx+ (Ui ) − gR Dx− (Ui ), Dx− (Ukt=0 ) Dx− (Ui ) (12)
where gR function for Dx+ (Ui ): ⎧ + + Dx (Ui )−mink [Dx (Ukt=0 ),0] ⎪ if Dx+ (Ui ) < mink Dx+ (Ukt=0 ), 0 ⎪ + ⎨ Dx (Ui ) + + Dx (Ui )−maxk [Dx (Ukt=0 ),0] gR () = if Dx+ (Ui ) > maxk Dx+ (Ukt=0 ), 0 (13) + ⎪ D (U ) ⎪ i x ⎩ 0 otherwise So, gr consists in minimizing the differences between the gradient at time t and the maximum gradient at t = 0. If the maximum (respectively minimum)
126
S. Pop et al.
gradient value at t = 0 is greater (respectively less) than 0, this value is considered as the upper (respectively lower) limit for the actual gradient value. In the previous 1D example, shown in 1, the aim is to obtain at the end of the process a ’Signal A’ identical to input ’Signal B’ and to preserve the ’Signal B’. The transitions of the impulsion in ’Signal B’ are injected in ’Signal A’ by means of fusion term described above. But the flat zone between the 13th and 17th samples are obtained in ’Signal A’ after a time t = 4.8, by the means of regularization term. The cause of the fusion term for the flat zone, ’Signal B’ tries to follow ’Signal A’ (’Signal A’ presents a high gradient value detected by the fusion term and injected in ’Signal B’). The time of convergence depends on the width of impulsion and on the weight regularization parameter (γ). In the frequency domain, the flat zones are characterized by the low frequency. This regularization term can be viewed as the fusion of low frequency. Figure 2 shows the results obtained with equation 12, where the time step dt was set at 0.1 and gF = 1, while γ = 1.
Fig. 2. Evolution of two 1D signals limited by the regularization term and gray level contraint (12)
A study of the influence of γ on the convergence of the process will be the subject of a further work. Contrary to the classical fusion methods, our algorithm provides one output for each source signal. Obviously, the aim is to obtain similar outputs while the relevant information is preserved. In practice, we can observe a convergence of the process: the distance (i.e. RMSE) between the fused images decreases in time. The stopping time, like in diffusion case, is chosen by the human operator; nevertheless a criterion based on a distance measure or a quality factor calculation can be proposed.
A PDE-Based Approach for Image Fusion
2.2
127
Diffusion Term
One of the benefits of our model is the possibility to add a denoising process during the fusion process. This denoising process can be achieved by adding another term to equation (5): ∂(Ui ) = div [gD (|∇Uσ |i )∇U ] − βi div [gF (|∇U |max )∇Umax ] ∂(t) + γdiv gR (∇Ui , ∇Ukt=0 )∇Ui (14) In 14, we propose to use a diffusion term based on the Catt model [10]. Uσ denotes the Gaussian smoothed version of U and gD is the diffusion function (2). The diffusion term works on current image (i), independently of the other input images. The gray level constraint is maintained. In order to avoid the persistence of noise at starting time, the maximum and minimum of the gray level constraint are evaluated at each time step. In this way, the noise at t = 0 is not taken into account. In classical fusion approaches the noise is detected as relevant information and is injected in the fused images. Thus the obtaining of a noise-free output image requires a preprocessing step for denoising the input data.
3
Results
We choose to examine the efficiency of our 2D model in an out-of-focus image problem. Figures 3(a) and 3(b) show the details of two known images with different zones of focus. We present in Fig. 3(g),(h) the corresponding fused images obtained after 1200 iterations with a time step dt = 0.1 and a weight parameter β = 0.05 and the regularization parameter γ = 1. Let us compare the results provided by our method with some classical fusion scheme results. Among the classical fusion methods implemented in the free Matlab tool: fusetool conceived by Rockinger [15], we evaluate the Laplacian (LAP) pyramid method [16] and the Shift Invariant Discrete Wavelet Transform (SIDTW) method (with Haar function) [4]. Figures 3(e) and 3(f) illustrate the results obtained after 6 decomposition levels for Laplacian pyramid respectively 3 decomposition levels in the case of SIDTW. In both cases, the choose-max selection scheme was applied for the high-pass combination and the average of inputs for the low-pass combination. For a visual comparison, we also present the results obtained with the PCA (Principal Component Analysis) method (Fig. 3(c)) and by average the inputs (Fig. 3(d)). For a quantitative comparison of the fusion methods, we adopt the weighted fusion quality measure proposed by Piella [17]:
QW (u, v, f ) = c(w) [pu (w)Q(u, f |w) + pv (w)Q(v, f |w)] (15) w
where Q is the Wang and Bovik quality factor [18] computed in the window w. The Wang and Bovik quality factor quantifies the structural distortion between
128
S. Pop et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 3. (a),(b): The input out-of-focus images (detail); Results: (c) Average (QW = 0.869); (d) PCA (QW = 0.867); (e) LAP pyramid - 6 decomposition levels (QW = 0.941); (f) SIDTW - 3 decomposition levels (QW = 0.942); (g),(h): fusion results (Equation 11) (QW = 0.941, QW = 0.941);
A PDE-Based Approach for Image Fusion
129
two images. It is composed by three factors: correlation, distortion of mean luminance and distortion of contrast: Q(u, v) =
σuv 2uv 2σu σv σu σv u2 + v 2 σu2 + σv2
(16)
σu2 and σuv stand for variance respectively covariance and u denotes the mean luminance of u. In (15) pu (w) quantifies the importance of input u relative to input v; c(w) is the overall saliency of a window. These measures employ salience information such as variance, entropy, contrast or gradient norm. We chose the variance computed in a 7 by 7 pixel window as salient information. The variance acts as edge detector, which is a desirable result in this specific fusion problem. Note that the quality factor was computed on the detail images. The PCA and average results have low quality factor QW = 0.867, respectively QW = 0.869, which reflects the poor visual quality. Our proposed approach obtained a similar quality factor (QW = 0.941) as the LAP pyramid method (QW = 0.941) or SIDTW method (QW = 0.942). Among fusetool methods these last two methods provide the best results for this application. In addition a visual comparison certifies that our results are comparable with images produced by the best fusion methods. The in-focus zones are detected by the absolute gradient value and are injected by the inverse diffusion equation into the output images. The high quality factor certifies that the saliency information (edges here) is well injected from the inputs into the output images. To quantify the similarities between two output images we use the root-meansquare error (RMSE): 2 x,y [UA (x, y) − UB (x, y)] RM SE(UA , UB ) = (17) n where n denotes the total number of pixels. We observe that the RMSE has a powerful decreasing slope. The RMSE between the input images is equal to 14.82 and it is drastically reduced to 0.74 at the end of process. Thus, the output images are quite similar. In order to have one single output image, at the end of process, the average of outputs or a simple selection based on quality factor are possible. It is well-known that PDE-based algorithms are extremely costly in terms of processing time. However, the main advantage of our approach is the possibility to deal with noisy inputs: by making the degree of fusion and diffusion dependent on the local context, the approach proves to be efficient in the preserving the relevant details. Thus, the proposed technique can be successfully used in case of offline applications for which noise is a problem that needs to be solved. To illustrate the efficiency of our approach in case of noisy inputs, we added to the original out-of-focus images a Gaussian noise of σN = 15 (to obtain a signal-to-noise ratio SN R ≈ 9db for both images). In Fig. 4 we present the noisy input images as well as the fused images.
130
S. Pop et al.
(a)
(b)
(c)
(d)
Fig. 4. (a),(b): The noised input images; (c),(d): fused images
The fused images are obtained with equation (14) after 100 iterations with a time step dt = 0.1, a weight parameter β = 1 and a regularization parameter γ = 1. In addition, parameters specific to diffusion are set to σ = 0.8 and λ = 2.5 as threshold. The RMSE is reduced from 25.01 to 2.40 at the end of the process. As can be observed, the noise is discarded from the input images while the in-focus zones are well injected and preserved in the output images.We can note that the vertical lines in the images are not well preserved. This drawback can be avoided by introducing a directional diffusion [19] in (14) instead of the Catt diffusion. The possibility of choosing the diffusion process underlines another advantage of our approach. In the case when we dispose of non-noisy input images the weight fusion quality factor can be useful to discriminate between the outputs. But in the real noisy cases, such a quality factor is tributary to the saliency measure, which incorporates the noise as well as pertinence information. A noisy-free saliency measure is also an object of further studies.
4
Conclusions and Perspectives
In this paper we propose a new approach for image fusion based on an inverse diffusion process. The proposed formulation allows to deal with noisy inputs through the use of a diffusion process along with the fusion process.The advantage of such an approach lies in the possibility to adapt the fusion and diffusion processes to different types of applications.
A PDE-Based Approach for Image Fusion
131
In the further works we would like to propose an optimal stop criterion for the process. In addition, we will concentrate on finding different powerful anisotropic functions for fusion (gF ). Finally, a study on the convergence of the outputs will be carried out.
References 1. Blum, R.S., Xue, Z., Zhang, Z.: An overview of image fusion. In: Blum, R.S., Liu, Z. (eds.) Multi-Sensor Image Fusion and Its Applications, Signal and Image Processing Series M, Dekker/CRC Press, Boca Raton, USA (2005) 2. Burt, P.J., Kolczynski, R.J.: Enhanced image capture through fusion. In: 4th Intl. Conf. on Computer Vision, pp. 173–182 (1993) 3. Piella, G.: A general framework for multiresolution image fusion: from pixels to regions. Information Fusion 9, 259–280 (2003) 4. Rockinger, O.: Image Sequence Fusion Using a Shift-Invariant Wavelet Transform. In: International Conference on Image Processing ICIP 1997, vol. III, pp. 288–292 (1997) 5. Rockinger, O., Fechner, T.: Pixel-level image fusion: the case of image sequences. Proc. SPIE 3374, 378–388 (1998) 6. Simone, G., Farina, A., Morabito, F.C., Serpico, S.B., Bruzzone, L.: Image fusion techniques for remote sensing applications. Information Fusion 3(1), 3–15 (2002) 7. Pattichis, C.S., Pattichis, M.S., Micheli-Tzanakou, E.: Medical image fusion applications: an overview. Systems and Computers 2, 1263–1267 (2001) 8. Reed, J.M., Hutchinson, S.: Image fusion and subpixel parameter estimation for automated optical inspection of electronic components. IEEE Transactions on Industrial Electronics 43(3), 346–354 (1996) 9. Perona, P., Malik, J.: Scale space and edge detection using anisotropic diffusion. IEEE Transactions on PAMI 12(7), 629–639 (1990) 10. Catt´e, F., Lions, P.L., Morel, J.M., Coll, T.: Image selective smoothing and edge detection by nonlinear diffusion I. SIAM Journal on Numerical Analysis 29(1), 182–193 (1992) 11. Osher, S., Rudin, L.: Feature-oriented image enhancement with shock filters. SIAM Journal on Numerical Analysis 27(3), 919–940 (1990) 12. Weickert, J.: Coherence enhancing diffusion filtering. In: Hlavac, V., Sara, R. (eds.) Computer analysis of images and patterns, pp. 230–237. Springer, Heidelberg (1995) 13. Weickert, J.: Multiscale texture enhancement. International Journal of Computer Vision 31, 111–127 (1999) 14. Gilboa, G., Sochen, N., Zeevi, Y.: Forward-and-Backward Diffusion Processes for Adaptive Image Enhancement and Denoising. IEEE Trans. Image Processing 11(7), 689–703 (2002) 15. Fusetool by O. Rockinger at http://www.metapix.de 16. Burt, P.J., Adelson, E.H.: The Laplacian Pyramid as a Compact Image Code. IEEE Transaction in Communication COM-3l(4), 532–540 (1983) 17. Piella, G.: New quality measures for image fusion. In: Intl. Conference on Information Fusion, pp. 542–546 (2004) 18. Wang, Z., Bovik, A.C.: A universal image quality index. IEEE Signal Processing Letters 9(3), 81–84 (2002) 19. Terebes, R., Lavialle, O., Baylou, P., Borda, M.: Directional anisotropic diffusion. In: European Signal Processing Conference EUSIPCO2002, vol. 2, pp. 266–269 (2002)
Improvement of Classification Using a Joint Spectral Dimensionality Reduction and Lower Rank Spatial Approximation for Hyperspectral Images N. Renard1 , S. Bourennane1 , and J. Blanc-Talon2 1
Univ. Paul C´ezanne, Centrale Marseille, Institut Fresnel (CNRS UMR 6133), Dom. Univ. de Saint J´erˆ ome, F-13013 Marseille cedex 20, France 2 DGA/D4S/MRIS, Arcueil, France
[email protected],
[email protected]
Abstract. Hyperspectral images (HSI) are multidimensional and multicomponent data with a huge number of spectral bands providing spectral redundancy. To improve the efficiency of the classifiers the principal component analysis (PCA), referred to as P CAdr , the maximum noise fraction (MNF) and more recently the independent component analysis (ICA), referred to as ICAdr are the most commonly used techniques for dimensionality reduction (DR). But, in HSI and in general when dealing with multi-way data, these techniques are applied on the vectorized images, providing a two-way data. The spatial representation is lost and the spectral components are selected using only spectral information. As an alternative, in this paper, we propose to consider HSI as array data or tensor -instead of matrix- which offers multiple ways to decompose data orthogonally.We develop two news DR methods based on multilinear algebra tools which perform the DR using the P CAdr for the first one and using the ICAdr for the second one. We show that the result of spectral angle mapper (SAM) classification is improved by taking advantage of jointly spatial and spectral information and by performing simultaneously a dimensionality reduction on the spectral way and a projection onto a lower dimensional subspace of the two spatial ways.
1
Introduction
The emergence of hyperspectral images (HSI) implies the exploration and the collection of a huge amount of data. Hyperspectral imaging sensors provide a huge number of spectral bands, typically up to several hundreds. It is conceded that HSI contains many highly correlated bands providing a considerable amount of a spectral redundancy. This unreasonably large dimension not only increases computational complexity but also degrades classification accuracy [1]. Indeed, the estimation of statistical properties of classes in a supervised classification process needs the number of training samples to exponentially increase when the number of data dimensions increases if the classifier is non-parametric. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 132–143, 2007. c Springer-Verlag Berlin Heidelberg 2007
Improvement of Classification Using a Joint Spectral DR
133
The relation is linear for a linear classifier and to the square of the dimensionality for quadratic classifier [2]. In HSI a too small amount of training data is available and previous research has demonstrated that high-dimensional data spaces are mostly empty, indicating that the data structure involved exists primarily in a subspace. Dimensionality reduction (DR) is often employed for band decorrelation and data dimension reduction by extracting features from transformed feature and as a result increases classification and detection efficiency. Due to its simplicity and ease of use, the most popular DR algorithm is the PCA, referred to as P CAdr , which maximizes the amount of data variance by orthogonal projection. A refinement of P CAdr is the independent component analysis (ICA), referred to as ICAdr [3,4] which uses higher order statistics. But the use of these matrix algebra methods requires a preliminary step which consists in vectorizing the images. Therefore they rely on spectral properties of the data only, thus neglecting to the spatial rearrangement. To overcome this weakness, [5] proposes a feature extraction method based on multichannel mathematical morphology operator which incorporates the image representation. In this paper, we propose to use multilinear algebra tools for the DR problem which perform a spectral and spatial decorrelation simultaneously. This strategy requires to consider HSI as multi-way data. As was pointed out in [6] the intuitive representation of a collection of images is a three-dimensional array, or third-order tensor, rather than a matrix of vectorized images. Hence, instead of adapting data to classical matrix-based algebraic techniques (by rearrangement or splitting), the multilinear algebra (the algebra of higher order tensors) proposes a powerful mathematical framework for analyzing the multifactor structure of data. Tucker3 tensor decomposition has been developed with the aim of generalizing the matrix singular value decomposition (SVD). Tucker3 model thus achieves a multimode PCA, also known as higher order SVD (HOSVD) [7] and lower rank-(K1, K2 , K3 ) tensor approximation (LRTA-(K1 , K2 , K3 )) [8,9]. These multilinear tools have been applied in blind source separation, in separation of seismic waves, in image processing to noise filtering in color images [10] and to faces recognition [11]. We propose two novel multilinear tools for the DR problem to improve the classification efficiency in hyperspectral context. They perform jointly a dimensionality reduction of the spectral way (by extracting D3 spectral components) and a lower spatial (K1 , K2 )-rank approximation. The latter processus is a projection onto a lower dimensional subspace which permits to spatially decorrelate the data. The first proposed method extracts the spectral component using the P CAdr , referred to as LRT Adr -(K1 , K2 , D3 ) and the second one using the ICAdr , referred to as LRT A-ICAdr -(K1 , K2 , D3 ). As a result, those multimodal methods takes advantage of spatial and spectral information. The remainder of the paper is organized as follows: Section 2 presents the multi-way model and a short overview of its major properties. Section 3 introduces the multimode PCA. While reviewing the classical DR methods (P CAdr and ICAdr ) the Section 4 introduces our multilinear based methods, the LRT
134
N. Renard, S. Bourennane, and J. Blanc-Talon
Adr -(K1 , K2 , D3 ) and the LRT A-ICAdr -(K1 , K2 , D3 ). Section 5 contains some comparative results of classification performance after dimensionality reduction of hyperspectral images.
2
Multi-way Modelling and Properties
In this paper we consider a three-way array as a third order tensor, We define a tensor of order 3 as 3-way data, the entries of which are accessed via 3 indices. It is denoted by X ∈ RI1 ×I2 ×I3 , with elements arranged as xi1 i2 i3 , i1 = 1, . . . , I1 ; i2 = 1, . . . , I2 ; i3 = 1, . . . , I3 and R being the real manifold. Each index is called way or mode and the number of levels in the mode is called dimension of that mode. The mode is built on vector space E (n) of dimension In , which is the number of data sampled in the physical way associated with mode n. Each way of this multidimensional array is associated with physical quantity. For instance, in multivariate image analysis, a HSI is a sample of I3 images of size I1 × I2 , we have three indices and data can be geometrically arranged in a box of dimension I1 × I2 × I3 . HSI data can be represented as a three-way array : two modes for rows and columns and one mode for spectral channel. Foremost, let us give a brief review of tensor rank definitions which can be found in [8]. The n-mode rank of tensor data X ∈ RI1 ×I2 ×I3 , denoted by Rankn (X ), is the dimension of its n-mode vector space E (n) composed of the In -dimensional vectors obtained from X varying index in and keeping the other indices fixed. X is called a rank−(K1, K2 , K3 ) if Rankn (X ) = Kn whatever n = 1, 2, 3. This multi-way, or tensor modelling permits to consider multivariate data as inseparable whole data which involves a joint processing on each mode without separability assumption rather than splitting data or processing only the vectorized images. This model naturally implies the use of processing technics based on multilinear algebra. The Tucker3 model [12] is the commonly used tensor decomposition model. This Tucker3 model permits the approximation of a lower rank −(K1 , K2 , K3 ) tensor, LRTA-(K1 , K2 , K3 ).
3
Multimode PCA : LRT A-(K1 , K2 , K3 )
Following the Tucker3 model, any 3-way data X ∈ RI1 ×I2 ×I3 can be decomposed as : X = C ×1 U(1) ×2 U(2) ×3 U(3) (1) where U(n) are orthogonal matrix holding the Kn eigenvectors associated with the Kn largest eigenvalues, C ∈ RI1 ×I2 ×I3 is the core tensor and ×n is the n-mode product, properties which can all be found in [8]. An example of the Tucker3 three-way decomposition model is illustrated in Fig. 1.
Improvement of Classification Using a Joint Spectral DR
135
Fig. 1. Tucker3 decomposition model
Given a real-valued three-way data X ∈ RI1 ×I2 ×I3 , the LRTA-(K1 , K2 , K3 ) problem consists in finding the lower rank-(K1 , K2 , K3 ) multi-way data X, with K n < In, 2∀n=1 to 3, which minimizes the following quadratic Frobenius norm: X − X . Thus the best lower rank-(K1, K2 , K3 ) multi-way approximation in F
the least-squares sense of X is: X = X ×1 P(1) ×2 P(2) ×3 P(3) , and:
T
P(n) = U(n) U(n) ,
(2) (3)
(n)
is the projector on the Kn -dimensional subspace of E which minimizes (3). In a vector or matrix formulation, the definition of the projector on the signal subspace is based on the eigenvectors associated with the largest eigenvalues of the covariance matrix of the set of observation vector. By extension, in the tensor formulation, the projectors on the n-mode vector spaces are estimated by computing the best LRT A-(K1, K2 , K3 ), in the least-square sense. X ∈ RI1 ×I2 ×I3 is achieved after an alternating least squares (ALS) algorithm convergence. This ALS algorithm can be summarized in the following steps: 1. initialisation k = 0: Perform HOSVD [7] to initialize the projectors ∀n=1 to (n)
(n)
(n)T
(n)
3, P0 = U0 U0 . U0 contains the Kn eigenvectors associated with the Kn largest eigenvalues of the unfolding Xn [13]. 2 2. ALS loop: while X − Xk > 10−4 , F
(a) for i. ii. iii.
n=1 to 3 : (q) (r) Xk = X ×q Pk+1 ×r Pk+1 , with q = r = n; n,k n-mode unfold Xk into matrix X (n) n,k XT ; compute matrix Ck = X n,k
iv. process Ck SVD, and Uk+1 ∈ XIn ×Kn contains the Kn eigenvectors associated with the Kn largest eigenvalues; (n) (n) (n)T v. compute Pk+1 = Uk+1 Uk+1 ; (1) (2) (3) (b) compute Xk+1 = X ×1 Pk+1 ×2 Pk+1 ×3 Pk+1 (n)
(n)
136
N. Renard, S. Bourennane, and J. Blanc-Talon
(1) (2) (3) 3. output: Xkstop = X ×1 Pkstop ×2 Pkstop ×3 Pkstop , the best lower rank(K1 , K2 , K3 ) approximation of X .
The LRT A-(K1, K2 , K3 ) uses intact multi-way structure to derive jointly the n-mode projectors. Indeed, the LRT A-(K1, K2 , K3 ) takes into account the crossdependency of information contained in each mode thanks to the ALS algorithm. Next section shows how the LRT A-(K1 , K2 , K3 ) can be an interesting tool for hyperspectral images.
4 4.1
The Dimensionality Reduction (DR) Tools Classical DR Methods
In hyperspectral context, there is great interest in reducing the spectral ways by selecting the most significant spectral features to maximize the separation between classes. Suppose that we collect I3 images of full size I1 × I2 . Each of the I3 images X is transformed into a vector xT by row concatenation. The tensor X ∈ RI1 ×I2 ×I3 becomes a matrix X ∈ RI3 ×p where p = I1 · I2 . The aim of the DR is to extract a small number D3 of features with D3 < I3 , called components. In P CAdr context the extracted components are called principal components (PCs). Each PC is generated by projecting the data spaced onto the nth eigenvector associated with the nth largest eigenvalue. This orthogonal projection maximizes the amount of data variance. Therefore the D3 spectral PCs generate a reducing matrix ZPCs ∈ RD3 ×p . If Λ ∈ RD3 ×D3 is the eigenvalue diagonal matrix and U ∈ Rp×D3 their associated eigenvectors, the PCs are given by: ZPCs = Λ−1/2 UT X.
(4)
Whereas, in the ICAdr [4,3] context the extracted components are called independent components (ICs). ICA reaches for a linear non-orthogonal transformation which minimizes the statistical dependence between components. The observed signals X are used to estimate the unmixing matrix W ∈ RI3 ×D3 thanks to the FASTICA [14] algorithm. The hyperspectral images are then transformed onto a lower dimensional space, yielding the reducing matrix ZICs ∈ RD3 ×p , which is constructed by the desired D3 materials (sources). The ICs are given by: ZICs = WT X. (5) From the ZPCs or ZICs matrices, the data can be reshaped to a tensor image Z ∈ RI1 ×I2 ×D3 . Figure 2 a) illustrates the P CAdr and the ICAdr strategy in hyperspectral imagery. 4.2
Multilinear Based DR Methods
We can easily adapt the well-known LRT A-(K1, K2 , K3 ) (see section 3) into a spectral dimensionality reduction tool. The major purpose of our multilinear
Improvement of Classification Using a Joint Spectral DR
137
a)
b)
c)
Fig. 2. Dimensionality reduction strategy : a) P CAdr and ICAdr . b) LRT Adr3 (K1 , K2 , D3 ).c)LRT A − ICAdr3 -(K1 , K2 , D3 ).
based methods is to extract D3 spectral components from X ∈ RI1 ×I2 ×I3 to derive the tensor Z ∈ RI1 ×I2 ×D3 . The challenge carried out thanks to our two proposed DR methods is to jointly reduce the dimensionality of the spectral way and to transform the spatial way onto a lower dimensional subspace. Like for the LRT A-(K1, K2 , K3 ) our DR methods estimate spatial projectors P(n) (equation 3) with n = 1,2 which spatially decorrelate data and approximate it. Our first multilinear based method, the LRT Adr3 -(K1 , K2 , D3 ), extracts principal spectral components with respect to the following model: Z = X ×1 P(1) ×2 P(2) ×3 Λ−1/2 U(3) , T
(6)
Where U is the matrix holding the D3 eigenvectors associated with the D3 largest eigenvalues, Λ is the diagonal eigenvalue matrix holding the D3 largest eigenvalues and Pn are the n-mode projectors defined in the same way in the above section 3. With the same strategy the ICAdr method has been integrated to yield our second proposed multilinear based DR method, the LRT A-ICAdr3 -(K1 , K2 , D3 ), which has a model defined by : T
Z = X ×1 P(1) ×2 P(2) ×3 W(3) ,
(7)
Where W ∈ RI3 ×D3 is the unmixing matrix, estimated thanks to the FASTICA algorithm. Figure 2 b) illustrates the LRT Adr -(K1 , K2 , D3 ) and the LRT A-IC Adr3 -(K1 , K2 , D3 ) scheme. The major LRT Adr -(K1 , K2 , D3 ) and LRT A-IC
138
N. Renard, S. Bourennane, and J. Blanc-Talon
Adr3 -(K1 , K2 , D3 ) attribute in relation to the P CAdr and ICAdr respectively is the use of the spatial information to select the components. Indeed, thanks to the ALS loop, the spectral features are estimated iteratively like the spatial n-mode projectors. Different (K1 , K2 , D3 )-values can be retained for each way. [15] proposes to estimate the D3 -dimension by introducing some criteria which determine the virtual dimensionality defining the minimum number of spectrally distinct signal sources that characterize the hyperspectral data. While concerning the (K1 , K2 )dimensional subspace, [16] proposes to extend the Akaike information criterion (AIC) in order to estimate the signal subspace in the case of Gaussian additive noise. In this paper, we focus on introducing multimodal tools in hyperspectral context and all (K1 , K2 , D3 )-dimensions are fixed empirically.
5
Results
The data used in the following experiments are real-world data collected by HYDICE imaging, with a 1.5 m spatial and 10 nm spectral resolution and including 148 spectral bands (from 435 to 2326 nm), 310 rows and 220 columns. This HSI can be represented as a multi-way array data, denoted by X ∈ R310×220×148 . For convenience, a preprocessing step removes the mean of each vector pixel of the initial multi-way data X . In this paper, we focus on the classification result obtained after each DR method. Figure 3 a) shows the entire scene used for experiments. The land cover classes are : field, trees, road, shadow and 3 different targets.
Classes Training Test samples samples field forest road shadow target 1 target 2 target 3
a)
1 002 1 367 139 372 128 78 37
40 5 3 5
Color
811 green 1 537 green 2 226 blue 1 036 pink 519 red 285 blue 2 223 yellow
b)
Fig. 3. Classes in the HYDICE image RGB (a), information classes and samples (b)
The resulting number of training and testing pixels for each class are given in Fig. 3 b). The classification [17,1] is performed thanks to the spectral angle mapper (SAM) algorithm [17] which is very largely applied to HSI data. To appreciate quantifiable comparisons, we determine the overall (OA) and individual test accuracies in percentage exhibited by SAM classifier. OA is de1 i=P fined as follows : OA = M i=1 aii , where M is the total number of samples, P is the number of classes Ci for i = 1, . . . , P and aij is the number of test samples
Improvement of Classification Using a Joint Spectral DR
139
that actually belong to class Ci and are classified into Cj for i, j = 1, . . . , P . In the considered example P = 7. To highlight the advantage of a multi-way method before classification, we compare first the SAM classification results after applying the LRT Adr -(K1 , K2 , D3 ) and the P CAdr -(D3 ) (schematized in Fig. 2) which extracts each D3 spectral components. While the second experiment compares the SAM classification results after applying the LRT A-ICAdr -(K1 , K2 , D3 ) and the ICAdr (D3 ) (schematized in Fig. 2). For all experiments, the classification results are evaluated for various numbers of retained spectral components, and in each case we empirically test several (K1 , K2 )-dimensions of the spatial subspaces for the LRT Adr -(K1 , K2 , D3 ) and for the LRT A-ICAdr -(K1 , K2 , D3 ). • LRT Adr -(K1 , K2 , D3 ) compared to P CAdr -(D3 ). 98
Overall accuracy (OA)
96
94
92 PCAdr
90
LRTAdr−(310,220,D3) 88 LRTAdr−(60,60,D3) 86
84
LRTAdr−(40,40,D3)
0
50
100
150
Number of spectral component (D3)
Fig. 4. Dimensionality reduction outcome for SAM classification. The overall accuracy with respect to the number of retained spectral components. The OA obtained from the initial tensor image X ∈ R310×220×148 is equal to 78.98.
Figure 4 shows the overall accuracy with respect to the number of retained spectral components. Knowing that the OA obtained from the initial tensor image X ∈ R310×220×148 is equal to 78.98, Fig. 4 highlights the DR interest when the aim is the classification. Indeed, for P CAdr DR method, we note that there is an optimal spectral dimension : using too few component or too much components decrease the classification efficiency. We notice also that the LRT Adr -(K1 , K2 , D3 ) leads to better OA than P CAdr -(D3 ) for all D3 spectral components. For each value of D3 , the lower (K1 , K2 ) values, equal to 40, the better the classification results. The individual classes accuracies are reported in Table 1, for convenience only the results obtained for D3 = 5 and D3 = 10 are reported.
140
N. Renard, S. Bourennane, and J. Blanc-Talon
a) Classification result b) Classification result c) Classification result from from initial data, from P CAdr -(10), LRT Adr -(40, 40, 10), OA = 78.98. OA = 92.73. OA = 97.32.
Fig. 5. Dimensionality reduction outcome for classification, 10 spectral features are extracted
It is revealed that the LRT Adr -(K1 , K2 , D3 ) permits better classification efficiency by jointly selecting the ten most significant spectral components and reducing the dimensions of the spatial subspaces to 40. Table 1. Overall (OA) and individual test accuracies in percentage obtained after applying the P CAdr -(D3 ) and the LRT Adr -(K1 , K2 , D3 )
Class
field forest road shadow target 1 target 2 target 3 OA
D3 =5 bands Initial PCAdr ! LRT Adr PCAdr Image ! K1 K2 !K1 K2 !K1 K2 !310 220 ! 60 60 ! 40 40 ! ! ! ! ! ! ! ! !
88.9 4.4 85.3 80.1 64.9 80.7 31.5
93.2 41.0 98.7 95.1 67.0 77.9 39.6
78.98
87.96 !
88.3 54.3 83.3 95.7 54.9 75.4 44.9
! ! ! ! ! ! ! ! !
95.4 62.3 94.7 97.4 72.7 67.7 65.8
! ! ! ! ! ! ! ! !
97.7 72.9 95.5 96.8 76.7 66.3 78.0
84.76 ! 91.83 ! 94.61
97.5 64.1 89.6 93.8 63.6 68.4 42.8
D3 =10 bands ! LRT Adr ! K1 K2 !K1 K2 !K1 K2 !310 220 ! 60 60 ! 40 40 ! ! ! ! ! ! ! ! !
92.73 !
98.3 69.5 95.6 96.1 57.6 73.7 38.5
! ! ! ! ! ! ! ! !
99.7 75.8 91.1 93.5 79.6 83.2 63.6
! ! ! ! ! ! ! ! !
100 82.5 97.7 95.7 81.3 84.2 51.3
95.36 ! 95.83 ! 97.32
Figure 5 shows visual classification results obtained from the original multiway array X and after the two P CAdr based DR methods which select D3 = 10 spectral features and where the spatial subspaces (K1 , K2 )-dimension have been fixed to 40 for the LRT Adr -(40, 40, 10). Figure 5 a) permits visually to appreciate the DR usefulness and shows that in comparison with P CAdr -(D3 ), the LRT Adr -(K1 , K2 , D3 ) permits to have classes which are more homogeneous and the mean area corresponding to the background and the target are more identifiable with less unclassified pixels. These quantitative and visual results confirm the ability of the LRT Adr -(K1 , K2 , D3 ) as a DR tool for the considered HSI data with the aim of improving classification. • LRT A-ICAdr -(K1 , K2 , D3 ) compared to ICAdr -(D3 ).
Improvement of Classification Using a Joint Spectral DR
141
98
Overall Accuracy (OA)
96
94
92
ICAdr
90
LRTA−ICAdr−(310,220,D3)
88
LRTA−ICAdr−(60,60,D3)
86
84
LRTA−ICAdr−(40,40,D3) 5
10
15
20
25
30
35
40
Number of spectral components (D3)
Fig. 6. Dimensionality reduction outcome for SAM classification. The overall accuracy with respect to the number of retained spectral components ICAdr -(D3 ) and LRT AICAdr -(K1 , K2 , D3 ) . The OA obtained from the initial tensor image X ∈ R310×220×148 is equal to 78.98.
The same experiment is performed using the ICAdr -(D3 ) and LRT A-ICAdr (K1 , K2 , D3 ) as DR methods. Figure 6 shows the overall accuracy with respect to the number retained of spectral components varying from 5 to 40. Like the P CAdr , ICAdr requires a optimal number of spectral components to yield good classification results. Figure 6 shows that LRT A-ICAdr -(K1 , K2 , D3 ) leads to better OA than ICAdr -(D3 ) for all D3 spectral components varying from 5 to 40. Table 2 gives more individual class information about the individual class accuracy. Figure 7 shows visual classification result obtained from the original multiway X and after the two ICAdr based methods which select D3 = 20 spectral features and where the spatial subspaces (K1 , K2 )-dimension have been fixed to 60 for the LRT A-ICAdr -(60, 60, 20). Like the LRT Adr -(K1 , K2 , D3 ), the LRT A-ICAdr -(60, 60, 20) yields more homogeneous classes. Moreover this ICAdr based multi-way DR method permits to detect all 8 targets of the type three (see Fig. 3). Those results confirm that the LRT A-ICAdr -(60, 60, 20) is also a great DR tool for this hyperspectral image and to improve classification efficiency. It is conceded that the number of retained spectral features has an impact on the classification efficiency. The results obtained above with the proposed LRT Adr -(K1 , K2 , D3 ) and LRT A-ICAdr -(K1 , K2 , D3 ) DR methods show that the dimensions of the spatial subspaces also have much impact. This optimal interplay between parameters (K1 , K2 ) and D3 is not permitted when P CAdr or ICAdr are used. The P CAdr and ICAdr only permits to reduce the spectral
142
N. Renard, S. Bourennane, and J. Blanc-Talon
a) Classification result b) Classification result c) Classification result from from initial data, from ICAdr -(20), LRT A-ICAdr -(60, 60, 20), OA = 78.98. OA = 95.93. OA = 97.99.
Fig. 7. Dimensionality reduction outcome for classification, 10 spectral features are extracted Table 2. Overall (OA) and individual test accuracies in percentage obtained after applying the ICAdr -(D3 ) and the LRT A-ICAdr -(K1 , K2 , D3 ) Class
Initial Image ICAdr
96.28 91.46 99.10 70.29 90.17 80.35 66.84
! ! ! ! ! ! ! ! !
field forest road shadow target 1 target 2 target 3
88.85 4.42 85.34 80.14 64.93 80.70 31.55
OA
78.98 93.37 !
10 bands 20 bands LRTA-ICAdr ICAdr LRTA-ICAdr 310 220 !60 60 ! 40 40 310 220 !60 60 ! 40 40 98.50 89.27 99.16 56.87 88.05 76.14 77.54
! !98.33 !98.88 !99.41 !76.57 !97.50 !97.19 !80.21 !
! ! ! ! ! ! ! ! !
98.64 99.69 99.41 86.36 95.38 98.25 90.91
98.32 97.33 98.54 75.89 90.56 77.50 71.66
! ! ! ! ! ! ! ! !
93.56 !96.39 !97.61 95.93 !
98.66 95.29 99.19 69.76 88.44 76.14 69.52
! !99.56 !99.69 !99.44 !83.16 !97.30 !95.09 !86.63 !
! ! 98.33 ! 99.89 ! 99.35 ! 97.22 ! 99.81 !100.00 ! 99.5 !
95.42 !97.99 !
98
dimension. In opposite, our two proposed methods permit to reduce simultaneously the spectral dimension and the dimensions of the spatial subspaces which is of great interest for classification.
6
Conclusion
Two multi-way data analysis tool referred to as LRT Adr -(K1 , K2 , D3 ) and LRT A-ICAdr -(K1 , K2 , D3 ) have been proposed. Those multilinear based methods take into account the spatial and spectral information to select optimal spectral features. Thanks to the ALS algorithm, the spectral components are extracted jointly with spatial decorrelation. LRT Adr -(K1 , K2 , D3 ) and LRT AICAdr -(K1 , K2 , D3 ) reveal to be quite interesting for classification efficiency of high-dimensional hyperspectral data. Indeed, the classification result depends not only on the number of extracted spectral features but also on the dimension of spatial subspaces.
Improvement of Classification Using a Joint Spectral DR
143
References 1. Landgrebe, D.: Hyperspectral image data analysis as a high dimensional signal processing problem. Special issue of the IEEE Signal Process. Mag. 19, 17–28 (2002) 2. Fukunaga, K.: Introduction to statistical pattern recognition, 2nd edn. Academic Press Professional, Inc. San Diego, CA (1990) 3. Wang, J., Chang, C.: Independent component analysis - based dimensionality reduction with applications in hyperspectral image analysis. IEEE Trans. on Geosc. and Remote Sens. 44, 1586–1588 (2006) 4. Lennon, D., Mercier, G., Mouchot, M., Hubert-Moy, L.: Independant component analysis as a tool for the dimension reduction and the representation of hyperspectral images. Spie Remote Sens. 4541, 2893–2895 (2001) 5. Plaza, A., Martinez, P., Plaza, J., Perez, R.: Dimensionality reduction and classification of hyperspectral image data using sequences of extended morphological transformations. IEEE Trans. on Geosc. and Remote Sensing 43, 466–479 (2005) 6. Shashua, A., Levin, A.: Linear images coding for regression and classification using the tensor-rank principle. In: Proc. of IEEE CVPR’01, vol. 1, pp. 42–49. IEEE, Los Alamitos (2001) 7. De Lathauwer, L., De Moor, B., Vandewalle, J.: A multilinear singular value decomposition. SIAM Journal on Matrix Analysis and Applications 21, 1253–1278 (2000) 8. De Lathauwer, L., De Moor, B., Vandewalle, J.: On the best rank-(r1 , . . . , rN ) approximation of higher-order tensors. SIAM Journal on Matrix Analysis and Applications 21, 1324–1342 (2000) 9. Kroonenberg, P.: Three-mode principal component analysis. DSWO press, Leiden (1983) 10. Muti, D., Bourennane, S.: Survey on tensor signal algebraic filtering. Signal Proc. Journal 87, 237–249 (2007) 11. Vasilescu, M., Terzopoulos, D.: Multilinear image analysis for facial recognition. In: IEEE Int. Conf. on Pattern Recognition (ICPR’02), Quebec city, Canada, vol. 2, IEEE, Los Alamitos (2002) 12. Tucker, L.: Some mathematical notes on three-mode factor analysis. Psychometrika 31(66), 279–311 13. Muti, D., Bourennane, S.: Fast optimal lower-rank tensor approximation. In: IEEE ISSPIT, Marrakesh, Morocco, pp. 621–625. IEEE Computer Society Press, Los Alamitos (2002) 14. Hyvarunen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997) 15. Chang, C., Du, Q.: Estimation of number of spectrally distinct signal sources in hyperspectral imagery. IEEE Trans. on Geosc. and Remote Sens. 42, 608–619 (2004) 16. Renard, N., Bourennane, S., Blanc-Talon, J.: Multiway filtering applied on hyperspectral images. Lecture notes in Journal Computer Science. 4179, 127–137 (2006) 17. Manolakis, D., Shaw, G.: Detection algorithms for hyperspectral imaging applications. IEEE Signal Process. 19, 29–43 (2002)
Learning-Based Object Tracking Using Boosted Features and Appearance-Adaptive Models Bogdan Kwolek Rzesz´ ow University of Technology, W. Pola 2, 35-959 Rzesz´ ow, Poland
[email protected]
Abstract. This paper presents a learning-based algorithm for object tracking. During on-line learning we employ most informative and hard to classify examples, features maximizing individually the mutual information, stable object features within all past observations and features from the initial object template. The object undergoing tracking is discriminated by a boosted classifier built on regression stumps. We seek mode in the confidence map calculated by the strong classifier to sample new features. In a supplementing tracker based upon a particle filter we use a recursively updated mixture appearance model, which depicts stable structures in images seen so far, initial object appearance as well as two-frame variations. The update of slowly varying component is done using only pixels that are classified by the strong classifier as belonging to foreground. The estimates calculated by particle filter allow us to sample supplementary features for learning of the classifier. The performance of the algorithm is demonstrated on freely available test sequences. The resulting algorithm runs in real-time.
1
Introduction
Object tracking is a central theme in computer vision and has received considerable attention in the past two decades. The goal of tracking is to automatically find the same object in adjacent frames in a video sequence. To achieve a better quality of tracking many algorithms consider environment and utilize pixels from background [1][2][3]. To cope with changes of observable appearance many of them incrementally accommodate models to the changes of object or environment [4][5]. In such systems, Gaussian mixture models can be used to represent both foreground [6] and background [7]. Detecting and tracking of objects using their appearances play an important role in many applications such as vision based surveillance and human computer interaction [5][8][6]. A learning algorithm can improve the robustness if the observed appearance of a tracked object undergoes complex changes. A learning takes place in recently proposed algorithms built on classification methods such as support vector machines [1] or AdaBoost [2][3]. Obtaining a collection consisting of both positive and negative examples for on-line learning is complex task. The algorithm [9] starts with a small collection of manually labeled data and then generates supplementary examples by J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 144–155, 2007. c Springer-Verlag Berlin Heidelberg 2007
Learning-Based Object Tracking
145
applying co-training of two classifiers. To avoid hand labeling the use of motion detection in order to obtain the initial training set was proposed in [10]. In our approach, Gentle AdaBoost built on regression stumps combines several classifiers into an accurate one. An algorithm constructs on the fly a training set consisting of promising object and background features. It consists of representative object features from the initial template, the most stable object features seen so far, uniformly subsampled background features without repetition and features maximizing individually the mutual information. Such family of features can be poorly informative and therefore the set also consists of hard to classify examples that provide most new information during object tracking. An on-line method using boosted features and adaptive appearance models is key contribution of this paper to learning based object tracking. This work’s novelty consists in managing several kinds of features, namely describing stable object structures, characterizing two-frame variations and characteristic samples from the initial template to support a data-driven learning of weak classifiers within computationally feasible procedure based on Gentle AdaBoost. We also demonstrate how adaptive appearance models can be integrated with boosted features to improve the performance of tracking. The resulting algorithm considers the temporal coherence between images of object undergoing tracking. The rest of the paper is organized as follows. In the next Section below we refer to learning in object tracking. In Section 3 we discus how regression stumps are utilized in Gentle AdaBoost. The components and details of learning based object tracking using boosted features are discussed in Section 4. The usage of adaptive appearance models in a particle filter is explained in Section 5. We demonstrate also how adaptive appearance models can be integrated with boosted features to improve the performance of tracking. We report and discuss experimental results in Section 6. We draw conclusions in the last Section.
2
Learning in Object Tracking
When learned off-line classifiers are employed the tracking can be realized trough detection of the target. Okuma et al. [11] propose an approach that uses a boosted detector operating on color distributions to construct a proposal distribution for the particle filter. Considering tracking as binary classification, Avidan [1] proposes a support vector based tracker built on the polynomial kernel. In such tracker with learning capabilities the score of support vector machine is maximized for every frame. A system built on the relevance vector machine which employs temporal fusion is described in work of Williams et al. [12]. In work [2] AdaBoost is used in algorithm termed as ensemble tracking to learn the classifier. The appearance model is updated by adding recent features. An approach presented in [13] employs image pairs and temporal dependencies into a learned similarity function instead of learning a classifier to differentiate the object from the background. Some work has been done in the past to enable automatic labeling of training data. Robust automatic labeling is a highly desirable property in any learning
146
B. Kwolek
based tracking system. Levin et al. [9] propose the so called co-training approach which consists in starting with a small training set and increasing it by cotraining of two classifiers, operating on different features. Nair and Clark [10] use the motion detection for constructing the initial training set and then the Winnow as a final classifier. Ensemble methods such as boosting and bagging have demonstrated significant advantages in off-line settings. However little work has been done in exploring these methods in on-line settings. In [14], Oza and Russel propose on-line version of boosting which simulates the bootstrap process through updating each base model using multiple copies of each new example. The algorithm that is proposed in work [2] maintains a list of classifiers that are trained over time. During tracking it removes old classifiers, trains new classifiers using a confidence map generated by the strong classifier and then adds them to the ensemble. However, through removing the oldest classifiers this algorithm omits important information contained in the initial object template [15] as well it is not able to detect features being stable during tracking. The importance of such stable features during tracking has been highlighted by several authors, among others by [6]. In an algorithm described in [3] the selectors are updated when a new training sample is available. This operation needs considerable computations since the strong classifier contains 50 selectors and each can choose from 250 selectors. This in turn can even lead to slower boosting algorithm in comparison with an off-line algorithm applied to learn on-line. The average number of calculations per feature in this algorithm can be far larger than in off-line AdaBoost.
3
Boosting
Boosting originates from a machine learning model known as Probably Approximately Correct (PAC). Boosting algorithms combine simple decision rules into more complex ones. They aim at finding an accurate classifier consisting of many base classifiers, which are only moderately accurate. The boosting algorithm executes the base learning algorithm multiple times to achieve the desired classification performance. During iterations the weights are updated dynamically according to the errors in previous round of learning. The base learning algorithm takes into account a weight coupled with each training instance and attempts to find a learned hypothesis that minimizes the weighted classification error. The learning algorithm generates classification rules that are combined by the boosting algorithm into the final classification rule. In the first step a boosting algorithm constructs an initial distribution of weights over the training set. The weights are greater than zero, sum to one and constitute a distribution over the training set. Using the weighted training set the algorithm searches for a classification rule consisting in a selecting a base classifier that gives the least weighted error. The weights of the data that are misclassified by the selected base classifier are increased. This leads to selection of classifier that performs better on examples misclassified previously. Each weak classifier predicts the label of the data. In consequence, AdaBoost [16], which is the adaptive version of boosting
Learning-Based Object Tracking
147
minimizes the following exponential loss function: J(F ) = E(e−yF (x) ),
(1)
where E denotes the expectation and the strong classifier F (x) is a linear combination of T weak classifiers fi (x): F (x) =
T
αi fi (x),
(2)
i=1
with parameters αi to balance the evidence from each feature. The set of decision rules {fi }Ti=1 and combining coefficients {αi }Ti=1 are learned. 3.1
Gentle AdaBoost
We employ in our tracking algorithm a version of boosting called Gentle AdaBoost [17], because it requires fewer iterations to achieve similar classification performance in comparison with other methods. Given a set of training instances X and a corresponding weight distribution D the boosting algorithm calculates a weak hypothesis f : X → R, where the sign of f determines the predicted label y of the instance x ∈ X . The magnitude |f (x)| expresses the confidence of the prediction. Suppose we have a current ensemble hypothesis F (x) = Tt=1 ft (x) and seek better one F + f by minimizing the following criterion: J(F + f ) = E[e−y[F (x)+f (x)]],
(3)
where E denotes the expectation. Gentle AdaBoost minimizes this equation by employing adaptive Newton steps [17], which corresponds to minimizing at each step a weighted squared error. At each step m the current ensemble hypothesis F is updated as follows F (x) ← F (x) + fm , where fm is selected to minimize a second order Taylor approximation of the cost function. Replacing the weighted conditional expectation E[y |x] in (3) with an empirical expectation over the training data leads to minimizing the weighted squared error: J=
L
wi (yi − fm (xi ))2 ,
(4)
i=1
where wi = e−yi F (xi ) and the summation is over the training exemplars. 3.2
Regression Stumps Based Weak Learner
As weak learners we employ regression stumps of the following form: fm (x) = aδ(x(k) > θ) + b
(5)
where x(k) denotes the k-th coordinate of K dimensional feature vector x, δ is the Kronecker delta function, θ is a threshold, and a, b are regression parameters.
148
B. Kwolek
Such binary regression stumps were employed in [18][19]. To minimize function (4) we should determine in each iteration m four parameters of the regression stump (5), namely a, b, θ and k. First, we calculate parameters a and b with (k) (k) respect to each possible threshold θi = xi , i.e. for i = 1, 2, ..., L and k = 1, 2, ..., K: L (k) bi
=
L
(k) (k) ≤ xi ) j=1 wj yj δ(xj L (k) (k) ≤ xi ) j=1 wj δ(xj
(k) ai
(k)
j=1
= L
wj yj δ(xj
(k)
j=1 wj δ(xj
(k)
> xi ) (k)
> xi )
(k)
− bi . (6)
Then, we determine error according to the following formula: (k)
ei
=
L
(k)
(k)
wj (yj − ai δ(xj
(k)
(k)
> xi ) + bi )2 .
(7)
j=1 (k)
Next, for each dimension k we seek for thresholds θ(k) = xˇi(k) , which minimize the error function given by (7). This can be expressed in the following manner: ˇi(k) = arg
(k)
max {ei }.
i=1,2,...,L
(8)
In the final step of selecting the best regression stump we determine the coordinate kˇ for which the error function (7) takes minimal value: kˇ = arg
max
k=1,2,...,K
(k)
{eˇi(k) }.
(9)
To speed up the selecting θ the computations were conducted using K sorted vectors x. In order to decrease the number of summations during fitting the regression stumps we utilized the cumulative sums of wj and wj yj .
4
Learning-Based Object Tracking Using Boosted Features
The most informative and hard to classify examples are in vicinity of the decision boundary between background and target. In our approach, an on-line AdaBoost focuses on such hard examples that provide more new information than easy ones. Such examples cause the base learner to concentrate on unseen examples. The updated on-line training set consists of also most stable object features seen so far, uniformly subsampled background features without repetition and features maximizing individually the mutual information. In this context, the major difference of our work from relevant research is that weak classifiers are not trained from the same data sets, which are acquired within rectangles covering the object and the surrounding background, but only a small portion of the newly available training sets. It is major difference between our learning based tracking algorithm and algorithms relying on linear adaptation or learning, where the update of the object model is done via all newly extracted pixels.
Learning-Based Object Tracking
149
An on-line learning algorithm does not need all the training data processed so far to calculate a current hypothesis, rather it process data as it become available without the need for storage, through reusing previously learned weak classifier to learn new classifier. In our approach we initially train the classifier on pixels that were labeled in advance and then apply the classifier in each frame to extract the object of interest. An unsupervised learning is done using labeled pixels by the classifier, pixels depicting initial object appearance as well as stable object structures within all past observations. The object and background pixels are extracted using center-surround approach in which an internal rectangle covers the object, while a larger surrounding rectangle represents the background. The weak learner that was described in subsection 3.2 is used in on-line training. Before starting of the tracking the foreground and background pixels are extracted using center-surround approach. The initial object template is constructed on the basis of the internal rectangle covering the object of interest. A number of representative pixels that are sampled from the object of interest are then utilized during tracking. Such pixel collection holds information about initial object appearance and prevents from model drift. A strong classifier is used to label the pixels as either belonging to the object of interest or background. On the basis of the distribution indicated by weights we sample from the current frame a set of foreground pixels that are hardest to classify. Using a histogram holding information about colors of all pixels seen so far in the object rectangle we extract in each frame a set of the most stable pixels and add it to the set representing the current frame. Through such stable pixels the algorithm considers the temporal coherence between images of object undergoing tracking. The background is represented by pixels laying in close to decision boundary as well as collection of uniformly sampled pixels both from the current and previous frame. In order to avoid the weakness of the random sampling we additionally pick features maximizing individually the mutual information to forecast the class. Given Ns samples with the M binary features X1 , ..., XM , and the target classification variable Y , our goal is to select G features Xv(1) , ..., Xv(G) , which accurately characterize Y . The selected features individually maximize the mutual information I(Y ; Xv(l) ) = H(Y ) − H(Y |Xv(l) ), where H() is the entropy. During tracking a simple procedure is responsible for removing the pixels belonging to previous frame and inserting the pixels from the new frame as well as maintaining proportions between the mentioned above ingredients of the training vector at possibly the same level. The length of the list containing training pixels is constant. During boosting iterations the weights that are employed by weak learner are calculated as follows: w ← w exp(−y fm ) (10) The total score produced by AdaBoost is normalized through soft identity function to range between -1 and 1 in the following manner: s = tanh(F (x)) = tanh(
T
m=1
fm (x))
(11)
150
B. Kwolek
Such a normalized score can be used as a measure of prediction confidence [20]. The face location during tracking is computed by CamShift [21] acting on the likelihood images. Since our tracking algorithm should spend small number of CPU cycles, we use similar color cues to those employed in original implementation of CamShift, i.e. RG or HS color components.
5
Adaptive Models for Particle Filtering
Low-order parametric models of the image motion of pixels laying within a template can be utilized to predict the movement in the image plane [22]. This means that by comparing the gray level values of the corresponding pixels within region undergoing tracking, it is possible to obtain the transformation (giving shear, dilation and rotation) and translation of the template in the current image [23]. Therefore, such models allow us to establish temporal correspondences of the target region. They make region-based tracking an effective complement to tracking that is based on classifier distinguishing between foreground and background pixels. In a particle filter the usage of change in transformation and translation Δωt+1 arising from changes in image intensities within the template can lead to reduction of the extent of noise νt+1 in the motion model. It can take the form [6]: ωt+1 = ωˆt + Δωt+1 + νt+1 . 5.1
Adaptive Velocity Model
Let Ix,t denote the brightness value at the location (x1 , x2 ) in an image I that was acquired in time t. Let R be a set of J image locations {x(j) | j = 1, 2, ..., J} (j) defining a template. Yt (R) = {Ix,t | j = 1, 2, ..., J} is a vector of the brightness values at locations x(j) in the template. We assume that the transformations of the template can be modeled by a parametric motion model g(x; ωt ), where (1) (2) (l) x denotes an image location and ωt = {ωt , ωt , ..., ωt } denotes a set of l parameters. The image variations of planar objects that undergo orthographic projection can be described by a six-parameter affine motion models [22]: a d u g(x; ω) = x + 1 = Ax + u, (12) c e u2 where ω = (a, c, d, e, u1 , u2 )T . With these assumptions, the tracking of the object in time t can be achieved by computing ωt+1 such that Yt+1 (g(R; ωt+1 )) = Yˆt (R), where the template Yˆt (R) is in pose determined by the estimated state. (n) (n) Given a set S = {ωt , πt ) | n = 1, ..., N } of weighted particles, which approximate the posterior distribution p(ωt | Y1:t ), the maximum aposteriori estimate (MAP) of the state is calculated according to the following formula: (n)
ωˆt = arg max p (ωt | Y1:t ) ≈ arg max πt ωt
ωt
(13)
The motion parameters in time t + 1 take values according to: ωt+1 = ω ˆ t + At+1 [Yˆt (R) − Yt+1 (g(R; ω ˆ t ))].
(14)
Learning-Based Object Tracking
151
This equation can be expressed as follows: Δωt+1 = At+1 Δyt+1 . Given N measurements we can estimate matrix At+1 from matrices consisting of adjoined vectors Δωt+1 and Δyt+1 [23]: (1)
(1)
(N )
(N )
ΔMt = [ˆ ωt − ωt , ..., ω ˆ t − ωt ] (1) (1) (N ) (N ) ΔYt = [Yˆ − Y , ..., Yˆ −Y ]. t
t
t
t
(15) (16)
Using the least squares (LS) method we can find the solution for At+1 [23]: At+1 = (ΔMt ΔYtT )(ΔYt ΔYtT )−1 .
(17)
Singular value decomposition of ΔYt yields: ΔYt = U W V T . Taking q largest diagonal elements of W the solution for At+1 is as follows: At+1 = ΔMt Vq Wq−1 UqT . The value of q depends on the number of diagonal elements of W , which are below a predefined threshold value. In the particle filter [24] we utilize the following motion model: ωt+1 = ωˆt + Δωt+1 + νt+1 ,
(18)
where νt+1 is zero mean Gaussian i.i.d. noise, independent of state and with covariance matrix Q which specifies the extent of noise. When individual measurements carry more or less weight, the individual rows of Δω = AΔy can be multiplied by a diagonal matrix with weighting factors. If the diagonal matrix is the identity matrix we obtain the original solution. In our approach such row weighting is used to emphasize or de-emphasize image patches according to number of background pixels they contain. 5.2
Appearance Modeling Using Adaptive Models
Our intensity-based appearance model consists of three components, namely, the W -component expressing the two-frame variations, the S-component characterizing the stable structure within all previous observations and F component representing a fixed initial template. The model At = {Wt , St , Ft } represents thus the appearances existing in all observations up to time t − 1. It is a mixture of Gaussians [5] with centers {μi,t | i = w, s, f }, their corresponding variances 2 {σi,t | i = w, s, f } and mixing probabilities {mi,t | i = w, s, f }. The update of the current appearance model At to At+1 is done using the Expectation Maximization (EM) algorithm. For a template Yˆ (R, t) corresponding to the estimated state we evaluate the posterior contribution probabilities as follows: (j) (j) (j) mi,t Iˆx,t − μi,t (j) oi,t = exp − (19) 2 2σi,t 2 2πσi,t where i = w, s, f and j = 1, 2, ..., J. If the considered pixel belongs to back(j) ground, the posterior contribution probabilities are calculated using Iˆx,1 : (j) (j) (j) mi,t Iˆx,1 − μi,t (j) oi,t = exp − . (20) 2 2σi,t 2πσ 2 i,t
152
B. Kwolek
This prevents the slowly varying component from updating by background pix (j) els. The posterior contribution probabilities (with i oi,t = 1) are utilized in updating the mixing probabilities in the following manner: (j)
(j)
(j)
mi,t+1 = γoi,t + (1 − γ)mi,t
| i = w, s, f,
(21)
where γ is accommodation factor. Then, the first and the second-moment images are determined as follows: (j)
(j)
(j) (j)
2,t+1
2,t
s,t
M1,t+1 = (1 − γ)M1,t + γos,t Iˆx,t (j) (j) (j) (j) M = (1 − γ)M + γo (Iˆ )2 .
(22)
x,t
In the last step the mixture centers and the variances are calculated as follows: (j) (j) M1,t+1 M2,t+1 (j) (j) (j) μs,t+1 = (j) , σs,t+1 = − (μs,t+1 )2 (j) ms,t+1
(j) (j) μw,t+1 = Iˆx,t , (j) (j) μf,t+1 = μt,1 ,
ms,t+1
(j) σw,t+1 (j) σf,t+1
= =
(j) σw,1 (j) σf,1 .
(23)
When the considered pixel belongs to background, the mixture center in the component expressing two-frame variations is updated according to: (j) (j) μw,t+1 = Iˆx,l ,
(24)
where index l refers to last non-background pixel. In order to initialize the model A1 the initial moment images are set using 2 the following formulas: M1,1 = ms,1 I(R, t0 ) and M2,1 = ms,1 (σs,1 + I(R, t0 )2 ). The observation likelihood is calculated according to the following equation: (j) (j) (j) J
mi,t Ix,t − μi,t p(Yt | ωt ) = exp − (25) 2 2σi,t 2πσ 2 j=1 i=w,s,f
i,t
Underlying AdaBoost-based tracking algorithms do not take into account of temporal information (except [13]) as they rely on learned binary classifiers that discriminate the target and the background. In our algorithm the data-driven binary classifier learns on-line using features from the initial object template, stable object features within all past observations, features maximizing individually the mutual information, most informative and hard to classify examples, and the features that are sampled from the object rectangle estimated by particle filter. In the particle filter we use a recursively updated mixture appearance model, which depicts stable structures in images seen so far, initial object appearance as well as two-frame variations. The update of slowly varying component is done using only pixels that are classified by the strong classifier as belonging to foreground. In pairwise comparison of object images we employ only non-background pixels and in case of background we use the last foreground pixels. Our probabilistic models differ from those proposed in [6] in that we adapt models using information about background. The outcome of the strong classifier is used to
Learning-Based Object Tracking
153
construct a Gaussian proposal distribution, which guides particles towards most likely locations of the object of interest.
6
Experiments
The tests were done on a sequence1 of images 288 high and 384 pixels wide. In this sequence a tracked pedestrian crosses zones in varying illumination conditions. In tracking experiments with this sequence and a particle filter built only on adaptive appearance models and configured to run with 100 particles, some pixels of the object rectangle are updated by background pixels (for example in frames #1000 and #1200). Despite this undesirable effect the object model can adapt to pedestrian’s side view. However, the update of the model by background pixels leads to considerable jitter of ROI and in consequence the track is lost in frame #1226. In a comparison of the results generated by our on-line learning-based algorithm and an adaptive algorithm, where all pixels laying inside the object rectangle are utilized in an linear adaptation of the model, we observed that our algorithm performs significantly better. In particular, we compared the probability images, which illustrate the potential of algorithms in extraction of the target. The confidence maps generated by the learning-based algorithm picks better the person’s shape over time. In frames that were generated by learningbased algorithm the jitter of rectangular ROI is smaller and it is located near the true location of the target in most frames. Despite similar distribution of background color with the foreground color, the number of background pixels with high confidence in the rectangle surrounding the object is relatively small. The mentioned effect has been achieved using only ten rounds of boosting in on-line learning. Figure 1 shows the behavior of learning-based tracker using boosted features and appearance-adaptive models. It has been initialized and configured in the same manner as the algorithm based on adaptive appearance models. Because the appearance models are updated using only object pixels, the algorithm performs far better than algorithm built on only adaptive appearance models, especially in case of rotations of the pedestrian. The estimates calculated by particle filter were employed to sample additional features for learning of the classifier. Generally speaking, the 2-frame affine tracker can be expected to posses problems with targets that are nor deforming in a roughly affine manner, as well as with small objects. In such a situation the learning based algorithm can support the tracking. The algorithms have different failure modes and complement each other during tracking. Our algorithm is about 2.2 times slower than the algorithm built on adaptive appearance models. It was implemented in C/C++ and runs with 320×240 images at about 10 fps on 2.4 GHz Pentium IV. It can be easily extended to run with other features, for example integral images or orientation histograms. A modification consisting in a replace of the CamShift by a particle filter operating on the confidence maps is also straightforward. 1
Downloaded from site at: http://groups.inf.ed.ac.uk/vision/CAVIAR/
154
B. Kwolek
#700
#1000
#1200
#1263
#1140
#1275
Fig. 1. Pedestrian tracking using learning and adaptive appearance models
7
Conclusions
We have presented an approach for on-line learning during tracking. The major difference of our work from relevant research is that weak classifiers are not trained from the same data but only a portion of newly available pixels. During learning we employ stable object features seen so far, features maximizing individually the mutual information, examples that are in vicinity of the decision boundary between background and target, and uniformly subsampled background features. To avoid drift the on-line training is conducted using pixels of the object template. In a supplementing tracker based on a particle filter we use a recursively updated mixture appearance model, which depicts stable structures in images seen so far, initial object appearance as well as two-frame variations. We accommodate the slowly varying component using only pixels that are classified by the strong classifier as belonging to object. The estimates calculated by particle filter are employed to sample learning features. The two algorithms have different failure modes and complement each other during tracking.
Acknowledgment This work has been supported by Polish Ministry of Education and Science (MNSzW) within the projects 3 T11C 057 30 and N206 019 31/2664.
References 1. Avidan, S.: Support vector tracking. In: Int. Conf. on Comp. Vision and Pattern Rec. Hawaii, pp. 184–191, Hawaii (2001) 2. Avidan, S.: Ensemble tracking. In: Int. Conf. on Comp. Vision and Pattern Rec. vol. 2, pp. 494–501 (2005)
Learning-Based Object Tracking
155
3. Grabner, H., Grabner, M., Bischof, H.: On-line boosting and vision. In: Int. Conf. on Comp. Vision and Pattern Rec. vol. 1, pp. 260–267 (2006) 4. Han, B., Comaniciu, D., Zhu, Y., Davis, L.: Incremental density approximation and kernel-based bayesian filtering for object tracking. In: Int. Conf. on Comp. Vision and Pattern Rec. Washington, DC, pp. 638–644 (2004) 5. Jepson, A.D., Fleet, D.J., El-Maraghi, T.: Robust on-line appearance models for visual tracking. In: Int. Conf. on Comp. Vision and Pattern Rec. pp. 415–422 (2001) 6. Zhou, S.K., Chellappa, R., Moghaddam, B.: Appearance tracking using adaptive models in a particle filter. In: Proc. Asian Conf. on Comp. Vision. (2004) 7. Grimson, W.E.L., Stauffer, C.: Adaptive background mixture models for real-time tracking. In: IEEE Int. Conf. on Comp. Vision and Pattern Rec. pp. 22–29. IEEE Computer Society Press, Los Alamitos (1999) 8. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: Proc. Int. Conf. on Comp. Vision. vol. 2, pp. 734–741 (2003) 9. Levin, A., Viola, P., Freund, Y.: Unsupervised improvement of visual detectors using co-training. In: Proc. Int. Conf. on Comp. Vision. vol. 1, pp. 626–633 (2004) 10. Nair, V., Clark, J.J.: An unsupervised, online learning framework for moving object detection. In: Int. Conf. on Comp. Vision and Pattern Rec. vol. 2, pp. 317–324 (2004) 11. Okuma, K., Teleghani, A., Freitas, N.D., Little, J., Lowe, D.G.: A boosted particle filter: Multitarget detection and tracking. In: Proc. 8th European Conf. on Comp. Vision, Prague, Czech Republic, pp. 29–39 (2004) 12. Williams, O., Blake, A., Cipolla, R.: A sparse probabilistic learning algorithm for real-time tracking. In: Int. Conf. on Comp. Vision, Nice, France, pp. 353–360 (2003) 13. Zhou, S.K., Shao, J., Georgescu, B., Comaniciu, D.: Boostmotion: Boosting a discriminative similarity function for motion estimation. In: Proc. of Int. Conf. on Comp. Vision and Pattern Rec. New York, vol. 2, pp. 1761–1768 (2006) 14. Oza, N.C., Russell, S.: Online bagging and boosting. In: 8th Int. Workshop on Artificial Intelligence and Statistics, pp. 105–112. Morgan Kauffman, San Francisco (2001) 15. Matthews, I., Ishikawa, T., Baker, S.: The template update problem. IEEE Trans. on Pattern Analysis and Machine Intelligence 26, 810–815 (2004) 16. Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proc. of Int. Conf. on Machine Learning, pp. 148–156. Morgan Kauffman, San Francisco (1996) 17. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Annals of Statistics 38, 337–374 (2000) 18. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Int. Conf. on Comp. Vision and Pattern Rec. vol. 1, pp. 511–518 (2001) 19. Torralba, A., Murphy, K., Freeman, W.: Sharing features: efficient boosting procedures for multiclass object detection. In: Int. Conf. on Comp. Vision and Pattern Rec. vol. 2, pp. 762–769 (2004) 20. Schapiere, R., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics 26, 1651–1686 (1998) 21. Bradski, G.R.: Computer vision face tracking as a component of a perceptual user interface. In: Proc. IEEE Workshop on Appl. of Comp. Vision, pp. 214–219. IEEE Computer Society Press, Los Alamitos (1998) 22. Hager, G.D., Belhumeur, P.N.: Efficient region tracking with parametric models of geometry and illumination. IEEE Trans. on Pattern Analysis and Machine Intelligence 20, 1025–1039 (1998) 23. Horn, B.K.P.: Robot Vision. MIT Press, Cambridge (1986) 24. Isard, M., Blake, A.: Condensation – conditional density propagation for visual tracking. Int. J. of Computer Vision 29, 5–28 (1998)
Spatiotemporal Fusion Framework for Multi-camera Face Orientation Analysis Chung-Ching Chang and Hamid Aghajan Wireless Sensor Networks Lab, Stanford University, Stanford, CA 94305 USA
Abstract. In this paper, we propose a collaborative technique for face orientation estimation in smart camera networks. The proposed spatiotemporal feature fusion analysis is based on active collaboration between the cameras in data fusion and decision making using features extracted by each camera. First, a head strip mapping method is proposed based on a Markov model and a Viterbi-like algorithm to estimate the relative angular differences to the face between the cameras. Then, given synchronized face sequences from several camera nodes, the proposed technique determines the orientation and the angular motion of the face using two features, namely the hair-face ratio and the head optical flow. These features yield an estimate of the face orientation and the angular velocity through simple analysis such as Discrete Fourier Transform (DFT) and Least Squares (LS), respectively. Spatiotemporal feature fusion is implemented via key frame detection in each camera, a forward-backward probabilistic model, and a spatiotemporal validation scheme. The key frames are obtained when a camera node detects a frontal face view and are exchanged between the cameras so that local face orientation estimates can be adjusted to maintain a high confidence level. The forward-backward probabilistic model aims to mitigate error propagation in time. Finally, a spatiotemporal validation scheme is applied for spatial outlier removal and temporal smoothing. A face view is interpolated from the mapped head strips, from which snapshots at the desired view angles can be generated. The proposed technique does not require camera locations to be known in prior, and hence is applicable to vision networks deployed casually without localization.
1
Introduction
The advent of image sensor and embedded processing technologies has enabled novel approaches to the design of security and surveillance networks as well as new application classes such as smart environments. When multiple image sensors view a freely moving (i.e. non-cooperative) person, only a few selective snapshots captured during the observation period may provide an adequate view of the person’s face for face recognition and model reconstruction applications. Detection, matching, and recording of those frames would hence be the key to enabling effective facial analysis techniques. In surveillance applications, in addition to the face model reconstruction, capturing the frontal face view of the intruder is often of paramount importance. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 156–167, 2007. c Springer-Verlag Berlin Heidelberg 2007
Spatiotemporal Fusion Framework
157
Most face recognition algorithms require face images with approximately frontal view to operate effectively. Examples are principle component analysis (PCA) [5], linear discriminant analysis (LDA) [4], and hidden markov model (HMM) techniques [3]. In order to be robust, the PCA and LDA techniques require a large number of training samples in different face orientation angles. A recent approach is to collect and classify face data in a higher-dimensional space, like the 3D space. However, current robust methods of recognition by stereo vision [7] require large amounts of computation in the 3D reconstruction of the face. The 3D morphable model algorithm [1] [2] highly reduces the computational complexity in reconstructing a 3D model; however, it requires a frontal view image of the face in the training stage.
Unknown camera location
A Single Camera Node Image(x,y,t)
In-node feature extraction
Head strip matching and model reconstruct
Face model Other Network Nodes
Coarse estimation
Spatiotemporal data fusion
Optical flow estimation
Hair-face estimation
save until next key frame
Key frame? Y
Forward-backward probabilistic estimation
Fine estimation
Face orientation estimates
Head-strips
Face-strip matching
Key-frame notification and receiving
Trigger
Sample points of the strips Time of the key frame
Orientation and time
Relative angular difference to the object
Spatiotemporal validation Spatiotemporal fusion/validation
Fig. 1. Framework of spatiotemporal feature fusion for face orientation analysis
In a networked camera setting, the desire for a frontal view to pursue an effective face analysis is relaxed due to the distributed nature of the camera views. Instead of acquiring frontal face image from any single camera, we propose an approach to head view interpolation in a smart camera network by collaboratively collecting and sharing face information spatially. Due to the limited computation power assumed for each camera, in the proposed technique in-node signal processing algorithms are intentionally designed to be lightweight, accepting the fact that the resulting feature estimates in each camera might be erroneous. On the other hand, the camera nodes exchange their soft information with each other, allowing for the network to enhance its detection accuracy as well as confidence level, and produce accurate results describing the orientation of each facial view. The proposed collaborative face orientation analysis approach employs selective features and spatiotemporal relationship between the features in order to offer a low-complexity and robust solution.
158
2
C.-C. Chang and H. Aghajan
System Framework Overview
The proposed framework of spatiotemporal feature fusion for face analysis and head view interpolation is shown in Fig. 1. In-node feature extraction in each camera node consists of low-level vision methods to detect features for estimation of face orientation or the angular velocity. These include the hair-face ratio and optical flow, which are obtained through Discrete Fourier Transform (DFT) and Least Squares (LS), respectively. Another feature extracted locally is a set of head strips, which is used to estimate relative angular difference to the face between cameras by a proposed matching technique. A Markov model is designed to exploit the geometric connectivity between strips in two cameras, and a Viterbi-like algorithm is applied to select the most probable displacement between the collection of head strips of the two cameras. The estimated relative angles are useful in several ways. In Section 4, a face view is interpolated via spatial fusion of the extracted and matched head strips. In Section 5, a spatiotemporal feature fusion is implemented via key frame detection, a forward-backward probabilistic model, and spatiotemporal validation. The key frames are obtained when a camera node detects a frontal face view through a hair-face analysis scheme and this event is broadcasted to other camera nodes so that the fusion schemes for face analysis can be adaptively adjusted according to the relative angular estimates, in order to maintain a high confidence level. In this way, spatial collaboration between cameras is pursued through key frame event sharing instead of raw image transfer. The proposed forward-backward probabilistic model aims to mitigate error propagation in time and interpolate orientation estimates between key frames. Finally, the proposed spatiotemporal validation scheme detects spatial outliers and smoothes temporal estimates by minimizing the weighted sum of temporal and spatial distance metrics according to the relative angular estimates.
3
In-Node Feature Extraction
Local data processing algorithms in each camera node consist of low-level vision methods to detect features for estimation of face orientation, including optical flow and hair-face ratio as introduced in the following subsections. These techniques are developed to be of low computational complexity, allowing them to be adopted for in-node processing implementations. 3.1
Optical Flow Estimation
The underlying idea of this analysis is to project the motion of the head into several independent dimensions and estimate the projected vector by least squares estimation. The motion vectors are obtained by finding the correspondent strong corners [9] between two consecutive frames by the iterative version of the LucasKanade pyramid method [8][9]. We can decompose the head motion into translation, rotation in y axis (turn of the head) and in z axis (tilt of the head). The decomposition model is as follows:
Spatiotemporal Fusion Framework
159
Fig. 2. Optical flow estimates with high and low confidence. Each with image(x,y,t), image(x,y,t+1), image(x,y,t) with the motion vectors, and least squares estimates.
v1 1 r1 cos(θ11 ) r2 cos(θ12 ) t v 1 r1 cos(θ21 ) r2 cos(θ22 ) 2 ω (1) vi = t+r1 ω1 cos(θi1 )+r2 ω2 cos(θi2 ) ⇒ . = . .. .. .. .. ω12 . . vn 1 r1 cos(θn1 ) r2 cos(θn2 )
v
z
A
where vi is the norm and direction of the motion vector in the direction orthogonal to the head’s vertical axis (where positive sign indicates the direction is to the right, and negative sign to the left), t is the translation factor, r1 is the transversal radius of the head, r2 is the distance to the bottom of the head, ω1 is the angular motion in y axis, ω2 is the angular motion in z axis, rcos(θi1 ) represents the distance from the point of the motion vector to the longitudinal axis of the head in the 2D image plane, and rcos(θi2 ) represents the vertical distance from the point of the motion vector to the bottom of the head in the 2D image plane. Minimizing the mean square error of the motion vectors under the model yields the least squares solution of x as: zls = (AT A)−1 AT v, where the first element of zls is the translational velocity, the second element is the angular velocity of the head in y axis, and the third element is the angular velocity of the head in z axis. Experimental results are shown in Fig. 2, where the slope indicates the angular velocity, and the intersection on y axis indicates the translational velocity. 3.2
Hair-Face Ratio Estimation
To estimate the hair-face ratio, we first classify the head region into face and hair regions by color [6]. Based on the hair-face classification, face orientation is analyzed in the following procedure as shown in Fig. 3(a)(b). Consider the head as an ellipsoid ball in 3D space, and cut the surface of the ball into N equally spaced strips along its longest axis direction. In each camera frame, we can only see m of the N strips of the ellipsoid. Calculating the ratio of the hair region to the face region in each of the m strips and padding zeros to the strips that cannot be seen in the current frame, we form a ratio sequence of length N . We estimate the face orientation by calculating the phase of the fundamental frequency of the ratio sequence using DFT. This uses the assumption that the hair-face ratio is symmetric in the frontal face and is approximately a sinusoidal curve along
160
C.-C. Chang and H. Aghajan
(a)
(c) ?????
(b)
zero-padding
Fig. 3. (a)(b) Procedure for the hair-face ratio estimation (Illustration of how the head ellipsoid (left) is transformed into a sequence of hair-face categorized image slices (middle), and into a ratio sequence with zero-padding (right)) (c) Reconstructed hairface ratio model
the surface of the ellipsoid. The assumption can be refined after certain period of observing the subject. Along with the face orientation estimates, we may estimate a more accurate hair-face ratio model by least squares estimation as in Fig. 3(c). Estimation based on this refined hair-face ratio model is the subject of on-going research. Fitting the hair-face ratio curve to a sinusoidal curve illustrate a simple way to find the face orientation using the symmetry of the face. Therefore, as long as the hair-face ratio is symmetric to the frontal face, the estimation is reliable when it is closer to the frontal face, even if the person has long hair or the curve is not truly sinusoidal. Although we assume the detected head should not be bald, the model-based estimation with confidence measurement can also act as a detection algorithm. If the detected head is considered to be bald, a secondary algorithm should be activated to handle such a condition. Our future work includes making estimation based on multiple attributes to make the system robust to this and other conditions.
4
Head Strip Matching and Head View Interpolation
Geometrically, if all cameras are deployed at the same horizon, the relative angular difference to the head between two cameras would cause a shift in their observed strips. Therefore, matching the head strips of the two cameras and finding the displacement of the strips give us the (quantized) relative angular difference to the object between the two cameras at a given time. Based on the displacement of the strips and the confidence of the estimated displacement, which is obtained by cross validation of displacement between cameras, we can reconstruct the face model as described in the following subsections. 4.1
Head Strip Matching
The head strip mapping is based on a Markov model and a Viterbi-like algorithm as illustrated in Fig. 4. Considering two sets of head strips Y and Y , each
Spatiotemporal Fusion Framework (a)
(b)
(c) Sm
Sm
Sm
Sm
Sm
ʌm-1 Sm-1
Sm-1
Sm-1
Sm-1
Sm-1
ʌm Si+3
Si+2
Pxixi+3
Yi+3
ʌ3
S3
S3
S3
S3
S3
Pxixi+2
Si+1 Pxixi+1 Pxixi
Si
161
Yi+2
ʌ2
S2
Yi+1
ʌ1
S1
Input
S2 S1 Y’1
S2
S2
S1
S1
Y’2
S2
40
40
30
30
20
20 10
10 10
20
30
40
10
20
30
40
Viterbi-like algorithm
S1 Y’3
Relative angular difference to the face between two cameras: 8 slices = 8 * 360/80 degree
Fig. 4. Illustration of the Markov model and Viterbi-like algorithm. (a) The Viterbilike model generated by the head strip set in camera C, (b) The trellis of the Viterbilike algorithm. Sm−1 in the rightmost row is the state with the minimum cost, and the corresponding trellis is marked in thick (red) line, (c) Experimental data and the corresponding head view interpolation.
sampled with n sample points, corresponding to the head images captured in two cameras, C and C , let Y = [y1 y2 . . . ym ] and Y = [y1 y2 . . . ym ], where n yi , yi ∈ R correspond to n sample points in a single strip. Our problem now is to map the strips in Y to the strips in Y with the constraint that yi , yi are in some spatial order. We now introduce the concept of the states S. Let S = [s1 s2 . . . sN ] denote all states for the strips of a head (360o ), for example, s1 representing the strip that includes the nose trail. For each of the captured head images, the corresponding head strips Y should map to a consecutive subset of S, denoted by SY , which is not known in prior and is approximately of length m. In other words, Y is a representation of the states SY . As we scan vertical sampling lines through the head horizontally, we are actually going from state to state, for example, from si to si+1 . Ideally we will get yj and yj+1 to match each other for a certain j. However, due to the fact that the head is not a perfect ellipsoid, we may as well get yj and yj+k to match each other for a certain j and a small k ≥ 0, the latter constraint showing that the two states should be near and cannot occur in a reverse order as we scan through the head strips. In other words, the probability of P si si+1 , the probability of going from the current state to the next state as we scan through the head, is not necessarily 1. The probability of the transition between states forms a Markov model, as shown in Fig.4(a). In our experiment, the choice for the probability is Psi si+k = exp(−
(k − 1)2 )(u(k) − u(k − 4)) 2σ 2
(2)
where u is the unit step function and σ is the so-called bandwidth parameter. As we match the set of strips Y to Y , we first assume that the representation Y is ideal, corresponding to the states SY one-by-one. Under this assumption, we transform the Viterbi algorithm, a supervised learning algorithm, into an
162
C.-C. Chang and H. Aghajan
unsupervised way of learning, which we call a Viterbi-like algorithm. For each given input yi , we can sum the cost in each of the previous states and the cost-togo(w) in each branch, and choose the branch with the minimum cost as the path from the previous states to the current states. The cost of the branch is written as wsi si+k = − ln(Psi si+k γ(yi+k ; si si+k )) where γ(yi+k ; si si+k ) is calculated by the inverse of the mean square error between strips yi+k and yi+k . The initial states are assumed to be equally likely, meaning that the matching can start from any of the states in SY . The first and the last states in SY may be regarded physically as the not-in-Y (not in current face) states. Therefore, some exceptions for the probability model are made in the first and the last states, where P s1 s1 is given higher probability and P si sm is 1 when i = m, and zero otherwise. According to the Viterbi algorithm, the path with the smallest cost is chosen. For example, as in Fig. 4, assume sm−1 in the rightmost column is the state with the minimum cost, and the corresponding previous paths are marked with thick (red) lines, showing that the paths are [s1 s1 s2 . . . sm−1 ]. In Fig. 4, an example of head strip matching is shown, the trellis of the Viterbi-like algorithm is shown in the right figure with blue dots, where red dots represent minimum branch cost (w) in each Viterbi-like step. Notice that the trellis, excluding those in states s1 and sm , intersects the x-axis around 10, which means the displacement between two head images is 10 strips, or 45 degrees in the example. 4.2
Spatial Head View Interpolation
In the previous subsection, the head strip matching is conducted in a peer-topeer manner. In many cases, the estimated angle difference to the face between different pairs of the cameras may be inconsistent. Let CA , CB , and CC denote three cameras, and ∠E CA CC denote the estimated angle difference to the face between the two cameras CA and CC by mapping CA to CC , in the units of strip displacements. In many cases, it is possible that ∠E CA CC = ∠E CA CB + ∠E CB CC . Define the confidence measure KAB = exp(−|∠E CA CB +∠E CB CA |). Ideally, ∠E CA CB = −∠E CB CA , and KAB = 1, indicating that the estimate is with high confidence. As the two estimates become more and more inconsistent, KAB gets smaller. Based on the confidence level generated by cross-validation between each pair of the cameras, a weighted quadratic refinement is applied. The refinement algorithm is defined as follows: minimize KAB Δ2AB + KAC Δ2AC + KBC Δ2BC subject to KAB ΔAB = KAC ΔAC = KBC ΔBC whereΔij = ∠R Ci Cj − ∠E Ci Cj =⇒ ΔAB
i, j ∈ {A, B, C} 1 ∠E CA CB + ∠E CB CC − ∠E CA CC = 1 1 1 KAB KAB + KAC + KBC
and ∠R CA CB is the refined estimate over ∠E CA CB .
Spatiotemporal Fusion Framework
163
Having determined the angular difference to the face between cameras, head view interpolation can be done by shifting and concatenating the face strips. The transition strip from one face image to the other in the overlapping region is determined by choosing the state in the Viterbi-like algorithm with the minimum cost wsi si+k , representing the MSE, and hence, yielding the smoothest transition. 4.3
Temporal Head View Interpolation
Temporal head view interpolation can be implemented according to the above idea, or directly by shifting and concatenating two consecutive frame face images provided the face orientation and angular motion are known. However, in the smart camera networks, due to the limited computation power, those estimates may not be accurate enough without spatiotemporal data exchange. Therefore, the temporal head view interpolation is usually executed after collaborative estimation by data exchange in the networks. The transition in the overlapping region between the two images can be determined by choosing strip pairs with the least MSE among the overlapping strip pairs as stated in the previous subsection for the spatial case. On the other hand, after data collaboration, we may acquire the reconstructed hair-face ratio model as in Fig. 3 and measure the confidence for each head strip in one face image by calculating the squared difference between the hair-face ratio of the strip and of the reconstructed model for each strip. Choosing the strip with the least squares error among the pairs in the overlapping region usually yields a smooth transition between the face images since the hair-face ratio curve itself is usually smooth, too.
5
Spatiotemporal Data Exchange Mechanisms
Collaboration between cameras is achieved by data exchange. The frame with features of very high confidence is called a key frame, and the features from a key frame are broadcasted in the network. They are used to validate data in the other cameras. To determine the estimation of the frames in between key frames, we apply a probabilistic model forward and backward. Finally, a spatiotemporal validation is applied to cross-validate and determine the estimates collaboratively. 5.1
Key Frame Detection
Key frames are the frames that include features or estimates with high confidence. The hair-face ratio based on the phase of the fundamental frequency is sensitive to the face angle, especially when the view is approximately symmetric to the face center. In other words, the frontal views can be detected accurately. By linear interpolation between samples, the time of a frontal view, defined as a key frame event, can be determined. In this paper, hair-face ratio estimation, utilizing the symmetric property, gives good estimates when it captures a frontal
C.-C. Chang and H. Aghajan
Orientation estimates
164
Interpolate by FBPM
Key frame
Key frame time
Fig. 5. Key frames and probability density function (PDF) propagation in FBPM. Right figure: The leftmost and the rightmost columns show the shifted delta functions corresponding to key frame detection. According to the optical flow estimation, the red curve is propagating forward and the blue curve is propagating backward. Left figure shows the results obtained by two cameras between key frames received from the third camera.
view. It is less likely to have false positive since the hair-face ratio is symmetric only at its frontal and back views, and is convex and concave, respectively. Therefore, we simply take the hair-face ratio estimates with small face orientation angle as key frames. Once a key frame is detected, the time of its detection is notified to other cameras. Since the key frame is associated with relatively high confidence, other cameras would assume the received key frame orientation estimation to be true and calculate their face orientation by adding that with the relative angular difference to the object between themselves and the camera that broadcasted the key frame. 5.2
Forward-Backward Probabilistic Model
We apply the forward-backward probabilistic model (FBPM) to find the probability density function (pdf) of the head orientation for the frames between key frames. Since the optical flow estimate is obtained by a linear fit to the set of motion vectors, the estimates are Gaussian distributed by the law of large numbers, regardless of the actual distribution of motion vectors. Since key frames are the frames with estimation of relatively high confidence, the pdf of the face orientation at the time of the key frame is nearly a shifted delta function located at the predicted angle. Let x(t) be the orientation estimation at time t and f (x(t)) be the corresponding pdf. Since x(t + 1) = x(t) + vf orward where P (vf orward |x(t)) ∼ N (μ, σ), therefore, f (x(t + 1)) = f (x(t)) ∗ N (μ, σ) where μ and σ are the mean and variance of the optical flow estimate. In backward propagation, instead of propagating forward with vf orward , we calculate P (vbackward |x(t + 1)) to propagate backward. We may regard the
Spatiotemporal Fusion Framework
165
sequence in the reversed-time order, then all motion vectors between frames would be in the opposite direction and P (vbackward |x(t + 1)) ∼ N (−μ, σ). The orientation estimates for frames between two key frames are determined as the maximum value of the sum of forward and backward pdfs (see Fig. 5). In most cases when a new key frame is detected, the probabilistic model will be applied bilaterally between that frame and the previous key frame. If there is only one key frame detected, for example in the case of the first key frame, the probabilistic model will be applied unilaterally. 5.3
Spatiotemporal Validation
Correlations in temporal domain can be exploited since face orientation and angular velocity, one being the derivative of the other, are continuous in consecutive frames provided that the time lapse between frames is short. Correlation in spatial domain can be exploited since for any time instance the captured image in each camera should reflect the same structure and motion in 3D. The spatiotemporal validation formulates this idea into an optimization problem by penalizing the inconsistencies. Let z (1) , z (2) , and z (3) be the estimation in each camera after FBPM, and z be the decision after validation. We will have: minimize μφquad (z) + φtv (z, z (j) )
(3)
2 where φquad (z) = n−1 i=1 (zi+1 − zi ) is the quadratic temporal smoothing func (1) (2) (3) n−1 tion, and φtv (z, z (j) ) = i=1 (zi − zi , zi − zi , zi − zi )1 is the L-1 norm that penalizes the inconsistency between cameras, where the subscript i denotes the time. Parameter μ ≥ 0 gives relative weight between φquad and φtv . If the time lapse between camera frames is small enough such that the face orientation between frames is continuous, applying quadratic smoothing can efficiently average out the Gaussian measurement noise in each frame. By using the L-1 norm for errors between cameras we can avoid the effect of estimation outliers.
6
Comparative Experiments
The setting of our experiment is as follows: Three cameras are placed approximately on the same horizon. One camera (camera 3) is placed in frontal direction to the seat, and the other two are with about +42o (camera 2) and −37o (camera 1) deviations from the frontal direction. The experiment is conducted with a person sitting still on a chair with the head turning from right(−50o) to left(+80o ) and then to the front(+40o) without much translational movement. The time lapse between consecutive frames in each camera is half a second, and the resolution of the cameras is 320x240 pixels2 . Fig. 6 shows the result of the in-node orientation and angular estimations. The dotted lines in the figures show the ground truth face orientation at each time instance. The hair-face ratio estimates are very accurate when the ratio curve is symmetric, either frontal or back view. The optical flow estimates are mostly consistent with each other, with slow motions demonstrating small variance and
166
C.-C. Chang and H. Aghajan Hair−Face ratio Estimation
Cam1 Head Cam2 Head Cam3 Head Cam1 Body Cam2 Body Cam3 Body
60
ω (degree/consecutive frame)
150
Estimated Orientation (degree)
Optical Flow Estimation
Cam1 Estimation Cam2 Estimation Cam3 Estimation
100
50
0
40 20 0 −20 −40
−50 −60 −100
1
2
3
4
5 frame #
6
7
8
9
1
2
3
4
5
6
7
8
frame #
Temporal Feature Fusion Cam1 Cam2 Cam3
80 60
Cam1 Estimation Cam2 Estimation Cam3 Estimation
150
Estimated Orientation (degree)
Est. angular diff. to the face between the camera and Cam3
Fig. 6. Estimated face orientation and angular motion by in-node signal processing
40 20 0 −20 −40
100
50
0
−50
−60 −80 1
2
3
4
5 frame #
6
7
8
−100
9
1
2
3
4
5 frame #
6
7
8
9
Fig. 7. Estimated relative angular differences to the object between cameras and temporal feature fusion in each camera Spatiotemporal Feature Fusion before Validation
100
50
0
−50
−100
Cam1 Estimation Cam2 Estimation Cam3 Estimation
150
Estimated Orientation (degree)
150
Estimated Orientation (degree)
Spatiotemporal Feature Fusion after Validation
Cam1 Estimation Cam2 Estimation Cam3 Estimation
100
50
0
−50
1
2
3
4
5 frame #
6
7
8
9
−100
1
2
3
4
5 frame #
6
7
8
9
Fig. 8. Spatiotemporal feature fusion before and after spatiotemporal validation
vice versa. Utilizing FBPM and using local key frame information, temporal feature fusion estimates are given as in Fig. 7 (Left). Further data exchange between cameras requires relative angular differences to the face between cameras, which is given by the head-strip mapping algorithm (Fig. 7 (Right)).
Spatiotemporal Fusion Framework
167
The results of the collaborative face orientation estimation are shown in Fig. 8. Before validation, the error may propagate in FBPM as we estimate face orientation by accumulating angular motion from the time of a key frame. The spatiotemporal validation successfully corrects the outlier estimates and smoothes the data as we expect.
7
Conclusions
In this paper, we have shown that it is possible to estimate face orientation with preliminary in-node signal processing and spatiotemporal data exchange in a smart camera network, where both computation and bandwidth are limited. Preliminary image processing and estimation methods are designed intentionally to reduce computation cost accepting that some local estimates may be inaccurate. A spatiotemporal data exchange method is embodied through identification and exchange of key frames spatially, forward-backward propagation of angular motion estimates temporally, and smoothing and outlier rejection spatiotemporally. A head-strip matching method based on a Viterbi-like algorithm predicts the relative angular differences to the face between cameras and reconstructs a face model without having to know the camera locations in prior.
References 1. Bai, X.-M., Yin, B.-C., Shi, Q., Sun, Y.-F.: Face recognition using extended fisherface with 3d morphable model. In: Proc. of the ICMLC, vol. 7, pp. 4481–4486 (2005) 2. Hu, Y., Jiang, D., Yan, S., Zhang, L., Zhang, H.: Automatic 3d reconstruction for face recognition. In: IEEE Conference on FGR, IEEE Computer Society Press, Los Alamitos (2004) 3. Kurata, D., Nankaku, Y., Tokuda, K., Kitamura, T., Ghahramani, Z.: Face recognition based on separable lattice hmms. In: Proc. of ICASSP (2006) 4. Liu, C., Wechsler, H.: Enhanced fisher linear discriminant models for face recognition. In: Proc. of ICPR, vol. 2, pp. 1368–1372 (1998) 5. Turk, M., Portland, A.: Eigenfaces for recognition. J. Cognition Nueralscience 3(1), 71–86 (1991) 6. Chang, C., Aghajan, H.: A LQR spatiotemporal fusion technique for face profile collection in smart camera surveillance. In: Proc. of ACIVS (2007) 7. Uchida, N., Shibahara, T., Aoki, T.: Face recognition using passive stereo vision. In: Proc. of ICIP (2005) 8. Bouguet, J-Y.: Pyramidal Implementation of the Lucas Kanade Feature Tracker Description of the algorithm. In: Intel Corporation, Microprocessor Research Labs (2000) 9. Intel Corporation: Open Source Computer Vision Library 1.0 (2006)
Independent Component Analysis-Based Estimation of Anomaly Abundances in Hyperspectral Images Alexis Huck and Mireille Guillaume Institut Fresnel UMR 6133 CNRS-Universits Aix Marseille France
[email protected] [email protected]
Abstract. Independent Component Analysis (ICA) is a blind source separation method which is exploited for various applications in signal processing. In hyperspectral imagery, ICA is commonly employed for detection and segmentation purposes. But it is often thought to be unable to quantify abundances. In this paper, we propose an ICA-based method to estimate the anomaly abundances from the independent components. The first experiments on synthetic and real world hyperspectral images are very promising referring to the estimation accuracy and robustness.
1
Introduction
A Hyperspectral Image (HSI) is a set of 2D-images of a scene taken at the same time in hundreds of contiguous thin spectral bands such that each pixel of the HSI is a vector containing the sampled radiance or reflectance spectrum of the local scene. As it represents a great deal of crud information, many applications exist such as target and anomaly detection, compression and denoising, segmentation and classification. A specificity of HSIs is the spectral dimension which enables processings leading to a subpixel information. ICA [1] [2] is a blind source separation (BSS) method which finds the linear transform that decomposes the HSI into 2D-images - the independent components (ICs) or the sources - as statistically independent as possible. This method is based on the hypothesis that the pure materials composing the scene have statistically independent presences. In practical cases, the hypothesis of statistical independence is never respected, which lead to two main disturbances: 1. The sources are not the expected abundance maps. The Abundance map of a pure material (endmember ) is a 2D image whose pixel values, ranged between 0 and 1, indicate the surfacic proportion of this material spectrum in each vector pixel. 2. The mixing matrix columns are not the endmember sampled spectra. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 168–177, 2007. c Springer-Verlag Berlin Heidelberg 2007
ICA-Based Estimation of Anomaly Abundances in Hyperspectral Images
169
However, ICA is often used for HSI analysis, because the hypothesis of statistical independence is considered as a a priori least improper one, if really no knowledge of the scene is a priori reachable. What can be directly exploited from a HSI ICA is the set of sources which only gives an idea of the different material locations. But, in the case of anomalies - i.e. objects whose presence in the scene is rare - it is nevertheless realistic to concede the hypothesis of statistical independence from the other materials. So an ICA usually associates an IC to each kind of anomaly of the scene. In this paper, we propose a method to estimate each anomaly abundance map from its corresponding IC. We call it ICA-EAA for ICA-based Estimator of Anomaly Abundance. In [3], Wang and Chang proposed an ICA-based Abundance Quantification Algorithm (ICA-AQA) that enables estimation of the endmember abundances, provided each endmember is naturally fully present in at least one vector-pixel of the image. Nevertheless, this estimator is not adapted to some practical purposes which will be made explicit. The ICA-EAA is robust to any anomaly-IC shape and which performs very accurate abundance estimations. The paper is organized as follows. In Sect.2, the principle of ICA adapted to HSI analysis is briefly explained as well as the mathematical model and the notations. In Sect.3, we define the ICA-EAA and discuss the definition. The experimental Sect.4 emphasizes the robustness of our abundance estimator to any anomaly-IC shape, and accuracies of the two estimators are compared through tests on a synthetic image and we apply it on real-world images to quantify small anomaly surfaces. The last section concludes the paper.
2
Mathematical Model: From Linear Mixture Model to ICA
This section introduces ICA of HSIs as a blind source separation method. We start remembering the physical interpretation of ideal blind source separation in hyperspectral imagery, introducing the explanation from the linear mixture model (LMM). Then, ICA is briefly introduced and its use for HSI analysis is discussed. 2.1
LMM
HSI analysis very often uses the LMM [4] which models the ith sampled spectral pixel, contained in the spectral column vector r i , as the linear combination of endmember spectra: r i = γ1i m1 + · · · + γJi mJ + ni = γ iT M + ni
(1) (2)
where J is the number of endmembers, {mj }j=1...J are the endmember sampled spectra contained in column vectors, M = [m1 , . . . , mJ ] , γ i = [γ1i , . . . , γJi ]T is T
170
A. Huck and M. Guillaume
the column vector containing the set of abundances and ni is the noise vector. It takes into account the sensor noise, the spectral variability and the atmosphere fluctuations. The coefficients γji j=1,...,J must satisfy the physical conditions: ∀j = 1, . . . , J, γji ≥ 0 and J
γji = 1 .
(3) (4)
j=0
Henceforth, let us consider RT , a matrix-reshaped writing of the HSI, whose lines are the spectral pixels and each column corresponds to a spectral band of the image. Thus, it is possible to decompose the whole HSI according to the LMM: RT = Γ M (5) 1 N 1 N where R = r , . . . , r and Γ = [γ , . . . , γ ]. N is the number of pixels in the HSI. From now, any 2D image is assumed vector-reshaped in the same manner. 2.2
ICA: A BSS Technique
In HSI analysis, a way to model an ideal BSS problem consists in decomposing the HSI as follows: R =A·S +N , (6) where A = M T = [m1 , . . . , mJ ], S = Γ T = [γ1 , . . . , γ1 ]T and N is the noise matrix. A is called the mixing matrix, and its columns contain the endmember sampled spectra. S is called the source matrix and its lines correspond to the abundance maps. With no more hypothesis than the LMM, the BSS problem is ill-posed. ICA applied to HSIs uses LMM, assuming the statistical independence of the sources. Performing a HSI ICA consists in: 1. associating a monodimensional random variable (r.v.) Rω to each spectral band of the HSI, and considering the corresponding monochromatic 2Dimage of the HSI as a set of observations of Rω . Thus, an L-dimensional random vector rω is associated with the whole HSI. Its random components are reasonably assumed statistically dependent. 2. finding the linear transform W ICA which maximizes the statistical independence between the components of the J-dimensional random vector sω : sω = W ICA · rω .
(7)
Let AICA be the pseudo-inverse of W ICA . Thus, performing a HSI ICA enables to decompose the HSI as follows: R = AICA · S ICA + E ,
(8)
where the lines of S ICA are the ICs - or the sources - or vector-reshaped 2D images - are statistically independent, and E is the reconstruction error matrix.
ICA-Based Estimation of Anomaly Abundances in Hyperspectral Images
171
If the endmembers have statistically independent locations in the scene, we can expect the decomposition given by equations (6) and (8) be identical: ⎧ ⎪ ⎨AICA = A (9) S ICA = S ⎪ ⎩ E=N Unfortunately, this case never purely occurs, so AICA columns are not endmember spectra and S ICA lines are not abundance maps. Peculiarly, these matrices contains negative coefficients in any practical cases, even after any transformation such as: AICA = AICA · F S ICA = F −1 · S ICA where F is a diagonal J×J matrix with non-null elements. A physical argument is the natural dependence between materials. For instance, it is more likely to find grass on a muddy soil than on rock. A mathematical rationale is induced by the conditions given in equations (3) and (4) which are a dependence link between endmembers spectra. However, it is worth discussing the peculiar case of anomalies. As they are objects whose presence is rare in the image, they can be objects of interest, for some applications as environment control. The hypothesis of statistical independence is really more plausible. Consequently, when a HSI ICA is performed, an IC is usually attributed to each anomaly. The histogram of each anomaly IC is characterized by a central value around which the background IC pixels are centered and the IC pixels containing the anomaly have a different value. ICA-EAA is based on this point.
3
Estimation of Anomaly Abundances: Methods and Discussion
In this section, ICA-EAA is presented. Ideal histogram shapes of anomaly ICs are shown in Fig.1. The left histogram shape refers to an IC with only positive values, whereas the right one refers to an IC with only negative values. Note that in a HSI, most ICs do not correspond to an anomaly. We can expect such histogram shapes because ICs returned by ICA are known apart from a (positive or negative) multiplying factor. So, in this ideal case, ICs are proportional to abundances. They are supposed to have non-gaussian shapes [2], due to the presence of outliers (the anomalies). Figs.1 and 2 illustrate the fact that in a given anomaly-IC, most values are centered around an average one and, by contrast, the IC values corresponding to the pixels containing the associated anomaly are different from this average value. In practical cases, an IC may simultaneously have positive and negative values, due to the independence hypothesis not being fully satisfied. Therefore, more realistic anomaly IC shapes are given in Fig.2.
172
A. Huck and M. Guillaume pixel number
pixel number
0
IC value
0
IC value
Fig. 1. Examples of IC histogram shapes pixel number
0
pixel number pixel number
IC value pixel number
IC value
0
0 IC value pixel number
IC value
0
IC value
pixel number
0
0
IC value
0
IC value
pixel number
pixel number
0
IC value
Fig. 2. Examples of conceivable shapes of IC histograms
The four left histograms correspond to most real-world cases: most IC values are nearly null and only a few ones have a higher absolute value. The four right histograms correspond to rare cases of IC histogram shapes. In order to estimate anomaly abundances from ICs, the method consists in defining a linear transform f such that: f ICj (i) −→ γji , (10) where ICj is the j th IC, which corresponds to an anomaly, γji is the estimated abundance of the j th endmember in the ith spectral pixel. According to the condition (3) and to the hypothesis that the anomaly is fully present in at least one pixel, Wang and Chang proposed in [3] an abundance estimator, named ICA-AQA, given by: γˆj (i) =
|ICj (i)| − mini∈{1,...,N } |ICj (i)| . maxi∈{1,...,N } |ICj (i)| − mini∈{1,...,N } |ICj (i)|
(11)
It is a linear transform whose principle is made explicit in Fig.3. Now let us propose the following estimator, ICA-EAA, based on the same conditions: ICj (i) − medi∈{1,...,N } (ICj (i)) γ˜j (i) = , (12) pj − medi∈{1,...,N } (ICj (i))
ICA-Based Estimation of Anomaly Abundances in Hyperspectral Images pixel number
173
pixel number
f mini |ICj (i)| maxi |ICj (i)| 0
IC value
0
1 abundance value
Fig. 3. Illustration of the f transform of the IC histogram into abundance histogram
pixel number
pixel number f˜
maxi |ICj (i)|
mini |ICj (i)| med p
0
IC value
1 0 abundance value
Fig. 4. Illustration of the f˜ transform of the IC histogram into abundance histogram, with ICA-EAA
where the operator med is the median and the operator pj , for the ICj , is defined as follows: If min (ICj (i)) − med (ICj (i)) ≤ i∈{1,...,N } i∈{1,...,N } max (ICj (i)) − med (ICj (i)) , i∈{1,...,N }
i∈{1,...,N }
then case 1: pj (ICi (r)) = maxi∈{1,...,N } (ICj (i)) and otherwise, case 2: pj (ICj (i)) = mini∈{1,...,N } (ICj (i)) With ICA-EAA, the estimated abundance vector γ ˜ j is obtained from a f˜ transform of the IC, as shown in Fig.4. The considered histogram corresponds to the case 2 of (12). Note that some coefficients γ˜ji can be slightly negative. The proposed solution for that contingency is to fix negative abundances to zero as shown in Fig.4. Now, let us present two improvements provided by ICA-EAA. First, in the ideal case, statistical independence of the anomalies is nearly satisfied. Then, the histogram shape looks like in Fig.1 and both estimators work.
174
A. Huck and M. Guillaume 500
background object1 object2 object3
400
10
300 20
200
30 40
100 0 0
50 60
50
100
(a)
150
10
20
30
40
50
60
(b)
Fig. 5. (a) Spectra of the synthetic HSI objects, (b) 10th spectral band of the synthetic HSI
However, it is more intuitive to associate the median of the IC histogram with the null abundance as most spectral pixels of the HSI don’t hold any anomaly. In Sect.4, an experiment shows this point allows reducing the estimation error, especially when the true abundance is poor. Therefore, accuracy improvement is the first advantage of ICA-EAA. Secondly, in the general case, statistical independence of the anomaly is not fully satisfied, considering 2 cases in (12) makes ICA-EAA more robust to any histogram shapes plotted in Fig.2. Case 1 of definition (12) corresponds to line 1 of Fig.2. By contrast, ICA-AQA only works if histogram shapes look like the 4 left histogram shapes of Fig.2. This point, that is robustness to a gap from the hypothesis of statistical independence, is the second asset of ICA-EAA.
4 4.1
Experimental Results Test on a Synthetic HSI
The proposed abundance estimator has been tested on a synthetic HSI. We arbitrarily chose four spectra in the HYDICE Radiance Forest HSI, plotted in Fig.5(a). The background spectrum corresponds to forest radiance spectrum. Objects 1, 2 and 3 correspond to ground, vehicle and road radiance spectra, respectively. The 10th spectral band of the studied synthetic HSI is shown in Fig.5(b). The background is composed of the background spectrum with an additive 30 dB gaussian noise. The objects have been set into lines and columns. To the columns 1-3 correspond the objects 1-3, respectively. The lines are associated with known abundances: to the lines 1-7 correspond the object abundances 100, 80, 60, 40, 20, 10 and 5 percents. ICA has been applied to the synthetic HSI and the 3 endmembers corresponding to anomaly natures have been selected. In practical cases, this selection step is complicated, because ICA unlike PCA is not supposed to classify ICs. Methods to select automatically the anomaly-ICs are proposed
ICA-Based Estimation of Anomaly Abundances in Hyperspectral Images
175
−40
1
1
−50
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
−60 −70 −80
30 20 10 0
10 0
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
−10 −20 −30
Fig. 6. Left column: ICs of objects 1-3 (lines 1-3); middle column: abundances of objects 1-3 estimated with ICA-AQA; right column: abundances of objects 1-3 estimated with ICA-EAA
in [5]. As explained in Sect.2, obtained ICs are images, whose grey level values can be negative. In Fig.6, abundance maps of the three objects, which are anomalies, are estimated with ICA-AQA on the middle column and with ICA-EAA on the right column. The left column represents the ICs corresponding to the objects 1-3. We notice that in any case, ICA-EAA returns abundances near 1 if the object is expected in the pixel and approximately null in the background If we focus on the first object abundances (Fig.6, line 1), whose results seem accurate for both estimators, we can draw up a table (Fig.7) comparing estimated abundances with the ground-truth ones. It is interesting to remark that the lower the true abundance, the higher the relative error of estimation. This is due to the decreasing signal to noise ratio when the anomaly abundance decreases: actually, when the anomaly abundance (signal) is reduced in the pixel, the noised background (noise) is prevailing. Another noteworthy point is the general better accuracy of ICA-EAA. This is due to the accurate correspondence between the median of the IC and the null abundance of the associated anomaly. 4.2
Application on Real-World HSIs: Estimation of Small Anomaly Surfaces
ICA-EAA has been tested on real-world HSIs. As [4] reveals true panel sizes in the HSI HYDICE Radiance Forest, we propose to use ICA-EAA to estimate
176
A. Huck and M. Guillaume ICA-EAA ICA-AQA ground truth estimation error (abs) estimation error (abs) 100 100 0 100 0 80 80.03 0.03 80.25 0.25 60 59.93 0.07 60.37 0.37 40 39.31 0.69 39.98 0.02 20 19.93 0.07 20.82 0.82 10 9.74 0.26 10.74 0.74 5 5.46 0.46 6.50 1.50
estimation relative mean error
Fig. 7. Abundance estimations of the object 1
(a) HSI
1.4 ICA-EAA
1.2
ICA-AQA
1 0.8 0.6 0.4 0.2 0
1
4
9
anomaly surface
(b)
Fig. 8. (a) Real-world HSI and (b) surface estimation mean error in terms of the anomaly true surface
these panel sizes and compare estimations to the reality. The panels belong to the HSI selection represented in Fig.8(a). The image contains anomalies spread into lines and columns: to each line corresponds a kind of anomaly (spectral nature) and to each column corresponds a size. The left column objects are 3m × 3m panels, the center column objects are 2m × 2m panels and the right column objects are 1m × 1m panels. Knowing the spatial resolution, which is estimated to be 0.85m, we estimate each panel surface from the abundance estimation of the neighbor pixels. Relative mean errors are given in Fig.8(b). We notice that the smaller the panel, the higher the estimation relative mean error. This confirms results shown in Fig.7 about estimator’s behaviors in the case of weak abundance. ICA-EAA, in comparison with ICA-AQA, enables accurate surface estimation of small anomalies. Fig.9 plots each true and estimated panel surface (y-coordinate) for the five kinds of anomaly (x-coordinate). From left to right, graphs refer to 9m2 , 4m2 , 1m2 panels. Full lines represent the ground truth, (×) represent the ICA-EAA estimation and () represent ICA-AQA estimation. Dash line on the two right graphs give an approximative idea of the bias appearing with ICA-AQA. This bias on surface estimation causes the poor estimation accuracy in the case of low abundances.
ICA-Based Estimation of Anomaly Abundances in Hyperspectral Images 6
10 9 8 7
1
2
3
4
anomaly nature
5
4
surface estimation
surface estimation
surface estimation
11
177
5
4
3 1
2
3
4
anomaly nature
5
3 2 1
1
2
3
4
5
anomaly nature
Fig. 9. From left to right, anomaly surface of 9m2 , 4m2 , 1m2 panels; ground truth in full line, ICA-EAA estimation (×), ICA-AQA estimation (); approximate bias of ICA-AQA (·−)
5
Conclusion
We proposed in this paper a method contributing to hyperspectral image analysis. It is a post-processing, to apply after anomaly detection and extraction with ICA, which enables estimation of anomaly abundances from the independent components. Thus, anomaly abundances are accurately estimated and the results obtained through synthetic and real-world HSI experiments are very promising. In current works, we are evaluating this estimator of anomaly abundances on other real-world images.
References 1. Cardoso, J.F.: Blind signal separation: statistical principles. Proceedings of the IEEE 9, 2009–2025 (1998) 2. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. WileyInterscience, Chichester (2001) 3. Wang, J., Chang, C.I.: Applications of independent component analysis (ica) in endmember extraction and abundance quantification for hyperspectral imagery. Geoscience and Remote Sensing, IEEE Transactions on 44, 2601–2616 (2006) 4. Chang, C.: Hyperspectral Imaging: techniques for spectral detection and classification. Kluwer academic/ Plenium publishers, New york (2003) 5. Chang, C.I.: Estimation of the number of spectral sources in hyperspectral imagery. Geoscience and Remote Sensing, IEEE Transactions on 42 (2004)
Unsupervised Multiple Object Segmentation of Multiview Images Wenxian Yang and King Ngi Ngan Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong {wxyang,knngan}@ee.cuhk.edu.hk
Abstract. In this paper we propose an unsupervised multiview image segmentation algorithm, combining multiple image cues including color, depth, and motion. First, the interested objects are extracted by computing a saliency map based on the visual attention model. By analyzing the saliency map, we automatically obtain the number of foreground objects and their bounding boxes, which are used to initialize the segmentation algorithm. Then the optimal segmentation is calculated by energy minimization under the min-cut/max-flow theory. There are two major contributions in this paper. First, we show that the performance of graph cut segmentation depends on the user interactive initialization, while our proposed method provides robust initialization instead of the random user input. In addition, we propose a novel energy function with a locally adaptive smoothness term when constructing the graphs. Experimental results demonstrate that subjectively good segmentation results are obtained.
1
Introduction
In recent years, generation and visualization of dynamic photorealistic environments have become very popular using video-based rendering (VBR) techniques. In [1], range space matching and multiple depth maps rendering methods are proposed to synthesize virtual views from sparse multiview images and to avoid accurate depth estimation. In [2], a multiview video capture system is built and a rendering scheme is proposed using a layered representation with boundary matting. The rendering algorithm in [3] identifies and selects the best quality surface areas from available reference images, and produces virtual views with better perceptual quality. In most existing VBR systems [1,2], the entire image is rendered. However, in some applications, the end-users may desire the capability to render only the object of interest (OOI). A first step towards this goal is semantic object segmentation. In current VBR systems, e.g., [3], the blue-screen or homogeneous background settings are applied to avoid segmentation. However, the constraint of the homogeneous background limits the viewing freedom to be within 180◦ . In addition, it is not applicable to setup homogeneous background for typical multiview scenarioes including a football game, etc. A dynamic VBR system J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 178–189, 2007. c Springer-Verlag Berlin Heidelberg 2007
Unsupervised Multiple Object Segmentation
179
with the all-around viewing capability (360◦ ) motivates this research on semantic object segmentation of multiview images. Although image segmentation has been extensively studied in the literature, the results are not satisfactory. A major difficulty lies in the fact that semantic objects are not homogeneous with respect to color, motion or texture properties. Fortunately, based on the assumption that the depth values over one object vary smoothly and continously, the depth information associated with multiview images functions as an important cue for segmentation. However, due to the occlusion problem and the ill-posed nature of matching, errors may occur in the depth map. In addition, the depth values over two touching objects also distributes as they were one. To obtain more robust segmentation results for object-level manipulation, integration of depth, color, and other image cues should be considered. Existing multiview image segmentation algorithms have two major drawbacks. First, some algorithms rely on depth models. In [4], the depth of an object is represented by an affine model, and the dense depth map is segmented by energy minimization using iterated conditional modes (ICM) relaxation. The drawback of using the affine model is that it cannot accurately represent the motion of the background, which may contain various structures, and thus leading to the over-segmentation of the background. Layered representations [5] are also widely adopted for the depth map, but a depth layer does not necessarily represent a semantic object. In [6], layered dynamic programming and layered graph cut are proposed to segment stereo images, but only bi-layer (foreground/background) segmentation is considered. Second, many multiview image segmentation algorithms segment the depth map and the color image independently and fuse the results to get the final mask. They fail to utilize all the information simultaneously and efficiently, and may lack of accuracy, generality, or require expensive computation. In [7], the color image and the depth map are segmented using a multiresolution recursive shortest spanning tree (M-RSST) algorithm separately. The final object mask is obtained by projecting color segments onto the depth segments. The number of foreground objects has to be known a prior. In [8], object segmentation is carried out by combining initial disparity estimates with nonlinear diffusion techniques. Alternatively, in our work, we consider the direct coupling of multiple cues including depth and color in one energy minimization framework, which is more efficient and robust. On the other hand, graph cut [9,10,11,12] has been extensively used in computer vision tasks as a powerful energy minimization technique in this decade. In the field of image segmentation, many variation also evolved, including normalized cut [13], ratio cut [14], and grab cut [15], etc. However, a major drawback of the graph cut based segmentation methods is their dependence on initialization. When segmenting a color image, the user needs to draw a line across the foreground object and another across the background object, so that the initial data models can be built. First, the initialization procedure itself may be annoying to the users, and a fully unsupervised segmentation is desired. Second, graph cut based methods suffer from incomplete initial modeling. For example, if the
180
W. Yang and K.N. Ngan
foreground object contains several colors while the initial foreground line does not cover all of them, the part of the foreground object with the missing color may never be correctly segmented. In this paper, we propose an unsupervised multiple object segmentation scheme for multiview images. The scheme contains two parts: a visual attention based fully automatic OOI extraction algorithm and a global energy minimization based segmentation scheme. The rest of the paper is organized as follows. In section 2, the visual attention based OOI extraction algorithm is described. A saliency map will be constructed, based on which thresholding is applied to extract the objects. Following a brief discussion of the graph cut, the proposed segmentation algorithm will be described in section 3. In section 4 we present some experimental results and analysis, and section 5 concludes the paper.
2
Visual Attention Based OOI Extraction
Automatically extracting objects of interest from images and videos is a challenging task. Traditionally, user interations are needed in OOI extraction from still images, and motion information is analyzed for OOI extraction from video sequences. The former is inconvenient, and the latter can only extract moving objects. This problem can be solved by taking into account the human visual attention perception [16]. In the proposed algorithm, multiple image features are extracted and combined into a single topographical saliency map. The saliency value of each location defines its conspicuity, and larger values tend to attract more visual attention. To calculat the saliency map, first, nine spatial scales (labelled scale 0 to 8, from top to bottom) are created using dyadic Gaussian pyramids which progressively low-pass filter and subsample the input image. Then, center-surround differences are computed between scales c ∈ {2, 3, 4} and s = c + δ, with δ ∈ {3, 4}. To compute the center-surround differences, two major operations, namely, across-scale difference and across-scale addition ⊕, are defined. The across-scale difference between two maps is obtained by interpolation to the finer scale and point-by-point subtraction. The across-scale addition between two maps is obtained by reduction of each map to the coarser scale and pointby-point addition. In [16], the features used to calculate the saliency map include intensity, color and orientation, which are the straightforward low-level cues for a single image. However, a semantic object may not comply to any pre-defined properties with regard to intensity, color or orientation. It may contain different intensity levels, even high contrast components, and this is also for color and orientation features. Our assumption on semantic object is that a semantic object should undergo continuous and smooth distribution on both depth and motion properties. These higher level features can be exploited from multiview video sequences, and thus we propose to calculate the saliency map based on depth and motion features. Given the depth map D, 6 feature maps are computed: D(c, s) = |D(c) D(s)|.
(1)
Unsupervised Multiple Object Segmentation
181
The motion field has two components: the horizontal component M Vh and the vertical component M Vv , and the scalar motion value for pixel (i, j) is calculated as M (i, j) = M Vh (i, j)2 + M Vv (i, j)2 . (2) Given the motion field M , 6 feature maps are computed as M (c, s) = |M (c) M (s)|.
(3)
Based on these 12 feature maps, 2 conspicuity maps are computed as ¯ = ⊕4c=2 ⊕c=4 D s=c+3 N (D(c, s)) 4 ¯ M = ⊕c=2 ⊕c=4 s=c+3 N (M (c, s)). Here, N is a map normalization operator. The saliency map is computed as the average of the conspicuity maps: 1 ¯ + N (M ¯) . S= N (D) 2
(4) (5)
(6)
Based on the saliency map, attention objects can be located and used to initialize the segmentation algorithm. First, we apply thresholding [17] on the saliency map. Then, small components are removed by morphological erosion and dilation operations. By connected component labelling (CCL), the remaining components are detected as regions of interested objects and are indexed. Finally, bounding boxes are drawn for the detected objects. To ensure that the bounding boxes covers the entire object, we enlarge the bounding boxes by 1.5 times both in width and height, keeping its center unchanged. In case if two objects’ bounding boxes become overlapped during stretching, the multiplier is automatically reduced along the direction of overlapping. The output of the OOI extraction includes the number of objects associated with their bounding boxes.
3 3.1
Segmentation Using Multiway Cut Graph Cut for Energy Minimization
The image segmentation problem can be naturally formulated in terms of energy minimization based on the Bayesian and Markov Random Field (MRF) theories. Typically, the energy function is in the form of E(f ) = Dp (fp ) + Vp,q (fp , fq ), (7) p∈P
{p,q}∈N
where f defines a labelling. The data term Dp (fp ) defines the cost of assigning fp to pixel p and measures how well label fp fits pixel p given the observed data. The smoothness term Vp,q (fp , fq ) measures the cost when two interacting pixels p and q are assigned with labels fp and fq , respectively, and imposes discontinuitypreserving smoothness on the labelling. N defines the neighborhood system.
182
W. Yang and K.N. Ngan
If the energy function is regular [11], a graph G = V, E can be constructed to represent the energy. V = P {s, t} is the set of vertices, including all the pixels in the image P and two terminal vertices, source s and sink t. E contains t-links and n-links, where a t-link connects a pixel p to a terminal s or t, and an n-link connects two neighboring pixels p and q. The edge weights of t-links and n-links are assigned based on the energy function. An s-t-cut C = (S, T ) is a partition of the vertices in V into two disjoint sets S and T by removing edges, such that s ∈ S and t ∈ T . The cost of cut C equals the sum of weights of all edges that go from S to T : c(S, T ) = c(p, q). (8) p∈S,q∈T,(p,q)∈E
The minimum s-t-cut problem is to find a cut C with the smallest cost. s-t-cut is defined for bi-label problems, and multiway cut is defined for multilabel problems. In a multi-label problem, V = P L, and there are multiple terminal vertices in the label set L. A subset of edges C ∈ E is called a multiway cut if the terminals are completely separated in the induced graph G(C) = V, E − C. Similarly, the cost of the cut C equals the sum of weights of all the edges removed in the cut. The multiway cut problem is to find the minimum cost multiway cut. Multiway cut can be solved by iterative bi-label s-t-cuts using α-expansion or α-β swap algorithms [9]. 3.2
Multiple Object Segmentation Via Energy Minimization
Proposed Energy Function. We follow the form of the energy function as defined in Eqn. (7). Similar to grab cut [15], Gaussian mixture models (GMM) are used to model data distribution. One GMM, which is a full-covariance Gaussian mixture with 5 components, is built for each of the objects. Note that besides the OOIs extracted from the previous stage, we treat the background as one object, and one GMM is built for the background object. The data term of the energy function becomes Dp (fp ) = − log p(dp |fp , kp ) − log π(fp , kp ),
(9)
where dp is the depth value of pixel p, kp ∈ {1, · · · , 5} is the GMM component variable, p(·) is a Gaussian probability distribution, and π(·) are the mixture weighting coefficients. The choice of the smoothness term Vp,q is critical for the overall performance of the algorithm. We propose to use locally adaptive weights derived from both depth and color cues for the edge weights of n-links. First, the n-link weight between pixel p and its neighbor q is initialized as V 0 (p, q) = γ dist(p, q)−1 exp {− diff(zp − zq )}.
(10)
Here, dist(p, q) is the coordinate distance between two neighboring pixels p and q. diff(zp − zq ) is the average difference of pixels p and q in terms of normalized depth D and three color components R, G and B as follows.
Unsupervised Multiple Object Segmentation
183
1 βd · (dp − dq )2 × 3 + βr · (rp − rq )2 + βg · (gp − gq )2 + βb · (bp − bq )2 6 (11) Here, dp , rp , gp and bp represent the depth, red, green, blue values of pixel p, respectively. β is a constant controlling the extent of smoothness, and is defined as −1 βd = 2(dp − dq )2 (12)
diff(zp − zq ) =
for the depth, where · denotes expectation over an image sample. βr , βg , βb are defined in the same way. In our proposed method, we use the 2nd -order neighborhood system so that each pixel p has 8 neighbors Np = {q1 , · · · , q8 }. Note that only 4 of them need to be calculated for one pixel as the graph is undirected. We define the weight of a pixel p as the sum of all the n-link weights associated with it. w(p) = V 0 (p, q) (13) q∈Np
The average weight of the image is calcluated as W =
1 w(p), s
(14)
p∈P
where s is the image size. The initial n-link weights obtained by Eqn. (10) are then normalized by a locally adaptive factor, which is defined as the average pixel weight in the image W divided by the pixel weight w(p). Thus, the weights of the n-links are updated as V (p, q) =
W V 0 (p, q). w(p)
(15)
With the help of this locally adaptive normalization process, the weights of a pixel with high discontinuity distributions in its neighborhood will be suppressed, while the weights of a pixel with low discontinuity distributions in its neighborhood will be enhanced. Energy Minimization via Multiway Cut. To minimize the proposed energy function, the multiway cut with α-expansion is used. The major steps are described in Table 1. First, in the initialization step, the data models are built and the label field is initialized based on the foreground object bounding boxes. The data in the bounding boxes is used to build the GMM for each foreground object, while the region which is not initially included in any bounding box is labelled as “background” and used to build the GMM for the background object. The set of labels L includes the multiple foreground objects’ indices and one index for the background object, i.e., L = {0, 1, · · · , n}, where n is the number of foreground objects.
184
W. Yang and K.N. Ngan Table 1. Proposed multiple object segmentation algorithm 1.
Assign GMM components to pixels.
2.
Learn GMM parameters from input data D.
3. For each label α ∈ L, 3.1. construct a s-t subgraph, 3.2. estimate segmentation using min-cut. 4.
Update the label field and repeat from step 1 until convergence.
Then, given the current label field f and a label α, a s-t subgraph Gα = Vα , Eα is constructed. The source s stands for the label α while the sink t stands for the current label fp . Equivalently, after the min-cut, if a pixel is connected to source s its label should be changed into α, while connecting to sink t means that the pixel should keep its current label fp , also denoted by α. ¯ The weights of t-links and n-links are defined in Table 2. In Table 2, node a is an auxiliary node as introduced in [9]. After the min-cut, the pixels are connected either to the source or to the sink, and those pixels connecting to the source will update their labels to α. Table 2. Edge weights defined for subgraph Gα edge
weight
for
tα p ¯ tα p
0 ∞
fp0 = 0, fp0 = α
tα p ¯ tα p
Dp (α) Dp (fp )
p∈P
e{p,a} e{α,q} ¯ tα a
V (fp , α) V (α, fq ) V (fp , fq )
{p, q} ∈ N , fp = fq
e{p,q}
V (fp , α)
{p, q} ∈ N , fp = fq
Visually, during α-expansion, label α grabs pixels from those whose current label is not α. In this paper, we assume that the objects do not overlap. Thus, given the current label field, the objects can only grab pixels from the background but not other objects, while the background can grab pixels from all α ¯ 0 objects. This is enforced by setting tα p to 0 and tp to ∞ when fp = 0 and 0 0 fp = α, where f is the initial label field. This assumption, being valid in many natural images, avoids mixing up different objects in case they undergo similar data distribution, which is possible in a depth image.
Unsupervised Multiple Object Segmentation
4
185
Experimental Results
We use the 3D video generated by MSR (http://research.microsoft.com/ vision/InteractiveVisualMediaGroup/3DVideoDownload/) to test the performance of our proposed algorithm. Sample images are shown in Fig. 5(a) and (d). The depth sequences are associated with the 3D video data, and the motion fields are generated by Lucas & Kanade’s algorithm [18] which is provided in the OpenCV library. 4.1
Results for Visual Attention Based OOI Extraction
We compare the saliency maps obtained by high-level cues with the saliency maps obtained by low-level cues, as shown in Fig. 1. Here, by high-level cues we refer to depth and motion, and by low-level cues we refer to intensity, color and orientation.
(a) Saliency map by low- (b) Saliency map by highlevel cues. level cues.
(c) Thresholding of (a).
(d) Thresholding of (b).
Fig. 1. The upper row shows the saliency maps obtained using (a) the low-level cues including intensity, color and orientation, and (b) the high-level cues including depth and motion. The bottom row shows the results of thresholding the saliency maps. (c) shows the thresholding result of (a), and (d) shows the thresholding result of (b).
When calculating the saliency map, by using the low-level features, the regions with contrasting brightness, distinct colors or evident edges can be differentiated, with regard to intensity, color and orientation, respectively. However, these cues cannot differentiate an object which has similar intensity and color as its background neighborhood. The legs of the man in Fig. 5(a) is a good example. In addition, the background of a natural image may not be idealy clean and thus some trivial objects in the background may also get high attention values.
186
W. Yang and K.N. Ngan
As demonstrated by Fig. 1, these two problems can be successfully solved using depth and motion cues. With the help of the depth information, we can extract the attention region which has similar intensity and color with the background. In Fig. 1(d), most of the region of the man’s legs gets high attention scores. With the help of the motion information, we can avoid the erroneous inclusion of the background region as an attention region. For example, the stripe region between the wall and the floor gets high attention scores when one or more of the intensity, color or orientation features are considered. It even has high attention values with regard to the depth map, as it shows sharp discontinuites in depth. However, as no motion exists for the background, using the motion cue can suppress such regions. The bounding boxes obtained from the saliency map and the enlarged bounding boxes are shown in Fig. 2.
(a)
(b)
Fig. 2. (a) The object bounding boxes obtained from the thresholded saliency maps. (b) To ensure that the bounding boxes cover the entire object, they are extended by 1.5 times both in width and height. In case when two objects become overlapping, the stretch parameter is automatically reduced along the direction of overlapping.
4.2
Results for MultiCut Segmentation
For comparison, the results of depth image segmentation using graph cut [12] are shown in Fig. 3, and the results of color image segmentation using grab cut [15] are shown in Fig. 4. For both algorithms, different initializations are tested. The top row shows the user initialization and the bottom row shows the corresponding result. In Fig. 3, the graph cut method erroneously includes the background between the arm and the body of the man as foreground, and part of the foreground objects is segmented into the background. In Fig. 4, as the foreground object contains colors that are similar to the color in the background, the grab cut method fails to extract all the color components of the object. Besides, both methods are not robust as they require “good” user input to provide satisfactory results, see Fig. 4(e), (f). Fig. 5 shows the segmentation results of the proposed method. The n-links image shows that discontinuties in smooth regions are well enhanced, while are suppressed in high textured regions. The proposed method solved the above mentioned problems and successfully separate the foreground objects from the background object.
Unsupervised Multiple Object Segmentation
(a) Initialization
(b) Initialization
(c) Result of (a)
(d) Result of (b)
187
Fig. 3. Segmentation of depth image using graph cut. The top row shows different initializations and the bottom row shows the corresponding segmentation results.
(a) Initialization
(b) Initialization
(c) Initialization
(d) Result of (a)
(e) Result of (b)
(f) Result of (c)
Fig. 4. Segmentation of color image using grab cut. The top row shows different initializations and the bottom row shows the corresponding segmentation results.
188
W. Yang and K.N. Ngan
(a) Original image.
(b) N-links image.
(c) Segmentation result.
(d) Original image.
(e) N-links image.
(f) Segmentation result.
Fig. 5. The segmentation results of the proposed method. The left column (a) and (d) show the original images, the middle column (b) and (e) are the calculated n-links, and the right column (c) and (f) are the segmentation results.
5
Conclusions
In conclusion, a fully automatic multiview image segmentation algorithm is proposed in this paper and its performance is demonstrated. The algorithm directly couples multiple image cues including color, depth and motion associated with multiview video data, and finds optimal segmentation by global energy minimization via multiway cuts. The future work includes extending and incorporating the proposed algorithm to a video-based rendering system which provides object-level manipulations with all-around viewing freedom. Acknowledgment. This work was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project CUHK415505).
References 1. Kong, D., Tao, H., Gonzalez, H.: Sparse IBR Using Range Space Rendering. In: Proc. British Machine Vision Conf. vol. 1, pp. 181–190 (2003) 2. Zitnick, C.L., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High-Quality Video View Interpolation using a Layered Representation. ACM Trans. on Graphics 23, 600–608 (2004)
Unsupervised Multiple Object Segmentation
189
3. Cooke, E., Kauff, P., Sikora, T.: Multi-view Synthesis: A Novel View Creation Approach for Free Viewpoint Video. Signal Processing: Image Communication 21, 476–492 (2006) 4. Fran¸ois, E., Chupeau, B.: Depth-Based Segmentation. IEEE Trans. on Circuits and Systems for Video Technology 7(1), 237–240 (1997) 5. Kang, S.B., Dinh, H.Q.: Multi-Layered Image-Based Rendering. In: Proc. Graphics Interface, pp. 98–106 (1999) 6. Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., Rother, C.: Probabilistic Fusion of Stereo with Color and Contrast for Bilayer Segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 28(9), 1480–1492 (2006) 7. Doulamis, A.D., Doulamis, N.D., Ntalianis, K.S., Kollias, S.D.: Unsupervised Semantic Object Segmentation of Stereoscopic Video Sequences. In: Proc. International Conf. on Information Intelligence and Systems, pp. 527–533 (1999) 8. Izquierdo, E., Ghanbari, M.: Video Composition by Spatiotemporal Object Segmentation, 3D-Structure and Tracking. In: Proc. IEEE International Conf. on Information Visualization, vol. IV, pp. 194–199 (1999) 9. Boykov, Y., Veksler, O., Zabih, R.: Fast Approximatie Energy Minimization via Graph Cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(11), 1222–1239 (2001) 10. Boykov, Y., Kolmogorov, V.: An Experimental Comparison of Min-Cut/Max-Flow Algorithm for Energy Minimization in Vision. IEEE Trans. on Pattern Analysis and Machine Intelligence 26(9), 1124–1137 (2004) 11. Kolmogorov, V., Zahih, R.: What Energy Functions Can Be Minimized via Graph Cuts? IEEE Trans. on Pattern Analysis and Machine Intelligence 26(2), 147–159 (2004) 12. Boykov, Y., Jolly, M.P.: Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images. In: Proc. International Conf. on Computer Vision, pp. 105–112 (2001) 13. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) 14. Wang, S., Siskind, J.M.: Image Segmentation with Ratio Cut. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(6), 675–690 (2003) 15. Rother, C., Kolmogorov, V., Blake, A.: GrabCut–Interactive Foreground Extraction using Iterated Graph Cuts. ACM Trans. on Graphics 23(3), 309–314 (2004) 16. Itti, L., Koch, C., Niebur, E.: A Model of Saliency-based Visual Attention for Rapid Scene Analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence 20(11), 1254–1259 (1998) 17. Strouthopoulos, C., Papamarkos, N.: Multithresholding of Mixed Type Documents. Engineering Application of Artificial Intelligence 13(3), 323–343 (2000) 18. Lucas, B., Kanade, T.: An Iterative Image Registration Technique with an Application to Stereo Vision. In: Proc. of 7th International Joint Conf. on Artificial Intelligence (IJCAI), pp. 674–679 (1981)
Noise Removal from Images by Projecting onto Bases of Principal Components Bart Goossens, Aleksandra Piˇzurica, and Wilfried Philips Ghent University - TELIN - IPI - IBBT Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium
Abstract. In this paper, we develop a new wavelet domain statistical model for the removal of stationary noise in images. The new model is a combination of local linear projections onto bases of Principal Components, that perform a dimension reduction of the spatial neighbourhood, while avoiding the ”curse of dimensionality”. The models obtained after projection consist of a low dimensional Gaussian Scale Mixtures with a reduced number of parameters. The results show that this technique yields a significant improvement in denoising performance when using larger spatial windows, especially on images with highly structured patterns, like textures.
1
Introduction
Traditional film cameras and digital cameras both produce images contaminated by noise, especially in bad lightening conditions or when the sensors are only shortly exposed to the light. Video sequences transmitted over analogue channels or stored on magnetic tapes, can also exhibit high noise levels. During the last decade, large scale digitization of analogue material is taking place and the removal of noise becomes indispensable, not only to enhance the visual quality but also to improve the compression performance. Recently, multiresolution concepts like wavelets, have been used widely due to the sparseness of the representation. In literature, many wavelet-based methods have been developed for the removal of image noise, e.g. [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. Many existing techniques mainly focus on the reduction of white noise, with a flat energy spectral density (ESD). However, in practice the noise is often correlated, by various post-processing steps in the camera, like Bayer pattern demosaicing and automatic resharpening. Techniques developed for white noise are in general not efficient in this case. Only recently, the GSM-BLS filter has been proposed for dealing with this kind of noise [9, 11]. The GSM-BLS is a vector-based technique, that extracts wavelet coefficient vectors in a small neighbourhoods (e.g. with size 3 × 3), and models correlations between the components of the vectors. The question arises if using more available local information could improve the denoising performance, e.g. in the presence of structured patterns, like textures. Another more severe problem is that the number of samples required for a reliable estimation, expands exponentially with the size of the local neighbourhood, or the dimension of the extracted J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 190–199, 2007. c Springer-Verlag Berlin Heidelberg 2007
Noise Removal from Images
191
coefficient vectors. Already for neighbourhoods larger than 3 × 3 (i.e. 10 dimensions or more), this number becomes extremely large. This effect has been termed the ”curse of dimensionality” [12], and is a significant obstacle in e.g. neural network training [13] and multivariate density estimation [14]. Methods based on clustering and dimension reduction, avoid this problem, by finding the manifolds in the high-dimensional space on which the data resides. Principal Component Analysis (PCA) [15] is a popular technique for dimension reduction, and is discussed in most textbooks on multivariate analysis. For a set of observed d-dimensional data vectors, the q-principle axes are those orthonormal axes onto which the retained variance under projection is maximal. It can be shown that the corresponding basis vectors, called Principle Components, are the q-dominant eigenvectors of the data sample covariance matrix. By projecting onto this basis, a q-dimensional description of the data is obtained. In this paper, we develop a dimension reduced statistical model in the wavelet domain, by projecting onto a basis of Principle Components. When combining several of these models in a locally adaptive manner, we obtain a higher dimensional model, with less degrees of freedom than the corresponding non-dimension reduced model. This way, we avoid the ”curse of dimensionality”, potentially allowing larger spatial windows. The method here is proposed for the wavelet domain, but can be extended to the spatial domain as well (like the PCA-based spatial domain denoising technique in [16]). Working in the wavelet domain allows us to use fixed, relatively small window sizes on each scale, jointly corresponding to a large window in the pixel domain, thus offering computational advantages. This paper is organized as follows: in Section 2.1, we introduce the wavelet domain signal-plus-noise model. In Section 2.2, we introduce the dimension reduced Gaussian Scale Mixture model. We extend this to mixtures in Section 3 and describe an EM algorithm to estimate the mixture model parameters. We derive a Bayesian estimator in Section 4. Results are given in Section 5 and the conclusion in Section 6.
2 2.1
Signal-Plus-Noise Model Original Gaussian Scale Mixture Model
By the linearity of the wavelet transform, the following relationship holds between the noise-free coefficients xj , the noise nj and the observed noisy coefficients yj on a given scale and orientation: yj = xj + nj
(1)
where a one-dimensional index j denotes the spatial position (like in raster scanning). The vectors xj , nj and yj , random process realizations of respectively x, n and y, are formed by extracting wavelet coefficients in a local M × M window at position j. Hence the dimension of the model is d = M 2 . We further
192
B. Goossens, A. Piˇzurica, and W. Philips
assume that the noise n is stationary and Gaussian with known covariance1, but not necessarily white. It is well known that the discrete wavelet transform does not fully decorrelate the signal. Noise-free wavelet coefficients exhibit strong local correlations (see, e.g., [9]). These correlations are typically the strongest in the direction of the edges on a particular wavelet transform band. Note that there are also interscale dependencies between wavelet coefficients, which have been studied in e.g. [2, 18, 8]. In this work we will focus on characterizing dependencies within the same band. By the sparseness of the representation, marginal pdfs of noisefree wavelet coefficients are typically unimodal, symmetric around the mode and highly kurtotic (i.e. sharper peak than the Gaussian). These effects can be modeled using elliptically symmetric distributions, like Gaussian Scale Mixtures (GSM) [9] (see Fig 1). A random variable x conforms to a GSM model if it can be written as the product of a zero mean Gaussian random vector u and an independent scalar random variable z 1/2 where z ≥ 0: d
x = z 1/2 u
(2)
Prior models for the hidden variable z involve Jeffrey’s noninformative prior [9], the exponential distribution [19], and the Gamma distribution [20, 17].
0.012 0.01 0.008 0.006 0.004 0.002 0 20 10
20 10
0 0
−10
−10 −20
−20
Fig. 1. Probability density function of a bivariate Gaussian Scale Mixture
2.2
Dimension Reduced Gaussian Scale Mixture Model
To reduce the dimension of the model, we decompose the observation vector y into two components (see, e.g., [15]): ¯ y = Wt + Wr
(3)
where t is a q-dimensional zero mean random vector (q < d), with covariance Ct , the residual r is a (d − q) dimensional zero mean Gaussian random vector, with 1
If the noise covariance matrix for a given wavelet band is not known in advance, it can be estimated from the observed noisy wavelet coefficients using techniques as in [11, 17].
Noise Removal from Images
193
diagonal covariance Ψ and independent of t. W is a d × q matrix, the columns ¯ is a of which are orthonormal basis vectors of the low-dimensional space W. W d × (d − q) matrix, containing the orthonormal basis vectors of the orthogonal complementary subspace W ⊥ . The random vector t represents the observation in the dimension reduced space, and conforms to the observation model from Section 2.1: t = v + n = z 1/2 u + n (4) The covariance matrix of the observation vectors can be expressed in terms of the covariance matrices of the projected components t and n: ¯ W ¯T Cy = WCt WT + WΨ
(5)
where Ct = E (z) Cu + Cn . Cu and Cn are q × q covariance matrices of respec¯ = 0), the tively u and n. Using the orthogonality (i.e., WT W = I and WT W relationship (5) can be inverted: Ct = WT Cy W
¯ T Cy W ¯ and Ψ = W
(6)
Since Ψ is diagonal, only correlations between coefficients within the dimension reduced space W are considered. In the complementary space, coefficients are assumed to be uncorrelated. This means that we should select the basis vectors of W, such that the strongest correlations between the coefficients can be captured. We therefore estimate the projection bases from the observed data by maximisation of the log-likelihood function, defined by: +∞ L = log f (y) = log f (y|z)f (z)dz (7) 0
The integral in the complete data log-likelihood hampers the direct maximization of (7). Therefore, we apply Jensen’s inequality, which results in the lower bound L ≤ L to maximize: +∞
L =
f (z) log f (y|z)dz
(8)
0
For the assumed model (see Section 2.2), L can be written as: N +∞ L = − f (z) log |Cy|z | + d log(2π) + S (9) 2 0 T with Cy|z = W(zCu + Cn )WT + WΨWT and S = N1 N j=1 yj yj is the sample covariance matrix. To find the orthogonal projection that maximizes L , we look for the stationary points of L by taking the gradient of L with respect to W: +∞ ∂L −1 = −N fz (z)C−1 SC W − W (zCu + Cn )dz = 0 (10) y|z y|z ∂W 0 Unfortunately, due to the dependence of the integrand in (10) on z, a solution where W is orthogonal and Ψ is diagonal, is not trivial to find in general.
194
B. Goossens, A. Piˇzurica, and W. Philips
Therefore, we solve this equation for the most likely z (i.e. E (z)) instead of integrating over z. Equation (10) becomes: −1 C−1 y (SCy W − W)Ct = 0
(11)
with solutions given by Cy = S and SC−1 y W−W (for Cy = S) [21]. Substituting the first solution Cy = S in (6) results in: Ct = WT SW
¯ T SW ¯ and Ψ = W
(12)
Next, we require that Ψ is diagonal, and solve (12) for W, while minimizing the determinant of Ψ. Applying Singular Value Decomposition (SVD) to the positive ¯ T UΛUT W ¯ must be diagonal, which definite matrix S = UΛUT yields that W T ¯ ¯ is satified if U W = I. W must be a matrix with eigenvectors of S on its columns. We can minimize |Ψ| by selecting the eigenvectors that correspond to the smallest eigenvalues. By this choice, W will contain the eigenvectors with the largest eigenvalues, also called Principal Components. The covariance matrices of the projected observed data are found using: Ct = diag{λ1 , ..., λq } and Ψ = diag{λq+1 , ..., λd }
(13)
where diag{·} constructs a diagonal matrix, and λi = Λii , i = 1, ..., d. Finally, the solution Cy = S corresponds to an exact covariance model for the dimension reduced data, which in practice rarely will be the case. In [21], it is shown that the second solution of (11) also leads to the basis of Principle Components for W.
3 3.1
Mixtures of Dimension Reduced Models Introduction
The Principal Component Analysis from previous Section, defines a linear projection of the data and may still require a large number of components, to store most of the variance. By mixing dimension reduced models, we attempt to retain a greater proportion of the variance using fewer dimensions [21]. Alternatively, this allows us to make the principal component basis spatially adaptive. Consider ¯ k r. a set of k = 1, ..., K dimension reduced models conforming to y = Wk t + W Mixtures are obtained by: f (y) =
K k=1
P(Hk )f (y|Hk ) =
K
P(Hk )f (t|Hk )f (r|Hk )
(14)
k=1
where Hk denotes the hypothesis that sub-model k is the ”correct one”, i.e. the most likely according to the observed data. The posterior probability for a given sub-model is: f (y|Hk )P(Hk ) P(Hk )f (t|Hk )f (r|Hk ) P(Hk |y) = = K (15) f (y) l=1 P(Hl )f (t|Hl )f (r|Hl ) Given the observation at spatial position j, the probability that the projection basis is Wk , is P(Hk |yj ), and hence depends indirectly on the spatial position. This shows the locally adaptive character of the mixture model.
Noise Removal from Images
3.2
195
Mixture Model Parameter Estimation
In this Section, we estimate the model parameters (i.e. the projection bases Wk and the model probabilities P(Hk ), k = 1, ..., K) using the EM algorithm [22]. The EM algorithm is a general method for finding the maximum likelihood estimate of the model parameters Θ, when the data has missing values. In this case, the model choice k is the missing variable. Given an initial estimate of the model parameters Θ(0) , the EM algorithm first finds the expected value of the complete-data log-likelihood function log f (y, k|Θ), with respect to the observed data y (E-step): Q(Θ, Θ(i−1) ) = E log f (y, k|Θ)|y, Θ(i−1) (16) Next, the M-step maximizes the expectation computed in the E-step: Θ(i) = arg max Q(Θ, Θ(i−1) )
(17)
Θ
These two steps are repeated until the algorithm converges to a local maximum of the likelihood function, since each iteration increases the log-likelihood function of the observed data. We denote the mixing weights as πk = P(Hk ), with the K constraint k=1 πk = 1. For our model, the M-step consists of: 1 (i) α N j=1 k,j N (i) T j=1 αk,j yj yj = N (i) j=1 αk,j N
(i)
π ˆk = (i)
Sk
(18)
(19)
(i)
where the posterior probability αk,j = P(Hk |yj , Θ(i−1) ) are obtained using Bayes’ rule: (i−1)
π (i) αk,j = P(Hk |yj , Θ(i−1) ) = L k
l=1
f (yj |Hl , Θ(i−1) )
(i−1)
πl
f (yj |Hl , Θ(i−1) )
(20)
Similar as in Section 2.2, but now conditioned on k, we obtain the following covariances: T T (i) (i) (i) (i) (i) ¯ (i) S(i) W ¯ (i) Ct,k = Wk Sk W k and Ψk = W (21) k k k (i) ¯ (i) are selected using the Singular Value DeThe basis vectors of Wk and W k (i) composition of Sk , as explained in Section 2.2.
4
Bayesian Estimation of the Noise-Free Coefficients
In this Section, we estimate the noise-free wavelet coefficient vector from an observed noisy coefficient vector (i.e. denoising). The Bayesian approach imposes
196
B. Goossens, A. Piˇzurica, and W. Philips
a prior distribution on the noise-free wavelet coefficients. In this application the projected data v is modeled using a Gaussian Scale Mixture (see Section 2). If the correct observation model is model k, the Minimum Mean Square Error (MMSE) estimator for the noise-free coefficients in the dimension reduced model is equivalent to that for the observation model in [9]: +∞ −1 ˆ k = E (v|t, Hk ) = v f (z|t, Hk )zCv,k (zCu,k + Cv,k ) tk dz (22) 0
ˆk a weighted average of local Wiener solutions in the dimension reduced Here, v space k. If q is very small compared to d, or when still a large proportion of the energy is inside the orthogonal complement of the principal subspace, it may also be √ necessary to estimate in the complementary space as well. If we denote rk = zρk + ωk with respective covariances Ψk , Pk and Ωk , we estimate ρk using: −1 ρˆk = E (ρ|r, Hk ) = zPk (zPk + Ωk ) rk (23) By the diagonality of the covariance matrices in (23) each component can be estimated independently. To estimate the noise-free wavelet coefficient vector in the observation space, we average over the solutions of all K local models: ˆ = E (x|y) = x
K k=1
P(Hk |y)ˆ xk =
K k=1
ˆk + P(Hk |y)Wk v
K
¯ k ρˆk (24) P(Hk |y)W
k=1
¯k We note that the basis change using a linear orthonormal transform Wk W is in fact a rotation of the coordinate system. This rotation does not alter the mean squared error metric, thus we obtain the global MMSE solution for our prior model.
5
Results
In Fig. 2, we illustrate the prior model fitting to noise-free data for a zebra texture. In Fig. 2.b, this image is filtered vertically using the highpass filter corresponding to the Daubechies’ wavelet of length 4. As initial projection bases for the EM-algorithm from Section 3.2, we use combinations of basis vectors that are rows of the identity matrix. Fig. 2.c shows a scatter plot of 3 neighbouring coefficients, including the basis of two Principle Components. We illustrate this for only three dimensions to allow a visual representation. Fig. 2.d is obtained by projecting Fig. 2.c onto the basis marked in Fig. 2.c. The data clouds are fitted using the GSM model introduced in Section 2. It is clear that the contours of the joint histogram are not elliptically contoured. Therefore, by using mixtures of dimension reduced models (Section 3), we obtain a better fitting in Fig. 2.e. Here we have projected the result on the same basis as in Fig. 2.d, to allow proper comparison. In this case, both the original three-dimensional GSM model (with contours as in Fig. 2.d) and the mixture model (Fig. 2.e) have 6 parameters, although the mixture model provides a better fitting to the data. In Fig. 3, we
Noise Removal from Images
y3
10 0
40
(b)
30
30
20
20
10
10
0
0
t2
t2
(a)
197
−10
−10
−20
−20
20
−10 50
0 0
−20
y2
−50 −40
(c)
y1
−30 −40
−20
0 t1
(d)
20
40
−30 −40
−20
0 t1
20
40
(e)
Fig. 2. Prior model fitting on a noise-free image (a) the original image (b) highpass wavelet band from (a) (c) three-dimensional scatter plot of neighbouring wavelet coefficients from (b), with the two most dominant eigenvectors (Principle Components) of the data covariance matrix (d) two-dimensional projection of the data from (c) by projecting onto the basis of Principle Components, and the contours of the fitted GSM model (ellipses) (Section 2) (e) fitted model consisting of K = 2 two-dimensional GSM models (see Section 3)
Fig. 3. Denoising results for white noise: crop outs of the Barbara image, for σ = 25. From left to right: the original image, the noisy image, GSM-BLS [9], the proposed method using DT-CWT.
198
B. Goossens, A. Piˇzurica, and W. Philips Table 1. Denoising results for white noise with standard deviation σ PSNRout [dB] σ / PSNRin [dB] Piˇzurica06 [10] S ¸ endur02 [8] Portilla03 [9] Proposed Barbara 5 20.17 37.75 37.10 38.28 38.48 15 24.61 31.46 31.28 32.20 32.65 25 34.15 28.45 28.63 29.31 29.95 Lena 5 20.17 38.18 38.01 38.49 38.59 15 24.61 33.23 33.58 33.89 33.83 25 34.15 30.87 31.35 31.67 31.51 House 5 20.17 38.04 38.01 38.66 39.26 15 24.61 32.69 33.01 33.63 33.65 25 34.15 30.18 30.74 31.36 31.19
assess the impact of the improvement in modeling accuracy using the Barbara image corrupted by white noise with σ = 25. Our method uses the Dual Tree Complex Wavelet Transform (DT-CWT) from [23], with 6 tap Q-shift filters, local windows of size 5 × 5, dimension reduction parameter q = 16 and K = 4. The visual results show that the edges and textures can be better reconstructed. In Table 1, we compare the proposed method with current wavelet domain stateof-the-art denoising algorithms. The method of [10] uses an undecimated wavelet transform, with the Symmlet of length 16. In [8], the DT-CWT is also used. The GSM-BLS filter from [9] uses Full Steerable Pyramids, with 8 orientations and a 3 × 3 local window. Our method is very competitive to the technique from [9], and performs significantly better in the presence of strong edges or patterns.
6
Conclusion
In this paper we developed a dimension reduced Gaussian Scale Mixture model, that allows a higher dimensionality while avoiding the ”curse of dimensionality”. Combining different dimension reduced models adapts the GSM model to the spatial context. This results in globally non-linear model with relatively few free parameters, while not imposing a too strong constraint on the overall covariance structures of the signal and the noise. The results show that this technique leads to an improvement in denoising performance by using a prior model that better deals with highly structured patterns, like textures.
References 1. Donoho, D.L.: De-Noising by Soft-Thresholding. IEEE Trans. Inform. Theory 41, 613–627 (1995) 2. Crouse, M., Nowak, R., Baraniuk, R.: Wavelet-based statistical signal processing using Hidden Markov Models. IEEE. Trans. Signal Processing 46, 886–902 (1998) 3. Mih¸cak, M.K.: Low-complexity Image Denoising based on Statistical Modeling of Wavelet Coefficients. IEEE Signal Processing Letters 6(12), 300–303 (1999)
Noise Removal from Images
199
4. Chang, S., Yu, B., Vetterli, M.: Spatially Adaptive Wavelet Thresholding with Context Modeling for Image Denoising. IEEE Trans. Image Process. 9, 1522–1531 (2000) 5. Liu, J., Moulin, P.: Complexity-Regularized Image Denoising. IEEE Trans. on Image Processing 10(6), 841–851 (2001) 6. Fan, G., Xia, X.: Image denoising using local contextual hidden Markov model in the wavelet domain. IEEE Signal Processing Letters 8(5), 125–128 (2001) 7. Piˇzurica, A., Philips, W., Lemahieu, I., Acheroy, M.: A joint inter- and intrascale statistical model for Bayesian wavelet based image denoising. IEEE Trans. Image Processing 11(5), 545–557 (2002) 8. S ¸ endur, L., Selesnick, I.: Bivariate Shrinkage with Local Variance Estimation. IEEE Signal Processing Letters 9, 438–441 (2002) 9. Portilla, J., Strela, V., Wainwright, M., Simoncelli, E.: Image denoising using Gaussian Scale Mixtures in the Wavelet Domain. IEEE Trans. Image Processing 12, 1338–1351 (2003) 10. Piˇzurica, A., Philips, W.: Estimating the probability of the presence of a signal of interest in multiresolution single- and multiband image denoising. IEEE Trans. Image Process 15(3), 654–665 (2006) 11. Portilla, J.: Full Blind Denoising through Noise Covariance Estimation using Gaussian Scale Mixtures in the Wavelet Domain. In: Proc. Int. Conf. on Image Processing (ICIP), vol. 2, pp. 1217–1220 (2004) 12. Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, NJ (1961) 13. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995) 14. Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach toward Feature Space Analysis. IEEE Trans. Pattern Analysis Machine Intell. 24(5), 603–619 (2002) 15. Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (1986) 16. Muresan, D.D., Parks, T.W.: Adaptive Principal Components and Image Denoising. In: Proc. Int. Conf. on Image Processing (ICIP) (2003) 17. Goossens, B., Piˇzurica, A., Philips, W.: Noise Reduction of Images with Correlated Noise in the Complex Wavelet Domain. In: IEEE BENELUX/DSP Valley Signal Processing Symposium SPS-DARTS, Antwerp, IEEE, Los Alamitos (2007) 18. Wainwright, M.J., Simoncelli, E.P., Willsky, A.S: Random Cascades on Wavelet Trees and their use in modeling and analyzing natural images. Applied Computational and Harmonic Analysis 11(1), 89–123 (2001) 19. Selesnick, I.W.: Laplace Random Vectors, Gaussian Noise, and the Generalized Incomplete Gamma Function. In: Proc. Int. Conf. on Image Processing (ICIP), pp. 2097–2100 (2006) 20. Srivastava, A., Liu, X., Grenander, U.: Universal Analytical Forms for Modeling Image Probabilities. IEEE Trans. Pattern Analysis and Machine Intelligence 24(9), 1200–1214 (2002) 21. Tipping, M.E., Bishop, C.M.: Mixtures of Probabilistic Principal Component Analysers. Neural Computation 11(2), 443–482 (1999) 22. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 19(1), 1–38 (1977) 23. Kingsbury, N.G.: Complex Wavelets for Shift Invariant Analysis and Filtering of Signals. Journal of Applied and Computational Harmonic Analysis 10(3), 234–253 (2001)
A Multispectral Data Model for Higher-Order Active Contours and Its Application to Tree Crown Extraction P´eter Horv´ath University of Szeged, Institute of Informatics, P.O. Box 652, H-6701 Szeged, Hungary, Fax:+36 62 546 397
[email protected] Ariana (joint research group CNRS/INRIA/UNSA), INRIA, B.P. 93, 06902 Sophia Antipolis, France, Fax:+33 4 92 38 76 43
Abstract. Forestry management makes great use of statistics concerning the individual trees making up a forest, but the acquisition of this information is expensive. Image processing can potentially both reduce this cost and improve the statistics. The key problem is the delineation of tree crowns in aerial images. The automatic solution of this problem requires considerable prior information to be built into the image and region models. Our previous work has focused on including shape information in the region model; in this paper we examine the image model. The aerial images involved have three bands. We study the statistics of these bands, and construct both multispectral and single band image models. We combine these with a higher-order active contour model of a ‘gas of circles’ in order to include prior shape information about the region occupied by the tree crowns in the image domain. We compare the results produced by these models on real aerial images and conclude that multiple bands improves the quality of the segmentation. The model has many other potential applications, e.g. to nanotechnology, microbiology, physics, and medical imaging.
1 Introduction Successful forestry management depends on knowledge of a number of statistics connected to forest structure. Among these are the number and density of trees in a forest, their average size, and changes in these quantities over time. High-resolution remote sensing images, and in particular colour infrared (CIR) aerial images, can facilitate the acquisition of these statistics by providing images from which tree crowns can be identified and counted, and their areas and shapes analysed. The task of manually extracting this information from aerial images, or worse, measuring the statistics in the field, is, however, labour intensive, which limits the extent to which it can be used. Typically, information can be extracted at tree stand resolution but not below, simply due to the time and cost involved. Image processing methods capable of extracting the same information automatically would therefore be of great use.
This work was partially supported by EU project MUSCLE (FP6-507752), Egide PAI Balaton, OTKA T-046805, and a HAS Janos Bolyai Research Fellowship. We thank the French National Forest Inventory (IFN) for the data.
J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 200–211, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Multispectral Data Model for HOACs and Its Application to Tree Crown Extraction
201
The problem of inferring automatically the region in the image domain corresponding to tree crowns given the image data is, however, not simple to solve. Like all inference problems, it can be phrased probabilistically. The quantity of interest in P(R|I, K), the probability that region R in the image domain corresponds to tree crowns given the image data I and any prior knowledge K we may choose to include. This is proportional to P(I|R, K)P(R|K), and thus we must construct models of the image to be expected given knowledge of where the tree crowns are, and of the possible regions corresponding to tree crowns in the absence of the data. We discuss the latter first. It might be thought that the region model P(R|K) could be rather generic, but this turns out not to be the case. Trees are not always easily distinguished from the background using the data alone, and so a prior model of R that incorporates enough knowledge to disambiguate these situations is required. Fortunately, we have a great deal of prior knowledge about the type of region to be expected. In this paper, we will focus on plantations, that is, collections of trees that do not often overlap and that are of the same species and roughly the same age. In this case, which is of great importance in practice, the region corresponding to tree crowns will consist of an unknown number of connected components corresponding to different trees, each connected component being a circular shape with a certain radius. Horv´ath et al. [1,2] addressed the extraction of tree crowns from CIR images. They constructed a model of such regions, called the ‘gas of circles’ model, using the higher-order active contour framework proposed in [3]. In this paper we use the same prior model. Horv´ath et al. [1,2] also described a data model. This model described the behaviour of only one band of the three available bands in the CIR images. The model was Gaussian, with the values at different pixels independent, and with different means and variances for tree crowns and the background. While successful, this model, even with the strong region prior, was not capable of extracting accurately the borders of all trees. Some trees were simply too similar to the background. The purpose of this paper is to construct a new data model that makes use of all three bands in the CIR images. We study the improvement or otherwise of the extraction results produced by modelling the three bands as independent or as correlated. As we will see, even at the level of maximum likelihood, the inclusion of ‘colour’ information, and in particular, interband correlations, can improve the results, and in conjunction with the region prior, the full model is considerably better than that based on one band alone. In the next subsection 1.1, we briefly review previous work on tree crown extraction and on region modelling. Then in section 2, we recall notions of higher-order active contours and describe the ‘gas of circles’ model, with the emphasis on a method to fix all but one of its parameters. In section 3, we study four possible ‘colour’ data models and compare them. In section 4, we describe the gradient descent algorithm used to minimize the full active contour energy. In section 5, we show experimental results on CIR images. In section 6, we sum up. 1.1 Previous Work The problem of delineating, locating or counting individual trees in high resolution aerial images has been studied in several papers. Different approaches have been proposed based on template matching [4], collections of rules [5], contours [6,7], mathematical
202
P. Horv´ath
morphology [8] and stochastic geometry [9]. Although the input to many of these approaches consists of multispectral images, usually only the infrared band is used for tree crown extraction. One way to use the multispectral information is through spectral signatures of various types, thoroughly reviewed and compared in [10]. For example, ‘tree colour lines’ approximate the cigar-shaped distribution of tree crown pixels in ‘colour space’ by a line. In this paper we study probabilistic models, based on multispectral histograms and Gaussian distributions. Previous work on region modelling, for example [11,12,13,14], is in general not suitable for the tree crown extraction problem. This is because it focuses on small variations of a region around a small number (usually one) of template regions. This means that regions with high probability lie in a bounded subset of the space of regions close to the template(s). The regions corresponding to tree crowns have an unknown number of connected components, and hence cannot easily be described by such approaches. ‘Higher-order active contours’ (HOACs) [3] provide a complementary approach. In active contour models, higher-order or not, a region R is represented by its boundary ∂R. P(R|I, K) is constructed implicitly, via an energy functional E(∂R) = Eg (∂R) + Ei (∂R, I), where Eg and Ei correspond to prior and likelihood. In classical active contours[15], Eg is constructed from single integrals over the boundary or region. Euclidean invariance then requires that Eg be a linear combination of the length of the boundary and the area of the region, since these are the only Euclidean invariant energies that can be constructed using single integrals over the boundary if the curvature is not used. Thus these energies incorporate only local differential-geometric information. HOACs generalize these classical energies to include multiple integrals over ∂R. Thus HOAC energies explicitly include long-range interactions between tuples of boundary points without the use of a template, which allows the inclusion of sophisticated prior knowledge while permitting the region to have an arbitrary number of connected components, which furthermore may interact amongst themselves. Euclidean invariance is intrinsic, with no pose estimation required. The approach is very general: classical energies are linear functionals on the space of regions; HOAC energies include all polynomial functionals. Rochery et al. [3] applied HOACs to road network extraction, while Horv´ath et al. [1,2] extended the model to describe a ‘gas of circles’, and applied it to tree crown extraction. We describe this model in the next section.
2 The ‘Gas of Circles’ Model HOAC energies generalize classical active contour energies by including multiple integrals over the contour. The simplest such generalizations are quadratic energies, which contains double integrals. There are several forms that such multiple integrals can take, depending on whether or not they take into account contour direction at the interacting points. The Euclidean invariant version of one of these forms is [3] β Eg (γ) = λL(γ) + αA(γ) − dp dp t(p) · t(p ) Ψ (r(p, p )) , (1) 2 where γ is the contour, a representative map in the equivalence class of embeddings representing ∂R, and thereby R; p and p are parameters for γ; L is the length of ∂R; A
A Multispectral Data Model for HOACs and Its Application to Tree Crown Extraction
203
Fig. 1. Gradient descent evolution using Eg alone, from initial (left) to final, stable (right) state (r0 = 5, α = 2, β = 1.69 and d = 5)
is the area of R; r(p, p ) = |γ(p) − γ(p )|; t = γ˙ is the (unnormalized) tangent vector to the contour; and Ψ is an interaction function that determines the geometric content of the model. With an appropriate choice of interaction function Ψ , the quadratic term creates a repulsion between antiparallel tangent vectors. This has two effects. First, for particular ranges of α, β, and d (λ = 1 wlog), circular structures, with a radius r0 dependent on the parameter values, are stable to perturbations of their boundary. Second, such circles repel one another if they approach closer than 2d. Regions consisting of collections of circles of radius r0 separated by distances greater than 2d are thus local energy minima. The model with parameters in the appropriate ranges is called the ‘gas of circles’ model [1]. Via a stability analysis, Horv´ath et al. [1] found the ranges of parameter values rendering circles of a given radius stable as functions of the desired radius. Stability, however, created its own problems, as circles sometimes formed in places where there was no supportive data. To overcome this problem, in [2], the criterion that circles of a given radius be local energy minima was replaced by the criterion that they be points of inflexion. As well as curing the problem of ‘phantom’ circles, this revised criterion allowed the fixing of the parameters α, β, and d as functions of the desired circle radius, leaving only the overall strength of the prior term, λ, unknown. For energy-based models, parameter adjustment is a problem, so this is a welcome advance. To illustrate the behaviour of the prior model, figure 1 shows the result of gradient descent starting from the region on the left. Note that there is no data term. The parameter values in these experiments render the circles involved stable. With the parameter values calculated in [2], they would disappear.
3 Aerial Images and Image Models The previous section described the prior model of regions that we will use. In this section, we examine the data, and study data models that use all three bands of the CIR image data. The wavelength of the three bands are between 520nm and 900nm approximately, with shifted colour bands used to add false colour to the images. Notice that the blue band in the images corresponds to green in reality, green to red, and red to photographic or very near infrared (700–900nm). Figure 2(a) shows a typical CIR aerial image falsely coloured for display purposes. It is of a poplar stand. Figure 2(b) shows the infrared band of the image. To see how colour can help, note that the bright pixels in the spaces between the trees are light grey in the colour image, while the trees are red. In the greyscale image, they have roughly
204
P. Horv´ath
(a)
(b)
(c)
(d)
(e)
Fig. 2. (a): typical CIR aerial image of a poplar plantation; (b): greyscale version of the image; (c): ground truth used for statistics; (d): another CIR image; (e): corresponding ground truth. c Images French National Forest Inventory (IFN).
the same intensity, making separation of trees and background difficult. Although the prior model helps to disambiguate these situations, it is not always successful, and it makes sense to consider a data model that uses the available information to the full. We want to construct a data model for the observed CIR image, given that region R corresponds to tree crowns. We can divide the image I (a three-component quantity) into two pieces, IR and IR¯ corresponding to the tree crowns and the background. Then we have that P(I|R, K) = P(IR , IR¯ |R, K). Without further information, IR and IR¯ are not independent given R: illumination for example will link them together. However, we may introduce parameters for the two regions, θR and θR¯ , so that the two pieces of the image become independent given these parameters. We note that the size of the tree crowns (∼10 pixels), coupled with the resolution of the image, does not allow the definition of meaningful texture features. Thus we will assume, without real justification, that the image values at different pixels are independent. Refinements to this assumption, for example, tree crown profiles, will be considered in future work. The data model then takes the form P(I|R, θR , θR¯ , K) = P(IR (x)|θR , K) P(IR¯ (x)|θR¯ , K) . x∈R
¯ x∈R
To help us design the model for individual pixels, we examine the statistics of the pixel values in the different bands. Figure 3 shows histograms of the pixel values in figure 2(a) for all three bands, separated into tree crown and background based on a manual labelling shown on figure 2(c). As expected, the infrared band shows the largest separation. Can adding the other two bands help? To test this idea, we performed four different types of maximum likelihood classification, based on four different estimates of the probability distributions for individual pixels of each class. Two of these estimates use raw histograms with different bin sizes. Of these, one is constructed as a product of the individual histograms for each band (independent bands), called HI for short, while the other uses the colour histogram (HC). The other two estimates use Gaussian models, either with covariances diagonal in colour space (independent bands), called GI, or with full covariances (G3D). The models parameters were learned from figure 2(a) and figure 2(d) based on the manual labelling. The resulting models were then used to classify the image in figure 2(a). The results of maximum likelihood classification on the same image are shown in figure 4. The images have four different colours: black and dark grey correspond to
A Multispectral Data Model for HOACs and Its Application to Tree Crown Extraction
205
Fig. 3. Histograms of pixel values in the three bands of figure 2(a), based on the manual labelling shown in figure 2(c). Green is background; blue is tree crown.
(HI, n = 64)
(HI, n = 128)
(HC, n = 64)
(HC, n = 128)
(GI)
(G3D)
Fig. 4. Maximum likelihood classifications of figure 2(a) using the different models trained on the same image
correct classification of background and tree crowns respectively, while light grey and white correspond to incorrect classifications in which tree crowns were classified as background and vice-versa respectively. Table 1 left shows the resulting classification error rates. Naturally, the results using HC are almost perfect. The number of bins is very large, and this means that there are unlikely to be more than one or two pixels in each bin. Consequently, any given pixel is very likely to have zero probability to be in the incorrect class. Equally clearly, the results using HI are poor: the different bands are not independent. This is confirmed by the result for GI. G3D, however, produces a reasonable performance, second only to the HC results. Bearing in mind that G3D has 3 + 6 = 9 parameters, while HC has the same number of parameters as bins, this is encouraging. These conclusions are confirmed by the label images, which clearly show the inferior classifications produced by the models with independent bands. To test the generalization ability of the models, we used a different image to learn the model parameters, and used them to classify figure 2(a). The new training image is figure 2(d), along with a manual labelling. Figure 5 shows the results, while table 1 right shows the error rates.
206
P. Horv´ath
(HI, n = 64)
(HI, n = 128)
(HC, n = 64)
(HC, n = 128)
(GI)
(G3D)
Fig. 5. The same classification trained on figure 2(d) Table 1. Error rates for the maximum likelihood classification of figure 2(a), using models trained on the same image (left) and on figure 2(d) (right) Method HI (64 bins) HI (128 bins) HC (643 bins) HC (1283 bins) GI G3D
B→F 446 446 121 19 470 256
F→B error (%) 404 9.64 399 9.58 214 3.8 97 1.32 426 10.16 328 6.62
Method HI (64 bins) HI (128 bins) HC (643 bins) HC (1283 bins) GI G3D
B→F 748 752 1028 1747 1106 841
F→B error (%) 242 11.22 253 11.39 490 17.21 1277 34.52 123 13.93 85 10.5
It is not a surprise that the error rates are larger. The histogram-based methods do not generalize well, and produce more errors than both Gaussian models. The Gaussian results are naturally not as good as in the previous test, but are adequate in the absence of a prior energy. The model with dependent bands performs considerably better than the independent band model in both cases. In particular, the independent band models, whether histogram-based or Gaussian, consistently confuse certain types of inter-tree background with the tree crown foreground.
4 Data Model and Energy Minimization Our full energy functional for tree crown extraction is a combination of the energy associated to the likelihood, Ei (γ, I) = − ln P(I|R, θR , θR¯ , K), and the HOAC ‘gas of circles’ prior geometric term Eg given in equation (1): E(γ, I) = Eg (γ) + Ei (γ, I). In the last section, we established that the Gaussian model with full covariance provides the best compromise between precision and generalization. In this section, we describe this data term and how we minimize the full energy E.
A Multispectral Data Model for HOACs and Its Application to Tree Crown Extraction
207
The parameters of Ei are learnt from samples of each class using maximum likelihood, and then fixed. We denote the mean vectors inside and outside as Min and Mout and the covariance matrices Σin and Σout . We define the energy as we wrote above −1 T 1 dp ln det−1/2 (Σin /2π)e− 2 (I(p)−Min ) Σin (I(p)−Min ) R −1 T 1 − dp ln det−1/2 (Σout /2π)e− 2 (I(p)−Mout ) Σout (I(p)−Mout ) .
E(γ) = Eg (γ) −
¯ R
The energy is minimized using gradient descent. The descent equation is ∂γ 1 det(Σin ) (p) = −λκ(p) − α + ln + β dp ˆr(p, p ) · n(p ) Ψ˙ (r(p, p )) ∂t 2 det(Σout ) 1 T −1 T −1 − (I(p) − Min ) Σin (I(p) − Min ) − (I(p) − Mout ) Σout (I(p) − Mout ) , 2
ˆ· n
where κ is the curvature of the contour, a dot indicates derivative, r(p, p ) = γ(p) − γ(p ), and ˆr = r/r. To evolve the contour we use the level set framework [16] extended to cope with the nonlocal forces arising from higher-order active contours [3].
5 Experimental Results We tested the new approach on CIR aerial images of poplar stands located in the Saˆone et Loire region in France, provided by the French National Forest Inventory (IFN). We compare three models: the new model, which uses the multispectral data term with the ‘gas of circles’ prior; the model in [2], which uses only the infrared band of the CIR image with the ‘gas of circles’ prior; and a classical active contour model, which uses the multispectral data model, but only the length and area terms of Eg , i.e. β = 0. There is thus no prior shape information in this third model. In all experiments, the contour was initialized to a rounded rectangle slightly larger than the image domain. Figure 6(a), (b), and (c) show the results obtained on the image shown in figure 2(a), using the new model, the model in [2], and the classical active contour model respectively. The new model is the most successful, separating trees that are not separated by the other models. Figure 6(d), (e) and (f) show the results obtained on the image shown in figure 2(d). None of the results is perfect, all the models failing to separate some trees, but the new model detects several trees that are not detected by the model in [2]. The classical active contour model was not be able to separate all the crowns, and found a large connected area at the bottom right, due to the missing prior shape information. Figure 7(a) shows a difficult image with a field at the top, and strong shadowing. The result with the new model, shown in figure 7(b), is good, detecting all the trees and ignoring the field and shadows. The result with the model of [2], shown in figure 7(c), is not as good. Some trees are missed, but more importantly, the fact that the field has a similar IR response to the tree crowns means that a large incorrect region is produced. The result with the classical active contour model, shown in figure 7(d), avoids this error
208
P. Horv´ath
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 6. Top row: results obtained on the image shown in figure 2(a), using the new model (a), the single band model of [2] (b), and the classical active contour model combined with the multispectral data term (c). Bottom row: results obtained on the image shown in figure 2(d), using the new model (d), the single band model of [2] (e), and the classical active contour model (f). In (a), c (b), (d), and (e), the stable radius was set to r0 = 2.5. French National Forest Inventory (IFN).
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 7. Top row: (a), a CIR image; (b), result with the new model; (c), result with the model of [2] (stable radius r0 = 2.5); (d), result with classical active contour model combined with the multispectral data term. Bottom row: (e), a CIR image; (f), result with new model; (g), result with model of [2] (stable radius r0 = 7.0); (h), result with classical active contour model (β = 0) c combined with the multispectral data term. French National Forest Inventory (IFN).
A Multispectral Data Model for HOACs and Its Application to Tree Crown Extraction
(a)
(b)
(c)
(d)
209
Fig. 8. (a), a CIR image; (b), result with the new model; (c), result with the model of [2] (stable radius r0 = 4.0); (d), result with classical active contour model (β = 0) combined with the c multispectral data term. French National Forest Inventory (IFN).
thanks to the multispectral information, but the lack of prior shape information means that some trees are merged. Figure 7(e) shows a different type of image, of isolated trees in fields. The result with the new model, shown in figure 7(f), is correct, ignoring the field, for example. The result with the model of [2] is not as good, with one large false positive, and smaller errors on each of the detected trees, due to confusion between the field and parts of the road and the tree crowns(figure 7(g)). Figure 7(h) shows the result obtained using the multispectral data term combined a classical active contour model. The result is almost as good as the new model, except that the contours are slightly less smooth, and there is a small false positive area in the upper right corner, which was not detected by the new model, presumably because it is not circular. Figure 8(a) shows another CIR image with fields and some sparse trees. It is a difficult image, because some of the fields have a similar colour to the trees. The result with the new model, shown in figure 8(b), is good, detecting all the trees, and only merging two of them. The result with the model of [2], shown in figure 8(c), is not as good. The greyscale level between some of the trees is too similar to the tree crowns to be separated, despite the prior shape information, meaning that several trees are merged. In addition, some non-tree objects were detected as tree crowns, again due to similarity of grey scale. The result obtained with the classical active contour and multispectral data model is slightly better, but due to the missing prior shape information several tree crowns are merged and a small non-tree area was detected.
6 Conclusion We have presented a new higher-order active contour (HOAC) model for tree crown extraction from colour infrared (CIR) aerial images. The new data term takes into account the multispectral nature of the data, in contrast to almost all previous work. The
210
P. Horv´ath
interband correlations are modelled using a full covariance Gaussian distribution. The prior term is a HOAC model of a ‘gas of circles’, modelling regions consisting of a number of circles of approximately a given radius. Experimental results show that the new model outperforms both a model with the same prior shape information, but which uses only the IR band of the data (the model of [1,2]), and models with the same multispectral data term, but including less prior shape information, to wit, a classical active contour model and maximum likelihood.
References 1. Horv´ath, P., Jermyn, I.H., Kato, Z., Zerubia, J.: A higher-order active contour model for tree detection. In: Proc. International Conference on Pattern Recognition (ICPR), Hong Kong, China (2006) 2. Horv´ath, P., Jermyn, I.H., Kato, Z., Zerubia, J.: An improved ‘gas of circles’ higher-order active contour model and its application to tree crown extraction. In: Proc. Indian Conference on Vision, Graphics and Image Processing (ICVGIP), Madurai, India (2006) 3. Rochery, M., Jermyn, I.H., Zerubia, J.: Higher-order active contours. International Journal of Computer Vision 69, 27–42 (2006) 4. Larsen, M.: Finding an optimal match window for Spruce top detection based on an optical tree model. In: Hill, D., Leckie, D. (eds.) Proc. International Forum on Automated Interpretation of High Spatial Resolution Digital Imagery for Forestry, Victoria, British Columbia, Canada, pp. 55–66 (1998) 5. Gougeon, F.: A crown-following approach to the automatic delineation of individual tree crowns in high spatial resolution aerial images. Canadian Journal of Remote Sensing 21(3), 274–284 (1995) 6. Brandtberg, T., Walter, F.: Automated delineation of individual tree crowns in high spatial resolution aerial images by multiple-scale analysis. Machine Vision and Applications, 64–73 (1998) 7. Gougeon, F.A.: Automatic individual tree crown delineation using a valley-following algorithm and rule-based system. In: Hill, D., Leckie, D. (eds.) Proc. International Forum on Automated Interpretation of High Spatial Resolution Digital Imagery for Forestry, Victoria, British Columbia, Canada, pp. 11–23 (1998) 8. Andersen, H., Reutebuch, S., Schreuder, G.: Automated individual tree measurement through morphological analysis of a LIDAR-based canopy surface model. In: Proc. International Precision Forestry Symposium, Seattle, WA, USA, pp. 11–21 (2001) 9. Perrin, G., Descombes, X., Zerubia, J.: Tree crown extraction using marked point processes. In: Proc. European Signal Processing Conference (EUSIPCO), Vienna, Austria (2004) 10. Gougeon, F.: Comparison of possible multispectral classification schemes for tree crown individually delineated on high spatial resolution MEIS images. Canadian Journal of Remote Sensing 21(1), 1–9 (1995) 11. Cremers, D., Kohlberger, T., Schn¨orr, C.: Shape statistics in kernel space for variational image segmentation. Pattern Recognition 36, 1929–1943 (2003) 12. Leventon, M., Grimson, W., Faugeras, O.: Statistical shape influence in geodesic active contours. In: Proc. IEEE Computer Vision and Pattern Recognition (CVPR), Hilton Head Island, SC, USA, pp. 316–322. IEEE Computer Society Press, Los Alamitos (2000) 13. Paragios, N., Rousson, M.: Shape priors for level set representations. In: Proc. European Conference on Computer Vision (ECCV), Copenhagen, Denmark, pp. 78–92 (2002)
A Multispectral Data Model for HOACs and Its Application to Tree Crown Extraction
211
14. Riklin-Raviv, T., Kiryati, N., Sochen, N.: Prior-based segmentation by projective registration and level sets. In: Proc. IEEE International Conference on Computer Vision (ICCV), pp. 204–211. IEEE Computer Society Press, Los Alamitos (2005) 15. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1, 321–331 (1988) 16. Osher, S., Sethian, J.A.: Fronts propagating with curvature dependent speed: Algorithms based on Hamilton-Jacobi formulations. Journal of Computational Physics 79, 12–49 (1988)
A Crossing Detector Based on the Structure Tensor Frank G.A. Faas and Lucas J. van Vliet Quantitative Imaging Group, Delft University of Technology, The Netherlands
[email protected]
Abstract. A new crossing detector is presented which also permits orientation estimation of the underlying structures. The method relies on well established tools such as the structure tensor, the double angle mapping and descriptors for second order variations. The performance of our joint crossing detector and multi-orientation estimator is relatively independent of the angular separation of the underlying unimodal structures.
1
Introduction
The structure tensor [1] and its nonlinear variations [2] yield a reliable estimate of orientation on unimodal structures. It fails where unimodal structures overlap (or cross). In this paper we present a method based on the structure tensor to divide the image around crossings in unimodal regions. Using the 4-fold symmetry of the orientation map at line crossings (or saddle points in checkerboard patterns) we are able to achieve a high response independent of the angular separations of the underlying lines. This in contrast to e.g. the Harris Stephens crossing detector [7] and variations thereof [9,11,5,4,10,8] for which the response drops significantly with decreasing angular separation. Our new method is reasonably fast, has a good angular selectivity and yields good localization. This is particularly important for camera calibration in which the crossings of checkerboard patterns (or other fiducials) need to be located with sub-pixel accuracy in many different poses. Another key application in molecular biology requires the detection and characterization of overlapping bio-polymers such as DNA strands deposited on a surface for AFM or TEM imaging.
2
Method
The key observation to our method is the following. Applying the gradient structure tensor [1,7](GST) to a crossing of linear structures results in an orientation pattern with a saddle point structure, i.e. regions of uniform orientation bounded by the bisectors of the underlying crossing (Fig. 1). Hence detection of these saddle points will yield a crossing detector. After the orientation of these saddle points is determined one can divide the local neighborhood of a crossing in four regions, i.e. wedges with an opening angle of π2 radians. The antipodal wedges J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 212–220, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Crossing Detector Based on the Structure Tensor
A
θA
A
θB
θA
θB
π
B
θ1
θB
θ→
B θ2
θA
θ2
θ1
θA
θB θ1
(a)
213
(b)
θA + π θB + π θ2
(c)
θ1 + π
2π
θ2 + π φ→
Fig. 1. (a) Sketch of crossing lines A and B with respectively orientations θA and θB . The dashed lines denote the major axes of the detected saddle point at the crossing with respective orientations θ1 and θ2 . The dashed regions denote the areas in which the measured orientation by the GST in the circular region is approximately constant. In sketch (b) the same crossing is denoted. The circular arrow denotes the track along which the orientation response of the GST is sketched in subfigure (c)(phase wrapping is assumed absent).
can form two bow ties, see the regions in Fig. 1(a) labelled with respectively θA and θB for lines A and B. Applying the gradient structure tensor to these regions separately, either to the bow tie or to the wedges separately, yields a reliable local orientation estimate for each arm or line of the crossing separately. As the location of the crossing is already detected, the orientation estimate of the four wedges can be converted to direction estimates pointing away from the crossing’s center. In this section we will briefly describe the four steps of our method. Transform crossings into saddle shaped structures: In the first step of our algorithm we determine the local orientation by means of the GST. The GST, T (I), is the averaged dyadic product of the gradient field ∇I of image I, in which the overhead bar denotes local averaging. T (I) = ∇I∇I T
with ∇I = [Ix , Iy ]T .
(1)
In this tensor representation two antipodal vectors are mapped on top of eachother. Where the antipodal vectors cancel out in an averaging step, the corresponding tensor representations reinforce eachother. Now the directional gradient power is maximized for angle θ tan 2θ = 2Ix Iy / (Ix2 − Iy2 )
(2)
where θ denotes the orientation of the gradients of the unimodal structure. This corresponds to the orientation of the eigenvector belonging to the largest eigenvalue. In Fig. 1(a) the saddle point structure of the orientation field of the GST is sketched. This saddle point structure is caused by the averaging nature of the GST which treats the local neighborhood as a single structure, i.e. when one of the arms is dominant in the analysis window it will dominate the orientation
214
F.G.A. Faas and L.J. van Vliet
(a)
(b)
(c)
(d)
Fig. 2. (a) Synthetic image of a cross. (b) Orientation determined by GST clearly showing phase wrapping (BW transitions). (c-d) The double angle mapping of (b) with respectively in (c-d) the cosine and sin of the double angle of (b).
result as well. Only when both lines are visible an averaged orientation will be obtained, see Fig. 1(b-c). Due to this averaging property of the GST the orientation field will look like a Voronoi tessellation of the underlying structures, i.e. in case of crossing lines a cross is formed by the internal bisectors of the lines, see Fig. 1(a). Although the GST gives an excellent characterization of the local orientation, the angle representation of Eq. 2 suffers from phase wrapping, i.e. the resulting orientation is modulo π radians. This causes large jumps in the orientation image where the angle jumps from 0 to π radians while in reality these orientations are identical. In Fig. 2(a-b) we show respectively a synthetic crossing and the orientation estimate by means of the GST. Where the latter clearly shows phase wrapping events. To solve the phase jumps caused by phase wrapping we apply a double angle mapping to the measured orientation. θ → (cos 2θ, sin 2θ)
(3)
Note that the double angle is closely related to the GST. As shown in [6] this mapping preserves the angular metric, gives a continuous mapping and preserves the local structure. In Fig. 2(c-d) the double angle representation is shown for the image in Fig. 2(b). It clearly shows that the phase wrapping events, BW transitions, in Fig. 2(b), are absent after the double angle mapping, Fig. 2(c-d). Generate candidate crossings from second order shape descriptors: The phase unwrapped orientation gives rise to a saddle structure. This structure is more pronounced for large angles of separation, i.e. for lines crossing at an angle of π2 radians it is maximized. Therefore a saddle point candidate generator is needed which separates the magnitude of the saddle point from the shape descriptor which characterizes the structure type. To this end we explore the second order structure [3]. A structure vector f is presented with three components to describe the second order structure based on the Hessian matrix. These three components are respectively the angle β which denotes the orientation of the structure, κ which denotes the structure type and f which denotes
A Crossing Detector Based on the Structure Tensor
215
the structure strength. These descriptors are based on the spherical harmonics which constitute an orthonormal basis in contrast to the second order derivatives which are not independent. However, the second order spherical harmonics Jij can be expressed in terms of the second order derivatives Jab as follows ⎛ ⎞ ⎛ ⎞ J20 (J + J ) xx yy √ ⎝ J21 ⎠ = √1 ⎝ 2 (Jxx − Jyy ) ⎠ (4) 3 √ J22 8 Jxy Now we can express the structure vector f in terms of the spherical harmonics for image J as, ⎛ ⎞ ⎛ ⎞ f | (J20 , J21 , J22 ) | ⎠ f (J) = ⎝ β ⎠ = ⎝ arg(J21 (5) , J22 ) 2 +J2 ) κ arctan(J20 / J21 22 For |κ| = π2 the structure can be described as a blob structure, for |κ| = π6 as ridges/valleys and for |κ| = π3 we have the pure second order derivatives. The pure saddle structure is located at |κ| = 0. The double angle representation results in two κ images, i.e. one for the sine and cosine term. These images are combined in one structure descriptor κ based on the corresponding structure strength, i.e. κ(cos 2θ) if f (cos 2θ) > f (sin 2θ) κ (θ) = (6) κ(sin 2θ) elsewhere To detect candidate saddle points we apply a threshold to the |κ | image, π 1 if |κ | ≤ κth with κth = 12 Saddle(θ) = 0 else
(7)
where κth is chosen as the middle value between the pure saddle point at |κ | = 0 and the line structures at |κ | = π6 . Detect crossings using a second order magnitude measure: After we generated the candidate saddle points we want to assign a magnitude measure to each candidate based on the structure strength to confine the candidates to regions where structure is present, i.e. noise can also give rise to saddle points on a small scale. As f (θ) based terms are dependent on the angular separation of the crossing, another energy measure is needed to reduce the angular dependency of the detector. The measure of our choice is the curvature-signed second order strength in I, E(I) = sign(κ(I)) |f (I)| (8) where the sign term is introduced to be able to distinguish between the crest of a line and its edges, i.e. on a ridge the curvature is positive but on the flanks the curvature is negative. Thresholding the energy image yields the candidate regions based on the structure strength in I,
216
F.G.A. Faas and L.J. van Vliet
Energy =
1 if E≷Eth 0 else
with
Eth = threshold({E(I)|E≷0})
(9)
where the comparison direction depends on the structure of interest, i.e. black lines on a white background or vice versa. Eth is determined by an isodata threshold on respectively the positive or negative data in E(I), the threshold type can of course be adapted to a particular problem. Now we combine the Energy and Saddle images by an AND operation. Furthermore, to remove spurious detections, we require the detected regions to be larger than SA pixels. Detector = {x|x ∈ Saddle ∧ x ∈ Energy ∧ Area(x) > SA }
(10)
where the area SA is defined as the minimum cross section of two lines of width w intersecting under an angle φ, i.e. SA = w2 . Of course, the line width of the line is a combination of the true width of the line, the PSF of the imaging device and size of the derivative kernels and as such has to be set to a suitable value for the problem at hand. Analyze orientation of lines composing the crossing: The algorithm continues with the analysis of the orientation of the lines from which the crossing is composed. Therefore, first the center of gravity is determined for each connected region in the detector image, which serves as location of the detected crossing. Further analsis is performed with these points as point of origin. The value of β at these points now gives the orientation of the saddle points (the β responses on the double angle representation are combined in a similar fashion as those for κ in Eq. 6). At these points the gradient information is analyzed by means of the GST in the bow tie shaped region. The bow tie is constructed by the major axes of the saddle point, i.e. given by lines through the local point of origin with orientation β and β + π2 . The eigenvector belonging to the largest eigenvalue of the GST for each bow tie gives the orientation of the underlying structure. The size of the gradients (Gaussian derivatives of scale σg ) must be small to avoid unnecessary signal suppression. The size of the tensor smoothing (Gaussian filter of scale σt ) is usually three to five times larger than the gradient size. The size of the second derivatives is set identical to the size of tensor smoothing.
3
Results
First we test the algorithm on synthetic data, i.e. crossing lines with an angular separation between 0 and π2 radians, see e.g. Fig. 2(a). In Fig. 3 we show the results for lines with a Gaussian line profile of σline = 1 and a SNR of respectively 10 dB and 25dB after addition of Gaussian noise. The signal to noise ratio is defined as SNR = 20 log (contrast/σnoise ). For each angular separation 100 realizations are obtained with randomly selected sub-pixel position and orientation of the structures. All derivatives and averages are computed with Gaussian
True False
1
0
0
0.1
513
660
0.2 0.3 0.4 angular separation (π rad)
551
643
659
531
627
403
141
59
0.5
0.5
0
0
0.1
angular deviation (rad)
angular deviation (rad)
0.4 0.2
0.1
0.2 0.3 0.4 angular separation (π rad)
0 0.5
True False
1
0.5
0
0
0.1
0.2 0.3 0.4 angular separation (π rad)
0.5
SNR 25 db
0.6
0
7
SNR 25 db 1.5
SNR 10 db
0
0 0 0 0 0 0.2 0.3 0.4 angular separation (π rad)
609
SNR 10 db 1.5
0
495
0.5
30 0.5
473
0.2 0.3 0.4 angular separation (π rad)
35
454
23
345
0.1
7
394
0
1
256
0
194
0
1 86
84
1 0.5
SNR 25 db
36
7
217
1.5
215
61 73
distance from true crossing center
83 1
measured angular sepration ( rad)
SNR 10 db 1.5
130
measured angular sepration ( rad)
distance from true crossing center
A Crossing Detector Based on the Structure Tensor
0.5
0.6 0.4 0.2 0
0
0.1
0.2 0.3 0.4 angular separation (π rad)
0.5
Fig. 3. Top row figures show the distance between the true crossing center and the measured center as a function of the angular separation of the lines for true positives. The middle row shows the measured angular separation of the true crossings (in blue) and in of the false positives (in red). The plots on the bottom row show the angular deviation of the measured orientations from the true orientations of the lines for true positives. The left and right column show the results for 10dB and 25 dB respectively. For each separation angle 100 realizations were made.
kernels (σg = 1, σt = σs = 4). Note that for both noise levels the same settings were used. The region in which the orientations of the lines were measured, complies to the size of the tensor smoothing. Further, the analysis window is set to a region within 100 pixels from the crossing center. Keep this in mind as the number of false positives are expected to scale linearly with this value. On the top row of Fig. 3 the distance from the true center is depicted, where the numbers denote the number of false negatives. True positives are detections closest to the true center and at a maximum distance of 2 pixels. All other detections are marked as false positives. The high number of false negatives for small angles is attributed to the the fact that a crossing resembles more and more a line with decreasing angular separation resulting in a poor localization. The increase in the number of false negatives for large separations in the high noise
218
F.G.A. Faas and L.J. van Vliet
Fig. 4. DNA molecules labeled with uranyl acetate and visualized by transmission electron microscopy. The images are kindly provided by Dr. D. Cherny, PhD, Dr.Sc. The examples show (self) crossing DNA strands, the white dashed lines show the orientation estimates of the detected crossings while the black dashed lines show the major axes of the saddle point regions.
realizations is not fully understood, but can be lowered in exchange for more false positives with small separation angles. On the middle row of Fig. 3 the measured angular separation is plotted as a function of the true angular separation, where the numbers denote the number of false positives. The plots clearly show that the false positives can be easily separated from the true positives for separation π angles larger than ≈ 10 , and even smaller for the low noise case. The figures on the right denote the error in the orientation estimation of the crossing lines. In Fig. 4 and 5 we show some examples on real data. The images represent respectively DNA strands, a deformed clay dike model and a checker board. For these images SA = (1 + σt2 ), i.e. the width of the detected lines is put to 1. The settings for the first order derivatives, tensor smoothing/second derivatives and the cutoff radius of the wedges are respectively (σg , σt ) = (2, 10),(1, 4) and (1, 6) for the clay dike, checker board and DNA images. The clay dike image is produced by line scanning and suffers from striping. To overcome this problem the tensor smoothing is set to a relative high value. In all three images the black dashed lines denote the major axes of the saddle points whereas the white dashed lines denote the measured orientations of the underlying lines.
A Crossing Detector Based on the Structure Tensor
219
(a)
(b) Fig. 5. (a) Image of a deformed miniaturized clay dike model with a superimposed grid. Courtesy of GeoDelft, The Netherlands. (b) Checker board image. In both images the white dashed lines denote the orientation estimates of the detected crossings while the black dashed lines show the major axes of the saddle point regions.
220
4
F.G.A. Faas and L.J. van Vliet
Conclusions
The presented crossing detector is relatively insensitive to the angular separation of constituent lines/edges. False positives can easily be removed by setting a simple threshold on the angular separation. The detector also allows for an accurate orientation estimation of the underlying structures and performs well on noisy data. We believe this can be a good tool for camera calibration on checkerboard images due to its independence of the angular separation between the linear structures (pose independence). Further it can be used for the analysis of (self)overlapping line-like objects. The low number of parameters can be adjusted easily to the problem at hand where the values are intuitive to determine. For the first order derivatives we like to keep the footprints as small as possible. The tensor and second order footprints can be kept at the same value where the value is dependent on the spatial separation of crossings as well as the noise properties of the image at hand. The same is true for the final orientation √ measurements of the underlying structures. The size of the bow ties are 2 time the size of the tensor smoothing.
References 1. Big¨ un, J., Granlund, G.H.: Optimal orientation detection of linear symmetry. In: Proc. 1th IEEE Int. Conf. Comput. Vis. June 8-11, 1987, pp. 433–438. IEEE Computer Society Press, Los Alamitos (1987) 2. Brox, T., Weickert, J., Burgeth, B., Mrazek, P.: Nonlinear structure tensor. Im. Vis. Comp. 24, 41–55 (2006) 3. Danielsson, P.-E., Lin, Q., Ye, Q.-Z.: Efficient detection of second-degree variations in 2D and 3D images. J. Vis. Comm. Im. Repr. 12, 255–305 (2001) 4. F¨ orstner, W.: A feature based correspondence algorithm for image matching. Int. Arch. Phot. Rem. Sens. 26(3/3), 150–166 (1986) 5. Garding, J., Lindeberg, T.: Direct computation of shape cues using scale-adapted spatial derivative operators. Int. J. Comput. Vis. 17, 163–191 (1996) 6. Granlund, G.H.: In search of a general picture processing operator. Comp. Vis. Graph. Im. Proc. 8, 155–173 (1978) 7. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vis. Conf. pp. 147–151 (1988) 8. Kenney, C.S., Zuliani, M., Manjunath, B.S.: An axiomatic approach to corner detection. In: Proc. IEEE Conf. Comput. Vis. Patt. Recogn. pp. 191–197. IEEE Computer Society Press, Los Alamitos (2005) 9. Rohr, K.: On 3d differential operators for detecting point landmarks. Im. Vis. Comp. 15, 219–233 (1997) 10. Shi, J., Tomasi, C.: Good features to track. In: Proc. IEEE Conf. Comput. Vis. Patt. Recogn. (CVPR’94), pp. 593–600. IEEE Computer Society Press, Los Alamitos (1994) 11. Triggs, B.: Detecting keypoints with stable position, orientation, and scale under illumination changes. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 100–113. Springer, Heidelberg (2004)
Polyphase Filter and Polynomial Reproduction Conditions for the Construction of Smooth Bidimensional Multiwavelets Ana Ruedin Departamento de Computaci´ on, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires Ciudad Universitaria, Pab. I. CP 1428, Ciudad de Buenos Aires
[email protected]
Abstract. To construct a very smooth nonseparable multiscaling function, we impose polynomial approximation order 2 and add new conditions on the polyphase highpass filters. We work with a dilation matrix generating quincunx lattices, and fix the index set. Other imposed conditions are orthogonal filter bank and balancing. We construct a smooth, compactly supported multiscaling function and multiwavelet, and test the system on a noisy image with good results. Keywords: orthogonal filterbank, multiwavelets, nonseparable, polynomial reproduction.
1
Introduction
The wavelet transform has proved to be an efficient tool for many image processing applications. By means of lowpass and highpass filters, at each step, the image (or an approximation of the image), belonging to a subspace Vj , is decomposed into its projection onto 2 subspaces: an approximation subspace Vj−1 and a detail subspace Wj−1 , both having less resolution. When the process is completed, the image is represented as the sum of its details at different resolutions and positions, plus a coarse approximation of the same image [1, 2]. The approximation subspaces Vj , which are nested, are the linear span of the scaling function Φ (or a scaled version of Φ) and its integer translates. The detail subspaces Wj are the linear span of the wavelet Ψ (or a scaled version of Ψ ) and its integer translates. We call (Φ, Ψ ) a wavelet system. In one dimension, the different scales are powers of a dilation factor, most commonly equal to 2. To process an image, the tensor product of one-dimensional filters is used; the details lie mainly in the vertical and horizontal directions, which does not agree with our visual system. Nonseparable bidimensional wavelets give a more isotropic treatment of the image [3, 4, 5, 6, 7]. The dilation factor is a 2 × 2 matrix, called the dilation matrix. Its elements must be integers, and it must be an expansion of the plane. Multiwavelets, related to time–varying filterbanks, are a geralization of the wavelet theory, in which the approximation subspaces Vj are the linear span of J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 221–232, 2007. c Springer-Verlag Berlin Heidelberg 2007
222
A. Ruedin
more than one scaling function [8, 9] . They offer a greater degree of freedom in the design of filters. There has been a growing interest in unifying both research lines, and construct nonseparable bidimensional multiwavelets. The first reported example was in 1998 [10] (although coefficients are not given). Mc Clellan transforms have been used to construct nonseparable multidimensional wavelets and multiwavelets [6] and 3D nonseparable wavelets on a 3D lattice [11], for a dilation matrix equal to 2 I. Nonseparable multiwavelets have been constructed and applied for compression [12, 13, 14], edge detection [15] and interpolation [16] with dilation matrices generating quincunx lattices. In this paper we impose polynomial approximation order 2 and add new conditions on the polyphase highpass filters to construct very smooth nonseparable multiscaling functions and multiwavelets, having compact support. We work with a dilation matrix generating quincunx lattices, that expands the plane evenly in all directions. We seek an orthogonal filter bank, because it provides stability when modifications such as rounding or thresholding are introduced in the transformed image, and has the advantage that the inverse transform is immediately obtained. The 2 branches of the lowpass filter are required to be balanced. We obtain smooth multiscaling and multiwavelet functions. These are plotted and tested for image denoising with good results. In section 2 the general setting for this bidimensional multiwavelet is given; we explain the choice of the dilation matrix and the index set. The condition of polynomial reproduction by the integer translates of the multiscaling function is written out as nonlinear equations on the set of parameters that are to be calculated. In section 3 we introduce the formulae for transforming and antitransforming an image with this particular nonseparable multiwavelet. Further desired properties are stated in section 4, such as balancing and lowpass or highpass conditions on the filters. In section 5 we briefly refer to the numerical construction, we plot the 2 scaling functions and apply the multiwavelet obtained to denoise an image. Concluding remarks are given in section 6.
2
A Nonseparable Bidimensional Mutiwavelet System and Desired Properties
In this setting, both the refinable function and the wavelet, are function vectors of 2 components, that is, they are determined by 2 functions. 2.1
A Nonseparable Bidimensional Multiscaling Function
The multiscaling function Φ = [Φ1 Φ2 ]T , spans the approximation spaces Vj . For denoising or interpolation, among other applications, it is desirable to have smooth bases for these spaces, so that the recontructed image is agreeable to the eye. We start with the choice of the dilation matrix 1 1 D= , (1) 1 −1
Polyphase Filter and Polynomial Reproduction Conditions
223
√ a reflection (on an axis of 2. √ outside the image) followed by an expansion √ Its eigenvalues are ± 2, its singular values are both equal to 2; the matrix provides an expansion of the whole plane in Euclid norm, and does not introduce visual distortions in an image [17]. Given an image x, we define the downsampling operation with D as: y = x ↓ D ⇔ yn = xDn (n ∈ Z 2 ). This operation reflects and contracts the image. D induces a decomposition of the set Z 2 into |D| = |det(D)| = 2 cosets Γ0 and Γ1 : Z 2 = Γ0 ∪ Γ1 ; Γ0 = {DZ 2 } ; Γ1 =
1 DZ 2 + . 0
(2)
Φ1 and Φ2 are 2 functions, defined over R2 , that verify the following dilation or refinable equation (written in vector form in Eq. (3); written in detail in Eq. (4)): Φ(x) = H (k) Φ( D x − k ), (3) k∈Λ⊂Z 2
Φ1 (x) Φ1 ( D x − k ) = H (k) , Φ2 (x) Φ2 ( D x − k )
(4)
k
where H (k) are 2 × 2 matrices (matrix filters), with indices in Λ as indicated ⎡
0 0
0
⎢ H (−1,1) ⎢ (−2,0) M0 = ⎢ H (−1,0) ⎢H ⎣ 0 H (−1,−1) 0 0
H (0,2) H (0,1) H (0,0) H (0,−1) H (0,−2)
⎤ H (1,2) 0 0 H (1,1) H (2,1) 0 ⎥ ⎥ H (1,0) H (2,0) H (3,0) ⎥ ⎥. H (1,−1) H (2,−1) 0 ⎦ H (1,−2) 0 0
(5)
Notice that there are 9 matrix filters H (k) with indices in Γ0 (drawn with circles in Fig. (1) and 9 in Γ1 (drawn with crosses). Both cosets conform a quincunx lattice. The entries of matrices H (k) are the unknowns or parameters of our multiscaling function. The aim is to find a set of parameters that will give smooth functions. O× O×O× O×O×O× O×O× O× Fig. 1. Index configuration for the matrix filters
224
2.2
A. Ruedin
A Nonseparable Bidimensional Multiwavelet
The number of wavelets is 2. The equation for the multiwavelet is: Ψ (x) =
G ( k ) Φ( D x − k ).
(6)
k
where G(k) are 2 × 2 matrices (matrix filters). We choose matrices G( k) having the same indices k ∈ Λ, as indicated in Eq. (5).The entries of matrices G( k) are the unknowns or parameters of our multiwavelet. The aim is to find a set of parameters that define matrices H ( k) and G( k) , by imposing conditions on the multiscaling function and on the multiwavelet, so that we obtain smooth functions. Matrices H ( k) and G( k) conform a matrix filterbank which we want to be orthogonal. 2.3
Compact Support
With the choices made for D and Λ, we automatically have compactly supported multiscaling functions. The supports of Φ1 and Φ2 are contained in a set S ⊂ R2 , that depends on D and Λ, and verifies S=
D−1 {S + k} = D−1 {S + Λ} .
k∈Λ⊂Z 2
It may be shown that the supports of Ψ1 and Ψ2 are also compact. 2.4
Polynomials in the Linear Span of the Integer Translates of Φ
The smoothness of a scaling function, and the degree of the polynomials it can reproduce, are related [18]. Accordingly, to obtain smooth multiscaling functions we look for functions that can reproduce polynomials. We say that Φ(x) has polynomial approximation order s if any polynomial p(x) of degree ≤ s can be written as a linear combination of the integer translates of Φ1 and Φ2 , i.e. p(x) =
αTk Φ(x + k)
(7)
k∈Z 2
where αk is a column vector of 2 elements. The polynomial approximation order is related to the number or vanishing moments of the multiwavelet. (j,) To abridge notation, we call Si the sum of all matrices H (k) , whose indices k = (k1 , k2 ) belong to coset Γi , multiplied by k1j k2 ; and we call S (j,) the same sum over both cosets: j (j ) (j ) (j ) Si = k1 k2 H (k) , S (j ) = k1j k2 H (k) = S0 + S1 . (8) k∈Γi
k∈Γ0 ∪Γ1
Polyphase Filter and Polynomial Reproduction Conditions
225
We list the conditions for polynomial approximation, given in [19] for compactly supported functions and written out for the dilation matrix chosen: ( 00)
w T = w T Si u = −w T
T
( 10) Si
T
y =
, (9)
( 00) ( 01) ( 00) + (u + v) Si , v T = −wT Si + (u − v)T Si , (20) (10) (00) xT = wT Si − 2 (u + v)T Si + (x + 2 y + z)T Si , ( 11) (10) ( 01) (00) w T Si − (u − v)T Si − (u + v)T Si + (x − z) Si , (02) (01) (00) z T = wT Si − 2 (u − v)T Si + (x − 2 y + z)T Si . T
(10) (11) (12) (13)
If there exist vectors w, u, v, x, y, z in R2 ,w = [0 0]T , verifying Eqs. (9–13) for i = 0, 1, then Φ has polynomial approximation order 2. 2.5
Orthogonal Filterbank
As mentioned, it is convenient to have an orthogonal filterbank. Such a filterbank determines the formulae for the multiwavelet transform and antitransform. The orthogonality conditions are the following: T |D| I if j = (j1 , j2 ) = (0, 0) (k) (k+Dj) H H = (14) 0 if j = (j1 , j2 ) = (0, 0) 2 k∈Λ⊂Z
(k)
G
k∈Λ⊂Z 2
T |D| I if (j1 , j2 ) = (0, 0) (k+Dj) G = 0 if (j1 , j2 ) = (0, 0)
T G(k) H (k+Dj) = 0 ∀ j ∈ Z 2
(15)
(16)
k∈Λ⊂Z 2
3 3.1
Image Processing with Nonseparable Multiwavelets Separating an Image in Its Phases
In this section we give formulae for processing an image with nonseparable multiwavelets, and illustrate them with an example. Since we have 2 scaling functions, we need 2 images to feed into the filterbank (see Fig. (2)). In a similar way as 1d signals are separated into phases (even and odd entries), we separate one image in 2 images (phases) according to the coset of the pixels’ coordinates. The (0) original image Xk , k ∈ Z 2 , in our example Lena image of 128 × 128 pixels, (0) (0) is separated into c1,k and c2,k , its (downsampled) entries in Γ0 and Γ1 , respectively – see Fig. (3)(a). Specifically, (at the right of each formula is given its Z-transform): (0) (0) (0) c1,· = X (0) ↓ D, c1,· (z1 , z2 ) = XDk z1−k1 z2−k2 ; (17) k=(k1 ,k2 )∈Z 2
226
A. Ruedin (−1)
c1,k Analysis Lowpass
(0) c1,k
(−1)
c2,k
(0)
c1,k
(0)
c2,k
Synthesis
(−1)
d1,k Analysis Highpass
(0)
c2,k
(−1)
d2,k
Fig. 2. Analysis-synthesis scheme
(0) c2,· = X (0) ∗ ∂−10 ↓ D,
(0)
c2,· (z1 , z2 ) =
XDk+(10) z1−k1 z2−k2 ; (0)
(18)
k∈Z 2 (0)
(0)
X (0) (z) = c1,· (z D ) + z1−1 c2,· ( z D ). (19) z1−j z2−k and z D = (z1 z2 , z1 /z2 ).) (0)
X (0) = [c1,· ↑ D] + [∂(10) ∗ (c2,· ↑ D)];
(Note: F (z1 , z2 ) =
Fjk
( j,k )∈Z 2
3.2
( 0)
Analysis– In Terms of Matrix Filters
We have the approximation spaces Vj , and the detail spaces Wj : Vj = span Φ1 (Dj · −k1 ), Φ2 (Dj · −k2 ) (k1 ,k2 )∈Z 2 , Wj = span Ψ1 (Dj · −k1 ), Ψ2 (Dj · −k2 ) (k1 ,k2 )∈Z 2 .
(20) (21)
Let f (x) be the function in V0 (the approximation space having the fine resolution of the image) that verifies: (0) T (0) (0) (0) f (x) = c·,k Φ(x − k), where c·,k = [ c1,k c2,k ]T . k∈Z 2 (−1)
The analysis scheme (see Fig. 2) has 4 outputs: 2 approximation images: c1,k (−1)
(−1)
(−1)
and c2,k , and 2 detail images: d1,k and d2,k (k ∈ Z 2 ). Writing f (x) as the sum of its projections onto V−1 and W−1 : (−1)T (−1)T 1 f (x) = √ c Φ( D−1 x − k) + d·,k Ψ ( D−1 x − k ) , 2 k∈Z 2 ·,k 2 k∈Z it can be shown that the analysis formulae are 1 (j−Dk) (0) 1 (j−Dk) (0) (−1) (−1) c·,k = √ H c·,j , d·,k = √ G c·,j 2 j∈Z 2 2 j∈Z 2 (−1)
(22)
(23) (−2)
In Fig. (3)(b) are the coefficients of 2 steps of the transform: d1,· , d1,· , (−2)
c1,·
(−1)
(−2)
(top) and d2,· , d2,·
(−2)
, c2,·
(bottom).
Polyphase Filter and Polynomial Reproduction Conditions
(a) Image separated in 2 phases
227
(b) 2 steps of the transform Fig. 3.
3.3
Analysis Formulae in Terms of Convolutions with 2d Filters
We now write other formulae that contain bidimensional convolutions and are equivalent to Eq. (23). This will enable us to obtain bidimensional filters (associated to the analysis step) to which we will impose good lowpass or highpass properties. We may write Eq. (23) as (−1)
ci,·
= yi,· ↓ D,
(−1)
di,·
= ui,· ↓ D,
i = 1, 2,
(24)
where, for i = 1, 2, 1 (·) (0) (·) (0) yi,· = √ Hi1 ∗ c1,· + Hi2 ∗ c2,· , 2
(25)
1 (·) (0) (·) (0) ui,· = √ Gi1 ∗ c1,· + Gi2 ∗ c2,· , 2
(26)
and H (k1 , k2 ) = H (−k1 ,−k2 ) . The 4 bidimensional filters Hij are obtained by (·)
(·)
putting in Eq. (5) the entry (i,j) of matrices H (k) . The same goes for Gij .
228
3.4
A. Ruedin
Analysis in Terms of Convolutions with Polyphase 2d Filters
Now the Z-transforms of equations (25) and (26) are (0) (0) c1,· (z1 , z2 ) c1,· (z1 , z2 ) y1,· (z1 , z2 ) u1,· (z1 , z2 ) = PF1 ,F2 (0) , = PI1 ,I2 (0) , y2,· (z1 , z2 ) u2,· (z1 , z2 ) c2,· (z1 , z2 ) c2,· (z1 , z2 ) (27) where (i) PF1 ,F2 is the polyphase matrix of 2 bidimensional filters F1 and F2 , (ii) PI1 ,I2 is the polyphase matrix of 2 bidimensional filters I1 and I2 , and (iii) the mentioned filters are Fi (z1 , z2 ) = Hi1 (z D ) + z1−1 Hi2 (z D ),
i = 1, 2,
(28)
Ii (z1 , z2 ) = Gi1 (z D ) + z1−1 Gi2 (z D ),
i = 1, 2.
(29)
(·)
(·)
(·)
(·)
We are mainly interested in filters I1 and I2 , on which we will impose condi(·) tions. We call them polyphase highpass filters. Filter Ii has coefficients Gi1 on (·) coset Γ0 , and coefficients Gi2 on coset Γ1 . The polyphase matrix for both these filters is (·) (·) 1 G11 ( z11 , z12 ) G12 ( z11 , z12 ) PI1 ,I2 = √ . (·) (·) 2 G21 ( z11 , z12 ) G22 ( z11 , z12 ) By replacing Eqs. (17) (left) and (18) (left) into Eqs. (25) and (26), we get yi,· = (X (0) ∗ Fi ) ↓ D,
ui,· = (X (0) ∗ Ii ) ↓ D,
(30)
and replacing into Eq.(24), we have another expression for Eq. (23): (−1)
ci,· 3.5
= (X (0) ∗ Fi ) ↓ D2 ,
(−1)
di,·
= (X (0) ∗ Ii ) ↓ D2 ,
i = 1, 2.
Synthesis
Similarly it can be shown that the synthesis formula is: ⎡ ⎤ T T 1 (0) (−1) (−1) ck = √ ⎣ H (k−Dj) c·,j + G(k−Dj) d·,j ⎦ 2 j∈Z 2 j∈Z 2
4 4.1
(31)
(32)
Further Desired Properties Balancing Φ1 and Φ2
When processing a constant signal one may get unbalanced values at the 2 outputs of the lowpass branch, if no additional conditions are set on the multiwavelet system. This annoying fact suggested the idea of balanced multiwavelets [20, 21]. Multiwavelets are balanced if the the lowpass branch preserves 2 equal constant images and the highpass branch annihilates them.
Polyphase Filter and Polynomial Reproduction Conditions
229
We list the 3 conditions for a balanced multiwavelet system, given in [13]: (00) 1 (00) 1 1 1 Si S =2 , = 1 1 for i = 0, 1. (33) 1 1 4.2
Lowpass and Highpass Conditions for the 2d Filters (·)
We want the frequency response of the lowpass filters Hij to be 0 at (π, π), and to be flat at that point. We also want the frequency response of the highpass (·) filters Gij to be 0 at (0, 0), and to be flat at that point. We accordingly impose the following conditions on their Z-transform, their gradient and Hessian: (·)
(·)
Hij (−1, −1) = 0, (·)
∇Hij (−1, −1) = [ 0 0 ]T , ∇
(·) Hij (−1, −1)
i = 1, 2,
(·)
∇Gij (1, 1) = [ 0 0 ]T ,
2
0 0 = , 0 0
Gij (1, 1) = 0,
(34)
i = 1, 2, j = 1, 2, (35)
∇
2
(·) Gij (1, 1)
0 0 = , 0 0
j = 1, 2,
i = 1, 2, j = 1, 2. (36)
We require further conditions on the 2d polyphase highpass filters I1 and I2 , in order to improve the frequency localization of the filters. These are Ii (1, 1) = 0
5
∇Ii (1, 1) = [ 0 0 ]T ,
i = 1, 2.
(37)
Construction
The set of nonlinear equations arising from Eqs. (9–13, 14–16, 33, 34–37) was solved with a numerical Levenberg–Marquardt optimization routine. No solution was found for less than 9 matrices H (k) on each coset. The coefficients obtained are listed in the appendix. In Figs. (4) (a) and (b) are plotted Φ1 and Φ2 . The multiwavelets are equally smooth. The joint spectral radius was estimated to be 0.7071, following [3, 16] the continuity of all 4 functions was proved.
(a) Φ1
(b) Φ2
Fig. 4. Multiscaling functions Φ1 and Φ2
230
A. Ruedin
(a) Lena with added Gaussian noise
(b) Lena denoised Fig. 5.
To illustrate one of the prospective applications of this constructed multiwavelet, we performed a simple test for denoising an image. Gaussian noise N (0, 10) was added to image Lena of (128 × 128) pixels. The noisy image was transformed with the multiwavelet (7 steps), a hard threshold was applied to the wavelet coefficients, leaving unchanged 18% of them, and the result was antitransformed. The recovered image in Fig. (5)(b) has a PSNR of 28. Most of the noise has disappeared; at the same time the image maintains its most salient features.
6
Conclusions
In the search for smooth nonseparable multiwavelets, we have imposed several conditions such as polynomial reproduction up to degree 2, balancing of the 2 scaling functions and an orthogonal filterbank. We have required good lowpass properties on the 4 bidimensional filters that operate to calculate the approximation coefficients; and good highpass properties on the 4 bidimensional filters, as well as on the 2 bidimensional polyphase filters, that operate to calculate the detail coefficients. All these conditions were included in the design of the multiwavelet. We have shown how image processing is achieved with these wavelets: how the original image is decomposed into 2 input images, we have given the analysissynthesis formulae and illustrated the first steps of these transforms. To find these associated filters, we have given the analysis step in 3 equivalent formulations. A graph of the two scaling functions associated to one multiwavelet has been obtained by means of a cascade algorithm, and the coefficients are given. Finally, a short experiment in denoising an image with the transform has given good results.
Polyphase Filter and Polynomial Reproduction Conditions
231
References [1] Daubechies, I.: Ten lectures on wavelets. Society for Industrial and Applied Mathematics (1992) [2] Mallat, S.: A Wavelet Tour of Signal Processing. Academic Press, London (1999) [3] Cohen, A., Daubechies, I.: Non-separable bidimensional wavelet bases. Revista Matematica Iberoamericana 9, 51–137 (1993) [4] Kovacevic, J., Vetterli, M.: Nonseparable multidimensional perfect reconstruction filter banks and wavelet bases for Rn . IEEE Trans. Inf. Theor. 38, 533–555 (1992) [5] Lawton, W., Lee, S., Shen, Z.: Stability and orthonormality of multivariate refinable functions. SIAM J. Math. Anal. 28, 999–1014 (1997) [6] Karoui, A., Vaillancourt, R.: Nonseparable biorthogonal wavelet bases of L2 (n ). CRM Proceedings and Lecture Notes American Math. Society 18, 135–151 (1999) [7] Ji, H., Riemenschneider, S., Shen, Z.: Multivariate compactly supported fundamental refinable functions, duals and biorthogonal wavelets. Studies in Applied Mathematics (to appear) [8] Strela, V., Heller, P., Strang, G., Topiwala, P., Heil, C.: The application of multiwavelet filterbanks to image processing. 8, 548–563 (1999) [9] Plonka, G., Strela, V.: Construction of multiscaling functions with approximation and symmetry. SIAM Journal of Mathematical Analysis 29, 481–510 (1998) [10] Wajcer, D., Stanhill, D., Zeevi, Y.: Two-dimensional nonseparable multiwavelet transform and its application. In: Proc. IEEE-SP Intern. Symp. Time-Frequency and Time-Scale Analysis, pp. 61–64. IEEE Computer Society Press, Los Alamitos (1998) [11] Tay, D., Kingsbury, N.: Design of nonseparable 3-d filter banks wavelet bases using transformations of variables. IEE VISP 143, 51–61 (1996) [12] Ruedin, A.: Nonseparable orthogonal multiwavelets with 2 and 3 vanishing moments on the quincunx grid. Proc. SPIE Wavelet Appl. Signal Image Proc. VII 3813, 455–466 (1999) [13] Ruedin, A.M.C.: Balanced nonseparable orthogonal multiwavelets with two and three vanishing moments on the quincunx grid. Wavelet Appl. Signal Image Proc. VIII, Proc. SPIE 4119, 519–527 (2000) [14] Ruedin, A.M.C.: Construction of nonseparable multiwavelets for nonlinear image compression. Eurasip J. of Applied Signal Proc. 2002(1), 73–79 (2002) [15] Ruedin, A.: A nonseparable multiwavelet for edge detection. Wavelet Appl. Signal Image Proc. X, Proc. SPIE 5207, 700–709 (2003) [16] Ruedin, A.: Estimating the joint spectral radius of a nonseparable multiwavelet. In: IEEE Proc. XXIII Int. Conf. SCCC, pp. 109–115. IEEE Computer Society Press, Los Alamitos (2003) [17] Ruedin, A.M.C.: Dilation matrices for nonseparable bidimensional wavelets. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2006. LNCS, vol. 4179, pp. 91–102. Springer, Heidelberg (2006) [18] Ron, A.: Smooth refinable functions provide good approximation orders. SIAM J. Math. Anal. 28, 731–748 (1997) [19] Cabrelli, C., Heil, C., Molter, U.: Accuracy of lattice translates of several multidimensional refinable functions. J. of Approximation Theory 95, 5–52 (1998) [20] Lebrun, J., Vetterli, M.: Balanced multiwavelets: Theory and design. IEEE Transactions on Signal Processing 46, 1119–1125 (1998) [21] Selesnick, I.: Balanced multiwavelet bases based on symmetric fir filters. Proceedings SPIE, Wavelet Applications in Signal Processing VII 3813, 122–131 (1999)
232
A. Ruedin
Appendix Here we list the coefficients for the multiwavelet system. In matrix Λ are the indices Λ in column form. In matrix A are given the coefficients of matrices H (k) , each one in a row. In matrix B are given the coefficients of matrices G(k) , each one in a row. 0 2 1 0 −1 −1 0 1 2 2 1 0 −1 1 −2 1 0 3 Λ= 2 1 1 1 1 0 0 0 0 −1 −1 −1 −1 −2 0 2 −2 0 Aj,1 Aj,2 Bj,1 Bj,2 Λ(:,j) Λ(:,j) H = G = j = 1, 18 Aj,3 Aj,4 Bj,3 Bj,4 −7.676681737331555e − 3 −1.800628466756745e 5.596475347677797e 8.098466931480185e 4.074252174849357e 1.298915261680141e 9.061671169687957e −1.265355512890541e 1.895490932371631e 2.491938391014932e 1.591364018110787e −1.266320637413509e −8.014545642902055e 4.366333191230456e −3.262591322547495e 1.388620251346148e −8.011769972362422e 2.469215102776983e −7.559021569346361e − 2 −1.149977810403876e
− − − − − − − − − − − − − − − − − −
2 1 1 1 1 2 1 2 1 1 1 2 2 2 2 3 2 1
−3.262586850656414e 4.366339763966613e −8.014547650802760e −1.266319620203958e 1.591363305682723e 2.491937172117110e 1.895493068974861e −1.265356622857294e 9.061669733937130e 1.298915631027565e 4.074253002192344e 8.098467462914564e 5.596475008358056e −1.800626519183009e −1.149977653519566e 2.469210034195785e −8.011771151254036e 1.388618816576228e
− − − − − − − − − − − − − − − − − −
2 2 2 1 1 1 2 1 2 1 1 1 1 2 1 2 3 2
−6.862965319594380e −1.029099268509471e −2.174263458637924e 1.940648664509475e 3.211054197473414e 5.797596701083157e 2.560171417770199e −3.211739878927785e −3.604086902716809e 4.264788049967948e −1.541700237096979e −2.320480468126808e 5.777024146398765e 7.676681979242073e 7.559027761407097e −1.006126331657131e −3.800648455827771e 1.670266704222660e
− − − − − − − − − − − − − − − − − −
2 1 1 2 1 1 2 1 2 2 1 1 2 3 2 2 3 2
−2.904772647107222e − 2 −6.092372995000143e −1.984745425697163e 3.508172366980110e −2.078617176106751e 2.662472529709452e −4.567694788051868e 1.401416795965695e 2.186481424669168e −1.710925031322607e 1.372777223848028e 8.463027385831111e −4.189116607471705e 4.063807126959025e −1.641402198476032e −3.145094294168439e −7.372010643382483e 1.056891535124376e 2.585601740682022e − 2 7.810851434252261e
− − − − − − − − − − − − − − − − − −
2 1 1 1 1 1 1 1 1 2 2 1 1 2 2 3 2 2
1.641403678779469e −4.063806548823623e 4.189117092851097e −8.463028926583166e −1.372779009888785e 1.710924883739774e −2.186481763858121e −1.401417510903238e 4.567694445851030e −2.662471485826307e 2.078617539708238e −3.508172315999681e 1.984745622942293e 6.092368900936538e −7.810851735724883e −1.056892231841977e 7.371995136947689e 3.145090668352720e
− − − − − − − − − − − − − − − − − −
2 1 1 2 2 1 1 1 1 1 1 1 1 2 2 2 3 2
3.651338604134514e −2.427336123691436e 6.321843717640014e −3.274245683418866e −1.966235821047690e 3.373556733929335e 9.595223609484102e −2.086129104744682e −2.366519423722117e 1.933608395607457e −2.204336731612760e 2.553246535796781e −2.561043882659167e −2.904769409913837e 2.585605524114773e 1.047802371338519e 2.266674076287575e 1.079402775038917e
− − − − − − − − − − − − − − − − − −
2 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 2 1
−5.777029000654312e − 2 2.320482074377012e − 1 1.541699926636972e − 1 −4.264785660948585e − 2 −2 3.604091724353745e 3.211738860657933e − 1 −2.560148512183114e − 2 −5.797596616642996e − 1 A = −3.211055298768434e − 1 −1.940634290483643e − 2 −1 2.174262940716100e 1.029099764096597e − 1 6.862975689337350e − 2 −1.670267210359687e − 2 −3 3.800711927501486e 1.006126718542787e − 2
−2.561043416621321e − 1 2.553247635276005e − 1 −2.204336825937889e − 1 1.933608991720845e − 1 −2.366519231481610e − 1 −2.086131234865212e − 1 9.595211396234715e − 2 3.373555792117917e − 1 B = −1.966234916844790e − 1 −3.274244005512183e − 1 6.321844504902909e − 1 −2.427335263643445e − 1 3.651332740071130e − 2 1.079402486053018e − 1 2.266674322523625e − 2 1.047796458114367e − 2
T
w
= [1 1]
T u = [ 6.026297688062700e − 2 −1.814358597181103e + 0 ] T
= [ −1.868081007004267e + 0 9.206123907268971e − 1 ]
T
= [ 2.648279773438635e + 0 6.472565969993439e − 1 ]
T
= [ −2.679391169766528e + 0 8.965035652896843e − 1 ]
T
= [ −5.205839782403163e − 1 4.857833932468711e + 0 ]
v x y z
Multidimensional Noise Removal Method Based on Best Flattening Directions Damien Letexier1 , Salah Bourennane1 , and Jacques Blanc-Talon2 1
Institut Fresnel (CNRS UMR 6133),Univ. Paul C´ezanne, Ecole Centrale Marseille, Dom. Univ. de Saint J´erˆ ome, 13397 Marseille Cedex, France
[email protected] 2 DGA/MRIS, Arcueil, France
Abstract. This paper presents a new multi-way filtering method for multi-way images impaired by additive white noise. Instead of matrices or vectors, multidimensional images are considered as multi-way arrays also called tensors. Some noise removal techniques consist in vectorizing or matricizing multi-way data. That could lead to the loss of inter-bands relations. The presented filtering method consider multidimensional data as whole entities. Such a method is based on multilinear algebra. We adapt multi-way Wiener filtering to multidimensional images. Therefore, we introduce specific directions for tensor flattening. To this end, we extend the SLIDE algorithm to retrieve main directions of tensors, which are modeled as straight lines. To keep the local characteristics of images, we propose to adapt quadtree decomposition to tensors. Experiments on color images and on HYDICE hyperspectral images are presented to show the importance of flattening directions for noise removal in color images and hyperspectral images.
1
Introduction
In Physics, the acquisition of data is an important step to validate theory. However, because of acquisition or transmission processes, data sets are often impaired by noise. Therefore, the first pre-processing step to analyze data relies on an efficient denoising. Although image processing has been of major interest for years, most of studies concern monochrome images [1]. For multidimensional images, some denoising methods consider each band separately. This kind of method is poorly adapted to multidimensional image processing because it cuts the link between each dimension of the image. In this paper, multidimensional data are considered as whole entities. This model has been used in several fields such as psychology [2], chemometrics [3], face recognition [4], etc. Recently, a tensor based filtering which extends bidimensional Wiener filtering to multi-way arrays has been proposed [5]. The goal of this paper is to improve this multidimensional Wiener filtering (M W F ) by taking into account the characteristics of processed data. We propose to process three dimensional images which means there are two dimensions -or n-modes- for the localization of a pixel (row and column) and a dimension for the J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 233–241, 2007. c Springer-Verlag Berlin Heidelberg 2007
234
D. Letexier, S. Bourennane, and J. Blanc-Talon
spectral channel. To improve M W F efficiency, a specific flattening of tensors is used, based on the estimation of main directions in the image. These flattening directions are obtained by the extension of the SLIDE algorithm [6, 7]. A block decomposition is used to keep local characteristics of images. The paper is organized as follows. Section 2 overviews some useful tools of multilinear algebra. Section 3 recalls the multi-way Wiener filtering (M W F ), without any choice on the flattening directions of tensors. The drawbacks of M W F are depicted and explained in section 4. In section 5, we propose to retrieve the main directions in tensors in order to choose adaptive flattening directions in the filtering process. Section 6 presents a way to restore local details using a quadtree based block partitioning of HSI. Experimental results on realworld data are provided in section 7. The last section concludes the paper. In the whole paper, scalar is denoted by x, vector by x, matrix by X and tensor by X . ×n denotes the n-mode product between a tensor and a matrix.
2
Tensor Flattening
A tensor can be turned into a n-mode matrix (Fig. 1). The n-mode flattening matrix An of a tensor A ∈ RI1 ×...×IN is defined as a matrix [8] from RIn ×Mn where : Mn = In+1 · . . . · IN · I1 · . . . · In−1 .
Fig. 1. 2-mode flattening matrix of color image baboon represented as a third order tensor A ∈ RI1 ×I2 ×I3
In the following, the n-mode flattening matrice ranks are denoted by Kn and called n-mode ranks [8]. Kn = rank(An ), n = {1, . . . , 3}.
3
Multi-way Wiener Filtering
Multi-way data are considered to be impaired by additive white noise N . It has been shown that M W F [5] is far more efficient than bidimensional Wiener filtering, which consists in processing bands separately. This method is based
Multidimensional Noise Removal Method
235
on Tucker3 decomposition [2, 9] which considers that a tensor can be seen as a multi-mode product : X = G ×1 C(1) ×2 . . . ×N C(N ) ,
(1)
where C(n) is a In × Jn matrix and G ∈ RJ1 ×...×JN . G is called core tensor. ×n is the n-mode product. The entries of the n-mode product P = G ×n C(n) are given by [8] : pi1 ...in−1 jin+1 ...iN =
In
gi1 ...in−1 in in+1 ...iN hjin
(2)
in =1
Let us define the noisy data tensor : R = X + N,
(3)
where X is the signal tensor. The multi-way filtering principle consists in the estimation of tensor X denoted by Xˆ : Xˆ = R ×1 H(1) ×2 H(2) ×3 . . . ×N H(N )
(4)
Each matrix H(n) of equation (4) is called a n-mode filter. In the case of M W F , n-mode filters are obtained through the minimization of the mean squared error. The n-mode filters H(n) are obtained through an Alternative Least Squares algorithm. It is an iterative algorithm. The n-mode filters are initialized to identity. When H(n) is computed, m-mode filters H(m) , m = n, are fixed. The final expression of n-mode filter H(n) is given by [5] : T
H(n) = Vs(n) Λ(n) Vs(n) ,
(5)
(n) where Vs (n)
are the eigenvectors corresponding to the n-mode signal subspace and Λ is a weight matrix involving the eigenvalues of the covariance matrices corresponding to the signal and data n-mode flattening matrices Xn and Rn . M W F needs the n-mode ranks values K1 , K2 , . . . , KN in the weight matrices Λ(n) , n = {1, . . . , N }. They can be estimated using Akaike Information Criterion [10, 11, 12].
4
Drawbacks of M W F
To quantify the restoration of images, the remainder of the paper uses the following criteria : – The signal to noise ratio (SN R), to measure noise magnitude in the data tensor : 2 X SN R = 10 · log (6) 2 B
236
D. Letexier, S. Bourennane, and J. Blanc-Talon
– A quality criterion (QC) to evaluate quantitatively the estimation compared to signal tensor : ⎛ ⎞ 2
⎜ X ⎟ QC(Xˆ ) = 10 · log ⎝ 2 ⎠ ˆ X − X
(7)
Even if M W F has been shown to improve channel-by-channel filtering of color images impaired by additive white noise [5], in some cases, the improvement is not visually rendered. As an illustration, fig. 2 shows that artifacts can appear.
(a)
(b)
(c)
Fig. 2. (a) Signal tensor, (b) noisy data tensor: SN R = 9.03 dB , (c) M W F : QC = 15.21 dB
Two kinds of artifacts are present in the M W F restored tensor. Firstly, an overall blur because local characteristics of images are not taken into account during the filtering. Secondly, an undesirable effect of vertical and horizontal lines. It comes from orthogonal projections during the filtering process (see equation (4)). In the remainder of the paper, we aim at avoiding these drawbacks. For that purpose, we propose to retrieve the directions adapted to the image for the projections involved by n-mode products. That is, we aim at rearranging data in the flattening matrices.
5
Estimation of Main Direction of a HSI by SLIDE Algorithm
To rearrange data in the flattening matrices, we propose to find the main directions of tensors. The main directions are modeled as straight lines. They represent the principal directions used for the projections involved by the multi-way filtering of equation (4). In this paper, SLIDE algorithm [6, 7, 13] provides the orientation of the main directions in HSIs instead of the Hough Transform [14] which exhibits a higher computational cost [6, 13]. The number of main directions is given by the Minimum Length Description [11]. The main idea of this method is that it is possible to generate some virtual
Multidimensional Noise Removal Method
237
Fig. 3. (a) The image matrix provided with the coordinate system and rectilinear array of N equidistant sensors. (b) A straight line characterized by its angle θ and offset x0 .
signals out of the image data. That permits to establish the analogy between localization of sources in array processing and the recognition of straight lines in image processing. The modeling is depicted in fig. 3. In the case of a noisy image containing d straight lines, the signal measured at the lth row reads [6] : zl =
d
ejμ(l−1) tan θk · e−jμx0k + nl , l = 1, . . . , N
(8)
k=1
where μ is a parameter of speed propagation [6], nl is the noise resulting from outlier pixels at the lth row. Starting from this signal, the SLIDE method (Straight LIne DEtection) [6,7] can be used to estimate the orientations θk of the d straight lines. Defining : al (θk ) = ejμ(l−1) tan θk , and sk = e−jμx0k , we obtain: zl =
d
al (θk )sk + nl , ∀l = 1, · · · , N
(9)
(10)
k=1
Thus, the N × 1 vector z is defined by: z = As + n,
(11)
where z and n are N × 1 vectors corresponding respectively to received signal and noise. A is a N × d matrix and s is the d × 1 source signal vector. This relation corresponds to the usual signal model of an array processing problem. The SLIDE algorithm [6, 7] provides the angles θk estimation :
1 λk −1 θk = tan Im ln , k = 1, . . . , d (12) μΔ |λk |
238
D. Letexier, S. Bourennane, and J. Blanc-Talon
where Δ is the displacement between the two sub-arrays as defined in the TLSESPRIT algorithm [15]. {λk , k = 1, . . . , d} are the eigenvalues of a diagonal unitary matrix that relates the measurements from the first sub-array to the measurements resulting from the second sub-array and ”Im” stands for ”imaginary part”.
6
Block Partitioning
The second processing proposes to improve M W F is a block approach to take care of local characteristics. For that purpose, a quadtree decomposition is used to provide homogeneous sub-tensors. Such a block processing approach has been used for the segmentation of hyperspectral images [16]. In this paper, the quadtree decomposition is adapted to improve the restoration of local details by M W F . The approach consists in filtering separately homogeneous regions to preserve local characteristics.
7
Experiments
The criteria SN R and QC used to quantify the restoration have been defined in equations (6) and (7). We denote by M W F the Multi-way Wiener Filtering and by M W F R the Multi-way Wiener Filtering applied on Rearranged flattening matrices and subtensors. 7.1
Color Images
A color image can be seen as a third order tensor. Two modes correspond to pixel localization and the third mode represent the color channel -red, green or blue-. Fig. 4 shows the improvement brought by the rearrangement of data (fig. 4(d)), compared to classical multi-way Wiener filtering (fig. 4-(c)) of noisy data
(a) SN R = 9.03 dB
(b) QC = 15.21 dB
(c) QC = 16.87 dB
Fig. 4. (c) M W F , (d) M W F R with SLIDE estimated angles θR : [ 0◦ , 20◦ , 25◦ , 60◦ , 78◦ , 90◦ ]
Multidimensional Noise Removal Method
239
tensor of fig. 2-(b). Here, the analysis of the image has provided five main directions: 0◦ , 20◦ , 25◦ , 60◦ , 78◦ , 90◦ . 7.2
Hyperspectral Images
This subsection gives some results concerning real-world data HSIs, obtained with HYDICE [17]. HYDICE is an airborne sensor. It collects post-processed data for 210 wavelengths from the range 0.4 - 2.5 μm. The spatial resolution is 1.5 m and the spectral resolution is 10 nm. As color images, hyperspectral images can be written as third order tensors, the third mode being the spectral signature. Fig. 5 gives a visual interpretation of the improvement brought by M W F R in terms of quality criterion. Actually, the oblique road of the image is poorly restored by M W F compared to M W F R. This visual interpretation is closely linked with the values of the quality criterion of both images: 18.19 dB and 19.59 dB. The analysis of the image has given three main directions corresponding to orientations of roads. Fig. 6 studies the evolution of the restoration quality with respect to the SN R, varying from 4 dB to 16 dB. M W F R clearly improves the results obtained
(a)
(b) SN R = 10.13 dB
(c) QC = 18.19 dB
(d) QC = 19.59 dB
Fig. 5. (a) signal tensor, (b) data tensor, (c) recovered tensor by M W F and (d) recovered tensor by M W F R with SLIDE estimated angles θR : [ 0◦ , 34◦ , 90◦ ]
240
D. Letexier, S. Bourennane, and J. Blanc-Talon
26 CCWF MWF MWFR
24
QC (dB)
22
20
18
16
14
12
4
6
8
10 12 SNR (dB)
14
16
18
Fig. 6. QC with respect to the SN R for each filtering method for the previous hyperspectral image. (K1 = K2 = 31, K3 = 97).
with M W F . We have also proposed the comparison with the channel-by-channel optimized Wiener filtering provided by Matlab (CCW F ). This result shows that M W F effectively improves noise reduction [5]. Moreover, using HSI’s local characteristics removes artifacts and the blur in the restored tensor. From that, M W F R increases the result of M W F by 1 dB for a wide range of initial SN R.
8
Conclusion
In this paper, we have proposed an improved multi-way Wiener filtering for multi-way images impaired by additive white noise. This multi-way filtering consider a multidimensional image as a whole entity, which is not the case in usual noise removal methods. Multilinear algebra provides tools such as n-mode product and flattening matrices of a tensor, which permit to develop n-mode filters, that is, to filter data jointly in each mode of the image. The main problems of this approach is that images local characteristics are not considered and the flattening process is not specific. Thus, we have proposed to rearrange data in the flattening matrices thanks to the retrieval of main directions and a quadtree based decomposition. The main directions are obtained using the adapted SLIDE algorithm. We have shown that the consideration of local characteristics of images leads to an improved restoration on real-world color images and HYDICE hyperspectral images. We have compared the results obtained with multi-way Wiener filtering without choosing the flattening directions or the optimized channel-by-channel Wiener filter of MATLAB and the our new method. Our algorithm could be applied as a pre-processing method for further applications such as classification or target detection.
Multidimensional Noise Removal Method
241
References 1. Huang, K., Wu, Z., Fung, G., Chan, F.: Color image denoising with wavelet thresholding based on human visual system model. Signal Processing: Image Communication 20, 115–127 (2005) 2. Kroonenberg, P.: Three-mode principal component analysis. DSWO press (1983) 3. Kiers, H.: Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics 14, 105–122 (2000) 4. Alex, M., Vasilescu, O., Terzopoulos, D.: Multilinear analysis of image ensembles: Tensorfaces. In: Tistarelli, M., Bigun, J., Jain, A.K. (eds.) ECCV 2002. LNCS, vol. 2359, Springer, Heidelberg (2002) 5. Muti, D., Bourennane, S.: Survey on tensor signal algebraic filtering. Signal Processing, 237–249 (2007) 6. Aghajan, H., Kailath, T.: Sensor array processing techniques for super resolution multi-line-fitting and straight edge detection. IEEE Trans. on Image Processing 2, 454–465 (1993) 7. Sheinvald, J., Kiriati, N.: On the magic of SLIDE. Machine vision and Applic. 9(97), 251–261 8. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: A multilinear singular value decomposition. SIAM Jour. on Matrix An. and Applic. 21, 1253–1278 (2000) 9. Tucker, L.: Some mathematical notes on three-mode factor analysis. Psychometrika 31, 279–311 (1966) 10. Akaike, H.: A new look at the statistical model identification. IEEE transactions on automatic control AC-19 (1974) 11. Wax, M., Kailath, T.: Detection of signals by information theoretic criteria. IEEE transactions on acoustics, speech and signal processing, ASSP-33 (1985) 12. Renard, N., Bourennane, S., Blanc-Talon, J.: Multiway filtering applied on hyperspectral images. Lecture Notes on Computer Science. Springer, Heidelberg (2006) 13. Bourennane, S., Marot, J.: Contour estimation by array processing methods. Applied signal processing (2006) 14. Duda, R., Hart, P.: Use of the hough transform to detect lines and curves in pictures. Comm. ACM 15(72), 11–15 15. Roy, R., Kailath, T.: Esprit-estimation of signal parameters via rotational invariance techniques. IEEE Trans. on ASSP 37(89), 984–995 16. Kwon, H., Der, S., Nasrabadi, N.: An adaptive hierarchical segmentation algorithm based on quadtree decomposition for hyperspectral imagery. In: ICIP (2000) 17. Rickard, L.J., Basedow, R.W., Zalewski, E.F., Silverglate, P.R., Landers, M.: HYDICE: an airborne system for hyperspectral imaging. In: Vane, G. (ed.) Proc. SPIE, Imaging Spect. of the Terrestrial Environment, vol. 1937, pp. 173–179 (1993)
Low-Rank Approximation for Fast Image Acquisition Dan C. Popescu, Greg Hislop, and Andrew Hellicar Wireless Technologies Lab CSIRO ICT Centre, Marsfield NSW 2122, Australia {Dan.Popescu,Greg.Hislop,Andrew.Hellicar}@csiro.au Abstract. We propose a scanning procedure for fast image acquisition, based on low-rank image representations. An initial image is predicted from a low resolution scan and a smooth interpolation of the singular triplets. This is followed by an adaptive cross correlation scan, following the maximum error in the difference image. Our approach aims at reducing the scanning time for image acquisition devices that are in the single-pixel camera category. We exemplify with results from our experimental microwave, mm-wave and terahertz imaging systems.
1
Introduction
Imaging systems have advanced considerably over the last century, from systems capable of imaging just visible light, to imaging technologies which cover the accessible electromagnetic spectrum. The motivation for imaging in different frequency regimes is that each region of the electromagnetic spectrum has its own unique characteristics in which electromagnetic waves interact with matter. For example, x-rays penetrate through a range of materials and allow imaging of the interior of structures, thermally generated infrared waves allow night vision, and microwave radar systems can detect structures at long ranges. A sample of the characteristics for various frequency domains, including image resolution, energy available for imaging due to black body radiation from objects at room temperature (300K), penetrating ability of radiation, and what part of the material the waves interact with, is summarised in Table 1. It should also be noted that frequencies above optical, such as x-rays and gamma-rays are ionising (due to their atomic interaction) and potentially hazardous for people. Examining Table 1 it is apparent that imaging in the mm-wave and terahertz region of the spectrum offers unique possibilities due to the confluence of a number of characteristics including penetration through a range of materials, suitable resolution, molecular interactions, and safety for humans. Early applications in the security and medical areas have been identified such as imaging non-metallic weapons concealed under clothing on people [1], identifying explosives through their unique molecular response [2], or in-vivo imaging of the extent of skin cancers [3]. Technology has become available in the last decade allowing imaging at these frequencies which are historically high for electronics-based imaging systems and low for laser based systems. This ‘final frontier of the electromagnetic spectrum’ is now being explored by a growing number of research groups. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 242–253, 2007. c Springer-Verlag Berlin Heidelberg 2007
Low-Rank Approximation for Fast Image Acquisition
243
Table 1. Characteristics of various regions in the electromagnetic spectrum frequency microwave mm-wave terahertz infrared optical x-ray gamma-ray
resolution cm mm sub-mm high high high high
300K radiation negligible low medium high medium negligible negligible
penetration penetrates walls, blocked by metal penetrates clothing, and packaging, blocked by metal blocked blocked penetrates most materials including metal
interaction bulk bulk inter-molec. intra-molec. atomic nucleus
However there are a number of problems when imaging at frequencies where detector technology is not mature and where there is not much power generated by the scene. In these situations it is difficult to capture an image with a suitable signal to noise level. To achieve desired signal levels imaging systems often employ a large array of detectors that simultaneously sample the entire image with an integration time long enough to generate a large enough signal. Unfortunately in the case of immature detector technology, the cost of the detector is a significant proportion of the imaging system’s cost, and arrays are not cost effective. Systems are restricted to one or a small number of detectors that are steered across the scene to build an image. The lower achievable detector sensitivities mean that a large integration time needs to be spent achieving the required signal from each pixel. Unfortunately these large integration times, and serial approach to image acquisition are prohibitive for applications where frame rates need to be fast, such as imaging walking people. Integration time is the limiting factor and another approach is required. The CSIRO ICT Centre has a microwave antenna range [4] which can be used to generate images by holography at 20 GHz. The Centre also has two electronic based imaging systems: a millimetre-wave imaging system [5] and a terahertz (THz) imaging system [6]. Both these systems generate images by a raster scan over the scene, and are limited in acquisition speed due to integration time requirements. The millimetre-wave system employs two antennas that generate orthogonal fan beams, or strips, on the scene. One antenna transmits energy in a strip across the scene, the other antenna receives energy from a strip orthogonal to the illuminated strip. Any energy collected by the receiving antenna is assumed to originate from reflection off an object at the intersection point of the two beams. In this way the system generates a single image pixel at a time, and by scanning the two orthogonal beams across the scene all pixels may be imaged. The system is shown in Fig. (1). The terahertz system is located on an optical bench and uses a quasi-optical system to focus energy from a THz source onto a sample in a spot approximately 1mm in diameter. The THz signal transmitted through the sample is then collected and focused onto a detector. An image is built up by physically
244
D.C. Popescu, G. Hislop, and A. Hellicar
Fig. 1. Left: Schematic of millimetre-wave imaging system. Right: photo of mm-wave imaging system.
Fig. 2. Left: Schematic of THz imaging system including mirrors M1- M4, and translating sample. Right: Photo of THz imaging system.
Low-Rank Approximation for Fast Image Acquisition
245
translating the sample through the beam and acquiring the image one pixel at time. The THz system is shown in Figure 2. In this paper, we propose a method of reducing the acquisition time, based on low-rank representations of images. Our solution is based on the observation that not all image pixels contribute the same amount of information to the image. Integration time should be spent on regions of the image that are important at the expense of less integration time on less important regions of the image. However the question remains how to decide what region of the image is important, without a-priori knowledge of the image. Our approach employs an initial coarse sampling of the image and interpolation using a singular value decomposition. The image is then sampled with full resolution linear scans in regions where the errors are judged higher based on an adaptive cross approximation technique. The singular value prediction method is presented in section 2. In section 3 we present the details of the adaptive cross correlation technique. The two techniques are then combined into the adaptive low-rank scanning technique we present in section 4. We exemplify each technique with simulated results on an optical image, because the typical artifacts of each type of low rank approximation are better illustrated on such an image. In section 4 we present results obtained with our imaging systems described in this section.
2
Low Rank Approximation Using SVD
Images are, in general, represented by full-rank matrices. However, they do have good approximations given by lower rank matrices. Mirsky [7] has shown that the best k-rank approximation of a matrix under a unitary norm is obtained by retaining the highest k singular triplets from its singular value decomposition (SVD). Figure 3 shows an original 128 × 128 image and its best approximations of rank 64, 40 and 20 respectively. Even though the image is full-rank, only the last approximation image displays noticeable artifacts, and it is clear that by dropping a fair number of the low-order singular triplets one still retains a good approximation of the original. Unfortunately, to get such a good quality low-rank reconstructions, one needs to have the whole of the image, in order to perform the singular value decomposition. What if we had only a scaled-down version of image I, say Id , subsampled by a factor of d in both directions? We could then get its singular value decomposition: Id = Ud Sd VdT =
n
σi ui viT
(1)
i=1
and use it to write: I ≈ U SV T =
n
1 1 dσi ( √ udi )( √ vidT ) d d i=1
(2)
where the vectors udi and vid are scaled-up versions of ui and vi obtained by dtimes pixel replication. Eq. (2) constitutes a rigorous SVD for this approximation I,
246
D.C. Popescu, G. Hislop, and A. Hellicar
Fig. 3. Left to right: original full-rank 128 × 128 image and best approximations of ranks 64, 40 and 20
which is a d-pixel replication of Id . We can do significantly better than that in terms of image prediction at the higher scale, if we slightly relax the condition of orthonormality of the vectors in the singular triplets. We achieve this by generating the vectors udi and vid from ui , vi using a 4-point interpolatory scheme [8]. Essentially, this is a dyadic smooth interpolation scheme for 1-dimensional signals, which generates a new value at the next scale from its nearest 4 neighbours, using the formula: 1 9 fi+ 12 = − 16 (fi−1 + fi+2 ) + 16 (fi + fi+1 ).
Fig. 4. Left to right: original image and predictions from subsampled image: pixel replication, bilinear and bicubic interpolation, and smooth interpolation of singular triplets. The rms error values for the four prediction images are 19.67, 15.82, 16.69 and 9.21 respectively.
Fig. (4) shows an original image and its prediction from a scaled down version, by a factor of 4 in each direction, using pixel replication, the traditional techniques of bilinear and bicubic interpolation, and our proposed technique of smooth interpolation of the singular triplets. A simple visual inspection may suggest that the last 3 prediction images are of fairly similar quality, but the rms error values and the difference images shown in Fig. (5) reveal the better performance of the method using smooth interpolation of the singular triplets. Of course, scanning just a subsampled version of an image is unlikely to lead to an acquisition of acceptable quality. The question is how to choose, for additional scanning, the image regions where important detail is likely to be found. For example, in a security application, it is possible that a small-size concealed weapon would be lost by an undersampled version of the scene. Because such an object will leave a small-sized high intensity spot only in a higher resolution
Low-Rank Approximation for Fast Image Acquisition
247
Fig. 5. Difference images corresponding to the 4 prediction images of Fig. (4)
scanning, it is likely that attempts to predict the location of such important areas for the undersampled image would fail. This problem could be overcome by an additional scanning technique operating on whole row and columns, like the one we describe in the next section. This scheme is designed to adaptively chase and cancel the maximum error, and can quickly compensate highly localised error.
3
Low Rank Approximation Using the Adaptive Cross Approximation
The adaptive cross approximation (ACA) algorithm was originally proposed by Bebendorf [9] for low rank approximation of dense matrix kernels of integral operators. The motivation was to reduce the computational load of certain numerical techniques used to solve integral equations. In the following we give a concise description of the algorithm and indicate how it performs on image data representations [10]. Let an image to be collected be represented by an m × n matrix I. The ACA algorithm aims to approximate this image using: I ≈ I˜ =
k
uj vjT ,
(3)
j=1
where uj and vj are m-vectors and n-vectors respectively, associated with selectively scanned columns and rows of I, and k is the number of ACA iterations ˜ The image approximation is iteratively reperformed (equal to the rank of I). fined, until either its rank reaches a given limit kmax , or it satisfies: ˜ < εI, I − I
(4)
2 where . is the matrix Frobenius norm, A = i,j |Aij | and ε is a required tolerance. If the algorithm needs to run under the assumption that I is unknown (which is the case for our scanning scenario), the condition of Eq. (4) can be ˜ The algorithm operates by approximated by the condition uk vk < εI. scanning a row, followed by a column of the image at each iteration, and progressively builds up a low rank estimate of the image based on the rows and columns that have been scanned. This is a concise description of the algorithm:
248
D.C. Popescu, G. Hislop, and A. Hellicar
1. Initialise the image approximation as I˜ = 0 and the iteration count as k = 1, arbitrarily choose a row as the first row. 2. Scan the k th row, and find the error of the previous estimate at the k th row, ˜ th row). Choose the k th column to be the one containing r = I(k th row) − I(k the maximum element of |r|. Let a be the value of the element in the k th row and k th column of I. 3. Assign vk = a1 r. 4. Scan the k th column and find the error of the previous estimate at this ˜ th column). Choose the (k + 1)th row to be column, c = I(k th column) − I(k the one containing the maximum element of |c|. 5. Assign uk = c. 6. Update the image estimate I˜ = I˜ + uk vkT . ˜ or k = kmax , stop scanning, else increment k and repeat 7. If uk vk < εI steps 2 to 7. In this way, a lower rank estimate of the scene to be imaged is obtained by collecting only (at most) kmax (m + n) pixels rather than m ∗ n pixels, as would have been collected if the entire scene were raster scanned. The adaptive cross
Fig. 6. Left to right: original and ACA reconstructions of ranks 64, 40 and 20
correlation algorithm is interpolatory, in the sense that the approximation reproduces exactly the image at all points of the lines and columns that have been scanned. The overall approximation of the image I after the scanning of k lines with indexes i1 , ..., ik and k columns with indexes j1 , ..., jk is given by [9]: I˜k (i, j) = I(i, [j]k )Mk−1 I([i]k , j)
(5)
where I(i, [j]k ) = [I(i, j1 ), ..., I(i, jk )]T , I([i]k , j) = [I(i1 , j), ..., I(ik , j)]T and Mk is the k × k matrix (I(is , jr )), 1 ≤ s, r ≤ k. The effect of the ACA algorithm on the same 128 × 128 test image, with reconstructions of ranks 64, 40 and 20, is shown in Fig. (6).
4
Adaptive Low-Rank Scanning
We can combine the effects of SVD and ACA based scanning methods by firstly acquiring a prediction from a subsampled version of the image, as described in
Low-Rank Approximation for Fast Image Acquisition
249
section 2, and then letting the ACA algorithm, described in section 3 run a few more additional iterations on the difference image between the scene to be imaged and its prediction. This is based on the fact that the ACA algorithm only needs the currently scanned line or column, and that from it and the already acquired prediction image, the line or column of the difference image can be computed. The results of our proposed technique, applied to the previous test
Fig. 7. Top row: original image, its SVD prediction, reconstruction over 37% of the image pixels, using SVD prediction followed by additional ACA iterations over difference to prediction, and reconstruction of equivalent scanning fidelity using only ACA-type scanning. The rms error values for the last two reconstructions in the top row are 7.69 and 9.21 respectively. Bottom row: difference image to prediction and the approximation of this difference, obtained with ACA.
image, with a scan over slightly more than one third of the image, are presented in Fig. (7). For comparison, a reconstruction of equivalent scanning fidelity using only the ACA technique is also presented. Both the rms error values and a visual inspection indicate the better performance of the combined SVD and ACA technique. The difference image to the SVD prediction, and its partial reconstruction with ACA, displayed in the bottom row the figure, are indicative for both the strength and the possible shortcomings of the ACA technique. The localised strong error around the eyes and strong edge around the hat, is picked up and cancelled, but the error along the weaker edges around the nose and lips is not accounted for. Also, the typical interpolation artifacts of ACA are still noticeable in the third image of the top row of Fig. (7), even though less obvious than in the last image of the same row. The microwave scanning system was used to obtain the 96 × 96 images of a teflon pyramid. In the top row of Fig. (8) are displayed the results of the full scan, followed by the SVD + ACA based, and pure ACA based scanning techniques, on 16% of the image pixels. The rms error values for the last two
250
D.C. Popescu, G. Hislop, and A. Hellicar
Fig. 8. Top row: microwave original teflon pyramid image, its reconstruction from 16% image scanning, with SVD prediction followed by additional ACA iterations over difference image , and corresponding reconstructed image using ACA only, for equivalent scanning. Bottom row: difference image to prediction and the approximation of this difference, obtained from 5 ACA iterations.
approximations shown in the first row are 2.91 and 5.28 respectively. Both scanning techniques give remarkably good results, and again the combined SVD + ACA technique outperforms the pure ACA technique. Because the edges in this image have mostly horizontal and vertical orientations, the ACA technique reconstructs the images accurately with few iterations. The difference image to the SVD prediction is well reconstructed with only 5 additional ACA iterations, as shown by the images in the bottom row of this figure. In Fig. (9) is pictured an image scanned with out mm-wave system described in Fig. (1). Both the SVD prediction followed by ACA and the pure ACA scanning perform well in identifying the concealed weapon, with the first technique being marginally better in reconstructing the human body contour and a specular return on the right side of the knife. The leaf image from Fig. (10) was captured using the terahertz imaging system shown in Fig. (2). The image was obtained through two pieces of cloth and a cardboard screen. The edges of the cloth are clearly visible in the image. This is a very difficult image to approximate by a low rank representation, because it contains a lot of detail information, both in high intensity and low intensity regions, and has edges at various orientations. We notice that for a scan over 50% of the image pixels, again the proposed method of SVD prediction followed by ACA outperforms a pure ACA method. The bottom row reveals a good reconstruction of most strong edges by the ACA technique on the difference image, and even a good reconstruction of the weak middle horizontal edge; however, the poor reconstruction of the weak diagonal edges, as well as a few interpolation artifacts can also be noted.
Low-Rank Approximation for Fast Image Acquisition
251
Fig. 9. Top row: mm-wave original image of a person with concealed weapon, its reconstruction from 35% image scanning, using SVD prediction followed by additional ACA iterations over difference image , and corresponding reconstructed image using ACA only, with equivalent scanning. The rms error values for the two approximations are 6.95 and 7.62. Bottom row: difference image to prediction and the approximation of this difference, obtained with 7 ACA iterations.
Fig. 10. Top row: original leaf image, its SVD prediction, reconstruction using only ACA-type scanning over 50% of the image, and reconstruction of equivalent scanning using SVD prediction followed by additional ACA iterations over difference to prediction. The rms error values for the two reconstructions in the top row are 7.90 and 8.94 respectively. Bottom row: difference image to prediction and the approximation of this difference, obtained with ACA.
252
5
D.C. Popescu, G. Hislop, and A. Hellicar
Conclusion
We have presented a procedure for real-time image acquisition, based on the single-pixel scanning device paradigm. Our procedure combines a low-resolution scanning step, followed by an image prediction based on smooth interpolation of singular triplets, with an adaptive cross approximation procedure on the difference image. The paradigm we have presented shares similarities with both the concept of image compression and compressive sensing [11], in the sense that only partial information from an image is used to obtain an image representation. It also differs substantially from both those concepts, because due to time constraints, the whole of the image is never available, and has to be predicted from partially scanned data. Optimising with partial information is substantially more difficult than optimising with complete information, and therefore it is not surprising that the scanning speedup factors are not very high. We have exemplified our proposed procedure with results obtained from our experimental imaging systems, operating in the microwave, mm-wave and terahertz regions of the electromagnetic spectrum. Our experiments indicate that image reconstructions of acceptable quality can be acquired by reducing the scanning time by factors between 2 and 10.
Acknowledgement Several people have assisted in taking measurements and supplying data used in this paper. In particular the authors would like to acknowledge Ken Smart who helped with microwave measurements, Michael Brothers and Greg Timms who acquired the mm-wave images, and Li Li for supplying the THz system data.
References 1. Dickinson, J.C. et al.: Terahertz Imaging of Subjects with Concealed Weapons. In: Proceedings of SPIE, vol. 6212 (2006) 2. Kemp, M.C., et al.: Security Applications of Terahertz Technology. In: Proceedings of SPIE, vol. 5070 (2003) 3. Woodward, R.M., et al.: Terahertz Pulse Imaging of ex vivo Basal Cell Carcinoma. Journal Invest Dermatol 120, 72–78 (2003) 4. Barker, S., et al.: The development of an inexpensive high-precision mm-wave compact antenna test range. In: Proceedings of AMTA, Newport, Rhode Island, USA October, pp. 337–340 (2005) 5. Brothers, M., et al.: A 190 GHz active millimetre-wave imager. SPIE Passive Millimeter Wave Imaging X, April 9-13, 2007 Orlando, Florida (2007) 6. Hellicar, A.D., et al.: Development of a terahertz imaging system. In: IEEE Antennas and Propagation Society International Symposium, Honolulu, USA (June 2007) 7. Mirsky, L.: Symmetric gauge functions and unitarily invariant norms. Quart. J. Math. Oxford Ser. 11(2), 50–59 (1960)
Low-Rank Approximation for Fast Image Acquisition
253
8. Dyn, N., Gregory, J.A., Levin, D.: A four-point interpolatory subdivision scheme for curve design. Computer Aided Design 4, 257–268 (1987) 9. Bebendorf, M.: Approximation of boundary element matrices. Numer. Math. 86, 565–589 (2000) 10. Hislop, G., Hay, S.: Adaptive Electromagnetic Imaging Australian Symposium on Antennas, Sydney, Australia (February 2007) 11. Pitsianis, et al.: Compressive Imaging Sensors. Proc. of SPIE 6263, 1–9 (2006)
A Soft-Switching Approach to Improve Visual Quality of Colour Image Smoothing Filters Samuel Morillas1, , Stefan Schulte2 , Tom M´elange2 , Etienne E. Kerre2, , and Valent´ın Gregori1, 1
Technical University of Valencia, Department of Applied Mathematics, E.P.S. de Gandia, Carretera Nazaret-Oliva s/n, 46730 Grao de Gandia, Spain 2 Ghent University, Department of Applied Mathematics and Computer Science, Krijgslaan 281 - S9, 9000 Gent, Belgium
Abstract. Many filtering methods for Gaussian noise smoothing in colour images have been proposed. The common objective of these methods is to smooth out the noise while preserving the edges and details of the image. However, it can be observed that these methods, in their effort to preserve the image structures, also generate artefacts in homogeneous regions that are actually due to noise. So, these methods can perform well in image edges and details but sometimes they do not achieve the desired smoothing in homogeneous regions. In this paper we propose a method to overcome this problem. We use fuzzy concepts to build a soft-switching technique between two Gaussian noise filters: (i) a filter able to smooth out the noise near edges and fine features while properly preserving those details and (ii) a filter able to achieve the desired smoothing in homogeneous regions. Experimental results are provided to show the performance achieved by the proposed solution.
1 Introduction Any image is systematically affected by the introduction of noise during its acquisition and transmission process. A fundamental problem in image processing is to effectively suppress noise while keeping intact the features of the image. Two noise models can adequately represent most noise corrupting images: additive Gaussian noise and impulsive noise [7]. Additive Gaussian noise, which is usually introduced during the acquisition process, is characterized by adding a random value from a zero-mean Gaussian distribution to each image pixel channel where the variance of this distribution determines the intensity of the corrupting noise. An advantage of such a noise type is that its zero-mean property allows to remove it by locally averaging pixel channel values. Ideally, removing Gaussian noise would involve to smooth the different areas of an image without degrading neither the sharpness of their edges nor their details. Classical
The author acknowledges the support of Spanish Ministry of Education and Science under program “Becas de Formaci´on de Profesorado Universitario FPU”. S. Schulte, T. M´elange and E.E. Kerre acknowledge the support of Ghent University under the GOA-project 12.0515.03. Valent´ın Gregori acknowledges the support of Spanish Ministry of Education and Science under grant MTM 2006-14925-C02-01.
J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 254–261, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Soft-Switching Approach to Improve Visual Quality
(a)
(b)
(c)
255
(d)
Fig. 1. Artefacts generated by Gaussian smoothing filters: (a) Lena image with σ = 20 Gaussian noise, (b) FNRM output, (c) Baboon image with σ = 30 Gaussian noise, (d) FNRM output
linear filters, such as the Arithmetic Mean Filter (AMF) or the Gaussian Filter [7], smooth noise but blur edges significantly. Recently, many nonlinear methods have been proposed to approach this problem, for instance: the bilateral filter [1,2,10], the peer group filter [3], the anisotropic diffusion [6], the chromatic filter [4], the fuzzy vector smoothing filter [9], the GOA filter [11], the fuzzy bilateral filter [5] or the fuzzy noise reduction method (FNRM) [8]. The aim of these methods is to detect edges and details by means of local statistics and smooth them less than the rest of the image to better preserve their sharpness. However, in homogeneous regions and because of the effort done by these methods to preserve the image structures, they use to generate artefacts that are actually due to noise. Figure 1 shows some images filtered using the recent FNRM [8]. The commented effect can be seen for instance in Lena’s face (Figure 1 (b)) or Baboon’s nose (Figure 1 (d)). The same effect can also be observed for other state-of-the-art methods [1]-[6] and [9]-[11]. So, these methods can perform quite well in image edges and details but sometimes they do not achieve the desired smoothing in homogeneous regions. In this paper, we propose a soft-switching approach intended to solve this problem. The proposed filter will softly switch from the FNRM to the AMF, that provides the maximum smoothing capability, when the pixel under processing is estimated to be in a flat area of the image. Otherwise, in areas near edges or details the filter will perform the FNRM operation. We have chosen the FNRM filter because it is a recent filter that provides excellent results nevertheless, note that an analogous filter design could be built using any other of the above listed Gaussian noise smoothing methods. The paper is arranged as follows. First in Section 2 we present a simple fuzzy method to distinguish between edges and flat areas of an image and then the proposed filtering method is described. Experimental results and discussions are presented in Section 3 and conclusions are given in Section 4.
2 Proposed Soft-Switching Method In order to build the desired soft-switching method, first we aim at distinguishing for each image pixel whether it is close to an edge or not. For this, we compute the maximum observed Euclidean distance between the colour vectors of two adjacent pixels
256
S. Morillas et al. 1
0.9
0.8
0.7
S(x)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
x
Fig. 2. Illustration of S-membership function behaviour with α = 0.1 and γ = 0.25
in the image F and we denote it by M . Then, for each image pixel Fj we compute the index Ij as mj Ij = (1) M where mj denotes the maximum observed Euclidean distance between the colour vector of the pixel Fj and the colour vector of the pixels from its 3x3 neighbourhood. It is clearly seen that the value of the index Ij is higher when the pixel is close to an edge (see for instance pixels around Baboon’s eye in Figure 3 (b) or pixels in the edges of Lena’s hat or eyes in Figure 3 (e)). Then, a degree μj that represents the certainty of the pixel j to be in an edge is computed using the S-membership function as μj = S(Ij ). The S-membership function, whose graph is shown in Figure 2, is given by ⎧ 0 if x ≤ α ⎪ ⎪ 2 ⎪ ⎪ x−γ ⎨2 if α < x ≤ α+γ γ−α 2 2 S(x) = (2) ⎪ α+γ x−α ⎪ if 2 < x < γ ⎪ 1 − 2 γ−α ⎪ ⎩ 1 if x ≥ γ where we have experimentally found that α = 0.1 and γ = 0.25 receive satisfying results. Figure 3 shows two samples of the computation of Ij and μj . It can be seen that this procedure is not a very accurate edge detector however it draws a distinction between homogeneous and edge regions, which is sufficient for our purpose. If we denote by F N RMout and by AM Fout the outputs of the FNRM, computed as described in [8], and the AMF filter, calculated as the average of the colour vectors in a filtering window, respectively, the output of the proposed soft-switching filter (SSFout ) is defined as SSFout = μj F N RMout + (1 − μj )AM Fout . (3) It can be seen that the certainty degree μj controls the soft-switching between FNRM and AMF. Indeed, when μj → 1, SSF approaches the FNRM and when μj → 0, SSF behaves as the AMF. Any value in between implies that the output is computed by appropriately weighting the outputs of FNRM and AMF.
A Soft-Switching Approach to Improve Visual Quality
(a)
(b)
(c)
(d)
(e)
(f)
257
Fig. 3. Classification in homogeneous/edge regions: (a) Detail of Baboon image with σ = 15 Gaussian noise, (b) Ij indexes computed for (a), (c) μj degrees for (b), (d) Detail of Lena image with σ = 20 Gaussian noise, (e) Ij indexes computed for (d), (f) μj degrees for (e)
3 Experimental Results In order to experimentally assess the proposed filter, we have taken the well-known images Parrots, Peppers, Baboon and Lena and we have used the classical white Gaussian noise model [7] to corrupt them with different densities of noise. Then, we apply iteratively both FNRM and SSF filters using a 3 × 3 filter window. We compare the received results both visually and in terms of the PSNR objective quality measure which is defined as follows [7]: ⎛ ⎞ ⎜ ⎜ P SN R = 20 log ⎜ ⎜ ⎝
⎟ ⎟ ⎟ 2 ⎟ Q N ·M ⎠ 1 Fiq − Fˆiq N MQ 255
(4)
i=1 q=1
where M , N are the image dimensions, Q is the number of channels of the image (Q = 3 for RGB colour images), and Fiq and Fˆiq denote the q th component of the original image vector and the filtered image, at pixel position i, respectively. Figure 4 shows the PSNR performance of FNRM and SSF against varying the standard deviation of the noise σ for different test images. Figures 5-8 show some images
258
S. Morillas et al.
32
31
30
PSNR
29
28
27
26
25
24
23 5
10
15
20
25
30
Gaussian noise density σ
35
40
Fig. 4. Performance of FNRM (solid) and SSF (dashed) in terms of PSNR as a function of the standard deviation σ of Gaussian noise using the images Lena (black), Peppers (dark gray), Parrots (gray), and Baboon (light gray)
(a) PSNR = 22.54
(b) PSNR = 24.75
(c) PSNR = 28.34
(d) PSNR = 28.53
Fig. 5. Performance evaluation: (a) Detail of Lena image with σ = 20 Gaussian noise, and outputs of (b) AMF, (c) FNRM, and (d) SSF
filtered using AMF, FNRM and SSF and their respective PSNR values. The results in Figure 4 indicate that SSF can only marginally outperform FNRM for high noise densities. However, by analyzing the outputs in Figures 5-8 it is clearly seen that SSF outputs are more visually pleasing than FNRM outputs. Indeed, the following observations can be pointed out: 1. SSF can preserve sharp edges in the image (see for instance Lena’s hat and face in Figure 5 (d), Parrot’s beak in Figure 6 (d), or Baboon’s eye in Figure 7 (d)). 2. Also, SSF achieves an appropriate smoothing in flat regions, which constitutes a clear improvement over FNMR (see for instance Lena’s face in Figure 5 (c)(d), white and black regions in Parrot’s beak in Figure 6 (c)-(d), yellow region in
A Soft-Switching Approach to Improve Visual Quality
(a) PSNR = 20.68
(b) PSNR = 21.63
(c) PSNR = 25.73
259
(d) PSNR = 25.15
Fig. 6. Performance evaluation: (a) Detail of Parrots image with σ = 25 Gaussian noise, and outputs of (b) AMF, (c) FNRM, and (d) SSF
(a) PSNR = 19.13
(b) PSNR = 21.75
(c) PSNR = 24.33
(d) PSNR = 24.26
Fig. 7. Performance evaluation: (a) Detail of Baboon image with σ = 30 Gaussian noise, and outputs of (b) AMF, (c) FNRM, and (d) SSF
(a) PSNR = 16.78
(b) PSNR = 22.55
(c) PSNR = 24.63
(d) PSNR = 24.79
Fig. 8. Performance evaluation: (a) Detail of Peppers image with σ = 40 Gaussian noise, and outputs of (b) AMF, (c) FNRM, and (d) SSF
Baboon’s eye or blue and red regions in Baboon’s nose in Figure 7 (c)-(d) or red and green regions in Peppers image in Figure 8 (c)-(d)). 3. However, SSF also introduces some blurring in not very sharp image edges and some image textures (see for instance blurred texture over Baboon’s eye in Figure 7 (d) or hairs of the hat in Lena image in Figure 5 (d)). As a consequence, we can state that SSF presents some improvements over FNRM, above all, from the visual point of view. Nevertheless, some research issues could be addressed in the future in order to improve SSF performance. First the edge/flat
260
S. Morillas et al.
regions fuzzy classification could be improved. Second, some mechanism to detect texture could be introduced in order to avoid the drawback mentioned in the third point above. Also, it should be noted that there is a clear disagreement between PSNR performance evaluation and visual performance evaluation. This claims for the necessity of research in new objective quality measures that match the human visual perception.
4 Conclusions In this paper, a new Gaussian noise smoothing method for colour images has been introduced. Firstly, the method applies a fuzzy procedure to classify edge/flat regions of the image. Then, on the basis of this fuzzy classification, a soft-switching between a filter appropriate to smooth image edges (FNRM) and a method appropriate to smooth flat regions (AMF) is performed. The proposed method is able to smooth Gaussian noise and preserve sharp image edges. In addition, the smoothing capability in flat regions is increased with respect to state-of-the-art methods and the generated images are more visually pleasing, which constitutes the improvement achieved by the proposed method. However, the method may also introduce some blurring in textured image regions or soft edges. Therefore, further research is still to be done in order to improve the proposed filter. Also, it has been observed that there is lack of matching between PSNR evaluation and visual observation, which should encourage the research on objective similarity measures for colour images.
References 1. Elad, M.: On the origin of bilateral filter and ways to improve it. IEEE Transactions on Image Processing 11(10), 1141–1151 (2002) 2. Garnett, R., Huegerich, T., Chui, C., He, W.: A universal noise removal algorithm with an impulse detector. IEEE Transactions on Image Processing 14(11), 1747–1754 (2005) 3. Kenney, C., Deng, Y., Manjunath, B.S., Hewer, G.: Peer group image enhancement. IEEE Transactions on Image Processing 10(2), 326–334 (2001) 4. Lucchese, L., Mitra, S.K.: A new class of chromatic filters for color image processing: theory and applications. IEEE Transactions on Image Processing 13(4), 534–548 (2004) 5. Morillas, S., Gregori, V., Sapena, A.: Fuzzy Bilateral Filtering for color images. In: Campilho, A., Kamel, M. (eds.) ICIAR 2006. LNCS, vol. 4141, pp. 138–145. Springer, Heidelberg (2006) 6. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(5), 629–639 (1990) 7. Plataniotis, K.N., Venetsanopoulos, A.N.: Color Image processing and applications. Springer, Berlin (2000) 8. Schulte, S., De Witte, V., Kerre, E.E.: A fuzzy noise reduction method for colour images. IEEE Transactions on Image Processing 16(5), 1425–1436 (2007)
A Soft-Switching Approach to Improve Visual Quality
261
9. Shen, Y., Barner, K.: Fuzzy vector median-based surface smoothing. IEEE Transactions on Visualization and Computer Graphics 10(3), 252–265 (2004) 10. Tomasi, C., Manduchi, R.: Bilateral filter for gray and color images. In: Proc. IEEE International Conference Computer Vision, pp. 839–846 (1998) 11. Van de Ville, D., Nachtegael, M., Van der Weken, D., Philips, W., Lemahieu, I., Kerre, E.E.: Noise reduction by fuzzy image filtering. IEEE Transactions on Image Processing 11(4), 429–436 (2001)
Comparison of Image Conversions Between Square Structure and Hexagonal Structure Xiangjian He1, Jianmin Li2, and Tom Hintz1 1
Computer Vision Research Group, University of Technology, Sydney, Australia {sean,hintz}@it.uts.edu.au 2 School of Computer and Mathematics, Fuzhou University, Fujian, 320002, China
[email protected]
Abstract. Hexagonal image structure is a relatively new and powerful approach to intelligent vision system. The geometrical arrangement of pixels in this structure can be described as a collection of hexagonal pixels. However, all the existing hardware for capturing image and for displaying image are produced based on rectangular architecture. Therefore, it becomes important to find a proper software approach to mimic hexagonal structure so that images represented on the traditional square structure can be smoothly converted from or to the images on hexagonal structure. For accurate image processing, it is critical to best maintain the image resolution after image conversion. In this paper, we present various algorithms for image conversion between the two image structures. The performance of these algorithms will be compared though experimental results.
1 Introduction The advantages of using a hexagonal grid to represent digital images have been investigated for more than thirty years [1-5]. The importance of the hexagonal representation is that it possesses special computational features that are pertinent to the vision process [4]. Its computational power for intelligent vision has pushed forward the research in areas of image processing and computer vision. The hexagonal image structure has features of higher degree of circular symmetry, uniform connectivity, greater angular resolution, and a reduced need of storage and computation in image processing operations [6-7]. In spite of its numerous advantages, a problem that limits the use of hexagonal image structure is the lack of hardware for capturing and displaying hexagonal-based images. In the past years, there have been various attempts to simulate a hexagonal grid on a regular rectangular grid device. The simulation schemes include those approaches using rectangular pixels [1-2], pseudo hexagonal pixels [3], mimic hexagonal pixels [4] and virtual hexagonal pixels [5,8]. The use of these techniques provides a practical tool for image processing on a hexagonal structure and makes it possible to carry out research based on a hexagonal structure using existing computer vision and graphics systems. The new simulation scheme as presented in [8] was developed to virtually mimic a special hexagonal structure, called Spiral Architecture (SA) [4]. In this scheme, each of the original square pixels and simulated hexagonal pixels is regarded as a J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 262–273, 2007. © Springer-Verlag Berlin Heidelberg 2007
Comparison of Image Conversions Between Square Structure and Hexagonal Structure
263
collection of smaller components, called sub-pixels. The light intensities of all subpixels constituting a square pixel (or hexagonal) are assigned the same value as that of the square pixel (or hexagonal) pixel in the square (or hexagonal) structure. This simple assignment method does not give accurate enough intensity interpolation of subpixels, and hence results in some resolution loss when images are converted between the square structure and the hexagonal structure. Therefore, in order to take advantages of hexagonal structure for image processing and reduce the effect of conversion between the two image structures to minimum, it is critical to find a best conversion method so that the image resolution will be best kept during the conversion process. In this paper, we present various schemes using different interpolation algorithms for image conversion between square structure and hexagonal structure. We will use experimental results to compare and analyze the performance of these methods for image conversion. The rest of this paper is organized as follows. In Section 2, we briefly review a software simulation of hexagonal structure as shown in [8]. In Section 3, various conversion schemes are presented. The experimental results are demonstrated in Section 4. We conclude in Section 5.
2 Hexagonal Structure and Its Simulation A collection of 49 hexagonal pixels together with one-dimensional addressing scheme as shown in [10] is displayed in Figure 1.
Fig. 1. A hexagonal structure with its addressing scheme [8]
To construct hexagonal pixels, in [8], each square pixel was first separated into 7×7 small pixels, called sub-pixels. We assume that the centre of each square pixel is located at the middle sub-pixel of its total 7×7 sub-pixels. Each virtual hexagonal pixel was formed by 56 sub-pixels as shown in Figure 2. Figure 2 shows a collection of seven hexagonal pixels constructed with spiral addresses from 0 to 6. The collection of virtual pixels covering an image constitutes a virtual hexagonal structure.
264
X. He, J. Li, and T. Hintz
Fig. 2. A cluster of seven hexagonal pixels
It is not difficult to locate each virtual hexagonal pixel when the size of an image is known. Let us assume that original images are represented on a square structure arranged as 2M rows and 2N columns, where M and N are two positive integers. Then the centre of the virtual hexagonal structure can be located at the middle of rows M and M+1, and at column N. Note that there are 14M rows and 14N columns in the (virtual square) structure consisting of sub-pixels. Thus, the first (or the central) hexagonal pixel with address 0 consists of 56 sub-pixels has its centre located in the middle of rows 7M and 7M+1 and the column 7N of the virtual square structure. After the 56 sub-pixels for the first hexagonal pixel are allocated, all sub-pixels for all other hexagonal pixels can be assigned easily as shown in [8].
3 Conversion Between Square and Hexagonal Structures In this section, we present various interpolation algorithms used for converting images between square structure and hexagonal structure. 3.1 Conversion from Square Structure to Virtual Hexagonal Structure We present two schemes for converting images from square structure to the virtual hexagonal structure derived in the previous section.
Comparison of Image Conversions Between Square Structure and Hexagonal Structure
265
A. Simple averaging approach To be simple, the light intensity for each of sub-pixels separated from the same square pixel as shown in Section 2 was set to be the same as intensity value of the square pixel from which the sub-pixels were separated. The light intensity of each virtual hexagonal pixel was approximated as the average of the light intensities of the 56 sub-pixels that constitute the hexagonal pixel. For a hexagonal pixel at the image boundary, it may not find all its 56 sub-pixels. In this case, the light intensity of this incomplete hexagonal pixel can be computed as the average of the intensities of the all its sub-pixels found. Then, the pixel (or intensity) values of all hexagonal pixels are computed and an image represented on a square structure is hence converted to an image on the virtual hexagonal structure. B. Bilinear interpolation approach We adopt the bilinear interpolation method that was originally proposed for image interpolation on the square structure. The detailed approach is presented as follows. For every sub-pixel, we can compute its coordinates at the two dimensional coordinate system easily. Let us denote location of an arbitrarily given sub-pixel by X. Then, there exist four square pixels (with their centres) located at A, B, C and D, as shown in Figure 3, lying on two consecutive rows and columns in the original square structure such that point X falls onto the rectangle with vertices at A, B, C and D.
Fig. 3. A sub-pixel X located on a rectangle formed from square pixels A, B, C and D
Let us denote the coordinates of A, B, C and X by (Ax, Ay), (Bx, By), (Cx, Cy) and (Xx, Xy) respectively. Let
α=
| Ay − X y | | Ax − X x | ,β = . | Ax − B x | | Ay − C y |
(1)
Then, it is easy to derive that
X = (1 − α )(1 − β ) A + α (1 − β ) B + (1 − α ) β C + αβ D.
(2)
Let f be the image brightness function that maps a pixel (either square pixel or subpixel) to its light intensity value. Then the intensity value assigned to X using the bilinear interpolation method as shown in [9] is computed as
266
X. He, J. Li, and T. Hintz
f ( X ) = (1 − α )(1 − β ) ⋅ f ( A) + α (1 − β ) ⋅ f ( B) + (1 − α ) β ⋅ f (C ) + αβ ⋅ f ( D).
(3)
After all sub-pixels have their intensity values computed, we follow the following two methods to compute the intensity values of all virtual hexagonal pixels. Method 1 is to approximate the light intensity of any given virtual hexagonal pixel as the average of the light intensities of the 56 sub-pixels that constitute the hexagonal pixel. For a hexagonal pixel at the image boundary, the light intensity of the hexagonal pixel can be computed as the average of the intensities of the all its sub-pixels found. Method 2 is to approximate the light intensity of any given virtual hexagonal pixel by the light intensity of one of its 56 sub-pixels that is located at the forth row and the middle column of these 56 sub-pixels. This sub-pixel is called the reference sub-pixel of the corresponding virtual hexagonal pixel close to the centre of the hexagonal pixel like the sub-pixels Pi (i=0,1,2,…,6) shown in Figure 4.
Fig. 4. Reference sub-pixels of virtual hexagonal pixels
3.2 Conversion from Virtual Hexagonal Structure to Square Structure Similar to the previous sub-section, we perform two different ways to convert images from hexagonal structure to square structure. A. Simple averaging approach Converting images from the virtual structure to the square structure can be simply performed as follows. All of the 56 sub-pixels constituting each individual hexagonal pixel are assigned the same intensity value as of the hexagonal pixel. After this step, all sub-pixels in the
Comparison of Image Conversions Between Square Structure and Hexagonal Structure
267
virtual square structure consisting of all sub-pixels as obtained in Section 2 have been re-assigned intensity values, which may be different from the original. The intensity value of each square pixel can then be computed using the following two methods. Method 1 is to approximate the light intensity of any given square pixel as the average of the 7×7 sub-pixels that form the square pixel as shown in [11]. Method 2 is to approximate the light intensity of any given square pixel by the light intensity of the sub-pixel that is located at the centre of these 7×7 sub-pixels that were separated from the square pixel. B. Tri-linear interpolation approach As shown in Figure 4, each given sub-pixel is located on a triangle form by three reference sub-pixels of three virtual hexagonal pixels that are connected each other. Let (x, y) is the coordinates of the given sub-pixel, and (x1, y1), (x2, y2) and (x3, y3) are the coordinates of the three reference sub-pixels respectively. Let
⎡1 1 1⎤ ⎡ y2 − y3 A = ⎢⎢ x1 x 2 x3 ⎥⎥, B = ⎢ ⎣ y 3 − y1 ⎢⎣ y1 y 2 y 3 ⎥⎦ ⎡ k1 ⎤ 1 ⎢k ⎥ = ( A ) * B * C; ⎣ 2⎦ k 1 + k 2 + k 3 = 1.
x3 − x 2 ⎤ ⎡ x − x3 ⎤ ,C = ⎢ ⎥ ⎥; x1 − x3 ⎦ ⎣ y − y3 ⎦ (4)
Let φ represents the intensity value of the given sub-pixel, and φ1, φ2,and φ3 be the intensities of the three reference sub-pixels respectively. Then φ is computed from φ1, φ2,and φ3 from
ϕ = k1ϕ1 + k 2ϕ 2 + k 3ϕ 3 .
(5)
Thereafter, all sub-pixels in the virtual square structure have been re-assigned intensity values, which may be different from the original. The intensity value of each square pixel can then be computed using the following two methods. Method 1 is to approximate the light intensity of any given square pixel as the average of the light intensities of 7×7 that were separated from the square pixel as shown in Section 2. Method 2 is to approximate the light intensity of any given square pixel by the light intensity of the sub-pixel that is located at the centre of these 7×7 sub-pixels that were separated from the square pixel.
4 Experimental Results To assess the various methods described in Section 3, we use two commonly used images for image processing and three merits, which are PSNR (Peak Signal-to-Noise
268
X. He, J. Li, and T. Hintz
Fig. 5. Original Lena (left) and Mary (right) images
Ratio), RMSE (Root Mean Square Error) and MAXE (Maximum Error). The two images used are Lena and Mary images as shown in Figure 5. The formula used for computation of PSNR, RMSE and MAXE are given by
PSNR = 10 log10
255 2 × M × N M −1 N −1
∑∑ [ f (i, j) − g (i, j )]
, 2
i =0 j =0
M −1 N −1
RMSE =
∑∑ [ f (i, j ) − g (i, j )] i =0 j =0
M ×N M
N
i =0
j =0
2
,
(6)
MAXE = max max | f (i, j ) − g (i, j ) |, where M×N is the image size, f(i, j) represents the original intensity value of the pixel at location (i, j), and g(i, j) presents the re-assigned intensity value of the pixel at location (i, j) after an interpolation algorithm. The bigger the PSNR is, the closer the match between the original and the modified images. Similarly, the smaller the RMSE or MAXE is, the better match between the two images. Six different approaches are applied and the re-produced images of Lena after image conversions from square structure (SQ) to hexagonal structure (HS) and then back to SQ are shown in Figure 6. The top two images in Figure 6 use simple averaging approach for conversion from SQ to HS. Left image uses simple approach Method 1 for conversion from HS to SQ while right image uses simple approach Method 2 from HS to SQ.
Comparison of Image Conversions Between Square Structure and Hexagonal Structure
269
Fig. 6. Re-produced Lena images. Top left: simple method from SQ to HS and simple Method 1 from HS to SQ; Top right: simple method from SQ to HS and simple Method 2 from HS to SQ; Middle left: bilinear Method 1 from SQ to HS and simple Method 1 from HS to SQ; Middle right: bilinear Method 1 from SQ to HS and simple Method 2 from HS to SQ; Bottom left: bilinear Method 2 from SQ to HS and tri-interpolation Method 1 from HS to SQ; Bottom right: bilinear Method 2 from SQ to HS and tri-interpolation Method 2 from HS to SQ.
270
X. He, J. Li, and T. Hintz
The middle two images in Figure 6 use bilinear interpolation approach Method 1 for conversion from SQ to HS. Left image uses simple approach Method 1 for conversion from HS to SQ while right image uses simple approach Method 2 from HS to SQ.The bottom two images in Figure 6 use bilinear interpolation approach Method 2 for conversion from SQ to HS. Left image uses tri-linear interpolation approach Method 1 for conversion from HS to SQ while right image uses tri-linear interpolation approach Method 2 from HS to SQ. The index values of all the six experiments shown in Figure 6 corresponding to PSNR, RMSE and MAXE are shown in Table 1. Table 1. Comparison of six approaches for image conversion on Lena image
From the 1st four rows of Table 1, one will find that the PSNRs provided by simple averaging approach with Method 1 for conversion from HS to SA is about 10% higher than by simple averaging approach with Method 2 no matter if we use simple or bilinear approach for conversion from SQ to HS. Meanwhile, the RMSE and
Comparison of Image Conversions Between Square Structure and Hexagonal Structure
271
MAXE values are lower. This indicates that the simple approach with method 1 gives more accurate result with less resolution loss in this case. This can also be visually seen in Figure 6, where some jig-saw shapes are clearly shown in the top-right and middle right images. One will also find that when the simple approach is used for conversion from HS to SQ, the bilinear approach for conversion from SQ to HS degrade the quality slightly. Therefore, the simple averaging approach for conversion from SQ to HS best matches the simple averaging approach (Method 1) for conversion HS to SQ. However, when a bilinear interpolation approach is applied for conversion from SQ to HS, tri-linear interpolation (Method 2) performs the best for conversion from HS back to SQ. This can be seen in the last four rows of Table 1. Overall, the bilinear interpolation approach (Method 2) for conversion from SQ to HS together with the tri-linear interpolation approach for conversion from HS to SQ outperforms all other approaches. This result is expected because this approach is Table 2. Comparison of six approaches for image conversion on Mary image
272
X. He, J. Li, and T. Hintz
closer to real linear interpolation from SQ to HS and then back from HS to SQ where intensities of hexagonal pixels are computed from known intensities of square pixels through a bilinear interpolation and then the intensities of square pixels can be recomputed through a tri-linear interpolation from the intensities of hexagonal pixels. The above results are further confirmed by the similar results obtained using the Mary’s image shown in Figure 5 as displayed in Table 2.
5 Conclusions In this paper, we have presented various interpolation approaches used to obtain light intensities of pixels on a virtual hexagonal structure and re-assign light intensities of the original square pixels. The experimental results show that the bilinear approach together with the tri-linear approach outperforms all other approaches including the method shown in [8]. When converting between images on square structure and hexagonal structure, this approach gives best match and results in less loss of image resolution. It is worth to note that, in the bilinear approach (Method 2), the computation of intensity of each hexagonal pixel does not require to take into account the intensities of all its 56 sub-pixels. Similarly, in the tri-linear approach (Method 2), the computation of intensity of each square pixel does not require to take into account the intensities of all its 49 sub-pixels. This has greatly saved the processing time for interpolation compared with the simple averaging approaches. Although we have not yet computed and compared the time requested for each conversion method, it can be predicted that the method using bilinear interpolation from SQ to HS and tri-linear interpolation from HS to SQ is much faster than any other methods because it computes the pixel value for each hexagonal pixel only once and vice versa. The bi-cubic interpolation approach as shown in [9] together with a tri-cube approach may also be used for image conversion and is expected to give more accurate image matching between the two image structures. However, bi-cubic/tri-cube interpolation may also increase the conversion time. A hybrid method combining a bilinear/tri-linear method with a bi-cubic/tri-cube interpolation method, which can convert images fast and also provide accurate image matching, will be our future goal to achieve. A faster and more accurate interpolation for conversion images between SQ and HS will benefit the already existed research on intelligent compute vision and other close areas.
References 1. Horn, B.K.P.: Robot Vision. MIT Press, Cambridge, MA & McGraw-Hill, New York (1986) 2. Staunton, R.: The Design of Hexagonal Sampling Structures for Image Digitization and Their Use with Local Operators. Image and Vision Computing 7(3), 162–166 (1989) 3. Wuthrich, C.A., Stucki, P.: An Algorithmic Comparison between Square- and Hexagonalbased Grids. CVGIP: Graphical Models and Image Processing 53(4), 324–339 (1991)
Comparison of Image Conversions Between Square Structure and Hexagonal Structure
273
4. He, X.: 2-D Object Recognition with Spiral Architecture. PhD Thesis. University of Technology, Sydney (1999) 5. Wu, Q., He, X., Hintz, T.: Virtual Spiral Architecture. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, vol. 1, pp. 399–405 (2004) 6. Wang, H., Wang, M., Hintz, T., He, X., Wu, Q.: Fractal Image Compression on a Pseudo Spiral Architecture. Australian Computer Science Communications 27, 201–207 (2005) 7. He, X., Jia, W.: Hexagonal Structure for Intelligent Vision. In: Proceedings of International Conference on Information and Communication Technologies (ICICT05), pp. 52–64 (2005) 8. He, X., Hintz, T., Wu, Q., Wang, H., Jia, W.: A New Simulation of Spiral Architecture. In: International Conference on Image Processing, Computer Vision and Pattern Recognition (IPCV06), pp. 570–575 (2006) 9. Tian, Y., Liu, B., Li, T.: A Local Image Interpolation Method Based on Gradient Analysis. In: International Conference on Neural Networks and Brain (ICNN&B05), vol. 2, pp. 1202–1205 (2005) 10. Sheridan, P., Hintz, T., Alexander, D.: Pseudo-invariant Image Transformations on a Hexagonal Lattice. Image and Vision Computing 18, 907–917 (2000) 11. He, X., Wang, H., Hur, N., Jia, W., Wu, Q., Kim, J., Hintz, T.: Uniformly Partitioning Images on Virtual Hexagonal Structure. In: 9th International Conference on Control, Automation, Robotics and Vision (IEEE ICARCV06), pp. 891–896. IEEE Computer Society Press, Los Alamitos (2006)
Action Recognition with Semi-global Characteristics and Hidden Markov Models Catherine Achard, Xingtai Qu, Arash Mokhber, and Maurice Milgram Institut des Systèmes Intelligents et Robotique, Université Pierre et Marie Curie, 4 place Jussieu, Boite courrier 252, 75252 Paris Cedex 05 69042 Heidelberg, Germany {achard,maum}@ccr.jussieu.fr,
[email protected],
[email protected]
Abstract. In this article, a new approach is presented for action recognition with only one non-calibrated camera. Invariance to view point is obtained with several acquisitions of the same action. The originality of the presented approach consists of characterizing sequences by a temporal succession of semi-global features, which are extracted from “space-time micro-volumes”. The advantages of the proposed approach is the use of robust features (estimated on several frames) associated to the ability to manage actions with variable duration and to easily segment the sequences with algorithms that are specific to time varying data. For the recognition, each view of each action is modeled by an Hidden Markov Model system. Results presented on 1614 sequences of everyday life actions like “walking”, “sitting down”, “bending down”, performed by several persons validate the proposed approach.
1 Introduction Human activity recognition has received much attention from the computer vision community ([6], [8], [18]) since it leads to several important applications such as video surveillance for security, human-computer interaction, entertainment systems, monitoring of patients in hospitals, and elderly people in their homes. The different approaches can be divided into four groups: (i) 3D approaches without shape model; (ii) 3D approaches with volumetric models such as elliptical cylinders, and spherical models; (iii) 2D approaches with explicit shape model such as stick figure, and 2D ribbons; and (iv) 2D approaches without explicit shape model. Since the human body is not a rigid object and may present a multitude of postures for the same person, a robust modeling is difficult to obtain. Therefore, appearance models are utilized rather than geometric models. Action recognition can then be considered as the classification of time varying feature data, i.e., matching an unknown sequence with a group of labeled sequences representing typical actions. For this step the characterization of actions can be done either globally or as a temporal set of local features. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 274–284, 2007. © Springer-Verlag Berlin Heidelberg 2007
Action Recognition with Semi-global Characteristics and Hidden Markov Models
275
1.1 Global Representation of Sequences The advantage of this representation is that sequences are not characterized as temporal objects. An action is then represented by only one vector. This feature is robust because it is computed globally for all the sequence. Simple measurements, such as, the Mahalanobis distance can be used to determine the similarity between two actions. This method has been employed by Bobick and Davis [2] who characterize an action with: (i) a binary motion-energy image (MEI), which represents where motion has occurred in an image sequence; and (ii) a motion-history image (MHI) which is a scalar-valued image where intensity is a function of recency of motion. Given a set of MEIs and MHIs for each view/action combination, a statistical model of the 7 Hu moments has been generated for both the MEI and MHI. To recognize an input action, the Mahalanobis distance is estimated between the moment description of the input and each of the known actions. Several researchers have considered actions as a space-time volume. Shechtman and Irani [15] have proposed to extend the notion of 2D image correlation into the 3D space-time volume; thus allowing correlating dynamic behaviours and actions. Another approach [5] consists of detecting informative feature points in the 3D volume (x,y,t) and characterizing the spatio-temporally windowed data surrounding these feature points, similar to approaches in object recognition [1]. The global study of action can be managed with empirical distribution of some features. Chomat and Crowley [3] have performed probabilistic recognition of activities from local spatio-temporal appearance. Joint statistics of space-time filters have been employed to define histograms that characterize the activities to be recognized. These histograms provide the joint probability density functions required for recognition by using the Bayes rule. Dynamic events have been regarded as long-term temporal objects, which are characterized by spatio-temporal features at multiple temporal scales [20]. They have designed a statistical distance measure between video sequences. Finally, motivated by the recent success of the boosting process, Ke et al. [9] have constructed a real-time event detector for each action of interest by learning a cascade of filters based on volumetric features that scans video sequences in space and time. 1.2 Sequence Modelling as Temporal Object In the previous approaches, actions were considered globally and not as a sequence of images. As mentioned before, robust features were thus obtained and used with simple distances since actions are represented by only one vector. The disadvantage of the global method is that the segmentation of sequences in several actions is difficult to obtain and can be very time consuming. In the second approach, sequences are considered as a temporal set of local features. Martin and Crowley [10] have proposed a system for hand gesture recognition composed of three modules including tracking, posture classification, and gesture recognition by a set of finite state machines. Cupillard et al. [4] have used a finite state automaton to recognize sequential scenarios in a context of metro surveillance. For composed scenarios they have employed Bayesian networks (several layers of naive Bayesian classifiers) as proposed by Hongeng et al. [7]. Another approach to deal with temporal data consists of employing Dynamic Time Warping (DTW) to match sequences. Pierobon et al. [12], for example, extract
276
C. Achard et al.
features directly from 3D data (x,y,z) making the system insensitive to viewpoint. Frame-by-frame descriptions, generated from gesture sequences, are collected and compared with DTW. Other researchers have preferred the use of Hidden Markov Models (HMM) [14] that constitute an important tool to recognize temporal objects of variable durations. Hidden Markov Models have been initially used for speech recognition. Now, they are largely employed in image processing. Yamato et al. [19] have developed one of the first HMM based gesture recognition systems to distinguish between 6 tennis strokes. Starner et al. [16] have proposed a real-time HMM-based system for recognizing sentence level American Sign Language (ASL) without explicitly modelling the fingers. In the present work an innovative solution is proposed, where the extracted features are semi-global (estimated on “space-time micro-volumes” generated from several images of the sequence). The proposed approach, similar to methods used in speech recognition allows to work with robust features and to use algorithm dedicated to temporal data for sequence recognition or segmentation. These features that characterized “micro-movement” are extracted from 3D spatio-temporal volumes comprising all moving points detected (x,y,t) in a temporal window. These “spacetime micro-volumes” contain various information, such as, the silhouette of the person in each image or the action dynamics (this latter is lost when sequences are considered as a succession of local features extracted independently on each image). A study on the dimension of the temporal window to be used is presented and shows the interests of the presented approach. The temporal chains obtained are then introduced in a Hidden Markov Model (HMM) system for the recognition. The advantages are (i) the ability to manage actions with variable duration, (ii) speed, and (iii) the ease to segment the sequences with the algorithm of Viterbi [14]. In the following, we quickly detail the approach used to detect moving pixels in each image. We then present the features selected to characterize the sequences which constitute the entry of the recognition system exposed in the following paragraph. Recognition results on real sequences with several actions of the everyday life like walking, sitting on a chair, jumping, bending or crouching are finally presented in paragraphs 6 and 7.
2 Motion Detection The first stage of the activity recognition process consists of detecting moving pixels. Therefore, the current image is compared at any given time to a reference image that is continuously updated. A second stage is also necessary to remove shadows that eventually are present in the scene. To authorize multi-modal backgrounds, the history of each pixel of the reference image is modeled by a mixture of K Gaussian distributions [17]. The probability of observing the value of the current pixel Xt is then given by: K
P( X t ) = ∑ wi ,t * N ( X t , μi ,t , Σi ,t ) . i =1
(1)
Action Recognition with Semi-global Characteristics and Hidden Markov Models
277
where, for ith Gaussian at time t, wi,t is the weight of the Gaussian, μi,t is its mean value and Σi,t its covariance matrix. N ( ) is the Gaussian density probability function that is defined, as follows: N ( X , μ , Σ) =
1 (2π ) n / 2 Σ
1/ 2
⎧ ( X − μ )T Σ −1 ( X − μ ) ⎫ exp ⎨ − ⎬. 2 ⎩ ⎭
(2)
where n is the dimension of the vector. In this study n is equal 3 because 3 channels are used for color images. Initialization of the Gaussian mixture is carried out by the K-means algorithm on the first 40 images of the sequence where it is assumed that no movement occurs. Each pixel of the background is modeled by K=2 Gaussians. It appears that this is a reasonable compromise between the computing time and the quality of results. For each new pixel Xt, the most likely Gaussian is searched. If the probability given by this Gaussian for the current pixel is less than a threshold value, the latter is assigned to the background. Otherwise, it is classified as a pixel belonging to a moving object. To consider lighting changes during the process of acquisition, pixels labeled as background, are used to update the reference image and thus, the Gaussian they belong to:
μt = (1 − α ) μt −1 + α X t . Σt = (1 − α )Σ t −1 + α ( X t − μt )( X t − μt )T .
(3)
where α was empirically fixed at 0.1. This method leads to reasonably good results for detection. However, shadows are often detected as a moving object. As a result, the shapes of the detected silhouettes are significantly deteriorated and disturb the algorithm of action recognition. A second stage is employed to address this issue. In this work it is assumed that shadows decrease brightness of pixels but do not affect their color, as proposed by [13]. Thus, the angle Φ between the color vector of the current pixel Xt and that of the corresponding background pixel Bt (mean of Gaussian associated to the pixel) is an effective parameter to detect shadows. Note that if Φ is below a threshold value, and the brightness of the current pixel is smaller than the brightness of the background, it is assumed that the pixel corresponds to shadows. Therefore, a shadow is defined as a cone around the color vector corresponding to the background, as shown in figure 1.
Fig. 1. Shadow is defined as a cone
278
C. Achard et al.
At the end of the process, only pixels detected as moving by the mixture of Gaussian and which do not correspond to shadows are preserved. Several morphological operations end this stage and lead to a binary map of moving pixels, for each image. As can be seen in figures 2a and 2b, fairly good quality detection results are obtained. However, as presented in figure 2c, on some images of these sequences, the detection is not as clear. This is due to the close similarities between background colors and those of the moving person, to a light change of position of the camera or still, to noise. Nonetheless, the space-time characterization of these binary images, presented in Section 3, is robust enough to lead to quite acceptable action recognition results.
(a) Image difference
(b) With shadow modeling
c) Poor detection
Fig. 2. Typical results for good and poor motion detections
3 Features Extracted from Sequences Features representative of the sequence are extracted from all the binary images. If several persons are present in the scene, a supplementary task is required by the system : the tracking of persons as proposes by Mostafaoui et al. [11] They constitute the input of the Hidden Markov Model system for the recognition of action. To obtain robust features, we have chosen to work with local “space-time volumes”, representative of "micro-movements" and composed by the binary silhouettes extracted on a temporal window of the sequence. They are characterized by their three-dimensional geometrical moments. This characterization of "micro-volumes" permits, as in speech recognition, to exploit the dynamics of the actions ("micro-movements"), and to keep local characteristics that could be introduced into an HMM based system, to manage time varying features data. Let {x,y,t} be the set of points belonging to the binary “ space-time micro-volume” where x and y represent the space coordinates and t, the temporal coordinate. The moment of order (p+q+r) of this volume is determined by: Apqr = E { x p y q t r } .
(4)
where E{x} represents the expectation of x. In order to work with features invariant in translation, the central moments are considered, as follows: AC pqr = E {( x − A100 ) p ( y − A010 ) q (t − A001 )r } .
(5)
These moments must be invariant to the scale to preserve invariance with the distance of action or with the size of people. A direct normalization on the different axes, by dividing each component by the corresponding standard deviation is not desirable
Action Recognition with Semi-global Characteristics and Hidden Markov Models
279
because it leads to an important loss of information, that is, the shape of the binary silhouettes appears to be rounder. Also, an identical normalization is carried out on the first two axes, while the third (time) is normalized, separately. The normalization performed by preserving the ratio of width-to-height of the binary silhouettes is thus obtained by the following relation:
M pqr
p ⎧⎪⎛ ⎞ ⎛ x − A100 y − A010 = E ⎨⎜⎜ ⎜⎜ 1/ 4 1/ 4 ⎟ 1/ 4 1/ 4 ⎟ AC200 AC020 ⎠ ⎝ AC200 AC020 ⎩⎪⎝
⎞ ⎟⎟ ⎠
q
⎛ t − A001 ⎞ ⎜⎜ 1/ 2 ⎟ ⎟ ⎝ AC002 ⎠
r
⎫⎪ ⎬. ⎭⎪
(6)
Each space-time “micro-volume’ is thus characterized by a vector of features o composed of the 14 moments of 2nd and 3rd order. o ={M200, M011, M101, M110, M300, M030, M003,M210, M201, M120, M021, M102, M012, M111}.
(7)
Note that the moment M020 is not calculated. This is due to the normalization, which makes M020 inversely proportional to M200. In addition, the moment M002 is always equal to 1. This vector is extracted on a sliding temporal window. Therefore, a sequence is represented by a temporal succession of 14 dimensional vectors: O = {o1,o2,… , oT}.
(8)
Action recognition is then obtained with HMMs.
4 HMM for Action Recognition An action can be considered as a sequence of configurations belonging to states set {q1, q2...,qN } of a HMM (N is the number of states in the model). The Markov chain with N states is fully specified by the triplet λ = ( A, B, π ) where: -
-
A is the state transition probability matrix: A = {aij / aij = P ( St +1 = q j / St = qi )}
where St represents the state at time t. B={b1(o), b2(o),…, bN(o)} corresponds to the observation probability for each state. As we are working with continuous data, observations are modelled with Gaussian distribution: b j (o) = N (o, μ j , Σ j ) where o is the 14 dimensional feature vector previously presented (equation 8), μj and Σj are the mean and covarith ance matrix of Gaussian for the j state of the chain. Π=(π1, π2, … ,πN) represents the initial state distribution.
A HMM is created for each action and each view (37 HMMs). The set of parameters λk is learned on a training database with the Baum-Welch algorithm [14]. This is an Expectation- Maximization (EM) algorithm, which maximizes the likelihood that the HMMs generate all the given training sequences. To recognize a given action O, we evaluate P(O|λk) with the "forward-backward" algorithm for each of the k classes and we chose the class with the maximum probability to identity the sequence: cl = arg max P(O / λk ) . k
In the following, we will present the training database.
(9)
280
C. Achard et al.
5 Presentation of the Sequence Database A sequence database comprising 8 actions is considered: (1) "to crouch down", (2) "to stand up", (3) "to sit down", (4) "to sit up", (5) "to walk", (6) "to bend down", (7) "to get up from bending", and (8) "to jump". Various viewpoints were acquired for each action. The front, 45° and 90° views, were captured while others were synthesized from the sequences already recorded (at -45°, at -90°). Each action was executed by 7 people, and repeated 230 times on average. The database comprises 1614 sequences. Presented in figure 3 are some examples of images of the database representing various actions and silhouettes of the actors.
6 Recognition Rate Based on the Length of the Temporal Window and the Number of States Features characterizing sequences are extracted from a sliding temporal window in order to obtain a judicious normalization to the size of people or the scale of actions (cf paragraph 3). Moreover, this leads to robust features containing information on the dynamics of actions. The joint influence of the length of the temporal window and the
a) to sit down -45°
b) to crouch down face
c) to crouch down -90°
d) to bend down -90°
e) to walk 135°
f) to walk face
Fig. 3. Typical images of the database
Action Recognition with Semi-global Characteristics and Hidden Markov Models
281
states number of the Markov chain is studied. For these tests, recognition rates are obtained by placing each of the 7 persons in the test database, one by one. (the training is achieved with the six other persons). Recognition rates presented below are the average rates on the seventh persons. A Markov chain is trained for each of the 37 classes (each action being observed with several views). For classification it is considered that the various viewing angles of the same action belong to the same class. Therefore, recognition results with 8 classes are presented. Tests are carried out with a length of temporal window varying between 2 to 17 images and a number of states between 2 to 6 in the HMM process. Figure 4 presents the evolution of recognition rates according to the length of the temporal window. It can be seen that best results are obtained for a length of window around eight images and two or three states for the HMM. This confirms the interest of working with semi-global features estimated from “space-time micro-volume”, rather than to consider the sequence as a succession of features extracted independently on each image. As can be seen in Figure 5, best results (89% of good recognition) are obtained for 3 states in the Markov chain and a window length equal to 7. In addition, it is observed that a great number of states strongly deteriorate the results. In the next step of this work, temporal windows of length 7 associated with 3 states for the Markov chains is used. Table 1 presents the confusion matrix obtained with these last parameters. Actions are generally well recognized with a minimum rate of good recognition of 81.5% corresponding to -
action (3) "to sit down", sometimes confused with action (8) " to jump " or action (4) “to sit up” sometimes confused with action (7) “to get up from bending”.
If actions “to sit up” and “to get up from bending” seem to be similar, confusion between “to sit down” and “to jump” is more surprising. The study of the binary Recognition Rate 0.9
Q=2 Q=3 Q=4 Q=5 Q=6
0.88 0.86 0.84 0.82 0.8 0.78 0.76
2
4
6
8
10
12
14
16
18
Length of the temporal
Fig. 4. Recognition rate according to the length of the temporal window and the number of states in HMMs
282
C. Achard et al. Table 1. Confusion matrix between actions
1 1 88.6 2
0
2 0
3 0.3
4 0
93.8
0 81.5 0 0.1 5.1 0.4 3.3
3 0.83 0 4 0 3.6 5 0 0 6 6.2 0 7 0 2.9 8 0 1.29
6 2.6
7 0
8 8.5
0.6
5 0 0
0
0.3
5.2
0.28 81.5 0.6 0 2.1 0
1.7 0 95.4 0.2 0.4 0.4
3.9 0 2.4 84.6 0.4 0.4
0.3 10.2 0.1 0 91.5 2.5
11.6 4.7 1.3 3.9 2.3 92.1
Fig. 5. Some images (one out of six) of a sequence belonging to the action “to jump”
silhouettes reveals however a passage by similar states, mainly produced by the runup before the jump as illustrated figure 5.
7 Recognition Rate Based on the Person Carrying Out the Action Presented in Table 2 are the seven recognition rates obtained by placing each of the 7 persons in the test database, one by one (training is achieved by six other persons). The average recognition rates, on the 8 actions, vary from 70.3% to 95.3% depending on the person. A poor recognition rate (70.3 %) appears for the seventh person. This is not surprising because this person presents a particular binary silhouette due to her clothing, as shown in figure 3. This person wears a long skirt (and it is the only person with a skirt in the base). Table 2. Recognition rate based on the person carrying out the action
Personne Taux
1
2
3
4
5
6
7
95.3
93.6
80.2
90.8
88.2
93.3
70.3
The conclusion of these tests is that the method copes with different morphologies of the people: the first six actors present different morphologies (varying from 1.57 meters to 1.85) and their action recognition is good. It should be noted that these people worn trousers. However, for the seventh person, the clothing (long skirt) changes the shape of the binary silhouettes. This is an issue that cannot be addressed by normalization, but it may be solved by an extension of the training database.
Action Recognition with Semi-global Characteristics and Hidden Markov Models
283
8 Summary and Conclusions In this work, a method to recognize human actions of everyday life is proposed. We have chosen to work with semi-global characteristics, which are computed on “spacetime micro-volumes” and are generated from several images of the sequence. As a result, the robustness of global approaches is preserved, with the use of algorithms dedicated to time varying features data like HMM that facilitate recognition and segmentation of sequences. In this innovative solution similar to methods used in speech recognition, features are extracted from 3D “space-time micro-volumes” containing a lot of information, such as, the silhouette of the person in each image or the action dynamics (this one is lost when sequences are considered as a succession of local features extracted independently on each image). A study on the size of the temporal window to be used is presented and validates the interests of the presented approach. A recognition rate of 89% on average was obtained from a database of 1614 sequences divided into 8 actions and carried out by 7 people.
References 1. Bigorgne, E., Achard, C., Devars, J.: Local Zernike Moments Vector for Content based Queries in Image Database. In: Machine Vision and Applications, Tokyo, Japan, pp. 327– 330 (2000) 2. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 257–267 (2001) 3. Chomat, O., Crowley, J.L.: Probabilistic recognition of activity using local appearance. In: IEEE International Conference on Computer Vision and Pattern Recognition, Colorado, USA (1999) 4. Cupillard, F., Avanzi, A., Brémond, F., Thonnat, M.: Video Understanding for Metro Surveillance. In: IEEE International Conference on Networking, Sensing and Control, Taipei, Taiwan (2004) 5. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition via Sparse SpatioTemporal Features. In: IEEE International workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), Beijing, China (2005) 6. Gavrila, D.M.: The visual analysis of human movement: a survey. Computer Vision and Image Understanding 73, 82–98 (1999) 7. Hongeng, S., Bremond, F., Nevatia, R.: Bayesian framework for video surveillance application. In: International Conference on Computer Vision, Barcelona, Spain (2000) 8. Hu, W., Tan, T., Wang, L., Maybank, S.: A Survey on Visual Surveillance of Object Motion and Behaviors. IEEE Transaction on System, Man and Cybernetics 34, 334–352 (2004) 9. Ke, Y., Sukthankar, R., Hebert, M.: Efficient Visual Event Detection using Volumetric Features. In: IEEE International Conference on Computer Vision, Beijing, China (2005) 10. Martin, J., Crowley, J.L.: An appearance based approach to gesture recognition. In: International Conference on Image Analysis and Processing, Florence, Italy (1997) 11. Mostafaoui, G., Achard, C., Milgram, M.: Real time tracking of multiple persons on color image sequences. In: Blanc-Talon, J., Philips, W., Popescu, D.C., Scheunders, P. (eds.) ACIVS 2005. LNCS, vol. 3708, Springer, Heidelberg (2005)
284
C. Achard et al.
12. Pierobon, M., Marcon, M., Sarti, A., Tubaro, S.: Clustering of human actions using invariant body shape descriptor and dynamic time warping. In: IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), Como, Italy, IEEE, Los Alamitos (2005) 13. Porikli, F., Tuzel, O.: Human body tracking by adaptive background models and meanshift analysis. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Nice, France (2003) 14. Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition, Readings in speech recognition, pp. 267–296. Morgan Kaufmann Publishers Inc. San Francisco (1990) 15. Shechtman, E., Irani, M.: Space-Time Behavior Based Correlation. In: IEEE International Conference on Computer Vision and Pattern Recognition 2005, San Diego, CA, USA, pp. 405–412. IEEE, Los Alamitos (2005) 16. Starner, T., Weaver, J., Pentland, A.: Real time American sign language recognition from video using HMMs. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 1371–1375 (1998) 17. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: IEEE International Conference on Computer Vision and Pattern Recognition, Ft. Collins, USA, pp. 246–252. IEEE, Los Alamitos (1999) 18. Wang, J.J., Singh, S.: Video Analysis of Human Dynamics - a survey. Real-time Imaging Journal 9, 320–345 (2003) 19. Yamato, J., Ohya, J., Ishii, K.: Recognizing Human Action in Time-Sequential Images using Hidden Markov Models. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 379–385. IEEE I, Los Alamitos (1992) 20. Zelnik-Manor, L., Irani, M.: Event based analysis of video. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 123–130. IEEE, Los Alamitos (2001)
Patch-Based Experiments with Object Classification in Video Surveillance Rob Wijnhoven1,2 and Peter H.N. de With2,3 1
Bosch Security Systems B.V., Glaslaan 2, Eindhoven, The Netherlands 2 Technische Universiteit Eindhoven, Eindhoven, The Netherlands 3 LogicaCMG, Tech. Softw. Eng., Eindhoven, The Netherlands
Abstract. We present a patch-based algorithm for the purpose of object classification in video surveillance. Within detected regions-of-interest (ROIs) of moving objects in the scene, a feature vector is calculated based on template matching of a large set of image patches. Instead of matching direct image pixels, we use Gabor-filtered versions of the input image at several scales. This approach has been adopted from recent experiments in generic object-recognition tasks. We present results for a new typical video surveillance dataset containing over 9,000 object images. Furthermore, we compare our system performance with another existing smaller surveillance dataset. We have found that with 50 training samples or higher, our detection rate is on the average above 95%. Because of the inherent scalability of the algorithm, an embedded system implementation is well within reach.
1
Introduction
Traditional video surveillance systems comprise of video cameras generating content-agnostic video streams, being recorded by digital video recorders. Recently, there is a shift towards smart cameras that generate a notion of the activity in the monitored scene by means of Video Content Analysis (VCA). State-of-the-art VCA systems comprise object detection and tracking, thereby generating location data of key objects in the video imagery of each camera. For video surveillance, this technology can be used to effectively assist security personnel. While the detection and tracking algorithms are becoming mature, the classification of the detected objects is still in an early stage. Classification of the detected objects is commonly done using the size of the object, where simple camera calibration is applied to compensate for the perspective. However, effects such as shadows and occlusion negatively influence the segmentation process and thus the object classification (e.g. shadows increase the object size, and occlusion decreases the size). Furthermore, when objects cross each other, they may be combined into one object. For improved scene understanding, more advanced object models are required, taking specific object features from the video into account. The aim of our object modeling is to classify various objects in a reliable way, thereby supporting the decision-making process for a security operator of a CCTV surveillance system. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 285–296, 2007. c Springer-Verlag Berlin Heidelberg 2007
286
R. Wijnhoven and P.H.N. de With
In the presented work, we assume that the camera image has been segmented into a static background and moving foreground objects using the algorithm proposed in [1]. Initially, a texture and intensity analysis is applied between the input image and the background reference frame at low-resolution. The resulting initial foreground image blocks are further analyzed at high-resolution to obtain a pixel-true segmentation mask. The extracted objects are represented by a shape and bounding box description and will be referred to as Regions-Of-Interest (ROIs) in the remainder of the paper. In previous work [2] [3], wire-frame models were matched onto the detected ROIs that represent the detected objects. The disadvantage of this approach is that for each object, such a wire-frame model has to be designed and when the number of objects grows, the classification distance between the models decreases. Furthermore, the computational requirement grows linearly with the number of object models. As an alternative, in this paper we study a patch-based algorithm as proposed by Serre et al. [4]. In this technique, the computational expensive stage of template and pattern matching, is independent of the number of object classes and the classification is performed afterwards, on a subset of the data, using feature vectors. Classification results for this algorithm show that a classification rate above 95% is possible. The two approaches are compared under the conditions of a possible implementation in an embedded environment, where the computation power available is strictly limited and scalability of the algorithm is important. The remainder of the paper is as follows. In Section 2 related work is presented. Section 3 discusses the model that we use for object classification. The dataset used is introduced in Section 4. The results of the algorithm are presented in Section 5, including a discussion on the comparison of the presented algorithm and the previously considered wire-frame approach. The paper ends with conclusions and future work.
2
Related Work
Model-based object classification/detection approaches are based on two different classes of models: rigid (non-deformable) and non-rigid (deformable) models. Rigid models are commonly used for the detection of objects like vehicles, where non-rigid models are typically used for person detection. In the following, we consider three types of algorithms. In various surveillance systems, classification methods are commonly based on the pixel-size of the object’s ROI. More advanced algorithms for traffic surveillance match 3D wire-frame models onto the input image for the purpose of object tracking or classification. Within the domain of generic object recognition in large multimedia databases, various proposed algorithms are based on low-level local descriptors that model the object’s appearance. Each of the three methods will now be addressed in more detail. Region-of-interest methods are the most simple object models and computationally inexpensive. Systems that segment the camera input images into a static
Patch-Based Experiments with Object Classification in Video Surveillance
287
background image and moving foreground images (e.g. [1]), generate the object’s ROI, which already provides some information about the detected objects, e.g. pixel-size and -speed. Bose and Grimson [5] use the area of the bounding box and the percentage of foreground pixels within the box as features. Furthermore, the y-coordinate is used to compensate for the perspective in the scene. A different method for obtaining perspective invariance is applied by Haritaoglu et al. [6], who use projection histograms in x - and y-direction for tracked objects to make a distinction between various object types. Wire-frame models have been proposed for the purpose of model-based object detection and tracking [2] [3]. For a more complete overview, we refer to previous work of the authors [7], where rigid object models have been considered for the purpose of vehicle classification. The algorithm is briefly summarized here as it will be discussed later in the paper. Within the already available ROI, the algorithm tries to find the best matching image position for all models in the database. After applying a 3 × 3 Sobel filter to the image in x - and y-direction, a histogram of gradient orientations is generated, from which the object orientation is extracted. Next, the 3D wire-frame model is projected onto the 2D camera image, using the calculated orientation and the center of the ROI as the object location. The projected 2D line-set is shifted over the image region and calculates a matching error for each pixel position. The position giving the smallest error defines the best matching pixel position. This is performed for all models in the database, and the model with the lowest matching error is chosen as the classified object model. Low-level image features describing the object appearance are used by several object recognition systems. Haar-wavelets are commonly used, because of the low computational complexity [8], [9], [10]. Mikolajczyk and Schmid [11] compare the performance of various local interest descriptors. They show that Scale Invariant Feature Transform (SIFT) descriptors and the proposed extension of SIFT, Gradient Location and Orientation Histogram (GLOH), outperform other methods. Dalai and Triggs [12] compare the performance of Haar wavelets, PCA-SIFT [13] and Histogram Of Gradient methods (HoG). They show that the HoG method outperforms the others. Mikolajczyk et al. [14] generate HoG features for the purpose of person detection, extended with Laplacian-filtered versions of the input images as blob detectors. Ma and Grimson [15] propose a method based on SIFT for the purpose of vehicle classification in traffic video using a constant camera viewpoint. Serre et al. [4] model findings from biology and neuro-science using a hierarchical feed-forward architecture. The model is shown to have performance in line with human subjects, considering the first 150 ms of the human visual system in a simple binary classification task [16]. Serre et al. have shown that the algorithm outperforms SIFT in the generic object-recognition task. As mentioned, the advantage of this approach is that the image analysis part is independent of the
288
R. Wijnhoven and P.H.N. de With
amount of object classes. For this reason, the algorithm is suited for embedded implementation and was therefore adopted for further exploration.
3
Algorithm Model
Since humans are good at object classification, it is reasonable to look into biological and neurological findings. Based on findings from Hubel and Wiesel [17], Riesenhuber and Poggio have developed the ”HMAX” model [18] that has been extended recently by Serre [19], [4] and optimized by Mutch and Lowe [20]. We have implemented the model proposed by Serre up to the second processing layer. In his thesis, Serre [16] proposes to extend the model with additional third and fourth layers. For completeness, we will address the working of the algorithm in the following. A simplified graphical representation of the model for classification of objects detected in a video camera is shown in Figure 1, where the first step of object detection is described in [1]. Proto ty pe data base
Fea ture v e ctor ge ne ration
Object detection
S1
C1
S2
C2
SVM Classifier
Fig. 1. Architecture for classification of objects in camera image
The algorithm is based on the concept of a feed-forward architecture, alternating between simple and complex layers, in line with the findings of Hubel and Wiesel [17]. The first layer implements line-detectors by filtering the graylevel input image with Gabor filters of several sizes to obtain scale-invariance. The filters are normalized to have zero mean and a unity sum of squares. The filter size of the smallest filter (at scale zero) has a size of 7 × 7 elements, increasing for every scale up to 37 × 37 elements (at scale 15). The Gabor response is defined by: G(x, y) = exp (− where
X 2 + γ2Y 2 2π ) cos ( X), 2 2σ λ
(1)
X = x cos σ − y sin σ
(2)
Y = x sin σ + y cos σ.
(3)
We use the parameters as proposed by Serre et al. [4]. After applying the Gabor filters onto the input image, the results are normalized. This compensates for the image energy in each area of the input image that is used to generate the filter-response. Hence, the final filter response for each filter is defined as: I F i i R(I, F ) = (4) , Ii 2
Patch-Based Experiments with Object Classification in Video Surveillance
289
where Ii denote pixels of the input image, and Fi denote the actual pixels within the filter aperture. This filter response is called the S1 feature map. An example of such a response for a car image, is shown in Figure 2.
Fig. 2. Gabor filter response (filter size 7 × 7 elements) on input image of a car (scaled to 140 pixels in height)
3.1
Complex Layer 1 (C1)
The C1 layer from Figure 1 is added to obtain invariance in local neighborhoods. This invariance will be created in both the spatial dimensions and in the dimension of scale. Considering the dimension of scale, two S1 feature maps in consecutive scales (132 elements in height for scale zero) are element-wise maximized. This generates one feature map for every two scales. The combination of several scales results in a band. Next, in order to obtain spatial invariance, the maximum is taken over a local spatial neighborhood around each pixel and the resulting image is sub-sampled. Because of the down-sampling, the number of C1 features is much lower than the number of S1 features. The resulting C1 feature maps for the input image (33 elements in height at band zero and 12 at band 7) of the car image in Figure 2 are shown in Figure 3.
Fig. 3. C1 feature maps for S1 responses from Figure 2 (at band 0). Note that the C1 maps are re-scaled for visualization.
3.2
Simple Layer 2 (S2)
The next layer in the processing chain of the model applies template matching of image patches onto the C1 feature maps. This can be compared to the simple layer S1, where the filter response is generated for several Gabor filters. This template matching is done for several image patches (prototypes). These patch prototypes are extracted from natural images at a random band and spatial location, at the C1 level. Each prototype contains all four orientations and prototypes are extracted at four different sizes: 4 × 4, 8 × 8, 12 × 12 and 16 × 16 elements. Hence, a 4 × 4 patch contains 64 C1 elements. Serre [16] has shown that for a large number of prototypes, the patches can be extracted from random natural images, and do not specifically have to be extracted from the training set.
290
R. Wijnhoven and P.H.N. de With
Fig. 4. Patch response for two example patches. The eight images of decreasing size represent the S2 feature maps at each band. Note that the top prototype clearly results in higher responses in the medium bands, where the lower prototype gives a higher reaction in the lower bands. For simplicity, only patches of size 4 × 4 C1 elements are considered.
The response of a prototype patch P over the C1 feature map C of the input image I is defined by a radial basis function that normalizes the response to the patch-size considered, as proposed by Mutch and Lowe [20]. Examples of image patches (prototypes) are shown in Figure 4 for the car image from Figures 2 and 3. Note that we only show two patch prototypes, each of size 4 × 4 C1 elements. 3.3
Complex Features Layer 2 (C2) and Feature Vector Classification
In this layer, for each prototype patch, the most relevant response is extracted and stored in the final feature vector. This is done by taking the maximum patchresponse over all bands and all spatial locations. Therefore, the final feature vector has a dimensionality equal to the number of prototype patches used. In our implementation, we used 1,000 prototype patches. Note that by considering a higher or lower number of C1 patch prototypes, the required computation power can be linearly scaled. In order to classify the resulting C2 feature vector, we use a one-vs-all SVM classifier with a linear kernel. The SVM with highest output score defines the output class of the feature vector. The Torch3 library [21] was used for the implementation of the SVM. Note that instead of the SVM, also a neural network could have been used for the feature vector classification.
4
Dataset and Experimental Setup
The algorithm model of the previous section was implemented as follows. The S1 layer filters the input image with Gabor filters at several scales, followed by the C1 layer to obtain invariance in both scale and space. In the S2 layer, the C1 feature maps are template matched with a high number of prototype
Patch-Based Experiments with Object Classification in Video Surveillance
291
patches. The final C2 layer obtains invariance by taking the global maximum over both scale and space for each prototype patch. For each prototype patch, this maximum value is stored in the final feature vector, which is classified using the support vector machine. The use of a relevant dataset is very important for objective comparison of the proposed algorithms. Ponce et al. [22] discuss the datasets commonly used for generic object detection/recognition. However, these generic datasets are not specific for the typical surveillance case. Most available surveillance datasets have been created for the purpose of object tracking, and therefore contain a strictly limited number of different objects. For the purpose of object classification, a high number of different objects is required. Ma and Grimson [15] presented a limited dataset for separating various car types. Since future smart cameras should be able to make a distinction between more object classes, we have created a new dataset. A one hour video-recording was made from a single, static camera, monitoring a traffic crossing. The camera image was captured at CIF resolution (352x288 pixels), resulting in object ROIs of 10-100 pixels in height for a person in the distance and a nearby bus, respectively. After applying the tracking algorithm proposed by the authors of [1], the resulting object images were manually adjusted if required, to have a clean performance of the ROI extraction and avoid any possible negative interference with the new algorithm. For this reason, redundant images, images of occluded objects and images containing false detections have been removed. Because of the limited time-span of the recording, the scene conditions do not change significantly. The final dataset contains 9,233 images of objects. The total object set has been split into the following 13 classes: trailers, cars, city buses, Phileas buses (name of a specific type of bus), small buses, trucks, small trucks, persons, cleaning cars, bicycles, jeeps, combos and scooters. Some examples of each object class are shown in Figure 5. The experiments were conducted on a PC P-IV running at 2 GHz. The average processing time of an object image is about 4 to 5 seconds.
Fig. 5. Surveillance dataset Wijnhoven 2006
5
Results
This section shows the results for the object classification on the surveillance dataset presented in Section 4. Each image is first converted to grayscale and
292
R. Wijnhoven and P.H.N. de With
scaled to 140 pixels in height while maintaining the aspect ratio. The total set of images for each class is divided into a training and a test set at random. For the training set, the number is specified (e.g. 30 samples) and the remainder of the images is used for the test set. Next, the feature vectors for all images are calculated using the methods discussed in Section 3. The SVM classifier is trained with the feature vectors of the images in the training set and tested with the test set. We present the detection rate, being the percentage of images correctly classified. The final detection rate is calculated by averaging the results over ten iterations. The average correct detection rate in the case of 30 training samples per class is 87.7%. The main misdetections are bicycles and scooters (13%), and combos and small buses (13%). For some simple applications, the classification between four object classes is already significant. A camera that can make a distinction between cars, buses, persons and bikes with high accuracy adds functionality to the camera that only comprises object detection and tracking. Therefore, the total dataset of 9,233 object images has been redivided into a new dataset, containing only the mentioned four object classes. Applying the same tests as mentioned before, result in an increase in detection rate. Furthermore, because there are less classes with a low number of object images, the number of learning samples can be increased. Table 1 shows that the detection rate of such a four-class system increases to 94.6% for 30 samples and up to 97.6% when 100 samples are learned. Furthermore, we have compared our system with the system of Ma and Grimson [5]. As can be seen in Table 2, our system outperforms the proposed SIFT-based system for the car-van problem, in contrast to the sedan-taxi problem. Where our proposed algorithm has been designed to limit the influence of small changes within an object class, the SIFT-based algorithm focuses on describing more specific details of the test objects. This explains the differences in performance.
Table 1. Detection rates for the four-class classification problem Training samples 1 5 10 20 50 100
Car 62.7% 86.8% 87.3% 93.1% 96.7% 97.3%
Bus 38.8% 73.1% 91.9% 94.6% 96.4% 99.4%
Person 64.9% 91.8% 93.5% 95.2% 97.1% 98.2%
Bike 66.6% 84.0% 89.4% 92.3% 93.4% 95.6%
Average 58.3% 83.9% 90.5% 93.8% 95.9% 97.6%
Table 2. Detection rates for the traffic dataset from Ma and Grimson [5]
Car-van Sedan-taxi
Ma Grimson Our method Difference 98.5% 99.25% +0.75% 95.76% 95.25% -0.49%
Patch-Based Experiments with Object Classification in Video Surveillance
5.1
293
Wire-Frame Models vs. Feature-Based Object Modeling
In discussing the differences between the wire-frame approach and the patchbased techniques, we focus specifically on the trade-off between computational requirements and performance, which is very important for implementation in an embedded system. Scale invariance is reached in the wire-frame approach by calibration of the camera. This results in correct projections of the 3D models onto the 2D camera image. With this a-priori knowledge, we scale the models to the correct size, so they are relevant for the image pixel-position they are considered at. The requirement of the calibration makes the wire-frame approach inherently sensitive to the object size. In contrast with this, the patch-based algorithm implements scale-invariance by filtering with a set of Gabor filters of different size. By taking a global maximum in both scale and space in the C2 feature generation step, the algorithm is not influenced by the actual object size. It should be noted, that the variation factor of object sizes in typical camera settings is quite limited. If they are large, scale-invariance can be reached by upor down-sampling of the original image pixels. Scalability in required computation power in the patch-based approach is reached by changing the number of C1 patch prototypes used in the template matching process, which is the most expensive part of the system. Furthermore, the parameters for the Gabor filters in S1 can be changed (e.g. number of orientations and scales considered). This filtering can be implemented in a fully parallel way. The generation of the feature vector is independent of the number of object classes considered, where in the case of wire-frame models, each model of the total set of 3D models needs to be matched. A second aspect is that the template matching cost grows quadratically with the image resolution. Changing the input resolution of the object images directly results in a change of the required computation power. In the case of wire-frame models, the complexity of the calculation of the orientation using the gradient orientation histogram has a quadratic dependence on the image resolution, just as the calculation of the matching error. The level of camera calibration required for VCA systems is important for the installer of a security system. Requesting a large number of parameters is impractical and therefore, a semi-automatic approach is preferred. In the case of wire-frame models, the installer only needs to calibrate the extrinsic camera parameters, since the intrinsic parameters are defined by the camera. The database of 3D models does not depend on the camera calibration. In the patch-based approach however, for optimal performance, the classification system needs to be trained with training examples, coming from the actual setting of the camera. There is some robustness for small changes in the camera setting.
294
6
R. Wijnhoven and P.H.N. de With
Conclusions and Future Work
We have presented a scalable patch-based algorithm, suited for parallel implementation in an embedded environment. The algorithm has been tested on a new dataset extracted from a typical traffic crossing. When the total set of object images is divided into 13 classes and 30 samples per class are used for training, a correct classification rate of 87.7% has been obtained. This performance increases to 94.6% when the set is split into only four classes and reaches 97.6% with 100 training samples. Furthermore, we have shown comparable performance with the SIFT-based algorithm by Ma and Grimson [5] using their dataset. The previously mentioned performance can be further improved by exploiting application-specific information. Object-tracking algorithms provide useful information that can be taken into account in the classification step. Viola and Jones [23] show a performance gain by using the information from two consecutive frames. Another potential improvement can be made as follows. Extracting a sub-set of relevant features (C1 patch prototypes in our case) which are specific to our application, can give a performance gain as shown by Wu and Nevatia [24]. For future research, it is interesting to know how much sensor information is required to obtain a decent classification system. One of the first experiments would be to measure the influence of the input image resolution on the classification performance.
Fig. 6. Generic object modeling architecture, containing multiple detectors
A generic object modeling architecture can consist of several detectors that include pixel-processing elements and classification systems. We propose a generic architecture as visualized in Figure 6, where detectors can exchange both features extracted at the pixel level and classification results. For the purpose of person detection, Mohan et al. [9] propose multiple independent component
Patch-Based Experiments with Object Classification in Video Surveillance
295
detectors. The classifier output of each component is used in a final classification stage. In contrast to this fully parallel implementation, Zuo [25] proposes a cascaded structure with three different detectors to limit the computational cost in a face-detection system. Recently, the authors have considered a 3D wire-frame modeling approach [7] that is completely application-specific. This means that for each typical new application, 3D models have to be manually generated. Furthermore, addition of a new object class requires a new model that differs from the other models and implies the design of a new detector. On the opposite, the patch-based approach is a more general approach which generates one feature vector for every object image and the SVM classifier is trained to make a distinction between the application-specific object classes. In our view, when aiming at a generic object modeling architecture, we envision a convergence between application-specific techniques and applicationindependent algorithms, thereby leading to a mixture of both types of approaches. The architecture as shown in Figure 6 should be interpreted in this way. For example, in one detector the pixel processing may be generic whereas in the neighboring detector the pixel processing could be application-specific. The more generic detectors may be re-used for different purposes in several applications.
References 1. Muller-Schneiders, S., Jager, T., Loos, H., Niem, W.: Performance evaluation of a real time video surveillance system. In: Proc. of 2nd Joint IEEE Int. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), pp. 137–144. IEEE Computer Society Press, Los Alamitos (2005) 2. Kollnig, H., Nagel, H.: 3d pose estimation by directly matching polyhedral models to gray value gradients. Int. Journal of Computer Vision (IJCV) 23(3), 283–302 (1997) 3. Lou, J., Tan, T., Hu, W., Yang, H., Maybank, S.: 3-d model-based vehicle tracking. IEEE Transactions on Image Processing 14(10), 1561–1569 (2005) 4. Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. Transactions on Pattern Analysis and Machine Intelligence (PAMI) 29(3), 411–426 (2007) 5. Bose, B., Grimson, W.E.L.: Improving object classification in far-field video. In: Proc. of IEEE Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA, vol. 2, pp. 181–188. IEEE Computer Society Press, Los Alamitos (2004) 6. Haritaoglu, I., Harwood, D., Davis, L.: W4: real-time surveillance of people and their activities. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 22, pp. 809–830. IEEE Computer Society Press, Los Alamitos (2000) 7. Wijnhoven, R., de With, P.: 3d wire-frame object-modeling experiments for video surveillance. In: Proc. of 27th Symposium on Information Theory in the Benelux, pp. 101–108 (2006) 8. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proc. of the 2001 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR)., vol. 1, pp. 511–518. IEEE, Los Alamitos (2001) 9. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in images by components. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 23(4), 349–361 (2001)
296
R. Wijnhoven and P.H.N. de With
10. Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., Poggio, T.: Pedestrian detection using wavelet templates. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), San Juan, Puerto Rico, pp. 193–199. IEEE Computer Society Press, Los Alamitos (1997) 11. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 27(10), 1615– 1630 (2005) 12. Dalai, N., Triggs, B.: Histogram of oriented gradients for human detection. In: Proc. of the IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 886–893. IEEE Computer Society Press, Los Alamitos (2005) 13. Ke, Y., Sukthankar, R.: Pca-sift: A more distinctive representation for local image descriptors. In: Proc. of IEEE Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 506–513. IEEE, Los Alamitos (2004) 14. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 69–81. Springer, Heidelberg (2004) 15. Ma, X., Grimson, W.: Edge-based rich representation for vehicle classification. In: Proc. of IEEE Int. Conf. on Computer Vision (ICCV), vol. 2, pp. 1185–1192. IEEE Computer Society Press, Los Alamitos (2005) 16. Serre, T.: Learning a Dictionary of Shape-Components in Visual Cortex: Comparison with Neurons, Humans and Machines. PhD thesis, Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory (April 2006) 17. Ullman, S., Vidal-Naquet, M., Sali, E.: Visual features of intermediate complexity and their use in classification. Nature Neuroscience 5, 682–687 (2002) 18. Riesenhuber, M., Poggio, T.: Models of object recognition. Nature Neuroscience 3, 1199–1204 (2000) 19. Serre, T., Wolf, L., Poggio, T.: Object recognition with features inspired by visual cortex. In: Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 994– 1000 (2005) 20. Mutch, J., Lowe, D.: Multiclass object recognition with sparse, localized features. In: IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 11–18. IEEE Computer Society Press, Los Alamitos (2006) 21. Collobert, R., Bengio, S., Mariethoz, J.: Torch: a modular machine learning software library. Technical report, Dalle Molle Institute for Perceptual Artificial Intelligence, PO Box 592, Martigny, Valais, Switzerland (October 2002) 22. Ponce, J., Berg, T., Everingham, M., Forsyth, D., Hebert, M., Lazebnik, S., Marszalek, M., Schmid, C., Russell, B., Torralba, A., Williams, C., Zhang, J., Zisserman, A.: Dataset issues in object recognition. In: Ponce, J., Hebert, M., Schmid, C., Zisserman, A. (eds.) Toward Category-Level Object Recognition. LNCS, vol. 4170, Springer, Heidelberg (2006) 23. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: Proc. of the Ninth IEEE Int. Conf. on Computer Vision (ICCV), vol. 2, pp. 734–741. IEEE Computer Society Press, Los Alamitos (2003) 24. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In: Proc. of the 10th IEEE Int. Conf. on Computer Vision (ICCV), vol. 1, pp. 90–97. IEEE Computer Society, Washington, DC, USA (2005) 25. Zuo, F.: Embedded face recognition using cascaded structures. PhD thesis, Technische Universiteit Eindhoven, The Netherlands (October 2006)
Neural Network Based Face Detection from Pre-scanned and Row-Column Decomposed Average Face Image Ziya Telatar, Murat H. Sazlı, and Irfan Muhammad Ankara University, Faculty of Engineering, Electronics Engineering Department 06100 Tandogan, Ankara, Turkey {telatar,sazli}@eng.ankara.edu.tr
Abstract. This paper introduces a methodology for detecting human faces with minimum constraints on the properties of the photograph and appearance of faces. The proposed method uses average face model to save the computation time required for training process. The average face is decomposed into row and column sub-matrices and then presented to the neural network. To reduce the time required for scanning the images at places where the probability of face is very low, a pre-scan algorithm is applied. The algorithm searches the faces in the image at different scales for detecting faces in different sizes. Arbitration between multiple scales and heuristics improves the accuracy of the algorithm. Experimental results are presented in this paper to illustrate the performance of the algorithm including accuracy and speed in detecting faces.
1 Introduction Face detection is considerably a difficult task because it involves locating the face with no prior knowledge about the location, scale, orientation (up-right or rotated around three axis), and with different poses (e.g. frontal profile) [1]. Facial expressions and lighting conditions also change the overall appearance of faces. Furthermore, appearance of human faces in an image depends on the poses of humans and the viewpoints of the acquisition devices. In the literature, researchers have proposed different techniques for face detection. Knowledge based methods use the rules derived from knowledge of human face, e.g. like a face always contains eyes, nose, mouth and symmetry of face around the centre. In this approach, a course to fine set of rules is applied for eliminating the false detections [1], [2]. In feature based approaches, facial features are searched and classified for a given image. Here, it is assumed that every face has some features, which are invariant and if these features exist in a group, then it can be conferred that this group of points is a face in an image [3], [4], [5], [6]. Methods combining multiple features use skin color, size, shape and global features to model a face. The general approach is to find skin patches and then apply the size and shape to these patches for fine search [7], [8]. Most of the methods given in the literature require a face model which is used in designing face detectors with a priori information of pictures. Success of the face detector depends on how accurate the face model is compared to a real face. For this J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 297–309, 2007. © Springer-Verlag Berlin Heidelberg 2007
298
Z. Telatar, M.H. Sazlı, and I. Muhammad
purpose, learning model based face detection algorithms are preferred in this type of applications. A neural network is trained to recognize spatial face patterns and then used to find faces in other pictures. Success of these systems depends on the structure of the network and the training process. Frontal and rotated face detection [9], MultiLayer Perceptrons (MLP) and Fast Fourier Transform (FFT) [10], Time-Delay Neural Network (TDNN) [11], Principal Component Analysis (PCA) [12] with eigenvalues for only one frontal face, averaging feature maps [13], a combination of Eigenface and Support Vector Machine (SVM) based multiple view face detection [8], [14], gradient feature extraction from polynomial neural network for classification-based detection [15], automatic scalable face model design [16] from adaptive face segmentation and motion of head and facial features to detect the faces in an image [17], [18], Radial Basis Function (RBF) based hybrid learning algorithms [19], are some prominent examples of neural network based face detection algorithms. In this paper, a combined face detection algorithm is presented. There are three major research aspects in our work: The first one addresses the issue how the main algorithm detects faces from an input image. In this part, an average face is obtained from the database. The second distinct feature of the algorithm is to divide the average face into row and column sub-images and then apply the algorithm to train the neural network for each specific face region [9], [12]. The third distinct research aspect is the implementation of the pre-scan algorithm which will be applied to images before detection of faces. The pre-scan algorithm does not only reduce the scanning time but also distinguishes non-face areas from the face areas. The rest of the paper is organized as follows: The face model and sub division of the algorithm are introduced in Section 2. Section 3 describes general procedure of the face detection algorithm. Some results from the experimental studies are given in Section 4. Finally, some concluding remarks are presented in Section 5.
2 Methodology 2.1 Neural Network Used in Face Detection Algorithm In this study, a fully connected multilayer feed-forward Neural Network (NN), which contains a single hidden layer, was trained using back-propagation algorithm and used as a part of the face detection algorithm. Sigmoid function is used for activation functions of the neurons in the hidden layer and the output layer. Due to space constraints, detailed description, including the drawings, of the NN are not presented here. Interested reader is referred to literature [20], for a comprehensive treatment of the subject. 2.2 Face Model Human facial features might vary from person to person. Some important facial features can be listed as: perimeter of face, skin color, dimensions and shapes of nose, mouth and lip. These facial organs also differ in width and height except for eye distances. Therefore, a huge number of face samples are required to define the facial features and vectors. Pictures examined for face model were collected from the databases found in web sites of the universities [21], [22] doing similar research and from our local image
Neural Network Based Face Detection
299
processing group database. To establish a face model, we examined all the images in the databases and we found the eye distance nearly constant, especially for fully grown humans, from the measurements and computations. The other facial features measurements were examined by considering the reference for eye distance as given in Fig. 1. The eye distance was measured from pictures of 400 persons, after scaling the pictures to same size. For each picture in the local database, the center point of the left and the right eye were marked manually. This eye distance was normalized to 60 pixels, and then using this normalized value of the eye distance, a resized image of 80x100 pixels was extracted from the original picture. From these measurements, the face model was established as shown in Fig. 1.
Fig. 1. Face model
2.3 Average Face and Smallest Face Dimension In traditional procedure to train the NN, a large number of input examples are required. Initially, random values are assigned to the weights of the NN. By presenting all the face images from the database, weights of the NN are updated using the backpropagation algorithm. At the end of the training process, an optimum weight matrix is obtained. Here, it takes quite a long time to train the NN due to huge training data set. Utilization of the average face rather than using all the faces from the database has some advantages: 1) Training times, complexity and computational costs are significantly decreased. 2) Results obtained by using the average face and by the traditional procedure are almost the same. Underlying idea is that the average face contains the average properties of all the faces. Our experimentation has shown that, the average face achieves comparable superiority against the traditional neural network training [17], [18], [23]. Average face used in this work is computed from the face images of the databases. For that purpose, pictures in which the eyes are not horizontally aligned are rotated in such a way that the eyes are horizontally aligned, and then the procedure for constant eye distance is repeated. After obtaining pictures having constant eye distance, these pictures are rotated at angles of ±5, ±10 and ±15 degrees to obtain new pictures. Also, these pictures are resized to the smallest dimension of 20x20 pixels. The rotated and resized faces are added and normalized to obtain 7 average faces at each angle as shown in Fig. 2a. These average faces were then again added and normalized to
300
Z. Telatar, M.H. Sazlı, and I. Muhammad
obtain average face of the average faces as shown in Fig. 2b. Final average face is obtained as follows,
f av (i, j ) =
1 M ∑ f k (i, j ) M k =1
(1)
where M is the number of average faces for each direction. Averages faces, fk(i,j), are computed as in Eq.(1) for each direction.
(a)
(b)
Fig. 2. 20x20 pixels (a) average faces at different angles, (b) average face of average faces
In addition to average faces, a matrix having random values of elements and subimages of pictures containing non-face or partial face of 20x20 pixel dimension are also used in training of the NN in order to differentiate face and non-face patterns in a given picture. For detecting a face in an image, it is necessary to put a limit on the smallest face that can be detected. A lower limit of 20x20 pixels has been examined in this study. It was observed from our experiments that faces, who's dimension is less than 20x20 pixels, can not be distinguished by human visual system. 2.4 Row-Column Decomposition of the Image
In the traditional face detection, changing uneven illumination or local dynamic intensity variations over an image could cause some degradation. In such cases, localizing a specific region of interest in an image enables the NN to extract some details in that region. By considering this fact, the average face is first normalized and histogram equalized, and then, the average face is divided into 4 row matrices of 5x20 pixels and 4 column matrices of 20x5 pixels (Fig. 3). The purpose of dividing the image into sub-matrices is to define each specific region of the face image to the NN in order to improve the efficiency of recognizing capacity. The row and column sub-matrices are applied to their respective NNs and each NN produces an output between 0 and +1. Outputs of the individual NNs are added together to get the final output. For an ideal face pattern the ideal output is +8. For real faces, this output has to be close to +8. Same steps are repeated for randomly generated images to recognize non-face areas more accurately. The NN produces a negative output for non-face areas. Mathematical representation of separating the face and the non-face areas is as follows, RN = ∑ f (wr 2 j × f (∑(wr1i ri − br1 )) − br 2 ) + ∑ f (wc2 j × f (∑ wc1i ci − bc1 )) − bc2 ) j
i
j
i
(2)
where, ri, ci, w`s, f, and b`s are the row matrix, the column matrix, the weight matrix of first and second layers, a non-linear function and, the biasing values. The weight
Neural Network Based Face Detection
301
matrices used in Eq.(2) are obtained after training the NN. The NN is trained in order to minimize the error between the desired output and the output produced by the given input. The mean squared error is calculated as follows, e1 = ∑ ( d i − o i ) 2 i
≤τe
Here, di is the desired output, oi is the output produced for the given input and
(3)
τe
is
the threshold for acceptable error in which better results are obtained for smaller error. The NN has also been trained by ±45 degrees sideways rotated faces. The same training algorithm has been used for this purpose. 2.5 Pre-scan Algorithm
Face detection algorithms generally detect faces by scanning the input image by a window into sub-blocks. The scanning time also depends on the window block size and is directly proportional to the image and window dimensions. While scanning the image, most of the time is spent at scanning non-face areas and especially the background which is generally simple and relatively smooth (e.g. wall, curtains, sky in landscape etc.). The pixel values in each block of background areas have small variance with a little deviation from the mean of the block. In contrast, the blocks in face areas have large variance. Using the mean and variance information, the input image can be classified into two classes. One class has the face or face like areas and other class has non-face areas. Mean and standard deviation values for face and non-face areas are calculated and threshold values are found out. These values are then used to find face areas as,
Fig. 3. Face detector's details
Fig. 4. Results obtained by using pre-scan algorithm
( x ≥ UTh m ) & ( x ≤ LTh m ) & (δ ≤ UTh sd ) & (δ ≥ LTh sd )
(4)
where, UThm, LThm, UThsd and LThsd are the upper and lower threshold values of the two-dimensional mean and standard deviation respectively. Fig. 4 shows some examples applying the pre-scanning algorithm in which white areas represent the face areas and black areas represent the non-face areas.
302
Z. Telatar, M.H. Sazlı, and I. Muhammad
To pre-scan an image, a scanning window size is selected. The maximum window size both for horizontal and vertical directions has been selected as 20 pixels. It is observed that, further reduction in window size does not produce better results but increases the scanning time. The pre-scan algorithm has been tested together with the main algorithm and a reduction in the error has been observed for false face detections by eliminating the non-face areas. The other advantage of the pre-scan algorithm is to significantly reduce the overall scanning time by extracting only face related areas.
3 Face Detection Algorithm The general flow diagram of the algorithm is given in Fig. 5. At the first stage, the algorithm pre-processes the given photograph, and then applies a pattern recognizer upon the whole photograph to detect faces. The photographs presented to the algorithm can be obtained in a variety of environments such as a studio, an open air, a non-uniform lighting condition. Therefore, photographs may have some form of degradation. In addition, photographs can include people of different races having different skin colors. In situations like these, the pixel values in the face areas vary from person to person. Also, if a face has nonuniform illumination, (e.g. some part of faces in shade) even then the pixel values can differ from the pixels of similar face regions. To overcome these problems, the subimages from the scanning window are also normalized and histogram equalized in the pre-processing block given in Fig. 5. For normalization, the sub-images are divided by the maximum value in the sub-image, thus giving values between 0 and 1, and is calculated as follows, xˆ =
x max ( x )
(5)
The pre-processing step applied to the photograph not only locally prevents the intensity level variations affecting the image, but also a standard face template or pattern is obtained for all possible inputs. Scanning is done by a constant window size of 20x20 pixels. As the given photograph may not be of standard size, a lower limit on the smallest dimension has to be set. This lower limit is needed for calculating the number of sub-images in the photograph pyramid shown in Fig. 6. This limit was set to be 3 times the dimension of the scanning window. Ep = 3× min (Rs, Cs)
(6)
where Ep is the smallest dimension, Rs and Cs are the number of rows and columns of the face pattern respectively. The calculations for smallest face is done by “calculations if smallest dimension” block in Fig. 5. Pre-scanning block estimates the possible face areas in the image and eliminates the non-face areas. After pre-scanning, the image is applied to the NN and possible faces are determined, and then image size is decreased by 10 % and same steps starting from pre-scanning block are repeated for new dimension.
Neural Network Based Face Detection
Fig. 5. Flow diagram of the algorithm
Fig. 6. Graphical representation of the algorithm
303
304
Z. Telatar, M.H. Sazlı, and I. Muhammad
The picture pyramid is formed by resizing the given image to a smaller dimension by 10% reduction at each iteration step as shown in Fig. 5 and Fig. 6. The number of sub-images in the pyramid is calculated as, t = (1.1 − Ep / Ms ) × 10
(7)
Ms = min( R, C )
(8)
where t is the number of sub-images, R and C are the number of rows and columns of the given photograph. Dimension of each sub-image is calculated as, b1 = 1.1 − i × 0.1
i = 1,2,3, " , t
(9)
where b1 is the normalized value of the dimension and is between 0 and 1. Despite the above steps, the algorithm may identify non-face areas as face areas. To reduce this error, the symmetry of the detected face locations is checked. Face locations which are not symmetric are dropped and identified as non-face areas. The symmetry information is obtained from the face features defined (half of the eye distance etc.). Face locations obtained at different dimensions are arbitrated to get exact face location in the given image. For that purpose, the neighborhood of each face at different dimensions is compared and those which are below the threshold value are identified as faces. This neighborhood is calculated as,
[
Δd n = X b − X b −1 + Yb − Yb − 1
]n
≤τd
(10)
where Δ d n is the neighborhood value, X and Y are the face areas and τ d is the threshold, b is the index of sub-images and n is the index of possible face areas. For the thresholded faces, a cost value is calculated as, C n = ∑ Δ d ni × RN ni
(11)
i
where Cn is the cost value, RN is the thresholded image from face detector, and i, is the number of sub-images for nth face found. The highest cost value is selected to be the face location. The values in the RN images change from one image to another; due to the different threshold values for each image. This is critical in detecting the possible face locations. This threshold value is calculated by using the mean and variance of the RN image and is given by Eq. (12).
τ l = ( μ + σ )υ
(12)
where τl is the threshold value, μ and σ are the mean and variance of RN sub-images and υ is a scaling constant. The success ratio of the face detection algorithm is calculated for all the photographs in the test set of the database as a percentage of the detected correct faces to the total faces in a photograph, overall success ratio = ∑ i
ri ti
(13)
Neural Network Based Face Detection
305
Here, ri, is the number of correct faces detected, and ti is total number of faces in a photograph. Despite the detecting correct faces, the algorithm rarely detects non-face areas as face areas incorrectly. This error was calculated by taking the percentage of number of the detected false faces to the total number of faces in the photograph, as, error ratio i =
hi ti
(14)
Here, hi is the number of wrong faces detected.
4 Experimental Results The proposed algorithm was tested by using face images from three databases. Two of them are collected from the Universities doing similar research [21], [22]. These two databases contain 400 facial images of 40 individuals from different positions in which the number of face images for each person is 10. Total number of the face images in these databases is 1400. The other one is our local database that includes 524 total face images. We gathered 64 frontal face images from 43 individuals and 76 sideways rotated by ± 45 and ± 90 degrees of 19 individuals. 64 frontal face images were rotated by ± 15, ± 10, ± 5 degree. We set up two different sets from these databases. The first set includes only the face images used in training process and the second set combines only the face images and the photographs including multiple faces which are different from the training set. These sets cover several racial groups and various illumination conditions. Several non-face images were also used in the training process. The NN was trained by the average face which was obtained by averaging all the faces in the database (Fig. 2a, 2b). For that purpose, the row-column decomposed average face was applied to the algorithm one by one to obtain an output for each corresponding face region. A total output of 7.725 was obtained (ideal output is +8). Same procedure was repeated for the rotated faces between 0 and ±15 degrees. The performance of the algorithm was first tested with the face images in which the average face was obtained. Pre-scan algorithm for each image to be applied to the NN was applied to obtain possible face regions. The image with possible face regions was decomposed into row-column sub-matrices in order to apply to its respective NN sub-matrix as explained before. Then, the group images including multiple faces applied to the same algorithm. For this aim, group images were applied to the pre-scan algorithm in order to obtain possible face regions and possible face regions were again divided into 20x20 pixel sub-images. Each sub-image was decomposed into row-column sub-matrices before applying it to the respective NN. A threshold value was obtained after passing all the faces through the NN. While image regions producing values below this threshold were classified as non-face, values above this threshold were classified as face. Table 1 presents some results by applying the algorithm to single frontal and sideways rotated face images. The algorithm detects 202 faces out of 210. The correct detection rate for frontal faces is obtained as 96.2 % and false detection rate is 7.1 %. The performance of the algorithm drops for sideways rotated faces to 60.4 %. The algorithm also pointed only one non-face image as face out of 100 non-face images
306
Z. Telatar, M.H. Sazlı, and I. Muhammad
when applied to the non-face images. ± 45 degree sideways rotated faces were also applied to the algorithm after trained by ± 45 degree rotated images and success ratios were computed as 73.6 - 78.9 %. These results are given in Table 2. The system was tested with photographs not present in the training set. This set contains images having at least one human face with plain or complex background. For complex background images, error ratio was grater than the images having plain background. Some of the results are shown in Fig. 7 and Table 2. In Fig. 7, the algorithm has found 10 faces out of 10 totally and two non-face areas have been identified as face area which indicates the error of the system. As seen, the success ratio is 93,75 % for group images with complex background, and 96.2 % for photographs including only one face with simple background. Success ratios in similar works have been reported between 79,9 % and 95,8% as given in Table 3. As mentioned in Section 2.5, a pre-scan algorithm was also developed to reduce the scanning time and to eliminate non-face areas. Table 4 shows the scanning times of the algorithm with and without pre-scanning for photographs shown in Fig. 9. Depending on the photograph content, the background and the number of faces, a reduction of 40 – 85 % in scanning time was observed. This comparison is also shown graphically in Fig. 8. In addition to decrease in scanning time, pre-scanning also contributes to the performance of the main algorithm by eliminating the non-face areas and reducing the false detections in images. This is shown in Fig. 9. The image in Fig. 9a shows the result obtained without the pre-scan. Fig. 9b shows areas to be scanned by the prescan. Fig. 9c shows the result with the pre-scan. False detections in Fig. 9a are eliminated as shown in Fig. 9c after using the pre-scan algorithm.
Fig. 7. Results obtained with the developed algorithm
Fig. 8. Comparison of scanning time
Fig. 9. a) Result without pre-scan, b) Pre-scanned image, c) reduction of error with pre-scan
Neural Network Based Face Detection
307
Table 1. Performance of neural network for frontal and near frontal faces Face images -15
o
-10
o
-05
o
Total number of input face
0
o
Non-face images o
o
+05
+10
o
+15
210
100
Number of detected face and Succes 127 ratio (%) 60,4
179 85
198 94,2
202 96,2
200 95,2
183 87,1
131 62,4
1
Number of faces not detected or false 49 detected (%) 23,3
36 17
20 9,5
15 7,1
19 9
28 13,3
44 20,9
99
Table 2. Performance of the algorithm for sideway rotated faces and group photographs
+45 degrees
19
14
5
Success ratio (%) 73.6
-45 degrees
19
15
4
78.9
Group images
32
30
6
93,7
Single face
Result # of faces Correct False
Table 3. Comparison with other algorithms in literature Methods
Detected faces
Success ratio %
Min. face dim.
Proposed In Ref. [5] In Ref. [8] In Ref. [9] InRef. [10] InRef. [13]
210 1930 507 149 21 130
96,2 89.3 92,5 79,9 95,8 94
20x20
50x50
Table 4. Comparison of scanning time Time (Sec.) Image
Dimension
1
98x157
Without pre-scan 871
Pre-scanned 121
Reduction 86.1%
2
126x242
1530
448
70.7%
3
125x83
449
332
26.1%
4
138x173
1736
954
44.8%
5
144x202
2080
1186
42.9%
5 Conclusion In this work a view based face detector was developed. Results obtained show that the performance of the detector is comparable and even superior to the other methods. One of the tasks examined in this work is the utilization of the average faces in the training process of the NN. In contrast to the traditional methods, instead of using all the images in the database to train the NN, only the average face was used for that purpose. This significantly decreases the long training process. Secondly, decomposition of the input image into row–column sub-matrices facilitates the recognition of each facial region separately by their respective NNs. Another important feature of
308
Z. Telatar, M.H. Sazlı, and I. Muhammad
the algorithm is to use a pre-scan algorithm. The pre-scan algorithm not only reduces scanning time significantly, but it also discards the relatively simple background so that the main algorithm does not scan for these areas. Thus, a significant reduction in error of the detection algorithm has been observed by avoiding false detections.
References 1. Ayinde, O., Yang, Y.H.: Region Based Face Detection. Pattern Recognition 35, 2095– 2107 (2002) 2. Lanitis, A., Taylor, C.J., Cootes, T.F.: An automatic face identification system using flexible appearance models. Image and Vision Comp. 13, 393–401 (1995) 3. Chiang, C.-C., Tai, W.-K., Yang, M.-T., Huang, Y.-T., Huang, C.-J.: A novel method for detecting lips,eyes and faces in real time. RealTime Imaging 9, 277–287 (2003) 4. Hsu, R.L., Abdel-Mottaleb, M., Jain, A.K.: Face Detection in Color Images. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 696–706 (2002) 5. Liu, C.: A Bayesian Discriminating Features Method for Face Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 725–740 (2003) 6. Viola, P., Jones, M.J.: Robust real-time face detection. Int. Journal of Computer Vision 57(2), 137–154 (2004) 7. Starner, T., Pentland, A.: Real time American sign language recognition from video using hidden markov models. Technical Report 375, Media Lab, MIT (1996) 8. Phuong-Trinh, P.-N., Kang-Hyun, J.: Color-based Face Detection using Combination of Modified Local Binary Patterns and embedded Hidden Markov Models SICE-ICASE. In: International Joint Conference, pp. 5598–5603 (2006) 9. Rowley, H.A., Baluja, S., Kanade, T.: Neural Network based face detection. IEEE Trans. Pattern Analy. Mach. Intell. 20, 23–28 (1998) 10. Sung, K., Poggio, T.: Example-based learning for view-based human face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20, 39–51 (1998) 11. Koh, L.H., Ranganath, S., Venkatesh, Y.V.: An integrated automatic face detection and recognition system. Pattern Recognition 35, 1259–1273 (2002) 12. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3, 71–86 (1991) 13. Lewis, T., Owens, R., Baddeley, A.: Averaging feature maps. Pattern Recognition 32, 1615–1630 (1999) 14. Li, Y., Gong, S., Sherrah, J., Liddell, H.: Support vector machine based multi-view face detection and recognition. Image and Vision Computing 22, 413–427 (2004) 15. Huang, L.-L., Shimizu, A., Hagihara, Y., Kobatake, H.: Gradient feature extraction for classification-based face detection. Pattern Recognition 36, 2501–2511 (2003) 16. Hu, M., Worrall, S., Sadka, A.H., Kondoz, A.M.: Automatic scalable face model design for 2D model-based video coding. Signal Processing: Image Communication 19, 421–436 (2004) 17. Muhammad, I., Telatar, Z.: An automatic human face detection algorithm. In: First IEEE Balkan Conference on Signal Processing, Communication, Circuits and Systems, Istanbul, IEEE, Los Alamitos (2000) (On CD-ROM) 18. Muhammad, I.: Computer based human face detection. Ph.D thesis, Ankara University (2001)
Neural Network Based Face Detection
309
19. Haddadnia, J., Faez, K., Ahmadi, M.: An efficient human face recognition system using pseudo zernike moment invariant and radial basis function neural network. Int. Journal of Pattern Recognition and Artificial Intelligence 17(1), 41–62 (2003) 20. Haykin, S.: Neural Networks, A comprehensive foundation, 2nd edn. ch. 4, Prentice Hall, Englewood Cliffs (1999) 21. http://www.cam-orl.co.uk/facedatabase.html 22. http://www.cs.cmu.edu/ har/faces.html 23. Muhammad, I., Telatar, Z., Tüzünalp, Ö.: A fast scanning algorithm for reduction of the scanning time in face detection algorithms. In: IEEE 9th. Signal Processing Applications, Gazimagusa-KKTC, pp. 565–570. IEEE, Los Alamitos (2001)
Model-Based Image Segmentation for Multi-view Human Gesture Analysis Chen Wu and Hamid Aghajan Wireless Sensor Networks Lab Department of Electrical Engineering Stanford University, Stanford CA, 94305 {chenwu,aghajan}@stanford.edu
Abstract. Multi-camera networks bring in potentials for a variety of vision-based applications through provisioning of rich visual information. In this paper a method of image segmentation for human gesture analysis in multi-camera networks is presented. Aiming to employ manifold sources of visual information provided by the network, an opportunistic fusion framework is described and incorporated in the proposed method for gesture analysis. A 3D human body model is employed as the converging point of spatiotemporal and feature fusion. It maintains both geometric parameters of the human posture and the adaptively learned appearance attributes, all of which are updated from the three dimensions of space, time and features of the opportunistic fusion. In sufficient confidence levels parameters of the 3D human body model are again used as feedback to aid subsequent vision analysis. The 3D human body model also serves as an intermediate level for gesture interpretation in different applications. The image segmentation method described in this paper is part of the gesture analysis problem. It aims to reduce raw visual data in a single camera to concise descriptions for more efficient communication between cameras. Color distribution registered in the model is used to initialize segmentation. Perceptually Organized Expectation Maximization (POEM) is then applied to refine color segments with observations from a single camera. Finally ellipse fitting is used to parameterize segments. Experimental results for segmentation are illustrated. Some examples for skeleton fitting based on the elliptical segments will also be shown to demonstrate motivation and capability of the model-based segmentation approach for multi-view human gesture analysis.
1
Introduction
The increasing interest in understanding human behaviors and events in a camera context has heightened the need for human gesture analysis of image sequences. In a multi-camera network, access to multiple sources of visual data often allows for making more comprehensive interpretations of events and gestures. It also creates a pervasive sensing environment for applications where it is impractical for the users to wear sensors. Example applications include surveillance, smart J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 310–321, 2007. c Springer-Verlag Berlin Heidelberg 2007
Model-Based Image Segmentation for Multi-view Human Gesture Analysis
Description Layer 4 : Gestures
311
G
Decision Layer 3 : collaboration between cameras Description Layer 3 : Gesture Elements
E1
E2
E3
Decision Layer 2 : collaboration between cameras Description Layer 2 : Features
Description Layer 1 : Images
f11
f12
f21
F1
f22
f31
f32
F3
F2
Decision Layer 1 : within a single camera I1
I2
I3
Fig. 1. The layered and collaborative architecture of the gesture analysis system. Ii stands for images acquired by camera i; Fi is the feature set for Ii ; Ei is the gesture element set in camera i; and G is the set of possible gestures.
home care, gaming, etc. In this paper we propose to use an opportunistic fusion framework to employ manifold sources of information obtained from the camera network in a principled way, which spans three dimensions of space (different camera views), time (each camera collecting data over time), and feature levels (selecting and fusing different feature subsets). With the goal of understanding the scene, inference from each of the three dimensions and correlation between them provide tremendous insights for intelligent interpretations. At the same time, such information fusion methodology poses challenges in developing an efficient and generic strategy. For human gesture analysis in a multi-camera network, there are three main motivations for the opportunistic fusion approach. First, in-node processing needs to reduce local data, such that the resulting local description should be pithy enough to enable efficient collaboration through communication with other cameras. Even if some details are lost in local processing, adequate reasoning is still achievable through spatiotemporal fusion. Second, spatial collaboration between multi-view cameras naturally facilitates solving occlusions. It is especially advantageous for gesture analysis since human body is self-occlusive. And finally, temporal and feature fusion help to gain subject-specific knowledge, such as the current gesture and subject appearance. This knowledge is in turn used for a more actively directed vision analysis. Therefore, we develop a 3D human body model to achieve spatiotemporal and feature fusion. The 3D human body model embodies up-to-date information from both current and historical observations of all cameras in a concise way as we define it. Concise as it is, the model is capable enough to derive gestures we are interested in. It maintains both geometric parameters of the human posture and also adaptively learned appearance attributes, all of which are updated from the three dimensions of space, time, and features of the opportunistic fusion. As such, the 3D human model takes up two roles. One is as an intermediate step for
312
C. Wu and H. Aghajan
high-level application-pertinent gesture interpretations, the other is as source of feedback from spatiotemporal and feature fusions for low-level vision processing. The 3D model maps to the gesture element layer in the layered architecture for gesture analysis (Fig. 1) we proposed in [1]. However, here it not only assumes spatial collaboration between cameras, but also it connects decisions from history observations with current observations. Fitting human models to images or videos has been an interesting topic for which a variety of methods have been developed. One aspect of the problem relates to the choice of human model. One category falls to 3D representations of human models fit to a single camera’s view [2,3]. Due to the self-occlusive nature of human body, causing ambiguity from a single view, most of these methods rely on a restricted dynamic model of behaviors. But tracking can easily fail in case of sudden motions or other movements that differ much from the dynamic model. Usually assuming a dynamic model (such as walking) will greatly help us to predict and validate the posture estimates. However, we always need to be aware of the balance between the limited dynamics and the capability to discover more diversified postures. Yet a different approach has been explored, in which a 3D model is reconstructed from multi-view cameras [4,5]. Most methods start from silhouettes in different cameras, then points occupied by the subject can be estimated, and finally a 3D model with principle body parts is fit in the 3D space [6]. Some construct very detailed human body models [7]. The latter approach is relatively “clean” since the only image components it is based on are the silhouettes. But at the same time, the 3D voxel reconstruction is sensitive to the quality of the silhouettes and accuracy of camera calibrations. It is not difficult to find situations where background subtraction for silhouettes suffers for quality or is almost impossible (clustered or complex backgrounds, or when the subject is wearing clothes with similar colors to the background) . Another aspect of the human model fitting problem is the choice of image features. All human model fitting methods are based on some image features as targets to fit the model. Most of them are based on generic features such as silhouettes or edges [8,5]. Some use skin color but such methods are prone to failure in some situations since lighting usually has big influence in colors and skin color varies from person to person. In our work, we try to incorporate appearance attributes adaptively learned from the network for initialization of segmentation, because usually color or texture regions are easier to find than generic features such as edges. Another emphasis of our work is that images from a single camera are first reduced to short descriptions and then reconstruction of the 3D human model is based on descriptions collected from multiple cameras. Therefore, concise descriptions are the expected outputs from image segmentation. In this paper we first introduce the opportunistic fusion framework as well as an implementation of its concepts through human gesture analysis in Section 2. In Section 3, image segmentation in a single camera is described in detail. Color distribution maintained in the model is used to initialize segmentation. Perceptually Organized Expectation Maximization (POEM) is then applied to
Model-Based Image Segmentation for Multi-view Human Gesture Analysis
313
refine color segments with observations from a single camera, followed by watershed algorithm to assign segment labels to all pixels based on their spatial relationships. Finally, ellipse fitting is used to parameterize segments in order to create concise segment descriptions for communication. In Section 4, a method for 3D model fitting is briefly described and examples are shown to demonstrate capability of the elliptical segments.
2
Opportunistic Fusion for Human Gesture Analysis
We introduce a generic opportunistic fusion approach in multi-camera networks in order to both employ the rich visual information provided by cameras and incorporate the learned knowledge of the subject into active vision analysis. The opportunistic fusion is composed of three dimensions of space, time, and feature levels. In the rest of the paper, the problem of human gesture analysis is elaborated on to show how those concepts can be implemented. 2.1
The 3D Human Body Model
We employ a 3D human skeleton model for the purpose of gesture analysis. A question that may be raised is whether we need to construct a human model for gesture analysis. Is it possible to infer gestures without implicitly reconstructing a model? There is existing work for hand gesture recognition [9,10], where only part of the body is analyzed. Some gestures can also be detected through spatiotemporal motion patterns of some body parts [11,12]. It is true that for a number of gestures we do not need a human body model to interpret the gestures. But as the set of gestures we would like to differentiate expands, it becomes increasingly difficult to devise methods for gesture recognition based on only a few cues. Therefore, the employment of a 3D human body model provides a unified interface based on which gesture interpretations can be made to specific applications. A graphic display of the 3D human skeleton model is shown as part of Fig. 2. It has the following components: – Geometric configuration: body part lengths and angles. – Color or texture of body parts. – Motion of body parts. Apart from providing flexibility in gesture interpretations, the 3D human model described in the previous paragraph also plays a few significant roles in the vision analysis process. First, the total size of parameters to reconstruct the model is very small compared to the raw images, thus facilitating affordable communication. For each camera, only segment descriptions are needed for collaboratively reconstructing the 3D model. Second, the model serves as a convergence point of spatiotemporal and feature fusion. All the parameters it maintains are updated from spatiotemporal fusion. In sufficient confidence levels parameters of the 3D human body model are used as feedback to aid subsequent vision analysis. Therefore, instead of being a passive output to represent decisions from spatiotemporal and feature fusion, the 3D model implicitly enables
314
C. Wu and H. Aghajan
Fig. 2. Opportunistic fusion for human gesture analysis
more interactions between the three dimensions by being actively involved in the current update of decisions. Third, although predefined appearance attributes are generally not reliable, adaptively learned appearance attributes collected in the model can be used to identify the person or body parts. More details of the 3D human body model are presented in Section 2.2. 2.2
The Opportunistic Fusion Framework Overview
The opportunistic fusion framework for gesture analysis is shown in Fig. 2. On the top of the figure are spatial fusion modules with progression in time. In parallel is the progression of the 3D human body model. Suppose at time t0 we have the model with the collection of parameters as M0 . At the next instance t1 , the current model M0 is input to the spatial fusion module for t1 , and the output decisions are used to update M0 from which we get the new 3D model M1 . Now we look into a specific spatial fusion module for the detailed process. In the bottom layer of the layered gesture analysis (bottom left of Fig. 2, an expanded view in Fig. 1), image features are extracted from local processing. No explicit collaboration between cameras is done in this stage since communication is not expected until images are reduced to short descriptions. If we take this spatial fusion module alone, only some generic image features (e.g. edges) are reliable. However, if we consider the current model M0 , some distinct features (e.g. colors) specific for the subject may be used for analysis, which may be much
Model-Based Image Segmentation for Multi-view Human Gesture Analysis
315
Color segmentation and ellipse fitting in local processing Background subtraction
Rough segmentation
EM: refine color models
Watershed segmentation
Previous color distribution 3D human body model
Ellipse fitting
Previous geometric configuration and motion
Maintain current model Combine 3 views to get 3D skeleton geometric configuration Update 3D model (color/texture, motion)
Take the highestscore configuration
Score similarity between the projection and the ellipses
Project the configuration onto 3 image planes
Assume a test configuration
Local processing from other cameras
Fig. 3. Algorithm flowchart for 3D human skeleton model reconstruction
1 in Fig. 2). easier than always looking for patterns of the generic features (arrow The intuition here is, we adaptively learn what are the attributes distinguishing the subject, save them as “marks” in the 3D model, and then use those “marks” to look for the subject. After local processing, data is shared between cameras to derive for a new estimate of the model. Parameters in M0 specify a smaller space of possible M1 ’s. Then decisions from spatial fusion of cameras are used 2 in Fig. 2). Therefore, for to update M0 to get the new model M1 (arrow every update of the model M , it combines space (spatial collaboration between cameras in Fig. 1), time (the previous model M0 ), and feature levels (choice of image features in local processing from both new observations and subjectspecific attributes in M0 ). Finally, the new model M1 is used for high-level 2 in Fig. 2). gesture deductions in a certain scenario (arrow 2.3
Algorithm Overview for 3D Human Body Model Reconstruction
An implementation for the 3D human body model reconstruction is presented in this paper, in which the process of image segmentation in a single camera will be described in detail. Elements in the opportunistic fusion framework described above are incorporated in this algorithm as illustrated in Fig. 3. Local processing in a single camera includes segmentation and ellipse fitting for a concise parameterization of segments. We assume the 3D model is initialized with a distinct color distribution for the subject. For each camera, the color distribution is first refined using the EM algorithm and then used for segmentation. Undetermined pixels from EM are assigned labels through watershed segmentation. For spatial collaboration, ellipses from all cameras are merged to find the geometric configuration of the 3D skeleton model. That is, if the optimal 3D skeleton model is projected onto image planes of the cameras, the projections should best match ellipses from all the cameras. Details and experiment results of the algorithm are presented in Section 3 and Section 4.
316
C. Wu and H. Aghajan
Some parts of the algorithm still need plenty of work to be part of a practical system. For example, two main difficulties are the initialization of the model and how to predict the span of the test space for the new model M1 based on M0 . These problems are within our current investigation.
3
Image Segmentation in a Single Camera
The goal of local processing in a single camera is to reduce raw images/videos to simple descriptions which can be efficiently transmitted between the cameras. In the proposed algorithm the outputs are ellipses fitted from segments and the mean color of the segments. As shown in the upper part of Fig. 3, local processing includes image segmentation for the subject and ellipse fitting to the segments. We assume a simple case in which the subject is characterized by a distinct color distribution. That is, segmentation is mostly based on color after a statistical background subtraction and thresholding the foreground is performed. Pixels with high or low illumination are also removed since for those pixels chrominance is not reliable. Then a rough segmentation for the foreground is done either based on K-means on chrominance of the foreground pixels or color distributions from the known model from previous time instances. In the initialization stage when the model has not been well established, or when we don’t have a high confidence in the model, we need to start from the image itself and use for example K-means to find color distribution of the subject. However, when a model with a reliable color distribution is available, we can directly assign pixels to different segments based on the existing color distribution. In practice, the color distribution maintained by the model may not be uniformly accurate for all cameras due to effects such as color map changes or illumination differences. Also the subject’s appearance may change in a single camera due to the movement or lighting conditions. Therefore, the color distribution of the model is only used for a rough segmentation in initialization of segmentation. Then an EM (expectation maximization) algorithm is used to refine the color distribution for the current image. Even if EM is used for refinement, the initial estimated color distribution provided by the model from prior time instances can play a very important because it can prevent EM from being trapped in local minima. Suppose the color distribution is a mixture of Gaussians with N modes, parameters Θ = {θ1 , θ2 , . . . , θN }, where θl = {μl , Σl } are the mean and covariance matrix of Gaussian modes. Mixing weights of different modes are A = {α1 , α2 , . . . , αN }. We need to find the probability of each pixel xi belonging to a certain mode θl : P r(yi = l|xi ). In a Gaussian distribution, the conditional probability of a pixel xi given a mode θl is: Pθl (xi ) = P r(xi |θl ) =
1 (2π)d/2 |Σ|1/2
e− 2 (xi −μl ) 1
T
Σl−1 (xi −μl )
(1)
From standard EM for Gaussian Mixture Models (GMM) we have the E step as:
Model-Based Image Segmentation for Multi-view Human Gesture Analysis
(k)
P r(k+1) (yi = l|xi ) ∝ αl Pθ(k) (xi ), l
N
Pr
(k+1)
⎫ l = 1, . . . , N ⎪ ⎪ ⎬ ⎪ ⎪ ⎭
(yi = l|xi ) = 1
317
⇒ P r(k+1) (yi = l|xi )
l=1
(2) and the M step as: (k+1)
μl
(k+1)
Σl
(k+1)
and αl
M xi P r(yi = l|xi , θ(k) ) = i=1 M (k) ) i=1 P r(yi = l|xi , θ M (k) (k) (xi − μl )(xi − μl )T P r(yi = l|xi , θ(k) ) = i=1 M (k) ) i=1 P r(yi = l|xi , θ 1 = P r(k+1) (yi = l|xi ) M x
(3) (4) (5)
i
where k is the number of iterations, the M step is obtained by maximizing M and N the log-likelihood L(x; Θ) = i=1 l=1 P r(yi = l|xi )logP r(xi |θl ). However, this basic EM algorithm takes each pixel independently, without considering the fact that pixels belonging to the same mode are usually spatially close to each other. In [13] Perceptually Organized EM (POEM) is introduced. In POEM, influence of neighbors is incorporated by a weighting measure w(xi , xj ) = e
−
xi −xj s(xi )−s(xj ) − σ2 σ2 1 2
(6)
where s(xi ) is the spatial coordinate of xi . Then “votes” for xi from the neighborhood are given by Vl (xi ) = αl (xj )w(xi , xj ), where αl (xj ) = P r(yj = l|xj ) (7) xj
Based on this voting scheme, the following modifications are made to the EM (k) (k) steps. In the E step, αl is changed to αl (xi ), which means that for every pixel xi , mixing weights for different modes are different. This is partially due to the influence of neighbors. In the M step, mixing weights are updated by (xi )
(k) αl (xi )
eηVl
= N
k=1
(xi )
eηVk
(8)
in which η controls the “softness” of neighbors’ votes. If η is as small as 0, then mixing weights are always uniform. If η approaches infinity, the mixing weight for the mode with the largest vote will be 1. After refinement of the color distribution with POEM, we set pixels with high probability (e.g., bigger than 99.9%) to belong to a certain mode as markers for that mode. Then a watershed segmentation algorithm is implemented to assign labels for undecided pixels.
318
C. Wu and H. Aghajan
(a)
(b)
(c)
(d)
Fig. 4. Ellipse fitting. (a) original image; (b) segments; (c) simple ellipse fitting to connected regions; (d) improved ellipse fitting.
Finally, in order to obtain a concise parameterization for each segment, an ellipse is fitted to it. Note that a segment refers to a spatially connected region of the same mode. Therefore, a single mode can have several segments. When the segment is generally convex and has a shape similar to an ellipse, the fitted ellipse well represents the segment. However, when the segment’s shape differs considerably from an ellipse, a direct fitting step may not be sufficient. To address such cases, we first test the similarity between the segment and an ellipse by fitting an ellipse to the segment and comparing their overlap. If similarity is low, the segment is split into two segments and this process is carried out recursively on every segment until they all meet the similarity criterion. In Fig. 4, if we use a direct ellipse fitting to every segment, we obtain Fig. 4(c). If we adopt the test-and-split procedure, correct ellipses are obtained as shown in Fig. 4(d). Some experimental results are shown in Fig. 5. The idea of elliptical descriptions is to find a simple parameterization of the subject. So it is not necessary to have the ellipses corresponding to body parts, although sometimes they do.
4
3D Model Fitting
The lower part of Fig. 3 shows the 3D skeleton fitting process. Ellipses from local processing in single cameras are merged together to reconstruct the skeleton. Here we consider a simplified problem in which only arms change in position while other body parts are kept in the default location. Elevation angles (θi ) and azimuth angles (φi ) of the left/right, upper/lower parts of the arms are specified as parameters (Fig. 6(a)). The assumption is that projection matrices from the 3D skeleton to 2D image planes are known. This can be achieved either from locations of the cameras and the subject, or it can be calculated from some known projective correspondences between the 3D subject and points in the images, without knowing exact locations of cameras or the subject. There can be several different ways to find the 3D skeleton model based on observations from multi-view images. One method is to directly solve for the unknown parameters through geometric calculation. In this method we need to first establish correspondence between points/segments in different cameras, which is itself a hard problem. Common observations of points are rare for human problems, and body parts may take on very different appearance from different view. Therefore, it is difficult to resolve ambiguity in the 3D space based on
Model-Based Image Segmentation for Multi-view Human Gesture Analysis
319
(a)
(b)
(c)
Fig. 5. Experiment results for local processing in single cameras. (a) original images; (b) segments; (c) fitted ellipses.
2D observations. A second method would be to cast this as an optimization problem, in which we find optimal θi ’s and φi ’s to minimize an objective function (e.g., difference between projections due to a certain 3D model and the actual segments). However, if the problem is highly nonlinear or non-convex, it may be very difficult or time consuming to solve. But it is possible to render the problem solvable by appropriately formulating it. This is a topic of interest for our future work. A third method would be to sample the parameter space, measure the distance between the sample and the images, and then assign the best sample to the 3D model. This is similar to the second approach in spirit. They both look for a sample point in the parameter space which optimizes the objective function. The difference lies in their searching strategies. Some optimization problems are well formulated and studied, so their solutions are guaranteed to converge. But when problems cannot be formulated in such ways, other optimization techniques need to be adopted. In this paper we implement a simple method for 3D skeleton fitting. First, a sample parameter space is generated. For every sample point in the parameter space, a 3D skeleton is constructed. Then, the skeleton is projected to image planes of all the cameras. In every image plane, a score is generated which measures the similarity between the projection and ellipses. A final score for the sample point is obtained by adding up scores from all cameras. The most critical part of the whole process is how to generate the sample parameter space.
320
C. Wu and H. Aghajan
z θ3
θ1
ϕ3
ϕ1
θ2
θ4 ϕ4
ϕ2
O
y
x
ellipses CAM1
ellipses
ellipses
CAM2
CAM3
(a)
(b)
Fig. 6. (a) The 3D skeleton. (b) Experiment results for 3D skeleton reconstruction. Original images from the 3 camera views and the skeletons are shown.
For t1 the parameter space is centered around the optimal solution of t0 with a small variance. This is effective in reducing the search space but is based on the assumption that the 3D skeleton will not go through a big change in that interval. Examples for 3D skeleton model fitting are shown in Fig. 6(b). Our current work includes using more sophisticated and efficient methods to search for the optimal parameter sample point. Other clues such as motion flows and accelerated searching techniques are potential candidates.
5
Conclusion
In a multi-camera network huge potentials exist for efficient vision-based applications if the rich visual information is appropriately employed. An opportunistic fusion framework is introduced which encompasses the three dimensions of data fusion, i.e., space, time, and feature levels. As an implementation of the opportunistic fusion concept in gesture analysis, a 3D human body model is employed as the converging point of spatiotemporal and feature fusion. It maintains both geometric parameters of the human posture and the adaptively learned appearance attributes, all of which are updated from the three dimensions of space, time and features of the opportunistic fusion. Parameters of the 3D human body model are in turn used as feedback to aid subsequent vision analysis in the cameras. Details of the algorithm were described in the paper and experiment results were provided. Future work includes a more robust and generalized initialization of the human model. The network is expected to discover distinct attributes of the subject so that more efficient segmentation can follow. This may include dominant colors, texture, or motions. The problem of fitting the 3D skeleton model based on local segments also has the potential to be greatly improved. Motion vectors and
Model-Based Image Segmentation for Multi-view Human Gesture Analysis
321
geometric relations can be used to “direct” posture refinement. More efficient searching techniques will also be employed aiming for a real-time gesture analysis system.
References 1. Wu, C., Aghajan, H.: Layered and collaborative gesture analysis in multi-camera networks. In: ICASSP (2007) 2. Sidenbladh, H., Black, M.J., Sigal, L.: Implicit probabilistic models of human motion for synthesis and tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 784–800. Springer, Heidelberg (2002) 3. Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle filtering. In: CVPR00, vol. II, pp. 126–133 (2000) 4. Cheung, K.M., Baker, S., Kanade, T.: Shape-from-silhouette across time: Part ii: Applications to human modeling and markerless motion tracking. International Journal of Computer Vision 63(3), 225–245 (2005) 5. M´enier, C., Boyer, E., Raffin, B.: 3d skeleton-based body pose recovery. In: Proceedings of the 3rd International Symposium on 3D Data Processing, Visualization and Transmission, Chapel Hill (USA) (June 2006) 6. Mikic, I., Trivedi, M., Hunter, E., Cosman, P.: Human body model acquisition and tracking using voxel data. Int. J. Comput. Vision 53(3), 199–223 (2003) 7. Plaenkers, R., Fua, P.: Model-based silhouette extraction for accurate people tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 325–339. Springer, Heidelberg (2002) 8. Sidenbladh, H., Black, M.: Learning the statistics of people in images and video. IJCV 54(1-3), 183–209 (2003) 9. Wilson, A.D., Bobick, A.F.: Parametric hidden markov models for gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(9), 884–900 (1999) 10. Starner, T., Pentland, A.: Visual recognition of american sign language using hidden markov models. In: AFGR95 (1995) 11. Liu, Y., Collins, R., Tsin, Y.: Gait sequence analysis using frieze patterns. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, Springer, Heidelberg (2002) 12. Rui, Y., Anandan, P.: Segmenting visual actions based on spatio-temporal motion patterns. In: CVPR00, vol. I, pp. 111–118 (2000) 13. Weiss, Y., Adelson, E.: Perceptually organized em: A framework for motion segmentaiton that combines information about form and motion. Technical Report 315, M.I.T Media Lab (1995)
A New Partially Occluded Face Pose Recognition Myung-Ho Ju and Hang-Bong Kang Dept. of Computer Eng. Catholic University of Korea #43-1 Yokkok 2-dong Wonmi-Gu, Puchon, Kyonggi-Do Korea
[email protected]
Abstract. A video-based face pose recognition framework for partially occluded faces is presented. Each pose of a person’s face is approximated using a connected low-dimensional appearance manifolds and face pose is estimated by computing the minimal probabilistic distance from the partially occluded face to sub-pose manifold using a weighted mask. To deal with partially occluded faces, we detect the occluded pixels in the current frame and then put lower weights on these occluded pixels by computing minimal probabilistic distance between given occluded face pose and face appearance manifold. The proposed method was evaluated under several situations and promising results are obtained.
1 Introduction Continuous face pose recognition plays an important role in human computer interaction, video-based face recognition and facial expression recognition. Since human head movement induces non-linear transformations in the projected face images and facial features often become occluded, robust face pose estimation is not an easy task. There have been some research works on face pose estimation which can be mainly categorized into two classes such as 3D model-based approach and 2D appearance based approach. The former method is usually required to build 3D face models or perform 3D reconstruction [1]. This method is accurate, but a hard task under arbitrary conditions. The latter method is based on 2D face appearance representation. Pentland et al. [2] proposed view-based Eigenspace approach to deal with various face appearances. Moghaddam et al. [3,4] suggested various probabilistic visual learning methods for face recognition. Lee et al. [5,6] presented video-based face recognition using probabilistic appearance manifolds. They showed good performance in face recognition, but have some limitations in recognizing partially occluded faces. In this paper, we propose a new video-based partially occluded face pose recognition based on appearance manifold. The pose appearance manifold consists of 11 sub-pose manifolds. The paper is organized as follows. Section 2 discusses face pose appearance manifold. Section 3 presents our face pose recognition scheme for partially occluded faces. Section 4 shows experimental results of our proposed method. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 322–330, 2007. © Springer-Verlag Berlin Heidelberg 2007
A New Partially Occluded Face Pose Recognition
323
2 Pose Appearance Manifold Let Ω denote the pose appearance manifold. A complex and nonlinear pose appearance manifold can be represented by a set of simple linear pose manifold using PCA plane. Fig. 1 shows the pose appearance manifold which consists of 5 sub-pose manifold. Each sub-pose manifold is approximated by a principal component analysis (PCA) plane. Pose recognition task is to find sub-pose distance from image I to a sub-pose manifold such as
n* by computing the minimal
n* = arg min d 2 ( I , P n )
(1)
n
Fig. 1. Face Pose Appearance Manifold
As in [6], we can define the distance as conditional probability (1) is
p( P n | I ) . So, Eq.
n* = arg max p ( P n | I ) n
where
p( P n | I ) =
(2)
1 −1 exp( 2 d ( I , P n )) , and Λ is the normalization term. Λ σ
2.1 Pose Estimation In the video-based face recognition framework, the face pose recognition is to estimate current sub-pose manifold pose
ptm−1 .
Pt n given current face image I t and previous sub-
324
M.-H. Ju and H.-B. Kang
Pt n* = arg max p ( Pt n | I t , Pt −m1 )
(3)
n
From this equation,
1 p( I t | Pt n , Pt −m1 ) p( Pt n | Pt −m1 ) Λ 1 = p( I t | Pt n ) p( Pt n | Pt −m1 ) Λ
p( Pt n | I t , Pt −m1 ) =
where Λ is the normalization term, the image
(4)
I t and Pt −m1 are independent.
From Moghaddam et al.[3], the likelihood probability
p ( I t | Pt n ) can be estimated
using eigenspace decomposition. In PCA, principal component feature vector is obtained as follows:
~ y = ( y1 , … , y M ) = Φ MT I where
(5)
ΦTM is a submatrix of Φ containing the principal eigenvectors and
~ I = I − I is the mean-normalized image vector. If we assume a Gaussian distribution, the likelihood probability can be represented by the product of two Gaussian densities [3,4]. In other words,
⎡ 1 M yi2 ⎤ ⎡ d 2 ( I t , Pt n ) ⎤ exp( − ( ) exp( − )⎥ ∑ ⎢ ⎥⎢ 2 λ 2 ρ n i = 1 i ⎥⎢ ⎥ p( I t | Pt ) = ⎢ M M ( N −M ) / 2 ⎢ ⎥ ( 2 πρ ) ⎢ ⎥ 1/ 2 2 ⎢ (2π ) ∏ λi ⎥ ⎢ ⎥ ⎦ i =1 ⎣ ⎦⎣
(6)
where N denotes the dimension of the image space, M denotes the dimension of subpose space, image
λ denotes
eigenvalue,
(
)
d 2 I t , Pt n denotes the L2 distance between an
I t and sub-pose Pt n , which is computed from the residual reconstruction error
ε 2 ( I t ) , and ρ =
1 N −M
N
∑λ . i
i = M +1
From Eq. (5), the residual reconstruction error is
ε (It ) = 2
N
∑y
i = M +1
2 i
N
M
i =1
i =1
= ∑ I ti − ∑ yi2
In Eq. (4), the transition probability between sub-poses
(7) n
p( Pt | Pt m−1 ) represents
the temporal dynamics of the face movement in the training sequence. When two subposes of face are not connected, the transition probability is 0. The transition probability is defined as follows.
A New Partially Occluded Face Pose Recognition
⎛ d 2 (I , P n ) t p(Pt n | Pt m−1 ) = exp⎜ − ⎜ 2σ 2 ⎝
(
325
⎞ ⎟ ⎟ ⎠
(8)
)
d 2 I t , P n can be estimated from the distance d 2 (I t −1 , P n ) and Δ t −1 (n) which is the distance difference from the target face to each sub-manifold between t-1 and t-2. Δ t −1 (n ) is computed as
(
)
(
Δ t −1 (n) = d 2 I t −1 , P n − d 2 I t − 2 , P n
)
(9)
So, the transition probability is computed as
(
)
⎛ d 2 I t −1 , P n + Δ t −1 (n ) ⎞ ⎟⎟ . p Pt n | Pt m−1 ≅ exp⎜⎜ − 2σ 2 ⎝ ⎠
(
)
(10)
3 Face Pose Recognition for Partially Occluded Faces To deal with partially occluded faces, we detect the occluded pixels and then assign 2
n
lower weights on those pixels when computing the distance like d ( I , P ) in Eq. (1). The intensity of an occluded pixel is different from that of the corresponding pixel of the training pose data. Therefore, it is necessary to determine the intensity difference for each pixel. The intensity difference IDi at pixel i is associated with the distance between the pixel’s intensity value and its expected value. The intensity difference
IDi at pixel
i is
IDi = I i − Ei
(11)
To normalize or balance the intensity difference at each pixel, Eq. (11) becomes
⎛ I − μi ⎞ ⎟⎟ IDi = ⎜⎜ i ⎝ σi ⎠ where Ii is the intensity value at pixel i, and
μi
2
(12)
and
σi
are pixel i’s mean value and
variance in the training data, respectively. If the pixel’s intensity difference is larger than the threshold value, it will be determined as an occluded pixel. If we assume that the distribution of IDi is Gaussian distribution, the weight of the ith pixel is computed as
326
M.-H. Ju and H.-B. Kang
⎧ ⎡ IDi − th ⎤ ⎪exp − IDˆ i ≥ th ϖ (i ) = ⎨ ⎢⎣ 2σ th2 ⎥⎦ ⎪ 1 otherwise ⎩ where
σ th is
(13)
the variance of the pixel differences less than threshold values. To
determine the threshold value th in Eq. (13), we compute the histogram of
IDi from
sample data of the sub-pose manifold, and the point of 95% in the accumulated histogram is selected as a threshold value. Based on the pixel’s weight information, an occlusion mask is constructed. In Fig. 2, the occlusion mask is constructed from previous input image and the corresponding pose training data. Then, the query image Q is made by projecting the masked input image into eigenspace. The pose recognition for the partially occluded face image is accomplished by computing the distance in Eq. (1) like
d 2 (Q, P n ) .
Input Image (I) Previous Image d2(Q, M)
Query Image (Q)
Face Pose Manifold
Occlusion Mask Pose Training Data Fig. 2. Face pose recognition using occlusion mask
Sometimes, if the input face pose is located at the boundary between two sub-pose manifolds, the generated occlusion mask cannot correctly represent the occluded pixels. This is shown in Fig. 3. The pose of input image is at the middle of two subposes P2 and P1 in Fig. 3. Some weights in the occlusion mask are incorrect. In order to solve this problem, we include some face pose data around the sub-pose manifold boundary by computing the normalized intensity difference IDi . After that, we change the threshold value th and variance
σ th
in Eq.(13). With these changes, we
can increase the accuracy of face pose recognition.
A New Partially Occluded Face Pose Recognition
327
Fig. 3. Errors occurred in the mask because the input pose is at the boundary between two face poses
4 Experimental Results We implemented our proposed pose appearance manifold learning algorithm on a P43.2Ghz system. Since there is no standard video database, we made 60 sequences out of data from 20 different persons. Each video sequence was recorded at our lab. The image resolution is 320 x 240 and frame rate is 15fps. The duration of each sequence is about 40 seconds and the frame rate is 15 frames per second. For pose appearance manifold, we picked 2 sequences from each person as training sequences and cropped face images as 19x 19. Then, we construct 11 sub-pose manifolds from these cropped images using PCA. To construct the desirable face pose manifold, cropped face regions are adjusted around the positions of two eyes and a nose. For each sub-pose manifold, the dimension of sub pose M in Eq. (6) is related to reconstruction errors and computing time. If the value of M increases, reconstruction error is reduced, but the computing time increases. So, we set the value of M to 20. The pose estimation test is performed on the test sequences. In the test sequence, the face is occluded by the glove or sunglasses. Fig. 4 shows the tracking result of partially occluded faces using our proposed method. From the accurate face tracking results, it is possible to obtain accurate face pose recognition. Table 1 shows the pose recognition results from our method and conventional face pose recognition method without occlusion handling. Using our proposed weighted mask for occlusion handling, the face pose recognition results have improved about 10 % in comparison with the conventional method without occlusion handling. We also compared pose recognition results using other occlusion handling methods in reconstruction [5], center distance [7], and LOPHOSCOPIC PCA[8]. Table 2 shows various pose recognition results on partially occluded data. Our proposed method shows the best pose estimation results. However, if the size of occluded region is large, our method fails to estimate correct face pose. To measure the tolerable size of occluded region in the face for
328
M.-H. Ju and H.-B. Kang
Fig. 4. Partially Occluded Face Tracking Result Table 1. Partially occluded face pose recognition result
Pose 1 2 3 4 5 6 7 8 9 10 11 Total
Frame Number 312 265 1,267 248 192 96 105 29 32 46 43 2,635
Proposed 100 96.98 97.32 97.98 99.48 97.92 73.33 100 100 100 100 96.82
Conventional Method (Without Mask) 79.55 83.02 86.42 82.66 85.42 90.63 70.48 93.10 81.25 93.48 95.35 85.01
correct pose estimation, we experiment three types of occlusion using a hand, a white board, and a black board. Table 3 shows the size of partial occlusion which is the ratio of the occluded region to the whole face for correct face pose recognition. In the case of the hand occlusion, the face pose recognition fails when the occlusion ratio increases to 56.5%.
A New Partially Occluded Face Pose Recognition
329
Table 2. Comparison of partially occluded face pose recognition result
Pose 0 1 2 3 4 5 6 7 8 9 10
Frame Num 83 190 936 281 174 97 91 57 28 13 16 1966
Proposed Method 59.04 100.00 94.02 97.15 89.08 96.91 98.90 100.00 92.86 100.00 100.00 93.45
Center Distance 51.81 73.16 67.41 36.65 76.44 94.85 91.21 77.19 7.14 84.62 100.00 69.13
Reconstruction 50.60 96.32 86.65 80.07 81.03 93.81 95.60 68.42 78.57 69.23 87.50 80.71
LOPHOSCOPIC PCA 54.22 93.68 82.37 82.21 82.18 94.85 94.51 31.58 85.71 92.31 81.25 79.53
Table 3. Size of partial occlusion for correct face pose recognition Occluding object Hand
The ratio of occluded region to the whole face 56.50
White board
35.50
Black board
40.95
5 Conclusions In this paper, we have presented a novel face pose recognition method for partially occluded faces. To deal with partially occluded faces, we detect the occluded pixels in the current frame and then put lower weights on these pixels by computing minimal probabilistic distance between the given occluded face pose and the face appearance manifold. We have experimented on realistic scenarios to show the validity of the proposed approach. It is worth noticing that our proposed method provides accurate pose estimation results and this will be helpful in video-based face recognition.
Acknowledgements This work was supported by the Culture Research Center Project, the Ministry of Culture & Tourism and the KOCCA R&D program in Korea.
330
M.-H. Ju and H.-B. Kang
References 1. Murase, H., Nayar, S.: Visual Learning and recognition of 3-D objects from appearance. Int. J. Computer Vision, 5–24 (1995) 2. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face recognition. In: Proc. IEEE Conf. CVPR, IEEE Computer Society Press, Los Alamitos (1994) 3. Moghaddam, B., Pentland, A.: Probabilistic visual learning for object recognition. IEEE Trans. PAMI, 696–710 (1997) 4. Moghaddam, B.: Principal Manifold and Probabilistic for Visual Recogntion. IEEE Trans. PAMI, 780–788 (2002) 5. Lee, K.C., Ho, J., Yang, M., Kriegman, D.: Video-Based Face Recognition Using Probabilistic Appearance Manifolds. In: Proc. IEEE Conf. CVPR, IEEE Computer Society Press, Los Alamitos (2003) 6. Lee, K.C., Kriegman, D.: Online Learning of Probabilistic Appearance Manifold for Videobased Recognition and Tracking. In: CVPR (2005) 7. Adam, A., Rivlin, E., Shimshoni, I.: Robust Fragments-based Tracking using the Integral Histogram. In: Proc. IEEE Conf. CVPR, IEEE Computer Society Press, Los Alamitos (2006) 8. Tarres, F., Rama, A.: A Novel Method for Face Recognition under partial occlusion or facial expression Variations. In: ELMAR (2005)
Large Head Movement Tracking Using Scale Invariant View-Based Appearance Model Gangqiang Zhao1, Ling Chen1,2, and Gencai Chen1 1
College of Computer Science, Zhejiang University, Hangzhou 310027, P.R. China
[email protected] 2 School of Computer Science and IT, The University of Nottingham, Nottingham, NG 8 1BB, UK
[email protected]
Abstract. In this paper we propose a novel method for head tracking in large range using a scale invariant view-based appearance model. The proposed model is populated online, and it can select key frames while the head undergoes different motions in camera-near field. We propose a robust head detection algorithm to obtain accurate head region, which is used as the view of head, in each intensity image. When the head moves far from camera, the view of head is obtained through the proposed algorithm first, and then a key frame whose view of head is most similar to that of current frame is selected to recover the head pose of current frame by coordinate adjustment. In order to improve the efficiency of the tracking method, a searching algorithm is also proposed to select key frame. The proposed method was evaluated with a stereo camera and observed a robust pose recovery when the head has large motion, even when the movement along the Z axis was about 150 cm.
1 Introduction A robust estimation of head pose in 3D is important for many applications and the knowledge about head-gaze direction can be used in many fields, such as humancomputer interaction, video compression, and face recognition systems etc. Many vision based 3D head tracking methods have been developed in recent years, but none of them has considered the problem of large motion, especially the movement along the Z axis, which makes the tracking results instable and inaccurate. Several different approaches have been used for model based head tracking. Cootes and Taylor [1] employ a linear subspace of shape and texture as a model. However the manifold underlying the appearance of an object under varying pose is highly nonlinear, so the method works well only when pose change is relatively small. Birchfield [2] uses the aggregate statistics appearance model for head tracking. The head is located using the distribution of skin-color pixels and this distribution can be adapted to fit the subject as tracking goes on. Since the characteristics of the statistics distribution are influenced by many factors and only one of them is pose, the tracking does not lock on to the target tightly. DeCarlo and Metaxas [3] proposed a deformable 3D model approach. This approach maintains the 3D structure of the subject in a state vector which is updated recursively as images are observed. The update requires that J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 331–339, 2007. © Springer-Verlag Berlin Heidelberg 2007
332
G. Zhao, L. Chen, and G. Chen
corresponding between features in the model and features in the image be known. However, computing these correspondences is difficult and the update is also expensive. Krahnstoever and Sharma [4] present an online approach to acquire and maintain appearance information for model-based tracking. However, mapping background data to the model can destabilize the tracker. Ohayon and Rivlin [5] proposed a method which acquires several 3D feature points from the head prior to tracking, and these points are used as a head model. However, these separated features will be lost when large motion occurs. The tracking method proposed in this paper is based on the work of Morency et al. [6], which uses an appearance model to represent the subject with a subset of the frames in the input sequence. These key frames are annotated with their estimated poses, and collectively represent the appearances of the subject as viewed from these estimated poses. However, the appearance model is used only for bounding drift when pose trajectory of the head crosses itself. This paper uses a similar appearance model to represent the head, and the difference is that besides annotating a key frame with the estimated pose, the head region is exactly selected and corresponding head view is also obtained from the intensity image to annotate the key frame. The proposed appearance model is populated online, and it can select key frames while the head undergoes different motions in camera-near field. When the head moves far from camera, the view of head is obtained first, and then a key frame whose view of head is most similar to that of current frame is selected to recover the head pose of current frame by coordinate adjustment. Since view-based models can capture non-Lambertian reflectance, corresponding tracking methods suit head tracking very well. Performance evaluation shows that the proposed tracker based on scale invariant appearance model achieves a robust pose recovery when the head has a large motion, and it works well even when the movement along the Z axis was about 150 cm. On a Pentium 4 2.6GHz PC, the measurement rate of the implemented 3D head tracker was 12 Hz.
2 Scale Invariant Appearance Model Our view-based model consists of a collection of key frames acquired using a stereo camera during tracking and each key frame is annotated with pose and head region both. For each key frame, the view-based model maintains the following information: Ms = {I s, Zs, Hs, xs} where I s and Z s are the intensity and depth images associated with the key frame s. Hs is the head region in intensity image Is. xs = [Tx Ty Tz Ωx Ωy Ωz ] is a 6 dimensional vector consisting of the translation and the three rotation angles around the X, Y and Z axes. The view-based model is defined by the set {M1 … Mk}, where k is the number of key frames. 2.1 Accurate Head Detection For each frame the precise head region H is selected using background subtraction and contour information together. First, edge image, shown as Fig. 1.(b), is obtained from intensity image, shown as Fig. 1.(a), using the Canny edge detector. With
Large Head Movement Tracking Using Scale Invariant View-Based Appearance Model
333
inten-sity image, the body movement in the indoor environment might cause the intensity value of a background pixel changing between two consecutive frames, which might decrease the accuracy of the background subtraction and would be worse if the body has a large move. The edge of an image is the collection of pixels where significant intensity changes occur. Since the edge will remain unchanged if all pixels of the image have the same intensity change, edge image is utilized to do background sub-traction. Let i1, i2,…, iN be a recent sample of intensity values for a pixel in the corre-sponding latest N frames. Using this sample, the probability density function that this pixel will have intensity value it in current frame can be computed employing kernel density estimators [7]. We use Gaussian kernel to estimate it.
1 Pr( it ) = N
N
∑ j =1
1 2πσ 2
e
−
2 1 (i t − i j ) 2 σ2
(1)
Where N is the number of samples, σ is the standard deviation of Gaussian function. This pixel is classified as background if the following criterion is met.
Pr(it ) > Th
(2)
Where Th is a threshold defined according to the real images. The foreground image after background subtraction is shown as Fig. 1.(c) and the profile of the body is clear. Then, we subtract contours from the foreground image to locate the head. The final result is shown in Fig. 1.(d), with the red curve showing the head contour and the green rectangle showing the head region H.
(a)
(b)
(c)
(d)
Fig. 1. Head detection: (a) Intensity image, (b) edge image, (c) foreground image, (d) result image
2.2 Pose Estimate Given frame s and frame t, the registration algorithm estimates the pose change between two frames. Let P and Q be two 3D point sets. Each point in P is chosen from s and its corresponding point in Q is found from frame t. Let the three rotation angles around the X, Y and Z axes be R = [Ωx Ωy Ωz] and the translation be T = [Tx Ty Tz]. P and Q are connected via the following equation: Q = RP + T (3) Finding R and T is known as the registration problem. The least squares formulation, which can be used to minimize this alignment error, is shown as follows:
334
G. Zhao, L. Chen, and G. Chen
E = ∑ || Q − ( RP + T ) ||2
(4)
The registration algorithm [8] based on Iterative Closest Point (ICP) and the Normal Flow Constraint (NFC) is employed to solve this problem. 2.3 Appearance Model Building When the head moves in camera-near field (its distance to the camera is smaller than 0.8m), the view-based model is populating online with intensity and depth images, head regions and poses. Firstly, head region Hs is detected for frame s using the proposed head detection algorithm. Secondly, pose xs is estimated for frame s using Pose adjustment Yes
Stereo camera
New frame
Head detection
Pose Estimation
Dose the pose exist?
No
new key frame
Fig. 2. The flowchart of key frame selection
the two-frame rigid body registration algorithm mentioned in Section 2.2. Then the frame will be inserted into the appearance model if its pose different with that of other frames already in the model [5]. Fig. 2 shows the flowchart of key frame selecting. At this stage, the model tries to eliminate drift when the head’s pose trajectory crosses itself. All these selected head regions in intensity images are resized to the same size (e.g. 50*60 pixels) before being used as views of head, and this eases the appearance distance computing. Note that this resize mechanism makes the appearance model scale invariant, as the appearance distance can be calculated from two views of head even when they have different original sizes.
3 Tracking for Large Motion When the head moves far from camera quickly the two-frame registration algorithm can not estimate the head pose accurately, as it can not find corresponding points between two frames when large movement along the Z axis occurs. In this paper, a tracking method, which is based on the aforementioned appearance model, is proposed, and it can use the appearance model to recover the head pose through finding a key frame whose view of head is most similar to that of current frame. 3.1 Base Frame Selection After the head region is obtained from the intensity image of current frame using the head detection algorithm described in Section 2.1 and resized to the same size like the
Large Head Movement Tracking Using Scale Invariant View-Based Appearance Model
335
key frames in the appearance model. The L2 distances between head regions of key frames and current frame are calculated using the following equation: 1
⎡ MN ⎤2 k k d L 2 ( H s , H c ) = ⎢∑ ( H s − H c ) 2 ⎥ ⎣ k =1 ⎦
(5)
Where Hs is the head region of key frame s, Hc is the head region of current frame, MN is the size of head region (e.g. 50*60 pixels). The key frame having smallest appearance distance (i.e. L2 distance) with current frame is chosen as the base frame of current frame. However, it will cost long time to select this base frame if the appearance model has lots of key frames. In order to decrease the load of distance calculation, an efficient searching algorithm, which is based on the rotation angle index of key frames, is proposed. If the pose difference between two key frames is little, their appearances would be very similar and their appearance distances with current frame should be close. Based on this, one key frame could be selected to represent dozens of key frames which have close poses. Let x = [Tx Ty Tz Ωx Ωy Ωz ] describe the pose of a key frame, Ωx, Ωy, and Ωz are the rotation angles around the X, Y, and Z axes. Rotation around the X axis is divided into two sub-classes: positive Ωx, changing from 0 to positive maximum (Class-X-Positive); and negative Ωx, changing from 0 to negative maximum (Class-XNegative). Put this classification to the Y and Z axes, other four sub-classes can be obtained. In the rotation angle index, each key frame is classified to one of the six sub-classes according to its major rotation axis (i.e. the axis that has the largest rotation angle). Two key frames are shown for each subclass in the third layer of Fig. 3, from left to right: negative Ωy, negative Ωz, negative Ωx, positive Ωx, positive Ωz, and positive Ωy. To each sub-class, one key frame is selected to represent it. For instance, the representative key frame for Class-X-Positive is the one whose Ωx is almost half of the positive maximum. The six nodes in the second layer of Fig. 3 represent six representative key frames. Based on the index and six representative key frames, the searching algorithm calculates the appearance distances between current frame and representative key frames, and the sub-class whose representative key frame is most similar with current frame is selected as the first potential sub-class.
Fig. 3. Rotation angle index, 2 key frames for each subclass are shown in the bottom layer
336
G. Zhao, L. Chen, and G. Chen
In most cases the base frame can be got after searching in the first potential sub-class, in order to get more stable result, both the first and second potential sub-classes are searched in our implementation. Let six sub-classes expressed as SUBCLASS i, where 1≤i≤6; SUBCLASS i has following key frames {M1i … Mkii}, where ki is the number of key frames for SUBCLASS i. The whole searching algorithm is shown in Fig. 4.
Given: The model {M1 … Mk} and current frame c. The index has been build and six representative key frames have been selected out. z Calculate the appearance distance between current frame c and each representative key frame. z Select the first potential subclass (SUBCLASS m) and the second potential subclass (SUBCLASS n), where 1m, n6. z (Select the most similar key frame) calculate the appearance distance between frame c and each key frame of SUBCLASS m (km in all) calculate the appearance distance between fame c and each key frame of SUBCLASS n (kn in all) find the minimum appearance distance in all (km+kn) distances. The corresponding key frame is Mb. Out : Mb is the base frame. Fig. 4. Base frame selection algorithm
3.2 Coordinate Adjustment The pose of current frame is recovered according to that of base frame. Since the pose of base frame xb is based on the coordinate of initialize frame, the pose should be appropriately adjusted to get the final result. Assume that the center point of the head region in the base frame is Cb = {Xb, Yb, Zb} and the center point in current frame is Cc = {Xc, Yc, Zc}. The real pose xc of current frame could be adjusted as follows: xc = (Cc − Cb ) xb (6)
4 Performance Evaluation This section presents the experiments to evaluate the tracking method based on the proposed appearance model. In the experiments, the subject moved in the near field (~0.8m) for several minutes and then moved far from the camera along the Z axis to the far field (~2.3m). At the first stage, the subject underwent some rotations (the degree of three rotation angles are in the range of -45° to 45° ) and translations (within 40cm, including little translations along the Z axis) in the near field. Then the subject moved far from the camera quickly along the Z axis to the far field. At this stage, the head underwent some rotations (the degree of three rotation angles are in the range of -45° to 45° ) and large translations along the Z axis (about 150 cm).
Large Head Movement Tracking Using Scale Invariant View-Based Appearance Model
337
A sequence obtained from Digiclops stereo camera [9], recording at 6Hz in 2 minutes, is employed to test the tracking method, and the number of key frames is set to 80. For background subtraction the number of samples N in equation (1) is set to 10 and the standard deviation of Gaussian function σ is 4.65. The background pixel selection threshold Th in equation (2) is set to 0.065. On a Pentium 4 2.6GHz PC, the implemented 3D head tracker runs at 12Hz. Fig. 5 shows the tracking results. The scale invariant appearance model approach is compared with Morency’s original appearance model approach. The top row of Fig. 5 shows the intensity images while moving far from the camera. When tracking with the Morency’s appearance model approach, as shown in the center row of Fig. 5, the pose estimate drifts when the large movement along the Z axis occurs. The sale invariant appearance model approach, as shown in the bottom row of Fig. 5, can track the head robustly during the entire sequence. To analyze quantitatively our algorithm, we compared our results with the measurements from pciBIRD motion sensor [10]. pciBird is a 6-DOF (Degree of Freedom) position and orientation tracking system. Ascension reports a pose accuracy of 0.15° RMS when the sensor is moving. We recorded 3 sequences with ground truth poses using the pciBird sensor. The sequences were recorded at 6Hz and the average length is 381 frames (~65sec). Fig. 6 compares our results with pciBird sensor employing sequences 1. Fig. 6 only shows about 100 frames when the large movement along the Z axis occurs. The RMS errors for all 3 sequences are shown in table 1.
Fig. 5. Comparison of tracking results when the head has a large movement along the Z axis. The box around the head shows the pose of head in the Open GL window. Top row shows the intensity image, bottom row represents results employing our appearance model in frame 485(0.88m far from the camera), frame 487 (1.10m), frame495 (1.67m), frame 510 (1.68m), frame 542 (2.06m), frame 580(2.31m), center row shows the tracking result using the Morency’s appearance model approach.
We further compared the proposed tracking approach when the appearance model includes different numbers of key frames. Fig. 6.(d) shows the results of this comparison. It can be seen that when the appearance model has 80 key frames, the recovered pose is closest to the ground truth.
338
G. Zhao, L. Chen, and G. Chen
Table 1. RMS error for each sequence. Pitch, yaw and roll represent rotation around X, Y and Z axis, respectively
Pitch 3.42° 2.95° 3.56°
Yaw 2.85° 4.11° 2.78°
60
60
40
40
20
20
0 1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
Yaw (degrees)
Pitch (degrees)
Sequences 1 Sequences 2 Sequences 3
-20
Roll 3.12° 3.53° 2.68°
Total 3.21° 3.83° 3.14°
0 1
8
15
22
29
36
43
-40
57
64
71
78
85
92
99
-40 Scale Invariant Appearance Model PciBIRD Sensor
-60
Scale Invariant Appearance Model PciBIRD Sensor
-60
(a)
(b)
60
60
40
40
20
20
0 1
8
15
22
29
36
43
50
57
64
71
78
85
92
0 1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
-20
-20
-40
-40
-60
99
Yaw (degrees)
Roll (degrees)
50
-20
Scale Invariant Appearance Model PciBIRD Sensor
(c)
-60
80 key frames 20 key frames
40 key frames PciBIRD Sensor
(d)
Fig. 6. Comparison of the head pose estimation from our scale invariant view-based approach with the measurements form the pciBird sensor. (a) Pitch, (b) Yaw, (c) Roll, (d) Yaw when the model consist of 80,40 or 20 key frames.
5 Conclusions In this paper we presented a 3D head tracking method using scale invariant viewbased appearance model. The proposed appearance model is generated online with views of head as it undergoes different motions in near field. When the head moves far from camera its pose can be recovered using this model. Experimental results show that the proposed method obtains a robust pose recovery when the head has large motion, even when the movement along the Z axis was about 150 cm. Therefore, the proposed tracking method can be used in many applications that the subject is moving now and then.
Large Head Movement Tracking Using Scale Invariant View-Based Appearance Model
339
References 1. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2001) 2. Birchfield, S.: Elliptical head tracking using intensity gradients and color histograms. In: Proceedings of IEEE International Conference on Computer Vision, Bombay, pp. 232– 237. IEEE Computer Society Press, Los Alamitos (1998) 3. DeCarlo, D., Metaxas, D.: Adjusting shape parameters using model-based optical flow residuals. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 4. Krahnstoever, N., Sharma, R.: Appearance management and cue fusion for 3D modelbased tracking. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Madison, pp. 249–256. IEEE Computer Society Press, Los Alamitos (2003) 5. Shay, O., Rivlin, E.: Robust 3D head tracking using camera pose estimation. In: Proceedings of IEEE International Conference on Pattern Recognition, Hong Kong, pp. 1063–1066. IEEE Computer Society Press, Los Alamitos (2006) 6. Morency, L., Rahimi, A., Darrell, T.: Adaptive view-based appearance models. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Madison, pp. 803–810. IEEE Computer Society Press, Los Alamitos (2003) 7. Elgammal, A., Harwood, D., Davis, L.: Non-parametric model for background subtraction. In: Proceedings of European Conference on Computer Vision, Dublin, pp. 751–767 (2000) 8. Morency, L., Darrell, T.: Stereo tracking using icp and normal flow. In: Proceedings of IEEE International Conference on Pattern Recognition, Quebec, pp. 367–372. IEEE Computer Society Press, Los Alamitos (2002) 9. Point Grey Research Inc. http://www.ptgrey.com/ 10. Ascension Technology Inc. http://www.ascensiontech.com/
Robust Shape-Based Head Tracking Yunshu Hou1,2 , Hichem Sahli1 , Ravyse Ilse1 , Yanning Zhang2 , and Rongchun Zhao2
2
Joint Research Group on Audio Visual Signal Processing (AVSP) 1 Vrije Universiteit Brussel, Department ETRO, Pleinlaan 2, 1050 Brussel {icravyse,hsahli}@etro.vub.ac.be Northwestern Polytechnical University, School of Computer Science, 127 Youyi Xilu, Xi’an 710072, P.R. China
[email protected], {ynzhang,rczhao}@nwpu.edu.cn
Abstract. This work presents a new method to automatically locate frontal facial feature points under large scene variations (illumination, pose and facial expressions). First, we use a kernel-based tracker to detect and track the facial region in an image sequence. Then the results of the face tracking, i.e. face region and face pose, are used to constrain prominent facial feature detection and tracking. In our case, eyes and mouth corners are considered as prominent facial features. In a final step, we propose an improvement to the Bayesian Tangent Shape Model for the detection and tracking of the full shape model. A constrained regularization algorithm is proposed using the head pose and the accurately aligned prominent features to constrain the deformation parameters of the shape model. Extensive experiments demonstrate the accuracy and effectiveness of our proposed method.
1
Introduction
Automatic analysis of facial images has received a great attention in the last few years. This is due to the increasing interest for applications such as humancomputer-interaction, video conferencing, 3D face modeling, expression analysis, and face recognition. All these applications requires accurate facial feature (visible facial elements such as mouth corners, eyebrows, eyelids, wrinkles, etc ...) detection and tracking. Several methods for facial feature extraction have been described in the litterature [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Current results show that Active Shape Model (ASM) [7, 8] gives promising results. ASM methods introduce prior statistical model as constraint and hence making the estimation more robust. They relate the variation of the model parameters directly with those of the measurements of the video face, using e.g. optical flow and gradient/edges, measurements. ASM enables accurate tracking of facial features, but lacks occlusions and self-occlusions. Recently, a novel application of the Bayesian Shape Model for facial feature extraction has been proposed, the Bayesian Tangent Shape Model (BTSM) [9, 10, 12]. First, a shape face model, with 83 feature points, is designed J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 340–351, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust Shape-Based Head Tracking
341
(see Figure 1), and the PCA is used to estimate the shape variance of the face model using a learning set of faces. Then, based on the prior shape distribution and the likelihood model in the image shape space, BTSM is applied to match and extract the face shape from the the input images. The MAP estimation of the parameters are obtained using EM algorithm [9].
Fig. 1. Shape Model with N = 83 feature points
Even though current techniques have yielded significant results, their success is limited by the conditions imposed by real applications. The major difficulty lies in tracking a person adapted features taking into account scene variations (illumination changes and facial expressions). To accommodate for such problems, in this paper we propose a new facial feature extraction and tracking method which relies on the combination of several methods and a cascaded parameter prediction and optimization, including (i) kernel-based face detection and tracking, resulting in the detection of the facial region and face pose [13], (ii) a constrained Lucas and Kanade (LK) tracker [14], for detecting and tracking prominent facial features, being the eyes and mouth corners, and (iii) an improvement to the Bayesian Tangent Shape Model (BTSM) [9, 10] for the detection and tracking of the shape model. A constrained regularization algorithm is proposed using the head pose and the accurately aligned prominent features to constrain the deformation parameters of the shape model. The remainder of the paper is organized as follows. Section 2 summarizes the kernel-tracking method for face detection, tracking and pose estimation. Section 3 describes the constrained prominent facial feature tracking. In Section 4, we describe the 2D shape parameter estimation algorithm. Finally, in Section 5 extensive results are discussed and some conclusions are drawn.
2
Face Region Detection and Tracking
For the face detection and tracking from color images, we use an early proposed algorithm [13] allowing the tracking in the presence of varying lighting conditions as well as complex background. This method first detects the skin region over the entire first image of the sequence, and generates face candidates based on
342
Y. Hou et al.
the spatial arrangement of the skin patches as well as the elliptical shape of the face. In a second stage, a novel kernel-based method wherein a joint spatial-color probability density characterizing the ellipse head region is used for tracking the face region over the entire image sequence. The parameterized motion, and the illumination changes affecting the target are estimated by minimizing a distance measuring the adherence of the samples of the head candidate to the density of the head model. This kernel-based approach proved to be robust to the 3dimensional motion of the face, and keeps the tracked region remain tightly around the face. Moreover, incorporating an illumination model into the tracking equations enables us to cope with potentially distracting illumination changes. The proposed algorithm [13] achieves reliable tracking results compared to the best spatially-weighted color histogram trackers [15]. The output of this phase is the face region (ellipse) and the estimated head pose γ = [s, θ, tx , ty ]T , for scaling, rotation and translation, parameters respectively. ˆ
3
Constrained Features Tracking
Within the detected face region, the second step of our approach is to detect prominent facial features, i.e. eyes and mouth corners. To this end we apply a constrained Lucas-Kanade [14] (LK) tracker. The LK tracker is aimed at estimating a robust match between feature points of two images I1 and I2 , by minimizing the sum of squared differences between two small windows centered at object feature locations: min [I2 (x + u, y + v) − I1 (x, y)]2 (1) x,y T
where u = [u, v] is the motion vector, to be computed. Equation 1 has a close form solution [14]: 2 u − Ix It Ix Ix I2y = (2) Ix Iy Iy v − Ix It In matrix form, and for M feature points (in our case M = 6 - eyes and mouth corners) the well-known LK tracker is given by: Au = h
(3)
In the case of facial features, the motion of the tracked eyes and mouth corners lies in a 4 dimensional manifold, which can be modeled using traditional 2D rigid motion model, m = [m1 , m2 , m3 , m4 ]T = [s cos θ, s sin θ, tx , ty ]T , parameterized by γ = [s, θ, tx , ty ]T . Then, the motion vector u can be expressed as: u = Bm + c x −y 1 0 −x with B = and c = . y x 01 −y Combining 4 and 3 we get:
ABm = h − Ac
(4)
(5)
Robust Shape-Based Head Tracking
343
Which has a least square solution: ˜ = [AB]T [h − Ac] m
(6)
being the optimal solution of 5 when the tracking errors are isotropic. Note that, when the errors are anisotropic, weighted least square techniques are more appropriate to solve equation 5. Suppose we have some prior knowledge about the rigid motion ,ˆ γ , of the tracked head, then 5 become a constrained LK model: ABm = h − Ac (7) m = m ˆ which could be solved as a minimization problem: min(||ABm − (h − Ac)||2 + λ||m − m|| ˆ 2)
(8)
([AB]T [AB] + diag(λ . . . λ))m = [AB]T [h − Ac] + λm ˆ
(9)
Or equivalently:
where m is the motion parameter vector to be estimated, m ˆ is the estimated motion from the kernel-based method of Section 2, and λ stands for the confidence on the prior-knowledge. Form 9 one can notice, if λ is zero matrix we get equation 5, and if λ is large enough the solution will be tightly converging to m. ˆ ˜ (equivThe output of the constrained LK model is a refined head motion m alently γ˜ ) between the previous frame and the current one, and the location of the tracked prominent features points in the current frame (according to 4) sC I . Generally speaking, the tracked features will not be exactly located at the corner positions of the eyes and mouth. To obtain more precisely corner tracking results we apply a post processing step. We use corner detection techniques and perform a local search for the optimal candidate eyes and mouth corners in a window centered around the position given by the constrained LK tracker. Another issue is the initialization of the constrained LK tracker. In other words the detection of the 6 prominent facial features in the first frame of the image sequence. In this work, we consider that that the face in the first frame correspond to the neutral face state, with opened eyes and the closed mouth. We first apply the face detector of section 2, obtaining a candidate face region expressed by an ellipse. Then we use the average face shape structure, iris and mouth dark region detection results, to locate regions of interest around the eyes and mouth corners, and then apply corner detection techniques for the detection of the 6 prominent facial features.
4 4.1
Constrained Shape Tracking Overview
The face shape in the 2D image space is expressed by the coordinates of N = 83 feature points: sI = [x1 , y1 , . . . , xN , yN ]T (10)
344
Y. Hou et al.
The task of head feature tracking is to detect the face shape, sIt , from image It−1 to image It given the previous face shape sIt−1 . In our approach the problem is formulated in the same way as for BTSM [9, 10] with some differences. First our objective is tracking and not alignment, as such the observations are the previously tracked feature points, and second we make use of the motion parameters γ as estimated in Section 2 as well as the well positioned prominent features of Section 3. In summary we propose a new optimization method including a PCA based tangent shape model, in the aligned tangent shape space, and a motion model from tangent shape space to the image shape space together with two prior constraints of motion parameters and well positioned six feature points. More precisely, given the observed/tracked face shape sIt in the image space we aim at estimating both the shape parameters b(t) of the tangent shape model, and the motion parameters γ(t) of the motion model under the constraints: (i) the trained shape sTt distribution in tangent shape space, (ii) the estimated motion parameters γ˜(t), and (iii) the well positioned prominent facial features (eyes and mouth corners) sC (t). The optimization is made via EM to obtain a MAP estimation. 4.2
Constrained Shape Model Formulation
Face shape variations in two dimension space are due to the rigid motion of the face, the nonrigid motion of the face, and the shape difference between people. In the proposed constrained shape model, denoted CSM, the latter two variations are simultaneously modeled in the tangent shape space via probabilistic principle component analysis (PPCA) and the first variation is modeled as a four degreeof-freedom motion model from the tangent shape space to the image space. The face shape in the tangent space sT is expressed in the same way as for sI , but in the tangent reference frame which is aligned to the mean shape μ = (¯ x1 , y¯1 , . . . , x ¯N , y¯N )T of the training samples set after Generalized Procrustes Analysis warping [9, 10]: sT = [x1 , y1 , . . . , xN , yN ]T
(11)
First, a dependency between the coordinates of the shape in the image space is introduced by the warping into the tangent space using the Generalized Procrustes Analysis: the rigid motion of the shape model is contained in a motion model (for the warping), while the residual 2N − 4 degrees of freedom define the tangent reference space. The motion model, γ = [s, θ, tx , ty ]T , that takes into account 2D translation, rotation and scaling, is expressed as: cos(θ) −sin(θ) t sI = s IN ⊗ sT + IN ⊗ x + ϕ = Tγ (sT ) + ϕ (12) sin(θ) cos(θ) ty where ⊗ denotes the Kronecker matrix product, IN the N × N identity matrix, and ϕ is the isotropic observation/tracking noise of current measurement sI in the image space, ϕ ∼ N (0, ρ2 I2N ), ρ being the mean displacement of the tracked 2 I (t−1)|| shape between the two successive frames ρ2 = ||sI (t)−s . 2N
Robust Shape-Based Head Tracking
345
The tangent shape model is formulated in a probabilistic learning framework: aligned training shapes are used to create the PPCA tangent shape model as sT = μ + Φr b + Φε
(13)
where Φ is obtained from the eigenvectors of PPCA of the aligned training shapes, Φr is a 2N ×r matrix which consists of the first r columns of Φ determined by maintaining 95% energy of the shape variation, b is the shape variation in the tangent space, b ∝ N (0, Λ = diag(λ1 , . . . , λr )) in which λ1 , . . . , λr are the first r eigenvalues of PPCA, and to model the isotropic shape noise in the ε is2 used
2N −4 tangent space, p(ε) = exp − ||ε|| with σ 2 = 2N1−4 i=r+1 λi (λr+1 , . . . , λ2N −4 2σ2 are the last r eigenvalues of PPCA). Detailed information can be found in [9]. Given an initially observed/tracked face shape sIt−1 in the image space we aim at estimating both the shape parameters b(t) of the tangent shape model, and the motion parameters γ(t) of the motion model, using the following constraints: (i) the trained tangent shape sT distribution in tangent shape space, (ii) the estimated motion parameters γ˜(t) of Section 3, and (iii) the well positioned prominent facial features (eyes and mouth corners) sC I of Section 3. The constraints are incorporated in a probabilistic framework by applying Bayesian inference (used for the optimization of the shape in section 4.3) as follows. The first constraint, the motion parameter γ˜ (t), is introduced as a boundary constraint in the shape optimization. In the Bayes inference framework the true motion parameters are modeled as a multivariate gaussian distribution: γ(t) ∝ N (˜ γ (t), diag(η))
(14)
where η = [0.2, 0.087, 10, 10]T has been set empirically. η can be seen as the weighting coefficient of the prior knowledge constraint. The second constraint makes use of the detected prominent facial features sC I , obtained using the constrained LK-Tracker. Therefore, we divide the tangent shape space sT into two orthogonal spaces: a corner shape space in which sC T ∗ belongs, and an orthogonal complement space in which sC T resides. This can be formulated as: C∗ sT = sC (15) T + sT For simplicity, suppose that the corner points are the first six feature points of T C C∗ the shape model sI , then sC I = (x1 , y1 , . . . , x6 , y6 , 0, . . . , 0) . The sets sT and sT can be obtained from the tangent shape sT as: I12 012×(2N −12) C sT = U sT U= (16) 0(2N −12)×12 0(2N −12)×(2N −12) ∗ 012×12 012×(2N −12) sC V = (17) T = V sT 0(2N −12)×12 I(2N −12) where Id is the d × d identity matrix, 0d×k is the d × k-zero matrix. Given the tangent shape sT the tracked six corners sC I can be expressed as: sC I = U Tγ (sT ) + ζ
(18)
346
Y. Hou et al.
where ζ is the isotropic observation/tracking noise of the prominent features estimation, ζ ∼ N (0, δ 2 U ) and δ being currently set to the average local pattern distance of the six corners between two subsequent frames in the LK tracking. The six reliable corners are expected to impose a constraint to the other points of the shape, making the full shape tracker more reliable. 4.3
Constrained Shape Model Optimization
Given the tangent shape model sT , the image shape model sI at frame It −1, and the two prior constraints, namely, the motion parameters, ˜ γ , and the prominent facial features sC I , at frame It , the posterior of the proposed constrained shape model parameters (CSM) (b, γ) can be formulated as: p(b, γ|sT , sI , sC γ ) = p(b|sT )p(γ|sT , sI )p(γ|sT , sC γ) I ,˜ I )p(γ|˜
(19)
where the posterior can be separated by the product of the four distribution thanks to the introduction of tangent shape as hidden variable: 1 T −1 −2 2 p(b|sT ) ∝ exp − [b Λ b + σ ||sT − μ − φr b|| ] (20) 2 1 −2 2 p(γ|sT , sI ) ∝ exp − [ρ ||sI − Tγ (sT )|| ] 2 1 −2 C 2 p(γ|sT , sC ) ∝ exp − [δ ||s − U T (s )|| ] γ T I I 2 1 p(γ|˜ γ ) ∝ exp − (γ − ˜ γ )T diag(η)−1 (γ − ˜ γ) 2
(21) (22) (23)
where (20), (21), (22) and (23) model the distributions of the tangent shape, the motion model by mapping the tangent space to the image space, the prominent features and the prior-motion, respectively. In the following we only deduce the most complicated equation ( 20), the other distributions can be obtained in a similar way: 1 1 T −1 1 p(b|sT ) = p(b)p(sT |b) = exp − [b Λ b] 2 (2π)r/2 |Λ|1/2 (2π)2N/2 σ 1 exp − [sT − μ − φr b]T [σ 2 I2N ]−1 [sT − μ − φr b] 2 1 ∝ exp − [bT Λ−1 b + σ −2 ||sT − μ − φr b||2 ] (24) 2 The general EM algorithm is applied to compute the MAP estimation of (b, γ) using sT as hidden variable: C ˆ γˆ) = arg (b, γ )) (b,γ) max(p(b, γ|sT , sI , sI , ˜
(25)
Robust Shape-Based Head Tracking
347
In the expectation step we can deduce the conditional expectation of the logarithm of the posterior as:
Q(b, γ|bold , γold ) = E log p(b, γ|sT , sI , sC γ) I ,˜
1
1 1 = − bT Λ−1 b − σ −2 E ||sT − μ − Φr b||2 − ρ−2 E ||sI − Tγ (sT )||2 2 2 2
1 −2
1 −2 C 2 − δ E ||sI − U Tγ (sT )|| − η E ||γ − ˜ γ ||2 (26) 2 2 By setting the gradient of Q w.r.t (b, γ) we can obtain the final constrained shape model parameters from the update formulation in the maximization step of the EM: b = σ −2 (σ −2 I + Λ−1 )−1 ΦTr E (sT )
γ = [ρ−2 E X T X + δ −2 E XCT XC + diag(η)]−1 .
[ρ−2 E X T sI + δ −2 E XCT sC γ] I + diag(η)˜ ∗
∗
C C C where X = (sT , ˘ sT , e, ˘ e), XC = (sC ) and x ˘ stands for rotating the T , sT , e , e coordinates of the shape by 90 degrees and e = (1, 0, . . . , 1, 0)T .
5
Experimental Results and Conclusions
The proposed tracking method has been evaluated using five sequences. Two standard test sequences ’missa’ and ’claire’, and three recorded sequences referred to as ’hou’, ’yl’, and ’pcm’ sequences, respectively. The ’yl’ and ’pcm’ sequences have been recorded for speech recognition, as such mouth motion is prominent. The ’hou’ and ’claire’ sequences contain out of plane head rotations and fast head movements. The ’missa’ sequence is with low quality. In the following the ’pcm’ sequence is used to illustrate the different steps of the prosed method. The face detection and tracking are illustrated in Fig. 2.
Fig. 2. Face Detection Results
The 6 prominent facial feature points can be reliably tracked through all the frames of all the considered sequences. Fig. 3 shows the tracking result for the ’pcm’ sequence.
348
Y. Hou et al.
Fig. 3. Prominent Facial Features Tracking
For the ’claire’ sequence the prominent facial feature points can still be tracked precisely even with the relatively large inter-frame motion between frame 67 and frame 75. Fig. 4 shows, for the the ’claire’ sequence, the improvement of the motion parameters γ = (s, θ, tx, ty) estimation using the proposed constrained LK tracker. As it can be seen smooth scaling and rotation parameters are obtained. In order to compare the tracking performance of the proposed approach and the original BTSM tracking we manually labeled twelve typical points on the eyes and mouth of a sequence. Fig. 6 shows the mean pixel error of each frame. One can notice that the mean pixel error of the proposed shape tracking system (line with blue circles) is below 3 pixels and the average error is about two pixels. However the original BTSM (blue crosses) has a mean error of five pixels. Fig. 5 depicts two tracking results showing the performances between the proposed and the original BTSM tracking method. In summary, we proposed a new method to automatically locate frontal facial feature points under large scene variations (illumination, pose and facial expressions). First, a previously developed kernel-based tracker is used for the detection and tracking of the facial region in an image sequence. Then, the results of the face tracking, i.e. face region and motion parameters, are used to constrain prominent facial feature detection and tracking. In this work, eyes and
Robust Shape-Based Head Tracking
1.15
349
0.06 NonConstrained Constrained
1.1
NonConstrained Constrained
0.04
rotation
scale
0.02 1.05 1
0 −0.02 −0.04
0.95 0.9
−0.06 0
20
40 frame
60
80
−0.08
8
0
20
40 frame
60
80
10 NonConstrained Constrained
6
NonConstrained Constrained 5
4 ty
tx
2 0
0 −2
−5
−4 −6
0
20
40 frame
60
80
−10
0
20
40 frame
60
80
Fig. 4. Constrained v.s. Non-constrained Feature Points Tracking - Motion Parameters
Fig. 5. Original BTSM v.s. Constrained Shape Model - Tracking Results
mouth corners have been considered as prominent facial features. In a final step, we proposed an improvement to the Bayesian Tangent Shape Model for the detection and tracking of a shape model defined by N = 83 facial feature points. A constrained regularization algorithm has been proposed. Extensive experiments demonstrated the accuracy and effectiveness of the proposed method. The proposed tracking system can reliably and precisely tracks the face shape through long sequences, moreover, it handles large nonrigid facial expression variations and small out of plan rotation and even blurred image quality. Future work will focus on three dimensional tracking.
350
Y. Hou et al.
7 CSM BTSM 6
tracking error
5
4
3
2
1
0
5
10
15
20 25 frame
30
35
40
45
Fig. 6. Original BTSM v.s. Constrained Shape Model - Tracking Errors
Acknowledgment This research has been conducted within (i) the ”Audio Visual Speech Recognition and Synthesis: Bimodal Approach” project funded in the framework of the Bilateral Scientific and Technological Collaboration between Flanders, Belgium(BILO4/CN/02A) and the Ministry of Science and Technology (MOST), China([2004]487), the fund of ’The Developing Program for Outstanding Persons’ in NPU-DPOP: NO. 04XD0102, and (ii) the IBBT-Virtual Individual Networks (VIN) project, co-funded by the Institute for Broad Band Technology (IBBT).
References 1. Hou, Y., Zhang, Y., Zhao, R.: Robust object tracking based on uncertainty factorization subspace constraints optical flow. In: Hao, Y., Liu, J., Wang, Y.-P., Cheung, Y.-m., Yin, H., Jiao, L., Ma, J., Jiao, Y.-C. (eds.) CIS 2005. LNCS (LNAI), vol. 3802, pp. 875–880. Springer, Heidelberg (2005) 2. Hou, Y., Zhonghua Fu, Y.Z., Zhao, R.: Face feature points extraction based on refined asm. Chinese Journal of Application Research of Computers 23, 255–257 (2006) 3. Yang, J., Stiefelhagen, R., Meier, U., Waibel, A.: Real-time face and facial feature tracking and applications. In: Proceedings of Auditory-Visual Speech Processing, Terrigal, Australia, pp. 79–84 (1998) 4. Strom, J., Jebara, T., Basu, S., Pentland, A.: Real time tracking and modeling of faces: An ekf-based analysis by synthesis approach. In: Proceedings of the Modelling People Workshop at International Conference on Computer Vision, pp. 55–61 (1999)
Robust Shape-Based Head Tracking
351
5. Bourel, F., Chibelushi, C., Low, A.: Robust facial feature tracking. In: Proceedings of British Machine Vision Conference, Bristol, England, vol. 1, pp. 232–241 (2000) 6. Zhang, Y., Ji, Q.: Active and dynamic information fusion for facial expression understanding from image sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 699–714 (2005) 7. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: Proceedings of European Conference on Computer Vision, vol. 2, pp. 484–498 (1998) 8. Cootes, T.F., Taylor, C.J.: Constrained active appearance models. In: Proceedings of IEEE International Conference on Computer Vision, vol. 1, pp. 748–754 (2001) 9. Zhou, Y., Gu, L., Zhang, H.: Bayesian tangent shape model: estimating shape and pose parameters via bayesian inference. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 109–116 (2003) 10. Zhou, Y., Zhang, W., Tang, X., Shum, H.: A bayesian mixture model for multiview face alignment. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 741–746. IEEE, Los Alamitos (2005) 11. Zhang, W., Zhou, Y., Tang, X., Deng, J.: A probabilistic model for robust face alignment in videos. In: Proceedings of IEEE International Conference on Image Processing, vol. 3, pp. 11–14. IEEE, Los Alamitos (2005) 12. Liang, L., Wen, F., Xu, Y., Tang, X., Shum, H.Y.: Accurate face alignment using shape constrained markov network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1313–1319. IEEE, Los Alamitos (2006) 13. Ravyse, I., Enescu, V., Sahli, H.: Kernel-based head tracker for videophony. In: The IEEE International Conference on Image Processing 2005 (ICIP2005), Genoa, Italy, 11-14/09/2005, vol. 3, pp. 1068–1071. IEEE, Los Alamitos (2005) 14. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the International Joint Conference on Artificial Intelligence, Vancouver, pp. 674–679 (1981) 15. Zivkovic, Z., Kr¨ ose, B.: An em-like algorithm for color-histogram-based object tracking. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), Washington, D.C., USA, June 27 - July 02, 2004, vol. 1, pp. 798–803. IEEE, Los Alamitos (2004)
Evaluating Descriptors Performances for Object Tracking on Natural Video Data Mounia Mikram, R´emi M´egret, and Yannick Berthoumieu Laboratoire IMS, D´epartement LAPS, UMR 5218 CNRS, Universit´e Bordeaux 1-ENSEIRB-ENSCPB, Talence, France
Abstract. In this paper, a new framework is presented for the quantitative evaluation of the performance of appearance models composed of an object descriptor and a similarity measure in the context of object tracking. The evaluation is based on natural videos, and takes advantage of existing ground-truths from object tracking benchmarks. The proposed metrics evaluate the ability of an appearance model to discriminate an object from the clutter. This allows comparing models which may use diverse kinds of descriptors or similarity measures in a principled manner. The performances measures can be global, but time-oriented performance evaluation is also presented. The insights that the proposed framework can bring on appearance models properties with respect to tracking are illustrated on natural video data.
1
Introduction
A large number of algorithms for visual object tracking have been proposed in the literature. Their true performance can be difficult to quantify and compare, for two reasons: data complexity, and system complexity. First, benchmark videos need to be available for the targeted application, with ground-truth information [1][2][3], which represents a large amount of work in order to get sufficient and representative data. Second, a video object tracking systems is a complex system, which can be conceptually decomposed into at least three elements – an appearance model that expresses how an object should look like in one image. – an optimization algorithm, which tries to estimate the object position that optimizes the match between the actual appearance and the appearance model, – spatio-temporal constraints, which give an a priori on the position of the object depending on past tracking. In the context of visual object tracking, many different methods for measuring the performance of a system have been proposed and have led to automatic benchmark evaluation of tracking algorithms [4][5][6][7]. Such benchmarks tackle the system complexity issue thanks to the black-box approach. They indeed ignore the internal composition of the tracking system, and consider only the output of the system when provided with raw video data. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 352–363, 2007. c Springer-Verlag Berlin Heidelberg 2007
Evaluating Descriptors Performances for Object Tracking
353
This approach produces global performances for a given application context. It also allows comparing trackers whose source code is not available. Nevertheless, only the external behavior can be known: the best hypothesis is returned, but the possibility that another result may or may not have been returned is not considered. It is therefore more difficult to characterize precisely the reasons of a good or a bad performance, which is needed when trying to improve the algorithms. In this paper, we propose to complement the standard black-box evaluation paradigm with additional tools to evaluate an important internal component of the tracking system, in an effort to get better insight in the tracking performances. The presented approach is more focused, as it does not evaluate a system as a whole, but suggests new measures to examine the properties of the appearance model, and more precisely its validity over time. The performances should as much as possible avoid the influence of the kind of optimization used, or the help of spatio-temporal constraints. More specifically, we deal with an appearance model which is defined as a similarity measure between a current descriptor and a reference one. In such a case the object descriptor is a key component of the tracking process: it represents the appearance of the object numerically, which makes it the main source of information for the rest of the algorithm. Its performance cannot be separated from the similarity measure used to compare it to the object model, as different similarity measures may yield very different performances for the same descriptor. Performance evaluation of image descriptors has been studied [8][9][10] in the context of content-based image retrieval (CBIR), where image retrieval in a database is based on the appearance of the images globally represented by a feature descriptor. The problem, although similar, is not totally identical to the problem which is tackled in this paper, as retrieving an image among other images is not identical to estimating the location of an object inside the clutter of a video image. We propose to adapt the image retrieval paradigm for performance evaluation in a tracking context. Our approach uses the ground-truth data of standard video tracking benchmarks to generate a database specifically tuned for the evaluation of object descriptor performances with respect to tracking. Information retrieval metrics such as the precision-recall metric are modified to take into account the time information, which is specific to the tracking context. The remainder of this paper is outlined as follows. In section 2, the stateof-the-art in descriptor performance evaluation will be presented and discussed with respect to the tracking context. In section 3, the framework for performance evaluation for tracking using natural video will be detailed. Finally, in section 4, experimental result will show how the proposed framework can provide objective measures for comparing appearance models.
2
Related Work
Before presenting the proposed descriptor evaluation framework, let us first recall the content-based image retrieval (CBIR) context in which most of descriptor
354
M. Mikram, R. M´egret, and Y. Berthoumieu
evaluation works have taken place. A review on CBIR is given in [8]. CBIR can be formalized as the task to find the most relevant images from an image database with respect to a query and the content of the image. A special case of retrieval is exemplar-based, where the query is itself an image and the result images are the most similar images of the database with respect to a given descriptor and similarity measure. The notion of relevance is generally defined as a set of predefined classes, which form the ground-truth data: an image is relevant only if it belongs to the same class as the query. The retrieval performance is most commonly measured by precision and recall based metrics. Both are computed on a sorted list of descriptor similarity. The precision P and recall R are defined for NR retrieved images as in equations (1): P =
NRR NR
R=
NRR NR
(1)
with NRR is the number of relevant images among the retrieved images, and N R the total number of relevant images in the database. The recall R measures the capacity of retrieving all relevant descriptors in the database. The precision P measures the retrieval accurateness. Deselaers et al. compared quantitatively different well-known features for CBIR [9]. Muller et al. discussed performance evaluation of the retrieval systems and proposed a set of quantitative performance measures for comparing CBIR systems [10]. In retrieval systems databases are defined as a set of classes that can be each composed of images of a single object, of a type of object, or of a type of scene. We can find such a data set for instance in the Amsterdam library [11] which is a color image collection of object images recorded under various imaging conditions (viewing angle, illumination angle, and illumination color). Two main issues arise when considering this framework from a tracking point of view. First, the time dimension is lost during performance evaluation. Second, the image database corresponds to objects, and not necessarily to the clutter and the distracters that are situated around it, and that depend on each specific video. This arises in particular from the use of image data captured without a time dimension. In contrast, we will deal with natural video, where the objects of interest are surrounded by distracters and the time dimension is taken into account. The notion of ranking used in the precision-recall framework is nevertheless a very powerful paradigm, that allows us to compare the performances of appearance models of different kinds, that could not be compared directly. In the following section, we will present how to adapt this framework to take into account the specificities of the tracking context.
3 3.1
Performance Evaluation Modeling of the Tracking System
The task of tracking an object n in a frame t will be modeled as the task of finding an estimate bn,t for the correct bounding-box b∗n,t of the object in
Evaluating Descriptors Performances for Object Tracking
355
image It . Standard black-box evaluation methods [1][5][4] first apply the tracking system, which outputs bn,t , and compare it to the ground-truth b∗n,t for frame t using some error measure e depending on the bounding boxes or on some features computed on them en,t = e(bn,t , b∗n,t ) (2) For instance, the Euclidean distance between the centers of the boxes is used in [6]. This error can then be thresholded in order to decide if the object was correctly detected or not. The reader interested in black-box evaluation may also look at multi-object tracking evaluation [12]. In order to evaluate more specifically an appearance model M , the tracking algorithm is additionally modeled as follows, with respect to the appearance M model M : any bounding-box can be associated to a descriptor vn,t on image It M vn,t = vM (It , bn,t )
(3)
∗M The object appearance model is defined by a reference descriptor vn,tref associ∗ ated to the ground-truth bounding-box bn,tref and computed on the reference image Itref ∗M vn,tref = vM (Itref , b∗n,tref ) (4)
The likelihood that a given bounding box is the correct one and may be chosen by the tracking algorithm is captured by a similarity measure sM between the reference descriptor and the current descriptor. The highest the similarity sM , the highest the likelihood that the bounding-box will be chosen by the tracking algorithm. M ∗M s = sM (vn,t , vn,tref ) (5) M ∗M M Some refinements can consider instead a similarity s = sM (vn,t , vn,tref , ωn,t ) M which also takes into account a contextual information ωn,t . For example, [13] proposes a color based tracking method where the color distribution in the background is used to decrease the influence of pixels located in the bounding-box but belonging to the background. Although it will not be detailed here, the groundtruth could be used as an oracle for providing such contextual information. In the proposed framework, the evaluation quantifies the performance of a given appearance model to discriminate between correct positions and incorrect positions. The design of this evaluation is detailed in the next paragraph.
3.2
Conception of a Descriptor Database
For each object n, a database is built, composed of items (t, bn,t , vn,t ), which hold the descriptor value together with the frame and the position at which it was computed. The set of bounding-boxes corresponds to a sampling of the bounding-boxes state space that covers the bounding-boxes the tracking system may be considering. In our experiments, this is done by translating the true bounding-box by a random vector. Given this sampling, the corresponding items are assigned to one of the following classes, as illustrated in figure 1:
356
M. Mikram, R. M´egret, and Y. Berthoumieu
– A target class of inliers items that holds items from all images where the in object appears, and that have an acceptable position bn,t ∈ Bn,t . out – A clutter class of outliers items that have an incorrect position bn,t ∈ Bn,t . – A class of discarded items that are not close not enough to be considered inliers, and not far enough to be considered outliers.
Fig. 1. Bounding-boxes database design for object n = 1. The inlier boxes are translated of a small distance from the ground-truth box. The outlier boxes do not overlap the ground-truth box.
The decision to include an item in the target or clutter class depends on a threshold on the location error : in bn,t ∈ Bn,t if e(bn,t , b∗n,t ) < ein
(6)
out bn,t ∈ Bn,t if e(bn,t , b∗n,t ) > eout
(7)
The value of thresholds ein and eout is a free parameter that needs to be fixed depending on the application. It is chosen with the following guidelines. ein should be of the same order as the imprecision of the ground-truth (typically a couple of pixels) so that all inliers can be considered to be approximation errors far from a perfect estimate. eout should be of the same order as the threshold usually used for deciding that an object is incorrectly detected (see figure 1). In our experimentation, a bounding box is an outlier when it does not overlap the true bounding-box. Such a database can be built from the ground-truth data used in manually annotated benchmarks such as PETS [2][5][14], ViPER [1], CAVIAR [3], or semisynthetic benchmarks [14]. The video sequences and associated ground-truth used in our evaluation come from the CAVIAR project [3] (see figure 2). 3.3
Framewise Performance Measures
∗ Given one query object n with a model vn,tref computed on a reference image Itref , the objective is to evaluate if the inlier descriptors vn,t,i computed on in bn,t,i ∈ Bn,t are more similar to vn∗ than the outlier descriptors vn,t,j computed out on bn,t,j ∈ Bn,t .
Evaluating Descriptors Performances for Object Tracking
357
Fig. 2. Some frames of two of the CAVIAR sequences (first row: seq 1, second row: seq 4) used to illustrate the proposed approach, with the ground-truth positions for each object n
After sorting all descriptors in frame t in descending similarity order with in respect to the reference, let us denote by rn,tref ,t the rank of the most similar out in2 inlier, rn,tref ,t the rank of the most similar outlier and rn,tref ,t the rank of the least similar inlier. Analogous notations are used for the corresponding similarities sn,tref ,t . Using a distance dn,tref ,t instead of a similarity is possible, as it simply involves sorting by ascending distance order for the rank estimation. For a given (tref , t) frame pair, the discriminatory power cM n,tref ,t of the appearance model M for object n is quantified into several categories: – Non discriminating (cM n,tref ,t = 0) when the most similar descriptor is an out outlier (rn,tref = 1). ,t – Discriminating, or partially discriminating (cM n,tref ,t ≥ 1) when the best outlier is less similar than at least one inlier (From a rank point of view out in rn,tref ,t > rn,tref ,t ). – Fully discriminating (cM n,tref ,t = 2) when the best outlier is less similar than out in2 each inlier (rn,tref > r ,t n,tref ,t ). These results can be conveniently represented in a matrix form, where each row represents a reference frame tref , and each column the tested frame t. This out is illustrated in figure 3, where the distances din n,tref ,t and dn,tref ,t as well as the M discriminatory power cn,tref ,t are shown for the following appearance models, that will be used in this paper.
358
M. Mikram, R. M´egret, and Y. Berthoumieu
The first model, M GH (for Gray-level Histogram), corresponds to a 256 bins gray-level histogram computed on the content of the bounding-box, with Matusita distance (which is equivalent to the Bhattacharyya distance). The second model, M GT (for Gray-level Template), corresponds to a gray level template obtained by warping the bounding-box content to a 20×20 pixels image. It is compared using Euclidean distance.
out Fig. 3. Best inlier distance din n,tref ,t (left), best outlier distance dn,tref ,t (center) and M discriminatory power cn,tref ,t (right) using model M GH (top row) or M GT (bottom row), for object n = 7 of sequence 1 (see figure 2)
This first representation calls for a couple of comments. First, it is clear that the dynamics of the two descriptor distances are different. For this reason, direct comparison of similarities (resp. distances in the example) must be avoided. Since only the rank order computed within the same similarity measure and the same descriptor are used, the proposed approach does not make any assumption on the dynamics of the similarity measures. As a consequence different types of descriptors or similarities can be compared based on the discriminatory power. The temporal invariance of the appearance model is therefore implicitly evaluated through the discriminative performance between the object and the clutter when the reference frame tref is different from the test frame t. Second, the diagonal for the matrix din n,tref ,t corresponds to looking for an object in the same frame the reference is computed on, and should therefore always be at least partially discriminating (cM n,tref ,t ≥ 1). When moving away from the diagonal, the time distance between the reference frame and the tested
Evaluating Descriptors Performances for Object Tracking
359
frame increases. This is associated to an increase of the inlier distance, which indicates a change of appearance of the object with time. Third, although the same frames are represented for rows tref and columns t, the matrices are not strictly symmetrical. Indeed, one row tref corresponds to measures associated to the true bounding-box in frame tref , whereas one column t corresponds to measures computed on the set of perturbated bounding-boxes in frame t. For the best inlier distance measure, the matrix is usually close to symmetrical, as the descriptor computed on the true bounding-box is very close to the best inlier descriptor. This can be observed in particular in figure 3 (bottom-left), as the object appearance is modified between frames 60 and 70, resulting in a visibly higher distance on the corresponding rows and columns. Outlier distance matrix are instead organised in columns with consistent distances. Indeed, a distracter may be present in a frame t but not in other frames. As an outlier bounding-box bout n,t overlapping the distracter is taken into account for the computation of a whole column t, this results in a column with a consistently low distance. This can be observed in figure 3 (top-center), where the outliers seem to be more dissimilar to the target object at the end of the sequence (columns t > 150) than at the beginning (columns t < 60). This will have an influence on the integrated performance measures presented next. Finally, the sensitivity of the appareance model to typical appearance changes can be revealed by the best inlier distance matrix. In particular, the image template model M GT is shown in figure 3 (bottom-left) to be sensitive to the deformation of the tracked person, as the 30 frames periodicity of the leg motion appears as darker lines parallel to the diagonal. This property was used in [15] to detect periodic motion. The best inlier representation distance shows that this behavior is not shared by the M GH model. 3.4
Integrated Performance Measures
Several quantitative properties can be extracted from the framewise measures. For an appearance model M , a measure of overall performance can be associated each object n by calculating the proportion Dn of pairs (tref , t) where the model is discriminating: # (tref , t) | cM ≥ 1 tref ,t DnM = (8) #{t}#{tref } where #{tref } = #{t} represent the number of frames in which object n appears. The temporal aspect is thus taken into account inside the performance matrices, and then integrated into a global measure. On the example we used before, the M GH model is discriminant D7MGH = 89% of the time, and the M GT model is discriminant D7MGT = 90% of the time. These global results are very close one to the other, whereas it seems both approaches do not have the same properties. In order to get more specific numerical measures about the ability of an appearance model to remain discriminating after some time, the discrimination rate with respect to time-distance Δt is defined by integration:
360
M. Mikram, R. M´egret, and Y. Berthoumieu
DnM (Δt) =
# (tref , t) | cM ≥ 1 and t − tref = Δt tref ,t # {(tref , t) | t − tref = Δt}
(9)
This is illustrated in figure 4, which compares the performances of the two descriptors of figure 3 with respect to the time-distance. The periodic appearance change due to leg motion is reflected in the performances of the image template descriptor, which shows that the performances of the model M GT are decreased because of this phenomenon. On the opposite, the model M GH is not perturbated by the leg motion, but it has a lower performance for negative Δt in this case, which is explained by the presence of a distracter between frames 1 and 60, as was observed for t < 60 in the discriminatory matrix in figure 3.
Fig. 4. Discrimination rate DnM (Δt) with respect to time-distance Δt for model M GH (left) and M GT (right) on object n = 7 in sequence 1 (see figure 2 top row)
When considering the results in figure 3, one can observe a loss of the discriminatory power around (tref , t) = (150, 40) for M GH and t = 55 or tref = 55 for M GT . It is interesting to note that the two models do not have the same failure modes, as the values of (tref , t) that correspond to a non discriminating situation are different in the two cases. This is also visible in the DnM (Δt) measure, where the best model is not the same in all situations. For that reason, it is also interesting to determine for a couple of appearance model M 1 and M 2 if they fail in the same situations, or if they exhibit complementary behaviors. This is done by identifying the proportion of situations for which each one is discriminating while the other is not. The situationwise comparative performance for the model M 1 to be superior to model M 2 is defined as: 1 2 # (tref , t) | cM ≥ 1 and cM =0 tref ,t tref ,t DnM1 >M2 = (10) #{t}#{tref } The inverse situation is quantified by DnM2>M1 . These results can be summarized for several objects by associating each object n a point (DnM1 , DnM2 ) that represents the global performances of two different appearance models, and a point (DnM1>M2 , DnM2>M1 ) that represents the situationwise comparative performances.
Evaluating Descriptors Performances for Object Tracking
361
Fig. 5. Comparison of the M 1 = M GH and M 2 = M GT appearance models, according to the global discriminative performance DM (left) and to the situationwise comparative performance (DM 1>M 2 , DM 2>M 1 ) (right). Each point is associated to an object and is labeled by ‘sequence id/object id’. The two objects on which the performance measures have been detailed in the text are highlighted.
Fig. 6. Discriminatory power cM n,tref ,t for model M GH (left) and M GT (center) and discrimination rate DnM (Δt) for both models (right) on object n = 0 in sequence 4 (see figure 2 bottom row)
When (DnM1>M2 , DnM2>M1 ) ≈ (0, 0), the two appearance models have the same behavior, and fail in the same situations. When DnM1>M2 ≈ 0 and DnM2>M1 is not close to 0, the model M 2 is systematically better than the model M 1. When both values are not close to 0, the two models are complementary, and fail in different situations. This representation is shown in figure 5, where the M GT appearance model is shown to be more discriminative than the M GH model on sequence 1, but this is the a opposite in the other four sequences analyzed. Such a representation is useful to give an overview of the different types of failure modes. The results for object 7 in sequence were already discussed in section 3.3. A different situation is shown for object 0 in sequence 4, where the success rate is lower. More detailed results of this case are shown in figure 6, the M GT
362
M. Mikram, R. M´egret, and Y. Berthoumieu
model is temporally valid only for a short time, whereas the M GH model stays valid longer, until a more abrupt change in frame 320 appears.
4
Conclusion
The present paper addressed the evaluation of the performance of appearance models composed of a feature descriptor and a similarity measure for tracking. The proposed framework builds on previous descriptor evaluation frameworks with the following contributions. First, the time aspect is taken into account at all levels, from the design of a specific database structure to the proposal of new performance measures that allow comparing descriptors in a tracking context. Second, the discrimination is here considered between an object and its nearby clutter, which is more relevant to the tracking problem than discriminating between object classes. Finally, existing tracking benchmarks datasets can be leveraged by the new framework, even though they have been designed and used with other kind of performances measures in mind. The proposed measures have been applied to natural video data to illustrate the kind of qualitative and quantitative insight they can bring for the study of the properties of feature descriptors and similarities. The focus in this article was to present and explain the framework and the proposed measures. Future work will apply this framework to a broader range of appearance models, such as color distribution based models [13] and models with some spatial information [16], where the influence of parameters such as the number of bins in the histograms or the use of background color distribution in the similarity measure [13] can be analyzed. The extension of the video corpus to different benchmarks sources will also help covering more types of failure modes. The proposed approach should not be considered as a replacement, but as a complement to existing black-box performance evaluation benchmarks for tracking. It indeed produces objective results specifically on the appearance model aspect. Correlating these results with those obtained with the black-box approach may be interesting to get more insight into the interaction of the appearance model with the other parts of the tracking systems.
References 1. Doermann, D., Mihalcik, D.: Tools and techniques for video performance evaluation. In: International Conference on Pattern Recognition. Barcelona vol. 4, , pp. 4167–4170 (2000) 2. Jaynes, C., Webb, S., Steele, R.M., Xiong, Q.: An open development environment for evaluation of video surveillance systems. In: International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), pp. 32–39 (2002) 3. CAVIAR: EU funded project, IST 2001 37540 (2004), http://homepages.inf.ed.ac.uk/rbf/CAVIAR/ 4. Schneiders, S., Loos, T.J.H., Niem, W.: Performance evaluation of a real time video surveillance systems. In: International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), pp. 15–16 (2005)
Evaluating Descriptors Performances for Object Tracking
363
5. Brown, L., Senior, A., Tian, Y., Connell, J., Hampapur, A., Shu, C., Merhl, H., Lu, M.: Performance evaluation of surveillance systems under varying conditions. In: International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), Colorado (2005) 6. Bashir, F., Porikli, F.: Performance evaluation of object detection and tracking systems. In: International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), pp. 7–14 (2006) 7. Schlogl, T., Beleznai, C., Winter, M., Bischof, H.: Performance evaluation metrics for motion detection and tracking. In: International Conference on Pattern Recognition, vol. 4, pp. 519–522 (2004) 8. Smeulders, A., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval: The end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12), 1349–1380 (2000) 9. Deselaers, T., Keysers, D., Ney, H.: Features for image retrieval: A quantitative comparison. In: Rasmussen, C.E., B¨ ulthoff, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) Pattern Recognition. LNCS, vol. 3175, pp. 228–236. Springer, Heidelberg (2004) 10. Muller, H., Muller, W., Squire, D.M., Marchand-Maillet, S., Pun, T.: Performance evaluation in content-based image retrieval: Overview and proposals. Pattern Recognition Letters 22(5), 593–601 (2001) 11. Geusebroek, J., Burghouts, G., Smeulders, A.: The Amsterdam library of object images. International Journal of Computer Vision 61(1), 103–112 (2005) 12. Smith, K., Ba, S., Odobez, J., Gatica-Perez, D.: Evaluating multi-object tracking. In: CVPR Workshop on Empirical Evaluation Methods in Computer Vision (EEMCV), San Diego, CA (2005) 13. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 564–575 (2003) 14. Black, J., Elis, T., Rosin, P.: A novel method for video tracking performance evaluation. In: International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), pp. 125–132 (2003) 15. Cutler, R., Davis, L.S.: Robust real-time periodic motion detection, analysis, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 781–796 (2000) 16. Birchfield, S.T., Rangarajan, S.: Spatiograms versus histograms for region-based tracking. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1158–1163. IEEE Computer Society Press, Los Alamitos (2005)
A Simple and Efficient Eigenfaces Method Carlos Gómez and Béatrice Pesquet-Popescu Signal and Image Processing Dept., Ecole Nationale Supérieure de Télécommunications, Paris, F-75634 CEDEX 13 France
[email protected],
[email protected] http://www.tsi.enst.fr/~pesquet/
Abstract. This paper first presents a review of eigenface methods for face recognition and then introduces a new algorithm in this class. The main difference with previous approaches is the definition of the database. Classically, an image is exploited as a single vector, by concatenating its rows, while here we simply use all the rows as vectors during the training and the recognition stages. The new algorithm reduces the computational complexity of the classical eigenface method and also reaches a higher percentage of recognition. It is compared with other algorithms based on wavelets, aiming at reducing the computational burden. The most efficient wavelet families and other relevant parameters are discussed. Index Terms: Face recognition, eigenfaces, wavelets, PCA, complexity reduction.
1 Introduction Among the most current and well documented techniques for biometric recognition is the face recognition. A large literature on face recognition exists and some of the most promising techniques are based on eigenfaces, elastic matching, neural networks [5, 6] or kernel Principal Component Analysis (PCA) [7]. The eigenface method originally proposed by Turk and Pentland [2] is based on projecting all faces as a vector in a space base. As a continuation, Fisherfaces [8], based on eigenfaces but introducing a new concept of inter-face and intra-face relationship, reaches better results with a more complex algorithm. Nevertheless, both methods suffer from the same problem, namely they cannot be used for large databases due to the computational complexity. A solution to this problem was proposed by using a PCA algorithm on different subbands of wavelet coefficients, but results were slightly worse than the original eigenfaces method [3]. Different approaches, such as the discrete wavelet transform based on fiducial points and jet functions [4], have been proposed to cope with the dimensionality increase. Their main problem is the need of a manual training stage for fiducial points, which could change the performance of the method. This also involves that with the same set of images the method will not converge to the same results with two separate trainings, even when the person performing the training stage is the same. This fact makes the performance of the method unstable. For more references see [1]. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 364–372, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Simple and Efficient Eigenfaces Method
365
The method presented in this paper reduces the computational complexity of the classical eigenface technique for large databases without adding new complexity to the algorithm and at the same time making it more efficient. It also offers some advantages related to the training time. The paper is organized as follows: in Section 2 we present a state-of-the-art on eigenface algorithms. The new method is described in Section 3. In Section 4 the experimental results are provided and in Section 5 an overview and final conclusions are drawn.
2 State of the Art In this section we are reviewing some well known methods with which we shall make the comparison of the proposed method. 2.1 Eigenfaces The original eigenface method was proposed by Turk and Pentland [2] in 1991. It is a simple method based on PCA. Consider a set of training faces Γ1, Γ2, Γ3, … ΓM , each one as a vector of size N2 describing an image of size NxN, where N is the number of rows and columns respectively. The average face will be
Ψ=
1 M
∞
∑Γ n =1
n
(1)
where M is the number of images in the set and each face is different from the average face by Φi= Γi-Ψ. We want to find the set with the most representative orthonormal vectors un associated to the largest eigenvalues λk. The calculation of the eigenvectors and eigenvalues of the covariance matrix C is performed as follows:
1 C= M
∞
∑φ φ n =1
T n n
= AAT
(2)
where A=[Φ1Φ2 … ΦM]. Obtaining the eigenvectors of the matrix C of size N2xN2 is an intractable task for typical image sizes. We need a computationally feasible method to find these eigenvectors. Nevertheless we can solve the problem as a (generally) much smaller MxM problem and taking linear combinations of the vectors. The main problem of this approach remains in large databases because in this case M will be comparable to N2 and the task will remain computationally intractable. Anyway, with a relatively small set of faces we can find the eigenvectors of L=ATA which are equivalent to those obtained from the covariance matrix but L is of size MxM, where M is the number of faces in the database (supposed, for the moment, smaller than N2).
366
C. Gómez and B. Pesquet-Popescu
We obtain the weights of each training face as wi= ukTΦi where uk are the eigenvectors of the space obtained and we save them all in ΩiT=[ w1 w2 ….. wM] with i=1,2…M. For the classification the process is quite similar. We subtract the average face from the input image given as Φ= Γ-Ψ where Γ is the input image and Ψ the average image of the set of faces. Next step is to project Φ onto the face space and we will obtain the weights w= ukT Φ, where uk are the eigenvectors associated to the largest eigenvalues of the training stage. Finally the face is classified by finding the image i that minimizes the Euclidian distance ei2=||Ω – Ωi||2, where Ωi is the vector representing the ith class. A face is classified belonging to a class i when the minimum ei is below a chosen threshold θ. Otherwise the face is classified as “unknown”. 2.2 PCA on the Wavelet Coefficients [3] The algorithm works exactly as the previous one, but in a previous stage the size of the vectors involved in the algorithm is reduced. In order to make this reduction a wavelet transform is used. In this way we can work not only with the reduced version of the image, but also with different frequency subbands. This will be discussed in detail in Section 4. The objective of reducing the size of the image is to be able to run the algorithm even for large databases. However, as it can be remarked from Fig. 5, the results are not as good as the original eigenface method, even though they are close to it.
Method comparison 100
Recognition rate (%)
95
90
85
80
75
70
70
50
40
30
23
10
6
Proposed
94,55
95,15
96,36
95,76
95,76
95,15
92,12
Eigenfaces
84,24
86,61
84,85
86,06
80,61
76,36
69,70
Number of eigenvectors
Fig. 1. Comparison of the performance between the new and the known method of eigenfaces for Yale faces database
A Simple and Efficient Eigenfaces Method
367
Proposed vs Eigenfaces 120
% Recognition
100
80
60
40
20
0
Proposed Eigenfaces
1
2
3
4
86,25
95
97,75
98
11
43,25 71,25 78,25
5
6
7
97,75 98,25 98,25 83
89
90,5
8
9
98
98
92
10
15
20
25
30
50
98,25 98,25 97,75 96,75 96,75 96,75
94,25 94,5
93
94,25
95
95
94,75
Coefficients
Fig. 2. Comparison of the performance between the new and the known method of eigenfaces for the ATT faces database
3 Proposed The proposed method starts from the same idea as the regular eigenface technique, namely extracting image features by a PCA. In our experiments, we noticed that it is better to compute the eigenvectors related to the highest eigenvalues in each row, instead of performing this on the entire image. The choice of taking the lines and not the columns comes from the fact that for a regular face in a square picture there are more representative lines than columns on it, in the sense that image feature can be easier found in rows than in columns. For example, the eyes or the mouth can be better retrieved on lines than on columns. With this discussion, the equations of the algorithm remain almost the same, even though its conception has changed. So, now A=[ Φ1 Φ2 … ΦM] will be an NxNM matrix and C is NxN. We have reduced the size of the covariance matrix from N2xN2 to NxN, which is a huge reduction of computational effort. What is more, now the matrix from which we will obtain the eigenvectors does not depend on the length of the database. We get the eigenvectors associated to the largest eigenvalues of the covariance matrix as before and project each subtracted face in the face space obtaining the weights as wi= ukTΦi and saving them in ΩiT=[ w1 w2 ….. wM]. Before, for each image we have R different weights associated to the R most representative eigenvectors. Now, as we do this process for each line of a face we obtain R*N different weights. The classification process is completely equivalent to the eigenfaces but it is more efficient, since we have more significant weights representing each face than before.
368
C. Gómez and B. Pesquet-Popescu Training time 250
Training time (s)
200
150 Proposed Eigenfaces
100
50
0 70
50
40
30
23
10
6
Number of eigenvectors
Fig. 3. Comparison of the time dedicated to the training stage between the proposed and the classical eigenface methods
Recognition time 120
Recognition time (s)
100
80 Proposed Eigenfaces
60
40
20
0 70
50
40
30
23
10
6
Number of Eigenvectors
Fig. 4. Comparison of the time dedicated to the recognition stage between the proposed and the classical eigenface methods
A Simple and Efficient Eigenfaces Method
369
Subband performance comparison 98 Recognition rate (%)
96 94 92 90 88 Low pass
86
Diagonal Vertical Horizontal
84 82 80 34
19
12
Number of eigenvectors
Fig. 5. Comparison of the performance of different coefficients using the wavelet method
Wavelet families comparison
Recognition rate (%)
100 95 90 85 80
Daub1
75
Daub3
70
Daub4
65 60 Lev1/34
Lev2/34
Lev3/34
Lev4/19- Lev5/14-8 Lev6/10-4 Lev7/8-2 16
Number of levels/eigenvectors
Fig. 6. Comparison of different wavelet families for recognition
370
C. Gómez and B. Pesquet-Popescu
Wavelet families comparison
Recognition rate (%)
97,5000 97,0000
97
Daub1 Daub2 Daub3 Sym2 Sym3 Bior3.5
96,5000 96,0000 95,5000
96
96 95
95
95
95,0000 94,5000 94,0000 1 Family
Fig. 7. Comparison of different families of wavelets for the approximation coefficients at the third level
100
Recognition rate (%)
95 90 85 80 75 70 65 Lev2/61 Lev3/31 Lev3/30 Lev3/29 Lev3/23 Lev4/16 Lev5/8
Lev6/4
Lev7/2
Number of Levels/eigenvectors
Fig. 8. Comparison of the performance between the different numbers of eigenvectors considered, using approximation coefficients of the Haar wavelet decomposition. (Yale database)
4 Experimental Results 4.1 Comparison with Classical Eigenface Method We present in Fig.1 a comparison of the reference and the new Eigenface method applied to the Yale faces database. This is a database containing the faces of 15 different persons in 11 different situations, such as light changes, glasses or facial
A Simple and Efficient Eigenfaces Method
371
expressions. The method used to prove the database is “leave one out” in pattern recognition terminology or “cross validation” in the statistical literature. Both terms express the same method. In this case of the Yale database we will perform 11*15 recognition stages. From Fig.1 one can remark that the results are much better with the new algorithm. We can notice that even for a very small number of coefficients the new algorithm gives a better performance than the classical method for a large set of coefficients. The explanation to this can be found in the origin of the new method. In the classical method, we used the same number of weights to make the recognition stage as the number of coefficients we took. With the proposed method if we are taking into account R coefficients we will have R*N (N: number of lines) weights. That explains that even for only one coefficient we already get good results as it can be seen in Fig.2. Fig.2 uses the ATT database. It is larger than the Yale database (400 faces of 40 different people) but images have less illumination changes which makes the eigenfaces perform quite well. Even though it is clear that even in this case the proposed method performs better. The following results of the different algorithms are all ran from the Yale database that even if it is smaller than the ATT database, because of the grimaces and illumination changes it seems more challenging. We should also notice that taking a larger number of coefficients for any of both algorithms does not lead to better performance. In the case of Turk’s algorithm, above 30 coefficients the performance remains quite stable. For the new algorithm we can clearly find a peak with 40 coefficients but with 10 coefficients the performance is quite similar. The training time is also reduced with the new method. The new method is about 3 times faster than the old one in the training stage (see Fig.3). The recognition time is longer in the new algorithm, as can be seen in Fig.4. Nevertheless, we should notice that for a number of eigenvectors between 30 and 40, the performance of the algorithm does not grow any further. In this case, the recognition time is about 50s but these 50 seconds are measured for all the set of faces which were 165. It takes 0.3s for each individual recognition with a usual Intel Centrino notebook and a Matlab program used for the experiment. Note that 0.3s can already be considered as real time recognition. 4.2 PCA on Wavelet Coefficients The algorithm involving PCA on the wavelet coefficients has been also programmed with the new method of eigenfaces. In the original article of Yuen et al. [3] it is proposed to use the diagonal coefficients of the wavelet decomposition. In contrast we can see from our experiment in Fig.5 that approximation coefficients get the best results independently of the number of considered eigenvectors. This is a result for the family of orthogonal wavelets Daubechies3 but similar results have been obtained for other families. From Fig.7 one can see that the family of wavelets that performs the best for face recognition using PCA is the Haar one. It has also been checked that the longer the wavelet, the worse it performs. Nevertheless there is not a very big difference in percentage of recognition between the tested families.
372
C. Gómez and B. Pesquet-Popescu
In Fig.8 we can observe that such as with the algorithm using PCA without wavelets, when we use about 30 coefficients we obtain the best results. There is no improvement by using more coefficients. Moreover, after the peak of 30 coefficients the performance declines. We also tested the algorithm by using jointly for training and recognition stage the three highest frequency subbands of detail coefficients in each wavelet decomposition. The coefficients in the three subbands are concatenated in a single vector and used together in the training and in the recognition stage. This leads to better recognition results than each high frequency subband separately, but it does not reach the percentage of recognition of the low frequencies. Once more, we have checked that the indication given by Yuen [3] which suggests using the diagonal frequencies is wrong. Best results are always reached by taking the low frequencies. Once again, Haar wavelets are the ones that yield the best results.
5 Conclusions In this paper, we have proposed a simple yet efficient eigenvector method, based on the PCA of the rows of images. It was found to be better than the reference technique, both in what concerns the recognition rate and the training time. This algorithm strongly reduces the computational cost and memory usage, thus allowing addressing large dimensional problems. The wavelet decomposition of large images was also used as a means to reduce the dimensionality, by applying the eigenvector method on different subbands of coefficients. The Haar wavelets and the approximation coefficients were found to perform the best in this context. Moreover, for regular size images the proposed low complexity algorithm performs better.
References 1. Ngo, D.C.L., Teoh, A.B.J., Goh, A.: Biometric Hash: High-Confidence Face Recognition. IEEE Trans. Circuits and Systems for Video Techn. 16(6), 771–775 (2006) 2. Turk, M., Pentland, A.: Eigenfaces for recognition”, Journal Cognitive Neuroscience. Journal Cognitive Neuroscience. 3(1), 71–86 (1991) 3. Feng, G.C., Yuen, P.C., Dai, D.Q.: Human face recognition using PCA on wavelet subband. Journal Electron. Imaging 9(2), 226–233 (2000) 4. Ma, K., Tang, X.: Discrete wavelet face graph matching. Int. Conf. Image Proc. 2, 217–220 (2001) 5. Zhang, J., Yan, Y., Lades, M.: Face Recognition: Eigenface, Elastic Matching, and Neural Nets. Proc. Of the IEEE 85(9), 1423–1435 (1997) 6. Chellappa, R., Wilson, C.L., Sirohey, S.: Human and machine recognition of faces: A survey. Proc. of the IEEE 83, 705–741 (1995) 7. Yang, M.H., Ahuja, N., Kriegman, D.: Face Recognition Using Kernel Eigenfaces. In: IEEE ICIP 2000, pp. 37–40. IEEE, Los Alamitos (2000) 8. Belhumeur, et al.: Eigenfaces vs. Fisherfaces. Recognition Using Specific Linear Proyection PAMI 19, 711–720 (1997)
A New Approach to Face Localization in the HSV Space Using the Gaussian Model Mohamed Deriche and Imran Naseem Electrical Engineering Department, King Fahd University of Petroleum and Minerals, Dhahran 31261, KSA
[email protected]
Abstract. We propose a model based approach for the problem of face localization. Traditionally, images are represented in the RGB color space, which is a 3-dimensional space that includes the illumination factor. However, the human skin color of different ethnic groups has been shown to change because of brightness. We therefore propose to transform the RGB images into the HSV color-space. We then exclude the V component, and use the HS-domain to represent skin pixels using a Gaussian probability model. The model is used to obtain a skin likelihood image which is further transformed into a binary image using the fuzzy C-mean clustering (FCM) technique. The candidate skin regions are checked for some facial properties and finally a template face matching approach is used to localize the face.. The developed algorithm is found robust and reliable under various imaging conditions and even in the presence of structural objects like hairs, spectacles, etc.
1
Introduction
With the emergence of new techniques in multimedia signal processing, we need more sophisticated, precise and user-friendly means of interaction with computers. The traditional ways of communicating with machines like keyboards, mice etc. are now considered to be burdensome. Furthermore the utilization of the facial features for the purpose of person identification has encouraged researchers all around the world to propose robust and efficient techniques for face processing prior the recognition task. In fact, many researchers believe that perhaps face detection is the first and most important step towards solving the problem of face recognition. The need for face detection is more enhanced, for the problem of face recognition in applications related to crowded places like airports, banks, buildings etc. In this paper, the general problem of face detection is defined as follows : given a still or a video image detect and localize human face(s) if any. The main issues related to face detection can be summarized as follows: – Posture. The images of a face, in real time environment, vary largely because of the positioning of the face (frontal, profile etc), which may result in occlusion of facial features like eyes, mustache etc. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 373–383, 2007. c Springer-Verlag Berlin Heidelberg 2007
374
M. Deriche and I. Naseem
– Structural components. Objects like mustaches, beards, glasses etc may or may not be present in a given face image. Furthermore there is a great variability in shape, size and color of these objects. – Expressions. Facial images are highly effected by facial expressions. – Occlusion. In a real world environment, a human face may be partially or full occluded by moving objects. – Ambient conditions An image is highly dependent upon the ambient imaging conditions like light intensity, etc. The above issues make the problem of face detection a challenging one. There are many other issues which are closely related to the problem of face detection. With an assumption that the image has only one face, the problem of face detection boils down to face localization [1], [2]. Facial features detection aims at detecting facial features like eyes, eyebrows, lips, nose etc. with an assumption of presence of a single face in the image [3], [4]. Face recognition performs a match between an input image and a database [5], [6]. Face tracking methods continuously look for face location in a sequence of images. Broadly speaking, face detection techniques can be divided into two categories: Feature based techniques and Image based techniques.The techniques falling under the first category make explicit use of the facial features; the apparent properties of the face such as skin color and face geometry are utilized. Typically, in these techniques, the face detection task is accomplished by manipulating distances, angles, and other geometrical aspects derived from the scene. Considered to be the most primitive feature in computer vision applications, edge representation was also used in the earliest face detection work by Sakai [7]. The work was based on analyzing line drawings of the faces from photographs, aiming to locate facial features. Craw [8] later proposed a hierarchical framework based on Sakai’s work to trace a human head outline. The work includes a line-follower implemented with curvature constraint to prevent it from being distracted by noisy edges. Edge features within the head outline are then subjected to feature analysis using shape and position information of the face. More recent examples of edge-based techniques can be found in [9], for facial feature extraction, and in [10], [11], [12], for face detection. Edge-based techniques have also been applied to detecting glasses in facial images [13], [14]. The gray level information contained in an image can also be utilized to extract features. Since parts of faces like eyes, eyebrows, pupils, lips, are darker (lower gray value than skin regions), they can be designated as facial features within a segmented face region using various algorithms. The second category of image based techniques include the linear subspace methods, neural networks, and statistical approaches [15], [16], [17], to mention a few.
2
A Statistical Model for Skin Pixels in HSV Domain
It has been observed that the skin colors of different people share almost similar points in the color space and the difference in the apparent skin colors is mostly
A New Approach to Face Localization in the HSV Space
375
due to the intensity or luminance. The luminance itself is not a reliable criteria for differentiating between a skin region and a non-skin region because of the varying ambient lighting conditions. Thus we always prefer to deal with the images in which the effect of luminance has been reduced or canceled. Unfortunately, most still and moving images are in RGB color space which not only represents the three primitive colors (red, green, blue) but also represents existing luminance. Thus, it is always desirable to transfer the RGB image into a color space in which the luminance effect is reduced or canceled. A good method to accomplish this is to transfer the image into the HSV color space. A three dimensional representation of the HSV color space is a hexacone (see figure 1 (b)), where the central vertical axis represents the Intensity (Value). Hue is defined as an angle in the range (0, 2π) relative to the Red axis with red at angle 0, green at 2 π3 , blue at 4 π3 and red again at 2π. Saturation is the depth or purity of the color and is measured as a radial distance from the central axis with value between 0 at the center to 100 at the outer surface. For S=0, as one moves higher along the Intensity axis, one goes from Black to White through various shades of gray. On the other hand, for a given Intensity and Hue, if the Saturation is changed from 0 to 100, the perceived color changes from a shade of gray to the most pure form of the color represented by its Hue. The above implies that any color in the HSV space can be transformed to a shade of gray by sufficiently lowering the Saturation. The value of Intensity determines the particular gray shade to which this transformation converges. The transformation from the RGB to the HSV is written as: H= S= V =
H1 if B ≤ G 2π − H1 if B > G max(R, G, B) − min(R, G, B) max(R, G, B)
max(R, G, B) 255
(1)
where, H1 = arccos
0.5[(R − G) + (R − B)] (R − G)(R − G) + (R − B)(G − B)
(2)
Once an image is transformed into the HSV domain, the V-component (value, intensity, and brightness) could be easily removed to get rid of the luminance. Figure 1 shows the HS-space distribution of 537272 skin pixels of 40 people belonging to different ethnic groups. The reader will note that the skin pixels cluster around a specific area rather being distributed in the whole space. This interesting observation has prompted researchers to develop a statistical model for skin using the Gaussian distribution [18].
376
M. Deriche and I. Naseem
100 80 60 60
40
40
20
20 0
Hue
0
Saturatio
Fig. 1. (a): HSV colorspace, (b): Skin pixels distribution 3000
3000 2500 2500 2000 2000 1500 1500 1000 1000
500 500
0
0
10
20
30
40
50
60
70
80
90
100
100 050
0
0
20
40
Hue Hue S t
60 Saturation
80
100
120
ti
Fig. 2. Hue and saturation distributions
Each pixel in the HS-space is seen as a bi-variate observation vector given as: x = [r b]T
(3)
Now, let x be is a two dimensional random vector of all such observations for a given image, the first and second order moments are given as: E(x) = m C = E[(x − m)(x − m)T ]
(4) (5)
With this Gaussian distribution the skin likelihood image can be obtained using the expression (up to a constant): P (r, b) = exp − 0.5(x − m)T C−1 (x − m)
(6)
Figure 3 shows the model resulting from 537272 skin pixels using the bivariate Gaussian distribution.
A New Approach to Face Localization in the HSV Space
377
1
0.8
0.6
0.4
0.2
0 120 100
120
80
100 60
80 60
40
40
20
20 0
Hue
0
Saturation
Fig. 3. Gaussian model for skin regions
3
Transformation of the Skin Likelihood Image into a Binary Image
We propose to use the Fuzzy C-mean Clustering (FCM) approach for the purpose of skin segmentation. FCM is a method of clustering which allows each observation in a data set to belong to two or more clusters. This method was developed by Dunn in [19] and improved by Bezdek in [20]. It is frequently used in pattern recognition and is based on minimizing of the following cost function: Jm =
N C
2 um ij xi − cj ; 1 < m < ∞
(7)
i=1 j=1
where m is any real number greater than 1, uij is the degree of membership of xi in the cluster j, xi is the ith of d-dimensional measured data, cj is the ddimension center of the cluster, and ∗ is any norm expressing the similarity between any measured data and the center. Fuzzy partitioning is carried out through an iterative optimization of the objective function shown in equation 7 above, with the update of membership uij and the cluster centers cj by: uij =
C k=1
1 xi − cj xi − ck
2 m−1
(8)
378
M. Deriche and I. Naseem
Fig. 4. The original image, the image in HSV color-space, and the skin likelihood image N
cj =
um ij .xi
i=1 N
um ij
i=1
This iterations stop when
max |uk+1 ij
− ukij | < δ , where δ is a termination
ij
criterion between 0 and 1, and k are the iteration steps. This procedure converges to a local minimum or a saddle point of Jm .
Fig. 5. Binary image representing the prospective skin regions
The binary image so obtained should ideally have all skin regions with pixel values 1 and non skin regions with 0, however as shown in figure 5 there are few non skin regions which have been erroneously classified as skin. Furthermore we have to select a suitable skin region from all skin regions in figure 5 which could be a potential human face. Thus to narrow down our search of human faces, we need to define a number of new criteria: 1. Holes in a skin regions: We start by using our knowledge that human faces have to contain objects like eyes, eyebrows, mustaches etc. These objects correspond to non skin regions (holes), suggesting that a human face
A New Approach to Face Localization in the HSV Space
379
is perhaps a skin region with a few holes with in its boundaries. Therefore, in our search of a human face, we can safely discard all skin regions which do not have any holes. The number of holes in a region is computed using the Euler number [21] of the region as follows: E=C-H, where, E is Euler number; it is a scalar whose value is the total number of objects in the image minus the total number of holes in those objects. C is the number of connected components and H is the number of holes in a region. Since we are considering one region at a time C = 1, thus, the number of holes could be computed as: H=1-E. 2. Geometrical properties:We now compute a number of geometrical properties like centroid, orientation and height to width ratio of the candidate skin regions. There are various methods to calculate center of mass (or centroid) of the region [21], given an image the center of mass can be calculated as, n m 1 x = j B [i, j] (9) A i=1 j=1 1 i B [i, j] A i=1 j=1 n
y =
m
(10)
where, B is a matrix of order n x m representing the region under consideration. A is the area of region under consideration in pixels. Although most of the faces considered are vertically oriented, to cope with inclined faces, we must calculate the angle of inclination θ. There could be various ways to do so but we have adopted the method of elongating the object as presented in [21]. The angle of inclination could be calculated as, θ=
1 −1 b tan 2 a−c
(11)
where, a=
n m
xij B[i, j] 2
i=1 j=1
b=
n m i=1
c=
j=1
n m i=1
xij xij B[i, j] yij B[i, j] 2
j=1
x =x−x y = y − y
(12)
We now calculate the height to width ratio of the region which serves two purposes. Firstly dimensions of the region are mandatory because we will
380
M. Deriche and I. Naseem
have to resize our template face according to the skin region so as to perform the template face matching. Secondly we can use the height to width ratio to improve our decision. Actually the human faces are vertically oriented and ideally the height to width ratio is bit larger than 1. Thus we can use this observation to classify that the regions having height to width ratio below 0.8 do not correspond to a human face. Similarly we can put a higher upper limit on the ratio however there are cases in which we have images with uncovered skin area below the face i.e neck etc, to account for this we put a higher upper limit of 1.6. Thus we would discard all those regions in our search of human face which have the region ratio of less than 0.8 or above 1.6.
4
Template Face Matching
The most important step in the method is to use a template face to match to the obtained skin regions. The template face shown in figure 6 was calculated by averaging 16 faces of males and females with no spectacles and facial hairs (www.ise.stanford.edu). Notice that the left and right borders of the template are located at the center of the left and right ears of the averaged faces. The template is also
Fig. 6. The average face
Fig. 7. An example of template face matching
A New Approach to Face Localization in the HSV Space
381
vertically centered at the tip of the nose of the model. The template face will be adapted using the geometric characteristics obtained for each region. It will first be resized using the height and width of the region. The resized template face is now oriented using the calculated angle θ so that the template face has the same inclination as that of the region. Now the center of the inclined template face is calculated and is located at the already calculated center of the region. We then calculate the cross-correlation between the adjusted template face and the skin region under consideration. Empirically, we have determined that the correlation value 0.6 is good enough to decide that the region under consideration corresponds to a human frontal face (refer to figure 7).
5
Experimental Results
We performed extensive experiments to verify the validity of our algorithm under various conditions. Some results are shown in figure 8. The algorithm is found to be robust under various lighting conditions. For instance in figure 8.b the subject is exposed to lateral lighting, note also that the right side of the subject is in complete darkness. Also note that the face is a bit tilted and is not a frontal pose. These issues make it a difficult face localization problem which is equally well tackled by the algorithm. The presence of structural objects like spectacles, facial hairs etc. tend to hide the skin information and lead to erroneous face detection. We performed extensive experiments for subjects with structural objects to verify the validity of the algorithm. For instance in figures the subject wear a pair of spectacles, note that the picture is taken under natural light in an outdoor environment. The presence of facial hairs like mustache, beard etc. are always a source of erroneous face localization. The developed algorithm adequately handles these types of problems as well as difficult cases of profile face images. Some examples are shown in figure 8.
6
Conclusion
In this paper, we have proposed a novel approach for human face localization. A probabilistic model of skin pixels was developed in the HSV color-space using a Gaussian distribution. The skin likelihood image obtained is transformed into a binary image using the FCM (fuzzy C-mean) clustering algorithm. The potential face candidates are then tested for some facial properties before using a template face matching approach. The extensive experiments carried showed that the algorithm was robust under tedious imaging conditions. The issue of occlusion due to structural objects was also addressed and the algorithm was found to be reliable in such environments. The proposed algorithm performs well even with isometric views, even though it is not developed for such images. In future work, we plan to extend the algorithm for profile images as well. An enhanced Gaussian mixture model (GMM) is also being investigated as little improvement was achieved in our initial experiments with the basic GMM model.
382
M. Deriche and I. Naseem
(b)
(a)
(e)
(d)
(g)
(h)
(c)
(f)
(i)
Fig. 8. Some experimental results
Acknowledgments The authors acknowledge the support of King Fahd University of Petroleum and Minerals, and King Abdulaziz City for Science and Technology, Saudi Arabia, for supporting this research.
A New Approach to Face Localization in the HSV Space
383
References 1. Lam, K., Yan, H.: Fast algorithm for locating head boundaries. J.Electronic Imaging 3(4), 351–359 (1994) 2. Moghaddam, B., Pentland, A.: Probabilistic visual learning for object recognition. IEEE Trans. Pattern Analysis and Machine Intelligence 19(7), 696–710 (1997) 3. Craw, T.I., Bennett, D.A.: Finding face features. In: Second European Conf. Computer Vision, pp. 92–96 (1992) 4. Petajan, E., Graf, H.P., Chen, T., Cosatto, E.: Locating faces and facial parts. In: 1st Int’l Workshop, Automatic Face and Gesture Recognition, pp. 41–46 (1995) 5. Turk, M., Pentland, A.: Eigen faces for recognition. J. Congnitive Neuroscience 3(1), 71–86 (1991) 6. Samal, A., Iyengar, P.A.: Automatic recognition and analysis of human faces and facial expressions. Pattern Recognition 25(1), 65–77 (1992) 7. Sakai, T., Nagao, M., Kanade, T.: Computer analysis and classification of photographs of human faces. In: First USA-Japan Computer Conference (1972) 8. Craw, I., Ellis, H., Lishman, J.R.: Automatic extraction of face-feature. Pattern Recog. Lett. 183–187 (1987) 9. Herpers, R., Michaelis, M., Lichtenauer, K.-H., Sommer, G.: Edge and keypoint detection in facial regions. In: IEEE Proc. of 2nd Int. Conf. on Automatic Face and Gesture Recognition, pp. 212–217 (1996) 10. De Silva, L.C., Aizawa, K., Hatori, M.: Detection and tracking of facial features by using a facial featuremodel and deformable circular template. IEICE Trans. Inform. Systems, 1195–1207 (1995) 11. Govindaraju, V.: Locating human faces in photographs. Int. J. Comput. Vision, 19 (1996) 12. Yuille, A.L., Hallinan, P.W., Cohen, D.S.: Feature extraction from faces using deformable templates. Int. J. Comput. Vision 8, 99–111 (1992) 13. Jiang, X., Binkert, M., Achermann, B., Bunke, H.: Towards detection of glasses in facial images, Pattern Anal. Appl. 3, 9–18 (2000) 14. Jing, Mariani, R.: Glasses detection and extraction by deformable contour. In: 15th International Conference on Pattern Recognition, vol. 2 (2000) 15. Mikami, M., Wada, T.: Example-based face detection using independent component analysis and rbf network. In: SICE Annual Conference (2003) 16. Rong Jin, A.G., Hauptmann: Learning to identify video shots with people based on face detection. In: Multimedia and Expo, ICME ’03 (2003) 17. Zhenqiu Zhang Li, S.Z.: Floatboost learning and statistical face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (2004) 18. Naseem, I., Deriche, M.: Robust face detection in complex color images. In: 12th IEEE International Conference on Image Processing, ICIP’05, IEEE, Los Alamitos (2005) 19. Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Journal of Cybernetics 3, 32–57 (1973) 20. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algoritms. Plenum Press, New York (1981) 21. Ramesh, R., Kasturi, R., Schunck, B.: Machine Vision. McGraw Hill, New York (1995)
Gait Recognition Using Active Shape Models Woon Cho, Taekyung Kim, and Joonki Paik Image Processing and Intelligent Systems Laboratory, Graduate School of Advanced Imaging Science, Multimedia and Film, Chung-Ang University, Seoul 156-756, South Korea {woony_love,kimktk}@wm.cau.ac.kr,
[email protected]
Abstract. The gait recognition is presented for human identification from a sequence of noisy silhouettes segmented from video. The proposed gait recognition algorithm gives better performance than the baseline algorithm because of segmentation of the object by using active shape model (ASM) algorithm. For the experiment, we used the HumanID Gait Challenge data set, which is the largest gait benchmarking data set with 122 objects, For realistic simulation we use various values for the following parameters; i) viewpoint, ii) shoe, iii) surface, iv) carrying condition, and v) time.
1 Introduction Human gait is a spatio-temporal phenomenon that specifies the motion characteristics of an individual [8]. Study of human gait, as well as its deployment as a biometric for identification purposes, is currently an active research area. Despite the imperative need for efficient security architectures in airports, border crossings, and other public access areas, most currently deployed identification methods were developed and established several years ago. It is now clear that these methods cannot cover contemporary security needs. For this reason, the development and deployment of biometric authentication methods, including fingerprint, hand geometry, iris, face, voice, signature, and gait identification, has recently attracted more attention of government agencies and other institutions. Gait analysis and recognition can form the basis of unobtrusive technologies for the detection of individuals who represent a security threat or behave suspiciously [8]. In the specific area of gait recognition, most works have focused on discriminating between different human motion types, such as running, walking, jogging, or climbing stairs. Recently human identification (HumanID) from gait has received attention and become an active area of computer vision. A review of the current studies shows that three common assumptions for constraining the scene include i) indoors, ii) static background, and iii) uniform background color. These assumptions cannot cover every possible situation in real life outdoor scenes [3]. In this paper, for objective evaluation, we compare recognition rates obtained by common experimental protocols on HumanID Gait Challenge data set [3]. The HumanID data set, which has 122 subjects acquired from outdoor scenes, has five covariable parameters as; i) change in viewing angle, ii) change in shoe type, iii) change J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 384–394, 2007. © Springer-Verlag Berlin Heidelberg 2007
Gait Recognition Using Active Shape Models
385
in walking surface, iv) carrying or not carrying a briefcase, and v) elapsed time between sequences being compared. The proposed approach falls into the shape-based approaches where dynamic shapes are used. Unlike to a shape representation method using an average over all stances, we discard the dynamics between the stances to obtain the temporal ordering of the individual gait stances by using stance definite representations. To highlight the shape part of gait, we use normalization of the gait dynamics based on active shape models (ASM) [4]. The ASM, which can be categorized mainly for non-rigid shapes, trains object’s shape information a priori and defines the object in a new image by allowing deformation from the mean shape [4]. The ASM has successfully been applied in many tracking and recognition areas including medical imaging [9] because of the ability to segment non-rigid human organs. The ASM algorithm is divided into five functional modules including; i) landmark assignment, ii) training set alignment, iii) shape variation modeling, iv) mode decision, and v) model fitting. The ASM algorithm is robust to detect silhouettes with an aid of background segmentation. After segmentation of the object by using ASM, we applied it to the baseline algorithm for gait recognition [3]. The HumanID data set and source codes for the gait baseline algorithm were obtained at the corresponding research community (http://figment.csee.usf.edu/GaitBaseline/). The rest of this paper is organized as follows. In section 2, we present the basic theory and implementation of ASM algorithm. Sections 3 and 4 describe the HumanID data set and the gait baseline algorithm using the proposed segmentation method each. The experimental results and conclusion are respectively provided in sections 5 and 6.
2 Active Shape Models 2.1 Landmark Point Assignment A shape represents geometrical information that is preserved after location, scale, and rotational effects are filtered out from an object. Such shapes can be described by using a set of landmark points. A landmark is a point of correspondence on each object that matches between and within populations. It can be classified as three subgroups; i) anatomical landmarks, ii) mathematical landmarks, and iii) pseudolandmarks. The proposed method HumanID data set [3], which has 32 landmarks upon the contour of an object acquired from 1,870 test sequences including 122 individuals. Each individual walks multiple times counterclockwise along two similar elliptical paths as shown in Fig.1. In Fig. 1, three different types of landmark points are summarized as follows. i) 14 anatomical landmarks: 4, 5, 6, 10, 11, 15, 16, 18, 19, 23, 24, 28, 29, and 30. ii) 16 mathematical landmarks: 1, 2, 7, 8, 9, 12, 13, 14, 17, 20, 21, 22, 25, 26, 27, and 32. iii) 2 pseudo-landmarks: 3 and 31. The two pseudo-landmarks are used to minimize the distance from the linear spline defined by the 32 landmarks and the actual object’s contour.
386
W. Cho, T. Kim, and J. Paik
(a)
(b)
Fig. 1. (a) 32 landmarks assigned on a sample object and (b) the typical walking trajectory to make the HumanID data set
2.2 Training Set Alignment A. Generalized Procrustes analysis To align a set of planar shapes the following iterative approach is used based on generalized Procrustes analysis. 1. 2. 3. 4.
Choose an initial estimate of the mean shape (e.g. the first shape in the set). Align all the remaining shapes to the mean shape. Re-calculate the estimate of the mean from the aligned shapes. If the estimated mean has changed return to step 2.
The iteration is considered to converge if the mean shape does not change significantly. The Procrustes mean shape is commonly used to obtain an estimate of the mean shape. Let N denote the number of shapes, then the Procrustes mean is defined as
x=
1 N
N
∑x
j
.
(1)
j =1
In order to avoid any shrinking or drifting of the mean shape, size and orientation should be properly fixed at each iteration by normalization. B. Projection to the tangent space Projection of a shape vector moves the shape onto the corresponding hyper plane, where the Euclidean distance can be employed as shape metric instead of the true geodesic distance on the hyper sphere surface [5]. 2.3 Modeling Shape Variation Suppose we have s sets of points xi which are aligned into the common co-ordinate frame. These vectors are in the nd − dimensional space. By modeling such, we can generate new samples, similar to those in the original training set, and can decide whether they are plausible or not.
Gait Recognition Using Active Shape Models
387
In particular we seek a parameterized model of the form x = M (b), where b represents a vector of model parameters. Such model can be used to generate new vectors, x. To simplify the problem, we first reduce the dimension of the data from nd to something more manageable. An effective approach is to apply principal component analysis (PCA) which is as follows: 1. Compute the mean of the data,
x=
1 s ∑ xj . s j =1
(2)
2. Compute the covariance of the data,
S=
1 s ( x j − x )( x j − x )T . ∑ s − 1 j =1
3. Compute the eigenvectors
φj
and corresponding eigenvalues
(3)
λj
of S
( λj
≥ λ j +1 ). If Φ contains the t eigenvectors corresponding to the t largest eigenvalues, then we can then approximate any of the training set x as
x ≈ x + Φb , where
(4)
Φ = (φ1 | φ2 | K | φt ), and b is a t − dimensional vector given defined as v b = ΦT ( x − x ) .
The vector
(5)
b represents a set of parameters of a deformable model.
2.4 Mode Selection The number of eigenvectors to form Φ can be chosen in several ways. A straightforward approach is to choose t so that the corresponding eigenvectors have 98% of the total energy [4]. Let λi be the i − th eigenvalue of the covariance matrix of the training data. Each eigenvalue gives the variance of the data about the mean in the direction of the corresponding eigenvector. The total variance in the training data is the sum of all the eigenvalues such as
VT = ∑ λi .
(6)
388
W. Cho, T. Kim, and J. Paik
We can then choose the t largest eigenvalues such that t
∑λ
i
≥ f vVT ,
(7)
i =1
where
f v defines the proportion of the total variation.
Fig. 2 shows the effect of varying the first three shape parameters in turn between
±1.6 standard deviations from the mean values, leaving all other parameters at zero. 2.5 Fitting a Model to New Points
Given a rough initial approximation, an instance of a model can be fit to an image. By choosing a set of shape parameters, b, for the model we define the shape of the object in an object-centered co-ordinate frame. We can create an instance model in the image frame by defining the position, orientation, and scale as
x = TX t ,Yt , s ,θ ( x + Φb) , where the function
X of the (8)
TX t ,Yt , s ,θ performs a rotation by θ , a scaling by s, and a transla-
tion by ( X t , Yt ) .
Fig. 2. Effect of varying the three largest eigenvalues in the range of ±1.6 standard deviation
An iterative approach to improving the fit is summarized as follows: 1. Examine a region of the image around each point
X i to find the best nearby
match for the point X i′. 2. Update the parameters
( X t , Yt , s,θ , b) to best fit the new found points X .
3. Repeat steps 1 and 2 until convergence. In practice, we look along profiles normal to the model boundary through each model point as shown in Fig. 3. If we want the model boundary to correspond to an edge, we can simply locate the strongest edge along the profile.
Gait Recognition Using Active Shape Models
(a)
389
(b)
Fig. 3. Model fitting along the profile normal to boundary edge
Since model points are not always located on the strongest edge in the locality, the best approach is to consider the training set. The proposed approach uses Mahalanobis distance to search optimal direction given as
f ( g s ) = ( g s − g )T S g−1 ( g s − g ) , where
(9)
g and S g are mean and covariance for the given model point, respectively.
Equation (9) is the Mahalanobis distance of the sample from the model mean, and is linearly related to the log of the probability that g s is drawn from the distribution. Minimizing
f ( g s ) is equivalent to maximizing the probability that g s comes from
the distribution.
3 Data Set The HumanID gait challenge problem data set was designed to advance the state-of-theart in automatic gait recognition and to characterize the effects on performance of five conditions. These two goals were achieved by collecting data on a large (122) set of subjects, compared to current standards in gait, spanning up to 32 different conditions, which is the result of all combinations of five covariates with two values each [3].
Fig. 4. Camera setup for the gait data acquisition
The gait video data were collected at the University of South Florida on May 20-21 and November 15-16, 2001. The collection protocol had each object walking multiple times counterclockwise along the elliptical path. The basic setup is illustrated in Fig. 4. The reason for using the elliptical path are i) to develop a robust algorithm with respect
390
W. Cho, T. Kim, and J. Paik
to variations in the fronto-paralled assumption and ii) to provide a data sequence including all the views of a person for the potential development of 3D model-based approaches. In this paper, the following parameters are used for covariates [3]; i) ii) iii) iv) v)
surface type by G for grass and C for concrete, camera by R for right and L for left, shoe type by A or B, NB for not carrying a briefcase and BF for carrying a briefcase, and the acquisition time, May and November, simply by M and N.
4 Baseline Algorithm The baseline algorithm utilizes spatial-temporal correlation between silhouettes. Comparisons are made with the silhouettes to reduce the effects of clothing texture artifacts. The baseline algorithm should be a combination of “standard” vision modules that accomplish the task. This algorithm is divided into three modules as; i) extracting silhouettes from segmentation of the object by using ASM, ii) Computing the gait period from the silhouettes and estimating the period to partition the sequences for spatial-temporal correlation, and iii) evaluating spatial-temporal correlation to compute the similarity between two gait sequences. 4.1 Silhouette Extraction The first step in the baseline algorithm is to extract the object’s silhouette by using ASM. Based on the common practice in gait recognition, we define the silhouette to be the region of pixels inside the ASM boundary.
Fig. 5. Various input frames ((a)-(e)) and the extracted region of objects ((f)-(j))
4.2 Gait Period Detection After successfully extracting silhouette of the object, gait periodicity,
N gait , is esti-
mated by using a simple strategy. We count the number of foreground pixels in the silhouette in each frame over time, N f (t ). This number will reach the maximum at the full stride stance and drop to the minimum when the legs fully overlap. To increase the sensitivity, we consider the number of foreground pixels mostly from the legs, which are selected simply by considering only the bottom half of the silhouette.
Gait Recognition Using Active Shape Models
391
2500
Num ber of foregroundpixels
2000
1500
1000
500 Original baseline Proposed method 0
20
40
60
80 100 Frame Number
120
140
160
Fig. 6. Gait period comparison between the proposed method and the original baseline algorithm. The number of foreground pixels represent the bottom half of the silhouettes using 02463C1AL which is an ID in the HumanID data set.
Fig. 6 shows an instance of the regular variation of
N f (t ) compared with the origi-
nal baseline algorithm. Note that this strategy works for the elliptic paths. 4.3 Similarity Computation The output from the gait recognition algorithm is a complete set of similarity scores between all gallery and probe gait sequences. Similarity scores are computed from spatial-temporal correlation. Let a probe sequence of N frames be denoted by
SequenceP = {S p (1), S p (2),K, S p ( N )}, and a gallery sequence of K frames be denoted by
SequenceG = {SG (1), SG (2),K, SG ( K )}. The final similarity score
is constructed out of matches of disjoint portions of the probe with the gallery sequence. More specifically, we partition the probe sequence into disjoint subsequences of K gait contiguous frames, where K gait is the estimated period of the probe sequence from the previous step. Note that we do not constrain the starting frame of each partition to be from a particular stance. Let the m − th probe subsequence be denoted by sequence,
SequencePm = {S P (mK gait ),K , S P (( m + 1) K gait )}. The gallery gait
SequenceG = {S G (1), S G (2), K , S G ( K )}, consists of all silhouettes
extracted in the gallery sequence from the elliptical paths. There are three ingredients to the correlation computations; frame correlation, correlation between Sequence Pm and SequenceG , and similarity between a probe sequence and a gallery sequence. Among various quantities, similarity computation is the most important between two silhouette frames, FrameSimilarity ( S P (i ), SG ( j )), can be obtained from the ratio of the number of pixels in their intersection to their union. This measure is also called the Tanimoto similarity measure, defined by using two binary feature vectors. Thus, if we denote the number of foreground pixels in silhouette S by Num( S ), then we have that
392
W. Cho, T. Kim, and J. Paik
FrameSimilarity ( S P (i ), SG ( j )) =
Num( S P (i ) ∩ SG ( j )) . Num( S P (i ) ∪ SG ( j ))
(10)
Since the silhouettes have been prescaled and centered, we do not have to consider all possible translations and scales when computing the frame-to-frame similarity. The next step is to use frame similarities to compute the correlation between SequencePm and SequenceG as
ComCor(SPm , SG )(l ) =
K gait −1
∑ FrameSimilarity(S (m + j), S P
j =0
G
(l + j)).
(11)
For robustness, the similarity measure is chosen to be the median value of the maximum correlation of the gallery sequence with each of these probe subsequences. Other choices such as the average, minimum, or maximum did not result in better performance. The strategy for breaking up the probe sequence into subsequences allows us to address the case when we have segmentation errors by using ASM.
Similarity( S P , SG ) = Medianm (max Correlation( S Pm , SG )(l )) . l
(12)
5 Experimental Results The HumanID data set in several views were used for the experiment. These sequences were captured at 30 frames per second. The proposed gait recognition method used 10 model parameters to recognize multiform gait poses. Fig. 8 shows the successfully recognized result by using ASM on the elliptical paths. Table 1 lists the identification rates that have been reported by the baseline algorithm upon release of the gait challenge data set. For comparison, we also list the performance of the proposed method on the reduced data set. We see that; i) the order of performance on the different experiments is the same with the baseline and the proposed algorithms, and ii) the performance of the proposed method is always higher than the baseline algorithm, and their gap increases in severe problems. Table 1. Performance of recognition for the gait challenge data set using both the baseline and the proposed algorithms
Experience A (view) B (shoe) C (view + shoe) D (surface) E (shoe + surface) F (view + surface) G (view + shoe + surface) # subjects in gallery
Baseline [3] 87 % 81 % 54 % 39 % 33 % 29 % 26 % 71
Proposed method 92 % 89 % 85 % 81 % 80 % 82 % 72 % 71
Gait Recognition Using Active Shape Models
(a)
393
(b)
(c)
(d)
(e)
(f)
Fig. 7. Results of gait recognition by using ASM on the elliptical paths; (a) to (f) respectively shows the 3rd, 129th, 282th , 330th, 450th, and 503th frames
6 Conclusion We presented and evaluated an ASM-based gait recognition algorithm by refining silhouettes in the elliptical paths. The proposed model consists of an eigen-shape that captures the shape variation of each stance. We observed that the quality of the reconstructed silhouettes were better in terms of shadow by using ASM. However, the performance of ASM relies on the initial shape and the center point of shape. For the future research, we will investigate that the multi-gait recognition and the multi-object tracking functions can be simultaneously accomplished for ASM-based visual surveillance.
Acknowledgment This research was supported by Korea Ministry of Information and Communication under the HNRC-ITRC program supervised by the IITA, by Seoul Future Contents Convergence (SFCC) Cluster established by Seoul R&BD Program, and by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MOST) (ROA-2003-000-10311-0). The authors would like to thank P. Jonathon Phillips (NIST), Patrick Grother (NIST), and Sudeep Sarkar (USF) for their help in providing them with the HumanID data set used in this paper.
References 1. Phillips, P., Sarkar, S., Robledo, I., Grother, P., Bowyer, K.: The gait identification challenge problem: data sets and baseline algorithm. In: Proc. 2002 Int. Conf. Pattern Recognition, pp. 385–388 (2002)
394
W. Cho, T. Kim, and J. Paik
2. Phillips, P., Sarkar, S., Robledo, I., Grother, P., Bowyer, K.: Baseline results for the challenge problem of human id using gait analysis. In: Proc. 2002 Int. Conf. Automatic Face, Gesture Recognition, pp. 137–142 (2002) 3. Sarkar, S., Phillips, P., Liu, Z., Robledo, I., Grother, P., Bowyer, K.: The HumanID gait challenge problem: data sets, Performance, and analysis. IEEE Trans. Pattern Analysis, Machine Intelligence , 167–177 (2005) 4. Cootes, T., Taylor, C., Cooper, D., Graham, J.: Training models of shape from sets of examples. In: Proc 1992 Int. Conf. British Machine Vision, pp. 9–18. Springer, Heidelberg (1992) 5. Stegmann, M., Gomez, D.: A brief introduction to statistical shape analysis. Informatics, Mathematical Modeling 1-15 (2002) 6. Shin, J., Kim, S., Kang, S., Lee, S., Paik, J., Abidi, B., Abidi, M.: Optical flow-based realtime object tracking using non-prior training active feature model. 2005 Real-Time Image, 204–218 (2005) 7. Boulgouris, N., Hatzinakos, D., Plataniotis, K.: Gait recognition a challenging signal processing technology for biometric identification. IEEE Signal Processing Magazine, 78–90 (2005) 8. Nixon, M., Tan, T., Chellappa, R.: Human Identification Based on Gait. Springer, Heidelberg (2006) 9. Lee, S., Kang, J., Shin, J., Paik, J.: Hierarchical active shape model with motion prediction for real-time tracking of non-rigid objects. IET Computer Vision, 17–24 (2007)
Statistical Classification of Skin Color Pixels from MPEG Videos Jinchang Ren and Jianmin Jiang School of Informatics, University of Bradford, BD7 1DP, UK {j.ren,j.jiang1}@bradford.ac.uk http://dmsri.inf.brad.ac.uk/
Abstract. Detection and classification of skin regions plays important roles in many image processing and vision applications. In this paper, we present a statistical approach for fast skin detection in MPEG-compressed videos. Firstly, conditional probabilities of skin and non-skin pixels are extracted from manual marked training images. Then, candidate skin pixels are identified using the Bayesian maximum a posteriori decision rule. An optimal threshold is then obtained by analyzing of probability error on the basis of the likelihood ratio histogram of skin and non-skin pixels. Experiments from sequences with varying illuminations have demonstrated the effectiveness of our approach.
1 Introduction Fast and accurate segmentation of skin pixels in image and videos are very essential for many image processing and computer vision applications, such as face detection and tracking, facial expression recognition, gesture recognition and naked people detection as well as content-based retrieval and efficient human-computer interactions. As human skin of consistent appearance is significant different from many other objects, pixel-based classification has been widely employed for its detection. In general, at least three issues need to be considered in skin classification, i.e. color representation and quantization, skin color modeling, and classification approaches. In real applications, some post-processing is also required for the detection and recognition of more semantic events including faces, hands or even special skin patches as naked images, etc. Although many different color spaces have been introduced in skin detection, such as RGB or normalized RGB [3], HSV (or HSI, HSL, TSL) [2, 7, 11, 15], YCbCr (or YIQ, YUV, YES) [4], and CIELAB (or CIELUV) [8], etc., they can be simply classified into two categories by examining whether the luminance intensity component is considered. Due to the differences between the training and test data, various results have been reported: Some people argue that ignoring luminance component helps to achieve more robust detection [6, 9, 10]; however, others still insist that luminance information is essential in accurate modeling of skin colors [1]. Results on skin detection with or without the luminance component are compared in our paper in Section 3. Moreover, it becomes a wider acknowledgment that training from different color spaces produces comparable results as long as the Y component is included [1], i.e. invertible conversion between color spaces can be achieved [16]. Consequently, J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 395–405, 2007. © Springer-Verlag Berlin Heidelberg 2007
396
J. Ren and J. Jiang
choosing a suitable color space merely depends on intrinsic requirements of efficiency, rather than effectiveness, i.e. the chosen color space should have its components extracted from image or videos as simple as possible. For instance, YCbCr and RGB spaces are naturally used in compressed and uncompressed image and videos. As for color quantization, various quantization levels have been suggested, such as 32, 64, 128 and 256 [1, 3]. Higher level means more storage space required hence lower efficiency in the detection process. However, there is no well-accepted scheme in such a context. Therefore, we need to compare the performance under different levels, especially on the test data under varying illumination. To model skin (and non-skin) colors, two main approaches are generally utilized, i.e. parametric and nonparametric ones. The prior usually model the skin colors as Gaussian or mixture of Gaussian distributions, and the number of components in the mixed model varies from 2 to 16 [8]. Other parametric models include elliptic boundary models, etc [4]. Parameters in the models are usually obtained by the EM (Expectation Maximum) approach [7]. Non-parametric approaches include histogram-based models [1, 3] and neural network, etc. [1]. In addition, there are also some imprecise models using fixed ranges of thresholds such as the work in [6] and [11], although the latter also contains a further step to adapt with the image contents. It is found that histogram-based approaches and neural network based ones generate almost the best results and outperform parametric approaches [1]. With obtained color models of skin and non-skin, skin pixels are usually determined by using Bayesian decision rules of maximum a posteriori, minimum cost and even maximum likelihood strategies [3]. The last has only skin color model which is similar to those using a look-up table for decisions whilst the prior two also have a model for non-skin colors thus the likelihood ratio of the pixel’s color in skin and non-skin models are obtained for decision. Other classification approaches include those using linear or elliptic decision boundaries [6, 8, 15]. Nevertheless, one or more threshold(s) is (are) then required for such a decision, and unsuitable threshold(s) may lead to quite poor performance. Furthermore, existing approaches work mainly on uncompressed image and videos, which makes them less efficient owing to the fact that most of such media is available in compressed format thus an expensive decompression is required before detection. Instead, our work is based on MPEG videos, in which skin pixels are detected directly from compressed domain and avoids time-consuming inverse DCT transform, and the potential applications are fast detection and indexing of human objects in videos. Consequently, it provides an efficient and fast implementation. Comparing with previous work reported in [9] and [12], an optimal threshold of likelihood ratio between skin and non-skin pixels is derived which skip the iterative processing in [12]. Furthermore, even without a dynamic model as introduced in [2], results from sequences under varying illumination still seem very promising.
2 Statistical Modeling and Classification Firstly, histogram-based approach is utilized to model color models of skin and nonskin pixels, in which manual ground truth data of skin and non-skin masks are extracted for this purpose. The main difference between our work and others is training
Statistical Classification of Skin Color Pixels from MPEG Videos
397
in compressed domain, thus we need to map probability from pixel level to block level to cope with the requirements of MPEG. With the obtained skin and non-skin models, Bayesian maximum a posteriori decision rule is employed for skin color classification. To determine an optimal threshold, a likelihood ratio map of skin and non-skin colors is extracted, and the threshold is decided by using minimum probability error strategy. Further details of our model and approach are described below. 2.1 Modeling Skin and Non-skin Colors in Compressed Domain We adopt YCbCr color space in our approach as it is easily extracted from MPEG compressed videos. Then, for each color entry ec = ( y , cb , cr ) , its associated probabilities as skin and non-skin,
p (ec / skin) and p (ec / nonskin) , are extracted as
follows.
p(ec / skin) = sum(ec / skin) / Vs .
(1)
p(ec / nonskin) = sum(ec / nonskin) / Vs .
(2)
sum(ec / skin) and sum(ec / nonskin) denote number of occurrence in training data when the color entry ec appears as skin and non-skin, respectively. Vs and Vs indicate volumes of skin and non-skin data, i.e., total number of occurrences where
in each model. In uncompressed pixel domain, sum(.) can be easily attained by counting pixels of same color entry. However, it becomes complex to count in compressed domain, as we can only access blocks, rather than pixels, to avoid expensive inverse DCT. In fact, our training in compressed domain is defined on the basis of DCT coefficients after simple entropy decoding. As a result, these DCT coefficients are extracted from each macroblock of 16*16 pixels. In 4:2:0 chrominance format, one macroblock contains four luminance sub-blocks and two chrominance sub-blocks, and in each subblock there are 8*8 pixels as shown in Fig. 1.
(a)
(b)
Fig. 1. One macroblock in 4:2:0 chrominance format contains four luminance subblocks and two chrominance subblocks (a) and each subblock has 8*8 pixels (b)
398
J. Ren and J. Jiang
For simplicity, only the DC components in each sub-block are extracted. Therefore, we have totally 6 DC components of which four from Y sub-blocks, one from Cb and one from Cr sub-block, respectively. A combined color entry of the macroblock, eb , is then extracted by using the average luminance of four Y components as its luminance and Cb, Cr its chrominance components. With the extracted block-based color entry, its probability of skin and non-skin can also be decided in a similar way as defined in (1) and (2). However, new definition of the sum(.) function is defined in (3) and (4), where N s (b) and N s (b) indicate number of skin and non-skin pixels in the macroblock b , and N = 256 is the total number of pixels in b . Please note that
N s (b) + N s (b) ≠ N when masks of skin
and non-skin are defined separately, especially when there are the third class of pixels introduced, although only two-classes training is utilized [2].
sum(eb / skin) = N s (b) / N .
(3)
sum(eb / nonskin) = N s (b) / N .
(4)
2.2 Bayesian Classification Please note that the probabilities extracted above are conditional probability of skin and non-skin, respectively. Given a color entry eb , the posterior probability of skin and non-skin are determined below based on the well-known Bayesian theorem in the inference process.
p ( skin / eb ) =
p (eb / skin) p ( skin) . p (eb / skin) p ( skin) + p (eb / nonskin) p (nonskin)
p (nonskin / eb ) =
p (eb / nonskin) p (nonskin) p (eb / skin) p ( skin) + p (eb / nonskin) p (nonskin)
(5)
(6)
p (skin) and p (nonskin) are the prior probability. According to maximum a posteriori decision rule, eb refers more likely to skin
where
color if its associated posterior probability of skin is more than that of non-skin, i.e. p( skin / eb ) > p (nonskin / eb ) . In other words, it means the posterior probability of skin and non-skin satisfies (7), where
θ ≥ 1 is a constant.
p ( skin / eb ) p (eb / skin) p ( skin) = >θ . p (nonskin / eb ) p(eb / nonskin) p (nonskin)
(7)
Statistical Classification of Skin Color Pixels from MPEG Videos
399
Since the prior probabilities of skin and non-skin are strongly dependent on the training data and seems neither reliable nor objective, they are omitted in our classification by introducing a new term λ , which is defined as λ = p( skin) / p (nonskin) . Then, the decision rules in (7) becomes (8), which indicates thresholding of the likelihood ratio of skin and non-skin for classification, and η = θ / λ is a chosen threshold.
p (eb / skin) > η → skin . p (eb / nonskin)
(8)
2.3 Optimal Thresholding Obviously, the performance of detection and classification depends on a suitable parameter of η . There are several ways to choose this threshold, including global optimization on ROC analysis [3], minimum probability error [10], and even empirically [9]. In this paper, we adopt a similar probability error analysis as used in [10], but the threshold is obtained by analyzing the effectiveness of extracted skin and nonskin models as below. Firstly, a logarithmic likelihood map (LLM), g (eb ) , is derived as
g (eb ) = ρ ln(1 +
p (eb / skin) ). p (eb / nonskin)
(9)
where ρ > 0 is a constant to scale LLM value within a given range, say [0, 255]. Consequently, the classification process becomes thresholding on this LLM. There are two reasons for us to employ the logarithmic operator to likelihood ratio of skin and non-skin here: one is to enhance the details when the likelihood ratio is small, and the other is helps to constrain the large range of likelihood ratio into a relatively small range. Then, according to skin and non-skin pixels, two histograms of this LLM, H s and
H s are extracted separately from both skin and non-skin masks in the training data. In Fig. 2 below, H s and H s show distributions of this logarithm likelihood map over sample colors of skin and non-skin, respectively. Then, the accumulated probability of H s and H s are extracted as respectively. Curves of
As and As ,
As and As against logarithm likelihood ratio are plotted in
Fig. 3. If we take
g as a threshold for classification, apparently, As (g ) denotes percentg , i.e. the missing detection rate; and As (g ) denotes percentage of training data of age of training data of skin color has a logarithm likelihood ratio no more than
400
J. Ren and J. Jiang
skin
non-skin
0.004 0.0035
Probability
0.003 0.0025 0.002 0.0015 0.001 0.0005 0 0
16
32
48
64
80
96
112
128
144
160
176
192
208
224
240
256
logarithm likelihood ratio
Fig. 2. Histograms of logarithm likelihood ratio of skin and non-skin colors
Accumulated probability
Missing detection rate
False alarm rate
0.12 0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
logarithm likelihood ratio
Fig. 3. Curves of
As
and
As
against logarithm likelihood ratio indicates potential missing
detection rate and false alarm rate
non-skin color has a logarithm likelihood ratio greater than g , i.e. the false alarm rate. Then, the overall probability of error classification is derived as
Perror ( g ) = As ( g ) p ( skin) + As ( g ) p (nonskin) . One solution to obtain a suitable threshold
(10)
g is to minimize Perror (g ) by taking
p ( skin) and p (nonskin) from training data as two weights in (10). An alternative solution is to choose the threshold which yields same false alarm rate and missing detection rate, i.e. As ( g ) = As ( g ) , and the corresponding probability of error
Statistical Classification of Skin Color Pixels from MPEG Videos
classification becomes
401
As (g ) , too. As normally we have p ( skin) < p (nonskin) ,
the threshold obtained in the second solution appears less than the one from the first solution. As a result, higher detection rate and more false alarms are intended to be detected. According to the training results showed in Fig 3, the threshold in the first solution is found as 49.25 with Perror = 1.38% . While the threshold obtained from
Perror = 2.82% . Since we have ρ = 30 , the corresponding thresholds in (8) satisfies η = 4.164 and η = 0.5028 , respectively. the second solution is 12.22 and
Please note the probability errors above are results from the training data only. 2.4 Post-processing To fill small holes and also remove spurs in the detected mask, morphological filtering is applied to the detected masks. Let M 0 and M s denote detected skin masks (both in binary) before and after this filtering, we have
Ms = M0 ⊕ B − B .
(11)
where B is a 3 × 3 structure element, ⊕ and – denote morphological dilation and erosion operators, respectively. Besides, small areas with their sizes less than a given threshold, s0 , are also re-
M s . Due to the fact that each pixel in the detected mask image represents one macroblock, i.e. 16*16 pixels in original frame image, a relative small s0 moved from
no more than 3 should be chosen in our system.
3 Results and Discussions In our experiments, all the test data is from Boston University which contains 21 sequences and can be accessed from (http://csr.bu.edu/colortracking/pami/ Data/) [2]. In each of the sequences, there is dynamic changing of illuminations which leads to some different between these frames. For convenience, we cropped the width of each frame from 641 to 640 and encoded each sequence as separate MPEG stream. Besides, two groups of ground truth maps are manually defined as skin and non-skin masks, respectively, which enables a third category of “don’t care” pixels included in a non-skin mask which belongs to neither skin nor non-skin background. Four examples of test frames and their corresponding masks are illustrated in Fig. 4, and white pixels in (b), (c) and (d) refer to skin, non-skin and don’t care masks, respectively. In both training and test process, only the I-frame is chosen as its DCT coefficients can be directly extracted from compressed streams. In order to fully utilize the ground truth maps, the sequences of MPEG streams are coded with I-frame only, i.e. there is only one frame in each group of picture. Further information about these sequences can be found in details in [2].
402
J. Ren and J. Jiang
(a)
(b)
(c)
(d)
Fig. 4. Examples of four test frames (a) and their associated masks of skin (b), non-skin (c) and don’t care pixels (d)
(a)
(b)
(c)
(d)
Fig. 5. Four results of detected skin from images in Fig. 4. (a) and (b) are our results using thresholds 12.2 and 49.25, respectively; (c) and (d) are results from Sigal etc [2] using their static and dynamic models, respectively.
Firstly, we compare detected skin masks using our approach and those from Sigal etc. in [2], in total four groups of results are compared. Two of them are ours using threshold of 12.2 and 49.25, respectively. The other two groups are results from static and dynamic models proposed in [2]. According to the source images in Fig. 4, detected skin masks are shown in Fig. 5, and from which several facts can be found:
Statistical Classification of Skin Color Pixels from MPEG Videos
• • •
403
Results from threshold of 12.2 have more false alarms than those from threshold of 49.25, which indicates threshold derived from minimum probability of error classification more suitable in this context; Although the dynamic model may help to fill the holes in detection by adapting the varying illumination, it also has the potential to cause more false alarms; Pixel-based model in [2] can successfully exclude small non-skin areas like eyes, mouth and accurately locate non-skin boundaries owing to its finer resolution than our approaches, which has a minimum resolution of one macroblock, i.e. 16*16 pixels! However, in comparison with Sigal’s approach, our results from threshold 49.25 still yield better results in the first two test images (need to remove small areas of noise) and comparable result in the third test image. Table 1. Performance comparisons of our approach and those from Sigal etc in [2]
Sequences
Our approach Threshold = 12.2
# #frames 1 100 2 72 3 72 4 110 5 75 6 72 7 76 8 73 9 72 10 73 11 233 12 72 13 350 14 72 15 75 16 50 17 75 18 91 19 73 20 120 21 53
skin 94.64 99.10 98.48 98.09 98.34 99.55 91.00 99.50 85.06 100.0 60.23 92.69 91.40 99.51 99.09 76.02 95.88 92.91 43.68 75.67 98.55
bk 97.39 90.60 93.53 95.22 98.94 98.73 99.648 92.84 99.68 45.50 99.14 97.44 99.01 98.90 89.74 98.91 99.90 99.82 99.67 99.70 92.20
Sigal’ approach
Threshold = 49.25
skin 75.79 97.66 94.76 90.30 84.55 97.51 81.27 99.02 79.89 99.74 55.13 81.91 79.21 97.34 96.19 40.00 89.63 82.59 32.17 41.58 93.59
bk 99.20 97.35 99.82 98.47 99.84 99.68 99.60 95.87 99.91 52.61 99.46 99.29 99.85 99.78 97.31 99.97 99.97 99.97 99.89 99.96 98.52
Static model
skin 49.08 96.46 77.21 92.67 96.87 88.32 77.67 99.99 81.30 96.00 87.47 70.51 67.73 91.79 91.03 52.05 97.89 92.07 11.02 18.75 92.96
bk 99.96 99.99 91.62 99.73 99.86 99.23 100.0 98.72 99.62 36.56 99.93 97.49 99.96 99.98 95.37 100.0 99.98 99.99 99.91 100.0 98.42
Dynamic model
skin 65.74 98.19 88.92 97.63 98.30 94.27 91.30 99.98 92.81 99.96 93.99 62.36 82.21 98.90 94.10 88.95 99.43 98.60 24.29 55.79 97.94
bk 99.35 99.73 86.43 99.13 99.66 99.14 100.0 97.17 100.0 15.72 99.59 95.94 99.71 97.73 90.18 99.82 99.66 99.94 99.48 90.95 95.15
By calculating the correct detection rate of both skin and non-skin background, quantitative comparisons of our results with those from Sigal’s are given in Table 1, in which the results from Sigal are directly duplicated from [2]. Please note that due to the fact that the minimum resolution of our approach in detection is a macroblock, hence our approach cannot yield accurate boundary of skin and non-skin areas, which certainly leads to inaccuracy quantitative measurements in such a performance analysis. Nevertheless we can still find that our results with threshold of 49.25 yield better
404
J. Ren and J. Jiang
or comparable results to Sigal’s models in 11 sequences (#1, #3, #6, #10, #12, #13, #14, #15, #19, #20 and #21). Considering its efficiency in compressed domain and inaccuracy in such a measurement, our results are proved very promising in spite of varying luminance in those test sequences.
4 Conclusions We presented an approach for skin detection from compressed MPEG videos. We discussed in details how the statistical models of skin and non-skin can be trained from macroblock level. Through analysis of the likelihood ratio of skin and non-skin colors, we found the threshold obtained by minimizing the probability of error classification more suitable for global thresholding. Further investigation will be face detection and recognition from the detected skin candidates for semantic video indexing and retrieval.
Acknowledgement Finally, the authors wish to acknowledge the financial support under EU IST FP-6 Research Programme with the integrated project: LIVE (Contract No. IST-4-027312).
References 1. Phung, S.L., Bouzerdoum, A., Chai, D.: Skin Segmentation Using Color Pixel Classification: Analysis and Comparison. IEEE T-PAMI. 27(1), 148–154 (2005) 2. Sigal, L., Sclaroff, S., Athitsos, V.: Skin Color-Based Video Segmentation under TimeVarying Illumination. IEEE T-PAMI. 26(7), 862–877 (2004) 3. Jones, M.J., Rehg, J.M.: Statistical Color Models with Application to Skin Detection. Int. J. Computer Vision. 46(1), 81–96 (2002) 4. Hsu, R.-L., Abdel-Mottaleb, M., Jain, A.K.: Face Detection in Color Images. IEEE TPAMI. 24(5), 696–706 (2002) 5. Wu, H., Chen, Q., Yachida, M.: Face Detection from Color Images Using a Fuzzy pattern Matching Model. IEEE T-PAMI. 21(6), 557–563 (1999) 6. Chai, D., Ngan, K.N.: Face Segmentation Using Skin-Color Map in Videophone Applications. IEEE T-CSVT. 9(4), 551–564 (1999) 7. Tan, R., Davis, J.W.: Differential Video Coding of Face and gesture Events in Presentation Videos. Int. J. CVIU. 96, 200–215 (2004) 8. Kakumanu, P., Makrogiannis, S., Bourbakis, N.: A Survey of Skin-Color Modeling and Detection Methods. Pattern Recognition. 40, 1106–1122 (2007) 9. Wang, H., Chang, S.-F.: A Highly Efficient System for Automatic face Region Detection in MPEG Video. IEEE T-CSVT. 7(4), 615–628 (1997) 10. Habili, N., Lim, C.C., Moini, A.: Segmentation of the Face and Hands in Sign language Video Sequences Using Color and Motion Cues. IEEE-TCSVT 14(8), 1086–1097 (2004) 11. Cho, K.-M., Jang, J.-H., Hong, K.-S.: Adaptive Skin-Color Filter. Pattern Recognition. 34, 1067–1073 (2001) 12. Zheng, Q.-F., Gao, W.: Fast Adaptive Skin Detection in JPEG Images. In: Ho, Y.-S., Kim, H.J. (eds.) PCM 2005. LNCS, vol. 3768, pp. 595–605. Springer, Heidelberg (2005)
Statistical Classification of Skin Color Pixels from MPEG Videos
405
13. Zhu, Q., Cheng, K.-T., Wu, C.-T., Wu, Y.-L.: Adaptive Learning of an Accurate SkinColor Model, 37–42 (2004) 14. Zhang, M.-J., Gao, W.: An Adaptive Skin Color Detection Algorithm with Confusing Background Elimination. Proc. ICIP II, 390–393 (2005) 15. Garcia, C., Tziritas, G.: Face Detection Using Quantized Skin Color Regions Merging and Wavelet Packet Analysis. IEEE T-Multimedia. 1(3), 264–277 (1999) 16. Albiol, A., Torres, L., Delp, E.J.: Optimum Color Spaces for Skin Detection. In: Proc. ICIP. I, pp. 122–124 (2001)
A Double Layer Background Model to Detect Unusual Events Joaquin Salas, Hugo Jimenez-Hernandez , Jose-Joel Gonzalez-Barbosa, Juan B. Hurtado-Ramos, and Sandra Canchola CICATA-IPN Unidad Quer´etaro, Cerro Blanco 141, Col. Cimatario, CP 76090, Quer´etaro, M´exico
Abstract. A double layer background representation to detect novelty in image sequences is shown. The model is capable of handling nonstationary scenarios, such as vehicle intersections. In the first layer, an adaptive pixel appearance background model is computed. Its subtraction with respect to the current image results in a blob description of moving objects. In the second layer, motion direction analysis is performed by a Mixture of Gaussians on the blobs. We have used both layers for representing the usual space of activities and for detecting unusual activity. Our experiments clearly showed that the proposed scheme is able to detect activities such as vehicles running on red light or making forbidden turns.
1
Introduction
In this paper, we define unusual events as motion events that can not be interpreted in terms of an existing probabilistic model. An abnormal or unusal event results when the observation do not fits the current pattern of activity, which corresponds to regular or background motion. For computing efficiency, motion analysis is carried out using salient objects, which result of subtracting the current image from the intensity background model. Due to their own constraints, we show how this double layer background model is specially suitable for crossroads scenarios. Vehicular intersections offer an unique set of constraints, like regularity of trajectories and predictability of vehicular flow. Observation of long term sequences can be used to learn the typical trajectories [7,5], that can be represented with a multidimensional Gaussian distribution [11]. In our case, we have introduced a strategy that doesn’t require to maintain a history of all prior data points, making it suitable for streaming video applications. Detection of unusual events can be defined as a problem in which the issue is to classify what is normal or common and what is not. In this sense, normal events can be interpreted as whatever remains in the scene background. In Toyama et al.[17] developed an extensive review of the functional parts of an ideal background maintenance system while Piccardi[10] reviewed some of the main methods. Detecting unusual activity may turn difficult because during training the unusual events rarely occur [19]. Most frequently, unusual events J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 406–416, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Double Layer Background Model to Detect Unusual Events
407
are modeled using Hidden Markov Models (HMMs)[3]. HMMs perhaps are the most successful framework in perceptual computing for modeling and classifying dynamic behaviors as they offer dynamic time warping, a training algorithm, and clear Bayesian semantics. Nonetheless, other possibilities that have been explored. One of them is the representation of the tracked trajectory into a binary tree structure that is used for classification [14], or the characterization of the video input as temporal templates[20]. In this work we presents an algorithm to model and to detect unusual events on crossroads, such as vehicles making either forbidden turns or running on red light. From the perspective that almost everything we have observed belongs to a background space of activities. For this purpose the paper was organized in several sections including the background of the model framework (section 2), the properties of the first layer (section 3), the motion model which defines the second layer (section 4). Then, in section 5, we describe the unusual activity method based on a probabilistic approach. Finally, in section 6, we present results of the algorithm implementation on a real crossroad and conclude the paper. First layer: pixel appearance model Layer model background
Appearance model Appearance model selector predictable
color model
Image frame
gray model
temporal gradient
temporal coherence appearance
pixel appearance
model
unpredictable
Second layer: blob motion model Motion model predictable
Motion model selector
motion direction Θi Θ1
Θ3
motion
temporal and spatial coherence motion
blob motion model
magnitud model
unpredictable
Θ2
unusual events
Fig. 1. Overview of the proposed framework
2
Background Model Framework
We define background modeling as the problem of segmenting of the different elements in the scene depending on how fast they change as perceived in an image sequence. Besides modeling the statistical distribution of the data, another important factor in background modeling is to choose the adequate transformation model, which has to be applied to the original data in order to obtain useful features. Several features can be considered including color, spatial gradients, texture and optical flow. The driving force behind the specific feature selection is the desire to obtain invariance to certain types of changes in the visual
408
J. Salas et al.
space while maintaining a good detection of foreground objects. For instance, in outdoor scenes, the classification must be robust with respect to changes in illumination conditions that might occur due to sunlight changes, clouds or light from nearby light sources. Similarly, in dynamic scenes such as ocean waves or waving trees, invariance to such periodic motion is critical. Each feature has its strength and weakness and is particularly applicable for handling a certain type of variation. For instance, normalized color, spatial gradients or texture features may be considered for obtaining invariance to illumination, while optical flow might be useful in handling dynamic scenes. Accordingly, such features may not be very suitable in other situations. For instance, spatial gradients or texture is not very suitable for a region that has low spatial gradients. Similarly, optical flow cannot be computed accurately in regions that have low texture. We propose the double layer background model depicted in Fig. 1. In the first one, we use a function that characterizes the pixel by the variations of its appearance through time. This layer is invariant to a change of illumination. But if the pixels appearance cannot be inferred from previous images, these pixels are analyzed by the second layer. In the second layer, the background model represents the motion that may be present at each pixel with a multimodal probabilistic function. This layer introduces the invariance to periodic motion. In our scheme to model the background, we have used simple values over long times, making parametric assumption feasible. Mixed backgrounds can be modeled by mixtures of Gaussians (MOG), where each of these Gaussians explains the intrinsic dynamic present in the background, either at the intensity or motion level. The key features of the algorithm are: – The use of a static intensity background model having a permissible range of intensity variations at each pixel. – The capability of coping with illumination changes. – The use of an adaptive and compact background model that can characterize structural background motion over a long period of time. This allows us to encode dynamic backgrounds or multiple backgrounds. – A layered detection scheme that allows us to model dynamic backgrounds.
3
First Layer
Vehicles behavior is different inside and outside the crossroads area. Inside, the vehicles are always moving, while behind the crosswalks, the vehicles may be waiting for the appropriate green light. The four sides area delimited by the crossroads defined our region of interest (ROI) and can be represented as follows. Let P = {p1 , p2 , . . . , pn , pn+1 } be the set of vertex points of a polygonal shaped ROI numbered in counter-clockwise order, where p1 = pn+1 and pk = (xk , yk ). Using these corner points, n regions can be defined such that Ck (x) = (y − yk−1 )(xk − xk−1 ) − (x − xk−1 )(yk − yk−1 ) > 0,
(1)
for k = 2, . . . , n + 1, is a logical predicate that divides the plane in two regions. This way, the ROI can be defined as the intersection between these regions
A Double Layer Background Model to Detect Unusual Events
(a) ROI
409
(b)Example of some dynamic objects
Fig. 2. The moving objects can be detected by subtracting the current image from the background appearance model. The result is segmented into groups of connected pixels. This procedure is useful to detect moving objects in a region of interest (ROI).
R(x) =
n+1
Ck (x).
(2)
k=2
R(x) is a boolean variable that is true whenever x = (x, y) is inside the ROI and false otherwise. 3.1
Appearance Model
In the first layer of the background, we use a pixel appearance model. An important processing stage includes how to get the initial background model[6]. The strategy that we have used was to compute the appearance model using the median of certain number of images [15]. Let I(x, t) be an image description, where x is a spatial position and t is a time stamp. In general, what is perceived as an image is J(x, t), a noisy version of I(x, t) given by J(x, t) = I(x, t) + δ(x, t), where δ(x, t) is assumed to be a random Gaussian variable with zero mean. We assume that the change in illumination conditions comes from smooth variations due to daylight changes. This assumption rules out scenarios where light emission changes drastically from one moment to the next. In the present application, the background is supposed to be free from objects. Thus a single Gaussian can model the perceived changes in intensity. Let a Gaussian process be modeled as g(x; μk , Σk ) =
1
−1 2
2π|Σk |
1 exp − (I(x) − μk )T Σk−1 (I(x) − μk ) , 2
(3)
where μk and Σk are respectively the mean and the covariance matrix. For the case of gray images and temporal gradient, μk and Σk is reduced to a scalar. On the other hand, for color images, the dimensions are 3 × 1 and 3 × 3 respectively. When a new observation I(x, t) is available, it is compared again with the parameters of the Gaussian model. If || I(x) − μk ||2 ≤ α || Σk ||,
(4)
410
J. Salas et al.
were: || || and || ||2 denote some type of norm operators, and α is a constant that should depend on x. Then, it is assumed that the observation is likely to be produced by a perturbation of the true value which otherwise should be similar to the one expressed by the model. The parameters of the Gaussian are adapted as time passess by following the on-line Expectation Maximization (EM) [13]. That is, μk ← ρμk + (1 − ρ)I(x, t), Σk2 ← ρΣk2 + (1 − ρ)(I(x, t) − μk )(I(x, t) − μk )T ,
(5)
where ρ ∈ [0, 1] is the learning rate.
Fig. 3. Usual Activity Space. In (a), (b), and (c), we illustrate the movement frequency which defined the number of Gaussian at each pixel location for the three different states in the studied scenario.
4
Second Layer
If pixels appearance can not be inferred from previous images using Eq. (4), these pixels are analyzed by the background motion layer. In this layer, the background modeling process is made out from the regular trajectories that describe moving objects in the scene or regular motion magnitude. The problem of detecting where a feature A moves from one image frame to the next has many interesting facets that include objects undergoing partial or total occlusion, or being subject to complex appearance transformations. In our case, the objects are assumed to be rigid and hence, although there are some effects due to perspective and scene location, the transformations observed involve primarily rotations and translations. Furthermore, we are assuming that we can achieve a sufficiently high frame processing rate so that effectively vehicles’ appearance is quite similar from frame to frame. Lucas and Kanade proposed, in a milestone paper, a strategy for additive image alignment based on a Newton-Raphson type of iterative formulation [8]. The translation of a feature between frames was computed with a steepest descend minimization strategy. In principle, a more general transformation including affine wrapping and translation could be sought. However, in practice, Shi and Tomasi showed that this procedure could
A Double Layer Background Model to Detect Unusual Events
411
be numerically unstable [16]. The procedure uses the optical flow invariance constraint which assumes that a feature reflected light intensity remains equal from frame to frame. That is, let I (x) and I(x) be two consecutive images. It has been shown that the displacement d of a feature F can be computed using the recursive equation [8,16] dk+1 = dk + Z −1 e, (6) 2 gx gx gy where Z = x∈F is the structural tensor, and gx gy gy2 e = x∈F (I (x) − I(x)) g is a scaled version of g = (gx , gy )T = ∇I (x), the gradient. The value of Z is a good reference on how easy it is to track a feature. That is, when its eigenvalues are small the displacement is large and convergence may be difficult. Occlusion seems to be the prime problem for robust tracking. Strategies to deal with it include the use of sub-features [2], high-level reasoning modules [18], bounding box models [1], temporal templates produced with interframe differences [9], active models [7], or multiple hypothesis[14]. In the case of this study, we do not deal explicitly with occlusion because experimentally we have made two observations. First, as it is shown in the section 6, it accounts for a small portion of the vehicles; and second, it is common that unusual maneuvers are performed by isolated vehicles, and when it is not the case, the event is likely to be detected as an unusual activity for all the vehicles in the group. 4.1
Blob Computing
Let B = {b1 , . . . , bn } be the set of pixels that belong to the dynamic object. The objective now is to cluster them together into segments S1 , . . . , Sm such that the intersection between Si and Sj for i = j is null and each Si holds a set of connected pixels {bi(1) , . . . , bi(a) }. Two pixels bi = (xi , yi ) and bj = (xj , yj ) are connected when either there is an immediate or intermediate connection between them [12]. The pixels have immediate connection when max(| xi − xj |, | yi − yj | ) ≤ k. On the other hand, they have intermediate connection when there is a pixel bk for which either bi or bj have inmediate conection, the pixel which doesn’t have the inmediate conection with bk will have then an intermediate connection instead. Segments that are too small are assumed to come from noise and are discarded. The displacement vector of the segment Si is the mean of the displacement vector of their pixels {bi(1) , . . . , bi(a) } computed by the solution of the Eq. 6 (see Fig. 4). The displacement vectors of these pixels are updated with their mean. 4.2
Motion Model
Fig. 1 shows the motion model characterization in the second layer. The features used in this layer are the motion magnitude or motion direction. The motion magnitude does not depend on the traffic light sequence and can be modeled, as the appearance background, by a mixture of Gaussians. However, the motion
412
J. Salas et al.
Fig. 4. Displacement vector for the pixels of a blob
(a)
(b)
(c)
Fig. 5. Some unusual events detected with our method. (a) Running on red light. (b) Forbidden turn (too wide). (c) Forbidden turn (too wide).
direction characterization depends of the traffic light sequence. The double layer is present in the background modeling process, the activity observed at each pixel location is modeled with a mixture of Gaussians (MOG) whose modes describe the main motion directions. During operation, a particular observation can be assigned with a probabilistic measure that describes how likely it is. Unlikely observations are called unusual events. This is contrary to other approaches[7,3] where once the trajectory of many vehicles has been accounted for, it is possible to arrive to a higher level of representation suitable for the description of activity. Our approach accounts for using MOG to describe the activity that is taking place at a particular pixel location as perceived from a fixed camera. Given a set of n angular directions, θ1 , . . . , θn ∈ [0, 2π], and a family F of probability density functions on , the problem is to find the probability density f (θ) ∈ F that is most likely to have generated the given directions. In this method, each member of the family F has the same general Gaussian form. Each member is distinguished by different values of a set of parameters Γ [4]. In this way
f (θ; Γ ) =
K k=1
pk g(θ; μk , σk ),
(7)
A Double Layer Background Model to Detect Unusual Events
413
where g(θ; μk , σk ) is a 1-dimensional Gaussian function, as in Eq. (3), and Γ = (γ1 , . . . , γK ) = [(p1 , μ1 , σ1 ), . . . , (pK , μK , σK )], is a 3K-dimensional vector containing the mixing probabilities pk as well as the means μk and standard deviations σk of the K Gaussian functions in the mixture. When a new observation θt is available, it is compared again the parameters of the Gaussian models. Classification, and learning can be done as indicated in Eq. (4) and (5) respectively. After a considerable number of frames have been processed the MOG consists on a set of Gaussians along with the number of samples that were used to define each of them. The MOG is then pruned to eliminate Gaussians that have small support.
5
Tracking and Classification of Activity
In the second layer, the motion direction may be seen as a deterministic machine controlled by the traffic light that cycles around a number of states Θ1 → Θ2 → . . . → Θn → Θ1 . At each specific state Θi , certain routes are present and others may be considered abnormal. Thus, passing on red light or making a forbidden turn may be considered abnormal because either they are happening in the wrong moment or because there were not training samples for them. Each state defines an usual activity space, which is represented by a specific MOG at each pixel location. When a new state arrives (change in traffic lights) the usual activity space changes in accordance. It is assumed that there is a way to let the vision system know that a new state has started. For instance, this can be a direct connection to the traffic light automatic controller box. Fig. 3 shows a description of the normal activity space for the three states composing studied. Once having the model of the normal behavior of vehicles in the crossroads, it is possible to start identifying unusual events. At each pixel position, we have a MOG describing the usual directions of motion presented in the training sequence. During operation, the centroid, x, corresponding to a particular moving object (Fig. 2). The centroid x and their displacement vector (Fig. 4) are used for tracking the object. Let X = {x1 , . . . , xn } be the ordered set of pixel points in a vehicle’s trajectory. The probability of observing this particular trajectory is p(x1 , . . . , xn ) = p(xn |xn−1 , . . . , x1 ) p(xn−1 |xn−2 , . . . , x1 ) .. . p(x2 |x1 )p(x1 ).
(8)
Assuming a Markovian condition, where each observation depends solely on the last one, the expression can be rewritten as p(x1 , . . . , xn ) = p(xn |xn−1 )p(xn−1 |xn−2 ) . . . p(x2 |x1 )p(x1 ).
(9)
Since, xi and xi−1 are dependent because the new position is the previous position plus a displacement. That is, xi = xi−1 + ai−1 ui−1 , where a is a constant,
414
J. Salas et al.
related to the vehicle’s speed, and ui−1 a unitary vector, then p(xi |xi−1 ) can be written as p(xi |xi−1 ) = p(ai−1 ui−1 |xi−1 ). In this way a possible measure for the likelihood of the trajectory X could be L(x1 , . . . , xn ) = p(un−1 |xn−1 )p(un−2 |xn−2 ) . . . p(u1 |x1 ) n−1 = p(ui |xi ).
(10)
i=1
The previous condition express temporal and spatial coherence of motion and can be part of the information carried out by the blob being tracked. Table 1. Statistics for the experiment performed. a) Success of vehicle’s tracking. The result has been computed by the hand count observed and matched with the system counted. b) The percentaje of unusual events detected ussing only the vehicle tracked; these has divided in two possibles stages: red lights running (column 1) and forbidden curves (coloumn 2); the percentajes of total represent is showed in parenthesis. a) b) State #Vehicles Untracked % Error State #Red Light(%) # Forbidden (%) Total(%) 1 262 40 15.3 1 2(3.4) 9(0.8) 11(4.2) 2 286 33 11.5 2 5(1.8) 2(0.7) 7(2.5) 3 176 18 10.2 3 16(0) 0(0) 16(9.1) Total 724 91 12.6 Total 30(4.1) 4(0.6) 34(4.7)
6
Results
We have programmed the algorithms to execute the method previously described using Matlab (TM). For our experiments, we used a sequence of 20,000 images, with a 320 × 240 resolution, the camera is located on a 28 m height tower in one of the corners of a vehicular crossroads. The traffic light control has three states. In one of the states (lets call it the first state) vehicles running from west to east (left to right in the images shown along the paper) and also turning to the left when driving in the same direction have the green light. In the second state the green light is for vehicles running from east to west and turning to the left when driving in the same direction. Finally the third state is when vehicles running north to south and south to north simultaneously (up-down and down-up in the images) have green light, no left turns are permitted in this state. After the third, states begin again. The experimental sequence has 12 complete cycles around these states. We have used the first 6 cycles for training and the rest for testing. Each training cycle sequence was divided into subsequences corresponding to the three different states. Then, the subsequences corresponding to the same state were processed to obtain the normal event space for each particular state. So as a result of the training phase we have (a) a region of interest, (b) an initial model of the background, and (c) a description of the normal event space for each of the individual states which are part of the cycle. The first cycle, in both the training and testing sequence, was used for background initialization. We have computed the most frequent gray level for each
A Double Layer Background Model to Detect Unusual Events
415
pixel in the image. Then, a Gaussian model was used to interpret the variations observed along the sequence. When the variations could be interpreted by the Gaussian model, the sample was used for learning. Otherwise, it was assumed that a foreground object was occluding the background. During operation, the usual event space is loaded simultaneously with the image that contains a traffic light change (change of state). The appropriate event space is then accesible and the execution continued. Next, the observed events are compared to what is considered normal for that particular state. The probabilities along the trajectory are evaluated and those with low probability value are considered unusual events. Results are summarized in Table I. During testing, we manually counted 724 vehicles. About 87.4successfully tracked as individual vehicles. In most cases, untracked vehicles were so close together that one of them occluded the other or the moving extraction module returned them as a single connected blob. For unusual event detection that number is significant because in such a situation, as we previously noticed, vehicles tend to be isolated and were successfully tracked in all cases. The percentage of vehicle maneuvers that were classified as unusual was considerably high, about 4.7that most of the unusual events detected are running on red light, 4.1accounts for almost half the observed unusual events.
7
Conclusion
A strategy where usual motion activity is modeled with a dual background layer has been successfully tested in a vehicular intersection. The method reliably detects such unusual events as red-light infringements and forbidden turns. While the first layer tell us what is moving, the second layer tells us the position. These tightly coupled layers complement each other and help us to reduce the computing burden. The first layer deals with appearance aspects, such as intensity or color. The second one uses objects’ moving directions or magnitudes. The model adapts to different illumination conditions and to the modes caused by the traffic-light controller. The method does not require high-level modeling of vehicles’ trajectories since the decisions are taken at a pixel level. For this particular case problem, the occlusion does not represent a big problem because most of the vehicles taking part in unusual events tend to be isolated. When they are not, the statistics may slightly affect the results, but the kind of activity that the groups of vehicles are taking part is going to be detected. We have exploited some constraints that surrender the scene including the simplicity of the background in the region of interest, the rigidity of the objects being observed, and the regularity of the trajectories.
References 1. Atev, S., Arumugam, H., Masoud, O., Janardan, R., Papanikolopoulos, N.P.: A Vision-Based Approach to Collision Prediction at Traffic Intersections. IEEE Transactions on ITS 6(4), 416–423 (2005) 2. Beymer, D.J., McLauchlan, P., Coifman, B., Malik, J.: A Real Time Computer Vision System for Measuring Traffic Parameters. In: CVPR, pp. 495–501 (1997)
416
J. Salas et al.
3. Chan, M.T., Hoogs, A., Schmiederer, J., Petersen, M.: Detecting Rare Events in Video using Semantic Primitives with HMM. In: ICPR, vol. IV, pp. 150–154 (2004) 4. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, Chichester (2001) 5. Snidaro, L., Foresti, G.L.: Vehicle detection and tracking for traffic monitoring. In: 13th International Conference Image Analysis and Processing (2005) 6. Gutchess, D., Trajkovics, M., Cohen-Solal, E., Lyons, D., Jain, A.K.: A Background Model Initialization Algorithm for Video Surveillance. In: ICCV, vol. 1, pp. 733– 740 (2001) 7. Johnson, N., Hogg, D.C.: Learning the Distribution of Object Trajectories for Event Recognition. Image and Vision Computing 14(8), 609–615 (1996) 8. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Image Understanding Workshop, pp. 121–130 (1981) 9. Medioni, G., Cohen, I., Bremond, F., Hongeng, S., Nevatia, R.: Event detection and analysis from video streams. In: USC Computer Vision (2001) 10. Piccardi, M.: Background Subtraction Techniques: A Review. In: IEEE International Conference on Systems, Man, and Cybernetics, vol. 4, pp. 3099–3104 (2004) 11. Pless, R.: Spatio-Temporal Background Models for Outdoor Surveillance. EURASIP Journal on Applied Signal Processing 2005(14), 2281–2291 (2005) 12. Rosenfeld, A.: Connectivity in digital pictures. Journal of the ACM 17(1), 146–160 (1970) 13. Stauffer, C., Grimson, W.E.L.: Adaptive Background Mixture Models for RealTime Tracking. CVPR 2, 246–252 (1999) 14. Stauffer, C., Grimson, W.E.L.: Learning Patterns of Activity Using Real-Time Tracking. IEEE Transactions on PAMI 22(8), 747–757 (2000) 15. Tai, J.-C., Song, K.-T.: Background Segmentation and its Application to Traffic Monitoring using Modified Histogram. In: Int. Conf. on Networking, Sensing and Control, vol. 1, pp. 13–18 (2004) 16. Tomasi, C., Shi, J.: Good features to track. In: CVPR, pp. 593–600 (1994) 17. Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principles and Practice of Background Maintenance. In: ICCV, vol. 1, pp. 255–261 (1999) 18. Veeraraghavan, H., Masoud, O., Papanikolopoulos, N.P.: Computer Vision Algorithms for Intersection Monitoring. IEEE Transactions on ITS 4(2), 78–89 (2003) 19. Zhang, D., Gatica-Perez, D., Bengio, S., McCowan, I.: Semi-Supervised Adapted HMMs for Unusual Event Detection. In: CVPR, vol. I, pp. 611–618 (2005) 20. Zhong, H., Shi, J., Visontai, M.: Detecting Unusual Activity in Video. In: CVPR vol. II, pp. 819–826 (2004)
Realistic Facial Modeling and Animation Based on High Resolution Capture Hae Won Byun School of Media & Information, Sung Shin Woman University, 169-1 Dongsun-dong 2, Sungbuk-gu, Seoul, Republic of Korea
[email protected]
Abstract. Real-time facial expression capture is an essential part for on-line performance animation. For efficiency and robustness, special devices such as head-mounted cameras and face-attached markers have been used. However, these devices can possibly cause some discomfort that may hinder a face puppeteer from performing natural facial expressions. In this paper, we propose a comprehensive solution for real-time facial expression capture without any of such devices. Our basic idea is first to capture the 2D facial features and 3D head motion exploiting anthropometric knowledge and then to capture their time-varying 3D positions only due to facial expression. We adopt a Kalman filter to track the 3D features guided by their captured 2D positions while correcting their drift due to 3D head motion as well as removing noises. Keywords: Performance-based animation, character animation, facial expression capture, real-time facial feature tracking.
1
Introduction
On-line performance-driven facial animation is a key technique for virtual character animation in broadcasting and computer games. For these applications, it is required to capture facial expressions in real time. For a live performer to feel comfortable in making expressions, it is also desirable, if not required, to avoid any devices such as a head-mounted camera and face-attached markers. These constraints on facial expression capture enforce additional difficulties: The performer naturally moves his/her head to express emotions while making facial expressions according to an animation script. Without a head-mounted camera, one needs to track the position and orientation of performer’s head for more accurate facial expression capture. Moreover, without any markers attached on the face, extra effort is needed to track the features of the face that characterize facial expressions. The final difficulty comes from the real-time constraint, that is, to capture facial expressions in real time while addressing the former two difficulties. In this paper, we propose a comprehensive solution for real-time facial expression capture from a stream of images that is given one-by-one in an on-line manner from a single camera. We make a mild assumption that a facial expression performer can move the head as long as all facial features are observable J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 417–426, 2007. c Springer-Verlag Berlin Heidelberg 2007
418
H.W. Byun
from the camera. The 3D positions of facial features are not only affected by facial expression change but also by 3D head motion. Our objective is to extract the time-varying 3D feature positions only due to the expression change. As depicted in Figure 1, our solution consists of three major steps: 2D feature
2D Feature Tracking Input Image
Color Space Transformation Blob Construction
3D Head Motion
3D Feature
Estimation
Tracking
Facial Features
& Noise Filtering
Feature Curve Extraction
Fig. 1. Overall structure of our expression capture scheme
tracking, 3D head motion estimation, and 3D feature tracking and noise filtering. These steps are executed in sequence for each input image. In the first step, we extract the 2D feature curves that best fit the contours of facial features, exploiting anthropometric knowledge. Those curves characterize the 2D facial features on images. We also extract six expression-invariant points such as four corner points of eyes and a pair of nostril centers. In the next step, we first compute the 3D positions of those six points. Then, assuming a camera of known parameters with its position and orientation fixed, we obtain 3D head motion efficiently guided by those 3D expression-invariant points while exploiting their redundancy for robustness. The remainder of the paper is organized as follows: We provide related work in Section 2. In Section 3, we describe the first step in detail, that is, how to extract 2D facial features. Sections 4 cover the second and third steps, that is, how to estimate 3D head motion. Section 5 demonstrates our experimental results. Finally, we conclude the paper and discuss future work in Section 6.
2
Related Work
There are rich results on facial expression capture. We specifically refer those that are directly related to our work. Williams[14] proposed an approach to capture the facial features with markers attached to the feature points on the face of a live performer. Terzopolous and Waters[11] adopted an active contour model called “snakes” presented by Kass et. al.[7] to track the outlines of facial features highlighted with special makeup. Cao et. al.[15] extracted facial features directly from an input image without any markers. Instead of finding the outlines of the facial features, Huang et.al.[5], and Wang et.al.[13], and DeCarlo et. al.[3] proposed a 3D model-based approach for tracking facial features. They used the optical flow field to displace the vertices of 3D models. Head tracking is estimating the 3D head orientation relative to the camera plane. Cascia et. al.[2] adopted iterative schemes for adjusting the posture of
Realistic Facial Modeling and Animation Based on High Resolution Capture
419
a pre-defined 3D face model until it has the same orientation as the face in the input image. Jebara and Pentland[6] presented a real-time face tracking system. The head pose is acquired from an extended Kalman filter together with a parametrized model of facial structure. To estimate the head position and orientation, Yang et. al.[16] utilized the invariance condition among some fixed points, including four eye corners and the tip of the nose, together with anthropometric statistics. Oliver et. al.[8] adopted a Kalman filter for 2D face feature tracking. However, there has been little work for tracking 3D facial features from images using a Kalman filter. To reconstruct the 3D motion of an object from a sequence of images, Kalman filters have been often employed. Str¨ omet. al.[10] introduced an extended Kalman filter to estimate both the structure of a moving object and its kinematic parameters such as position and velocity.
3
2D Feature Tracking
In this section, we describe how to track the facial features in real time without any devices. We assume a stream of images is captured from a single camera of known parameters located at a given position with a fixed orientation. As shown in Figure 1, feature tracking consists of three major tasks: color space transformation, blob construction, and feature curve extraction. 3.1
Color Space Transformation
For robust feature extraction, we transform the color space of the input image from the RGB model to a model in which the facial features such as eyelashes, eyebrows and lips are significantly distinguishable from their background, that is, the skin. To enhance the facial features, we design a new color transformation function from RGB values to gray-scale values (see Figure 2). We conceive that
Fig. 2. Proposed color transformation: (a)An original image (b)color transformation
the skin has low values of the magenta (M) and black (K) channels in the CYMK color model. A low intensity (V) value of the HSV color model is observed for the pixels in dark features such as eyebrows, eyelashes, and nostrils. Moreover, the portion of the hue (H) band occupied by the color of lips is fairly different from that of the skin. Therefore, we use those four components to emphasize the features in an image. With our transformation function, the intensity I(u, v) of a pixel (u, v) is defined as follows: I(u, v) = w1 M (u, v) + w2 K(u, v) + w3 V (u, v) + w4 G(H(u, v)).
(1)
420
H.W. Byun vin1
v1out
v1
A B
v2
2 v2in vout
Fig. 3. Two candidates and offset curves: (a)Rectangles containing blobs (b)Two candidate contours of the upper lip (c)Offset curves
Here, G is a function which has high values at a range of hue values which are similar to those of lips, and has very low values, otherwise. The weights wi , 1 ≤ i ≤ 4 are empirically tuned for both lighting condition change and skin color variation. Here, w3 is negative while the others are positive, since pixels in features have lower V values compared to those in the skin. We may further emphasize the features by a contrast enhancement function C. 3.2
Blob Contruction
A blob is said to be a set of connected pixels in an image that share similar visual properties such as color and intensity. Facial features are normally projected onto the image as distinct blobs. By constructing those blobs properly, we can estimate the facial features from the image at each frame. In order to accelerate blob construction, we confine a blob in a rectangle using anthropometric knowledge such as the relative positions of facial features and their size as shown in Figure 3(a). A similar idea is used in Thalmann et al.[12]. Given the rectangle containing a feature, we employ a blob growing algorithm in [1, 8] to construct the blob. 3.3
Feature Curve Extraction
In order to extract the outlines of features, we employ snakes as proposed by Kass et al.[7]. Snakes are energy-minimizing spline curves under the influence of three (possibly conflicting) forces: internal force, image force, and constraint force. Due to its high degrees of freedom, the snake may snap to unwanted boundaries. To avoid this problem, we remove the internal force from our formulation. Instead, we employ cubic Bezier curves with a small number of control points to represent snakes. The outlines of facial features are so simple that they can be well represented by such curves. Moreover, the strain energy minimization property of the splines guarantees their smoothness. This simplification increases time efficiency and robustness while sacrificing some flexibility that is not necessarily required for our purpose. The energy function of our contour model consists of two terms: E(v) = 1 Eimage (v(s)) + Econ (v(s))ds. Here, Eimage and Econ are respectively the en0 ergies due to the image force and the constraint force, and v(s) is a 2D cubic Bezier curve representing the contour of the feature. The energy Eimage is an edge detecting function[7], Eimage (v(s)) = −w1 |∇I(u, v)|2 . Here, w1 is a constant weight value, and ∇I(u, v) is the gradient at a point (u, v) on v(s), that
Realistic Facial Modeling and Animation Based on High Resolution Capture
421
∂I(u,v) is, ∇I(u, v) = ( ∂I(u,v) ∂u ∂v ), and I(u, v) is obtained from Equation (1). This energy function makes the curve v be attracted to the contour of a blob with large image gradients, or the outline of a feature. However, using only image gradients may cause an unwanted result. For example, as shown in Figure 3(b), we cannot discriminate the upper curve (A) and lower curve (B) with image gradients alone. We resolve this problem by employing the constraint energy together with simple upper and lower offset curves as illustrated in Figure 3(c). Suppose that we want to extract the upper curve (A). An offset curve of a feature curve v(s) is said to be its inner curve vin (s) if it is supposed to lie in the corresponding feature. Otherwise, it is said to be its outer curve vout (s) (see v1 in Figure 3(c)). Let I(vout (s)) and I(vin (s)) be the intensity of vout (s) and that of vin (s), respectively. Because of color transformation in Section 3.1, a point in a feature region has a high intensity value, and that in the skin has a low value. Given I(vout (s)) and I(vin (s)), the constraint energy of the feature curve v(s) is defined:
Econ (v(s)) = wout I(vout (s)) − win I(vin (s)),
(2)
where wout and win are positive constants. As illustrated in Figure 3(c), with win sufficiently greater than wout , Econ is positive for a curve (v2 in the figure) that is not properly located, but negative for a properly located one (v1 ).
4
3D Head Motion Estimation
In this section, we present a real-time algorithm for estimating 3D head motion. In particular, we exploit six expression-invariant feature points including four corner points of eyes and a pair of nostril centers to estimate the head position and orientation at each frame. These expression-invariant points are almost coplanar. Moreover, four eye corner points are almost collinear, and the line containing them is almost parallel with that containing a pair of nostril centers. In order to form non-collinear triples, we choose either two points from the former and one from the latter or one from the former and two from the latter to make a total of sixteen triples. 4.1
Tracking Expression-Invariant Points
To obtain the 3D position p of an expression-invariant point, let us observe the relation between the position p of an expression-invariant point and its projection pI onto the image plane, pI = MP · MH (t)p. Here, the matrices MP and MH (t) are, respectively, the projection matrix and the transformation matrix representing head motion. Let pi = (xi yi zi ), i = 1, 2, 3 be any non-colinear triple of the expressioninvariant points and (ui vi ), i = 1, 2, 3 be their projections on the image plane. Since those points are expression-invariant, the distances between three pairs of them are preserved to yield three quadratic equations:
422
H.W. Byun
(x1 − x2 )2 + (y1 − y2 )2 + (z1 − z2 )2 = l12 , (x2 − x3 )2 + (y2 − y3 )2 + (z2 − z3 )2 = l22 , and (x3 − x1 ) + (y3 − y1 ) + (z3 − z1 ) = 2
2
2
(3)
l32 ,
where li , 1 ≤ i ≤ 3 is the initial distance between each pair of 3D points. Using the relation between a 3D point (x y z) and its 2D projection (u v 0), we can compute unknowns z1 , z2 and z3 . In worst cases, there exist four solutions which satisfy the expression-invariance condition[4]. We exploit temporal coherency supplemented by the characteristics of head motion to choose a solution. Let (p1 , p2 , p3 ) be the triangle of which corner points are the points, p1 , p2 , and p3 . Since those points are invariant of facial expressions, they experience a rigid motion. Assuming such a motion, we establish a measure to evaluate the closeness of a solution to the previous solution. For the j-th solution at the i-th frame, the measure Eij consists of two terms: P v Eij = Eij + Eij ,
(4)
P v where Eij and Eij reflect the changes of position and orientation of (p1 , p2 , p3 ) and those of its linear and angular velocities, respectively. The motion of (p1 , p2 , p3 ) consists of two components: translation and rotation. The translational motion is represented by a vector from the center of (p1 , p2 , p3 ) at the initial frame to that at the current frame. The rotational motion again consists of two components: One is the rotation about the axis perpendicular to both normal vectors of the triangles at the initial and current frames and the other is that about the normal vector itself at the current frame. The former rotation q1 is represented by a unit quaternion:
q1 = e
θ1 nij ×n0 2 ||nij ×n0 ||
,
(5)
where nij and n0 are the normal vector of the j-th solution triangle at the i-th frame and that at the initial frame, respectively. θ1 is the angle between them, that is, θ1 = sin−1 ||nij × n0 ||. Similarly, the latter rotation q2 is q2 = e
lij ×l0 θ2 2 ||lij ×l0 ||
,
(6)
where lij and l0 are respectively the unit direction vector of an edge of the j-th solution triangle at the i-th frame and that at the initial frame rotated by q1 , and θ2 is the angle between them, that is θ2 = sin−1 ||lij × l0 ||. The rotation q of the triangle is the quaternion product of q1 and q2 , that is, q = q2 q1 . p Eij is defined as a weighted sum of the position and orientation changes: p Eij = ||pij − pi−1 || + α|| ln(q−1 i−1 qij )||,
(7)
where pij and qij are the position of the j-th solution at the i-th frame and its orientation, respectively. pi−1 and qi−1 are those at the previous frame. When the head moves quickly, this measure becomes large even for a proper solution.
Realistic Facial Modeling and Animation Based on High Resolution Capture
423
Therefore, we supplement the measure with another due to linear and angular velocity changes: v Eij = β||vij − vi−1 || + γ||ωij − ωi−1 ||,
4.2
(8)
Combining Solutions
From Section 4.1, we have obtained sixteen configurations of triangles, each of which is formed with a non-colinear triple of six expression-invariance points on the face. Since those points are almost coplanar, we assume that each triangle configuration give the posture of the head, that is the head position and orientation. Treating those sixteen head postures as sampled data, we estimate the true head posture. We use an M-estimator[9] to combine the translation components of the head ˜ minimizes the residual error of translation data, posture. Our estimator p Lp =
16
˜ ||) , ρσi (||pi − p
(9)
i=1
where pi , 1 ≤ i ≤ 16 are sample values. Due to its insensitivity to outliers, we take Lorentzian error distribution function as the objective function ρσi . ˜ for rotation data so that it minimizes Similarly, we define the estimator q residual error, that is defined as Lq =
n
˜ ). ρσi (qi ⊗ q
(10)
i=1
Here, ⊗ : S 3 × S 3 → R is an operator which yields the distance between two rotations, that is, q1 ⊗ q2 = 1 − (q1 · q2 )2 . θ is the angle between two rotations q1 and q2 . The function, 1 − cos2 x is a similar local shape as Lorentzian error distribution function, log 1 − 12 σxi . Thus, we employ an identity function as the objective function, that is, ρσi (x) = x.
5
Experimental Results
To evaluate effectiveness and performance of the proposed method, we performed experiments on a PC with Pentium III 800 Mhz CPU and 512 MB memory. Face images were captured with a single digital camera and sent to the PC through a video capture board at 30 frames per second. To illuminate puppeteer’s face, we used two desktop lamps each of which has a single 13W bulb. As shown in Figure 4, neither any markers were attached to performer’s face nor any headmounted camera was employed. The head was allowed to move and rotate during facial expression capture. Figures 4(m) and 4(p) show the captured face images of three puppeteers. Face images after color space transformation are given in Figures 4(n) and 4(q).
424
H.W. Byun
(m)
(n)
(o)
(p)
(q)
(r)
Fig. 4. Original images, color-transformed images, and extracted curves
Fig. 5. Head Tracking
From most face images, we can observe that the intensity values of pixels in the region of the skin are so different from those of the facial features. Indeed, we were able to extract the facial features robustly from the transformed images. Figures 4(o) and 4(r) exhibits the feature curves extracted from the images. The second row of Figure 5 depicts head motion estimation result. In order to visualize head motion estimation data, we drew the plane on the face of which normal vector is the same as that of the frontal face. The plane also represents the 3D translation and rotation of the head. As drawn in the last row of Figure 5, we compensated the captured feature curves for the error due to the head motion to finally obtain the correct facial expression features. Our method for facial expression capture can process more than 100 frames per second to exhibit a sufficient efficiency for real-time on-line performance-driven animation.
Realistic Facial Modeling and Animation Based on High Resolution Capture
6
425
Conclusions
In this paper, we propose a comprehensive solution for real-time facial expression capture. We assume a stream of images is captured, in an on-line manner, from a single camera of known parameters located at a given position with a fixed orientation. Our solution consists of three major components: 2D feature tracking, head motion estimation, and 3D feature tracking and noise filtering. The first component is for extracting the feature curves representing the outlines of facial features. The second component is for estimating the 3D motion of the head, that is, the translation and rotation of the head. Finally, in the last component, we adopt a Kalman filter to correct the error due to the head motion as well as to remove noise, and sample the facial feature points. Experimental results demonstrate that our solution extracts the facial features efficient enough for real-time applications such as on-line performance-driven animation. In future, we plan to capture, in real-time, more detail facial features such as winkles on the forehead caused by local deformation.
Acknowledgement This work was supported by the Korea Research Foundation Grant funded by Korea Government(MOEHRD, KRF-2005-204-D00033)
References 1. Basu, S., Oliver, N., Pentland, A.: 3D modeling and tracking of human lip motions. In: Proceedings of ICCV 98 (1998) 2. Cascia, M.L., Sclaroff, S.: Fast, reliable head tracking under varying illumination: An approach based on registration of texture mapped 3d models. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(4), 322–336 (2000) 3. DeCarlo, D., Metaxas, D.: Optical flow constraints on deformable models with applications to face tracking. International Journal of Computer Vision 38(2), 99– 127 (2000) 4. Huang, T.S., Netravali, A.N.: Motion and structure from feature correspondences: A review. Proceedings of the IEEE 82(2), 252–268 (1994) 5. Huang, X., Zhang, S., Wang, Y., Metaxas, D., Samaras, D.: A hierarchical framework for high resolution facial expression tracking. In: The Third IEEE Workshop on Articulated and Nonrigid Motion, CVPR’04, IEEE Computer Society Press, Los Alamitos (2004) 6. Jebara, T., Azarbayejani, A., Pentland, A.: 3d structure from 2d motion. IEEE Signal Processing Magazine, 3DandStereoscopicVisualCommunication 16(3) (1999) 7. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1987) 8. Oliver, N., Pentland, A., Berard, F.: Lafter: Lips and face tracking. In: Computer Vision and Pattern Recognition ’97 (1997) 9. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C, 2nd edn. Cambridge University Press, Cambridge (1996)
426
H.W. Byun
10. Str¨ om, J., Jebara, T., Basu, S., Pentland, A.: Real time tracking and modeling of faces: An ekf-based analysis by synthesis approach. In: Modeling People Workshop at ICCV’99 (1999) 11. Terzopoulos, D., Waters, K.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions of Pattern Analysis and Machine Intelligence 15(6), 569–579 (1993) 12. Thalmann, N.M., Pandzic, I., Kalra, P.: Interactive facial animation and communication. In: Tutorial of Computer Graphics International ’96, pp. 117–130 (1996) 13. Wang, Q., Ai, H., Xu, G.: 3D model based expression tracking in intrinsic expression space. In: Sixth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 487–497. IEEE, Los Alamitos (2004) 14. Williams, L.: Performance-driven facial animation. In: Proceedings of ACM SIGGRAPH Conference, pp. 235–242. ACM Press, New York (1990) 15. Xiang, C., Baining, G.: Real-time tracking and imitation of facial expression. In: Proc. SPIE, Second International Conference on Image and Graphics, vol. 4875, pp. 910–918 (2002) 16. Yang, T.-J., Wu, F.-C., Ouhyoung, M.: Real-time 3-d head motion estimation in facial imagecoding. In: Proceedings of Multimedia Modeling 98, pp. 50–51 (1998)
Descriptor-Free Smooth Feature-Point Matching for Images Separated by Small/Mid Baselines Ping Li1 , Dirk Farin1 , Rene Klein Gunnewiek2 , and Peter H.N. de With3 1
Eindhoven University of Technology {p.li,d.s.farin}@tue.nl 2 Philips Research Eindhoven
[email protected] 3 LogicaCMG Netherlands B.V.
[email protected]
Abstract. Most existing feature-point matching algorithms rely on photometric region descriptors to distinct and match feature points in two images. In this paper, we propose an efficient feature-point matching algorithm for finding point correspondences between two uncalibrated images separated by small or mid camera baselines. The proposed algorithm does not rely on photometric descriptors for matching. Instead, only the motion smoothness constraint is used, which states that the correspondence vectors within a small neighborhood usually have similar directions and magnitudes. The correspondences of feature points in a neighborhood are collectively determined in such a way that the smoothness of the local correspondence field is maximized. The smoothness constraint is self-contained in the correspondence field and is robust to the camera motion, scene structure, illumination, etc. This makes the entire point-matching process texture-independent, descriptor-free and robust. The experimental results show that the proposed method performs much better than the intensity-based block-matching technique, even when the image contrast varies clearly across images.
1
Introduction
Tracking feature points along frames of a video sequence is useful in many applications such as image segmentation, structure reconstruction, depth creation for 3D-TV, object recognition, etc. The key step of feature-point tracking is to establish feature-point correspondences between two successive frames, which can be further divided into two sub-steps. First, detecting feature points/regions in two individual images. Second, establishing correspondences between the detected feature points. This paper focuses on the second step, with the assumption that the feature points are already detected in two images using the well-known Harris corner detector [1]. 1.1
Related Work
Many feature-point matching algorithms have been proposed, and many of them are based on photometric descriptors to characterize and distinct the local J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 427–438, 2007. c Springer-Verlag Berlin Heidelberg 2007
428
P. Li et al.
image regions. Local image regions can be described by the histogram of the pixel intensity, distribution of the intensity gradients [2], composition of the spatial frequencies, image derivatives [3,4], generalized moments [5], or other image properties. Two feature points are matched if their corresponding descriptors show high similarity. An evaluation of the state-of-the-art interest point detectors and region descriptors can be found in [6] and [7]. In the following, we summarizes some of the well-known schemes that fall into this category. Lowe [2] proposed a Scale-Invariant Feature Transform (SIFT) algorithm for feature-point matching or object recognition, which combines a scale-invariant region detector and a gradient-distribution-based descriptor. The descriptor is represented by a 128-dimensional vector that captures the distribution of the gradient directions (sub-sampled into 8 orientations and weighted by gradient magnitudes) in 16 location grids. The Gradient Location and Orientation Histogram (GLOH) algorithm proposed by K. Mikolajczyk and C. Schmid [7] extends the SIFT to consider more regions for computing the histogram, and was shown to outperform the SIFT. Recently, Herbert Bay et al. proposed a new rotation- and scale-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features) [8]. It is based on sums of 2D Haar wavelet responses and makes an efficient use of integral images. The algorithm was shown to have comparable or better performance, while obtaining a much faster execution than previously proposed schemes. Another category of feature-point matching algorithms do not use region descriptors. In [9], a feature-point matching algorithm is proposed using the combination of the intensity-similarity constraint and geometric-similarity constraint. Feature correspondences are first detected using the correlation-based matching technique. The outliers are thereafter rejected by a few subsequent heuristic tests involving geometry, rigidity, and disparity. In [10], a point-matching method is proposed to globally match the feature points. The algorithm relaxes the huge combinatorial search domain into its convex-hull, which can be efficiently solved by concave programming. Any assumption can be used by the proposed method as a matching criterion, provided that the assumption can be translated into cost functions with continuous second derivatives. Intensity correlation has been demonstrated as a good criterion. For feature-point tracking in a video sequence, the variation of the camera parameters (rotation, zoom, viewpoint) is relatively small. The correlation-based block matching technique is often used because of its computational efficiency. In this method, the similarity between two image patches in windows around two feature points is measured by aggregating measurements such as intensity, color, phase, etc., over the window. Two feature points are matched if the measurements show high correlation. The descriptor-based algorithms are more suitable for matching feature points between two widely separated views or object recognition. The high computational complexity of the high-dimension1 descriptors makes these algorithms less efficient in this context. On the other hand, the 1
To describe the local regions properly, the descriptors normally require dozens or even hundreds of dimensions [7].
Descriptor-Free Smooth Feature-Point Matching
429
block-matching algorithm is less robust due to the fact that only the local intensity similarity is used for point matching. Geometric similarity2 and intensity similarity are the two underlying principles of most feature-matching algorithms. Though both are widely used, it appears that the geometric similarity is more fundamental and stable than intensity similarity since intensities are more liable to change [9]. It is favorable to establish the feature correspondences using the geometric similarity alone. 1.2
Our Approach
Our approach concentrates on both the computational efficiency and robustness of feature-point matching algorithm, as well as the fundamental nature of the geometric similarity. Therefore, this paper proposes an efficient and robust point-matching algorithm that uses only the smoothness constraint, targeting at feature-point tracking along successive frames of uncalibrated video sequences. In the proposed algorithm, the collected correspondences of feature points within a neighborhood are efficiently determined such that the smoothness of the correspondence field is maximized. Intensity information is not required for the matching. It is pursued that the proposed algorithm works well even when there is significant change of image contrast. Besides, due to the robustness and wide applicability of the smoothness constraint, the proposed algorithm works well even when the camera is subject to a moderate change of its parameters. Further, the proposed algorithm is also computationally efficient. As will be discussed in Section 3.1, the smoothness of the correspondence field is efficiently computed using a very simple metric. Our experimental results on both synthetic and real images show that the proposed algorithm is able to detect a much higher number of feature-point correspondences with a higher quality than the correlation-based block-matching technique. Because correspondences of feature points within a neighborhood are collectively determined, the chance is lower for the erroneous two-frame correspondences to propagate among several frames. This increases the robustness of the feature-point tracking in video sequences.
2
Notations
Let I = {I1 , I2 , · · · , IM } and J = {J1 , J2 , · · · , JN } be two sets of feature points in two related images, containing M and N feature points, respectively. For any point Ii , we want to find its corresponding feature point Jj from its candidate set CIi , which, as shown in Fig. 1(b), is defined as all the points within a co-located rectangle in the second image. The dimension of the rectangle and density of the feature points determine the number of the points in the set. 2
We consider the smoothness assumption related to the geometric constraint, because it is the rigidity of the scene geometry that gives the motion smoothness in the image. For example, a group of points on the surface of a rigid object usually move in similar direction and speed. This leads to smooth image motion.
430
P. Li et al.
Fig. 1. The set of feature points in neighborhood NIi in the first image and the set of candidate corresponding feature points CIi in the second image for feature point Ii
As illustrated by Fig. 1, the neighborhood NIi of feature point Ii is defined as a circular area around the point. The number of points within NIi depends on the radius of the circle and the density of the feature points. The displacement between Ii and Jj is represented by its Correspondence Vector (CV) v Ii . The candidate set CIi for Ii gives rise to a corresponding set of candidate correspondence vectors VIi . Determining the correspondence for Ii is equivalent to finding the corresponding point from CIi or finding the corresponding CV from VIi .
3
Matching Algorithm
We assume that correspondence vectors within a small neighborhood have similar directions and magnitudes, which is referred to as local-translational-motion (LTM) assumption in the remainder of the paper. CVs that satisfy this constraint are called coherent. In this section, the LTM assumption is translated into a coherence criterion for feature matching. 3.1
Coherence Metric
Given two coherent CVs v i and v j , we require that both the difference dij between their magnitudes, and the angle deviation θij between their directions, should be small, as illustrated in Fig. 2. Combining these two requirements, we obtain the following coherence metric: dij < ||v i || × sin(ϕ) = R,
(1)
where ϕ is the maximum allowed angle deviation between two CVs within a neighborhood, and R is a threshold based on the magnitude of the reference CV and ϕ, as illustrated in Fig. 2. The allowed degree of deviation ϕ specifies how similar two CVs should be in order to satisfy the coherence criterion. Difference dij is computed as: dij = |v i − v j | = |xvi − xv j | + |yvi − yvj |.
(2)
Descriptor-Free Smooth Feature-Point Matching
431
Fig. 2. Two coherent CVs v i and v j within a neighborhood; vector v i is the reference CV
Note that the smoothness assumption is more general than the LTM assumption, because it includes not only the translational motion but also other smooth motions caused by rotation, scaling, slanting depth, etc. The reason why our algorithm, which is based on the LTM assumption, works well for a wide range of scenarios (including images with evident rotation and scaling) is that the local correspondence field within a small neighborhood in most cases follows the translational-motion model well, regardless of the actual camera motion and scene structure. 3.2
Smoothness Computation
Given a reference CV v Ii ∈ VIi , the smoothness of the correspondence field with respect to the reference vector within neighborhood NIi is measured as the ratio between the number of coherent CVs found in NIi and the number of the feature points in NIi . This ratio is denoted by S(NIi , v Ii ) and can be computed by: Ik ∈NIi fIk (v Ii ) S(NIi , v Ii ) = , (3) n where n is the number of feature points in NIi ; fIk (v Ii ) is a binary variable, indicating whether the most similar CV (smallest distance by Eq. (2)) of feature point Ik is coherent with the reference vector, which can be computed by: 1 dik < R fIk (v Ii ) = (4) 0 else As stated by the smoothness assumption, the correspondence field within a neighborhood is smooth. This implies that S(NIi , v Ii ) should be as high as possible to have a smooth field. We compute S(NIi , v Ii ) for every v Ii ∈ VIi . The maximum is considered as the smoothness of the field, and is computed by: Sm (NIi ) = max S(NIi , v Ii ). v Ii ∈VIi
(5)
With the above equation, the problem to determine the correspondences for feature points within NIi is converted into selecting a CV v Ik ∈ VIk for every Ik ∈ NIi to have a maximum smoothness Sm (NIi ) of the correspondence field.
432
P. Li et al.
True correspondences are found once we find that Sm (NIi ) is larger than a given threshold. Note that once the vector v Ii for IIi is selected, vector v Ik for Ik ∈ VIk is determined as well. 3.3
Steps to Compute Correspondences for Feature Points Within a Neighborhood
We summarize the steps to compute the correspondences for feature points within neighborhood NIi as follows: S1 Given a reference CV v Ii ∈ VIi , for every Ik ∈ NIi (k = 1, · · · , n), find its most similar CV from VIk so that the distance dik by Eq. (2) is minimum. S2 Set the indicator variable fIk (v Ii ) according to Eq. (4); compute the smoothness S(NIi , v Ii ) of the correspondence field using Eq. (3). S3 Compute the maximum smoothness Sm (NIi ) using Eq. (5); true correspondences are found if Sm (NIi ) is higher than a given threshold. 3.4
Rationale of the Algorithm
The algorithm tries to find the CV that gives the maximum number of coherent CVs in a neighborhood. In this subsection, we explain why this maximum smoothness gives the correct correspondences with a high probability. As explained in Section 3.1, the correspondence field within a neighborhood in most cases follows the LTM model well. Thus, we can expect that the smoothness with respect to the true CV is approximate to the repetition ratio of the feature points within the neighborhood3 . That means, in the direction of the true CV, the smoothness is close to the repetition ratio. Due to the random pattern of the texture, along other candidate CVs from VIi , feature points appear randomly. The probability to find another set of coherent CVs that gives higher smoothness is thus low. Summarizing, the highest smoothness can be found, in most cases, only along the true CV. Once the highest smoothness (higher than a certain threshold) is detected, the true correspondences are found.
4
Experimental Results
The proposed algorithm is applied to both synthetic and real images for performance evaluation. To evaluate the quality of the detected correspondences, either the homography or the fundamental matrix is computed using RANSAC [11]. All correspondences that are inline to the homography or fundamental matrix are considered correct. We consider that a correspondence conforms to the homography or the fundamental matrix if the residual error dr is smaller than one pixel, which is computed by: 3
The repetition ratio of feature points within a neighborhood is defined as ratio between the number of true point-to-point correspondences and the number of feature points in the neighborhood.
Descriptor-Free Smooth Feature-Point Matching
dr =
[d(x , F x) + d(x, F T x )]/2, [d(x , Hx) + d(x, H −1 x )]/2,
given F given H.
433
(6)
Where, F is the fundamental matrix; H is the homography; (x, x ) is a pair of matched points; d(., .) is the geometric distance between the point and the epipolar line given the F , or the euclidian distance between the two points given the H. The number and percentage of the correct matches are thereafter computed. 4.1
Experiments on Synthetic Images
First, we generate an 800×600 image with 1, 000 randomly-distributed feature points. Second, the 1, 000 feature points are rotated and translated with controlled rotation or translation parameters to generate the second image. Third, an equal number of randomly-distributed outliers are injected into both images to generate two corrupted images. The proposed algorithm is then applied to those two corrupted images to detect feature correspondences. The homography is computed using the RANSAC to evaluate the detected correspondences.
Fig. 3. #Correct matches obtained by the proposed algorithm
Fig. 3 shows the number and Fig. 4 shows the percentage of the correct correspondences obtained under different settings of Degree of Deviation (DoD), i.e., ϕ in Eq. (1), Degree of Rotation (DoR) and Percentage of Injected Outliers (PIO). In the figures, the #Correct Matches is the number of correct matches detected; the Degree of Rotation is the angle that the image rotates around its image center, which measures how strong the image motion deviates from translation; the %Injected Outliers is the percentage of outliers injected into both images, which can be considered as either the repetition ratio of the feature points or the noise level of the image; the %Inliers is the percentage of inliers to the homography. As we see from Figs. 3 and 4, the DoR changes from 0 to 10 degrees, i.e., from pure translation to significant rotation (large deviation between two CVs). The
434
P. Li et al.
Fig. 4. Percentage of inliers to the homography
PIO changes from 0% to 75%, i.e., from repetition ratio of 100% (noise-free) to repetition ratio of 25% (seriously noisy). The DoD (ϕ) changes from 1o to 4o , i.e., from a small threshold to a large threshold by Eq. (2). In all experiments, the translation vector is kept constant as (Tx , Ty ) = (5, 10). Our experiments show that the magnitude of the translation has little effect on the performance. Discussion. This section investigates the effect of the rotation, noise, DoD on the performance of the proposed algorithm. From Figs. 3 and 4, we obtain the following observations: (1) The proposed algorithm is able to reliably detect the correspondences even when the image contains a large portion of injected outliers or when the image contains evident rotation. For example, when P IO = 50%, DoR = 4o , and DoD = 2o , we found 989 correct matches out of 1000 ground-truthes. Furthermore, 94.8% of the 1, 043 detected correspondences are inline to the homography. The obtained CVs are shown in Fig. 5(b), where an evident rotation is observed. (2) The performance drops when the rotation increases. As we discussed in Section 3.1, the proposed algorithm requires that the local correspondence field is more-or-less translational. With a high rotation, the deviation between two CVs is high. This may lead to a violation of the LTM assumption. Consequently, the performance of the proposed algorithm deteriorates, as can be observed from Figs. 3 and 4 when DoR increases above 5o . (3) The noise has little effect on the performance when the rotation is small, but has an evident influence on the performance when the rotation is high. The reason is that a high deviation between two CVs, caused by a high rotation, makes it easier to find a false correspondence vector that gives a smaller difference by Eq. (2), especially when there are many outliers present. (4) A large DoD is helpful when the rotation is high and the noise level is low. A high rotation means a high deviation between CVs. Increasing the DoD and thus the threshold R in Eq. (1) increases the chance for two true CVs to satisfy
Descriptor-Free Smooth Feature-Point Matching
(a) Corrupted first image.
435
(b) CVs superimposing on the uncorrupted first image.
Fig. 5. Results obtained by the proposed algorithm when DoD = 2o , DoR = 4o , and P IO = 50%
the coherence criterion. On the other hand, if the the noise level is high, a large threshold will make it easier for a false vector to satisfy the coherence criterion. This degrades the performance of the proposed algorithm. 4.2
Experiments on Real Images
We have applied the proposed algorithm to many image pairs from the medusa and castle sequences, which are used by [12] for structure reconstruction. We have also applied the algorithm to many self-recorded images. Since all experiments show similar results, only the results for two image pairs are presented in this section. The first pair (IP1) shows a small contrast change and the second pair (IP2) contains a large contrast change. The fundamental matrix is computed using detected correspondences to evaluate the performance. The homography is not applicable in this case. The results are then compared with those computed by the Block-Matching (BM) method. The proposed algorithm is referred to as Texture-Independent Featuring Matching (TIFM) in the following discussion. The first row of Fig. 6 shows the correspondences obtained using the BM on IP1. By comparing Fig. 6(a) with Fig. 6(b), we see many spurious correspondences are detected by the BM. Table 1 shows the results obtained by the BM and the TIFM on IP1 and IP2. In the table, OutOfDetcd means the percentage of the feature correspondences that conform to the epipolar geometry; OutOfTotal means the percentage of the feature points for which the correct correspondences are found. As we see from Table 1, for the BM-IP1, among the 1,332 correspondences detected out of 3,292 feature points, only 53% are found conforming to the epipolar geometry. Thus, we detect nearly4 21% (1, 332/3, 292 × 53%) correct correspondences out of a total of 3,292 feature points. Fig. 6(c) and Fig. 6(d) portray the correspondences obtained by the TIFM on IP1 before and after outlier removal. From the figures, only few spurious correspondences are observed. As we see from Table 1, for the TIFM-IP1, among the 1,609 correspondences detected out of 3,292 feature points, 97% conform to 4
Obviously, not all correspondences that comply to the epipolar geometry are correct.
436
P. Li et al.
(a) BM-IP1 before outlier re- (b) BM-IP1 after outlier removal. moval.
(c) TIFM-IP1 before outlier (d) TIFM-IP1 after outlier removal. removal.
(e) BM-IP2 before outlier re- (f) BM-IP2 after outlier removal. moval.
(g) TIFM-IP2 before outlier (h) TIFM-IP2 after outlier removal. removal. Fig. 6. Correspondences obtained by the BM and the TIFM on IP1 and IP2; the correspondences are illustrated by the CVs superimposed on the first image of an image pair; outliers are removed using the epipolar constraint
Descriptor-Free Smooth Feature-Point Matching
437
Table 1. Results by the BM and the TIFM on IP1 and IP2 BM-IP1 TIFM-IP1 BM-IP2 TIFM-IP2 Total fps 3,292 3,292 693 693 Detected fps 1,332 1,609 153 371 OutOfDetcd 53% 97% 54% 97% OutOfTotal 21% 47% 12% 52%
the epipolar geometry. Thus, we detected nearly 47% correct correspondences out of 3,292 feature points. Our second experiment is on IP2. The two images were taken at the same time. However, the contrast of the two images differs significantly because the images contain different portions of the bright sky, causing different internal camera parameters. Rows three and four of Fig. 6 show the results obtained by the BM and the TIFM on IP2, respectively. From Table 1 and Fig. 6, we see that the TIFM obtains much better results than the BM. As seen from Table 1, the TIFM is robust to the change of image contrast. For IP1 showing a small contrast difference, correct correspondences are found for 47% of the total feature points. For IP2 with evident contrast change, the percentage of the correct correspondences is 52%. The percentage keeps at a constant level irrespective of the change of the contrast. In comparison, the percentage for the BM decreases from 21% for IP1 to 12% for IP2. Both are significantly lower than the percentages by the TIFM. The reasons of the contrast invariance of the TIFM are two-fold. First, the Harris corner detector is known to be robust to contrast change. Second, the TIFM does not rely on image texture for feature matching. The proposed algorithm works under the following two conditions: (1) the local correspondence field within a small neighborhood follows the LTM model (certain degree of deviation allowed), and (2) the repetition ratio of the feature points is not too low. For images separated by wide camera baselines (with significant rotation, scaling, viewpoint change), the proposed algorithm may not work, because in those cases either the repetition ratio is too low or the LTM assumption is not valid. For future work, we will look at incorporating more constraints and extending the LTM assumption to a more general smoothness assumption.
5
Conclusion
In this paper, we have proposed a novel feature-point matching algorithm that uses only a self-contained smoothness constraint. The feature-point correspondences within a neighborhood are collectively determined such that the smoothness of the correspondence field is maximized. The proposed algorithm is descriptor-free and texture-independent. The performance of the algorithm is evaluated by experiments on both synthetic and real images. The experimental
438
P. Li et al.
results show that the proposed method performs much better than the intensitybased block-matching technique, in terms of both the number and the percentage of the correct matches. The algorithm is able to reliably detect the feature-point correspondences for images separated by small or moderate baselines, even when the image contrast varies substantially across two images.
References 1. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conf. pp. 147–151 (1988) 2. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. of Computer Vision 60(2), 91–110 (2004) 3. Baumberg, A.: Reliable feature matching across widely separated videws. In: Proc. IEEE Comp. Vision and Pattern Recognition, vol. 1, pp. 774–781. IEEE, Los Alamitos (2000) 4. Schaffalitzky, F., Zisserman, A.: Multi-view matching for unordered image sets. In: Proc. 7th European Conf. Computer Vision, pp. 414–431 (2002) 5. Gool, L.V., Moons, T., Ungureanu, D.: Affine/photometric invariants for planar intensity patterns. In: Proc. 4th European Conf. Computer Vision, vol. I, pp. 642– 651 (1996) 6. Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point dectors. Int. J. of Computer Vision 60(1), 63–86 (2004) 7. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Analysis and Machine Intelligence 27(10), 1615–1629 (2005) 8. Bay, H., Tuytelaars, T., Gool, L.V.: Surf: Speeded up robust features. In: Proc. 9th European Conf. Computer Vision (2006) 9. Hu, X., Ahuja, N.: Matching point feature with ordered geometric rigidity, and disparity constraints. IEEE Trans. Pattern Analysis and Machine Intelligence 16(10), 1041–1049 (1994) 10. Maciel, J., Costeira, J.P.: A global solution to sparse correspondence problems. IEEE Trans. Pattern Analysis and Machine Intelligence 25(2), 187–199 (2003) 11. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–393 (1981) 12. Pollefeys, M., Gool, L.V., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual modeling with a hand-held camera. Int. Journal of Computer Vision 59(3), 207–232 (2004)
A New Supervised Evaluation Criterion for Region Based Segmentation Methods Adel Hafiane, S´ebastien Chabrier, Christophe Rosenberger, and H´el`ene Laurent Laboratoire Vision et Robotique - UPRES EA 2078 ENSI de Bourges - Universit´e d’Orl´eans 88 boulevard Lahitolle, 18020 Bourges Cedex, France
[email protected]
Abstract. We present in this article a new supervised evaluation criterion that enables the quantification of the quality of region segmentation algorithms. This criterion is compared with seven well-known criteria available in this context. To that end, we test the different methods on natural images by using a subjective evaluation involving different experts from the French community in image processing. Experimental results show the benefit of this new criterion.
1
Introduction
Image segmentation is an essential step in the image treatment chain because it conditions its further interpretation. However, this step remains a difficult and an unsolved problem in image processing. Region-based approach [1,2,3] is particularly interesting when an image contains some textures (remote sensing applications, outdoor images processing...). This subject still remains a prolific domain if we consider the quantity of recent publications in this domain [4,5,6,7]. Nobody has already completely mastered such a step in image processing. Each of the proposed methods lays the emphasis on different properties and therefore reveals itself more or less suited to a considered application. This variety often makes it difficult to evaluate the efficiency of a proposed method and places the user in a tricky position because no method reveals itself as being optimal in all the cases. That is the reason why many recent works have been performed to solve the crucial problem of the evaluation of image segmentation results [8,9]. A possible solution consists in using supervised evaluation criteria which are computed from a dissimilarity measure between a segmentation result and a ground truth of the same image. This reference can either be obtained according to an expert judgement or set during the generation of a test database in the case of synthetic images use. Even if these methods inherently depend on the confidence in the ground truth, they are widely used for real applications and particularly for medical ones [10,11,12]. The work presented in this article deals with this research axis and concerns the proposal of a new supervised evaluation criterion for region based J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 439–448, 2007. c Springer-Verlag Berlin Heidelberg 2007
440
A. Hafiane et al.
segmentation methods. After presenting the criterion, its performance is compared to criteria from the literature. Finally, some conclusions are given.
2
A New Supervised Evaluation Criterion
To evaluate the performances of a given segmentation result, we propose to compute a new quality index which first of all consists in measuring the overlap between the segmentation result to assess and the reference but also penalizes the over- and under-segmentations. The proposed criterion has been developed in order to take at the same time into account the following principles: – localisation: the detected regions should be spatially coherent (eg. position, shape, size...) with those present in the reference, – over-segmentation: this situation is considered as disturbing and has to be penalized in the quality index, – under-segmentation: this situation is considered as a segmentation error and has also to be penalized. Let RiRef and RjSeg be two classes belonging respectively to the reference I Ref and to the segmentation result I Seg (i = 1..N RRef , j = 1..N RSeg where N RRef is the number of regions of the reference and N RSeg the number of regions of the segmentation result). The matching index MI is given by: MI =
Card(RiRef ∩ RjSeg )
j,maxi Card(RRef ∩RSeg ) i j
Card(RiRef ∪ RjSeg )
ρj
(1)
where Card(X) is the number of pixels of X. The value ρj expresses the importance of the region j in the image and permits to give to small regions less influence in the quality measure. ρj =
Card(RjSeg ) Card(I Seg )
(2)
The equation (1) express a morphological relation between two regions. Each region of a segmentation result is compared with the corresponding one in the reference by taking into account the most important overlapping surface. For instance, if there are two regions in I Seg that intersect with a region of I Ref , the measure considers the maximum of intersection. However, the not perfect matching is penalized by the normalization term Card(RiRef ∪RjSeg ). In the case of the perfect matching the index MI is equal to 1. In order to consolidate the judgment, we incorporate over- and under- segmentation errors: ⎧ N RRef /N RSeg if N RSeg ≥ N RRef ⎨ η= (3) ⎩ log(1 + N RSeg /N RRef ) otherwise
A New Supervised Evaluation Criterion
441
The final evaluation criterion HAF is then given by the following equation: HAF =
MI + m × η 1+m
(4)
where m is a weighting coefficient. Face to the increase of over- or undersegmentation, η decreases, penalizing the value of the criterion HAF . According to the experts perception, the log term allows to more penalize a presence of slight under-segmentation in comparison with a presence of slight oversegmentation. Experts are indeed immediately sensitive to under-segmentation which affects large regions. The parameter m was set to 0.2 in this experiment; it controls the importance of the over-/under-segmentation errors in the judgment.
3
Comparative Study
In order to study the performances of this new criterion, we followed a rigorous protocol described in the next paragraphs. We used a psychovisual study for the comparison of segmentation results [13]. We then used several criteria dedicated to the supervised evaluation of segmentation results in a region context. We finally studied their relative performances face to the proposed criterion. 3.1
Psychovisual Study
The goal of this experiment is to know if the comparison of multiple segmentation results of a single image can be made easily and can provide a similar judgement for different experts. In order to involve a high number of experts in image processing for this psychovisual study, we developed a Web interface 1 . The test is composed of 5 pages containing an original image and 5 segmentation results of this image. We have a low number of images because we prefer to have as many experts as possible to obtain reliable results. The original images are presented in Fig. 1. The segmentation methods we used are : EDISON [14], JSEG [15], K-means LBG [21], Thresholding [16] and Fuzzy K-Means [22]. The 5 segmentation results are colored by using a color matching method we developed to facilitate the visual comparison of two segmentation results having the same level of precision [13]. The five colored segmentation results of the first page are presented in Fig. 2. Each expert is asked to sort these five segmentation results. The score 1 for a segmentation result means that it is considered as the best one. 160 individuals participated to this psychovisual study on January 2005: 97 non experts and 63 experts. As the score given by experts is a value between 1 and 5, the standard deviation is a value between 0 to 2. As we can see in Table 1, the average standard deviation of scores given by experts is equal to 0.604. This value is quite low and shows that these judgements are reliable. We obtained similar results with non experts. The average scores given by experts and non experts for each segmentation result are very similar. These results put into obviousness 1
http://www.ensi-bourges.fr/LVR/SIV/interpretation/evaluation/Roc/
442
A. Hafiane et al.
Fig. 1. Original images for the psychovisual study
Fig. 2. Comparison of segmentation results
that a segmentation result can be evaluated without any a priori knowledge on the interpretation goal even if these images are quite simple. We thereafter designate by experts the 160 individuals who participated for the psychovisual study.
A New Supervised Evaluation Criterion
443
Table 1. Reliability of the psychovisual study Average standard deviation with experts 0.604 Average standard deviation with non experts 0.824 Global average standard deviation 0.764
Table 2. Ranking of the five segmentations for page 1 and corresponding standard deviation Ranking 1 2 3 4 5 Segmentation Seg. 2 Seg. 4 Seg. 1 Seg. 5 Seg. 3 Standard deviation of ranking 0.657 0.651 1.208 0.680 1.004
Table 2 presents for the first page the ranking given by the experts and the standard deviation of the ranking for the five available segmentation results. We can observe that, for example, an uncertainty is present concerning the ranking of the third and the fifth segmentation results of the page (for these two segmentation results, the standard deviation of the experts rankings is much higher). This information will be taken into account in the criteria performances comparison. 3.2
Supervised Evaluation Criteria
Supervised evaluation criteria allow to quantify the quality of a segmentation result given a reference such as a ground truth. We selected 7 criteria from the literature: – Vinet’s measure (V IN ) [17]: It computes the correct classification rate by comparing the result with a ground truth. Let IR be a segmentation result and IRref its ground truth. Then, we compute the following superposition table: T (IR , IRref ) = card{Ri ∩ Rjref } , i = 1..N R, j = 1..N Rref
(5)
where card{Ri ∩ Rjref } is the number of pixels i in R corresponding to the pixels j in Rref , and N R the number of regions of R. With this table, we look recursively for the matched regions: 1. We select in T the regions maximizing card(Ri ∩ Rjref ), 2. All the items of T belonging to the line or column of the selected cells are unselected, 3. While items left in T , loop at the first step. Let C be the selected cells, the Vinet’s criterion is then defined as follow: ref V IN (IR , IRref ) = card(I) −
card(Ri ∩ Rj )
C
(6)
444
A. Hafiane et al.
– Hamming’s criterion (HAM ) [18]: 1 2 2 2 Let R1 = {R11 , ..., RN R1 } and R = {R1 , ..., RN R2 } be two segmentation 1 results of an image R. The classes of R and R2 which have a maximal overlapping are matched. A first measure is then computed: DH (IR1 , IR2 ) =
n2 n1
card(Ri2 ∩ Rk1 )
(7)
i=1 k=1,k=i
Let X be the common support between the two segmentation results R1 et R2 . Then, the normalized distance of Hamming is defined as follows: HAM (R1 , R2 ) = 1 −
DH (R1 , R2 ) + DH (R2 , R1 ) 2 × Card(X)
(8)
– Yasnoff’s criteria (Y AS1, Y AS2, Y AS3)[19]: These criteria are computed upon the basis of a confusion matrix CFij with i = 1..n, j = 1..n, where n is the number of classes in the reference segmentation result. CFii represents the pixels well classified while CFij (i = j) represents the number of pixels classified in class i while they belong to the class j. n 1 × n k=1
CFik − CFkk
n
Y AS1(IR , IRref ) =
i=1
n
(9) CFik
i=1
where
n
CFik represents the number of pixels of the class k and CFkk is
i=1
the number of pixels well classified k. n
1 × n k=1 n
Y AS2(IR , IRref ) =
i=1
n n
CFki − CFkk
CFij −
j=1 i=1
where −
n
n
n
(10) CFik
i=1
Cki represents the number of pixels classified k and
i=1
n n
CFij
i=1 j=1
CFik the number of pixels of the image that does not belong to the
i=1
class k. A third measure of dissimilarity between a segmentation result IR and it’s ground truth IRref is defined as follows: Y AS3(IR , IRref ) =
1 × card(IR )
min d(a, b)
a∈IR , a∈Ra
b∈Ra
(11)
A New Supervised Evaluation Criterion
445
where Ra ∈ IRref corresponds to the region to which should belong the pixel a ∈ IR and where d(a, b) corresponds to the distance between a pixel a not belonging to Ra and the nearest pixel of Ra ∈ IRref . – Martin’s criteria (M AR1, M AR2) [20]: Let R(Ri , x) be the region containing the pixel x in the segmentation result Ri and X the common support between two segmentation results R1 et R2 . The two criteria M AR1 and M AR2 are then defined using the following measures: E(R1 , R2 , x) = M AR1(R1 , R2 ) =
1 min{ E(R1 , R2 , x), E(R2 , R1 , x)} card(X) x∈X x∈X
M AR2(R1 , R2 ) =
3.3
card(R(R1 , x)) − card(R(R1 , x) ∩ R(R2 , x)) card(R(R1 , x))
1 min{E(R1 , R2 , x), E(R2 , R1 , x)} card(X) x∈X
(12)
(13) (14)
Performances Comparison
The previous psychovisual study can be used to determine the segmentation that will be subsequently considered as the reference one for the comparison of different supervised evaluation criteria. We reorganise the segmentation results following the ranking given by the experts. For each page, we first select as reference the segmentation result designed by experts as being the best. We compute the criteria values for each segmentation result and compare them by pairs. A comparison result is a value in {−1, 1}. If a segmentation result is better than another one, the comparison value is set to 1 otherwise it equals -1. We then put the best segmentation aside and considered the second one as the reference. Criteria values and corresponding comparisons are once again computed. This procedure is then repeated for all possible situations. Fig. 3 presents the procedure on the example of the first page and for one criterion: HAF . We then define, for each criterion, the cumulative similarity of correct comparison (CSCC): CSCC =
5 4 4 5−i p=1 r=1 i=r j=1
(1 −
σi σi+j p,r p )(1 − )|CCi,i+j − CEi,i+j | σmax σmax
(15)
p,r where CCi,i+j corresponds to the criterion comparison of segmentation results th ranked in the i and (i+j)th position by experts for page p using as reference the p segmentation result ranked in the rth position and where CEi,i+j corresponds to the experts comparison of segmentation results ranked in the ith and (i + j)th for page p. As the results have, in that case, been reorganised to follow the p expert ranking, CEi,i+j always equals 1. The value σi corresponds to the expert ranking standard deviation for the segmentation result i and σmax = 2. The
446
A. Hafiane et al.
Fig. 3. Comparison of segmentation results evaluations for the first page and the proposed criterion. The reference used for the computation of the supervised evaluation criterion is obtained by considering the experts choice.
presence of this term permits to take into account the experts uncertainty. If experts feel uncertain whether to rank two segmentation results, we can accept that the criterion fails. If nearly all experts agree to rank a segmentation result, the criterion should reproduce this decision. In order to more easily compare this error measure, we also define the similarity rate of correct comparison (SRCC), which represents the absolute similarity of comparison referenced to the maximal value : SRCC = (1 −
CSCC ) ∗ 100 CSCCmax
(16)
A New Supervised Evaluation Criterion
447
where CSCCmax corresponds to the biggest difference which can be obtained considering all the possible comparison results. If we now consider the efficiency of supervised evaluation criteria in table 3, HAF gives the best results compared to human judgment. In 97.96% cases, this measure gives the same assessment. That means this measure can be useful to quantify the quality of a segmentation result in the supervised case. As the efficiency of a segmentation method is usually illustrated on synthetic images, HAF should be employed. Table 3. Efficiency of supervised evaluation criteria compared to the reference given by the psychovisual study HAF VIN HAM YAS1 YAS2 YAS3 MAR1 MAR2 97.96% 97.32% 97.83% 94.03% 97.14% 97.76% 97.30% 96.40%
4
Conclusion and Perspectives
We have presented in this paper a new criterion for evaluating region based segmentation methods. A subjective evaluation methodology have been followed in order to compare different segmentation results of original images. We defined a measure that quantifies the similarity of judgment given by an expert and an evaluation criterion. The measure takes into account the difficulty of experts to make the judgment. We compared the proposed criterion face to seven criteria from the literature. Experimental results put into obviousness its efficiency. Future works concern the quantitative comparison of region based segmentation methods from the literature with this new criterion.
Acknowledgments The authors would like to thank the Conseil R´egional du Centre and the European union (FSE) for their financial support.
References 1. Haralick, R.H., Shapiro, L.G.: Image Segmentation Techniques. Image Segmentation Techniques, Computer Vision, Graphics and Image Processing (CVGIP) 29, 100–132 (1985) 2. Heath, M., Sarkar, S., Sanocki, T., Bowyer, K.: Comparison of Edge Detectors: A Methodology and Initial Study. Computer Vision and Image Understanding (CVIU) 69, 38–54 (1996) 3. Freixenet, J., Mu˜ noz, X., Raba, D., Marti, J., Cufi, X.: Yet Another Survey on Image Segmentation: Region and Boundary Information Integration. In: Tistarelli, M., Bigun, J., Jain, A.K. (eds.) ECCV 2002. LNCS, vol. 2359, pp. 408–422. Springer, Heidelberg (2002) 4. Andrey, P.: Selectionist Relaxation: Genetic Algorithms Applied to Image Segmentation. Image and Vision Computing 17, 175–187 (1999)
448
A. Hafiane et al.
5. Bhanu, B., Peng, J.: Adaptative Integrated Image Segmentation and Object Recognition. IEEE transactions on systems, man, and cybernetics 30, 427–441 (2000) 6. Cavallaro, A., Gelasca, E.D., Ebrahimi, T.: Objective evaluation of segmentation quality using spatio-temporal context. In: IEEE International Conference on Image Processing (ICIP), pp. 301–304. IEEE, Los Alamitos (2002) 7. Jiang, X., Marti, C., Irniger, C., Bunke, H.: Distance Measures for Image Segmentation Evaluation. EURASIP Journal on Applied Signal Processing 2006, Article ID 35909 (2006) 8. Zhang, Y.J.: A survey on evaluation methods for image segmentation. Pattern Recognition 29, 1335–1346 (1996) 9. Chabrier, S., Rosenberger, C., Laurent, H., Emile, B., March´e, P.: Evaluating the segmentation result of a gray-level image. In: European Signal Processing Conference (EUSIPCO), pp. 953–956 (2004) 10. Montresor, S., Lado, M.J., Tahoces, P.G., Souto, M., Vidal, J.J.: Analytic wavelets applied for the detection of microcalcifications. A tool for digital mammography. In: European Signal Processing Conference (EUSIPCO), pp. 2215–2218 (2004) 11. Marques, F., Cuberas, G., Gasull, A., Seron, D., Moreso, F., Joshi, N.: Mathematic morphology approach for renal biopsy analysis. In: European Signal Processing Conference (EUSIPCO), pp. 2195–2198 (2004) 12. Lee, W.W., Richardson, I., Gow, K., Zhao, Y., Staff, R.: Hybrid segmentation of the hippocampus in MR images. In: European Signal Processing Conference (EUSIPCO) (2005) 13. Chabrier, S., Rosenberger, C., Emile, B.: Evaluation methodologies of image processing: an overview. In: 8th International IEEE Conference on Signal Processing (ICSP), IEEE Computer Society Press, Los Alamitos (2006) 14. Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Transactions on Pattern analysis and Machine Intelligence 24, 603– 619 (2002) 15. Deng, Y., Manjunath, B.S.: Unsupervised segmentation of color-texture regions in images and video. IEEE Transactions on Pattern Analysis and Machine Intelligence (2001) 16. Kermad, C., Vozel, B., Chehdi, K.: Hyperspectral image analysis and dimensionality: a scalar scheme through multi-thresholding technique. In: Proceedings of the Eos/Spie Symposium on Remote sensing, vol. 31(4170) (2000) 17. Vinet, L.: Segmentation et mise en correspondance de r´egions de paires d’images st´er´eoscopiques, Th´ese de Doctorat de l’universit´e de Paris IX Dauphine (1991) 18. Huang, Q., Dom, B.: Quantitative Methods of Evaluating Image Segmentation. In: Proceedings of the International Conference on Image Processing (ICIP’95), vol. 3, pp. 53–56 (1995) 19. Yasnoff, W.A., Mui, J.K., Bacus, J.W.: Error measures for scene segmentation. Pattern Recognition 9, 217–231 (1977) 20. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In: Proceedings of the 8th International Conference Computer Vision, pp. 416–423 (2001) 21. MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press, Berkeley (1967) 22. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981)
A Multi-agent Approach for Range Image Segmentation with Bayesian Edge Regularization Smaine Mazouzi1 , Zahia Guessoum2 , Fabien Michel1 , and Mohamed Batouche3 1
2
MODECO-CReSTIC, Universit´e de Reims, B.P. 1035, 51687, Reims, France {mazouzi,fmichel}@leri.univ-reims.fr LIP6, Universit´e de Paris 6, 104, av. du Pr´esident Kennedy, 75016, Paris, France
[email protected] 3 D´epartement d’informatique, Universit´e de Constantine, 25000, Alg´erie
[email protected]
Abstract. We present in this paper a multi-agent approach for range image segmentation. The approach consists in using autonomous agents for the segmentation of a range image in its different planar regions. Agents move on the image and perform local actions on the pixels, allowing robust region extraction and accurate edge detection. In order to improve the segmentation quality, a Bayesian edge regularization is applied to the resulting edges. A new Markov Random Field (MRF) model is introduced to model the edge smoothness, used as a prior in the edge regularization. The experimental results obtained with real images from the ABW database show a good potential of the proposed approach for range image analysis, regarding both segmentation efficiency, and detection accuracy. Keywords: Image segmentation, Multi-agent systems, Range image, Bayesian-MRF estimation.
1
Introduction
Image segmentation consists in assigning pixels of an image to homogenous and disjoint sets called image regions. The segmentation of an image is often necessary to provide a compact and convenient description of its content, suitable for high level analysis and understanding. In range images, segmentation methods can be divided in two distinct categories: edge-based segmentation methods and region-based segmentation methods. In the first category, pixels which correspond to discontinuities in depth (jump edges) or in surface normals (roof edges) are selected and chained in order to delimit the regions in the image [6,11]. Edge-based methods are well known for their low computational cost; however, they are very sensitive to noise. Region-based methods use geometrical surface proprieties to gather pixels with the same properties in disjoint regions [5,1]. Compared to edgebased methods, they are more stable and less sensitive to noise. However, they are computationally costly and their efficiency depends strongly on the selection of the region seeds. In both approaches, image denoising is often necessary. However, in the case of highly noisy images such as range images [8], a strong noise J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 449–460, 2007. c Springer-Verlag Berlin Heidelberg 2007
450
S. Mazouzi et al.
smoothing can erase roof edges and smooth edges. However, if the noise is undersmoothed distortions which remain in the image, lead to inaccurate or erroneous results. In range images, several recent segmentation methods fail because they do not correctly address and resolve this problem [10,1]. To deal with this problem, we introduce in this paper a multi-agent approach for range image segmentation. It consists in using a dense population of reactive agents. Agents move over the image and act on its pixels. While moving over the image, an agent adapts to the current planar region on which it is situated and memorizes its proprieties. At the boundaries between regions the agents will be in competition to align the pixels of the boundaries to their respective regions. The resulting alternative alignment of the boundary pixels preserves the region boundaries against erasing. A pixel is therefore processed according to both its neighborhood, and the agents that visit this pixel. An agent acts on the pixels with more certainty, acquired from its move on large areas on the regions of the image. The combination of the global information memorized within the agent, and the local information of the image provides more reliable decisions. Unfortunately, the competitive alignment of the region boundaries results in distorted and badly localized edges. So, these latter are corrected using a Bayesian regularization, based on a new Markov Random Field (MRF) model. The introduced MRF model is used to model the smoothness of image edges, considered as a prior in edge regularization. Extensive experimentations have been performed using real images from the ABW database [8]. The obtained results show a good potential of the proposed approach for an efficient and accurate segmentation of range images. The remainder of the paper is organized as follows: In Section 2, we review some agent-based approaches for image segmentation, as well as some works having used Bayesian inference in range image segmentation. Section 3 is devoted to the proposed agent-based approach for range image segmentation. It describes the behavior of the agents and shows the underlying collective mechanism to deal with the image segmentation. In section 4, we introduce the Bayesian edge regularization. The experimental results are presented in Section 5, in which we discuss the parameter selection, and we analyze and comment the obtained results. Finally, a conclusion summarizes our contribution.
2 2.1
Related Work Agent-Based Systems for Image Segmentation
Several agent-based systems have been proposed for image analysis and object recognition. In this review we consider only works which have addressed a solution in image segmentation. Liu et al. [15] introduce a reactive agent-based system for brain MRI segmentation. Agents are used to label the pixels of the image according to their membership grade to the different regions. When finding pixels of a specific homogenous region, agents create offspring agents into their neighboring regions. An agent is created so that it becomes more likely to meet more homogenous
A Multi-agent Approach for Range Image Segmentation
451
pixels. For the same type of images, Richard et al. [16] propose a hierarchical architecture of situated and cooperative agents. Several control agents are distributed in the volume. The role of each one consists in creating tissue dedicated agents, which perform a local region growing. The statistical parameters of the data distribution, needed to perform region growing are updated according to the interaction between neighboring agents. Based on a cognitive architecture, Bovenkamp et al. [4] have developed a multi-agent system for IntraVascular UltraSound (IVUS) image segmentation. They aim to elaborate a high knowledgebased control over the algorithms of low-level image processing. In this system, an agent is assigned to every expected object in the image. Most of the proposed agent-based systems for image segmentation are specific to image contents, and deal exclusively with jump edge detection. Following a supervised approach, these systems segment images in known and previously expected regions. The multi-agent approach proposed in this paper claims to be general and unsupervised. It aims to segment an image into its different regions by using geometrical surface proprieties. The adaptive and competitive behavior of the agents allows a collective and distributed image segmentation. We show in this work that simple interactions between agents can provide an alternative way for image segmentation. 2.2
Bayesian Inference in Range Image Segmentation
Few authors have integrated Bayesian inference in range image segmentation. Lavalle and Hutchinson [13] have used a Bayesian test to merge regions in both range and textured images. The merging of two regions depends on the probability that the resulting region is homogenous. Jain and Nadabar [9] have proposed a Bayesian method for edge detection in range images. Authors use the Line Process (LP) Markov random field (MRF) model [7] to label image pixels as EDGE or NON-EDGE pixels. Wang and Wang [17] have presented a hybrid scheme for range image segmentation. First, they proposed a joint Bayesian estimation of both pixel labels, and surface patches. Next, the solution is improved by combining the Scan Line algorithm [11], and the Multi-Level Logistic (MLL) MRF model [14]. In spite of various contributions of the works previously cited, some aspects inherent to range image segmentation were omitted. Indeed, most of the works use Markovian models that are based exclusively on the surface smoothness prior. In our work, a refinement of the initial segmentation is performed by Bayesian regularization of the resulting region boundaries using a new Markov random field model. The latter models the edge smoothness, which is considered as a prior in the edge regularization.
3 3.1
Multi-agent Range Image Segmentation Surface Modeling
A range image is a discretized two-dimensional array where at each pixel (x, y) is recorded the distance d(x, y) between the range finder and the corresponding
452
S. Mazouzi et al.
point of the scene. Let d∗ (x, y) be the equation parameters of the tangent plane at (x, y). The best tangent plane is obtained by the multiple regression method using neighbor pixels situated within a 3 × 3 window centred at (x, y), and with close depths, according to a given threshold (T rh ). The plane equation in a 3−D coordinate system may be expressed as follows: z = ax+by +c; where (a, b, −1)T √ 2 is a normal vector to the plane, and |c|/ a + b2 + 1 is the orthogonal distance between the plane and the coordinate origin. Two planes are considered equal if they have, according to some thresholds, the same orientation and the same distance to the coordinate origin. Let θ be the angle between the two normal vectors, and h the distance between the two planes; so, the two planes are considered equal if sin(θ) ≤ T rθ and h ≤ T rh , where T rθ and T rh are respectively the angle and the distance thresholds. Plane comparison is first used to test if a given pixel belongs to a planar region, given its plane equation. It is also used to test if the pixel is, or is not, a pixel of interest (edge or noise pixel). In this case, the pixel in question is considered as a pixel of interest if at least one of its neighbors has a different plane equation, according the previous thresholds. 3.2
Agent Behavior
The image is considered as the environment in which the agents are initialized at random positions. An agent checks if it is situated within a planar region, and adapts to this region if it is planar, by memorizing its plane equation. Next, the agent performs actions, which depend on both its state and the state of the pixel on which it is located. At each time t, an agent is characterized by its position (xt , yt ) over the image, and by its ability At to act on the encountered pixels. At the beginning of the process, all the agents are unable to alter any pixel of the image. After having been adapted to a planar region, an agent becomes able to modify the first encountered pixel that not belongs to the current region (At =true). When an agent alters a pixel, it becomes unable to alter other pixels (At =false) and starts again searching for a new planar region. An agent having modified a pixel records in an appropriate two-dimensional array I, at (xt , yt ) the last state of the visited pixel: I(xt , yt ) ∈ {smoothed, aligned, unchanged}. We show next, that this simple behavior of the agents allows both the detection of the edges, and the removal of the noise regions. Following are the tasks performed by an agent, according to its state and its position. Searching for a Planar Region. After its creation, an agent randomly moves within the image and searches for a planar region around its current position. The agent uses a region seed formed by the last P visited pixels. P is called the adaptation path-length. It represents the confidence degree that the agent is situated within a planar region. So, the agent considers that it is within a planar region if the pixels of the seed form a planar surface. The agent memorizes the proprieties of the new region and considers it as its current planar region. Henceforth it becomes able to alter the first encountered pixel that does not belong to its new region (At =true).
A Multi-agent Approach for Range Image Segmentation
453
Moving on a Planar Region. While moving inside a planar region, an agent smoothes the image at the pixel on which it is located by updating the equations of both the memorized plane and the plane at the current position (d∗ (xt , yt )). This is done by replacing the two equations by their weighted average. Let (a, b, c) and (a , b , c ) be the parameters respectively of the plane at the current pixel, and the memorized plane. Resulting parameters of the average plane are obtained as follows: 1 (a , b , c ) = (a + pa , b + pb , c + pc ) (1) 1+p where p is the length of the path crossed by the agent on the current region. Pixel Alignment. When an agent meets a pixel of interest (i.e. not belonging to its current planar region), the pixel is partially aligned to the planar region on which the agent moves. The parameters (a , b , c ) of the new plane equation at the pixel position are obtained by linear combination of the current parameters (a, b, c) and the parameters of the memorized plane equation (a , b , c ): (a , b , c ) =
1 (a + ξa , b + ξb , c + ξc ) 1+ξ
(2)
where ξ is the alteration strength. The agent becomes then unable to alter pixels (At =false) and starts again to search for a new planar region. The alteration strength ξ is a critical parameter which affects the quality of the results and the time of computation. Indeed, high values of ξ lead to a fast detection of regions. However, the resulting region boundaries are strongly distorted and badly localized (Fig. 1b). Low values of ξ result in a slow detection; nevertheless region boundaries in this case are well detected and localized (Fig. 1c). To speed up the segmentation process and avoid edge distortions, an agent chooses the alteration strength among ξmin and ξmax according to the information recorded by other agents in the array I. So, an agent assumes that the current planar region is adjacent to a noise region and thus uses ξmax as alteration strength, if the number of ”unchanged” pixels (situated in a noisy region) around the agent is greater than a certain threshold (fixed to 3 in our experimentations). Indeed, pixels labeled ”unchanged” in the adjacent region mean that this latter is a noise region for which agents have not adapted and consequently have not smoothed its pixels. Otherwise, the agent assumes that the current planar region is adjacent to another one, where other agents have labeled the pixels as ”smoothed” or ”aligned”. In this case the agent uses the alteration strength ξmin . 3.3
Edge Emergence and Noise Removal
While moving over the image, an agent smoothes the pixels that approximately belong to its planar region, and it considers all other pixels as noise pixels. Among these latter, the agent systematically aligns the first encountered one to its current region. However, pixels on the boundaries of planar regions are true-edge pixels, and thus should not be aligned. Nevertheless, the competition
454
S. Mazouzi et al.
(a)
(b)
(c)
Fig. 1. The impact of the alteration strength on the segmentation results: (a) Range image (abw.test.3); (b) segmentation results with ξmin = ξmax = 4 at t=2500; (c) segmentation results with ξmin = 0.3 and ξmax = 5 at t=13000
between agents preserves these pixels against an inappropriate smoothing. Indeed, around an edge between two adjacent planar regions, two groups of agents are formed on the two sides of the edge. Each group is formed of agents passing from one region to the other. Agents of each group align the pixels of the edge to their respective region. So, the pixels of the edge are continuously swapped between the two adjacent regions. The resulting alternative alignment of edge pixels allows these pixels to remain emergent in the image. This pattern of competitive actions between agents allows the emergence of the edges in the image, whose detection is not coded in any agent, but results from the collective action of all the agents. An agent, having aligned a pixel which belongs to the border of a noise region and having moved inside this region, will not be able to adapt. Consequently, it cannot align any pixel when leaving the noise region. This occurs in two distinct situations: 1) when the region is planar but insufficiently large to allow agents to cross the minimal path-length P , necessary to be able to adapt; 2) when the region is sufficiently large but not planar, or made up of random depths (noise). In both situations, the agent leaves the noise region and will adapt inside the surrounding planar region. Boundaries of noise regions are continuously aligned from outside by including their pixels in the true surrounding regions. So, these regions continuously contract, and they finally disappear after several steps. After several iterations (fixed to 13000), all image regions are well delimited by the detected boundaries. A simple region growing, steered by the detected boundaries, allows to provide the regions of the image.
4 4.1
Bayesian Edge Regularization Segmentation Modeling as Bayesian Estimation
We have used piecewise smoothness of image edges as priors to model the distributions of boundary pixels in range images. Let S denote the image lattice, and M be the number of regions. So, each pixel in the image can take a label from the
A Multi-agent Approach for Range Image Segmentation
455
set of labels L = {l1 , ..lM }. The labeling set F = {f(x,y) , (x, y) ∈ S, f(x,y) ∈ L}, represents an image segmentation. If we assume that F is Markovian, segmenting S according to the Bayesian-MRF framework [14] can be done by computing the maximum a posteriori (MAP) P (F/d) of the distribution of the set F , by considering F as a Markov random field (MRF). According to Bayes rule, the maximum a posteriori P (F/d) is expressed as follows: P (F/d) =
p(d/F )P (F ) p(d)
(3)
−U(F ) P (F ) = Z −1 e−U(F ) is the a priori probability of F , with Z = a F e normalization constant called the partition function. The a priori energy U (F ) is the sum of clique potentials Vc (F ) over the set of all possible cliques C: U (F ) = c∈C Vc (F ). In order to model the edge smoothness we use cliques formed by 9 sites located in a 3×3 window. Let c3×3 be a clique of 3×3 sites centred at an edge pixel (x, y), and ζ (ζ < 0) a potential parameter. Considering all possible configurations in Fig. 2, the potential Vc of cliques in C can be expressed as follows: ⎧ ⎪ ⎪ ζ if ∃(x , y ), (x , y ) ∈ c3×3 | f(x,y) = f(x ,y ) = f(x ,y ) ⎪ ⎪ ⎪ y), (x , y ) = π ⎪ and (x , y ), (x, ⎪ ⎪ ⎪ ⎨ Vc (c3×3 (x, y)) = 0 if ∃(x , y ), (x , y ) ∈ c3×3 | f(x,y) = f(x ,y ) = f(x ,y ) ⎪ ⎪ ⎪ and (x , y ), (x, y), (x , y ) = 2π/3 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ −ζ otherwise (4) Configurations used to define Vc depend on the surface type. For images containing polyhedral objects, considered in this work, Vc is defined on the basis that the boundary between two adjacent regions is formed by pixels belonging to the same straight line (Fig. 2). So, configurations which correspond to locally unsmoothed edges are penalized by using a positive clique potential (−ζ). The likelihood distribution p(d/F ), is obtained by assuming that observations d are degraded by an independent Gaussian noise: d(x, y) = af(x,y) x + bf(x,y) y + cf(x,y) + e(x, y). (af(x,y) , bf(x,y) , cf(x,y) ) are the parameters of the plane equation at the pixel (x, y) assuming thatit is labeled f(x,y) . e(x, y) ∼ N (0, σl2 ) with σl2 = {(x,y)|f(x,y) =l} (al x + bl y + cl − d(x, y))2 . So the likelihood distribution is expressed as follows: p(d/F ) =
1 e−U(d/F ) 2 2πσ (x,y)∈S f(x,y)
with the likelihood energy U (d/F ) defined by: U (d/F ) = (af(x,y) x + bf(x,y) y + cf(x,y) − d(x, y))2 /2σf2(x,y) (x,y)∈S
(5)
(6)
456
S. Mazouzi et al.
Since p(d) is constant for a fixed d, the solution F ∗ is obtained by maximizing the a posteriori probability P (F/d) ∝ p(d/F )P (F ), which is equivalent to minimizing the a posteriori energy U (F/d) = U (d/F ) + U (F ): F ∗ = argmin{U (d/F ) + U (F )}
(a)
(b)
Fig. 2. Clique potential Vc (c3×3 ) defined according to the edge smoothness prior. (a) Full smooth edge: Vc (c3×3 ) = ζ; (b) partial smooth edge: Vc (c3×3 ) = 0; otherwise, the edge is not locally smooth: Vc (c3×3 ) = −ζ.
4.2
Optimal Solution Computation
By assuming that F is Markovian, and the observations {d(x, y)} are conditionally independent, we have used the Iterated Conditional Modes (ICM) algorithm [3] to minimize the a posteriori energy U (F/d). By considering U (F/d) as the sum of energies over all image sites: U (F/d) = (x,y)∈S U (f(x,y) /d(x, y)), we can separate it in two terms: U (F/d) = U (f(x,y) /d(x, y)) + U (f(x,y) /d(x, y)) (7) (x,y)∈S
(x,y)∈S−S
where S is the set of sites belonging to region boundaries: S = {(x, y) ∈ S|∃(x , y ), (x − x, y − y) ∈ {−1, 0, 1}2 ∧ f(x,y) = f(x ,y ) } Assuming the correctness of thelabeling of the set S−S (performed by the multiagent segmentation), the term (x,y)∈S−S U (f(x,y) /d(x, y)) is thus constant. So, minimizing the energy U (F/d) is equivalent to minimizing the energy U (F/d) which corresponds to the sites in S : U (F/d) = (x,y)∈S U (f(x,y)/d(x, y)). The assumption of the correctness of the labeling of S − S also allows us to define a constraint on the set of values that a site in S can have during the k execution of the ICM algorithm. Indeed, the label f(x,y) at the iteration k, of a site (x, y) is chosen among the set L (x, y) ⊂ L containing the labels of the sites in a 3 × 3 window centred at (x, y). Formally, L (x, y) is defined as follows: L (x, y) = {l|∃(x , y ) ∈ S − S , (x − x, y − y) ∈ {−1, 0, 1}2 ∧ f(x ,y ) = l} (8) The two previous heuristics allow to speed up the calculation of the minimum of the a posteriori energy U (F/d). They allow also to satisfy the region continuity constraint. For the latter problem, if we assume that the distance between two coplanar regions R and R is greater than 3 (size of the window), the labels lR and lR corresponding respectively to R and R , cannot belong to the same set
A Multi-agent Approach for Range Image Segmentation
457
L (x, y). For example, if the site (x, y) is more close to R, it can not be labeled lR , although energies U (lR /d(x, y)) and U (lR /d(x, y)) are equal.
5
Experimentation and Analysis
Hoover et al. have proposed a dedicated framework for the evaluation of range image segmentation algorithms [8], which has been used in several related works [11,10,5,1]. The framework consists of a set of real range images, and a set of objective performance metrics. It allows to compare a machine-generated segmentation (MS) with a manually-generated segmentation, supposed ideal and representing the ground truth (GT). Region classification is performed according to a compare tool tolerance T ; 50% < T ≤ 100% which reflects the strictness of the classification. The 40 real images of the ABW set are divided into two subsets: 10 training images, and 30 test images. In our case, four methods, namely USF, WSU, UB and UE, cited in [8] are involved in the result comparison. 5.1
Parameter Selection
Since the evaluation framework provides a set of training images with ground truth segmentation (GT), we have opted to a supervised approach for the selection of parameters. For our approach, named 2ARIS for Agent-based Approach for Range Image Segmentation, seven parameters should be fixed: ξmin , ξmax , T rθ , T rh , N , P , and ζ. The performance criterion used in parameter selection is the average number of correctly detected regions with the compare tool tolerance T set to 80%. The set of parameters is divided into three subsets. 1) ξmin , ξmax , T rθ , and T rh represent respectively the two alignment strengths, the thresholds of the angle, and the threshold of the depth. These parameters are used for testing and aligning pixels of the image. 2) N and P represent respectively the number of agents, and the adaptation path-length. These two parameters control the dynamic of the multi-agent system. 3) ζ represents the clique potential parameter. For the first parameter subset, 256 combinations namely (ξmin , ξmax , T rθ , T rh ) ∈ {0.5, 0.3, 0.1, 0.05} × {1.0, 3.0, 5.0, 7.0} × {15◦ , 18◦ , 21◦ , 24◦ } × {12, 16, 20, 24} were run on the training images. These parameters are set as follows: ξmin =0.3, ξmax =5.0, T rθ =21◦ and T rh =16. In order to set the parameters N and P , 25 combinations of these parameters, namely (N, P ) ∈ {1500, 2000, 2500, 3000, 3500} × {3, 5, 7, 9, 11} were run on the training set. Optimal values of N and P are respectively 2500 and 7. The Coding method [2] was used to estimate the parameter ζ. A value of ζ is computed for each image in the training set. The Average is used as the final value of the parameter. The optimum for each training image is calculated by the simulated annealing algorithm [12], using a Gibbs sampler [7]. The average value of ζ obtained with the used training set is −0.27 × 10−4 . 5.2
Experimental Results
Fig. 3 shows an instance of segmentation progression within time of a typical range image (abw.test.8) [8,5]. The time t represents the number of steps
458
S. Mazouzi et al.
performed by each agent since the beginning of the process. Figures 3b, 3c, 3d and 3e show the set of pixels of interest (edge or noise pixels) respectively at t=1000, 5000, 9000 and 13000. Regions are progressively smoothed by aligning noise pixels to the surrounding planar regions. Edges between adjacent regions are also progressively thinned. At the end of the process, region borders consist of thin lines of one pixel wide (Fig. 3e). Fig. 3f shows the segmentation result after edge regularization. We can note that the positions of some edge pixels have been corrected. The regularization was performed typically for roof edges, situated between adjacent regions.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Segmentation progression. (a) Range image (abw.test.8) ; (b) at t=1000 ; (c) at t=5000 ; (d) at t=9000 ; (e) at t=13000 ; (f) after edge regularization
Table 1 contains the average results obtained with all test images, and for all performance metrics. The compare tool tolerance was set to the typical value 80%. By considering both correct detection and incorrect detection metrics, obtained results show the good efficiency of our method. Fig. 4 shows the average numbers of correctly detected regions for all test images, according to the compare tool tolerance T . Results show that the number of correctly detected regions by our system is in average better than those of USF, UB and WSU. For instance, our system scored higher than WSU for all the values of the compare tool tolerance T . It scored higher than USF for T ≥ 80%, and better than UB for T ≤ 80%. For all incorrect detection metrics (instances of Over-segmentation,
A Multi-agent Approach for Range Image Segmentation
459
Table 1. Average results of the different involved methods with T =80% Method USF WSU UB UE 2ARIS
GT Correct det. Over-seg. Under-seg. Missed Noise 15.2 12.7 0.2 0.1 2.1 1.2 15.2 9.7 0.5 0.2 4.5 2.2 15.2 12.8 0.5 0.1 1.7 2.1 15.2 13.4 0.4 0.2 1.1 0.8 15.2 13.0 0.5 0.1 1.4 0.9
Fig. 4. Average results of correctly detected regions of all methods, according to the compare tool tolerance T ; 0.5 < T ≤ 1.0
Under-segmentation, Missed Region, Noise Region), our system has equivalent scores to those of UE and USF. The two latter scored higher than UB and WSU, regarding incorrect detection metrics.
6
Conclusion
In this paper we have presented a multi-agent approach for range image segmentation. Edge detection and noise removal have resulted from indirect interaction between autonomous agents moving over the image. Image edges, for which no explicit detection was coded in any agent, result from the collective action of all the agents. The proposed approach aims to improve efficiency and to deal with the problem of result accuracy. Indeed, obtained results are better than those provided by the traditional region growing algorithm. Bayesian edge regularization using an appropriate MRF model, introduced in this paper, has allowed improving the segmentation results. The experimental results obtained with real images from the ABW database were compared to those provided by four typical algorithms for range image segmentation. Comparison results show the good efficiency of the proposed approach for accurate segmentation of range images.
460
S. Mazouzi et al.
References 1. Bab Hadiashar, A., Gheissari, N.: Range image segmentation using surface selection criterion. IEEE Transactions on Image Processing 15(7), 2006–2018 (2006) 2. Besag, J.E.: Spatial interaction and statistical analysis of lattice systems. Journal of the Royal Statistical Society, Series B 36, 192–236 (1974) 3. Besag, J.E.: On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, Series B 48, 259–302 (1986) 4. Bovenkamp, E.G.P., Dijkstra, J., Bosch, J.G., Reiber, J.H.C.: Multi-agent segmentation of IVUS images. Pattern Recognition 37(4), 647–663 (2004) 5. Ding, Y., Ping, X., Hu, M., Wang, D.: Range image segmentation based on randomized hough transform. Pattern Recognition Letters 26(13), 2033–2041 (2005) 6. Fan, T.J., Medioni, G.G., Nevatia, R.: Segmented description of 3-D surfaces. IEEE J. Robotics Automat. 3(6), 527–538 (1987) 7. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6), 721–741 (1984) 8. Hoover, A., Jean-Baptiste, G., Jiang, X., Flynn, P.J., Bunke, H., Goldgof, D.B., Bowyer, K.W., Eggert, D.W., Fitzgibbon, A.W., Fisher, R.B.: An experimental comparison of range image segmentation algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(7), 673–689 (1996) 9. Jain, A.K., Nadabar, S.G.: MRF model-based segmentation of range images. In: International Conference on Computer Vision, pp. 667–671 (1990) 10. Jiang, X., Bowyer, K.W., Morioka, Y., Hiura, S., Sato, K., Inokuchi, S., Bock, M., Guerra, C., Loke, R.E., Hans du Buf, J.M.: Some further results of experimental comparison of range image segmentation algorithms. In: International Conference on Pattern Recognition, vol. 4, pp. 4877–4882 (2000) 11. Jiang, X., Bunke, H.: Edge detection in range images based on Scan Line approximation. Computer Vision and Image Understanding 73(2), 183–199 (1999) 12. Kirkpatrick, J.S., Gelatt, Jr. C.D., Vecchi, M.P.: Optimization by simulated annealing. Readings in computer vision: issues, problems, principles, and paradigms, pp. 606–615 (1987) 13. LaValle, S.M., Hutchinson, S.A.: Bayesian region merging probability for parametric image models. In: Proc. 1993 IEEE Conference on Computer Vision and Pattern Recognition, pp. 778–779. IEEE Computer Society Press, Los Alamitos (1993) 14. Li, S.Z.: Markov random field modeling in image analysis. Springer, New York, Inc. Secaucus, NJ, USA (2001) 15. Liu, J., Tang, Y.Y.: Adaptive image segmentation with distributed behavior-based agents. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(6), 544–551 (1999) 16. Richard, N., Dojat, M., Garbay, C.: Automated segmentation of human brain MR images using a multi-agent approach. Artificial Intelligence in Medicine 30(2), 153– 176 (2004) 17. Wang, X., Wang, H.: Markov random field modeled range image segmentation. Pattern Recognition Letters 25(3), 367–375 (2004)
Adaptive Image Restoration Based on Local Robust Blur Estimation Hao Hu1 and Gerard de Haan1,2 1
2
Eindhoven University of Technology, Den Dolech 2, 5600 MB Eindhoven, The Netherlands Philips Research Laboratories, High Tech Campus 36, 5656 AE Eindhoven, The Netherlands
Abstract. This paper presents a novel non-iterative method to restore the out-of-focus part of an image. The proposed method first applies a robust local blur estimation to obtain a blur map of the image. The estimation uses the maximum of difference ratio between the original image and its two digitally re-blurred versions to estimate the local blur radius. Then adaptive least mean square filters based on the local blur radius and the image structure are applied to restore the image and to eliminate the sensor noise. Experimental results have shown that despite its low complexity the proposed method has a good performance at reducing spatially varying blur.
1
Introduction
Focal blur, or out-of-focus blur in images and videos occurs when objects in the scene are placed outside the focal plane of the camera. Due to a limited focal range of optical lenses or sub-optimal settings of the camera, the resulting image may suffer from blur degradation. As objects with varying distance are often differently blurred in the image, accurate blur estimation is essential for image restoration. The technique to estimate the blur and restore all-in-focus images is called multi-focusing. The demand for such a technique is emerging in many applications, such as digital camera and video surveillance. The technique potentially enables the use of algorithms running on relatively cheap DSP chips instead of expensive optical parts. Many techniques have been proposed to restore the original image from the blurred image. Most of them, like [1], are designed to estimate the spatially invariant blur. For local blur estimation, methods are typically based on an analysis of an ideal edge signal. In Elder’s method [2] the blurred edge signal is convolved with the second derivative of Gaussian function and the response has a positive and a negative peak. The distance between these peak positions can be used to determine the blur radius. Another approach from Kim [3] is based on an isotropic discrete point spread function (PSF) model. The one-dimensional step response along the orthogonal direction of edge direction will be estimated and the PSF can be obtained by solving a set of linear equations related to J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 461–472, 2007. c Springer-Verlag Berlin Heidelberg 2007
462
H. Hu and G. de Haan
the step response. Both Elder’s and Kim’s method require detection of the edge direction, which adds complexity to the algorithm. In this paper, we propose a new multi-focusing method that features low complexity aiming at real-time implementation. The proposed method adopts a simple non-iterative blur estimator, as proposed in our earlier work [4]. The blur estimator uses a Gaussian isotropic PSF model and the difference between digitally re-blurred versions of an image is used to estimate the blur radius without edge detection. As a de-blurring filter typically has a high pass characteristic, the sensor noise may be amplified during the restoration process. To avoid this and even suppress the noise, adaptive filters based on the local blur radius and image structure information [5] are applied in the image restoration part. The rest of the paper is organized as follows. In Section 2 we present the proposed blur estimation algorithm and its analysis based on an ideal edge model. Section 3 shows the proposed adaptive image restoration using local image structure and blur radius. Some experimental results on natural images are provided in Section 4 and, finally, Section 5 concludes the paper.
2
Local Blur Estimation
We analyse the blur estimation with a one dimensional (1D) signal. We assume an ideal edge signal and a discrete Gaussian blur kernel. The edge is modeled as a step function with amplitude A and offset B. For a discrete signal, the edge f (x) shown in Fig. 1 is A + B, x ≥ 0 f (x) = ,x ∈ Z (1) B, x<0 where x is the position. The focal blur kernel is modeled by a discrete Gaussian function: n2 g(n, σ) = C(σ) exp − 2 , n ∈ Z (2) 2σ where σ is the unknown blur radius to be estimated and C(σ) is the normalization factor. The normalization implies: n2 g(n, σ) = C(σ) exp − 2 = 1 (3) 2σ n∈Z
n∈Z
1 C(σ) admits no closed form expression, but the approximation √2πσ can be considered acceptable when σ > 0.5. Then the blurred edge b(x) will be: b(x) = f (x − n)g(n, σ) n∈Z
⎧ x ⎪ A ⎪ ⎪ (1 + g(n, σ)) + B, x ≥ 0 ⎪ ⎨2 =
n=−x
−x−1 ⎪ ⎪ A ⎪ (1 − g(n, σ)) + B, x < 0 ⎪ ⎩2 n=x+1
,x ∈ Z
(4)
Adaptive Image Restoration Based on Local Robust Blur Estimation
463
Amplitude
B+A
f(x) b(x) ba(x) bb(x) B −4
−3
−2
−1 0 Position x
1
2
3
Fig. 1. The step edge f (x), the blurred edge b(x) and its two re-blurred versions ba (x), bb (x)
As the convolution of two Gaussian functions with blur radii σ1 , σ2 is: g(n, σ1 ) ∗ g(n, σ2 ) = g(n,
σ12 + σ22 )
(5)
re-blurring the blurred edge using Gaussian blur kernels with blur radius σa and σb (σb > σa ), results in two re-blurred versions ba (x) and bb (x): ⎧ x ⎪ ⎪ ⎪A (1 + g(n, σ 2 + σa2 )) + B, x ≥ 0 ⎪ ⎨2 ba (x) =
n=−x
−x−1 ⎪ ⎪ A ⎪ (1 − g(n, σ 2 + σa2 )) + B, x < 0 ⎪ 2 ⎩
,x ∈ Z
(6)
n=x+1
⎧ x
⎪ A ⎪ ⎪ (1 + g(n, σ 2 + σb2 )) + B, x ≥ 0 ⎪ 2 ⎨ n=−x bb (x) = ,x ∈ Z −x−1
⎪ ⎪ A 2 + σ 2 )) + B, x < 0 ⎪ (1 − g(n, σ ⎪2 b ⎩
(7)
n=x+1
To make the blur estimation independent of the amplitude and offset of edges, we calculate the ratio r(x) of the differences between the original blurred edge and the two re-blurred versions for every position x: r(x) =
b(x) − ba (x) ba (x) − bb (x)
464
H. Hu and G. de Haan
10 9 8
Difference ratio r(x)
7 6 5 4 3 2 1 0 −11−10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 Position x
2
3
4
5
6
7
8
9
Fig. 2. Difference ratio among the edge
⎧ x
⎪ ⎪ 2 + σ 2 − g n, σ ⎪ g n, σ ⎪ a ⎪ ⎪ ⎪ n=−x ⎪ ⎪ , x≥0 x ⎪ ⎪
⎪ 2 ⎪ 2 2 2 ⎪ g n, σ + σb − g n, σ + σa ⎪ ⎨ n=−x = −x−1
⎪ ⎪ 2 + σ 2 − g n, σ ⎪ g n, σ ⎪ a ⎪ ⎪ ⎪ n=x+1 ⎪ ,x<0 ⎪ ⎪ −x−1
⎪
⎪ ⎪ 2 2 2 2 ⎪ g n, σ + σb − g n, σ + σa ⎪ ⎩
(8)
n=x+1
The difference ratio peaks at the edge position x = −1 and x = 0 as shown in Fig. 2. So we obtain: 1 σ
r(x)max = r(−1) = r(0) =
√
−√
1 2 σ2 +σa
1 2 σ2 +σa 1 −√ 2 2 σ +σb
(9)
When σa , σb σ, we can use some approximations: σ 2 + σa2 ≈ σa
σ 2 + σb2 ≈ σb
which we use to simplify Equation 8: r(x)max ≈
1 1 σ − σa 1 1 σa − σb
=
( σσa − 1) · σb σb − σa
(10)
Adaptive Image Restoration Based on Local Robust Blur Estimation
or σ≈
σa · σb (σb − σa ) · r(x)max + σb
465
(11)
Equation 9-11 shows that blur radius σ can be calculated from the difference ratio maximum r(x)max and re-blur radius σa , σb , independent of the edge amplitude A and offset B. The identification of the local maximum of difference ratio r(x)max not only estimates the blur radius, but also locates the edge position, which implies the blur estimation does not require a separate edge detection. This helps to keep the complexity low. For the blur estimation in images, i.e. two dimensional (2D) signals, we use a 2D isotropic Gaussian blur kernel for the re-blurring. As any direction of an isotropic Gaussian function is a 1D Gaussian function, the proposed blur estimation is also applicable. Using 2D Gaussian kernels for the estimation avoids detecting the angle of the edge or gradient, as required in Elder’s and Kim’s method. For simplicity, we implement the algorithm in a block-based manner to obtain a blur map of a natural image. A block size of 8 × 8 pixels has been used and we assign the blur radius to all pixels within the block. As shown in the block diagram in Fig 3, the difference ratios are calculated pixel-wise using the original image and its two re-blurred version. Then, in every block the maximum of the difference ratio is used to determine the blur radius in the block. To make the blur map smoother, we apply a post-processing using a minimum filter and a bicubic upscaling filter to obtain the final blur map on the pixel grid. Input image Blur (įa) Blur (įb)
++ -+
Difference ratio pixelwise
Difference ratio blockwise
Postprocessing
Blur map
+
Fig. 3. The block diagram of the proposed algorithm
3
Adaptive Image Restoration
Many image restoration techniques [6] use an iterative approach to remove the blur, because they do not need to determine the inverse of a blur operator. However, for real-time applications, iterative approaches are less suitable. Therefore, we use LMS filters as an approximation to the inverse of the blur operation. In order to simultaneously restore the fine structure and eliminate the sensor noise, we adapt the LMS filters to a binary pattern classification of local image structure information and blur radius. 3.1
Binary Pattern Classification
Fine structure and sensor noise have distinguishable luminance patterns in natural images. We propose to use adaptive dynamic range coding (ADRC)[7] to
466
H. Hu and G. de Haan
classify local image structure. Within a local aperture in the image, the binary ADRC code of the pixels are defined as: min 0, if xi < xmax +x 2 ADRC(xi ) = (12) 1, otherwise where xi is the value of pixels in the filter aperture and xmax , xmin are the maximum and minimum pixel value in the filter aperture. One can see that the fine structures such as edges have regular patterns while the noise shows chaotic patterns. To combine the blur radius into the classification, we quantize the local blur radius σ obtained from the blur map into a binary RB as: RB = round(
σ ) Q
(13)
where Q is predefined quantization step. The concatenation of ADRC code and RB gives the final binary classification code. The diagram of the proposed adaptive restoration is shown in Fig. 4. The local image structure within a filter aperture centered at the output pixel is first classified by ADRC and local blur radius at the central pixel position. The LMS filter is used to calculate the output pixel with filter coefficients obtained from the look-up-table (LUT). The filter aperture slides pixel by pixel over the entire image. To avoid an impractical number of classes, we apply ADRC on pixels only in the central 3 × 3 aperture.
filter aperture
ADRC+RB classification
coefficient LUT
output pixel
LMS filter
Fig. 4. The block diagram of the proposed algorithm
3.2
Training Procedure
The training procedure of the proposed method is shown in Fig. 5. To obtain the training set, we use the all-in-focus images as the reference output images. Furthermore we blur the original image with a Gaussian kernel with a range of blur radiuses and later add Gaussian noise to simulate sensor noise at an expected level. These blurred and corrupted versions of the original images are our simulated input images. Before training, the simulated input and the reference output pairs are collected pixel by pixel from the training material and are classified using ADRC and the blur radius on the input. The pairs that belong to
Adaptive Image Restoration Based on Local Robust Blur Estimation
467
reference images
Training pairs in class 2
training ...
add noise
training
...
ADRC+RB classificatio on input image
blur
Training pairs in class 1
Training pairs N in class 2
training
LUT
input images
Fig. 5. The block diagram of the proposed algorithm
one specific class are used for the corresponding training, resulting in optimal filter coefficients for this class. The optimal coefficients for each class are obtained by using the LMS algorithm. Suppose Xm = [x1,m , x2,m , ..., xn,m ]T is the input vector containing all the pixels in the filter aperture and within a class the total number of the inh put vectors X1 , X2 , ..., XM is M . Let ym be the reference output and ym be the output value of our adaptive filter for the input vector Xm . So we have: h ym = w1 x1,m + w2 x2,m + ... + wn xn,m
(14)
where W = [w1 , w2 , ..., wn ]T are the filter coefficients. The sum square error then is: 2
e =
M
h 2 (ym − ym )
(15)
m=1
Insert Equation 14 into Equation 15, then the summed square error becomes e2 =
M
[ym − (w1 x1,m + w2 x2,m + ... + wn xn,m )]2
(16)
m=1
To get the minimal value of e2 , let the first derivatives of e2 to w1 , w2 , ... ,wn equal zero. ∂e2 ∂w1
=
M
2x1,m [ym − (w1 x1,m + ... + wn xn,m ] = 0
m=1
... 2
∂e ∂wn
(17) =
M m=1
2xn,m [ym − (w1 x1,m + ... + wn xn,m )] = 0
468
H. Hu and G. de Haan
Let
⎡
m
M
m=1
m=1
⎤
x1,m x1,m . . . x1,m xn,m ⎥ ⎢ ⎢ m=1 ⎥ m=1 ⎢ m ⎥ M ⎢ ⎥ ⎢ ⎥ x2,m x1,m . . . x2,m xn,m ⎥ ⎢ ⎥ X= ⎢ m=1 ⎢ m=1 ⎥ ⎢ ⎥ . . .. .. ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ M M ⎣ ⎦ xn,m x1,m . . . xn,m xn,m Y =
M
x1,m ym , · · · ,
m=1
M
T xn,m ym
(18)
m=1
Equation 17 can be transformed into: X·W =Y
(19)
Please note that X is the sum of the correlation matrices of the vectors X1 , X2 , ..., XM . Then the coefficients W can be solved by matrix inversion: W = X −1 · Y
4
(20)
Experimental Results
To demonstrate the performance of our proposed method, we used a natural image taken by a consumer digital camera as shown in Fig. 6, which is not included in the training images. The image shows three objects that are differently blurred. The restored image is shown in Fig. 7. The focus has been brought back to those differently blurred objects by the proposed adaptive restoration. Fig 8 shows the blur map estimated by the proposed method. In the blur map the lighter areas indicate a larger blur radius, while the darker areas indicate a smaller blur radius. One can see that different blur level can be clearly discriminated. In the blur estimation, the blur radius for the re-blurring kernels are σa = 1 pixel, σb = 3 pixels. In order to show the effectiveness of our proposed adaptive image restoration method, we compare it with LMS filters which depend on local blur radius only and are trained with and without added noise respectively. Fig.10 shows image fragments from the test image processed by LMS filters with different settings. The LMS filters trained without added noise can reduce the blur, but they also amplify the noise. The proposed method not only suppresses the noise but also reduce the blur better due to the adaptivity to image structure, compared with LMS filters which depend on blur radius only.
Adaptive Image Restoration Based on Local Robust Blur Estimation
469
Fig. 6. The test image with differently blurred objects taken by a digital camera
Fig. 7. The restored all-in-focus image
470
H. Hu and G. de Haan
Fig. 8. The blur map obtained by our proposed method: the lighter areas indicate a larger blur radius, while the darker areas indicate a smaller blur radius
Fig. 9. The final blur map used for adaptive image restoration
Adaptive Image Restoration Based on Local Robust Blur Estimation
(A)
(B)
471
(C)
Fig. 10. Image fragments from the output images using LMS filters with different settings: (A) LMS filters which depend on blur radius only and trained without added noise (B) LMS filters which depend on blur radius only and trained with added noise (C) the proposed LMS filters which depend on both blur radius and image structure and trained with added noise
5
Conclusion
We have presented in this paper a novel blur identification and restoration algorithm for a multi-focusing system. The proposed algorithm is based on a robust local blur estimation that uses the difference ratio of a image and its two re-blurred versions. Adaptive least mean square filters, which depend on local image structure and the blur radius, are applied to remove the spatially variant blur and reduce the sensor noise. The proposed method shows promising results for blur identification and restoration given its complexity. Since it does not involve iterations, neither in the blur estimation, nor in the image restoration, it is suitable for real-time applications.
References 1. Lagendijk, R.L., Biemond, J., Boekee, D.E.: Identification and restoration of noisy blurred image using the expectation-maximization algorithm. IEEE Trans. Acoustic, Speech and Signal Processing 38, 1180–1191 (1990) 2. Elder, J.H., Zucker, S.W.: Local Scale Control for Edge Detection and Blur Estimation. IEEE Trans. Pattern Analysis and Machine Intelligence 20, 699–716 (1998) 3. Kim, S.K., Park, S.R., Paik, J.K.: Simultaneous out-of-focus blur estimation and restoration for digital auto-focusing system. IEEE Trans. Consumer Electronics 34, 1071–1075 (1998) 4. Hu, H., de Haan, G.: Low cost robust blur estimator. In: Proceedings of IEEE Int. Conf. on Image Processing, Atlanta (GA), October 8-11, 2006, pp. 617–620. IEEE Computer Society Press, Los Alamitos (2006)
472
H. Hu and G. de Haan
5. Hu, H., de Haan, G.: Simultaneous Coding Artifact Reduction and Sharpness Enhancement. In: Proceedings of IEEE Int. Conf. on Consumer Electronics, Las Vegas, pp. 213–214. IEEE Computer Society Press, Los Alamitos (2007) 6. Katsaggelos, A.K.: Iterative image restoration algorithms. Optical Engineering 287, 735–748 (1989) 7. Kondo, T., Fujimori, Y., Ghosal, S., Carrig, J.J.: Method and apparatus for adaptive filter tap selection according to a class, US-Patent: US 6,192,161 B1, February 20, 2001 (2001)
Image Upscaling Using Global Multimodal Priors Hiˆep Luong, Bart Goossens, and Wilfried Philips Ghent University - TELIN - IPI - IBBT Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium
Abstract. This paper introduces a Bayesian restoration method for lowresolution images combined with a geometry-driven smoothness prior and a new global multimodal prior. The multimodal prior is proposed for images that normally just have a few dominant colours. In spite of this, most images contain much more colours due to noise and edge pixels that are part of two or more connected smooth regions. The Maximum A Posteriori estimator is worked out to solve the problem. Experimental results confirm the effectiveness of the proposed global multimodal prior for images with a strong multimodal colour distribution such as cartoons. We also show the visual superiority of our reconstruction scheme to other traditional interpolation and reconstruction methods: noise and compression artifacts are removed very well and our method produces less blur and other annoying artifacts.
1
Introduction
Due to the huge amount of data, images and video sequences are compressed before transmission or storage. Image quality will typically be lost in the form of blocking artifacts (intrinsic to the used block structure in jpeg- and mpegalgorithms) and mosquito noise (random noise originating from the quantization of high frequent information). With the growing popularity of High Definition Television (hdtv), these artifacts become more bothersome, especially in cartoon movies since jpeg- and mpeg-compression schemes are not designed for this type of images. That is why digital high-resolution (hr) image reconstruction becomes very important nowadays. Many image interpolation methods have already been proposed in the literature, but all suffer from one or more artifacts. Linear or non-adaptive interpolation methods deal with jagging, blurring and/or ringing effects. Well-known and popular linear interpolation methods are nearest neighbour, bilinear and interpolation with higher order (piecewise) polynomials, b-splines, truncated or windowed sinc functions, etc. [9,14] Non-linear or adaptive interpolation methods incorporate a priori knowledge about images. Some general methods focus on reconstructing edges [10,16] and other methods tackle unwanted interpolation artifacts such as jagged edges, blurring and ringing using isophote smoothing [15], level curve mapping [11] or J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 473–484, 2007. c Springer-Verlag Berlin Heidelberg 2007
474
H. Luong, B. Goossens, and W. Philips
mathematical morphology [8]. Some other adaptive techniques exploit the selfsimilarity property, e.g. iterated function systems [7] or the repetitive behaviour in an image [12]. Another class of adaptive image enlargement methods takes advantage of the training-based priors [2], which for example maps low-resolution (lr) blocks into predefined high-resolution blocks [6]. General interpolation methods need clean input images, which are often not available due to noise and/or compression artifacts. The solution is a combination of denoising and interpolation. Most existing methods essentially perform the enhancement and resolution expansion as two separate steps. Constrained partial differential equations (pde’s) take interpolation information into account while denoising the image [20]. Training-based and non-local interpolation methods can also handle a certain amount of noise, but the result depends heavily on the used training set or image content [6,12]. The proposed method performs the image enhancement and interpolation simultaneously. Bimodal priors have been successfully applied in low-resolution text enhancement [4,18]. Typically the intensities of the text pixels tend to cluster around black and the intensities of the background pixels tend to cluster around white. Taking bimodality into account improves the contrast and thus the readability. We extend this concept for general image enhancement. Another use of multimodality is introduced in image retrieval [13]: local windows with one, two or three colours (respectively unimodal, bimodal and trimodal neighbourhoods) describe the features being matched in the image database. In what follows in this paper, we propose a reconstruction framework for degraded low-resolution images using global multimodal priors. Section 2 formulates the problem of image acquisition. In Section 3, we describe the Bayesian reconstruction framework. In Section 4, we focus on the image priors including the tensor-driven smoothness prior and the proposed multimodal priors. Section 5 gives an iterative scheme to solve the problem. Section 6 presents experimental results using our technique and compares with other interpolation/reconstruction methods. Finally, Section 7 concludes this paper.
2
Problem Formulation
The image acquisition process consists of converting a continuous scene into a (discrete) digital image. However, in practice, the acquired image lacks resolution and is corrupted by noise and blur. These linear degradation operations are summarized in figure 1. The recovery of the unknown high-resolution image x from a known low-resolution image y is related by y = DHx + n.
(1)
In this equation, the matrices D and H represent the decimation operator and the blur operator respectively and n describes the additive noise, which is assumed to be zero-mean Gaussian distributed (with a standard deviation σn ).
Image Upscaling Using Global Multimodal Priors
475
Fig. 1. Observation model of the image acquisition
We also assume that the blur operator H in the imaging model is denoted by a space-invariant point spread function (psf) (typically Gaussian blur, which is characterized by its standard deviation σblur ). If we rearrange y and x into vectors with length N and r2 N respectively, with r the magnification factor in both horizontal and vertical direction, then the probability of observing y given x in presence of uncorrelated Gaussian noise becomes 2 y − DHx2 1 p(y|x) = √ exp − . (2) (2σn2 ) ( 2πσn )N 2
The L2 -norm y − DHx2 is defined as (y−DHx)T (y−DHx). The multiplication with matrices D and H (which have dimensions of N ×r2 N and r2 N ×r2 N respectively) has the same effect as image downsampling with blurring as antialiasfilter. Because of the high dimensionalities (and the sparse representations), these matrix multiplications are replaced with their actual image operators in our implementation. The maximum likelihood (ml) estimator suggests to choose x ˆML that maximizes the likelihood function or minimizes the negative loglikelihood function [21]: x ˆML = arg max p(y|x) x
= arg min y − DHx22 . x
(3)
For a given low-resolution image y, an infinite set of high-resolution images can be generated by the observed data. Due to the unstability, the inverse problem is ill-posed. To solve a high-resolution image x that is optimal in some sense, prior knowledge about the image is introduced in a Bayesian framework.
476
3
H. Luong, B. Goossens, and W. Philips
Bayesian Reconstruction Framework
Via the Bayes rule, the probabilities p(y|x) in the likelihood function are replaced with the posterior probabilities p(x|y): p(x|y) =
p(y|x) · p(x) . p(y)
(4)
The Maximum A Posteriori (map) estimator suggests to choose x ˆMAP that maximizes the posterior probability [21]: x ˆMAP = arg max p(x|y) x = arg min − log p(y|x) − log p(x) . x
(5)
Note that the denominator, p(y), does not affect the maximization. In the next section we will derive two prior probability density functions (pdf) on the hr image, namely the anisotropic smoothness prior pS (x) and the global multimodal prior pM (x).
4
Image Priors
A general way to describe the prior pdf p(x) is the Gibbs distribution, which has the following exponentional form [2]: p(x) = cG · exp{−αf (x)},
(6)
where cG is a normalizing constant, guaranteeing that the integral over all x is 1, and the term f (x) is a non-negative energy function, which is low if the solution is close to the postulate and high otherwise. The contribution of the prior in the minimization becomes then very simple, namely αf (x). 4.1
Geometry-Driven Smoothness Prior
We assume that images are locally smooth except for edges. Therefore, the use of the so-called edge-stopping functions is very popular, because it suppresses the noise better while retaining important edge information [17]. When dealing with image upscaling problems, we have to face to an additional potential trap: preserving the edge discontinuities usually leads to the amplification of jagging artifacts. We can avoid this trap by taking local geometry of the image into account, i.e. we perform smoothing along the edge directions and simultaneously avoid smoothing orthogonally to these edges. The local geometry can be represented in a more convenient form of a 2 × 2 symmetric and semi-positive matrix, named diffusion tensor T [20]. The constructed diffusion tensor T = λη ηη T + λξ ξξ T has two orthonormal eigenvectors η = ∇Iσ / ∇Iσ and ξ = η ⊥ = ∇Iσ⊥ / ∇Iσ with corresponding eigenvalues λη
Image Upscaling Using Global Multimodal Priors
477
and λξ respectively, where ∇Iσ denotes the smoothed gradient ∇I ∗ Gσ (where Gσ is a 2D Gaussian kernel with variance σ). The direction ξ corresponds to the edge direction, when there is one, while η is the vector perpendicular to the edge (alsocalled the normal vector). The 2 proposed positive values λη = max ∇Iσ , 1 and λξ = 1 are related to the local strength of the edge. Given the diffusion tensor T, we can construct the 2D oriented Gaussian kernel: GT,t (x) =
xT T−1 x 1 exp − , 4πt 4t
(7)
where t is related to the diffusion strength. Depending on the local diffusion tensor T, the shape and size of the Gaussian kernels are varying at different locations depending on the edge information and thus the eigenvalues 1/λη and 1/λξ of T−1 . Convolution with these space-varying kernels may be seen as the juxtaposition of two oriented 1D heat flows [20]. When the pixel is located in a smooth region, the edge strength ∇Iσ will be small and 1/λη ≈ 1/λξ = 1, which yield an isotropic Gaussian kernel for smoothing. When the pixel is located at a sharp boundary between two regions, ∇Iσ will be large and 1/λη ≈ 0 1/λξ = 1, which yield an highly anisotropic Gaussian kernel oriented along the edge direction. Based on these oriented kernels we define the Gibbs geometry-driven smoothness pdf prior as ρ x(x) − x(x ) pS x(x) = cG,S · exp − GT(x),t (x − x) , (8) σs2 x ∈ℵ(x)
where ℵ(x) denotes the local neighbourhood of x (typically a p × p window centered around x) and the function ρ(.) represents the quadratic term .22 in our implementation. Minimizing ρ(x) leads to a very simple closed expression, namely ψ(x) = 2x. Note that equation 8 can be seen as a generalization of the popular bilateral diffusion (having an isotropic spatial weighting kernel) [19] and edge-preserving diffusion terms (ρ is replaced by an outlier-robust function) [17]. 4.2
Global Multimodal Prior
The number of different colours in a hr image neighbourhood is very small in general, if we do not take noise and edge pixels into account. This also holds for images with just a few dominant colours like cartoons, drawings or logos. Depending on the number of modes of the probability distribution of colour values, the images are characterized as unimodal (one dominant colour), bimodal (two modes), or in general multimodal [13]. For multimodal images with m colour modes, we use a Gibbs prior with a non-negative 2m-order polynomial which is based on the bimodal prior presented in [4]: m 2 1 x(x) − μi 2 pM;μ1 ,...,μm x(x) = cG,M · exp − , (9) 2 2 i=1 σm
478
H. Luong, B. Goossens, and W. Philips −4
3
x 10
Probabibility
Unimodal PDF
Bimodal PDF
Bimodal PDF
0 0
μ1
μ Intensity value
μ
255
2
Fig. 2. One-dimensional plot of the bimodal priors pM2 ;μ1 ,μ2 (x) and pM2 ;μ,μ (x) and the unimodal prior pM1 ;μ (x)
where μi is the mean of the colour distribution around the ith mode and σm is the standard deviation of the colour distribution. The one-dimensional unimodal and bimodal pdf are illustrated in figure 2. In our implementation each pixel is represented by its rgb colour vector. The gradient of the 2m-order polynomial of the numerator in equation 9 leads to the following expression: φM;μ1 ,...,μm (x) =
m i=1
2(x − μi )
m
2
x − μj 2 .
(10)
j=1;j=i
We illustrate the importance of the number of modes on the basis of the 1D bimodal prior. When two modes μi and μj=i are equal (or close) to each other, the bimodal pdf has a much lower peak than the unimodal pdf as plotted in figure 2. This means that the kurtosis, i.e. the degree of peakness, is lower than 3 (Gaussian kurtosis) and this is known as platykurtosis: √
4 2Γ ( 34 ) +∞ (x − μ)4 (x − μ)4 1 exp − dx = , 4 4 πσm σ 2σ 2 −∞ m m
+∞ 1 (x − μ)4 (x − μ)2 √ exp − dx = 3, (11) 4 2 σm 2σm 2πσm −∞ ∞ where Γ (α) = 0 tα−1 e−t dt. In case of the bimodal prior, the convergence to the peak is hampered due to the larger dispersion of the pdf. That is why it is important to cluster neighbouring modes correctly. If we a priori know how many colours an image contains, we can find the parameters μ1 , . . . , μm easily by locating the peaks of the multimodal distribution.
Image Upscaling Using Global Multimodal Priors
479
Via the expectation maximization (em) algorithm, we can obtain the means μi of the mixture of Gaussian distributions [3]. A typical application with a priori knowledge is restorating a scanned document with only two colours, namely the foreground distribution (e.g. black) and the background distribution (e.g. white). Nevertheless, we commonly do not know how many dominant colours there are in an arbitrary image. We will calculate the parameters μi robustly in a threesteps algorithm. In the first stage, we preselect some candidate colour modes μc . This is done by counting the number of neighbouring colours for each pixel μ on the lr grid. Two colours μ and y are neighbours if μ − y < τ (i.e. μ is located inside the hypersphere of radius τ centred at y in rgb colour space). We now select n colours with the most number of neighbours discarding the neighbouring colours lying in the same hypersphere with each selection. In the second stage, we track each selected candidate colour mode μc to its closest peak in the distribution. The preselection of the trackers prevents duplicate computations. To establish the location of the modes of the colour distribution, the mean-shift algorithm is applied in the rgb colour space. Starting from each candidate colour mode μc , the mean-shift procedure iteratively finds a path along the gradient direction away from the valleys and towards the nearest peak which is equivalent to a gradient ascent to the local mode of the distribution [1]. The positions of the modes are iteratively updated as follows:
y(x) − μ (k) 2 c
y(x)g
h x∈Ω(y) (k+1) μc = (12)
2 ,
y(x) − μ (k) c
g
h x∈Ω(y)
where Ω(y) contains all the pixels of the lr image y, h represents the window bandwidth and g is the profile that defines the kernel [1]. Using the multivariate Gaussian kernel, the profile g becomes 1 g(x) = exp − x (x ≥ 0). (13) 2 In the last stage, we replace duplicate colour modes μc by their mean (if their mutual distance is lower than τ ) which boosts the convergence toward the local peak in the multimodal distribution as opposed to the platykurtic case.
5
Optimization
On the one hand, we solve the minimization problem of the map estimator given in equation 5 by substituting the previously defined priors and using steepest descent. On the other hand, we jointly optimize the parameters μi of the multimodal priors on the hr grid. The latter can be solved by computing the steepest ascent in the mean-shift algorithm as discussed in the previous Section. More
480
H. Luong, B. Goossens, and W. Philips
precisely, we iteratively perform alternating optimizations (ao) over the image x and the parameters μi using the following closed-form expressions: H T DT [DH x ˆn ](xLR ) − y(xLR ) x ˆn+1 (x) = x ˆn (x) − β + σn2 n ,...,μn x φ ˆ (x) ψ x ˆ (x) − x ˆ (x ) M;μ n n n m 1 GT(x),t (x − x) + , (14) 2m 2σs2 2σm x ∈ℵ(x)
and
2
x n
ˆn+1 (x) − μi x ˆn+1 (x)g
h x∈Ω(ˆ xn+1 ) = ,
2
x n
ˆn+1 (x) − μi g
h
μi n+1
(15)
x∈Ω(ˆ xn+1 )
where β denotes the scalar step size in the direction of the gradient. The prior −1 terms are also called the regularization terms where σn−1 , σs−1 and σm become the weights or regularization parameters. The iterative procedure is initialized with the pixel replicated version x ˆ0 and the parameters μc retrieved from the
(a) x ˆ0
(b) x ˆ1
(c) x ˆ3
(d) x ˆ6
(e) x ˆ10
(f) x ˆ50
Fig. 3. Iterative reconstruction example of a jpeg-compressed patch (3 colour modes)
Image Upscaling Using Global Multimodal Priors
481
three-steps algorithm described in Section 4.2. An example of this iterative image reconstruction process is shown in figure 3.
6
Experimental Results
Our method is found perfectly suitable to images with a strong colour modality like cartoons, logos, maps, etc. As an experiment we have enlarged a jpegcompressed image with a lot of mosquito noise as √ illustrated in figure 4. The parameters for our method are β = 0.125, τ = 30 3, σn−2 = 1/ max(H), σs−2 = 1, −2m σm = (1/2552)m−1 , h = 10 and 100 iterations for the restoration process. The regularization parameters are chosen accordingly to the rough order of magnitude of the regularization terms. We compare our method with the popular cubic b-spline interpolation (with and without mean-shift postfiltering [1]), bilateral total variation (btv) regularization [5] (the iterative scheme is initialized with cubic b-spline interpolation because btv tends to retain the jagged edges) and the curvature preserving pde’s [20] using a linear magnification factor of 4. In figure 5 we have enlarged an image which is corrupted by colour quantization and error diffusion artifacts. We additionally compare our proposed method with the same reconstruction schemes where the anisotropic smoothing kernel is
(a) nearest neighbour
(b) cubic b-spline
(d) btv regularization
(e) curvature preserving pde’s
Fig. 4. Enlargement results
(c) (b) + mean-shift filtering
(f) proposed method
482
H. Luong, B. Goossens, and W. Philips
(a) nearest neighbour
(b) cubic b-spline
(c) isotropic reconstruction with multimodal priors
(d) anisotropic reconstruction without multimodal priors
(e) curvature preserving pde’s [20]
(f) proposed method
Fig. 5. Enlargement results
Image Upscaling Using Global Multimodal Priors
483
replaced by the isotropic version and where the use of the multimodal priors is −2m switched off (σm = 0) respectively. We can clearly see that our proposed method outperforms the other methods in visual quality. Noise and compression artifacts are heavily reduced compared to traditional interpolation techniques. Jagged edges are removed very well while isotropic regularization tends to preserve some jaggedness. Staircasing artifacts (i.e. piecewise constant regions) also do not occur in our method as opposed to curvature preserved pde method for example. Our method is also visibly much sharper than the other methods. Also figure 5 illustrates the contribution of both image priors to the end result.
7
Conclusion
In this paper we have presented a reconstruction scheme of low-resolution images from a Bayesian point of view, which mainly consists of the combination of the proposed multimodal prior and a geometry-driven smoothness prior. The multimodal prior is introduced for images that normally just have a few dominant colours. Results show the effectiveness and the visual superiority of our reconstruction scheme to other interpolation/reconstruction schemes for images with a strong colour modality such as cartoons: noise and compression artifacts (like mosquito noise) are removed and our method contains less blur and other annoying artifacts (e.g. jagged edges, staircasing effects, etc.). Future work consists in extending this method to real colour images by considering adaptive multimodal priors in local neighbourhoods and to handle gradual transitions like smooth edges.
References 1. Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence 24, 603–619 (2002) 2. Datsenko, D., Elad, M.: Example-Based Single Image Super-Resolution: A Global MAP Approach with Outlier Rejection. The Journal of Multidimensional Systems and Signal Processing (to appear) 3. Dempster, A.P., Lairde, N.M., Rubin, D.B.: Maximum Likelihood From Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, 1–38 (1977) 4. Donaldson, K., Myers, G.: Bayesian Super-Resolution of Text in Video With a Text-Specific Bimodal Prior. International Journal on Document Analysis and Recognition 7, 159–167 (2005) 5. Farsiu, S., Robinson, M.D., Elad, M., Milanfar, P.: Fast and Robust Multiframe Super Resolution. IEEE Trans. on Image Processing 13, 1327–1344 (2004) 6. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-Based Super-Resolution. IEEE Computer Graphics and Applications 22, 56–65 (2002) 7. Honda, H., Haseyama, M., Kitajima, H.: Fractal Interpolation For Natural Images. In: Proc. of IEEE International Conference of Image Processing, vol. 3, pp. 657– 661. IEEE, Los Alamitos (1999)
484
H. Luong, B. Goossens, and W. Philips
8. Ledda, A., Luong, H.Q., Philips, W., De Witte, V., Kerre, E.E.: Image Interpolation Using Mathematical Morphology. In: Proc. of 2nd IEEE International Conference On Document Image Analysis For Libraries (to appear) (2006) 9. Lehmann, T., G¨ onner, C., Spitzer, K.: Survey: Interpolations Methods In Medical Image Processing. IEEE Trans. on Medical Imaging 18, 1049–1075 (1999) 10. Li, X., Orchard, M.T.: New Edge-Directed Interpolation. IEEE Trans. on Image Processing 10, 1521–1527 (2001) 11. Luong, H.Q., De Smet, P., Philips, W.: Image Interpolation Using Constrained Adaptive Contrast Enhancement Techniques. In: Proc. of IEEE International Conference of Image Processing, vol. 2, pp. 998–1001. IEEE, Los Alamitos (2005) 12. Luong, H.Q., Ledda, A., Philips, W.: An Image Interpolation Scheme for Repetitive Structures. In: Proc. of International Conference on Image Analysis and Recognition. Lecture Notes on Computer Science, pp. 104–115 (2006) 13. Matas, J., Koubaroulis, D., Kittler, J.: Colour Image Retrieval and Object Recognition Using the Multimodal Neighbourhood Signature. In: Proc. of the 6th European Conference in Computer Vision, vol. 1, pp. 48–64 (2000) 14. Meijering, E.H.W., Niessen, W.J., Viergever, M.A.: Quantitative Evaluation Of Convolution-Based Methods For Medical Image Interpolation. Medical Image Analysis 5, 111–126 (2001) 15. Morse, B.S., Schwartzwald, D.: Isophote-Based Interpolation. In: Proc. of IEEE International Conference on Image Processing, pp. 227–231. IEEE Computer Society Press, Los Alamitos (1998) 16. Muresan, D.: Fast Edge Directed Polynomial Interpolation. In: Proc. of IEEE International Conference of Image Processing, vol. 2, pp. 990–993. IEEE, Los Alamitos (2005) 17. Piˇzurica, A., Vanhamel, I., Sahli, H., Philips, W., Katartzis, A.: A Bayesian Approach To Nonlinear Diffusion Based On A Laplacian Prior For Ideal Image Gradient. In: Proc. of IEEE Workshop On Statistical Signal Processing, IEEE Computer Society Press, Los Alamitos (2005) 18. Thouin, P., Chang, C.: A Method For Restoration of Low-Resolution Document Images. International Journal on Document Analysis and Recognition 2, 200–210 (2000) 19. Tomasi, C., Manduchi, R.: Bilateral Filtering for Gray and Color Images. In: Proc. of IEEE International Conference on Computer Vision, pp. 839–846. IEEE Computer Society Press, Los Alamitos (1998) 20. Tschumperl´e, D.: Fast Anisotropic Smoothing of Multi-Valued Images using Curvature-Preserving PDE’s. International Journal of Computer Vision 1, 65–82 (2006) 21. Van Trees, H.L.: Detection, Estimation, and Modulation Theory: Part I. John Wiley and Sons, New York (1968)
A Type-2 Fuzzy Logic Filter for Detail-Preserving Restoration of Digital Images Corrupted by Impulse Noise M. T¨ ulin Yildirim1 and M. Emin Y¨ uksel2 1
Department of Aircraft Electrical and Electronics, Civil Aviation School, Erciyes University, Kayseri, 38039, Turkey 2 Digital Signal and Image Processing Laboratory, Department of Electrical and Electronics Engineering, Erciyes University, Kayseri, 38039, Turkey
Abstract. A novel filtering operator based on type-2 fuzzy logic is proposed for detail preserving restoration of images corrupted by impulse noise. The performance of the proposed operator is evaluated for different test images corrupted at various noise densities and also compared with representative impulse noise removal operators from the literature. Results of the filtering experiments show that the presented operator offers superior performance over the competing operators by efficiently suppressing the noise in the image while at the same time effectively preserving the useful information in the image.
1
Introduction
Digital images are often degraded by impulse noise during image acquisition and/or transmission due to a number of non-idealities encountered in image sensors and communication channels. In most image processing applications, it is of vital importance to remove the impulse noise from the image since it severely degrades the performances of subsequent image processing tasks such as edge detection, image segmentation, object recognition, etc. There are a large number of methods proposed to remove impulse noise from digital images and most of these methods are based on order statistics filters exploiting the rank order information of the pixels contained in a given filtering window. The standard median filter [1] attempts to remove impulse noise by replacing the center pixel of the filtering window with the median of the pixels within the window. This method yields a reasonable noise removal performance at the cost of removing thin lines and blurring image details even at low noise densities. In order to avoid the inherent drawbacks of the standard median filter, the weighted median filter and the center-weighted median filter [2,3] have been proposed. These filters are modified median filters giving more weight to certain pixels in the filtering window. They demonstrate better performance in preserving image details than the median filter at the expense of reduced noise removal performance. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 485–496, 2007. c Springer-Verlag Berlin Heidelberg 2007
486
M.T. Yildirim and M.E. Y¨ uksel
Some methods [4]–[20] are based on a combination of the median filter with an impulse detector. The impulse detector aims to determine whether the center pixel of a given filtering window is noisy or not. If the center pixel is found to be noisy, it is restored by the median filter. Otherwise, it is left unchanged. Although this method significantly improves the performance of the median filter, its performance inherently depends on the impulse detector. As a result, several different impulse detection approaches exploiting median filters [4]–[6], center-weighted median filters [7]–[10], boolean filters [11], edge detection kernels [12], homogeneity level information [13], statistical tests [14], classifier based methods [15], rule based methods [16], pixel counting methods [17] and soft computing methods [18]–[20] have been proposed. Various types of mean filters have been recruited for impulse noise removal from digital images [21]–[23]. These filters usually offer good filtering performance at the cost of increased computational complexity. Filters based on nonlinear methodologies have also been used for impulse noise removal [24]–[34]. These filters are usually more complex than the above mentioned median- and the mean-based filters, but they usually offer much better noise suppression and detail preservation performance. All of the methods discussed so far more or less have the undesirable property of blurring image details and texture during filtering. This is mainly due to the uncertainty introduced by noise. It becomes more and more difficult for the filter to correctly distinguish between the corrupted and the uncorrupted pixels in the noisy input image as the density of the noise corrupting the image increases. Hence, some of the uncorrupted pixels are unnecessarily filtered causing undesirable distortions and blurring effects in the output image while some of the corrupted pixels are left unfiltered leaving a considerable number of noisy pixels in the restored output image. In the past few years, there has been a growing interest in the applications of type-2 fuzzy logic systems. Unlike conventional (type-1) fuzzy logic systems where membership functions are scalar, the membership functions in type-2 FLSs are also fuzzy and this extra degree of fuzziness provides a more efficient way of handling uncertainty, which is inevitably encountered in noisy environments. Hence, type-2 fuzzy logic systems may be utilized to design efficient filtering operators exhibiting much better performance in noisy environments provided that appropriate network structures and processing strategies are employed [35]. Motivated by these observations, this paper proposes a novel filtering operator based on type-2 fuzzy logic techniques for detail preserving restoration of impulse noise corrupted images. The performance of the proposed operator is evaluated for various noise densities and for different test images, and also compared with representative impulse noise removal operators from the literature. Results show that the proposed operator yields superior performance over the competing operators and is capable of efficiently suppressing the noise in the image while at the same time effectively preserving the useful information in the image such as thin lines, edges, fine details, and texture.
A Type-2 Fuzzy Logic Filter for Detail-Preserving Restoration
2 2.1
487
Method The Proposed Neuro-fuzzy Operator
Figure-1 shows the structure of the proposed impulse noise removal operator. The operator is constructed by combining four type-2 NF filters, four defuzzifiers and a postprocessor. The operator inputs the noisy pixels in its filtering window and outputs the restored value of the center pixel. The filtering window is shown in Figure-2. The NF filters used in the structure of the operator function as subfilters processing the horizontal, vertical, diagonal and the reverse diagonal pixel neighborhoods in the filtering window, respectively. Each of the four NF filters inputs the center pixel and two of its appropriate neighboring pixels and then outputs a type-1 interval fuzzy set representing the uncertainty interval for the restored value of the center pixel. The four output fuzzy sets coming from the four NF filters are then applied to the corresponding defuzzifier blocks. Each defuzzifier transforms the input fuzzy set into a single scalar value by performing defuzzification. The four scalar values obtained at the outputs of the four defuzzifiers represent four candidates for the restored value of the center pixel of the filtering window. These four candidate values are finally converted into a single scalar value by the postprocessor. The output of the postprocessor, which is also the output of the proposed filtering operator, represents the restored value of the center pixel of the filtering window.
Type-2 NF
Y
Defuzzifier
Y
x(r+1, c) x(r, c) x(r-1, c)
Type-2 NF
Y
Defuzzifier
Y
x(r+1, c+1) x(r, c) x(r-1, c-1)
Type-2 NF
Y
Defuzzifier
Y
x(r+1, c-1) x(r, c) x(r-1, c+1)
Type-2 NF
Y
Defuzzifier
Y
Postprocessor
x(r, c+1) x(r, c) x(r, c-1)
y(r, c)
Fig. 1. The structure of the proposed type-2 NF noise removal operator. Each of the four type-2 NF filters evaluates pixel neighborhoods in the horizontal, vertical, diagonal and the reverse diagonal directions, respectively.
x(r-1, c-1)
x(r-1, c)
x(r-1, c+1)
x(r, c-1)
x(r, c)
x(r, c+1)
x(r+1, c-1)
x(r+1, c)
x(r+1, c + 1)
Fig. 2. Filtering window of the proposed type-2 NF noise removal operator
488
2.2
M.T. Yildirim and M.E. Y¨ uksel
The Type-2 Neuro-fuzzy Filters
The four type-2 NF filters used in the structure of the proposed impulse noise removal operator are identical to each other. They are first order TSK type-2 interval fuzzy inference systems with 3-inputs and 1-output. The input-output relationship of any of the four NF filters may be formulated as follows: Let X1 , X2 , X3 denote the inputs of the NF filter and Y denote its output. Each combination of inputs and their associated membership functions is represented by a rule in the rule base of the NF filter. The rulebase contains a desired number of fuzzy rules, which are as follows: 1. if (X1 ∈ M11 ) and (X2 ∈ M12 ) and (X3 k13 X3 + k14 2. if (X1 ∈ M21 ) and (X2 ∈ M22 ) and (X3 k23 X3 + k24 3. if (X1 ∈ M31 ) and (X2 ∈ M32 ) and (X3 k33 X3 + kk4 .. .. . . i. if (X1 ∈ Mi1 ) and (X2 ∈ Mi2 ) and (X3 ki3 X3 + ki4 .. .. . .
∈ M13 ), then R1 = k11 X1 + k12 X2 + ∈ M23 ), then R2 = k21 X1 + k22 X2 + ∈ M33 ), then R3 = k31 X1 + k32 X2 + .. . ∈ Mi3 ), then Ri = ki1 X1 + ki2 X2 +
N. if (X1 ∈ MN 1 ) and (X2 ∈ MN 2 ) and (X3 ∈ MN 3 ), then RN kN 2 X2 + kN 3 X3 + kN 4
.. . = kN 1 X1 +
where N is the number of fuzzy rules in the rulebase, Mij denotes the ith membership function of the jth input and Ri denotes the output of the ith rule. The input membership functions are type-2 interval generalized bell type membership functions with uncertain mean: Mij (u) =
1 u − cij 2bij 1+ aij
cij ∈ [cij , cij ]
(1)
with i = 1, 2, · · · , N and j = 1, 2, 3. Here, the parameters aij , bij and cij correspond to the width, slope and center of the type-2 interval generalized bell type membership function Mij , respectively, and the interval [cij , cij ] denote the lower and the upper bounds of the uncertainty in the center. A sample type-2 interval generalized bell type membership function and its associated footprint of uncertainty (FOU) are illustrated in Figure-3. Since the membership functions Mij are interval membership functions, the boundaries of their FOU are characterized by their upper and lower membership functions, which are defined as
A Type-2 Fuzzy Logic Filter for Detail-Preserving Restoration
1
489
Mij (u) M ij (u)
M ij (u) slope = −
0.5
bij 2aij
M ij (u) 0
cij − aij
cij
cij
u cij + aij
Fig. 3. FOU for Generalized Bell type primary membership function with uncertain center
⎧ ⎪ 1 ⎪ ⎪ ⎪ ⎪ u − cij 2bij ⎪ ⎪ ⎪ 1+ ⎪ ⎪ ⎪ aij ⎪ ⎨ M ij (u) = 1 ⎪ ⎪ ⎪ ⎪ ⎪ 1 ⎪ ⎪ ⎪ 2b ⎪ ⎪ u − cij ij ⎪ ⎪ ⎩ 1 + aij and
M ij (u) =
⎧ ⎪ ⎪ ⎪ ⎪ 1 ⎪ ⎪ ⎪ ⎪ ⎪ u − cij 2bij ⎪ ⎪ ⎨ 1 + aij ⎪ ⎪ ⎪ ⎪ 1 ⎪ ⎪ ⎪ ⎪ u − cij 2bij ⎪ ⎪ ⎪ ⎩1+ aij
u < cij
cij ≤ u ≤ cij
(2)
u > cij
u>
cij + cij 2 (3)
u≤
cij + cij 2
where M ij and M ij are the upper and the lower membership functions of the type-2 interval membership function Mij . It should be observed that the parameters ci1 , ci1 , ai1 , bi1 , ci2 , ci2 , ai2 , bi2 , ci3 , ci3 , ai3 , bi3 characterize the membership functions in the antecedent of the ith rule. Similarly, the parameters ki1 , ki2 , ki3 , ki4 determine the consequent of the ith rule. Therefore, there are 16 parameters in total determining the output of the ith rule. Since the total number of rules in the rulebase is N , then the total number of parameters in the rulebase is 16N . The optimal values of these parameters are tuned by training. The output of the NF filter is the weighted average of the individual rule outputs:
490
M.T. Yildirim and M.E. Y¨ uksel N
Y =
wi Ri
i=1 N
(4) wi
i=1
The weighting factor, wi , of the i rule is calculated by evaluating the membership expressions in the antecedent of the rule. This is accomplished by first converting the input values to fuzzy membership values by utilizing the input membership functions Mij and then applying the “and” operator to these membership values. The “and” operator corresponds to the multiplication of the input membership values: wi = Mi1 (X1 ) . Mi2 (X2 ) . Mi3 (X3 )
(5)
Since the membership functions Mij in the antecedent of the ith rule are type2 interval membership functions, the weighting factor wi is a type-1 interval set, i.e. wi = [w i , w i ], whose lower and upper boundaries are determined by using the lower and the upper membership functions defined before: w i = M i1 (X1 ) . M i2 (X2 ) . M i3 (X3 ) w i = M i1 (X1 ) . M i2 (X2 ) . M i3 (X3 )
(6)
where w i and wi (i = 1, 2, · · · , N ) are the lower and the upper boundaries of the interval weighting factor wi of the ith rule. Once the weighting factors are obtained, the output Y of the type-2 NF filter can be found by calculating the weighted average of the individual rule outputs by using (4). The output Y is also a type-1 interval set, i.e. Y = [Y , Y ], since the wi ’s in the above equation are type-1 interval sets and Ri ’s are scalars. The upper and the lower boundaries of Y are determined by using the iterative procedure proposed by Karnik and Mendel [36]. The information presented in this subsection is related with the input-output relationship of a first order TSK type-2 interval fuzzy logic system with 3-inputs and 1-output. Readers interested in details of TSK type-2 fuzzy logic systems as well as other type-2 fuzzy logic systems are referred to an excellent book on this subject [35]. 2.3
The Defuzzifier
The defuzzifier block inputs the type-1 interval fuzzy set obtained at output of the corresponding NF filter, performs centroid defuzzification, and outputs the obtained scalar value. Since the input set is a type-1 interval fuzzy set, i.e. Y = [Y , Y ], its centroid is equal to the center of the interval: D=
Y +Y 2
(7)
A Type-2 Fuzzy Logic Filter for Detail-Preserving Restoration
2.4
491
The Postprocessor
The postprocessor produces the final output of the proposed NF impulse noise removal operator. It processes the four scalar values obtained at the outputs of the four defuzzifiers and generates a single scalar output. The operation of the postprocessor may be explained as follows: Let D1 , D2 , D3 , D4 denote the outputs of the four defuzzifiers. First, the postprocessor sorts these values such that D1 ≤ D2 ≤ D3 ≤ D4 , where D1 , D2 , D3 , D4 represent the output values of the defuzzifiers after sorting. Then, the lowest (D1 ) and the highest (D4 ) of the four values are discarded. Finally, the remaining two are averaged to obtain the postprocessor output, which is also the output of the proposed operator: D + D3 y= 2 (8) 2 2.5
Filtering of the Noisy Input Image
The overall filtering procedure for the restoration of the noisy input image may be summarized as follows: 1. A filtering window with a size of 3-by-3 pixels moves on the image. The window is starts from the upper-left corner of the image and moves pixel by pixel sideways and progressively downwards in a raster scanning fashion. 2. At each window position, the selected pixels from the filtering window corresponding to the horizontal, vertical, diagonal and the reverse diagonal neighborhoods of the center pixel are applied to the corresponding NF filters in the structure. Each NF filter individually processes the three pixels fed to its input and then produces an output, which is a type-1 interval fuzzy set representing the uncertainty interval for the restored value of the center pixel of the filtering window. 3. The type-1 interval fuzzy sets at the outputs of the type-2 NF filters are fed to their corresponding defuzzifiers. Each defuzzifier performs centroid defuzzification of the input type-1 interval fuzzy set and outputs a scalar value. The scalar values obtained at the outputs of the four defuzzifiers represent four candidates for the restored value of the center pixel of the filtering window. 4. The outputs of the four defuzzifiers are then fed to the postprocessor. The postprocessor sorts these four candidates, discards the lowest and the highest values, and then outputs the average of the remaining two values. The value obtained at the output of the postprocessor represents the restored value for the center pixel of the filtering window. It is also the output of the proposed operator. 5. This procedure is repeated for all pixels of the noisy input image.
3
Results and Conclusion
The proposed impulse noise removal operator discussed in the previous section is implemented. The performance of the operator is evaluated for various noise
492
M.T. Yildirim and M.E. Y¨ uksel
a
b
c
d
Fig. 4. Test images a) Baboon b) Boats c) Bridge d) Pentagon
conditions and test images. The test images are chosen from the literature. These are Baboon, Boats, Bridge and Pentagon images shown in Figure-4. All test images are 8-bit gray level images. The noisy images used in the experiments are obtained by degrading the original images by impulse noise with an appropriate noise density. The density of the noise is determined depending on the experiment. The corrupted experimental images are restored by using the proposed type-2 NF impulse noise removal operator as well as several conventional and state-of-the-art impulse noise filters including the switching median filter (SMF) [4], signal-dependent rank-ordered mean filter (SDROMF) [21], fuzzy filter (FF) [24], progressive switching median filter (PSMF) [5], multistate median filter (MSMF) [9], edge detecting median filter (EDMF) [12], adaptive fuzzy switching filter (AFSF) [31] and the alpha-trimmed mean-based filter (ATMBF) [23]. The performances of all operators are measured by utilizing the mean squared error (MSE) criterion, which is defined as MSE =
R C 1
(s [r, c] − y [r, c])2 RC r=1 c=1
(9)
Here, s[r, c] and y[r, c] denote the original and the restored versions of a degraded test image, respectively. MSE values calculated for the output images of all operators for the Baboon, Boats, Bridge and Pentagon images corrupted by 25%, 50% and 75% impulse noise are presented in Table-1. The average MSE values are presented in Table-2. It is seen that the proposed operator exhibits the best performance regarding the MSE criteria.
A Type-2 Fuzzy Logic Filter for Detail-Preserving Restoration
493
Table 1. MSE Values for Baboon, Boats, Bridge and Pentagon images corrupted by 25%, 50% and 75% impulse noise
Operator SMF SDROMF FF PSMF MSMF EDMF AFSF ATMBF PROPOSED
25% 681 587 464 536 847 460 476 706 204
Baboon 50% 75% 2625 8511 1076 4081 1012 3172 880 3003 3852 10269 1265 5188 734 1849 2647 8522 473 983
25% 344 249 214 275 525 270 187 343 107
Boats 50% 75% 2209 8563 789 4420 688 3134 548 2612 3566 10482 967 5010 466 1825 2206 8559 355 1210
25% 333 248 215 202 534 238 207 337 177
Bridge 50% 75% 2347 8816 766 4488 726 3428 461 2740 3735 10780 929 5108 536 1862 2347 8815 530 1444
Pentagon 25% 50% 75% 325 2116 7927 228 575 3277 167 511 2510 189 416 2206 542 3408 9736 223 867 4512 212 450 1501 337 2133 7938 107 300 690
Table 2. Average MSE Values for Baboon, Boats, Bridge and Pentagon images corrupted by 25%, 50% and 75% impulse noise
SMF SDROMF FF PSMF MSMF EDMF AFSF ATMBF PROPOSED
AVERAGE 25% 421 328 265 301 612 298 271 431 149
OF FOUR IMAGES 50% 75% 2324 8454 802 4067 734 3061 576 2640 3640 10317 1007 4955 547 1759 2333 8459 414 1082
Total Average 3733 1732 1353 1172 4856 2086 859 3741 548
For a visual evaluation of the noise removal and detail preservation performances of the operators, Figure-5 shows the output images of all operators for the Baboon image corrupted by impulse noise of 25% noise density. It is observed from this figure that the operators efficiently suppressing the noise (such as SDROMF, FF, PSMF, AFSF) fail to preserve image details. The output images of these operators suffer from considerable amount of blurring and distortion. On the other hand, the operators that are more successful at preserving image details (such as SMF, MSMF, EDMF, ATBMF) fail to suppress noise efficiently. It is observed that considerable amount of noisy pixels is still present in the output images of these filters. The proposed type-2 NF noise removal operator, however, offers much better performance than the others. It is clearly observed from the output image of the proposed operator that it is very successful at suppressing the noise and preserving the useful image details. The difference, especially in the detail preservation performance, can easily be seen by carefully comparing the appearance of the eyes and the hair around the mouth of the animal in the output images of all operators.
494
M.T. Yildirim and M.E. Y¨ uksel
a
b
c
d
e
f
g
h
i
Fig. 5. Comparison of the output images of the operators for the Baboon image corrupted by impulse noise with 25% noise density a)SMF b) SDROMF c)FF d) PSMF e)MSMF f) EDMF g) AFSF h) ATBMF i) Proposed
Based on these observations, it is concluded that the proposed operator can be used as a powerful image filter for efficient removal of impulse noise from digital images without distorting the useful information within the image.
Acknowledgment This work is supported by Erciyes University Scientific Research Projects Unit (Project No: FBT-07-12).
References 1. Gabbouj, M., Coyle, E.J., Gallager, N.C.: An overview of median and stack filtering. Circuit Syst. and Signal Processing 11(1), 7–45 (1992) 2. Ko, S.J., Lee, Y.H.: Center weighted median filters and their applications to image enhancement. IEEE Trans. on Circuits and Systems 38(9), 984–993 (1991)
A Type-2 Fuzzy Logic Filter for Detail-Preserving Restoration
495
3. Yin, L., Yang, R., Gabbouj, M., Neuvo, Y.: Weighted median filters: A tutorial. IEEE Trans. on Circuits and Systems II 43, 157–192 (1996) 4. Sun, T., Neuvo, Y.: Detail-preserving median based filters in image processing. Pattern Recognition Letters 15, 341–347 (1994) 5. Wang, Z., Zhang, D.: Progressive switching median filter for the removal of impulse noise from highly corrupted images. IEEE Trans. on Circuit and Systems 46(1), 78–80 (1999) 6. Crnojevic, V., Senk, V., Trpovski, Z.: Advanced impulse detection based on pixelwise MAD. IEEE Signal Processing Letters 11(7), 589–592 (2004) 7. Chen, T., Ma, K.K., Chen, L.H.: Tri-state median filter for image denoising. IEEE Trans. on Image Processing 8(12), 1834–1838 (1999) 8. Chen, T., Wu, H.R.: Adaptive impulse detection using center-weighted median filters. IEEE Signal Proc. Letters 8(1), 1–3 (2001) 9. Chen, T., Wu, H.R.: Space variant median filters for the restoration of impulse noise corrupted images. IEEE Trans. on Circuits and Systems-II 48(8), 784–789 (2001) 10. Chan, R.H., Hu, C., Nikolova, M.: An iterative procedure for removing randomvalued impulse noise. IEEE Signal Proc. Letters 11(12), 921–924 (2004) 11. Aizenberg, I., Butakoff, C., Paliy, D.: Impulsive noise removal using threshold boolean filtering based on the impulse detecting functions. IEEE Signal Proc. Letters 12(1), 63–66 (2005) 12. Zhang, S., Karim, M.A.: A new impulse detector for switching median filters. IEEE Signal Proc. Letters 9(11), 360–363 (2002) 13. Pok, G., Liu, Y., Nair, A.S.: Selective removal of impulse noise based on homogeneity level information. IEEE Trans. on Image Processing 12(1), 85–92 (2003) 14. Be¸sdok, E., Y¨ uksel, M.E.: Impulsive noise rejection from images with Jarque-Berra test based median filter. Int. J. Electron. Commun. 59(2), 105–109 (2005) 15. Chang, J.Y., Chen, J.L.: Classifier-augmented median filters for image restoration. IEEE Trans. Instrumentation and Measurement 53(2), 351–356 (2004) 16. Yuan, S.Q., Tan, Y.H.: Impulse noise removal by a global–local noise detector and adaptive median filter. Signal Processing 86(8), 2123–2128 (2006) 17. Smolka, B., Chydzinski, A.: Fast detection and impulsive noise removal in color images. Real-Time Imaging 11(4), 389–402 (2005) 18. Eng, H.-L., Ma, K.-K.: Noise adaptive soft-switching median filter. IEEE Trans. on Image Processing 10(2), 242–251 (2001) 19. Y¨ uksel, M.E., Be¸sdok, E.: A simple neuro-fuzzy impulse detector for efficient blur reduction of impulse noise removal operators for digital images. IEEE Trans. on Fuzzy Systems 12(6), 854–865 (2004) 20. Schulte, S., Nachtegael, M., De Witte, V., Van der Weken, D., Kerre, E.E.: A fuzzy impulse noise detection and reduction method. IEEE Trans. on Image Processing 15(5), 1153–1162 (2006) 21. Abreu, E., Lightstone, M., Mitra, S.K., Arakawa, K.: A new efficient approach for the removal of impulse noise from highly corrupted images. IEEE Trans. on Image Processing 5(6), 1012–1025 (1996) 22. Han, W.Y., Lin, J.C.: Minimum-maximum exclusive mean (MMEM) filter to remove impulse noise from highly corrupted images. Electronics Letters 33(2), 124– 125 (1997) 23. Luo, W.: An efficient detail-preserving approach for removing impulse noise in images. IEEE Signal Proc. Letters 13(7), 413–416 (2006) 24. Russo, F., Ramponi, G.: A fuzzy filter for images corrupted by impulse noise. IEEE Signal Proc. Letters 3(6), 168–170 (1996)
496
M.T. Yildirim and M.E. Y¨ uksel
25. Choi, Y.S., Krishnapuram, R.: A robust approach to image enhancement based on fuzzy logic. IEEE Trans. on Image Processing 6(6), 808–825 (1997) 26. Russo, F.: FIRE operators for image processing. Fuzzy Sets and Systems 103(2), 265–275 (1999) 27. Van De Ville, D., Nachtegael, M., Van der Weken, D., Kerre, E.E., Philips, W., Lemahieu, I.: Noise reduction by fuzzy image filtering. IEEE Trans. on Fuzzy Systems 11(4), 429–436 (2003) 28. Y¨ uksel, M.E., Ba¸st¨ urk, A.: Efficient removal of impulse noise from highly corrupted digital images by a simple neuro-fuzzy operator. Int. J. Electron. Commun. 57(3), 214–219 (2003) 29. Windyga, P.S.: Fast impulsive noise removal. IEEE Trans. on Image Proc. 10, 173 (2001) 30. Smolka, B., Plataniotis, K.N., Chydzinski, A., Szczepanski, M., Venetsanopulos, A.N., Wojciechowski, K.: Self-adaptive algorithm of impulsive noise reduction in color images. Pattern Recognition 35, 1771–1784 (2002) 31. Xu, H., Zhu, G., Peng, H., Wang, D.: Adaptive fuzzy switching filter for images corrupted by impulse noise. Pattern Recognition Letters 25, 1657–1663 (2004) 32. Alajlan, N., Kamela, M., Jernigan, E.: Detail preserving impulsive noise removal. Signal Processing: Image Communication 19, 993–1003 (2004) 33. Y¨ uksel, M.E., Ba¸st¨ urk, A., Be¸sdok, E.: Detail-preserving restoration of impulse noise corrupted images by a switching median filter guided by a simple neurofuzzy network. EURASIP Journal of Applied Signal Processing 2004(16), 2451– 2461 (2004) 34. Y¨ uksel, M.E.: A hybrid neuro-fuzzy filter for edge preserving restoration of images corrupted by impulse noise. IEEE Trans. on Image Processing 15(4), 928–936 (2006) 35. Mendel, J.: Uncertain Rule-Based Fuzzy Logic Systems: Introduction and New Directions. Prentice-Hall, NJ (2001) 36. Karnik, N.N., Mendel, J.M.: Centroid of a type-2 fuzzy set. Information Sciences 132, 195–220 (2001)
Contrast Enhancement of Images Using Partitioned Iterated Function Systems Theodore Economopoulos, Pantelis Asvestas, and George Matsopoulos Institute of Communication and Computer Systems School of Electrical and Computer Engineering, National Technical University of Athens
[email protected],
[email protected],
[email protected] http://www.ece.ntua.gr/index.html
Abstract. A new algorithm for the contrast enhancement of images, based on the theory of Partitioned Iterated Function System (PIFS), is presented. A PIFS consists of contractive transformations, such that the original image is the fixed point of the union of these transformations. Each transformation involves the contractive affine spatial transform of a square block, as well as the linear transform of the gray levels of its pixels. The PIFS is used in order to create a lowpass version of the original image. The contrast-enhanced image is obtained by adding the difference of the original image with its lowpass version, to the original image itself. Quantitative and qualitative results stress the superior performance of the proposed contrast enhancement algorithm against two other widely used contrast enhancement methods.
1 Introduction Contrast enhancement is essential in cases where substandard quality images are acquired. In many research fields, such as remote sensing, medical image analysis etc., the acquisition of digital images with sufficient contrast and detail is a strong requirement [1]. The ultimate aim of image enhancement is to improve the interpretability and perception of information in images for human viewers. In general, image enhancement techniques are divided into two broad categories: Spatial domain methods and frequency domain methods [2]. Spatial domain methods operate directly on the pixels of the input image, while frequency domain methods operate on the Fourier transform of the image. The most popular methods for contrast enhancement include adaptive histogram adjustment [3], adaptive unsharp masking [4], nonlinear unsharp masking [5], adaptive nonlinear filters [6] etc. This paper introduces a novel method for contrast enhancement, based on the theory of the Partitioned Iterated Function System (PIFS) [7]. The aim of a PIFS is to find parts of an image that are similar to other, properly transformed (scaled-down, flipped, rotated, sheared, etc.) parts [8]. The PIFS model has been extensively used in image compression, due to the scalability it provides. There are numerous variations of the method in this field [8], [9], which are usually addressed as fractal image J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 497–508, 2007. © Springer-Verlag Berlin Heidelberg 2007
498
T. Economopoulos, P. Asvestas, and G. Matsopoulos
compression. However, there are very few attempts to utilize the theory of PIFS for image enhancement, which are tightly bound to image content [10], unlike the proposed algorithm. The objective of the proposed algorithm is to provide strong contrast enhancement, by increasing the mean contrast measurement of the enhanced image, without affecting the information stored in the original image. The performance of the proposed algorithm has been compared against two other methods for image enhancement: the Linear and Cubic Unsharp Masking techniques. Qualitative and quantitative comparative results show advantageous performance of the proposed, PIFS-based, contrast enhancement algorithm.
2 Contrast Enhancement 2.1 PIFS-Based Modeling The fundamental idea of a Partitioned Iterated Function System (PIFS) is to represent a gray-level image using a series of contractive transforms. Each transform has a spatial component and an intensity component and affects a region (usually a square block) of the image support. The parameters of each transform are obtained by means of a minimization process. Practically, the procedure described above may be realized as follows. The image support is partitioned into blocks (usually squares), called range blocks. A second, coarser partition with larger (usually of double size) blocks, called domain blocks, is also imposed. The collection of the domain blocks is called the domain pool. Next, a class of contractive block transformations is defined. Each transformation shuffles the positions of the pixels in a domain block and alters the gray-levels of the pixels in the block. An image block transformation is contractive if and only if it brings every pair of pixels in a block both closer spatially and in value. For each range block, a domain block and a transformation are found such that the transformed domain block best matches the range block, under the least squares sense. Several variations of this algorithm have been proposed, which reduce the large search space of the transformed domain blocks using block classification techniques [11] or sophisticated nearest-neighbor techniques [12]. The procedure for decoding an image is iterative and begins with an initial, arbitrarily chosen image. The next image in the sequence is partitioned into range blocks and the previous one into domain blocks. Then, the gray level values of each range block are calculated by properly transforming the gray level values of the pixels in the corresponding domain block. 2.2 Contrast Enhancement Algorithm Let I ( x, y ) denote the original gray-level image with size N x × N y pixels and support S = [0, N x ) × [0, N y ) . Consider a partition of the image support into nonoverlapping range blocks Ri , j = [ xi , xi + w x ) × [ y j , y j + w y ) , with size w x × w y
Contrast Enhancement of Images Using Partitioned Iterated Function Systems
pixels, where xi = iwx (i = 0,1, …, N x / wx − 1) , and S =
∪R i, j
Let
i, j
499
y j = jw y ( j = 0,1, … , N y / w y − 1)
.
(
ri , j = I ( xi , y j ) , I ( xi + 1, y j ) , … , I ( xi , y j + 1) , … , I ( xi + wx − 1, y j + wy − 1)
the vector of the pixel values for the range block
)
be
Ri , j in a row-wise ordering. An-
other partition of the image support into possibly overlapping blocks (domain blocks) Dk ,l = [uk , uk + 2 wx ) × ⎡vl , vl + 2 wy with size 2 wx × 2 wy pixels is also imposed,
) ⎣ where uk = khx (k = 0,1,…) , vl = lhy (l = 0,1, …) and hx , hy are the horizontal and vertical distances between neighboring domain blocks, respectively. Then, each domain block is down-sampled by a factor of two by averaging the pixel values of each distinct 2×2 sub-block. The corresponding vector of pixel values for the sub-sampled domain block is
(
dk ,l = I d ( uk , vl ) , I d ( uk + 2, vl ) ,…, I d ( uk , vl + 2 ) , …, I d ( uk + 2wx − 1, vl + 2wy − 1)
)
where I d ( x, y ) = 1 ⎡ I ( x, y ) + I ( x + 1, y ) + I ( x, y + 1) + I ( x + 1, y + 1) ⎤ . Subsequently, ⎦ 4⎣ for each range block, the down-sampled domain block that minimizes the squared Euclidean distance:
(
) (
E ( k , l; i, j ) = γ k ,l d k ,l − μ Dk ,l − ri , j − μ Ri , j is found, where
μR
i, j
and
μD
k ,l
)
2
(1)
are the mean pixel values for the range and the sub-
sampled domain block, respectively. Using a predefined constant value for the contrast parameter γ k ,l ( γ k ,l = γ ), then the function to be minimized may be expressed as:
(
) (
E (k , l ; i, j ) = γ d k ,l − μ Dk ,l − ri , j − μ Ri , j
)
2
(2)
The minimization of this function involves a quite large search space for selecting the proper domain block out of the domain pool. As mentioned, this search space can be significantly reduced by employing nearest neighbor techniques, such as the kdimensional tree (kd-tree) nearest neighbor search technique. A kd-tree is a spacepartitioning data structure for organizing points in a k-dimensional space. The subsampled domain blocks are arranged properly as the tree’s nodes (leafs). The best match for a range block is then allocated by searching the tree in a depth first fashion, using the nearest neighbor algorithm [13]. After obtaining the values of the parameters for each block transformation, a global contractive transformation, W, can be defined by the following equation:
[(
)
]
W ( I )( x, y ) = ∑ γ I (2( x − xi ) + u k (i , j ) ,2( y − y j ) + vl (i , j ) ) − μ Dk (i , j ),l (i , j ) + μ Ri , j M i , j ( x, y ) i. j
(3)
500
T. Economopoulos, P. Asvestas, and G. Matsopoulos
where M ( x, y ) = ⎧⎪1, (x, y ) ∈ Ri , j and k ( i, j ) , l ( i, j ) = arg min E ( k , l ; i, j ) . ⎨ i, j ⎪⎩0, (x, y ) ∉ Ri , j
(
)
( k ,l )
{
}
By definition, the absolute value of parameter γ has to be less than 1 [14]. Therefore, in order to achieve the desired contrast gain, the subsequent procedure is followed. Firstly, the image is coded using Eq (3) with a relatively high value for the parameter γ (for example, γ = 0.8 ). Next, the decoded image is created by reapplying Eq. (3) with a lower value for γ (for example, γ = 0.1 ). The resulting image is the lowpass version, I LP , of the original image. The enhanced image, I enh , is finally obtained using the following equation:
I enh ( x, y ) = I ( x, y ) + λI HP ( x, y )
(4)
where the highpass image is given by I HP ( x, y ) = I ( x, y ) − I LP ( x, y ) and the parameter λ adjusts the contrast gain. In Fig. 1 there is an example of applying the proposed algorithm to the test image shown in Fig. 1 (a). The highpass image is illustrated in Fig. 1 (b) and the enhanced image for λ = 1 is shown in Fig. 1 (c). For the rest of the paper, the values for γ are assumed to be 0.8 for coding and 0.1 for decoding, respectively. The effects of varying parameters Section 4.
(a)
(b)
γ
and
λ
are discussed later in
(c)
Fig. 1. Contrast enhancement using PIFS (a) Test image Lena. (b) Highpass version of Lena using the values γ=0.8 (encoding) and γ=0.1 (decoding) for the parameter γ. (c) Enhanced image with λ=1.
Due to the block-based nature of the method, blocking artifacts may appear at the boundaries of neighboring blocks, which are more obvious as the size of the range blocks increase (see Fig. 2(a)). A solution to this problem is the usage of small range blocks, for example 2×2 pixels. Furthermore, since compression is not the primary objective in contrast enhancement, the usage of overlapping range blocks is feasible. In that case, the average of the gray levels of the overlapping regions is used during decoding. This improvement can be seen in Fig. 2(b) with overlapping range blocks of size 4×4 pixels. The lowpass image produced using overlapping range blocks is smoother, thus accounting for a smoother and visually improved final enhanced image.
Contrast Enhancement of Images Using Partitioned Iterated Function Systems
(a)
501
(b)
Fig. 2. Low pass version of the test image Lena using γ=0.8 (encoding) and γ=0.1 (decoding) (a) With 4×4 pixels wide non-overlapping range blocks. (b) With 4×4 pixels wide overlapping range blocks.
3 Results The proposed method was qualitatively and quantitatively evaluated against two commonly used contrast enhancement methods: Linear Unsharp Masking [15] and Cubic Unsharp Masking [16]. Qualitative evaluation was performed by means of visually inspecting the resulting enhanced images. In order to have a uniform measure of comparison, the three methods were applied using the following parameters: A) PIFS Algorithm: λ=1, wx=wy=4, non-overlapping blocks, B) Linear Unsharp Masking: λ=0.45, and C) Cubic Unsharp Masking: λ=2x10-4. The parameters employed for qualitative analysis were determined after several trials and ensure the best possible visual outcome out of each method, in terms of minimizing unwanted artifacts. Fig. 3 depicts the enhanced images after applying all three methods: the PIFSbased, the Linear and nonlinear Unsharp Masking methods. The three techniques where evaluated on enhancing a typical test image (Fig. 3(a)). As seen in Fig. 3, the proposed method (Fig. 3(b)) is capable of achieving superior levels of contrast enhancement when compared to the other two conventional methods. The reason for this is that it is able to produce a deeper level of contrast fluctuation, thus creating a stronger visual effect when compared to Linear and Cubic Unsharp Masking (Fig. 3 (c) and Fig. 3(d), respectively). One of the most common shortcomings in digital radiography is electronic interference. In digital imaging, interference may be emulated by adding noise to the input image. Therefore, in order to test the proposed enhancement algorithm under such conditions, it was further evaluated in the presence of noise. The distortion of an image due to noise can be quantified by neans of the Peak Signal-to-Noise Ration (PSNR) measured in dB. The PSNR is calculated as follows:
⎛ MAX I ⎞ PSNR = 20 log10 ⎜⎜ ⎟⎟ ⎝ MSE ⎠
(5)
where MAXI corresponds to the theoretical maximum intensity value of the image (255 for any grey-scale image) and MSE is the mean squared error between the
502
T. Economopoulos, P. Asvestas, and G. Matsopoulos
(a)
(b)
(c)
(d)
Fig. 3. Comparison of PIFS to Linear and Cubic Masking using test images Lena. (a) The original Lena test image. Enhanced Lena image using (b) PIFS with λ=1 (c) Linear Masking with λ=0.45 and (d) Cubic Masking with λ = 2 × 10 −4 .
noiseless image I ( x, y ) and its noisy counterpart I noisy ( x , y ) of size N x × N y . MSE is given by:
MSE =
1 NxNy
N y −1N x −1
∑ ∑ [I ( x, y) − I y =0 x =0
noisy
( x, y )
]
2
(6)
In general, the lower the PSNR, the more unwanted noise is amplified during contrast enhancement. In our case, Gaussian noise was added to the test image shown in Fig. 4(a) to produce the noisy image depicted in Fig. 4(b) which has a PSNR of 23.0 dB. This image was enhanced using the PIFS algorithm, Linear and Cubic Unsharp Masking techniques. The resultants are shown in Fig. 4(c), Fig. 4(d) and Fig. 4(e) respectively, where it is evident that the proposed approach is capable of stronger contrast enhancement in the presence of noise, than the other two methods in comparison. Despite of that, excessive noise may hinder the enhancement process altogether by generating noticeable artifacts on the enhanced image. The proposed contrast enhancement approach was quantitatively evaluated by examining the mean contrast difference between the original and the enhanced image. This may also be referred as the contrast gain of the enhanced over the original image.
Contrast Enhancement of Images Using Partitioned Iterated Function Systems
503
(b)
(a)
(c)
(d)
(e)
Fig. 4. Example of enhancing the test image Lena in the presence of noise. (a) Original test image Lena. (b) Noisy image with Gaussian noise of σ = 0.005 (PSNR of 23.0 dB). The enhanced noisy image Lena using (c) PIFS, (d) Linear Unsharp Masking and (e) Cubic Unsharp Masking.
Let the original image of size N x × N y be denoted by the pixel in position ( x, y ) is expressed as [17]:
lv( x, y ) lm( x, y )
c( x, y ) = where
I ( x, y ) . Then, the contrast of
(7)
lm( x, y ) and lv( x, y ) are given by the following equations: lm ( x, y ) =
lv ( x, y ) =
m
1
( 2m + 1) m
1
( 2m + 1)
2
In Eq. (8) and Eq. (9) the quantity
2
m
∑ ∑ I ( x + k, y + l )
m
∑ ∑ ⎡⎣ I ( x + k , y + l ) − lm ( x, y )⎤⎦
k =− m l =− m
1
(8)
k =− m l =− m
(2m + 1)2
2
(9)
is the size of a square window in pixels.
Throughout quantitative evaluation, this quantity was constant with m = 2 . The mean contrast over the entire image I ( x, y ) may be expressed as:
504
T. Economopoulos, P. Asvestas, and G. Matsopoulos
CI =
1 NxN y
N y −1N x −1
∑ ∑ c ( x, y )
(10)
y =0 x =0
After calculating the mean contrast of the original image, the enhancement algorithm is applied to the image and the mean contrast is recalculated for the enhanced image according to Eq. (10). The contrast gain is determined by the difference: CGAIN = C I enh − C I
where
(11)
C I enh denotes the mean contrast of the enhanced image and C I the mean
contrast of the original image. Obviously, a positive
C GAIN accounts for an increase
in the contrast of the enhanced image over the original one, while a negative value signifies contrast loss. Moreover, the greater the value of C GAIN , the stronger the resulting contrast enhancement. This scheme was employed to quantitatively evaluate the proposed algorithm against the Linear Unharp Masking and the Cubic Unsharp Masking methods. Using the aforementioned parameters for the three methods in comparison, the results obtained are illustrated in Table 1, where the contrast gain over the test image shown in Fig. 1(a) is verified for the proposed approach and the two methods in comparison. By examining Table 1, it is evident that the proposed algorithm accounts for superior contrast enhancement over the other two methods, in terms of contrast gain. Table 1. Performance of the contrast enhancement methods on test image Lena
Test Image Lena (Fig. 1(a))
PIFS Algorithm 5.851
Enhancement Method Linear Masking 2.347
Cubic Masking 1.136
Furthermore, The three methods were also quantitatively assessed under the presence of Gaussian noise with PSNR of 30, 27, 25, 24 and 23 dB. The results are quoted in Table 2. There, the three methods are compared in terms of the contrast gain of the enhanced noisy image over the original noisy image, as this is expressed in Eq. (11). Moreover, the PSNR of the enhanced noisy image over its enhanced noiseless counterpart is also recorded in each case. Table 2 indicates that the suggested method accounts for superior contrast enhancement, by providing higher contrast gain over the two other methods in comparison. Furthermore, the PIFS-based algorithm achieves higher PSNR values for all levels of noise, when compared to Linear and Cubic Masking techniques. This practically means that the proposed approach is more tolerant to Gaussian noise than the other two methods in comparison. Finally, if greater levels of noise are inserted, Linear and Cubic Masking methods fail completely, while the suggested method is still able to provide contrast enhancement, even with noticeable artifacts originating from the amplified noise.
Contrast Enhancement of Images Using Partitioned Iterated Function Systems
505
Table 2. Performance of the contrast enhancement methods on noisy versions of test image Lena, in terms of contrast gain and PSNR (printed in italic fonts, expressed in dB)
Enhancement Method PIFS Algorithm Linear Masking Cubic Masking
30.0 7.348 24.6 dB 6.100 20.7 dB 3.338 17.5 dB
27.0 8.686 21.7 dB 8.059 18.3 dB 4.306 15.9 dB
PSNR (dB) 25.0 10.238 20.0 dB 8.817 17.2 dB 4.977 15.4 dB
24.0 11,178 18.8 dB 9.017 16.6 dB 5.423 15.1 dB
23.0 12.251 17.9 dB 8.816 16.3 dB 5.870 14.7 dB
4 Discussion A novel approach for image enhancement using the Partitioned Iterated Function Systems (PIFS) was introduced in this paper. The algorithm depends upon a number of parameters which will be discussed in some detail in this section. Each of those parameters impact on the final enhanced image and thus have to be suitably adjusted in order for the proposed algorithm to reach its full potential. Moreover, the original PIFS scheme may be enriched with some additional features that render the algorithm more robust and enhance its interpretability by the observer. A key-improvement is also discussed in this section. The most important parameter is the contrast gain factor λ. The optimal value of λ mainly depends on the characteristics of the input image. After applying the PIFS method on several test images, it was deduced that optimal visual results were obtained when using λ=1. This is clearly shown in Fig. 5. There, the test image shown in Fig. 5(a) was enhanced using the proposed PIFS algorithm with λ=0.45 (Fig. 5(b)), λ=1 (Fig. 5(c)) and λ=2.1 (Fig. 5(d)). As can be seen, Fig. 5(c) provides both strong contrast enhancement and preserves the information of the original image. In general, the greater the value of λ, the higher the resulting contrast gain, as this is indicated in Fig. 6(a). There, the contrast gain is plotted against several values of λ, after applying the PIFS algorithm to the test image shown in Fig. 1(a). However, in images that contain fine structures (such as medical images) some information loss is recorded for λ > 1 . It follows that, as the value of λ further increases, information loss becomes more severe, hindering the produced optical contrast enhancement effect (Fig. 5(d)). The reason for this is that, as λ increases, less gray-scale intensity values are actually perceivable by the human observer in the enhanced image. On the other hand, using a relatively low value for λ (i.e. λ=0.45), does not produce the desired strong contrast enhancement (Fig. 5(b)). Hence, λ=1 was preferred throughout this paper in order to balance between contrast enhancement and minimization of information loss. As mentioned in Section 2.2, the parameter γ also plays an important role in the final product of the proposed algorithm. Several values of γ were evaluated for encoding a test image and then enhancing the image using the PIFS algorithm. In Fig. 6(b), the resulting contrast gain of the enhanced image was plotted against γ used for encoding. The value of γ used for decoding was kept constant ( γ = 0.1 ) throughout the
506
T. Economopoulos, P. Asvestas, and G. Matsopoulos
(a)
(b)
(c)
(d)
7,000
7,000
6,000
6,000
5,000
5,000
Contrast Gain
Contrast Gain
Fig. 5. Example of enhancing a radiographic test image using the PIFS algorithm. (a) Original test image. PIFS enhanced image with (b) λ=0.45, (c) λ=1 and (d) λ=2.1.
4,000 3,000 2,000 1,000
4,000 3,000 2,000 1,000
0,000
0,000 0,30 0,35 0,40 0,45 0,50 0,55 0,60 0,65 0,70 0,75 0,80 0,85 0,90 0,95 1,00 λ
(a)
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
γ For Encoding
(b)
Fig. 6. Resulting contrast gain on test image Lena, using the PIFS algorithm, against (a) values of λ ranging from 0.30 to 1, (b) values of γ (encoding) ranging from 0.1 to 0.8
test. As can be seen in Fig. 6(b), the greater the value of γ used for encoding, the higher the contrast gain of the enhanced image. Nevertheless, there is an upper boumd for this value, beyond which information loss is noticeable, similar to the case of parameter λ. This depends on the characteristics of the input image and, for the particular test image employed in this case, it was estimated to γ = 0.8 . Therefore, using γ = 0.8 for encoding proved to be the best possible value for adequate image enhancement without losses in the information of the original image. Finally, as far as the computational time of the proposed algorithm is concerned, PIFS image enhancement is generally slower than the other two methods in comparison. The reason for this is that both Linear Masking and Cubic Masking are rather simple filters, which do not involve complex mathematical computations. The much more elegant and complex PIFS approach typically requires about 4 seconds for enhancing a 512×512 pixels gray-scale image on a PC (x86 type, 1.8GHz with 1024MB RAM), but it provides superior quality enhancement, when compared to the other two commonly employed methods. The proposed scheme for contrast enhancement does not take into account the image content, which results in using the same value for the contrast gain, λ, for the entire image. In that case, there is a risk of enhancing image noise, especially for smooth regions of the image, as well as causing ringing effects, i.e. enhancing the strong edges of the image. Consequently, the algorithm could be slightly modified in order to cope with those problems. In particular, during the decoding phase, the original image is used as the initial image for the iterations. Then, for each range block, the variance of the gray levels of the pixels is computed. Next, only in case the variance
Contrast Enhancement of Images Using Partitioned Iterated Function Systems
507
is in the range between v1 and v2 , Eq. (3) is applied. The parameter v1 represents the low threshold and v2 the high threshold of the acceptable variance range. Both are determined so that range blocks which are neither smooth nor containing strong edges are chosen. Therefore, the smooth areas or the regions containing strong edges are not enhanced in the resulting image. In Fig. 7(c), the contrast-enhanced version of the test image shown in Fig. 7(a), for v1 = 400 , v2 = 1000 is illustrated. Fig. 7(b) depicts the highpass version of the test image for v1 = 400 , v2 = 1000 . When compared to Fig. 1(b), where no variance filtering was used, Fig. 7(b) clearly indicates the regions of the image that fall between v1 and v2 on which the enhancement algorithm is finally applied. As with all other parameters affecting the PIFS algorithm, the values of v1 and v2 depend on the contrast distribution of the original input image.
(a)
(b)
(c)
Fig. 7. Applying PIFS with variance filtering. (a) Test image Lena. (b) Highpass version of Lena using the values γ=0.8 (encoding) and γ=0.1 (decoding). (c) Enhanced image with λ=1 and, v1=400, v2=1000.
5 Conclusions This paper presented a novel approach for contrast enhancement based on the theory of the iterated function systems. After quantitative and qualitative analysis of the proposed algorithm, it was shown that it is able to increase the mean contrast of the enhanced image, thus achieving quite high mean contrast gain over the original image. When compared to two widely used contrast enhancement methods: Linear and Cubic Unsharp Masking, the proposed approach produced superior quality enhanced images after both visual and quantitative assessment. Finally, the introduced PIFS algorithm proved to be quite tolerant in the presence of noise, as it was capable to increase the contrast gain of the test images without amplifying the noise, in terms of PSNR, as much as the other two methods in comparison.
References 1. Lim, J.S.: Two-dimensional Signal and Image Processing. Prentice Hall, New Jersey (1990) 2. Umbaugh, S.E.: Computer Vision and Image Processing: A Practical Approach Using CVIPTools, 1st edn. Prentice-Hall, Englewood Cliffs (1997)
508
T. Economopoulos, P. Asvestas, and G. Matsopoulos
3. Sund, T., Møystad, A.: Sliding window adaptive histogram equalization of intraoral radiographs. effect on image quality, Dentomaxillofacial Radiology 35, 133–138 (2006) 4. Ramponi, P.G., Mathews, V.J.: Adaptive unsharp masking for contrast enhancement. In: International Conference on Image Processing, vol. 1, p. 267 (1997) 5. Badamchizadeh, M.A., Aghagolzadeh, A.: Comparative study of unsharp masking methods for image enhancement. In: Image and Graphics Proceedings, pp. 27–30 (2004) 6. Arici, T., Altunbasak, Y.: Image local contrast enhancement using adaptive non linear filters, IEEE international conference on Image Processing (to be published, 2006) 7. Barnsley, M.F., Hurd, L.P.: Fractal Image Compression. AK Press, Massachusetts (1993) 8. Jacquin, E.: Fractal image coding: a review. Proceedings of the IEEE 81(10), 1451–1465 (1993) 9. Thomas, L., Deravi, F.: Region-based fractal image compression using heuristic search. IEEE Trans. on Image Processing 4(6), 832–838 (1995) 10. Nikiel, S.: Integration of iterated function systems and vector graphics for aesthetics. Computers & Graphics 30, 277–283 (2006) 11. Fan, K.C., Chang, J.C., Kan, K.S.: Improvement of image-compression quality via block classification and coefficient diffusion. In: Proc. SPIE, vol. 2501, pp. 1727–1736 (1995) 12. Kuan, J.K.P., Lewis, P.H.: Fast k nearest neighbour search for R-tree family. In: Proceedings on First International Conf. on Information, Communications, and Signal Processing. Singapore, pp. 924–928 (1997) 13. Bentley, J.L.: Multidimensional binary search trees used for associative searching, Commun. ACM 18(9), 509–517 (1975) 14. Jacquin: Image coding based on a fractal theory of iterated contractive image transformations. IEEE Trans. Image Proc. 1, 18–30 (1992) 15. Chen, S.K., Hollender, L.: Linear unsharp mask filtering of linear cross-sectional tomograms of the posterior mandible. Swed. Dent. J. 19(4), 139–147 (1995) 16. Ramponi, G.: A cubic unsharp masking technique for contrast enhancement. Signal Processing 67(2), 211–222 (1998) 17. De Vries, F.P.: Automatic adaptive brightness independent contrast enhancement. Signal Process 21, 169–182 (1990)
A Spatiotemporal Algorithm for Detection and Restoration of Defects in Old Color Films Bekir Dizdaroglu and Ali Gangal Department of Electrical and Electronics Engineering, Karadeniz Technical University, 61080, Trabzon, Turkey {bekir,ali.gangal}@ktu.edu.tr
Abstract. A spatiotemporal method is presented for detection and concealment of local defects such as blotches in old color films. Initially, non-local means (NL-means) method which does not require motion estimation is used for noise removal in image sequences. Later, the motion vectors that are incorrectly estimated within defect regions are repaired by taking account of the temporal continuity of motion trajectory. The defects in films are detected by spike detection index (SDI) method, which are easily adapted to color image sequences. Finally, the proposed inpainting algorithm fills in detected defect regions, which is not required to estimate true motion like other approaches. The method is presented on synthetic and real image sequences, and efficient concealment results are obtained.
1 Introduction Old films are subject to degrade in quality due to bad environmental factors and repeated projection. Dust and dirt are major defects. They adhere to the film surface and appear as blotches. The blotches are presented random shapes and positions in each frame and do not generally occupy the same spatial location in successive frames. Vertical scratches occur in a frame when the film is abraded by dirt particles in the projector. Various other defects occur because of water damage or excessive heat. The digital restoration techniques are generally classified in three steps: motion estimation, detection and concealment of damaged regions. The accurate motion estimation and compensation can be necessary for especially detection and correction of defects. The detection of pixels that are probably to be damaged is required to restore only missing pixels. The detection algorithms are spike detection index [1], rank ordered differences, Markov random fields [1] and the AR model [1]. The final step in the restoration process is to fill in damaged pixels. The damaged pixels are restored by adjacent pixels information within successive frames. Inpainting, which has received much attention in recent years, can be used for interpolation of the damaged pixels. There are two categories of image inpainting methods: Texture synthesis and inpainting which based on partial differential equation (PDE). The first is used to restore the large regions of image and the second is used to fill in the small image holes. Bornand et al. [2] improved the study of Efros et al. [3] by reconstructing the J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 509–520, 2007. © Springer-Verlag Berlin Heidelberg 2007
510
B. Dizdaroglu and A. Gangal
defect locations for image sequences. In this method, the filling priority affects the output image. Criminisi et al. [4] presented an exemplar-based image inpainting algorithm to remove large objects from image. The method fills in the missing regions by sample patches. However, blocking artifact sometimes occurs in inpainting regions of the restored image. Gangal et al. [5] proposed a method using multilevel 3-D extended (ML3Dex) vector median filter for restoration. This approach successfully conceals blotches on image sequence if true motion estimation can be done. Moreover, Gangal et al. [6] presented a spatiotemporal reconstruction algorithm which occasionally fails to fill in missing areas, involved the complex texture and structure. In this paper, we have further enhanced the existing exemplar-based image inpainting method to complete damaged regions. The proposed method covers the advantages of spatiotemporal exemplar-based image inpainting as well as reconstructs the defect areas by finding the fittest patches, even if motion estimation is improperly done.
2 Proposed Method The proposed method contains noise reduction, motion estimation and motion vector repairing, and defect detection and restoration, respectively. Gray and vector-valued or color images are defined as follows:
I:
Β ⊂ Rs → R p → I (p )
, Ι:
Β ⊂ R s → Rc p → Ι (p)
,
(1)
where p = ( x, y ) and p = ( x, y, t ) for s = 2 and s = 3 , respectively. For color im-
ages, each pixel is a vector of dimension c ∈ Ν + and corresponds to c = 3 , with vector values in RGB or YUV color spaces. Ii : Β → R indicates the ith image channel of I (1 ≤ i ≤ c) . 2.1 Image Sequence Denoising
Various noise removal methods which request true motion estimation have been proposed for digital image sequence. However, many of these algorithms could degrade or remove the structure and texture of image. The accurate motion estimation within degraded image regions is very crucial and no method is able to produce efficient results. Thus, the filling process along the calculated motion trajectories can cause artifacts. The non-local means (NL-means) algorithm [7] takes no account of the same suppositions like other denoising methods. Instead it benefits from the sizable redundancy of image or image sequence. Namely, any small region in image has closely numerous regions in the same image or image sequence. Efros et al. [3] first time presented this approach. In this method, all pixels in that neighborhood can be used for reconstructing the value at point p .
A Spatiotemporal Algorithm for Detection and Restoration of Defects
511
The spatiotemporal NL-means algorithm is defined as: Ι% ( p ) =
− 1 e ∑ J ( p ) q∈{Φt −1 ∪Φt ∪Φt +1}
d ( p ,q ) h2
Ι (q ) , J (p ) = ∑ q e
d ( p, q ) = ∑ r Ga ( r ) Ι ( p + r ) − Ι ( q + r )
2
,
−
d ( p ,q ) h2
,
(2)
(3)
where d (.) is a weighted distance, Ga ( .) ψis a Gaussian kernel of standard deviation a , h acts as a filtering parameter, r denotes a translation vector in the comparison windows Ψ p or Ψ q , . indicates L2 distance, J ( p ) is a normalizing factor, Φ t −1 , Φ and Φ are search regions or learning windows and, Ι% ( p ) , the restored value t
t +1
at point p , is a weighted average of the other pixels of which Gaussian locality resembles the locality of point p . This method removes noise from image sequence without blurring fine details. However, it is not sufficient for removal of big defects in old films, so we only utilize it for prefiltering. 2.2 Motion Estimation and Motion Vector Repairing
We use predictive diamond search method for particularly correct detection and restoration of defects due to the heavy computation that is necessary for motion estimation process. The used method for motion estimation of the proposed algorithm is described by Tourapis et al. [8]. In order to prevent wrong motion estimation caused by defects in degraded image sequence, Boyce’s algorithm [9] can be used. But the method results in motion estimation failure when size of defects is larger than the block size. Furthermore, motion vectors, which could point out inaccurate motion trajectory within missing areas, must be interpolated. The repairing approaches of motion vector are unable to work for most scenes with complex motion. Motion does not change suddenly in any natural image sequence. This characteristic can be utilized in calculating correlation of motion vectors in the same spatial location within neighbor frames as follows: u ( x, y , t ) =
u ( x, y , t + 1) + u ( x, y, t − 1) 2
.
(4)
The spatial or temporal correlation among motion vectors of adjacent image blocks in a frame or successive frames can be used for repairing of motion vectors in damaged image sequence. Perhaps, the reestimation of wrong motion vectors is not done fairly because of the existence of damages, occlusions or nontranslational motions; but it is necessary for the accurate defect detection and restoration. As can be seen in Fig. 1., the motion vector of the current image block can be calculated to find out whether it is correctly estimated or not, based on motion vectors of the current image block and its neigbors. It can be explained as follows:
512
B. Dizdaroglu and A. Gangal 1
MV ( x, y, t − 1) =
1
∑ ∑ d ( i, j ) ua ( x + k × i, y + k × j, t − 1)
i =−1 j =−1
1
,
1
∑ ∑ d ( i, j )
(5)
i =−1 j =−1
where u a = ( ua , va ) is the motion vector in the forward temporal direction pointed by arrow 3, as shown Fig. 1, k is the dimension of the image block, and d (.) is the distance among the image blocks and is given by: ⎧1 ⎪ d ( i, j ) = ⎨ 2 2 ⎪⎩ i + j
(
if i = 0 and j = 0
)
1
2
.
(6)
otherwise
The length of the motion vector in the forward temporal direction pointed by arrow 2 is defined as follows:
MV ( x, y , t ) = u ( x, y, t ) = u ( x, y, t ) + v ( x, y, t ) . 2
2
(7)
If 2 MV ( x, y, t ) > MV ( x, y, t − 1) , the motion vector in the forward temporal direction pointed by arrow 2 is estimated incorrectly. Similar process is repeated for the motion vector in the backward temporal direction pointed by arrow 1. Blotch
Image blocks 1
2
3 (a)
(b)
(c)
Fig. 1. The repairing of motion vector approach: (a) Previous, (b) current and (c) next frames
The motion vectors within the damaged region, as shown Fig. 1., i.e. for the current image block in the backward and forward temporal direction pointed by arrow 1 and 2 are incorrectly estimated. However, the motion vector pointed by arrow 3 in the same spatial location from previous frame, which has been already restored, to next frame is correctly estimated. Therefore, the motion vector of the current block in the forward temporal direction pointed by arrow 2 is computed as follows: u ( x, y , t ) =
u a ( x, y, t − 1) 2
.
(8)
A Spatiotemporal Algorithm for Detection and Restoration of Defects
513
Same operations are iterated for the backward direction pointed by arrow 1. So, motion vectors of image blocks within defect locations can be computed by taking account of the temporal correlation of motion trajectory. 2.3 Defect Detection In this study, we use the spike detection index (SDI) [1], which is the simplest detector, in order to perceive temporal discontinuities such as blotches. We extended this method for detection of defect regions in old color films. It marks a pixel as damaged by using a threshold operation. The method is defined as follows: ⎪⎧1 ASDI ( p ) = ⎨ ⎪⎩0
if ε ( x, y, t ± 1) > T otherwise
c
, ε ( x, y, t ± 1) = ∑ Ii ( p ) − Iˆi ( x, y, t ± 1) ,
(9)
i =1
where Iˆi (.) is the motion compensated pixel. The current pixel is marked as damaged when both the forward and backward motion-compensated frame differences are higher than the predefined threshold T , which is chosen experimentally. 2.4 Defect Restoration A spatiotemporal exemplar-based inpainting method is proposed for restoration of damaged regions in old films. The method is based on the existing image inpainting method presented in [4]. However, the existing algorithm is unable to fill in missing regions perfectly due to complex textures and structures in the current frame. For this reason, three successive frames are used to restore degraded image sequence for increasing the performance of the proposed method. If missing areas can not exactly be reconstructed from the search area of the current frame, the method attempts to find the best sample patch in neighbor frames by searching acceptable patches based on the calculated motion trajectory. The method is shown in Fig. 2 and can be explained as follows: Ι(p) is a pixel value in a given current frame, Ω is the target region to be reconstructed, δΩ is boundary of Ω , and Φt −1 , Φt and Φt +1 are the search regions that consist of the sample patches, respectively. Ψ p is the current patch that will be filled in at point p = ( x, y, t ) on δΩ . The filling priority of each point on the boundary of the target region is computed as follows: P (p ) = C (p ) D (p ) ,
C (p) =
∑ q∈Ψp ∩Φt C ( q )
( )
Area Ψ p
(10)
,
(11)
514
B. Dizdaroglu and A. Gangal
⎧ ∇ ⊥ I ( x, y , t + i ) . n ( p ) ⎪ for i = 0 1 ⎪ 2α D ( p ) = ∑ Di , Di = ⎨ , ⎪ ∇ ⊥ Iˆ ( x, y, t + i ) . n ( p ) i =−1 ⎪ for i = −1,1 4α ⎩
(12)
where C (p) is the confidence term and provides the filling priority from the outer layers of the target region to inner layers, and D (p) is the data term and boosts the priority of a patch that has high gradient values such as edge information. Area( Ψ p ) is the area of Ψ p , α is a normalization factor (i.e. 255 for gray-valued images), ∇ ⊥ I (p) and ∇ ⊥ Iˆ(q) are the isophotes at points p and q , respectively . n(p) is a unit vector orthogonal to front of the contour at point p .
Motion trajectory
t
t 1
Iˆ q
q
t 1
n(p) q
q
I p
p
p
q
q
Iˆ q
q
Fig. 2. The proposed spatiotemporal exemplar-based inpainting method
During initialization, C (p) is set to following values: ⎧1, ∀p ∈ Φ t C (p ) = ⎨ . ⎩0, ∀p ∈ Ω
(13)
In the proposed method, after the patch Ψ pˆ is found with the maximum priority, the best exemplar patch is then searched within the search regions in successive
(
frames, where the distance d Ψ pˆ , Ψ qˆ
)
between the two patches Ψ pˆ and Ψ qˆ is
defined as the sum of squared differences (SSD) of the already filled pixels in the patches. The best sample patch is copied from the search regions to the target region. The last step is to update the confidence values.
A Spatiotemporal Algorithm for Detection and Restoration of Defects
515
3 Experiments The proposed method is applied to synthetically damaged and real image sequences. The size of image sequences is 352 x 288 pixels with YUV color space. The standard deviation a , the filtering parameter h , the comparison windows and the search regions for the spatiotemporal NL-means method are taken as 1, 4, 7 x 7 and 11 x 11 pixels, respectively. The block and patch sizes are chosen as 4 x 4 pixels for block matching motion estimation at ½ sub-pixel accuracy and 9 x 9 pixels for exemplarbased inpainting, respectively. The threshold T used in SDI detector is chosen as 60 for “Foreman” and the real sequences, and 100 for “Coastguard”. The search region for the exemplar-based inpainting approaches is chosen as 25 x 25 pixels. We demonstrate the algorithm performance on quantitative and qualitative manners and compare to the exemplar-based inpainting [4], the ML3Dex vector median [5] and the spatiotemporal search [6]. 0.1% Gauss noise and random size and shape blotches were artificially generated for the synthetic sequences, and frames 233 and 3 of “Foreman” and “Coastguard” are shown in Figs. 3a-b, respectively. Real image sequence was also grabbed from a TV broadcast and frame 58 of this sequence is shown in Fig. 3c.
(a)
(b)
(c)
Fig. 3. Degraded images: Frames (a) 223 of “Foreman”, (b) 3 of “Coastguard” and (c) 58 of real film sequences
3.1 Synthetic Damaged Image Sequences Fig. 4 shows the estimated motion vectors, which are calculated by using the predictive diamond search method. Since there is translational motion in frame 3 of “Coastguard” sequence, the repairing approach exactly and correctly estimates motion vectors in degraded regions (Fig. 4f). But, there is slightly complex motion in frame 223 of “Foreman” sequence. Therefore, the proposed method is fairly unable to calculate some motion vectors (Fig. 4c). Fig. 5 shows the defect detection results using the SDI. Here, 3 x 3 dilation operator was applied to the detection result in order to fill in efficiently. The red marked region is the missed defects (Fig. 5b). In these locations, the background information of frame 3 of “Coastguard” and the artificially added blotches are approximately the same. For this reason, the SDI is unable to detect these regions.
516
B. Dizdaroglu and A. Gangal
(a)
(b)
(d)
(c)
(e)
(f)
Fig. 4. Repairing of motion vectors: (a) and (d) original, (c) and (e) wrong, and (c) and (f) repaired motion vectors of frame 223 and frame 3 of “Foreman” and “Coastguard” sequences
(a) (b) Fig. 5. Detected damaged regions, shown in white pixels, and undetected regions, marked by red, using SDI. Frames (a) 223 and (b) 3 of “Foreman” and “Coastguard” sequences
The normalized mean squared error (NMSE) is the most widely used as a quantitative measure for evaluation purposes. It is defined as follows: c
NMSE ( t ) =
2 ∑ Ι ( p ) − %Ι ( p ) ∑ ∑ ( Ii ( p ) − I%i ( p ))
p∈Β
∑ Ι (p )
p∈Β
2
=
p∈Β i =1
c
∑ ∑ Ii ( p )
2
,
(14)
2
p∈Β i =1
where Ι% (p ) is the restored pixel value at point p . The NMSE line charts for frames 223-240 and 3-20 of “Coastguard” and “Foreman” sequences between the restored and original frames are shown in Figs. 6-7,
A Spatiotemporal Algorithm for Detection and Restoration of Defects
Spatial Restoration ML3Dex Vector Median Spatiotem poral Search Spatiotem poral Restoration
0.004 0.003 NMSE
517
0.002 0.001
23 9 24 0
23 6 23 7 23 8
23 3 23 4 23 5
23 0 23 1 23 2
22 7 22 8 22 9
22 3 22 4 22 5 22 6
0
Fram e Number
Fig. 6. NMSE for “Foreman” sequence between frames 223 and 240
Spatial Restoration
0.012
ML3Dex Vector Median
NMSE
0.01
Spatiotemporal Search Spatiotemporal Restoration
0.008 0.006 0.004 0.002 0 3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 Fram e Number
Fig. 7. NMSE for “Coastguard” sequence between frames 3 and 20
respectively. The graphs demonstrate that the NMSE of the proposed method is almost less than the other methods. In addition to these findings, a qualitative measure is necessary for human visual perception. Figs. 8-9 are obtained by applying above methods to reconstruct the frames 3 and 223 of “Coastguard” and “Foreman” sequences, respectively. The completed regions of interest are marked by green and the missed regions are marked by red in the figures. The performance of the proposed method is almost better than the other methods in damaged locations consisting of motion areas (Figs. 8d-9d). The poorest performance among the methods is belongs to the spatial restoration (Figs. 8a9a). The ML3Dex approach is occasionally unable to interpolate the missing regions perfectly due to the wrong motion compensation (Fig. 8b).
3.2 Real Image Sequence The performance of above methods is tested on the real image sequence and some example results are shown in Figs. 10-11. Some of the motion vectors marked by red
518
B. Dizdaroglu and A. Gangal
(a) (b) (c) (d) Fig. 8. Completed frame 223 of “Foreman” sequence using (a) spatial restoration, (b) ML3Dex, (c) spatiotemporal search and (d) spatiotemporal restoration
(a) (b) (c) (d)
Fig. 9. Completed frame 3 of “Coastguard” sequence using (a) spatial restoration, (b) ML3Dex, (c) spatiotemporal search and (d) spatiotemporal restoration
in the missing regions are unable to be repaired as shown in Fig. 10b and therefore the SDI detector is unable to perceive these areas. The exemplar-based spatial restoration is unable to complete the missing region because of using only the current frame for reconstruction process. The ML3Dex vector median performs better than the spatial restoration on the interpolation of damaged areas because of being no complex motions. But, it only fails to restore the man’s shoulder on account of the incorrect
A Spatiotemporal Algorithm for Detection and Restoration of Defects
(a)
(b)
519
(c)
Fig. 10. (a) Wrong and (b) repaired motion vectors, and (c) detected defects of frame 58 of real film
Fig. 11. Completed frame 58 of real film using spatial restoration (top left), ML3Dex (top right), spatiotemporal search (bottom left) and spatiotemporal restoration (bottom right)
motion compensation. The spatiotemporal search method fills in the missing areas perfectly. However, the proposed spatiotemporal restoration method has the best visual quality, particularly in the connected edge information of the felt hat as shown in Fig. 11, in comparison to the other methods. The methods were implemented in Visual C++ .NET 2003 and run on Pentium 2.4 GHz with 512 MB RAM. The proposed method took 22 seconds for restoration of the real film frame shown in Fig. 3c.
4 Conclusions and Future Work In this paper, we proposed a spatiotemporal method for restoration of damaged old color films. Experimental simulation results showed that the proposed method removes blotches from the degraded frame by reconstructing the visually possible and coherent patches.
520
B. Dizdaroglu and A. Gangal
It is clear that detection and correction of defect regions could be done better if the performance of repairing of motion vectors is further improved in the complex motion areas.
References 1. Kokaram, A.C., Morris, R.D., Fitzgerald, W.J., Rayner, P.J.W.: Detection of Missing Data in Image Sequences. IEEE Transactions on Image Processing 4(11), 1496–1508 (1995) 2. Bornand, R., Lecan, E., Laborelli, L., Chenot, J.: Missing Data Correction in Still Images and Image Sequences. In: Proceedinds of ACM Multimedia, ACM, New York (2002) 3. Efros, A., Freeman, W.: Image Quilting for Texture Synthesis and Transfer. In: Proceedinds of ACM Conference on Computer Graphics, Eugene Fiume, pp. 341–346. ACM, New York (2001) 4. Criminisi, A., Perez, P., Toyama, K.: Region Filling and Object Removal by ExamplerBased Inpainting. IEEE Trans. Image Proc. 13(9), 1200–1212 (2004) 5. Gangal, A., Kayikcioglu, T., Dizdaroglu, B.: An improved motion-compensated restoration method for damaged color motion picture films. Signal Proc. Image Comm. 19, 353–368 (2004) 6. Gangal, A., Dizdaroglu, B.: Automatic Restoration of Old Motion Picture Films Using Spatio-Temporal Exemplar-Based Inpainting. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2006. LNCS, vol. 4179, pp. 55–66. Springer, Heidelberg (2006) 7. Buades, A., Coll, B., and Morel, J. M., Denoising Image Sequences Does not Require Motion Estimation. In: CMLA 2005-18 (2005) 8. Tourapis, A.M., Shen, G., Liou, M.L., Au, O.C., Ahmad, I.: A New Predictive Diamond Search Algorithm for Block Based Motion Estimation. Proc. of Visual Comm. and Image Proc. (2000) 9. Boyce, J.: Noise Reduction of Image Sequences Using Adaptive Motion Compensated Frame Averaging. Proceedings of the IEEE ICASSP 3, 461–464 (1992)
Categorizing Laryngeal Images for Decision Support A. Gelzinis, A. Verikas, and M. Bacauskiene Department of Applied Electronics, Kaunas University of Technology, Studentu 50, LT-51368, Kaunas, Lithuania
[email protected],
[email protected],
[email protected]
Abstract. This paper is concerned with an approach to automated analysis of vocal fold images aiming to categorize laryngeal diseases. Colour, texture, and geometrical features are used to extract relevant information. A committee of support vector machines is then employed for performing the categorization of vocal fold images into healthy, diffuse, and nodular classes. The discrimination power of both, the original and the space obtained based on the kernel principal component analysis is investigated. A correct classification rate of over 92% was obtained when testing the system on 785 vocal fold images. Bearing in mind the high similarity of the decision classes, the correct classification rate obtained is rather encouraging.
1
Introduction
The diagnostic procedure of laryngeal diseases is based on visualization of the larynx, by performing indirect or direct laryngoscopy. A physician then identifies and evaluates colour, shape, geometry, contrast, irregularity and roughness of the visual appearance of vocal folds. This type of examination is rather subjective and to a great extent depends on physician’s experience. Availability of objective measures of these features would be very helpful for assuring objective analysis of the images of laryngeal diseases and creating systematic databases for education, comparison and research purposes. In addition to the data obtained from one particular patient, information from many previous patients—experience— plays also a very important role in the decision making process. Moreover, the physician interpreting the available data from a particular patient may have a limited knowledge and experience in analysis of the data. In such a situation, a decision support system for automated analysis and interpretation of medical data is of great value. Recent developments in this area have shown that physicians benefit from the advise of decision support systems in terms of increased reliability of the analysis, decreased intra- and inter-observer variability [1]. This paper, is concerned with an approach to automated analysis of vocal fold—laryngeal—images aiming to categorize diseases of vocal folds. A very few
We gratefully acknowledge the support we have received from the agency for international science and technology development programmes in Lithuania (EUREKA Project E!3681).
J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 521–530, 2007. c Springer-Verlag Berlin Heidelberg 2007
522
A. Gelzinis, A. Verikas, and M. Bacauskiene
attempts have been made to develop computer-aided systems for analyzing vocal fold images. In our previous study [2], a committee of multilayer perceptrons employed for categorizing vocal fold images into three decision classes correctly classified over 87% of test set images. In this paper, we investigate the effectiveness of the kernel-based approach to feature extraction and classification of laryngeal images. To obtain an informative representation of a vocal fold image that is further categorized by a committee of support vector machines, texture, colour, and geometrical features are used. Each member of the committee is devoted for analysis features of a single type.
2
Data
This study uses a set of 785 laryngeal images recorded at the Department of Otolaryngology, Kaunas University of Medicine. The internet based archive— database—of laryngeal images is continuously updated. The laryngeal images were acquired during routine direct microlaryngoscopy employing the MollerWedel Universa 300 surgical microscope. The 3-CCD Elmo colour video camera of 768 × 576 pixels was used to record the images. We used the gold standard taken from the clinical routine evaluation of patients. A rather common, clinically discriminative group of laryngeal diseases was chosen for the analysis i.e. mass lesions of vocal folds.
Fig. 1. Images from the nodular (left), diffuse (middle), and healthy (right) classes
Mass lesions of vocal folds could be categorized into six classes namely, polypus, papillomata, carcinoma, cysts, keratosis, and nodules. This categorization is based on clinical signs and a histological structure of the mass lesions of vocal folds. In this initial study, the first task was to differentiate between the healthy (normal ) class and pathological classes and then, differentiate among the classes of vocal fold mass lesions. We distinguish two groups of mass lesions of vocal folds i.e. nodular—nodules, polyps, and cysts—and diffuse—papillomata, keratosis, and carcinoma—lesions. Thus, including the healthy class, we have to distinguish between three classes of images. Amongst the 785 images available, there are 49 images from the healthy class, 406 from the nodular class, and 330 from the diffuse class. It is worth noting that due to the large variety of appearance of vocal folds, the classification task is sometimes difficult even for a trained physician. Fig. 1 presents characteristic examples from the three decision classes considered, namely, nodular, diffuse, and healthy.
Categorizing Laryngeal Images for Decision Support
3
523
Methods
To obtain an informative representation of a vocal fold image, colour, texture, and geometrical features are used. The measurement values related to image colour (C), texture (T), and geometry (G) are collected into three separate vectors ψ C , ψ T , and ψ G . Having the measurement vectors, features of the aforementioned three types are then obtained by applying the kernel principal component analysis separately for each of the spaces, as explained below. Having a vector of measurements ψ, the feature vector ξ is computed in the following way. Assume that Φ is a mapping of ψ onto a feature space F . Let ) denote the centered data point in the feature space F . The features ξ are Φ(ψ i then given by the kernel principal components computed as projections of the centered Φ-pattern Φ(ψ) onto the eigenvectors = v
M
i) α i Φ(ψ
(1)
i=1
ij = Φ(ψ i ), Φ(ψ j ) of the centered data points, where of the covariance matrix K M is the number of data points and α i are the expansion coefficients. Thus, the feature ξ is given by ξ = v, Φ(ψ) =
M
i ), Φ(ψ) α i Φ(ψ
(2)
i=1
The dimensionality of the feature vectors ξ is high and is equal to the number of data samples used. The course of dimensionality is circumvented by using support vector machines (SVM), which can classify data in very high-dimensional feature spaces [3]. Each of the feature vectors ξ C , ξ T , and ξ G is processed by a separate SVM. The final image categorization is then obtained from a committee aggregating outputs of the separate SVMs. 3.1
Colour Features
The approximately uniform L∗ a∗ b∗ colour space was employed for representing colours. We characterize the colour content of an image by the probability distribution of the colour represented by a 3-D colour histogram of N = 4096 (16 × 16 × 16) bins and consider the histogram as an N -vector. Most of bins of the histograms were empty or almost empty. Therefore, to reduce the number of components of the N -vector, the histograms built for a set of training images were summed up and the N -vector components corresponding to the bins containing less than Nα hits in the summed histogram were left aside. Hereby, when using Nα = 10 we were left with 733 bins—a ψ C vector of 733 components. The colour features ξC are then given by the kernel principal components C ) onto the eigenvectors computed as projections of the centered Φ-pattern Φ(ψ of the covariance matrix KCij = Φ(ψ Ci ), Φ(ψ Cj ).
524
3.2
A. Gelzinis, A. Verikas, and M. Bacauskiene
Extracting Texture Features
Gabor- and wavelet-based filtering [4,5], Markov random fields based modelling [6], and the co-occurrence matrices [7] are the most prominent approaches used to extract texture features. Regarding the characterization of texture of vocal fold images, the multi-channel 2-D Gabor filtering, co-occurrence matrices, runlength matrices, and the singular value decomposition based approaches have been applied in previous studies [2,8]. Amongst those, the Gabor filtering and the co-occurrence matrices based approaches proved to provide the most discriminative features. Therefore, we resorted to these two types of textural information in this work. To perform Gabor filtering of a colour image L∗ (x, y), a∗ (x, y), b∗ (x, y), we apply a complex colour representation given by: z(x, y) = L∗ (x, y) exp{jHab (x, y)}
(3)
where Hab (x, y) = arctan[b∗ (x, y)/a∗ (x, y)] is the CIE hue-angle. An image z(x, y) filtered by a Gabor filter gf,θ (x, y) of frequency f and orientation θ is given by zgf,θ (x, y) = FFT−1 [Z(u, v) · Gf,θ (u, v)]
(4)
where FFT−1 is the fast inverse Fourier transform, Z(u, v) is the Fourier transform of the image z(x, y), and Gf,θ (u, v) stands for the Fourier transform of the Gabor filter gf,θ (x, y). Having the filtered image zgf,θ (x, y), a 40-bin histogram of the image zgf,θ is then calculated. Thus, using Nf frequencies and Nθ orientations, Nf × Nθ of such histograms are obtained from one vocal fold image. The first two bins and the bins corresponding to those containing less than Nβ hits in the histogram accumulating all the training images are left aside. We used Nβ = 10 in this study. The remaining bins are concatenated into one long vector ψ T 1 , which was found to be of 552 components. The Gabor-type texture features ξT 1 are then given by the kernel principal components. In the co-occurrence matrix based approach, we utilized the 14 well known Haralick’s coefficients [7] as a feature set. The coefficients were calculated from the average co-occurrence matrix obtained by averaging the matrices calculated for 0◦ , 45◦ , 90◦ , and 135◦ directions. The matrices were computed for one, experimentally selected, distance parameter. Since red colour dominates in the vocal fold images, the a∗ (x, y) (red-green) image component has been employed for extracting the co-occurrence matrix based features. The 14 coefficients were collected into a vector ψ T 2 and the kernel principal components computed as T 2 ) onto the eigenvectors of the covariprojections of the centered pattern Φ(ψ T 2ij were used as the texture features ξT 2 . ance matrix K 3.3
Geometrical Features
Two geometrical features we use are mainly targeted for discriminating the healthy class from the other two. To extract one of the features, a vocal fold
Categorizing Laryngeal Images for Decision Support
525
image is first segmented into a set of homogenous regions. We segment vocal fold images in the concatenated 5-dimensional spatial-range space. There are two dimensions—x, y—in the spatial and three—L∗ a∗ b∗ —in the range space. Two lines, ascending in the left-hand part and descending in the right-hand part of the image are then drawn in such a way as to maximize the number of segmentation boundary points intersecting the lines. Fig. 2 presents two examples of the segmentation boundaries found and the two lines drawn according to the determined directions.
Fig. 2. Vocal fold images coming from the nodular (left) and the healthy (right) classes along with two lines used to calculate the geometrical feature ψG1
The first geometrical feature ψG1 is then given by the squared number of the boundary points intersecting the two lines. The second geometrical feature ψG2 is obtained in the same way, except that colour edge points are utilized instead of the segmentation boundary points. To extract colour edges, we use the following difference vector operators. Let h(x0 , y0 ) be a 3-D vector representing the pixel (x0 , y0 ) in the L∗ a∗ b∗ colour space. A gradient in each of the four following directions (0◦ , 45◦ , 90◦ , and 135◦ ) is then obtained as [9]: |∇g|0◦ = h(x1 , y0 ) − h(x−1 , y0 )
(5)
|∇g|45◦ = h(x1 , y−1 ) − h(x−1 , y1 )
(6)
|∇g|90◦ = h(x0 , y1 ) − h(x0 , y−1 ) |∇g|135◦ = h(x−1 , y−1 ) − h(x1 , y1 )
(7) (8)
where • stands for the L2 norm. The pixel value g(x0 , y0 ) in the gradient image g(x, y) is then set to: g(x0 , y0 ) = max(|∇g|0◦ , |∇g|45◦ , |∇g|90◦ , |∇g|135◦ )
(9)
The pixel (x0 , y0 ) in the gradient image g(x, y) is assumed to be an edge pixel if g(x0 , y0 ) > gα , where gα is a threshold. As in the case of colour and texture features, the kernel principal component analysis is utilized to transform the two-component vector ψ G into the vector of principal components ξ G that is further analyzed by a support vector machine classifier.
526
3.4
A. Gelzinis, A. Verikas, and M. Bacauskiene
Pattern Classifier
In this work, we resorted to committee based classification. The support vector machine is used as a committee member. The discriminat function of a twoclass—binary—SVM is given by M f (ξ) = sgn α∗j yj κj (ξ j , ξ) + b (10) j=1
where the threshold b and the parameter values α∗j are found as a solution to the optimization problem, κ(ξ, ξ) is a kernel, sgn stands for the sign function, and yj is a target value (yj ± 1). To distinguish between three classes of images, we utilized the one vs one pairwise classification scheme. The following rule has been used to calculate the output value for the ith class yi (ξ)—the estimate of the probability of a sample ξ to belong to the class i—based on the output values obtained from the binary SVMs: card(Si (ξ)) k∈Si (ξ) | yk (ξ)| yi (ξ) = Q (11)
card(Sm (ξ)) k∈Sm (ξ) | yk (ξ)| m=1 where Q is the number of classes, yk (ξ) is the output value of the kth binary SVM, Si (ξ) is the set of binary SVMs that have assigned ξ to the ith class, and card stands for the cardinality of the set. A variety of schemes have been proposed for combining multiple classifiers into a committee [10,11,12]. In this work, we explored three ways to aggregate the SVMs into a committee: 1. Aggregation by a linear SVM. 2. Aggregation by a non-linear SVM with a second degree polynomial kernel. 3. Weighted averaging. Given an image ξ, the winning class k is found according to the following rule: k = arg max
i=1,...,Q
L
wj yij (ξ j )
(12)
j=1
where L stands for the number of classifiers aggregated into a committee, wj is the jth classifier weight, and yij (ξ j ) is given by Eq. (11), where the index j was added to address a feature type. The aggregation weights used in the weighted averaging approach have been found using the Simplex algorithm. When using a meta-classifier—an SVM—to aggregate the outputs of SVMs of the different feature types, the output values yi (ξ) were utilized as input features for the meta-classifier.
4
Experimental Investigations
Based on experimental testes we have chosen to use Nf = 7 frequencies and Nθ = 6 orientations for extracting Gabor features. The distance parameter d
Categorizing Laryngeal Images for Decision Support
527
used to calculate the co-occurrence matrices was found to be d = 5. In all the tests, we have used 200 different random ways to partition the data set into Training–Dl and Test–Dt sets. The mean values and standard deviations of the test set correct classification rate presented in this paper were calculated based on those 200 trials. Out of the 785 images available, 650 images were assigned to the set Dl and 135 to the test set Dt . 4.1
Classification Results
We have carried out the classification tests in the original, and the space obtained based on the kernel principal component analysis. The second order polynomial kernel has been used to extract the principal components. Regarding the classification tests, SVMs with the polynomial kernel of degree one to three have been investigated. By applying cross-validation, the number of the principal components providing the best performance has been determined. Fig. 3 and Fig. 4 illustrate the dependence of the test set correct classification rate upon the number of components utilized for the different kernel degrees, q = 1, 2, 3. The graphs in Fig. 5 plot the test set correct classification rate as a function of the percentage of the data variance accounted for by the number of the kernel principal directions used. As it can be seen from the figures, the number of principal components providing the best performance is far below the maximum number of the components available—the number of the training samples (650). However, the percentage of the data variance accounted for by the optimal number of the components is quite close to 100. Similar dependencies have also been obtained for the geometrical and co-occurrence matrix based features. Colour features
Gabor features 80 q=1 q=2 q=3
92 90 88 86 84 82
Correct classification rate
Correct classification rate
94
q=1 q=2 q=3
75
70
65
60
80 78 0
100
200
300
Size of the feature set
400
500
55 0
200
400
600
800
Size of the feature set
Fig. 3. The dependence of the test set correct classification rate upon the number of the kernel principal components utilized: (left) colour features, (right) Gabor features
Table 1 summarizes the test data set correct classification rate obtained from the separate SVMs. In the parentheses, the standard deviation of the correct classification rate is provided. Numbers in the parentheses next to the denotations of the feature types stand for: the size of the original feature sets—the
528
A. Gelzinis, A. Verikas, and M. Bacauskiene Co−occurrence features
Geometrical features 69
75
70
65
60
55 0
q=1 q=2 q=3 20
40
60
80
100
Correct classification rate
Correct classification rate
80
q=1 q=2 q=3
68.5 68 67.5 67 66.5 66 65.5 0
120
10
Size of the feature set
20
30
40
Size of the feature set
Fig. 4. The dependence of the test set correct classification rate upon the number of the kernel principal components utilized: (left) co-occurrence matrix based features, (right) geometrical features Colour features
Gabor features 80
92
Correct classification rate
Correct classification rate
94
90 q=1 q=2 q=3
88 86 84 82
75
70 q=1 q=2 q=3
65
60
80 78 90
92
94
96
% of data variance
98
100
55 90
92
94
96
98
100
% of data variance
Fig. 5. The test set correct classification rate as a function of the percentage of the data variance accounted for by the number of the kernel principal directions used: (left) colour features, (right) Gabor features
upper part of Table 1—and the optimal number of the principal components found for the SVM utilizing the first, second, and the third degree polynomial kernel, respectively—the lower part of the table. The upper part of Table 1 presents the results obtained in the original feature spaces, while the lower part of Table 1 presents the classification results obtained using the optimal number of the kernel principal components. As it can be seen from Table 1, when used alone, the colour features clearly outperformed all the types of features tested, for both the original and transformed spaces. For all the feature types, except the Gabor ones, the classifiers constructed in the transformed spaces provided a higher performance than in the original ones. Table 2 presents the results obtained from the committees, where SVMl stands for SVM with the first degree polynomial (linear) kernel, SVMn means SVM with the second degree (nonlinear) kernel, and WA stands for weighted
Categorizing Laryngeal Images for Decision Support
529
Table 1. The average test data set correct classification rate obtained for the different kernel degrees when using a separate SVM for each type of features N#
Features\Kernel degree
q=1
q=2
q=3
1. 2. 3. 4.
Colour (733) Co-occurrence (14) Gabor (552) Geometrical (2)
82.34 71.89 75.62 56.95
(2.99) (3.49) (3.53) (4.05)
88.40 74.60 80.42 59.46
(2.78) (3.52) (3.71) (3.93)
90.47 76.63 81.12 58.02
(2.44) (3.64) (3.48) (4.06)
5. 6. 7. 8.
Colour (200, 208, 125) Co-occurrence (95, 70, 90) Gabor (300, 100, 78) Geometrical (14, 12, 14)
86.51 76.60 76.24 67.05
(3.05) (3.68) (3.34) (3.70)
91.73 78.69 77.78 68.77
(2.50) (3.97) (3.41) (3.87)
92.03 78.44 79.08 68.70
(2.42) (3.58) (3.19) (3.61)
Table 2. The average test data set correct classification rate obtained for the different combination schemes and the different kernel degrees of the binary SVMs N#
Committee\Degree
1. 2. 5.
SVMl SVMn WA
q=1
q=2
q=3
88.76 (1.99) 88.74 (1.92) 88.78 (1.99)
91.92 (1.69) 91.98 (1.67) 92.32 (1.65)
92.03 (1.79) 92.01 (1.78) 92.45 (1.68)
averaging. From Table 1 and Table 2 it can be seen that the committees considerably reduce the variance of the correct classification rate. When using SVMs with the first degree polynomial kernel, the improvement in the average correct classification rate obtained from the committees is also obvious and statistically significant. For the second and third degree kernels, the average increase in the correct classification rate is not statistically significant. However, the reduction in variance of the rate is considerable. Bearing in mind the high similarity of the decision classes, the obtained over 92% correct classification rate is rather encouraging. The classification results obtained point out that colour is the most significant information source for performing the discrimination. Regarding the aggregation approaches tested, the weighted averaging proved to provide the best performance.
5
Conclusions
This paper is concerned with the kernel-based automated analysis of vocal fold images aiming to categorize the images into the healthy, nodular, and diffuse classes. To obtain a comprehensive representation of the images, features of various types concerning image colour, texture, and pattern geometry are extracted. Amongst the two alternatives tested for extracting texture features, namely the co-occurrence matrices and Gabor filtering, the texture features obtained from
530
A. Gelzinis, A. Verikas, and M. Bacauskiene
the Gabor filtering proved to be more discriminative when performing the classification in the original space. In the transformed space, no significant difference has been found between these to types of the representation, however. When used alone, the colour features provided the highest correct classification rate amongst all the types of features tested. Regarding the aggregation techniques investigated, the weighted averaging proved to provide a slightly higher correct classification rate than that obtained from the SVM based aggregation. A correct classification rate of over 92% was obtained when classifying a set of unseen images into the aforementioned three classes.
References 1. Ohlsson, M.: WeAidUa decision support system for myocardial perfusion images using artificial neural networks. Artificial Intelligence in Medicine 30, 49–60 (2004) 2. Verikas, A., Gelzinis, A., Bacauskiene, M., Uloza, V.: Towards a computer-aided diagnosis system for vocal cord diseases. Artificial Intelligence in Medicine 36, 71– 84 (2006) 3. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998) 4. Bovik, A.C., Clark, M., Geisler, W.S.: Multichannel texture analysis using localized spatial filters. IEEE Trans Pattern Analysis Machine Intelligence 12, 55–73 (1990) 5. Unser, M.: Texture classification and segmentation using wavelet frames. IEEE trans Image Processing 4, 1549–1560 (1995) 6. Panjwani, D.K., Healy, G.: Markov random field models for unsupervised segmentation of textured color images. IEEE Trans Pattern Analysis Machine Intelligence 17, 939–954 (1995) 7. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. Man and Cybernetics 3, 610–621 (1973) 8. Verikas, A., Gelzinis, A., Bacauskiene, M., Uloza, V.: Intelligent vocal cord image analysis for categorizing laryngeal diseases. In: Ali, M., Esposito, F. (eds.) IEA/AIE 2005. LNCS (LNAI), vol. 3533, pp. 69–78. Springer, Heidelberg (2005) 9. Zhu, S.Y., Plataniotis, K.N., Venetsanopoulos, A.N.: Comprehensive analysis of edge detection in color image processing. Optical Engineering 38, 612–625 (1999) 10. Verikas, A., Lipnickas, A., Malmqvist, K., Bacauskiene, M., Gelzinis, A.: Soft combination of neural classifiers: A comparative study. Pattern Recognition Letters 20, 429–444 (1999) 11. Verikas, A., Lipnickas, A.: Fusing neural networks through space partitioning and fuzzy integration. Neural Processing Letters 16, 53–65 (2002) 12. Liu, C.L.: Classifier combination based on confidence transformation. Pattern Recognition 38, 11–28 (2005)
Segmentation of the Human Trachea Using Deformable Statistical Models of Tubular Shapes Romulo Pinho, Jan Sijbers, and Toon Huysmans University of Antwerp, Physics Department, VisionLab, Belgium {romulo.pinho,jan.sijbers,toon.huysmans}@ua.ac.be
Abstract. In this work, we present two active shape models for the segmentation of tubular objects. The first model is built using cylindrical parameterization and minimum description length to achieve correct correspondences. The other model is a multidimensional point distribution model built from the centre line and related information of the training shapes. The models are used to segment the human trachea in low-dose CT scans of the thorax and are compared in terms of compactness of representation and segmentation effectiveness and efficiency. Leave-one-out tests were carried out on real CT data.
1
Introduction
Segmentation of the human trachea is useful in the analysis of signs and symptoms of tracheal stenosis and in the calculation and visualization of computer fluid dynamic models of breathing activity. In this work, we propose a method to segment the trachea in CT images of the thorax using active shape models (ASM) [1]. Given the cylindrical nature of the shape of the trachea, a special cylindrical point distribution model (PDM) is built from a set of training images and used later in the search for the trachea in unseen images. We actually propose two methods to build the model. The first one is based on cylindrical parameterization of the surfaces of a training set, herein also called cylindrical model. The second model is a multidimensional representation of objects which approximates the shapes of the trachea with its centre line and associated information and which we refer to as skeleton based model. We carry out leave-one-out tests on 11 CT data sets to show that both models can be used for the segmentation of the trachea. A comparison between the two models is also done in order to understand their behaviour and to estimate which of them gives better results. In the following subsection, we present related work from the literature. In Section 2, ASMs are briefly reviewed and the construction and application of the cylindrical model follows straightforwardly. Afterwards, we introduce the skeleton based model and show how it can be used. Section 3 shows the segmentation results obtained from both methods and a comparison between them is made. Section 4 finally discusses the conclusions and points out future applications. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 531–542, 2007. c Springer-Verlag Berlin Heidelberg 2007
532
1.1
R. Pinho, J. Sijbers, and T. Huysmans
Previous Work
ASMs have been used in several application fields over the years. Most of the applications, however, concentrate on the segmentation of simple surfaces, with genus-0 topology. Kelemen et al.[4] represented a surface by a set of parametric spherical descriptors. The statistical model and the modes of variation were entirely based on these descriptors. Landmarking was addressed, in the sense that surface correspondence was achieved by similar positioning of the spherical descriptors within the training set. This surface parameterization was converted into a plain point distribution model by a linear piecewise representation of the object. The triangulated model was then used for the matching and surface relaxation constraints were applied during shape deformations. M-Reps [6] have been introduced as a new surface representation. Pizer et al.[7] used them for image segmentation and Hacker et al.[8] built a statistical model of the kidney and adapted it to the VOXEL-MAN project. Despite being represented by a skeleton based version of the objects, the mentioned applications using M-Reps can only handle objects of simple topology. Lorenz et al.[3] built a triangulated template and used this template to solve the correspondence problem by coating all other surfaces with it. User defined landmarks on the template and target surface guided the coating procedure. After coating, the statistical model was built using the vertices of the triangulated shapes as the PDM. Matching was normally performed by optimizing the values of the weights of the modes of variation. However, it is unclear how objects of complex topologies, like the vertebra presented in the results, are handled. Because ASMs are inherently specific to the model domain, their generalization to other domains is not always possible. For instance, a model that captures the statistics of spherical objects will hardly extend to segment tubular shapes. Thus, a specific model for such purpose becomes necessary. There are not many examples of such models in the literature. De Bruijne et al.[5] built a model for cylindrically shaped objects by combining a PDM of the two dimensional slices of MR scans of abdominal aortic aneurysms (AAA) with a PDM of the main axis of the object’s shape. This combination introduced dimensionality incompatibilities to the model, which were solved by the insertion of artificial, redundant landmarks. Correspondences were set manually and artificial modes of variation were also added to the model to cope with the lack of corresponding shapes while building the training set.
2
Methods
In this section, we describe the cylindrical and skeleton based models. Their theoretical basis was proposed in [1,2], which we are briefly going to review in the following subsection. 2.1
Active Shape Models
We are going to explain ASMs in the two dimensional Euclidean space for the sake of clarity, but they can be easily extended to more dimensions, as is done in
Segmentation of the Human Trachea Using Deformable Statistical Models
533
this work. A PDM is constructed by outlining the boundary (or other structures of interest) of the shape under consideration with a set of n points. This set of points, called landmarks, must correspond across the set of N training images. After the shapes in the training set are aligned with respect to a coordinate space of reference, the statistics of their variation can be captured. Let xi be a 2n vector describing the n landmarks of the ith shape in the training set. xi = (xi1 , yi1 , xi2 , yi2 , ..., xin , yin )T . The mean shape, x, and the covariance matrix S2n×2n of the training set are computed. Principal Component Analysis is used to extract the main modes of variation from S, described by its pk (k = 1..N − 1) eigenvectors, grouped as column vectors in a matrix P2n×(N −1) , and corresponding non-negative eigenvalues. The eigenvectors corresponding to the highest eigenvalues represent the most significant modes of variation. Commonly, most of the variation can be explained by a small number, t (< N ), of eigenvectors. The value t can be chosen to represent a significant proportion of the total variance, λT , where λT =
N −1
λk .
(1)
k=1
Combinations of landmark displacements can approximate any shape in the training set by linearly combining the mean shape with a weighted scaling of the matrix of the t eigenvectors: x = x + Pb, (2) where b = (b1 , b2 , ..., bt )T represents the √ √ weight of each eigenvector. Each bi usually varies in the range [-3 λi , +3 λi ]. We refer the reader to [2] for a complete description. Matching. In the context of ASMs for image segmentation, matching is the process of finding an object in an image using the statistical modes of variation described above. When the model of landmarks represents the boundary of an object, an iterative algorithm deforms an initial, given shape (e.g. the mean shape) towards the edges of the object in the image. This algorithm shifts each landmark along its corresponding normal and the new landmarks suggest which deformations will be applied to the shape of the current iteration [2]. The set of displacements (one for each landmark) is defined as dx, where dx = (dx1 , dy1 , dx2 , dy2 , ..., dxn , dyn )T . At each iteration, the result from the displacements is compared to the image and a new set of adjustments may be necessary until the algorithm converges. Given that the position of the deformed shape at each step is x, the adjusted position is defined as (x + dx). To achieve the new position, a sequence of rotation, scaling and translation is applied to the current state, in order to best approximate x to (x + dx). However, residual displacements may still be required, forcing the landmarks to be moved independently. Thus, the overall adjustment is given
534
R. Pinho, J. Sijbers, and T. Huysmans
Fig. 1. View of the mapping of a cylinder on the surface of a trachea
by a combination of rigid and non-rigid deformations, the latter being achieved by updating the model parameter vector db. We can represent them by the following approximation: dx ≈ P(db), (3) which means that each adjustment can be approximated by a variation of the landmarks along the modes of variation, according to the statistical model. Respecting the limits of each bi , the algorithm stops when no significant change has been made to the current shape. 2.2
Cylindrical Model
The training set of images is segmented using a region growing algorithm [14], but any segmentation algorithm can be used in this step. The reason to choose this algorithm will become clearer in the next sections. The binary images resulting from the segmentation step are then converted to a three dimensional polygonal representation and supplied to the construction of the statistical model, as described in [9]. The segmented tracheas of the training set are then mapped on the surface of the unit cylinder [10]. This parameterization is shown as isoparametric lines on the surface of the trachea, in Figure 1. The parameterized surfaces are aligned using the iterative closest point algorithm (ICP) [11]. After alignment, the choice of landmarks and the establishment of correspondences between the shapes of the training set is done automatically, using minimum description length (MDL) [13]. In practice, the landmarks are set on the surface of the cylinder and are then mapped on the shape of the trachea, using the inverse of the parameterization function. The main modes of variation are extracted from the covariance matrix as described above. The segmentation process starts by manually placing the average shape near the edges we wish to segment (in this case, the edges of the trachea). If necessary, adjustments in scale and orientation can be done. From this point, the method iteratively searches for the highest gradient value along the surface normals, computed at its landmarks. The new landmarks indicate how the current shape should be deformed in order to achieve the best match to the desired edges. This process is repeated until convergence is achieved.
Segmentation of the Human Trachea Using Deformable Statistical Models
535
Fig. 2. Approximating shapes of the trachea (on the left of each picture) with piecewise cylindrical objects (on the right)
2.3
Skeleton Based Model
In this section, we propose a new method to segment images with tubular objects using a multidimensional PDM. The idea is to represent the shape solely through information associated with its centre line, or skeleton. From the same region growing algorithm used in the previous section, an approximation of the skeleton of the trachea can be easily obtained. This approximation is represented by a number of points connected by straight lines. Each of these points holds information about the local orientation and diameter of the trachea (actually two diameters, representing an ellipse). The starting point of the centre line coincides with the beginning of the trachea in the segmented image set. The last point is the first bifurcation point that branches the trachea into its two primary divisions, the bronchi. The piecewise linear representation of the centre line is resampled using arclength parameterization and is subdivided into n ˆ new points. The resampling function also computes the values of the associated information (orientation and diameters) for every new point. A rendering of this representation produces piecewise cylindrical objects, as can be seen in Figure 2, whose shapes approximate the shapes used in the method described in Section 2.2. The PDM is thus built with the points obtained from the parameterization of the centre line and their related information. The correspondence within the training set is automatically achieved, because the points represent precise anatomical information of the trachea. Since this model is clearly an approximation of the previous one, it is intuitive to think that not all the variability of the trachea can be captured. This means that fewer eigenmodes are responsible for the same fraction of the total variance of the statistical model. Therefore, it might be necessary to increase the total
536
R. Pinho, J. Sijbers, and T. Huysmans
ˆ in order to achieve variations similar variation of the skeleton based model, λ, to those of the previous model. During matching, the skeleton based representation of the shape needs to be converted to a point representation in three dimensions, because these points will actually be used to search for high gradients along the normals of the shape. The displacements need to be converted back to the multidimensional representation, meaning updates in position, diameters and orientation of the local cylinders. These updates will, as before, suggest changes in the statistical model parameters and a new shape will be created at each step, until the method converges.
3
Results
We compared the two methods with respect to different criteria, namely compactness of representation, convergence time and segmentation accuracy. Leave-one-out experiments were carried out on 11 low-dose CT image sets of the trachea. The segmented images of the training set are O(512 × 512 × 150) in size while the test image set is O(512 × 512 × 500), corresponding to a CT scan of the whole human thorax. Later in this section, we will derive conclusions from these comparisons and will discuss the pros and cons of each model. Compactness. The dimensionality of the cylindrical model is D = O(3n). In the skeleton based model, much less information is needed to achieve similar results. Each landmark is represented by a tuple consisting of its three dimensional coordinates, two local trachea diameters, and a three dimensional vector describing the orientation of the local cylinder relative to the unit vector coinciding with the positive z direction (which is the axial orientation of the CT image ˆ = O(8ˆ set). This gives a dimensionality D n), which seems to be much higher than the previous one, but in fact n ˆ << n, to represent approximately the same amount of information. In our experiments n ˆ =40, while n = 1000. This makes ˆ = 8 × 40 = 320 and D = 3 × 1000 = 3000, meaning in general that at least D the computation of the covariance matrix uses much less memory space in the skeleton based model and runs much faster. On the other hand, there is more variability captured by the cylindrical model, meaning that the variance spreads over more eigenmodes. More eigenmodes enable more deformations and potentially increase the accuracy of the segmentation algorithm. In order to have the ˆ of the skeleton based model same main modes of variation, the total variance λ must be increased. This increase, however, just makes the two models comparable in terms of computational complexity. In our experiments, λ = 95% and ˆ = 99% of the total variation of their respective models, corresponding to the λ 4 main modes of variation in each model, which are shown in Figures 3 and 4. Convergence. This is a difficult measure because convergence is very much dependent on the initial estimation of the segmented shape and the methods usually behave differently for the same initial estimation. Yet, we computed the number of iterations necessary for each method to converge and used this measure as a hint for convergence performance. We also computed how the shape
Segmentation of the Human Trachea Using Deformable Statistical Models
537
Fig. 3. The 4 main modes of variation in each model. Left: cylindrical model. Right: skeleton based model. A coarser display of the skeleton based model reveals the variations of the centre line and of the diameters separately.
538
R. Pinho, J. Sijbers, and T. Huysmans
Fig. 4. The 4 main modes of variation and their average contributions to the total variance in each model
Fig. 5. Convergence graphs. The horizontal axis represents the test instances in our leave-one-out experiments. Left: number of iterations. Right: deviations from mean after convergence.
obtained after the convergence differs from the mean shape, as a square error between them. Despite being a somewhat subjective measure, high errors generally correlate to higher convergence times. Graphs of convergence comparison can be seen in Figure 5. Accuracy. In order to measure the accuracy of the segmentations, each of the test image sets was also segmented using the same region growing algorithm we used before. The final deformed shape was converted to a binary image representation and an XOR operation between them was performed. The fewer white voxels remain after the XOR operation, the better the segmentation is. Comparisons between the results are presented in Figure 6 and a three dimensional view of the matching process is shown in Figure 7. Conclusions. The fact that the skeleton based model converges faster in most cases, as shown in Figure 5, can be expected because the piecewise cylindrical shape itself does not have much intra-variability. Consequently, the error from the current iteration to the previous tends to be minimized. The skeleton based
Segmentation of the Human Trachea Using Deformable Statistical Models
539
Fig. 6. Accuracy graph. The vertical axis is the segmentation error, represented by the ratio between the number of wrong white voxels and the total number of white voxels in the original image. The horizontal axis represents the test instance in our leave-one-out experiments.
Fig. 7. 3D visualization of the matching process, from initial estimation to convergence. Top: cylindrical model. Bottom: skeleton based model.
model, however, seemed to be more sensitive to noise, which caused it to take much longer to converge in one test instance, but further investigation is still necessary. Nevertheless, deviations from the mean remained comparable among the test instances. The accuracy graph in Figure 6 shows that the average error was around 30%. Given the inherent inaccuracies of ASMs, they are usually combined with other free-form deformation algorithms, such as snakes [12], to refine the segmentation. This makes the computed errors acceptable. In test 3, however, both methods
540
R. Pinho, J. Sijbers, and T. Huysmans
Fig. 8. Problems when representing a complex shape with its skeleton based version. The picture on the left shows the surface of the trachea, viewed from different angles. The picture on the right shows the skeleton based representation, viewed from the same positions. Note that the concave region of the trachea (picture on the left) can not be represented with the skeleton based approximation.
showed very poor performance. The test shape in this case represents an image set of an expiration CT scan. During expiration, the trachea shrinks while the air is breathed out, forming a concave shape. All other shapes in the training set are roughly convex. Therefore, no combination of eigenmodes could make the average shape deform to a concave shape. The problem was worse with the skeleton model, as can also be seen in the graph, because shapes are represented as piecewise cylindrical objects. In the other tests, the concave shape was part of the training set. Thus the concave variation could be acquired, but the skeleton based model still had problems, for the same reasons described above. Figure 8 depicts the situation, where the concave shape and its skeleton based equivalent are viewed from different angles. It is clear that the skeleton based model can not represent the concavities correctly. We conclude that the choice of which model to use is very much dependent on the application. Therefore, depending on the complexity of the shape being segmented, the restrictions on the initial shape estimation can be relaxed. In the case of the trachea, the results of the skeleton based model seemed to be good enough to let it be the method of choice.
4
Discussion and Future Directions
It is difficult to compare the methods presented in this work with other statistical models of tubular shapes, because not much work has been reported
Segmentation of the Human Trachea Using Deformable Statistical Models
541
on this specific topic. However, given the characteristics of other methods described in Section 1.1, we believe that the models presented here provide a more straightforward approach to the statistical representation and segmentation of cylindrically shaped objects. To increase the quality of the segmentation, more robust methods of landmark displacement during matching can be used. For instance, a gray-level profile around each landmark can be defined and added to the model. The search then looks for voxels that have the closest match to that profile, and the suggested landmark can be moved to that new position. This can diminish the problem of local minima usually found in gradient search strategies. Other methods to obtain the centre line can also be used. The use of the image skeleton, however, will only give one distance value per-pixel, locally approximating the surface by a circle, instead of the ellipse used in our experiments. An improvement could be the use of the extreme points that define the diameters of the ellipse as control points of a parametric curve. This would increase the diˆ = O(10ˆ mensionality of the skeleton based model to D n), but would still be more compact than the cylindrical model. The local directions at the landmarks could also be represented with polar coordinates, instead of the used three dimensional vector, reducing one dimension of the model. Despite being robust to the initial estimation to a certain level, the methods can also be combined with techniques to automatically search for the best initialization. This potentially makes the segmentation process fully automatic. Extensions of these models to more complex shapes based on cylinders are also under study at this moment. We believe, for instance, that they can be used in the segmentation of shapes consisting of hierarchical combinations of cylinders, such as airway trees.
Acknowledgments We would like to thank the Radiology Department of the University Hospital of the University of Antwerp for providing the CT images used in our experiments. This work was financially supported by the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT-Vlaanderen) and the Fund for Scientific Research (F.W.O)-Flanders, Belgium.
References 1. Cootes, T.F., Taylor, C.J.: Active Shape Models - ’Smart Snakes’. In: Proceedings of British Machine Vision Conference, pp. 266–275. Springer-Verlag, Heidelberg (1992) 2. Cootes, T., Taylor, C., Cooper, D., Graham, J.: Active shape models: their training and application. Comput. Vis. Image Understand 61, 18–23 (1995) 3. Lorenz, C., Krahnstover, N.: 3D Statistical Shape Models for Medical Image Segmentation. In: Proceedings of the Second International Conference on 3-D Imaging and Modeling (3DIM ’99), pp. 414–423. IEEE Computer Society, Los Alamitos (1999)
542
R. Pinho, J. Sijbers, and T. Huysmans
4. Kelemen, A., Sz´ekely, G., Gerig, G.: Three-dimensional Model-based Segmentation of Brain MRI. In: Proc. IEEE Workshop on Biomedical Image Analysis, pp. 4–13. IEEE Computer Society Press, Los Alamitos (1998) 5. de Bruijne, M., Ginneken, B., Viergever, M., Niessen, W.: Interactive Segmentation of Abdominal Aortic Aneurysms in CTA Images. Medical Image Analysis 8(2), 127–138 (2004) 6. Pizer, S.M., Thall, A.L, Chen, D.T.: M-Reps: A New Object Representation for Graphics. Technical Report, University of North Carolina, Chappel Hill (1999) 7. Pizer, S.M., Fletcher, P.T., Joshi, S., Thall, A., Chen, J.Z., Fridman, Y., Fritsch, D.S., Gash, A.G., Glotzer, J.M., Jiroutek, M.R., Lu, C., Muller, K.E., Tracton, G., Yushkevich, P., Chaney, E.L.: Deformable M-Reps for 3D Medical Image Segmentation. Int. J. Comput. Vision 55(2-3), 85–106 (2003) 8. Hacker, S., Handels, H.: Representation and Visualization of Variability in a 3D anatomical Atlas using the Kidney as an Example. In: Cleary, K.R., Galloway, R.L. (eds.) Proc. SPIE Medical Imaging:6141, Visualization, Image-Guided Procedures, and Display, pp. B1–B7 (2006) 9. Huysmans, T., Sijbers, J., Vanpoucke, F., Verdonk, B.: Improved Shape Modeling of Tubular Objects Using Cylindrical Parameterization. In: Yang, G.-Z., Jiang, T., Shen, D., Gu, L., Yang, J. (eds.) Medical Imaging and Augmented Reality. LNCS, vol. 4091, pp. 84–91. Springer, Heidelberg (2006) 10. Huysmans, T., Sijbers, J., Verdonk, B.: Parameterization of tubular surfaces on the cylinder. Journal of the Winter School of Computer Graphics 13(3), 97–104 (2005) 11. Besl, P.J., McKay, N.D.: A method for registration of 3-d shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992) 12. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988) 13. Davies, R.H., Cootes, T.F., Taylor, C.J.: A Minimum Description Length Approach to Statistical Shape Modelling. In: Insana, M.F., Leahy, R.M. (eds.) IPMI 2001. LNCS, vol. 2082, pp. 50–63. Springer, Heidelberg (2001) 14. Pinho, R., Sijbers, J., Vos, W.: Efficient approaches to intrathoracic airway tree segmentations. In: Proceedings of the Biomedical Engineering IEEE/EMBS Benelux Symposium, Belgium, vol. 2, pp. 151–154. IEEE, Los Alamitos (2006)
Adaptive Image Content-Based Exposure Control for Scanning Applications in Radiography Helene Schulerud1, Jens Thielemann1, Trine Kirkhus1, Kristin Kaspersen1, Joar M. Østby1, Marinos G. Metaxas2, Gary J. Royle2, Jennifer Griffiths2, Emily Cook2, Colin Esbrand2, Silvia Pani2,8, Cristian Venanzi2, Paul F. van der Stelt3, Gang Li3, Renato Turchetta4, Andrea Fant4, Sergios Theodoridis5, Harris Georgiou5, Geoff Hall6, Matthew Noy6, John Jones6, James Leaver6, Frixos Triantis7, Asimakis Asimidis7, Nikos Manthos7, Renata Longo8, Anna Bergamaschi8, and Robert D. Speller2 1
SINTEF, PB 124 Blindern, 0314 Oslo, Norway
[email protected] 2 University College London, Department of Medical Physics & Bioengineering, U.K. 3 University of Amsterdam, Academic Centre for Dentistry, The Netherlands 4 Rutherford Appleton Laboratory, CCLRC, Oxfordshire, U.K. 5 University of Athens, Department of Informatics & Telecommunications, Greece 6 Imperial College, Department of Physics, U.K. 7 University of Ioannina, Department of Physics, Greece 8 University of Trieste, Department of Physics, Italy
Abstract. I-ImaS (Intelligent Imaging Sensors) is a European project which has designed and developed a new adaptive X-ray imaging system using on-line exposure control, to create locally optimized images. The I-ImaS system allows for real-time image analysis during acquisition, thus enabling real-time exposure adjustment. This adaptive imaging system has the potential of creating images with optimal information within a given dose constraint and to acquire optimally exposed images of objects with variable density during one scan. In this paper we present the control system and results from initial tests on mammographic and encephalographic images. Furthermore, algorithms for visualization of the resulting images, consisting of unevenly exposed image regions, are developed and tested. The preliminary results show that the same image quality can be achieved at 30-70% lower dose using the I-ImaS system compared to conventional mammography systems. Keywords: Adaptive X-ray imaging, mammography, encephalographic imaging, image correction.
1 Introduction In medical X-ray imaging there is a strong focus on acquiring images of sufficiently high quality at lowest possible dose. Higher dose increases the contrast, in general, and hence the image quality. Automatic exposure control (AEC) is often used to estimate the optimal exposure settings and is an important feature in mammography J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 543–552, 2007. © Springer-Verlag Berlin Heidelberg 2007
544
H. Schulerud et al.
and other radiographic systems. It enables consistent image exposure despite variations in e.g. tissue density and thickness, and user skill level. Existing AEC systems [1-3] are based on either a low-exposure prescan or on applying characteristic parameters of the object (e.g. size) together with predefined sets of standard exposure profiles. In both cases, one set of exposure parameters is selected and the object is exposed evenly. Digital images may be acquired using either an area detector or a linear detector. In the latter case, a scan is implemented by moving the detector under the sample. Using a linear detector enables adaptive exposure settings over the image. In a recent paper by Åslund et al. [4] an AEC scan system was proposed, where the information from the leading detector line adjusts the scan velocity during the scan. Hence, an adaptive exposure in the scan direction is achieved. The I-ImaS (Intelligent Imaging Sensors) system optimizes the exposure in each image region, in both scan direction and across the scan direction, during the scan. The objective of the I-ImaS project is to design and develop a new adaptive X-ray imaging system to acquire locally optimized images. The system is based on two linear arrays of detectors. The system uses Active Pixel Sensors (APS) and FPGAs, which allow real-time analysis of data during image acquisition, thus enabling realtime exposure adjustment by a suitable control system. In this study we have designed a simple steering algorithm and performed initial tests on mammographic and encephalographic images. Furthermore, algorithms for visualization of the resulting images, consisting of unevenly exposed image regions, are developed and tested.
2 Description of the System The I-ImaS system is based on a step-and-shoot line scanning technology where scattered radiation is effectively reduced, thus lowering dose and improving image quality. The system consists of a dual-in-line scanning system using an X-ray source and X-ray sensitive APS sensors. The dual-in-line detector arrays will scan across the object. The object is first imaged by the front line detector, the so-called “scout scan”, using a constant low dose. The data from the scout scan are analyzed automatically in real-time in order to determine the correct exposure for the second scan, the so-called “I-ImaS scan”. The line detector consists of 10 aligned sensors. Each sensor contains 512 by 32 pixels with a spatial resolution of 32*32 μm. The two line sensors will be approximately 1cm apart. Fig. 1 gives a schematic view of the dual sensor lines. The X-ray beam for the second row of sensors will be modulated according to a pre-determined model, in order to expose regions depending on the locally measured features. This modulation will be achieved through a series of wedge filters that can be inserted into the beam of the I-ImaS scan line. The size of the wedge filters corresponds to the sensor size. Hence, the I-ImaS sensor will have the capability of responding in real-time to changing conditions during image acquisition. Finally, the scout scan and the I-ImaS scan are combined to a full field image (the I-ImaS image). In addition, an image of the selected exposure parameters for each region is given. Fig. 2 gives a schematic view of the imaging system. Characterization of the developed I-ImaS sensors is described in [5].
Adaptive Image Content-Based Exposure Control
545
Scout scan line
Scan direction
I-ImaS scan line Fig. 1. Schematic view of the dual sensor lines
X-ray source
Wedge filter
Scan direction Primary collimator
Object Detector collimator
I-ImaS sensor Scout sensor
Steering algorithm
Fig. 2. Schematic view of the I-ImaS imaging system
3 Material The steering algorithms and the image correction algorithms are developed and tested on mammographic and encephalographic images. Mammography images The image material consists of a series of mammographic images made with a Siemens’s Mammomat B and dPix Flashscan 30 flat panel array. The system was set up to simulate clinical conditions. The 300 micron focal spot was used with a 60 micron molybdenum filter. The X-ray source target was molybdenum.
546
H. Schulerud et al.
The tissue consisted of excised full breast slices approximately 1 cm thick. The slices were known to contain both normal and pathological tissue. The tissue was fixed and contained in polythene bags. To simulate a full compressed breast each tissue section was imaged with additional Perspex sheets. Two tissue slices were made up to an equivalent thickness of 20 mm and another two to 40 mm. Each tissue section was then imaged over a range of exposure values. An ionization chamber was placed next to the tissue to ensure that the Mammomat was correctly exposing the tissue. The exposure parameters used were 28, 30, 35 kV and 8, 16, 25, 40, 64, 80, 100 and 125 mAs. The total number of training images obtained was 98. The possibility of significant artifacts from the plastic bagging was investigated and considered to be negligible. Cephalometric dental imaging A series of cephalographic images were recorded employing the imaging system Orthopantomograph OC-100D, which has a nominal spot size of 0.5 mm, an aluminum filter of 2.5 mm and a CCD sensor with a pixel size of 90x90 μm2. The test object used was a dry human skull covered with soft tissue equivalent plastic. Due to the limitation of the X-ray machine allowing only certain kV-mA-sec combinations, 18 exposure alternatives were selected. The images were acquired at kV between 60 and 85 and mAs values between 25 and 240. During the exposing procedure, the phantom was fixed at the same position. To reduce measurement errors (noise, drift and variation of the X-ray equipment etc.) 10 images were exposed at each of the 18 exposure conditions. Thus, a total of 180 cephalographic images were obtained. All the images were randomly exposed to reduce the influence of possible systematic time related effects of the X-ray equipment.
4 Steering Algorithm The main objective of the first prototype of the linear steering algorithm is to control the exposure parameters adaptively over the image in order to locally optimize the image quality with respect to the dose. Consequently, the scout scan is used to predict the exposure parameters necessary for the I-ImaS scan. For digital systems, the densest region has the lowest signal-to-noise ratio and hence fundamentally the lowest image quality. To be able to minimize the dose and avoid underexposures, we want the steering algorithm to control the exposure based on the transmission of the densest part in the region. To attain this, we use the maximum value within a region to control the exposure in the I-ImaS scan line. The region size is defined by the sensor size. In order to get a robust estimate of the maximum value, we have chosen the 95 percentile of the grayscale value within each region. The steering algorithm needs a model of how the sensor behaves for varying mAs values in order to predict the exposure parameters for the I-ImaS scan line. A sensor intensity response model for different mAs settings has therefore been developed. Given that we have an observed gray level G with mAs setting D, the model M allows us to predict what the gray level G’ would be if the same object had been observed
Adaptive Image Content-Based Exposure Control
547
with mAs setting D’. Training data, consisting of images taken of the same object with different exposure parameters, were used to establish the model M. In this system, the scout scan gives us D and Gp95, Gp95 being the 95 percentile of the region. The model can then be used to choose a D' such that G'p95 will be above a preset threshold T. This threshold is chosen such that a sufficient SNR is attained. The number of possible exposure settings will depend on the system; in this case six settings are used. Fig. 3 shows examples of the resulting mAs maps which indicate the extra radiation needed for each region in addition to the resulting I-ImaS images.
Fig. 3. Examples of mAs value images (top) and resulting regulated images before visual correction (bottom) for mammographic (left) and cephalographic images (right)
5 Image Correction The resulting image has varying intensity due to different exposure settings in different image regions. In order to facilitate visual inspection of the image, we want to normalize the gray levels so that we can visualize the acquired image (with different mAs values) as evenly exposed image, with constant mAs. To achieve this we use the regulated image’s mAs map and look up the corresponding pixel intensity values from a predefined model. This gives a global correction of the intensity values. Since there is some deviation from the global model, an additional local correction is performed.
548
5.1
H. Schulerud et al.
Global Adjustment
In order to find the relationship between the gray values and the exposure parameter mAs, we analyzed the intensity levels of the same object acquired with different mAs values (see Fig. 4). Fig. 4 shows that there is a linear relation between the gray levels measured with two different mAs values. Therefore we made a correction model fitting a 1st order polynomial (y = Ax+B). Training data, were used for estimating the correction parameters for all possible mAs combinations. For the resulting I-ImaS images all pixels within all regions are transformed to the same reference mAs value. The reference mAs value is set to the highest applied mAs value. The estimated correction model parameters vary in some degree depending on the training data. Fig. 5 shows the model parameters (A, B) as a function of different mAs values and training data. Results of the global correction algorithm applied to dependent and independent training data are shown in Fig. 6.
Fig. 4. Left: Gray level intensities for the same object measured with mAs equal to 6.4 and 40. Right: Gray level intensities for the same object measured with mAs equal to 40 and 50.
Fig. 5. Parameter A (left) and parameter B (right) as function of mAs and different training data
Adaptive Image Content-Based Exposure Control
549
Fig. 6. Results of global image correction. Upper left: Resulting I-ImaS image without any correction. Upper right: Globally corrected image. Lower left: Globally corrected image with correction parameters estimated from independent training data. Lower right: Original image from the data base with the same reference mAs value.
5.2 Local Intensity Adjustments The main variation of the gray level intensities at different mAs values can be modeled as a linear relationship, but there is some deviation from this model due to the nonlinear response of the sensor and to different absorption in different objects, as shown in Fig. 5 and Fig. 6. Therefore an additional local correction was performed. We assume that neighboring pixels have the same intensity values and that the correction model is linear. By using the mAs image map given by the I-ImaS system, we find all transitions from one mAs value to another. We then select the area with the highest mAs as the reference and adjust all the other mAs areas toward this reference value. All transition rows from one mAs to another are applied to find the local linear transformation parameters between the two regions. Fig. 7 shows the result of global correction and both global and local correction. In addition to the local correction, the user can choose to run a mAs edge smoothing filter to further improve the visual quality of the image. This means that the pixels at, and adjacent to, the mAs transitions are smoothed with a mean filter to blur out any remaining intensity shifts at the transitions. We have chosen default filter
550
H. Schulerud et al.
sizes of 3x3 pixels for horizontal edges and 5x5 pixels for vertical edges. Smoothing kernels of 3x3 and 5x5 pixels can result in loss of spatial information smaller than 100*100μm, which is about 1/3 of the size of the smallest microcalcification considered clinically significant today.
Fig. 7. Left: Globally corrected image. Right: After global and local correction.
6 Results A simulation of the I-ImaS system performance was tested on mammographic images. The resulting images (I-ImaS images) were visually adjusted, using both global and local correction. The resulting I-ImaS images were evaluated together with reference images by two experienced radiologists in a blind test at Ullevål University Hospital. The reference image was taken with uniform exposure parameters at the highest dose selected in the I-ImaS image. The images were evaluated under normal working conditions for the radiologists. The evaluation scores used were: -1 as not adequate, 0 as adequate, +1 as good and +2 as excellent. One of the radiologists evaluated the overall quality of all the I-ImaS and reference images to be equal, while the other radiologist evaluated the I-ImaS images to be better or equal to the reference images. All the images were evaluated as having adequate or good overall quality. Table 1. Surface dose of the regulated I-ImaS image compared to the reference image and the dose savings of the regulated images
Sample
Regulated image
Reference image
Dose savings
U01 U02 U03 U04
(mGy) 2,3 1,8 38,8 53,3
(mGy) 7,7 5,0 65,3 79,6
(%) 70 64 41 33
Adaptive Image Content-Based Exposure Control
551
The surface dose (mGy) was measured for all images acquired. The surface doses of the reference images and the I-ImaS images are shown in Table 1. The dose savings between the I-ImaS images and the reference images were estimated to be between 30 to 70%. 6.1 Preliminary I-ImaS Image Results A pre-prototype I-ImaS system, consisting of one sensor (32*512 pixels) and without online exposure control, has been developed. Fig. 8 shows three images of a piece of jaw and three teeth, where the middle tooth has small ball-bearing positioning markers attached. The images were formed by scanning the object in the horizontal direction. 56 image frames have been stitched together to form the full image, using off line image content-based exposure control. Fig. 8 also shows the selected exposure parameters for each region, where black means no additional exposure and white means 100% additional exposure. The results from the pre-prototype system show that the I-ImaS system is capable of producing good diagnostic quality images at lower dose.
Fig. 8. Image of a piece of jaw and three teeth (containing ball-bearing positioning markers). The image has been formed by stitching together 56 individual frames from a step-and-shoot single sensor acquisition. Left: A conventional evenly exposed image at 100% dose. Middle: An I-ImaS regulated image at 71% dose. Right: An I-ImaS regulated image at 52% dose. The gray stripes on the left shows the selected exposure parameters for each region.
7 Summary and Future Work In this paper we have described a new adaptive X-ray imaging system using online exposure control to create locally optimized images. The on-line beam optimization
552
H. Schulerud et al.
enables improved image quality within an acceptable delivered dose. We have implemented a steering algorithm using the 95 percentile to ensure a sufficient signalto-noise ratio over all the image. Hence, image acquisition parameters are unique to specific regions of the object, which results in optimal contrast resolution for a given dose. An algorithm for visualization of the resulting I-ImaS image, with regional uneven intensities, has been developed and tested. Simulated I-ImaS mammography images were evaluated in a blind test by two experienced radiologists at Ullevål University Hospital. The I-ImaS images were evaluated as having the same or better overall quality than the reference image, despite being taken at a lower dose. The dose savings between the I-ImaS images and the reference images were estimated to be between 30 to 70%. A prototype of the complete I-ImaS imaging system, with online image contentbased exposure control, has now been constructed and is currently undergoing acceptance testing and commissioning. The focus for future work on the steering algorithm will be to detect interesting regions within objects in real-time. For mammography, this includes detecting diagnostic features such as micro-calcifications to improve the image quality in diagnostically important areas. Adaptive imaging systems have initially been constructed for mammography and dental encephalography; however, the aim is that future development will lead to systems tailored to other types of radiography, and other applications such as industrial quality assessment and security.
Acknowledgement The authors would like to thank Per Skaane and Kari Young, at the Ullevål University Hospital of Norway, for evaluation of the mammography images. The I-ImaS (Intelligent Imaging Sensors) project has been funded by the European Commission under the Sixth Framework Program, Priority 3 NMP “New generation of sensors, actuators and systems for health, safety and security of people and environment”, with contract no. NMP2-CT-2003-505593.
References [1] Plewes, D.B., Vogelstein, E.: A scanning system for chest radiography with regional exposure control: Practical implementation. Medical Physics 10(5), 655–663 (1983) [2] Elbakri, I.A., Lakshminarayanan, A.V., Tesic, M.M.: Automatic exposure control for a slot scanning full field digital mammography system. Medical Physics 32(9), 2763–2770 (2005) [3] Eraso, F.E., Ludlow, J.B., Platin, E., Tyndall, D., Phillips, C.: Clinical and in vitro film quality comparison of manual and automatic exposure control in panoramic radiography. Oral Surgery, Oral Medicine, Oral Pathology, Oral Radiology And Endontics 87(4), 518– 523 (1999) [4] Åslund, M., Cederström, B., Lundqvist, M., Danielsson, M.: AEC for scanning digital mammography based on variation of scan velocity. Medical Physics 32(11), 3367–3374 (2005) [5] Griffiths, J.A., et al.: A Multi-Element Detector System for Intelligent Imaging: I-ImaS. IEEE Nuclear Science Symposium & Medical Imaging Conference 4, 2554–2558 (2006)
Shape Extraction Via Heat Flow Analogy Cem Direkoğlu and Mark S. Nixon Department of Electronics and Computer Science University of Southampton Southampton SO17 1BJ, UK {cd05r,msn}@ecs.soton.ac.uk
Abstract. In this paper, we introduce a novel evolution-based segmentation algorithm by using the heat flow analogy, to gain practical advantage. The proposed algorithm consists of two parts. In the first part, we represent a particular heat conduction problem in the image domain to roughly segment the region of interest. Then we use geometric heat flow to complete the segmentation, by smoothing extracted boundaries and removing possible noise inside the prior segmented region. The proposed algorithm is compared with active contour models and is tested on synthetic and medical images. Experimental results indicate that our approach works well in noisy conditions without pre-processing. It can detect multiple objects simultaneously. It is also computationally more efficient and easier to control and implement in comparison to active contour models.
1 Introduction There are two main types of image segmentation methods that evolve to the target solution: active contours and region growing techniques. We first review these techniques with special consideration of their advantages and practical limitations. We then describe techniques which are based on the use of the heat flow analogy, including the proposed model and its advantages as a segmentation technique. 1.1 Related Works Active contours (snakes) are curves that evolve to recover object shapes. Active contours can be classified as Parametric Active contours (PAC) and Geometric Active contours (GAC). The first PAC model was introduced by Kass et al. [1]. In this, segmentation is achieved by using gradient vectors of an edge map. Problems associated with this model are initialization and poor convergence to concave regions. These problems were largely solved with the development of new external force model, which is called the Gradient Vector Flow (GVF) [2]. GVF is computed as a diffusion of the gradient vectors of an edge map. However, PAC models can have difficulty with simultaneous detection of multiple objects, because of the explicit representation of curve. To solve this problem, GAC models have been introduced, where the curve is represented implicitly in a level set function. Caselles et al. [3] and Malladi et al. [4] proposed the first GAC model, which uses gradient based information for J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 553–564, 2007. © Springer-Verlag Berlin Heidelberg 2007
554
C. Direkoğlu and M.S. Nixon
segmentation. The gradient based GAC can detect multiple objects simultaneously but it has other important problems: boundary leakage, noise sensitivity, computational inefficiency and complexity of implementation. In [5], gradient based information has been improved to solve boundary leakage and noise sensitivity problems. However, it can just increase the tolerance, since gradient based information is always limited by noise. Several numerical schemes have also been proposed to improve the computational efficiency of the level set method, including narrow band [6], fast marching [7] and additive operator splitting [8]. Despite substantial improvements in efficiency, they are still not effective enough and can be difficult to implement. Chen and Vese [9] introduced a new GAC model based on the Mumford-Shah functional [10]. Their model uses regional statistics for segmentation. This model is good at initialization, handling noise and boundary leaking but still suffers from computational complexity and from difficulty of implementation, because of the level set method. Region growing is a procedure that groups pixels or sub-regions into larger regions based on predefined similarity criteria for region growth. The basic approach starts with a seed point and merges neighboring pixels that have pre-defined properties similar to the seed, such as intensity [11] or texture [12]. Although, region growing techniques can detect multiple objects simultaneously and can be more efficient than active contour models, the main problem is selection of the similarity criteria. They also have to use connectivity information to define the neighboring pixels in each step of growth. In addition, they can achieve region segmentation with irregular boundaries and holes in the presence of high noise, since they omit smoothing. 1.2 Heat Flow in Image Processing and Computer Vision The heat flow analogy has been used for image smoothing and enhancement [13] [14]. Anisotropic diffusion, which was introduced to vision by Perona and Malik [13], is the state-of-art image enhancement technique. In motion analysis, we can also see a significant application of heat flow by [15]. The algorithm combines anisotropic and isotropic heat flow to obtain moving edges. In [16], an anisotropic diffusion pyramid was introduced for region based segmentation. The pyramid is constructed using the scale space representation of the anisotropic diffusion. In [17], the anti-geometric heat flow model was introduced for the segmentation of regions. Here, anti-geometric heat flow is represented as diffusion through the normal direction of edges. In this paper, we introduce a novel segmentation algorithm based on the heat flow analogy. The proposed algorithm consists of two parts. In the first part, we represent a particular heat conduction problem in the image domain to roughly segment objects of interest. In this problem, we consider a conductive solid body with initial and boundary conditions respectively given by T (x, t = 0 ) = 0 and T (x, t ) = 0 , where T represents the temperature at position x = (x, y ) and time t . The given conditions mean that the temperature is initially zero inside the body and the boundary condition is “Dirichlet” that has specified temperature, zero, at the boundary layer all the time. If we initialize a continuous heat source, which is a positive constant, at any point inside the body, there will be heat diffusion to the other points from the source position as time passes and this will cause temperature increase in the body except at the boundary layer. This concept is represented in the image domain by using a control function
Shape Extraction Via Heat Flow Analogy
555
in the heat conduction equation. The control function is obtained from the source located region statistics, since we propose to segment the source located region. However, in noisy conditions, we can observe irregular boundaries and holes inside the segmented region. These problems are solved in the second part of the algorithm, which is geometric heat flow. In this part, the segmented image is firstly converted to binary form and then geometric heat flow is applied to reduce curvature in the boundary, as well as to remove holes inside the segmented region. After a specified number of iterations, the resultant image is thresholded and the final segmentation is obtained. Experimental results indicate that the proposed algorithm works well in noisy conditions without pre-processing. It can detect multiple objects simultaneously. It is also computationally more efficient and easier to control and implement in comparison to active contour models. As such, by using physics based analogies, we can control the segmentation process so as to achieve a result which offers improved segmentation, by a better fit to the image data. The rest of the paper is organized as follows: Section 2 explains the basic concepts of heat flow. Section 3 represents the proposed heat conduction problem in the image domain. Section 4 discusses the geometric heat flow. Section 5 concerns evaluation and experimental results and finally Section 6 is conclusions. A List of acronyms is also given below in Table 1. Table 1. List of Acronyms ACWE CF GAC GHF GVF GVFS PAC TF
Active Contours Without Edges Control Function Geometric Active Contours Geometric Heat Flow Gradient Vector Flow Gradient Vector Flow Snake Parametric Active Contours Temperature Front
2 Basic Concepts of Heat Flow Conduction, convection and radiation are three different modes of heat flow. Here, we chose to investigate use of a conduction model, which we found to operate well. Conduction is the flow of heat energy from high- to low- temperature regions due to the presence of a thermal gradient in a body [18]. The change of temperature over time at each point of material is described using the general heat conduction or diffusion equation,
(
)
dT dt = α d 2T dx 2 + d 2T dy 2 + Q = α∇ 2T + Q
(1)
Where, ∇ represents gradient, α is called thermal diffusivity of the material and a larger values of α indicate faster heat diffusion through the material. Q is the source term that applies internal heating. It can be uniformly or non-uniformly distributed over material body. The solution of this equation provides the temperature distribution
556
C. Direkoğlu and M.S. Nixon
over the material body and it depends on time, distance, heat source, properties of material, as well as specified initial and boundary conditions. Initial conditions specify the temperature distribution in a body, as a function of space coordinates, at the origin of the time coordinate (t = 0 ) . Initial conditions are represented as follows, T (x, t = 0 ) = Φ (x )
(2)
where, x = (x, y ) is the space vector for the two-dimensional case and Φ (x ) is the function that specifies the initial temperature inside the body. Boundary conditions specify the temperature or the heat flow at the boundaries of the body. There are three general types of boundary conditions: Dirichlet, Neuman and Robin. Here, we explain the Dirichlet condition, which is used in our algorithm. In the Dirichlet condition, temperature is specified along the boundary layer. It can be a function of space and time, or constant. The Dirichlet condition is represented as follows, T (x, t ) = f (x )
(3)
where f (x ) is the function that specifies the temperature at the boundary layer. Many heat conduction problems do not have analytical solutions. These problems usually involve geometrical shapes that are mathematically unsuited to representing initial and boundary conditions. However, numerical techniques exist, such as finite differences and finite elements, which are able to handle almost all problems with any complex shapes. The numerical methods yield numerical values for temperatures at selected discrete points within the body and only at discrete time intervals. The numerical heat conduction problem can be investigated in the image domain, since the image is formed by a set of points, as well as since the image is convenient for the finite difference technique. Each object in image can represent bodies and each pixel in object can represent points within the body.
3 Proposed Heat Conduction Problem and Representation in Image Domain Consider a two-dimensional conductive solid body with initial and boundary conditions respectively given by T (x, t = 0 ) = 0 and T (x, t ) = 0 , which mean the temperature is initially zero inside the body and the boundary condition is Dirichlet that has specified temperature (zero) at the boundaries. If we initialize a continuous heat source, which is a positive constant, at a point inside the body, there will be heat diffusion to the other points from the source position. As a result of this, all the points inside the body will have temperature values exceeding zero, except the boundary points. This is then an ideal approach for object segmentation in computer images. Let us investigate the proposed problem on a square object that is inside the grey-level image (G ) , as shown in Fig. 1(a). Assume that all the temperature values of the objects and the background are kept in another image, which is represented by I , and the initial condition of whole image is zero, I (x, t = 0 ) = 0 . This assumption means that all objects have temperature initially zero inside, as well as at the boundaries. When we initialize
Shape Extraction Via Heat Flow Analogy
557
a heat source at any pixel inside the square object, as shown in Fig. 1(a), there will be heat diffusion to the other pixels from the source position, which will cause temperature to increase. However the temperature at the boundary layer must be kept at zero all the time to obtain the Dirichlet condition, where the boundary layer is defined at the external side of an object as shown in Fig. 1(b). To achieve this, we use a control function in the heat conduction equation as given below,
(
)
dI (x, t ) dt = CF (x, t ) α∇ 2 I (x, t ) + Q(x )
(a)
(b)
(c)
(4)
(d)
Fig. 1. Heat conduction modeling in image domain of size 150 × 150 . (a) Source position at t = 0 . (b) Boundary layer illustration. (c) TF at t = 30 (iterations). (d) Final TF at t = 72 .
where I (x, t ) represents an image pixel value in terms of temperature at each point and time, α is the thermal diffusivity and 0 ≤ α ≤ 0.25 for the numerical scheme to be stable in two-dimensional system [18], Q(x ) is the source term and CF (x, t ) is the control function. The control function is obtained from the region statistics of source location on a given grey-level image. The proposed region statistics model is similar to the one used in [9]. In this model, the image is divided into two regions, interior and exterior, separated by contour and the model minimizes the variance inside and outside of the surface of desired object. In our model, the contour is represented by a Temperature Front (TF), where the TF is the boundary of the region that has temperature values exceeding zero. The control function, CF (x, t ) , is formulated as follows,
σ 1 (x, t ) = λ1 G ( x) − μin σ 2 (x, t ) = λ2 G ( x) − μ out
2 2
(5) (6)
where, λ1 > 0 and λ2 > 0 are fixed parameters for regional statistics, G (x) is the given grey-level image, σ 1 (x, t ) is variance, at each point and time, with respect to the mean, μ in , inside of the TF and σ 2 (x, t ) is variance, at each point and time, with
respect to the mean, μ out , outside of the TF. Then, the following logical decision is applied in each position and time increment. ⎧1, σ 1 (x, t ) ≤ σ 2 (x, t ) CF (x, t ) = ⎨ (7) otherwise ⎩0, Therefore, the control function allows heat diffusion inside the object of interest and achieves the proposed Dirichlet condition on the boundary layer by keeping the temperature value at zero. However, it is better to start this process after a short diffusion time by assuming CF ( x, t ) = 1 at all points. Because, it will increase the number
558
C. Direkoğlu and M.S. Nixon
of samples inside of the TF, which means better decision at the first step especially for noisy cases. In addition, the heat source must be initialized onto a smooth surface of the object, since the source localization to the edge pixel will give the wrong region statistic for our purpose. Fig. 1(c) and (d) respectively show the evolution and the final position of the TF. However, there is no need to continue diffusion, after the TF reaches its final position. For this reason, the position of the TF is controlled in each specified time interval and when there is no movement, diffusion is terminated automatically. The main difference between [9] and our model, in using region statistics, we attempt to segment region of source location instead of whole image. One difficulty arises when the source located region intersect with the image boundary. This problem can be solved by assuming that image is surrounded by a boundary layer, at the external side, which has temperature value zero all the time (Dirichlet). Fig. 2 shows the evolution and the final position of the TF, which has source location at the background. The result, in Fig. 2, also shows that multiple object detection can be achieved and the heat can diffuse through the narrow regions within the spiral object.
(a) Source position
(b) t = 110
(c) t = 221 (final)
Fig. 2. TF is moving on background in the image of size 150 × 150
It is also required to consider the control function when the given image is bimodal. In this case, the control function attempts to segment the whole image while the TF segments the source located region. The reason for this is that, the control function assigns unity to the pixels that are similar to the inside of the TF, and assigns zero to the pixels that are not similar. All the results so far have been on synthetic images without added noise. If we simulate this algorithm on noisy medical images, such as human heart image shown in Fig. 3(a) with the heat source location, we observe some drawbacks in segmentation. The drawbacks are irregular boundaries and holes inside the segmented region, as shown in Fig. 3(b). These problems are solved by using the heat flow analogy again as described in the next section.
4 Geometric Heat Flow Geometric Heat Flow (GHF) is a kind of anisotropic diffusion and is widely used for image denoising and enhancement [14]. It diffuses along the boundaries of image features, but not across them. It derives its name from the fact that, under this flow, the feature boundaries of the image evolve in the normal direction in proportion to
Shape Extraction Via Heat Flow Analogy
559
their curvature. Thus, GHF decreases the curvature of shapes while removing noise, in the images. GHF equation is obtained with the following consideration. Edge directions are related to the tangents of the feature boundaries of an image B . Let η denote the direction normal to the feature boundary through a given point (the gradient direction), and let τ denote the tangent direction. Since η and τ constitute orthogonal directions, the rotationally invariant Laplacian operator can be expressed as the sum of the second order spatial derivatives, Bηη and Bττ , in these directions and the heat conduction equation can be written without using the source term,
(
dB dt = α∇ 2 B = α Bηη + Bττ
)
(8)
Omitting the normal diffusion, while keeping the tangential diffusion yields the GHF equation as Bxx B y2 − 2 Bxy Bx B y + B yy Bx2 dB = αBττ = α (9) dt Bx2 + B y2
(
(
(a) Source position (b) Final TF at t = 59
(e) Final shape
(f) CF (x )
)
(c) B (x )
(g) CF (x ) after GHF
)
(d) S (x )
(h) Final shape
Fig. 3. Illustration of GHF for the purpose of obtaining smooth boundaries and removing holes inside the prior segmented regions. GHF is applied both to the binary form of the TF segmentation, B (x ) , and to the control function CF (x ) . The size of the human heart image is 177× 178 .
In our model, GHF is used to decrease curvature for the purpose of obtaining smooth boundaries and removing holes that appear because of noise. This is achieved as follows. Firstly, a segmented region is converted to a binary form as given below and also shown in Fig. 3(c),
⎧1, I (x ) > 0 B(x ) = ⎨ ⎩0, I (x ) = 0
(10)
where I (x ) is the temperature distribution after terminating diffusion and B(x ) is the binary form of the segmented image that assigns unity to the region of interest. Then,
560
C. Direkoğlu and M.S. Nixon
GHF is applied to the B(x ) until the specified time (number of iterations) and finally the resulting image is thresholded to obtain the final segmentation. The process is formulated below,
⎧1, GHF (B(x ), t s ) ≥ 0.5 S (x ) = ⎨ ⎩0, GHF (B(x ), t s ) < 0.5
(11)
where, ts is the number of iterations and S (x ) is the binary form of the final segmentation, which assigns unity to the region of interest. The final segmentation is shown in Fig. 3(d) and (e), where t s = 50 for this illustration. The selection of ts depends on the user and it is determined due to the noise condition of the image. However, as ts increases, the extracted shape evolves to a circle, then to a point and then it is lost. For this reason, we should avoid to use large values for t s . Since the illustrated human heart image seems bimodal, we can also consider the final form of the control function as shown in Fig. 3(f). To smooth boundaries and remove holes, we simply continue with Eq. (11) and observe the result in Fig. 3(g) and (h).
5 Evaluation and Experimental Results In this section, we first present the evaluation of segmentation by TF and then there are some illustrative examples on medical images. Segmentation by TF is compared with the Active Contour Without Edges (ACWE) [9] and Gradient Vector Flow Snake (GVFS) [2]. The evaluation is done on a harmonic object with varying normal distributed noise N μ , σ 2 , as shown at the top row in Fig. 5. The sum of squared error (SSE) is employed to quantify the performance of each algorithm.
(
M N
(
SSE = ∑ ∑ Si , j − Ai , j i =1 j =1
)
2
)
(12)
Where, S is the binary segmented image and A is the actual binary segmented image of size M × N . The quantity of noise is considered in terms of standard deviation σ with zero mean. ACWE is a region based GAC model that is implemented with a level set function. It applies global minimization to especially segment bimodal images as a whole. However, in this evaluation, we choose the biggest segmented region, since we are concerned with the harmonic object segmentation. Otherwise, it will cause very high errors in noisy conditions because of the segmented noises outside the harmonic object. In this evaluation, the selected parameter values for ACWE are: λ1 = λ2 = 1 (parameters for regional statistics), v = 0 (the area parameter), h = 1 (the step space), Δt = 0.1 (the time space), ε = 1 (the parameter for the Heaviside and Dirac delta functions) and μ = 0.1 * 2552 (the length parameter). GVFS is a gradient based PAC model that uses GVF as an external force. In this evaluation, the selected parameter values for GVFS are: α = 0.25 (smoothness of the
Shape Extraction Via Heat Flow Analogy
561
contour), β = 0 (rigidity of the contour) and μ = 0.2 (in calculating GVF), Δt = 1 (the time interval). In addition, we use 80 iterations to diffuse gradient vectors. In our algorithm, we use an explicit scheme of finite differences in the first and in the second part. In this evaluation, the selected parameter values for TF are: α = 0.25 (thermal diffusivity), λ1 = λ2 = 1 (parameters for regional statistics), Q = 5 (the energy generated from the source position per unit time interval), Δt = 1 (the time interval), Δx = Δy = 1 (the spatial intervals), t s = 10 (specified time for GHF). In addition, we start to use regional statistics after t = 10 to increase the number of samples inside of TF and in each 10 iterations we control the movement of TF to determine the termination of the first part.
Fig. 4. Performance of TF, ACWE and GVFS
(a) σ = 0
(b) σ = 40
(c) σ = 60 (d) σ = 80 (e) σ = 100
Fig. 5. Results for TF (second row), ACWE (third row) and GVFS (forth row) with respect to increasing Gaussian noise in the image of size 100 × 100
562
C. Direkoğlu and M.S. Nixon
In this evaluation, the contours and the heat source are initialized inside the harmonic object. Fig. 4 shows performance of TF, ACWE and GVFS. It is observed that TF and ACWE perform much better than GVFS. The reason for this is that TF and ACWE use region based algorithms, on the other hand GVFS uses a gradient based algorithm, which is very sensitive to the noisy conditions. When we compare TF and ACWE, ACWE performs better than TF until σ ≅ 40 . This appears to be due to the smoothing operation in TF. GHF attempts to smooth the original shape and cause errors in TF, when there is no noise or low noise in the image, since t s is fixed in the evaluation. However, from σ ≅ 40 to σ ≅ 80 , TF segments better than ACWE. The main reason is again the smoothing operation. TF applies smoothing after rough segmentation without any relation to the regional statistic constraints, while ACWE uses smoothness constraint with regional statistic constraints during the segmentation. After σ ≅ 80 , it is seen that ACWE shows better performance than TF. Because, ACWE segments many regions outside the harmonic region in the presence of high noise and then some of the segmented noise remains connected to the original region when we select the biggest region. Fig. 5 shows some of the results for TF (second row), ACWE (third row) and GVFS (forth row). Simulation results also show the effectiveness and the computational efficiency of our algorithm in comparison to the GVFS and ACWE. All the evaluations and the simulation results are obtained by using MATLAB 7.0 on a Pentium IV computer, which runs Windows XP operating system with 3.2 GHz CPU and 1GB RAM. Fig. 6
(a)
(b)
(d)
(c)
(e)
Fig. 6. Segmentation of pulmonary arterial branches in the chest image of size 259 × 250 by TF and GVFS. (a) Initial contour and the source position. (b) Segmentation by TF is shown by the black contour on the image. All the parameters are same as in evaluation except t s = 5 . (CPU=7.85 second). (c) Segmentation by TF is in binary form. (d) Segmentation by GVFS is shown with black contour on the image. All the parameters are same as in evaluation except the iteration to diffuse gradient vectors is 70. (CPU=9.23 second). (e) Segmentation by GVFS is in binary form.
Shape Extraction Via Heat Flow Analogy
(a)
(b)
(c)
563
(d)
Fig. 7. Segmentation of human lung image of size 123 × 118 by TF, CF and ACWE. (a) Initial contour and the source position. (b) Segmentation by TF is shown with white contour on the image. All the parameters are same as in evaluation except t s = 15 (CPU=1.96 second). (c) Segmentation by CF. t s = 15 . (CPU=1.96 second). (d) Segmentation by ACWE. All the parameters are same as in evaluation except the length parameter μ = 0.08 × 255 . (CPU=15.92 minutes). 2
shows the segmentation of pulmonary arterial branches in the chest image by TF and GVFS. The initial contour for GVFS and the source position for TF are shown in Fig. 6(a). Fig. 6(b) shows the segmentation by TF in the given image with black contour, however the segmented arterial branches are not visible with this illustration and the segmentation is also shown with the binary form in Fig. 6(c). On the other hand, Fig. 6(d) and (e) show segmentation by GVFS respectively with black contour in the image and the binary form. It is observed that TF segments the desired arterial branches better than GVFS. By this result, TF can easily handle topological changes and flow into the arterial branches with CPU=7.85s. However, GVFS cannot handle topological changes and cannot flow into the arterial branches. Although GVFS segments a smaller region than TF, the CPU=9.23s, which is more than for TF. Fig. 7 shows the segmentation of bimodal human lung image by TF, CF and ACWE, where the initial contour for ACWE and the source position for TF are shown in Fig. 7(a). Fig. 7(b) and (c) respectively show the segmentation by TF and CF with white contour in the image. Fig. 7(d) shows the segmentation by ACWE. It is observed that TF and CF achieves segmentation with CPU=1.96 seconds and ACWE achieves with CPU=15.92 minutes. This big difference in CPU time appears because of the computational complexity of ACWE that is implemented with level sets. It is also observed that CF can extract feature boundaries better than ACWE especially at the middle and at the bottom of the lung image.
6 Conclusions We have presented a novel segmentation algorithm based on heat flow analogy. In the first part of the algorithm, we roughly extract the desired feature boundaries by representing particular heat conduction problem in the image domain. The representation in image domain is achieved by using a control function (CF) in the heat conduction equation. This formulation also provides advantage when the given image is bimodal, since CF attempts to segment whole image in this case. In the second part, we use geometric heat flow (GHF) to tune the curvature of the extracted feature boundaries
564
C. Direkoğlu and M.S. Nixon
and remove possible noise that arises from the first part of the segmentation. Evaluation results indicate that temperature front (TF) has better performance than gradient vector flow snake (GVFS) and active contour without edges (ACWE) with respect to increasing Gaussian noise. For the bimodal images, TF and CF are again more efficient and effective than both GVFS and ACWE based on the simulation results. As such, the heat analogy can be deployed with success for shape extraction in images.
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active Contour models. In: IJCV, pp. 321–331 (1987) 2. Xu, C., Prince, J.L.: Snakes, Shapes and Gradient Vector Flow. IEEE Transaction on Image Processing 7(3), 359–369 (1998) 3. Caselles, V., Catte, F., Coll, T., Dibos, F.: A Geometric Model for Active Contours. Numerische Mathematic 66, 1–31 (1993) 4. Malladi, R., Sethian, J.A., Vemuri, B.C.: Shape Modeling with Front Propagation: A Level Set Approach. IEEE Transaction on PAMI 17(2), 158–175 (1995) 5. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic Active Contours. IJCV 22(1), 61–79 (1997) 6. Adalsteinsson, D., Sethian, J.: A Fast Level Set Method for Propagating Interfaces. J. Computational Physics 118(2), 269–277 (1995) 7. Sethian, J.: Level Set Methods and Fast Marching Methods. Cambridge Univ. press, New York (1999) 8. Weickert, J., Bart, M., Romeny, T.H., Viergever, M.A.: Efficient and Reliable Schemes for Nonlinear Diffusion Filtering. IEEE Transaction on Image Processing 7(3), 398–410 (1998) 9. Chan, T., Vese, L.: Active Contours without Edges. IEEE Transaction on Image Processing 10(2), 266–277 (2001) 10. Mumford, D., Shah, J.: Optimal Approximation by Piecewise Smooth Functions and Associated Variational Problems. Comm. Pure and Applied Math. 42, 577–685 (1989) 11. Adams, R., Bischof, L.: Seeded region growing. IEEE Trans. PAMI 16(6), 641–647 (1994) 12. Fung, P.W., Grebbin, G., Attikiouzel, Y.: Model-based region growing segmentation of textured images. In: ICASSP-90, vol. 4, pp. 2313–2316 (1990) 13. Perona, P., Malik, J.: Scale-Space and Edge Detection using Anisotropic Diffusion. IEEE Trans. PAMI 22(8), 629–639 (1990) 14. Kimia, B.B., Siddiqi, K.: Geometric Heat Equation and Nonlinear Diffusion of Shapes and Images. In: CVPR, pp. 113–120 (1994) 15. Direkoğlu, C., Nixon, M.S.: Low Level Moving-Feature Extraction via Heat Flow Analogy. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A., Meenakshisundaram, G., Pascucci, V., Zara, J., Molineros, J., Theisel, H., Malzbender, T. (eds.) ISVC 2006. LNCS, vol. 4291, pp. 243–252. Springer, Heidelberg (2006) 16. Acton, S.T., Bovik, A.C., Crawford, M.M.: Anisotropic diffusion pyramids for image segmentation. In: ICIP (1994) 17. Manay, S., Yezzi, A.: Anti-Geometric Diffusion for Adaptive Thresholding and Fast Segmentation. IEEE Transaction on Image Processing 12(11) (2003) 18. Holman, J.P.: Heat Transfer, 9th edn. McGraw-Hill, New York (2002)
Adaptive Vision System for Segmentation of Echographic Medical Images Based on a Modified Mumford-Shah Functional Dimitris K. Iakovidis, Michalis A. Savelonas, and Dimitris Maroulis Dept. of Informatics and Telecommunications, University of Athens, Panepistimioupolis, 15784, Athens, Greece
[email protected]
Abstract. This paper presents a novel adaptive vision system for accurate segmentation of tissue structures in echographic medical images. The proposed vision system incorporates a level-set deformable model based on a modified Mumford-Shah functional, which is estimated over sparse foreground and background regions in the image. This functional is designed so that it copes with the intensity inhomogeneity that characterizes echographic medical images. Moreover, a parameter tuning mechanism has been considered for the adaptation of the deformable model parameters. Experiments were conducted over a range of echographic images displaying abnormal structures of the breast and of the thyroid gland. The results show that the proposed adaptive vision system stands as an efficient, effective and nearly objective tool for segmentation of echographic images.
1 Introduction Echographic medical images provide a means for non-invasive in-vivo diagnostics. However, they are inherently characterized by noise, speckle, spatial aliasing and sampling artifacts, causing the boundaries of tissue structures to appear indistinct and disconnected. The shape of these boundaries can be a substantial clue in differential diagnosis, as it is often correlated with malignancy risk [1-2]. A vision system for automatic segmentation of echographic images would be an aid in medical diagnosis, even to experienced radiologists, by providing a nearly objective second opinion based on explicit image features. A variety of vision systems incorporating different image processing and pattern recognition methods have been proposed for the segmentation of echographic medical images. These include, minimum cross entropy thresholding [3], region growing methods [4-5], classification methods [6], clustering methods [7], wavelet analysis [8], mathematical morphology [9], genetic and fuzzy algorithms [10-11]. State of the art vision systems based on deformable models [12] exhibit advantageous performance in echographic medical image segmentation [13-15]. They are capable of accommodating the complexity and variability of such images by an inherent self-adapting mechanism that leads to continuous, closed or open, curves without requiring edge-linking operations. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 565–574, 2007. © Springer-Verlag Berlin Heidelberg 2007
566
D.K. Iakovidis, M.A. Savelonas, and D. Maroulis
Two-dimensional deformable models involve a contour deformation process which is realized by the minimization of an energy functional designed so that its local minimum is reached at the boundaries of a target object. The energy functional in its basic form comprises of a term that controls the smoothness of the contour and an image dependent term that forces the contour towards the boundaries of the objects. Mumford and Shah [16] formulated an energy functional that contributes to noise resistance by incorporating integrals over image regions. Based on that functional, Chan and Vese [17] developed a level set deformable model that allows the detection of objects whose boundaries are either smooth or not necessarily defined by gradient. The level set approach was introduced to allow for topological changes of the contour during its evolution and it is therefore capable of detecting multiple objects in an image. However, Chan-Vese model assumes that image intensity is piecewise constant, which is hardly true for echographic medical images. This assumption is violated because of single or multiple intensity spikes in such images, attributed to the characteristics of the tissue being examined, to the presence of artifacts such as calcifications, or to external causes such as speckle, usually related to the echographic imaging devices used. A drawback in the application framework of deformable models to echographic medical image segmentation is that it is device dependent; meaning that for the segmentation of images acquired from different echographic imaging devices, or from the same echographic imaging device using different settings (e.g. dynamic range), a set of different parameter values is required. In most cases parameter tuning requires technical skills and time-consuming manual interaction, which could hardly be performed by radiologists. In this paper we present a novel vision system for accurate segmentation of echographic images. It incorporates a level-set deformable model based on a modified Mumford-Shah functional estimated over sparse foreground and background regions in the image in order to cope with the presence of inhomogeneity. Moreover, the proposed system utilizes a genetic algorithm to adapt its parameters to the settings of the echographic imaging device used. The performance of the proposed system is evaluated for the segmentation of abnormal structures in breast and thyroid echographic images. The rest of this paper is organized in three sections. Section 2 describes the proposed system, whereas the results from its application on echographic medical images are apposed in Section 3. Finally, Section 4 summarizes the conclusions of this study and suggests future research perspectives.
2 The Proposed System The proposed echographic image segmentation system involves two phases: adaptation and testing. During the adaptation phase the parameters of the deformable model are tuned so that the system adapts to the settings of the echographic imaging device, based on ground truth information provided by expert radiologists. The testing phase refers to the segmentation of echographic medical images by a tuned deformable model. In what follows we describe the deformable model and the genetic algorithm used.
Adaptive Vision System for Segmentation of Echographic Medical Images
567
2.1 Deformable Model based on Modified Mumford-Shah Functional The original Mumford-Shah functional is defined as follows [16]: F MS (u, C ) = μ ⋅ Length(C ) + λ ∫ u 0 ( x, y ) − u ( x, y ) dxdy 2
Ω
+
∫ ∇u ( x , y )
2
(1)
dxdy
Ω/C
where C is an evolving curve in Ω, where Ω is a bounded open subset of R 2 , and μ, λ are positive parameters. The segmentation of an echographic image u 0 : Ω → R can be formulated as a minimization problem: We seek for the infimum of the functional F MS (u, C ) . The solution image u(x, y) obtained by minimizing this functional is formed by smooth regions with sharp boundaries. In the level set method [18], C ⊂ Ω is represented by the zero level set of a Lipschitz function φ : Ω → R, such that: C = {( x, y ) ∈ Ω : φ ( x, y ) = 0}, inside (C ) = {( x, y ) ∈ Ω : φ ( x, y ) > 0},
(2)
outside (C ) = {( x, y ) ∈ Ω : φ ( x, y ) < 0}
We consider that u(x, y) is defined as: ⎧c + , ( x, y ) ∈ inside C u ( x, y ) = ⎨ − ⎩c , ( x, y ) ∈ outside C
(3)
Eq. (1) becomes: F (c + , c − , C ) = μ ⋅ Length(C ) + λ+
∫|u
0
( x, y ) − c + | 2 dxdy
inside C
+λ
−
∫| u
(4)
− 2
0
( x, y ) − c | dxdy
outside C
where c + and c − are average intensities of only a subset of pixels in the foreground (inside C) and in the background (outside C) respectively. This subset is selected so that the pixels contributing most to local inhomogenity are excluded. It is worth noting that Eq. (3) appears in the Chan-Vese model, however in that model, c + and c − refer to the average intensities from all the pixels in the respective regions [17] and not to the intensities of subsets of pixels in the image. We proposed that the values of c + and c − are estimated by the following equations: c + (φ ) =
∫ u 0 ( x, y ) H (φ ( x, y )) H (φ 0 ( x, y ))Δ 1 ( x, y ) dxdy
Ω
∫ H (φ ( x, y )) H (φ 0 ( x, y ))Δ1 ( x, y ) dxdy
Ω
(5)
568
D.K. Iakovidis, M.A. Savelonas, and D. Maroulis
c − (φ ) =
∫ u 0 ( x, y )(1 − H (φ ( x, y ))) H (φ 0 ( x, y )) Δ 2 ( x, y )dxdy
Ω
∫ (1 − H (φ ( x, y ))) H (φ 0 ( x, y )) Δ 2 ( x, y ) dxdy
(6)
Ω
where H is the Heaviside function. The differences Δ1 ( x, y ) and Δ 2 ( x, y ) are introduced for the cases of foreground and background respectively, as: Δ i ( x, y ) = H (φ ( x, y ) + a i ) − H (φ ( x, y ))
(7)
where i = 1, 2 and α1 , α 2 are constants, negative in the case of the foreground and positive in the case of background. Their value is determined so that [0, a1 ] and [− a 2 ,0] define the acceptable ranges of φ ( x, y ) for a point ( x , y ) to be included in the calculations for the sparse foreground and background region, respectively. Equation (6) implies that the points ( x, y ) for which φ ( x, y ) does not belong in the acceptable range result in Δ i ( x, y ) ≈ 0 . These points correspond to intensity inhomogeneity and cause abrupt changes of φ , resulting in H (φ ( x, y ) + ai ) = H (φ ( x, y)) . Moreover, we assume that the initial contour as traced by φ0 corresponds to the region of interest and we employ H (φ0 ) to restrict the calculation of the average foreground and background intensities c + and c − over this region. Keeping c + and c − fixed, and minimizing F with respect to φ , the associated Euler-Langrange equation for φ is deduced. Finally, φ is determined by parameterizing the descent direction by an artificial time t ≥ 0 , and solving the following equation ∂φ ∇φ = δ (φ )[μ ⋅ div ( ) − λ+ (u 0 − c + ) 2 + λ− (u 0 − c − ) 2 ] = 0 ∂t ∇φ
(8)
where t ∈ (0, ∞), ( x, y) ∈ Ω and δ is the one-dimensional Dirac function. 2.2 Genetic Algorithm The genetic algorithm used in the adaptation phase aims at parameter tuning of the deformable model. Genetic algorithms are stochastic non-linear optimization algorithms based on the theory of natural selection and evolution [19-20]. They have been the optimizers of choice in various artificial intelligence applications, exhibiting better performance than other non-linear optimization approaches to parameter tuning [21-24]. Motivated by these studies, we transcribed the parameter tuning optimization problem of the level-set deformable model into a genetic optimization problem. Considering that μ , λ+ , λ − are weight terms of the energy functional that regulate the relative influence of the terms comprising Eq. (1), and that μ > 0 , (6) can be rewritten as follows:
δ (φ )[div(
−
∇φ λ+ λ )− (u 0 − c + ) 2 + (u 0 − c − ) 2 ] = 0 ∇φ μ μ
(9)
Adaptive Vision System for Segmentation of Echographic Medical Images
569
+ − and by setting k + = λ and k − = λ , (7) can be rewritten as follows: μ μ
δ (φ )[ div (
∇φ ) − k + (u 0 − c + ) 2 + k − (u 0 − c − ) 2 ] = 0 ∇φ
(10)
The parameters k + , k − , α1 and α 2 are encoded into a single bit-string, called chromosome. Their values are constrained within discrete, worst-case ranges determined experimentally. Two 6-bit variables with integer values ranging from 0 to 64, are used to hold k + and k + , and two 4-bit variables are used to hold the exponents of α1 and α 2 , enumerating the values 10 −15 ,10 −14 ,...,10 0 . The length l of the resulting chromosome sums a total of 20-bits. In the adaptation phase the genetic algorithm searches for the chromosome associated with the optimal parameters ( k , α1 and α 2 ) which maximize the overlap value f between a contour A and a given ground truth segmentation T of the target tissue structure. The ground truth segmentation comprises of all pixels falling within at least N/2+1 segmentations out of N segmentations drawn manually by N radiologists [25]. The bias introduced in the ground truth segmentation is reduced as N increases. The overlap value f between two delineated areas A and T is defined as in [5]: f =
A∩T . A∪T
(11)
In case of a perfect match between the two delineated areas A and T, the overlap value is maximized (f = 1). The genetic algorithm of the adaptation phase proceeds to the reproduction of an initial population of R chromosomes by following the steady state approach [26]. The fittest individuals are maintained in the population and they are used to generate offspring individuals by multi-parent diagonal crossover [27]. Following crossover, a mutation operator is applied, flipping the bit content of the chromosomes at random positions from 1 to 0, and vice versa, with very low probability [28]. This operation provides a mechanism to keep the solution away from local minima [24]. The genetic algorithm can be summarized in pseudocode as follows (where G is the current generation): Step 1. Initialize G ← 0 , f FITTEST ← 0 Generate Population of R Chromosomes at random Step 2. For each Chromosome Execute deformable model on input image Calculate f(G) If f (G ) ≥ f FITTEST Then f FITTEST ← f (G) Register fFITTEST End If End For
570
D.K. Iakovidis, M.A. Savelonas, and D. Maroulis
Step 3. G ← G + 1 Step 4. Begin Reproduction Select Fittest Chromosomes Maintain Fittest Chromosomes in the Population End Reproduction Step 5. Crossover Fittest Chromosomes to Generate new Chromosomes Step 6. Mutate Fittest Chromosomes to Generate new Chromosomes Step 7. Repeat Steps 2 to 6 Until G = Gmax The parameter tuning procedure, described above, will result in a registered optimal set of parameters ( k , α1 and α 2 ). This set of parameters can be used for the segmentation of similar tissue structures in other medical images acquired from the same imaging device with the same settings.
3 Results Experiments were performed aiming at the assessment of the proposed vision system for the segmentation of echographic medical images. The dataset used in the experiments comprised of 38 breast and thyroid echographic images (Table 1), containing abnormal tissue structures. The images were digitized at 256×256-pixel dimensions and at 8-bit grey level depth. The proposed vision system was implemented in Microsoft Visual C++ and executed on a 3.2 GHz Intel Pentium IV workstation. The contours were initialized with regions of interest defined by the boundaries of the thyroid gland, which were manually determined by expert radiologists. The parameters of the genetic algorithm were kept constant during the experimentation. A typical population of R = 30 chromosomes was considered in agreement with [29]. The crossover probability was set at 0.6 [30] and the mutation probability was set at 1/l = 0.05, where the length of the chromosome was l = 20 [31]. A number of Gmax = 50 generations was considered, as it allows for convergence to the highest attainable fitness value. The adaptation phase accepts a single echographic image for parameter tuning. In order to avoid the sample selection bias that would be introduced if the performance evaluation process used a single image for parameter tuning, arbitrarily selected from the available set of images, a cross-validation scheme was employed [32]. This scheme involved multiple experiments that use independent images for parameter tuning and testing. In each experiment, a different image was drawn from the dataset and used for parameter tuning, whereas the rest of the dataset was used for testing. The average overlaps obtained by the proposed vision system and the individual radiologists are summarized in Table 1. These results provide an estimate of the generalization ability of the system. The obtained segmentation accuracies are comparable to or even higher than the segmentation accuracies obtained by individual radiologists. The latter case can be attributed to the subjectivity induced in the segmentations obtained by individual radiologists, which is associated with interobserver variability.
Adaptive Vision System for Segmentation of Echographic Medical Images
571
Table 1. Average segmentation accuracy with respect to the ground truth, for the individual radiologists and the proposed system Subject
Breast findings Thyroid findings
Images
Radiologists v (%)
20 18
89.1±1.7 90.7±2.3
Proposed System v (%) 92.7±1.1 94.4±1.7
The interobserver variability as quantified by the coefficient of variation [33] ranges between 2.1% and 11.8%. The coefficient of variation of the overlap values obtained with the proposed vision system ranges between 0.9% and 3.0%, and in all the cases, it was lower than the coefficient of variation of the radiologists. Figure 1 illustrates two indicative echographic medical images used in the experiments. The first image (Fig. 1a) illustrates an echographic image of a breast nodule. The overlap obtained with the proposed vision system is 94.5% (Fig. 1c), whereas the
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 1. Echographic medical images and segmentation results, (a) echographic image of a breast nodule, (b) echographic image of a thyroid nodule, (c-d) segmentations obtained by individual expert radiologists, (e-f) segmentations obtained by the proposed segmentation approach
572
D.K. Iakovidis, M.A. Savelonas, and D. Maroulis
overlap obtained by an individual radiologist is 92.1% (Fig. 1e) respectively. The second image (Fig. 1b) illustrates an echographic image of a thyroid nodule. The overlap obtained with the proposed vision system is 98.9% (Fig. 1d) whereas the overlap achieved by an individual radiologist is slightly lower reaching 97.0% (Fig. 1f). The average time required for the execution of the segmentation algorithm is of the order of a minute. The maximum time required in the adaptation phase of the proposed vision system reaches approximately the 18h, but it needs to run only once for a particular imaging device. It should be noted that if one had to follow the naive approach of exhaustive search in the parameter space, the execution time required would be up to three orders of magnitude higher. The resulting set of optimal parameters ( k + , k + , α1 and α 2 ) may be applied for the segmentation of abnormal tissue structures in other similar echographic images acquired from the same echographic imaging device with the same settings. This means that for each new image, only the execution time of the deformable model is required.
4 Conclusion We have introduced a novel vision system, which embodies a level-set deformable model tuned by a genetic algorithm. The deformable model is based on a modified Mumford-Shah functional, which is estimated over sparse foreground and background regions in the image, so as to cope with the intensity inhomogeneity characterizing echographic medical images. The genetic algorithm has been employed for efficient tuning of the parameters of the deformable model to an optimal set of values for the particular settings of the imaging device used. This adaptation of the deformable model allows accurate segmentations of tissue structures in echographic medical images. The segmentation accuracy provided is comparable to or even higher than the segmentation accuracies obtained by individual radiologists. The results show that the interobserver variability of the individual radiologists is higher than the variability of the overlap values obtained with the proposed vision system. Therefore, this vision system offers a tool for nearly objective clinical assessment of tissue structures. Moreover, it provides the radiologists with a second opinion, without requiring technical skills or time-consuming manual interaction for parameter tuning. Future research perspectives include speed up of the proposed system, and its embedment into an integrated system that will combine heterogeneous information to support diagnosis.
Acknowledgement We would like to thank Dr. N. Dimitropoulos M.D. Radiologist, and EUROMEDICA S.A., Greece, for the provision of the echographic images and their contribution in the evaluation of the results. This work was supported by the Greek General Secretariat of Research and Technology and the European Social Fund, through the PENED 2003 program (grant no. 03-ED-662).
Adaptive Vision System for Segmentation of Echographic Medical Images
573
References 1. Ching, H.K., et al.: Stepwise Logistic Regression Analysis of Tumor Contour Features for Breast Ultrasound Diagnosis. In: Proc. IEEE Ultr Symp. Atlanta, GA, USA, vol. 2, pp. 1303–1306. IEEE, Los Alamitos (2001) 2. Papini, E., et al.: Risk of Malignancy in Nonpalpable Thyroid Nodules: Predictive Value of Ultrasound and Color-Doppler Features. J. Clin Endocrin & Metabol 87(5), 1941–1946 (2002) 3. Zimmer, Y., Tepper, R., Akselrod, S.: A two-dimensional extension of minimum cross entropy thresholding for the segmentation of ultrasound images. Ultr. Med. and Biol. 22, 1183–1190 (1996) 4. Adams, R., Bischof, L.: Seeded region growing. IEEE Trans Pat Anal Mach Intel 16(6), 641–647 (1994) 5. Hao, X., Bruce, C., Pislaru, C., Greenleaf, J.F.: A Novel Region Growing Method for Segmenting Ultrasound Images. Proc. IEEE Int. Ultr. Symp. 2, 1717–1720 (2000) 6. Kotropoulos, C., Pittas, I.: Segmentations of Ultrasonic Images Using Support Vector Machines. Pat. Rec. Let. 24, 715–727 (2003) 7. Boukerroui, D., Basset, O., Guerin, N., Baskurt, A.: Multiresolution Texture Based Adaptive Clustering Algorithm for Breast Lesion Segmentation. Eur. J. Ultr. 8, 135–144 (1998) 8. Fan, L., Braden, G.A., Herrington, D.M.: Nonlinear Wavelet Filter for Intracoronary Ultrasound Images. In: Proc. An Meet. Comp. Card, pp. 41–44 (1996) 9. Thomas, J.G., Peters, R.A., Jeanty, P.: Automatic Segmentation of Ultrasound Images Using Morphological Operators. IEEE Trans. Med. Im. 10, 180–186 (1991) 10. Heckman, T.: Searching for Contours. Proc. SPIE 2666, 223–232 (1996) 11. Solaiman, B., Roux, C., Rangayyan, R.M., Pipelier, F., Hillion, A.: Fuzzy Edge Evaluation in Ultrasound Endosonographic Images. In: Proc. Can. Conf. Elec. Comp. Eng. pp. 335– 338 (1996) 12. McInerney, T., Terzopoulos, D.: Deformable Models in Medical Image Analysis: A Survey. Med. Im. Anal. 1(2), 91–108 (1996) 13. Honggang, Y., Pattichis, M.S., Goens, M.B.: Robust Segmentation of Freehand Ultrasound Image Slices Using Gradient Vector Flow Fast Geometric Active Contours. In: Proc. IEEE South Symp. Im Anal. Interpr. pp. 115–119. IEEE, Los Alamitos (2006) 14. Liu, W., Zagzebski, J.A., Varghese, T., Dyer, C.R., Techavipoo, U., Hall, T.J.: Segmentation of Elastographic Images Using a Coarse-to-Fine Active Contour Model. Ultr. Med. Biol. 32(3), 397–408 (2006) 15. Cardinal, M.-H.R., Meunier, J., Soulez, G., Maurice, R.L., Therasse, E., Cloutier, G.: Intravascular Ultrasound Image Segmentation: a Three-Dimensional Fast-Marching Method Based on Gray Level Distributions. IEEE Trans. Med. Im. 25(5), 590–601 (2006) 16. Mumford, D., Shah, J.: Optimal Approximation by Piecewise Smooth Functions and Associated Variational Problems. Commun. Pure Appl. Math. 42, 577–685 (1989) 17. Chan, T.F., Vese, L.A.: Active Contours Without Edges, Vol. IEEE Trans. Im. Proc. 7, 266–277 (2001) 18. Osher, S., Sethian, J.: Fronts Propagating with Curvature-Dependent Speed: Algorithms Based on the Hamilton-Jacobi Formulations. J. Comp. Phys. 79, 12–49 (1988) 19. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA (1989) 20. Grefenstette, J.J.: Optimization of Control Parameters for Genetic Algorithms. IEEE Trans. Syst. Man. Cyber. 16(1), 122–128 (1986)
574
D.K. Iakovidis, M.A. Savelonas, and D. Maroulis
21. Min, S.H., Lee, J., Han, I.: Hybrid Genetic Algorithms and Support Vector Machines for Bankruptcy Prediction. Expert Systems with Applications 31(3), 652–660 (2006) 22. Zhao, X.M., Cheung, Y.M., Huang, D.S.: A Novel Approach to Extracting Features from Motif Content and Protein Composition for Protein Sequence Classification. Neural Networks 18, 1019–1028 (2005) 23. Plagianakos, V.P, Magoulas, G.D., Vrahatis, M.N: Tumor Detection in Colonoscopic Images Using Hybrid Methods for On-Line Neural Network Training. In: Proc Int Conf Neur Net Exp Syst Med Health, pp. 59–64 (2001) 24. Pignalberi, G., Cucchiara, R., Cinque, L., Levialdi, S.: Tuning Range Segmentation by Genetic Algorithm. EURASIP J. Appl. Sig. Proc. 8, 780–790 (2003) 25. Kaus, M.R., Warfield, S.K., Jolesz, F.A., Kikinis, R.: Segmentation of Meningiomas and Low Grade Gliomas in MRI. In: Proc Int Conf Med Im Comp Comp-Ass Interv, pp. 1–10 (1999) 26. Syswerda, G.: A Study of Reproduction in Generational and Steady State Genetic Algorithms.: Foundations of Genetic Algorithms, Rawlings G.J.E., pp. 94–101. Morgan Kaufmann, San Mateo (1999) 27. Eiben, A.E.: Multiparent Recombination in Evolutionary Computing, Advances in Evolutionary Computing. Natural Computing Series, pp. 175–192. Springer, Heidelberg (2002) 28. Bäck, T.: Optimal Mutation Rates in Genetic Search. In: Proc Int Conf Gen Alg, pp. 2–8 (1993) 29. Goldberg, D.E.: Sizing Population for Serial and Parallel Genetic Algorithms. In: Proc Int Conf Gen Alg, pp. 70–79 (1989) 30. Bäck, T., Hammel, U., Schwefel, H.P.: Evolutionary Computation: Comments on the History and Current State. IEEE Trans. Evol. Comp. 1(1), 3–17 (1997) 31. Kemenade K.M., van Eiben, A.E.: Multi-Parent Recombination to Overcome Premature Convergence in Genetic Algorithms. In: Proc Dutch Conf Art Intell pp. 137–146 (1995) 32. Theodoridis, S., Koutroumbas, K.: Pattern Recognition. Academic Press, London (1998) 33. Woltjer, H.H.: The Intra- and Interobserver Variability of Impedance Cardiography in Patients at Rest During Exercise. Physiol. Meas. 17, 171–178 (1996)
Detection of Individual Specimens in Populations Using Contour Energies Daniel Ochoa1,2, Sidharta Gautama1, and Boris Vintimilla2 1
Department of telecommunication and information processing, Ghent University, St-Pieters Nieuwstraat 41, B-9000, Ghent, Belgium 2 Centro de Vision y Robotica, Facultad de Ingenieria en Electricidad y Computación, ESPOL University, Km 30.5 via perimetral, 09015863, Guayaquil, Ecuador {dochoa,sid}@telin.ugent.be,
[email protected]
Abstract. In this paper we study how shape information encoded in contour energy components values can be used for detection of microscopic organisms in population images. We proposed features based on shape and geometrical statistical data obtained from samples of optimized contour lines integrated in the framework of Bayesian inference for recognition of individual specimens. Compared with common geometric features the results show that patterns present in the image allow better detection of a considerable amount of individuals even in cluttered regions when sufficient shape information is retained. Therefore providing an alternative to building a specific shape model or imposing specific constrains on the interaction of overlapping objects. Keywords: recognition, feature extraction, statistical shape analysis.
1 Introduction An important tool for biotechnology research and development is the study of populations at molecular, biochemical and microbiological levels. However, to track their development and evolution non-destructive protocols are required to keep individuals in a suitable environment. The right conditions allow continuous examination and data collection that from a statistically meaningful number of specimens provide support for a wide variety of experiments. The length, width and location of microscopic specimens in a sample are strongly related to population parameters such as feeding behavior, rate of growth, biomass, maturity index and other time-related metrics. Population images characterized by sample variation, structural noise and clutter pose a challenging problem for recognition algorithms [1]. These issues alter negatively the estimated measurements, for instance when parts of the detected object are out of focus, two or more individuals can be mistakenly counted as one or artifacts in the sample resembles the shape of specimens of interest. A similar condition occurs in tracking applications when continuous identification of a given individual, while interacting with others of the same or different phylum is required. Nevertheless the increasing amount of digital image data in micro-biological studies prompts the need of reliable image analysis systems to produce precise and reproducible quantitative results. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 575–586, 2007. © Springer-Verlag Berlin Heidelberg 2007
576
D. Ochoa, S. Gautama, and B. Vintimilla
The nematodes are one of the most common family of animals; they are ubiquitous in fresh water, marine and terrestrial eco-systems. As a result nematodes populations had become useful bio-indicator for environmental evaluation, disease expressions in crops, pesticide treatments, etc. A member of the specie, the C. Elegants nematode is widely applied in research in genetics, agriculture and marine biology. This microorganism has complete digestive and nervous systems, a known genome sequence and is sensitive to variable environmental conditions. Intensity thresholding and binary skeletonization followed by contour curvature pattern matching were used in images containing a single nematode to identify the head and tail of the specimen [2]. To classify C.Elegans behavioral phenotypes in [3] motion patterns are identified by means of a one-nematode tracking system, morphological operators and geometrical related features. The advantages of scale space principles were demonstrated on nematode populations in [4] and anisotropic diffusion is proposed to improve the response of a line detection algorithm; but recognition of single specimens was not performed. In [8] nematode population analysis relies on well-known image processing techniques namely intensity thresholding followed by filling, drawing and measuring operations in a semi-automatic fashion. However sample preparation was carefully done to place specimens apart from each other to prevent overlapping. Combining several image processing techniques when dealing with biological populations specimens increase the complexity of finding a set of good parameters and consequently reduce the scope of possible applications. Daily lab work is mostly manual, after the sample image is captured a biologist define points along the specimen, then line segments are drawn and measurement taken. User friendly approaches like live-wire [5] can ease the process as while pointing over the nematode surface a line segment is pulled towards the nematode centerline. Though in cluttered regions line evidence vanishes and manual corrections are eventually required. Considering that a data set usually consists of massive amounts of image data with easily hundreds of specimens, such repetitive task entails high probabilities of inter-observer variations and consequently unreliable data. Given the characteristics of these images, extracting reliable shape information for object identification with a restricted amount of image data, overlapping, and structural noise pose a difficult task. Certainly, the need of high-throughput screening of bio-images to fully describe biological processes on a quantitative level is still very much in demand [6]. Unless effective recognition takes place before any postprocessing procedure the utilization of artificial vision software for estimating statistical data from population samples [7] will not be able to provide with accurate measurements to scientists. As an alternative to past efforts focused at deriving shape models from a set of single object images using evenly distributed feature points [14]. We propose recover shape information by examining the energies of sample optimized active contours from a population image. In order to assert the efficiency of such approach we compare them with geometrical measurements. Our aim is to prove that patterns extracted from sample contours can lead to recognition of individual specimens in still images even in the presence of the aforementioned problems.
Detection of Individual Specimens in Populations Using Contour Energies
577
This paper is organized as follows. In section 2 the active contour approach is discussed. Shape features of detected nematodes are proposed and used for classification in Section 3. Comparative results are shown in Section 4; finally conclusions and future work is presented in Section 5.
2 Segmentation Using Active Contours Nematodes are elongated structures of slightly varying thickness along their length, wide in the center and narrow near both ends. Contrary to one might think its simple shape makes segmentation process a complex task in population images because nematodes interact with the culture medium and other specimens in the sample. Nematodes lie freely on agar substrate and explore their surroundings by bending their body. While foraging, nematodes run over different parts of the image, crawl on top of each other and occasionally dive into the substrate. This behaviour leads to potential issues in segmentation because substantial variations in shape and appearance are observed in population images. Nematodes exhibit different intensity level distributions either between individuals or groups when image background is non-homogeneous. Darker areas appear every time internal organs become visible or at junctions when two or more specimens overlap. Some parts get blurred as they get temporarily out of focus when diving into the sustrate. Regarding shape, the lack of contour features and complex motion patterns prevent using simple shape descriptors or building models able to account for the whole range shape configurations. These two characteristics also make difficult to find a set of geometrical constrains that can illustrate all the junction types found in overlapping situations Fig. 1. Under these conditions, thresholding techniques commonly used in images of isolated specimens fail to provide a reliable segmentation. Approaches based on differential geometry [11] can handle better the intensity variation, but a trade off between the image-content coverage and conciseness [12] is needed to set appropriate parameter values. Statistical tests on hypothetical center-line and background regions at every pixel locations as proposed in [23] rely on having enough local line evidence, which precisely disappear at junctions where saddle regions form. The inherent disadvantages of the aforementioned techniques allow in practice to obtain only a set of unconnected points hopefully the majority located on the traversal axis of some of the nematodes present in the image. Line grouping based on graph search and optimisation techniques enforcing line continuity and smoothness were applied to integrate line evidence [13,23], but segmentation of objects based on linear segments requires relevant local segments configurations that capture objects shape characteristics [22]. Shape modelling assuming evenly distributed landmark points along nematode body proved a complex issue, although non-linear systems had been devised [10] the complete range of nematode body configurations is still far from being model. Spatial arrangement of feature points at different scales were exploited in [15] to search for regions of high probability of containing a rigid wiry object in different cluttered environments, yet in populations clutter is mostly caused by nematode themselves.
578
D. Ochoa, S. Gautama, and B. Vintimilla
Fig. 1. Left: Nematodes in a population image. Center: Structural noise produced by internal organs, and overlapping. Right: Non-homogenous background cause differences in appearance.
In this paper we propose the utilization of active contours energies to capture relevant statistical shape information for recognition applied to nematode detection in population images. Active contours introduced by Kass with a model called snake [16] has drawn attention due to their performance in various problems. Segmentation and shape modeling in single images proved effective by integrating region-based information, stochastic approaches and appropriate shape constrains [17, 18]. Active contours combine image data and shape modeling through the definition of a linear energy function consisting of two terms: a data-driven component (external energy), which depends on the image data, and a smoothness-driven component (internal energy) which enforces smoothness along the contour.
E contour = λ1 ⋅ Eint + λ2 ⋅ Eext
(1)
The internal energy can be decomposed further into tension and bending energies, they report higher values as the contour stretches or bends during the optimization process. The goal is to minimize the total energy iteratively using gradient descent techniques as energies components balance each other. S
E int = ∫ et (s) + eb (s)ds , 0
S
E ext = ∫ eext (s)ds
(2)
0
The proposed approach is based on the idea that given convergence of the active contours mostly data-driven, appearance and geometrical data can be recovered from the resulting energy component value distribution. Contrary to other works that tried to embed partial shape information to guide the evolution of the contour [21], we consider the analysis of energy based derived features a natural way to explore the range of possible nematode shape configurations in a set of population images without having to build an specific model or making explicit constrains about objects interaction [19]. We leave to the active contour optimization process the task of locating salient linear structures and focus on exploiting the distribution of energy values for recognition of those contours corresponding to nematodes. For segmentation we used ziplock snake [20], this active contour model is designed to deal with open contours. Given a pair of fixed end points optimization is
Detection of Individual Specimens in Populations Using Contour Energies
579
carried out from them towards the center of the contour using in every step a increasing number of control points. This procedure is intended to raise the probability of accurate segmentation by progressively locating control points on the object surface. They can encode shape information explicitly [21] and provide faster convergence than geodesic snakes. It is important to point out that as in any deterministic active contour formulation there are situations in which convergence tends to fail. For instance in the presence of sharp turns, self-occlusion or in very low contrast regions. Nevertheless as long as the number of correct classified contours represent a valid sample of the population we can obtain meaningful data for bio-researchers. In the context of living specimens we should expect that eventually every individual will have the possibility of match with a nicely converged contour. For our experiments, the tension energy et was defined as the point distance distribution, the bending energy eb calculated by means of a discrete approximation of the local curvature and a normalized version of the intensity image was employed as energy field eext.
eext α I(x, y),
et =
x 2 + y 2 ,
⏐(x ⋅ y − x ⋅ y )⏐ ⏐ eb =⏐ 2 2 3/2 ( x + y ) ⏐ ⏐
(3)
The main bottleneck in the automated use of ziplock snakes is the need for specifying matching end points for a contour. The absence of shape salient features in head and tail nematode sections prevents building a reliable matching table. The only option is to examine all possible combination of points, but this can lead to a combinatorial explosion of the search space. In this context we devised two criteria to constrain the number of contours to analyze: • Matching end points within a neighborhood of size proportional to the expected nematode length, • Matching end points connected by path showing consistent line evidence. Fig. 2 depicts initial contours generated after applying the both criteria. In the first case the nematode length was derived from a sample nematode, in the second case the raw response of a line detector [24] was used to look for line evidence between end points. Any path between a pair of end points consisting of non-zero values was considered valid and allows the initialization of a contour. Once the contours had converged, we observe different situations regarding their structure: • The contour can be located entirely on a single nematode. • The contour sections correspond to different nematodes. • Part of the contour lies on the image background. The first case requires both end points to be located on the same object, occurs when the specimen is isolated or the energy optimization is able to overcome overlapping regions. The second type of contour appears when a contour spreads among overlapping nematodes while fitting a smooth curve between its end points. If
580
D. Ochoa, S. Gautama, and B. Vintimilla
the smoothness constrain can not be enforce some contour sections might rest on the image background. In the following we will refer to contours located on single nematode as nematode contours and the remaining cases as non-nematode contours. Our interest is to extract nematode contours reliably, but as can be seen in Fig. 2. there is no simple way to distinguish them without additional processing steps and the inconvenient problems mentioned previously. Hence the suggested solution is presented in the following section.
Fig. 2. Contours (white) from end points (blue) matching criteria. Left column: expected length. Right column: line evidence. First row: before convergence. Second row: after convergence. Right bottom: Examples of nematode (green) and non-nematode (orange) contour classes.
3 Detection of Specimens Using Energy Features The goal of our experiments is to explore the feasibility of classifying a given contour in a corresponding nematode wn or non-nematode wt classes. Let C be the set of contours {c1,...,cm} generated after the convergence process and define a contour c as a sequence of n control points (x1,...,xn ). Two types of shape measurements based on the three relations (length, curvature and line evidence) encapsulated in the energy terms are defined. The expected point energy Me captures the average value of a given energy term e along the contour:
Detection of Individual Specimens in Populations Using Contour Energies
581
, e ∈ {e t , e b , e ext }
(4)
M c , e =e c
and the point sequence energy Se integrates the control point’s energy in a vector providing evidence about the effect that different shape and appearance configurations have on the individual contour components:
Sc,e = (exc1 ,...,ec xn )
, e ∈{et , eb , eext}
(5)
The distributions of these energy based feature values allows us to study the similarity between contours belonging to objects of interest and their properties. It seems reasonable to expect that the energy configuration space should display clusters in regions linked to objects of consistent shape and appearance. The relevance of using active contours and their associated energies becomes manifest when comparing contours after convergence. In background regions, control points are collinear and equidistant, therefore Me features should report rather fixed values. For nematode contours, control point spatial distribution is not homogeneous because their location is determined by the foreground image data and body geometrical configuration. Since at some degree they look alike and share similar movement behavior a suitable set of Se features values could capture such limited configuration space. Other patterns can be deduced, but it is unlikely that features derived from any individual energy term will provide by itself a reliable recognition outcome. The combination of energy based features in a statistical framework is proposed to measure their discriminative power. To that aim the Bayes rule was applied to classify contours as nematode or non-nematode. The ratio of the a posteriori probabilities of nematode to non-nematode classes given the values of an energy based feature set was defined as discriminant function. The prior probabilities were regarded homogeneous to test the effectiveness of the proposed features, however they can be modeled for instance by the distribution of control point distances to the nearest end point or by the distribution of line evidence. This reduces the discriminant function to the ratio of the probabilities of feature values given that a contour is assigned to a particular class. Assuming independence between energy terms and control point locations theses distributions can be readily defined as the product of the probabilities of the feature set elements given a class w ∈ {wn , wt } :
P(M c ,e | w) = ∏ P( e c | w), e ⊆ {et , eb , eext }
(6)
e
P(S c ,e | w) = ∏ e
∏ P(e
c x
| w), e ⊆ {et , eb , eext }
(7)
x
Finally, the computational cost for contour classification in a population image depends on the size of C, the feature type selected and the number of energy terms included. In the case of Se there is no extra cost because their components are the terms of Econtour, Me calculations requires an additional step to calculate the associated average.
582
D. Ochoa, S. Gautama, and B. Vintimilla
4 Experimental Evaluation The proposed methodology was evaluated on a set of high resolution time-lapse images depicting populations of adult nematodes with approximately 200 specimens. The end point set was extracted from ground truth images and straight initial contours placed between pairs of matching points according to the criteria presented in section 2. Both contour sets with 903 and 1684 elements, each having 16 control points, were optimized until convergence. To estimate the conditional probability distributions we built a training set of 50 randomly selected nematodes and non-nematode contours. Given the non-gaussian nature of P(Me|w) and P(Se|w) data we fitted them using weibull and gamma probability density functions respectively to extract the distribution parameters. The features derived from the expected point energy and the point sequence energy definitions, comprised all the possible combinations of energy terms. Every feature type was evaluated separately and combined totaling 21 energy based features. For completeness we included also the total contour energy Econtour. We additionally performed energy based feature classification considering different number of control points. To do that an increasing number of control points on both ends of every contour was gradually discarded. To assert the performance of the proposed energy based features we compared them to geometrical features used in previous work on nematode classification [3]. They include: the contour length Len, the summation of signed distance from the end points to the contour’s centroid that provides a measure of symmetry Sym, a compactness Cmp metric calculated as the ratio between the contour length and its eccentricity, and the angle change rate Acr computed from the summation of the difference in angles between contour segments normalized by the length and number of control points. We tested them separately and combined using the same probabilistic framework described in section 3. Table 1. summarizes the classification results, it shows the true positive Tp rate, the false positive Fp rate, and the distance D to perfect detection corresponding to best performance for every feature type. In the case of energy based features the first Table 1. Best classification results for energy and non-energy based feature combinations
D
Line Evidence Tp Fp
D
Expected length Tp Fp
S(e16 ,e ,e ) t b ext
0.263
0.884
0.236
0.137
0.911
0.104
M (e10 ,e
0.406
0.614
0.125
0.227
0.800
0.108
M + S(e12 ,e
0.543
0.467
0.106
0.398
0.604
0.044
Len + Sym +Acr
0.479
0.924
0.473
0.352
0.901
0.338
Econtour
0.747
0.924
0.743
0.736
0.923
0.732
t
ext ) t ext )
Detection of Individual Specimens in Populations Using Contour Energies
583
column also specifies the energy terms included and the amount of control points. The proposed energy based features consistently show a better trade off between true and false detection rates compared to other features. Though in combination the true positive detection drops it is still comparable with non-energy based features that despite of detecting most nematode contours have a high rate of false detections. The total contour energy Econtour performed poorly. Point sequence features discriminative power increases as more control points are added while for expected point energy features results improves when this number decreases. This is indicative that nematode and non-nematode contour classes have similar average energy value distributions and only when the contour’s central part is analyzed the difference is large enough to allow reliable classification. A possible explanation relies on the fact that nematodes central area is the less flexible part of their body so contour variations become prominent if we use only the central control points. Regarding the two search spaces we noticed that results improve as we include more initial contours since we have more possibilities of segmenting all the nematodes contained in the sample.
Fig. 3. Classification results for nematode (green) and non-nematode contours (red) some nonnematode contours were remove to improve visibility
The results showed that the single most discriminating energy term for Me , Se and Me + Se features is the tension energy term et, the spatial distribution of control points appears to capture nematode evidence accurately. This observation is explained in terms of the relations between energy terms during optimization. Since in our image set nematodes show lower external energy eext values near the center, control points tend to gather in that area however as they move et increases in the vicinity of contour ends and pulls them in the opposite direction. Therefore, the distance between control points varies depending on the regions they are located, in our specimens these regions correspond to nematode appearance features. It must be noted that only by combining several energy terms the false positive rate can be consistently reduced. As expected bending energy eb allow us to filter out contours with sharp turns and the
584
D. Ochoa, S. Gautama, and B. Vintimilla
external energy eext, those with spatial intensity distribution too different from those found in the population Fig. 3. Nematode contour misclassification occurs when appearance information is lost or in the presence of an unusual shape configuration. The first case includes nematodes close to the petri dish border where lightning conditions reduce the contrast between foreground and background. The other case is frequently the result of optical distortion produced by the microscope lens. Non-nematode contours can be mistakenly classified when most of their control points converge towards a real nematode, for instance in the presence of parallel nematodes very close to each other, or when in heavy overlapping regions a contour manages to run over parts of several objects and still resemble a real nematode Fig. 4.
Fig. 4. Misclassification examples (yellow). Right: nematode contour affected by blur. Left: non-nematode contour partially running over different nematodes in overlapping region.
The change of relative optical density at junction constitutes the main source of structural noise. The resulting darker areas affect negatively the spatial distribution of control points during the optimization process and hence the recovered energy values. The more occluded is a nematode the less its discriminant function value, nevertheless correct detection of a number of nematodes in overlapping regions is feasible when enough shape information is retained. We also noticed that nematode contours sharing a end point with wrongly detected contours have a consistently higher discriminant function value, this relation could be used to improve detection results further but has not explored yet in these experiments.
5 Conclusions A set of features for detection of individual nematodes in population has been proposed. The resultant patterns from a set of optimized contours proved a valid source of shape evidence for recognition of specimens in difficult scenarios. Detection rates allowed us to reject most non-nematode contour while keeping a significant number of correct detected nematodes.
Detection of Individual Specimens in Populations Using Contour Energies
585
The proposed approach differ from existing shape modeling approaches where feature points are manually located on salient regions on individual object to build linear and non-linear shape model. We use the evolution of active contour models to capture object statistics therefore constraining the range of possible appearance and geometrical configurations to those present in the current sample set. Features based on average and local contour energy component distributions were tested on manually segmented images in the framework of Bayesian inference. Experimental results with two different contour initialization strategies show that energies based features provide better detection rates that geometrical based features commonly applied in image processing of biological samples. In particular energy term combination displayed a consistent performance for true nematode detection. When nematode and non-nematode contours have similar average feature values the results can be improved if only the central region of the contour is evaluated which is consequent with the morphological characteristic of these specimens captured during the optimization process. Despite the limitations of active contours to converge correctly in low contrast regions or in the vicinity of sharp corners we found out that recognition is still feasible if a sufficient amount of shape information is retained even in overlapping regions. Further improvement in detection rates could be achieved if interactions between classified contours and prior knowledge about line evidence are included however this work is out of the scope of this paper. We let for future work extending our findings to video sequences for tracking moving nematodes in occlusion situations. Acknowledgments. This work was supported by the VLIR-ESPOL program under the component 8, the images were kindly provided by Devgen Corporation.
References 1. Bengtsson, E., Bigun, J., Gustavsson, T.: Computerized Cell Image Analysis: Past, Present and Future. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 395– 407. Springer, Heidelberg (2003) 2. Fdez-Valdivia, J., De la Blanca, P.N., Castillo, P., Gomez-Barcina, A.: Detecting Nematode Features from Digital Images. Journal of Nematology 24, 289–298 (1992) 3. Wei, G., Cosman, P., Berry, C., Zhaoyang, F., Schafer, W.R.: Automatic tracking, feature extraction and classification of C. elegans phenotypes. IEEE transactions in Biomedical Engineering 51, 1811–1820 (2004) 4. Van Osta, P., Geusebroek, J., Ver Donck, K., Bols, L., Geysen, J., ter Haar Romeny, B.M.: The Principles of Scale Space Applied to Structure and Color in Light Microscopy. Proceedings Royal Microscopical Society. 37, 161–166 (2002) 5. Meijering, E., Jacob, M., Sarria, J.-C.F., Unser, M.: A Novel Approach to Neurite Tracing in Fluorescence Microscopy Images. Signal and Image Processing. 399, 96–148 (2003) 6. Meijering, E., Smal, I., Danuser, G.: Tracking in Molecular Bioimaging. IEEE Signal Processing Mag. 3, 46–53 (2006) 7. Moller, S., Kristensen, C., Poulsen, L., Cartersen, J., Molin, M.: Bacterial Growth on Surfaces: Automated Image Analysis for Quantification of Rate-Related Parameters. Applied and Environmental Microbiology 6(1), 741–748 (1995)
586
D. Ochoa, S. Gautama, and B. Vintimilla
8. Baguley, J., Hyde, L., Montagna, P.: A Semi-automated Digital Microphotographic Approach to Measure Meiofaunal Biomass. Limnology and Oceanography Methods. 2, 181–190 (2004) 9. Tomankova, K., Jerabkova, P., Zmeskal, O., Vesela, M., Haderka, J.: Use of Image Analysis to Study Growth and Division of Yeast Cells. Journal of Imaging Science and Technology 6, 583–589 (2006) 10. Twining, C., Taylor, C.: Kernel Principal Component Analysis and the Construction of Non-Linear Active Shape Models. In: British Machine Vision Conference, pp. 26–32 (2001) 11. Kirbas, C., Quek, F.K.H.: Vessel Extraction Techniques and Algorithms: A Survey. In: Proceedings 3th IEEE Symposium on BioInformatics and BioEngineering, pp. 238–246. IEEE Computer Society Press, Los Alamitos (2003) 12. Aylward, S., Bullitt, E.: Initialization, noise, singularities, and scale in height ridge traversal for tubular object centerline extraction. IEEE Transactions in Medical Imaging. 21, 61–75 (2002) 13. Geusebroek, J., Smeulders, A., Geerts, H.: A minimum cost approach for segmenting networks of lines. International Journal of Computer Vision. 43, 99–111 (2001) 14. Hicks, Y., Marshall, D., Martin, R., Rosin, P., Bayer, M., Mann, D.: Automatic landmarking for biological shape model. In: Proceedings IEEE International Conference on Image Processing, vol. 2, pp. 801–804. IEEE, Los Alamitos (2002) 15. Carmichael, O., Hebert, M.: Shape-based recognition of wiry objects. Pattern Analysis and Machine Intelligence. 26, 1537–1552 (2004) 16. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision. 4, 191–200 (1997) 17. Foulonneau, A., Charbonnier, P., Heitz, F.: Geometric shape priors for region-based active contours. In: Proceedings IEEE International Conference on Image Processing, vol. 3, pp. 413–416. IEEE Computer Society Press, Los Alamitos (2003) 18. Tsechpenakis, G., Rapantzikos, K., Tsapatsoulis, N., Kollias, S.: A snake model for object tracking in natural sequences. Signal Processing Image Communications. 19, 219–238 (2004) 19. Zimmer, C., Olivo-Marin, J.: Coupled parametric active contours. IEEE Transactions on Pattern Analysis and Machine Intelligence. 27, 1838–1842 (2005) 20. Neuenschwander, W., Fua, P., Iverson, L., Székely, G., Kubler, O.: Ziplock snakes. International Journal of Computer Vision. 23, 191–200 (1997) 21. Jiankang, W., Xiaobo, L.: Guiding ziplock snakes with a priori information. IEEE Transactions on Image Processing. 12, 176–185 (2003) 22. Dong Joong, K., JongEun, H., In So, K.: Fast object recognition using dinamic programing from combination of salient line groups. Pattern Recognition. 36, 79–90 (2003) 23. Lacoste, C., Descombes, X., Zerubia, J.: Point Processes for Unsupervised Line Network Extraction in Remote Sensing, IEEE Trans. Pattern Analysis and Machine Intelligence 27, 1568–1579 (2005) 24. Steger, C.: An unbiased detector of curvilinear structures. IEEE Trans. Pattern Anal Machine Intell. 20, 113–125 (1998)
Logarithmic Model-Based Dynamic Range Enhancement of Hip X-Ray Images Corneliu Florea, Constantin Vertan, and Laura Florea Image Processing and Analysis Laboratory, University ”Politehnica” of Bucharest, Romania
Abstract. Digital capture with consumer digital still camera of the radiographic film significantly decreases the dynamic range and, hence, the details visibility. We propose a method that boosts the dynamic range of the processed X-ray image based on the fusion of a set of digital images acquired under different exposure values. The fusion is controlled by a fuzzy-like confidence information and the luminance range is oversampled by using logarithmic image processing operators.
1
Introduction
The X-ray imaging is a widely used technique for medical inspection. Although modern technology provides means and apparatus for digital acquisition, such an option may not be feasible. It is unfortunate, but not always the modern technology has accessible fees. Furthermore, the radiographies acquired with analog means in the past (i.e. film) store valuable information for present medical investigations. Considering the said reasoning, we assumed a low–cost alternative acquisition scheme, which implies photographing the radiographic film with a digital still camera. However, such an approach has a major drawback: the quantity of information available in an radiography is seriously reduced by the low dynamic range of a digital still camera output: the typical radiography produces images that span a dynamic range of some 75dB, while consumer digital cameras output values in a dynamic range of some 48dB. The trivial solution for overcoming the obvious loss of information is to combine frames acquired with different exposures and to posteriorly process the results (involving registration, camera response function (CRF) estimation and frame fusion under various processing models). The resulting quantization oversamples the output space, such that the dynamic range and details visibility are increased. For illustration, we will present examples of high dynamic range images obtained from multiple exposures of a hip prosthesis X-ray. Conclusions and perspectives end the current material.
This work was supported by the CEEX VIASAN grant 69/2006.
J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 587–596, 2007. c Springer-Verlag Berlin Heidelberg 2007
588
2
C. Florea, C. Vertan, and L. Florea
Bracketing: Retrieving High Dynamic Range Images from Multiple Exposures
The straightforward solution to the problems generated by the reduced dynamic range of the digital still camera is to combine multiple images of the same scene, taken under various settings (exposure time, aperture). The camera response function (CRF) determines the weights of the mixture parts. This approach is a particular case of super-resolution and is generally known as bracketing. The underlying idea is that each of the images that are to be combined captures with high quality only a certain part of the scene gamut. The bracketing algorithm selects (under the assumption that the multiple images are perfectly aligned), for each pixel of the spatial support scene image, the combination of frames that provides the best value. Thus, an implementation of the dynamic range increase consists of several steps: a first step of image registration (that aligns the multiple images captured from the scene), a step of CRF estimation and the actual image combination (or fusion, or pixel value selection) that computes the enhanced image. 2.1
Image Registration
Image registration means the geometrical alignment of multiple images of a scene based on the matching of the content. Image registration is a widely dealt issue in the field of image processing and several solutions (block matching methods, edges matching methods, object matching methods or global matching methods) are at hand [1]. We used here the robust global matching method of spectrum phase correlation [2], [3]. The underlying idea is based on the translation property of the Fourier transform, F : a translation in the spatial (or time) domain t of a signal x yields a phase shift in the transformed domain. F [x(t + t0 )] (ω) = F [x(t)] (ω) · e−jωt0 .
(1)
Therefore, for a pair of non-aligned images, one will find the corresponding shift as the maximum difference in the phase spectrum of the images. However, the method perform well only if the images exhibit a similar content and if there is no rotational misalignment. The roll component (that produces rotational misalignment) is the least significant motion component for hand-held pictures. If a tripod is considered for capture, imperfections of its mechanical extensions induce only image translations. 2.2
Rough Estimation of Camera Response Function
The CRF (denoted in the current material by g) is the mapping of the device recorded brightness to the scene radiance. The scene radiance is given by the APEX [4] equations as a function of several exposure and device parameters. The
Logarithmic Model-Based Dynamic Range Enhancement
589
APEX equation that relates the exposure time, the aperture and the incident light is: S t EV = − log2 (t) + 2 log2 N = φ(t)dt , (2) K 0 where EV is the exposure value, the log of t represents the APEX time value (TV), N is the relative diaphragm opening (and log of N represents the APEX aperture value, AV), φ(t) is the incident light, S is the sensors sensibility (or the amplification for digital cameras) and K is a known constant. The observation made by Debevec and Malik [5] is of paramount importance for practical bracketing solutions: a set of differently exposed images contains, usually, enough information to recover the CRF using the images themselves. If the scenario conditions include the same scene, aperture number and amplification as constants, then, by taking into account the right term of equation (2), the measured intensity is linearly dependent of the exposure time. To be more precise, let us assume that images A and B of the same scene were photographed with different exposure times tA and, respectively, tB . Given a photo-detector, its charge from the two images must preserve the same ratio as the exposure time. Now, if we come to the reported pixel values uA and uB , we get the basic CRF equation: tB g(uB ) = g(uA ) . (3) tA Recovering g from equation (3) is a difficult task ([6]). Certain restrictions have to be imposed on g. The minimum irradiance, 0, will produce no response of the imaging system, hence g(0) = 0. The maximum irradiance is an unrecoverable parameter, but the sensors output is limited by a saturation level in the photodetectors, umax ; therefore there is an upper-bound : g(umax ) = D. The monotonic behavior of g is, also, a typical assumption. Mann and Picard [7] proposed a gamma-like function for g, while Mitsunaga and Nayar [8] used a low degree polynomial regression. Debevec and Malik [5] used a smoothness constraint and recovered the response using a nonparametric model of g, sampled at certain values and represented by a vector. For our purposes these approaches are too complicated. Further more, it is not feasible to assume that, independently of the frame exposure value, the camera outputs the scene brightness correctly. For over–exposed pictures, it is less likely that pixels having values near the saturation level are accurately recorded. For under–exposed pictures values from the lower part of the range suffer the noise influence and their reported values are corrupted by quantization error. Instead of precise determination of the g function, as in the other mentioned approaches, we will simply compute the confidence that we have in a value recorded at a given exposure bias. There are different pairs {t,N} (exposure time - aperture) that satisfy equation (2). Most of the digital still cameras available on the market are capable of estimating the deviation of the exposure value from the set that balances equation (2). Thus, multiple scenes with the same EV may be obtained; averaging the results will decrease the error of estimation.
590
C. Florea, C. Vertan, and L. Florea
Given an exposure value, an image of the usual Macbeth Color Checker chart should exhibit a known set of values. In reality, the camera outputs different brightness intensities. The sum of the squared differences between the output values and the expected values normalized by the expected value is used as an error measure, ε. A low order polynomial regression is implied to extend the domain of the error function from the 24 original values (the number of patches in the chart) to the [0,255] required range. The error function is represented as matrix were the rows are bind to the exposure value parameter, while the columns span the possible gray-levels: ε → ε(EV, u). The confidence function is computed similarly to a fuzzy negation from the globally normalized error functions: μ(EV, u) = 1 − ε(EV, u) , (4) where, again, EV denotes the exposure value and u denotes the gray level. Examples of non-normalized, interpolated error functions and their corresponding confidence functions computed on images acquired with a SLR–like (Kodak DX6490) digital camera are shown in figure 1. 10
10
x 10
10
x 10
x 10
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
100
200
0
EV=−1.0
100
200
0
EV=0.0 1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0 100
200
100
200
EV=1.0
1
0
0
0 0
100
200
0
100
200
Fig. 1. The top row shows the measured errors with respect to the 0 − 255 gray level range for three exposure values (EV=-1, EV=0, EV=1). The bottom row presents the corresponding confidence functions μ.
2.3
Image Fusion
The image fusion step is the actual dynamic range increasing procedure. A simple approach for fusing a set of N frames taken by a digital camera under several exposures is to discard the pixels with saturated values and to average the remaining values [8]. The frames, denoted by f1 , ..., fN , are corrected by the
Logarithmic Model-Based Dynamic Range Enhancement
591
exposure factor EV (i), such that the pixel located at coordinates (l, m) in the resulting high dynamic range image, fHDR is obtained as: fHDR (l, m) =
N0 1 2EV (i) · fi (l, m) , N0 i=1
(5)
where N0 is the number of frames having non–saturated values at the specified location. Taking into account the confidence value computed in the previous subsection, a more informative approach is to consider the weighted average (or the convex combination of the pixel values). The weights encode the confidence that a value is outputted correctly. By this approach, the high dynamic range image is computed as: N μ (EV (i), fi (l, m)) · 2EV (i) · fi (l, m) i=1 fHDR (l, m) = . (6) N μ (EV (i), fi (l, m)) i=1
3
The Logarithmic Model for Image Fusion
The image values represent, in the case of an X-ray image, the transparency (or the opacity) of the real objects imaged by any given pixel. The underlying physical properties of the imaging system are naturally multiplicative. The key to the logarithmic image processing (LIP) approaches is a homomorphism which transforms the product into a sum (by logarithm), allowing the use of the classical linear filtering in the presence of additive components. Also, it should be clear that the functions used are bounded (taking values in a bounded interval [0, D)). During the image processing, the following problem may appear: the mathematical operations on real valued functions use, implicitly, the algebra of the real numbers (i.e. on the whole real axis) and we are faced with results that may fall outside of the interval [0, D) – the physical meaningful values. 3.1
The Classical LIP Model
In the classical LIP model [9], [10], the intensity of an image is completely modelled by its gray tone function v, with v ∈ [0, D). In this model, the addition of two gray tone functions v1 and v2 and multiplication of v by a real number λ are defined in terms of usual IR operations as: v1 v2 v1 ⊕ v2 = v1 + v2 − (7) D and respectively: v λ λ⊗v =D−D 1− . (8) D The use of the operations defined in (7) and (8) leads to an increased visibility of objects in dark areas, as well to the prevention of saturation in high-brightness areas [11].
592
C. Florea, C. Vertan, and L. Florea
3.2
The Homomorphic LIP Model
The logarithmic model introduced in [12] works with bounded real sets: the gray–tone values of the involved images, defined in [0, D), is linearly mapped onto the standard set (−1, 1): 2 D z= u− (9) D 2 where u ∈ [0, D) and z ∈ (−1, 1). The (−1, 1) interval plays the central role in the model: it is endowed with the structure of a linear (moreover: Euclidean) space over the scalar field of real numbers, IR. In this space, the addition between two gray-levels, z1 and z2 is defined as: z1 + z2 z1 ⊕ z 2 = (10) 1 + z1 z2 while the multiplication of a gray level, z with a real scalar, λ ∈ IR is: λ⊗z = 3.3
(1 + v)λ − (1 − v)λ . (1 + v)λ + (1 − v)λ
(11)
Over–Sampled Fuzed Images
The advantage of the use of LIP models is in the dynamic range reported by the resulting images. If one will examine equation (5) with inputs being all possible combinations of pairs defined between 0 and D, then there will be 2D−1 possible resulting levels. If the operation is performed using equation (7), then the number 2 of outputted different levels is in the order of D4 , while equation (10) leads to a 2 order of D2 . The logarithmic addition produces an over-sampling of the output values space. The corresponding dynamic range value for D = 256 is, roughly: 2 D DR = 20 log ≈ 90DB . 2 Thus, implementing the image fusion in a logarithmic space (or, shortly, by applying log-bracketing) the resulting image will exhibit largely increased number of different brightness levels (which can give the user the possibility of detecting objects in areas displayed uniformly in the original images).
4
Results
The proposed methods were used to enhance hip prostheses X-ray images taken with a consumer digital camera from a original radiographic film placed on a opaque illuminator (negatoscope). For each film a set of images with various exposures (as shown in figure 2) were acquired. High dynamic range images were produced by the four described approaches: simple averaging (as defined by equation (5)) and CRF weighted averaging (as
Logarithmic Model-Based Dynamic Range Enhancement
a)
b)
593
c)
Fig. 2. Originally acquired images: a) under-exposed image (EV=-1); b) correctly exposed image (EV=0); c) over-exposed image (EV=1)
defined by equation (6)) implemented with classical IR addition/ multiplication and with LIP model (both classical and homomorphic) addition/ multiplication. The intensity values were quantized with 12 bits per pixel (bpp) precision. Figure 3 presents an example of such high dynamic range X-ray images. The 12 bpp gray level images were displayed on usual RGB color displays using an extension of the classical gray level map via highly unsaturated colors that match the needed luminance levels that uniformly sample the luminance range. Indeed, the human visual system is unable to distinguish colors for which the difference within the maximal and minimal RGB components is small (less than 5 units on the 256 units scale). As such, the 4096 gray levels needed for the 12 bpp representation are obtained from the 256 classical (and exact) gray levels and 3840 highly unsaturated colors. The criteria used for choosing the best picture is the number of visible details of the prosthesis and the distinction between its parts, the visibility of the bone channel surrounding the prosthesis tail and the visibility of the bone fibres structure. Under such criteria, the high dynamic range images computed using the convex combination are the best. The direct implementation, in this case, leads to several outcomes, like the smearing effect on the background (which is expected to be completely dark) or less contrast in the prosthesis tail area. The images computed using the convex combination implemented according to the LIP model are the best. Figure 4 shows some of the relevant prosthesis details.
594
C. Florea, C. Vertan, and L. Florea
a)
b)
c)
d)
e)
f)
Fig. 3. High dynamic range images obtained from the set presented in figure 2 by averaging (as defined by equation (5)) using a) IR addition and multiplication, b) classical LIP addition and multiplication c) homomorphic LIP addition and multiplication and by CRF weighted averaging (as defined by equation (6)) using d) IR addition and multiplication, e) classical LIP addition and multiplication f) homomorphic LIP addition and multiplication
5
Conclusions
We presented a new method that takes as input a set of X-ray frame-images with the same subject, but different exposure values and combines them into a high-dynamic range image. The proposed fusion scheme requires confidence
Logarithmic Model-Based Dynamic Range Enhancement
a1)
b1)
c1)
d1)
a2)
b2)
c2)
d2)
a3)
b3)
c3)
d3)
595
Fig. 4. Details from X-ray prosthesis images: top two rows – prosthesis head and cup, bottom row – prosthesis tail. The images are: a) well exposed original images (EV=0) and high dynamic range images obtained by CRF weighted averaging (as defined by equation (6)) using b) IR addition and multiplication, b) classical LIP addition and multiplication d) homomorphic LIP addition and multiplication. The classical LIP model seems to yield the greatest detail visibility.
information derived from the non-linearity of the camera response function. Performing the operation required by the fusion scheme according to a logarithmic image processing method highly increases the number of resulting gray levels. Therefore objects placed in uniform areas become easier to examine. The proposed method was successfully applied to enhance the dynamic range of hip prosthesis X-ray film images acquired by a consumer digital camera. Even that the classical LIP model was designed to be used for special categories of images, there are proves that the homeomorphic LIP model is suitable for most of the images. By these mens we intend to test the currently described method on natural images.
596
C. Florea, C. Vertan, and L. Florea
References 1. Schechner, Y.Y., Nayar, S.K.: Generalized mosaicing: High dynamic range in a wide field of view. International Journal on Computer Vision 53, 245–267 (2003) 2. Kuglin, C.D., Hines, D.C.: The phase correlation image alignment method. In: Proc. of IEEE Conference on Cybernetics and Society, Bucharest, Romania, pp. 163–165. IEEE Computer Society Press, Los Alamitos (1975) 3. Averbuch, A., Keller, Y.: Fft based image registration. In: Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP ‘02, Orlando FL, USA, vol. 4, pp. 3608–3611. IEEE, Los Alamitos (2002) 4. PH2.5-1960, A.: American standard method for determining speed of photographic negative materials (monochrome, continuous tone) United States of America Standards Institute (1960) 5. Debevec, P., Malik, J.: Recovering high dynamic range radiance maps from photographs. In: Proc. of ACM SIGGRAPH 24th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles CA, USA, vol. 1, pp. 369–378. ACM, New York (1997) 6. Grossberg, M.D., Nayar, S.K.: High dynamic range from multiple images: Which exposures to combine? In: Proc. of IEEE Workshop on Color and Photometric Methods in Computer Vision at ICCV 2003, Nice, France, IEEE, Los Alamitos (2003) 7. Mann, S., Picard, R.: Being ’undigital’ with digital cameras: Extending dynamic range by combining differently exposed pictures. In: Proc. of ST’s 48th Annual Conference, Washington D.C. USA, vol. 1, pp. 422–428 (1995) 8. Mitsunaga, T., Nayar, S.K.: High dynamic range imaging: Spatially varying pixel exposures. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition CVPR, Hilton Head SC, USA, vol. 1, pp. 472–479. IEEE, Los Alamitos (2000) 9. Jourlin, M., Pinoli, J.C.: A model for logarithmic image processing. Journal of Microscopy 149, 21–35 (1998) 10. Jourlin, M., Pinoli, J.C.: Logarithmic image processing. Advances in Imaging and Electron Physics 115, 129–196 (2001) 11. Deng, G., Cahill, L.W., Tobin, G.R.: The study of logarithmic image processing model and its application to image enhancement. IEEE Trans. on Image Processing 4, 506–512 (1995) 12. Patra¸scu, V., Buzuloiu, V., Vertan, C.: Fuzzy image enhancement in the framework of logarithmic model. In: Nachtegael, M., Kerre, E. (eds.) Algorithms in Modern Mathematics and Computer Science. Studies in Fuzziness and Soft Computing, vol. 122, pp. 219–237. Springer Verlag, Heidelberg (2003)
A New Color Representation for Intensity Independent Pixel Classification in Confocal Microscopy Images Boris Lenseigne, Thierry Dorval, Arnaud Ogier, and Auguste Genovesio Image Mining Group, Institut Pasteur Korea, Seoul Korea
[email protected]
Abstract. We address the problem of pixel classification in fluorescence microscopy images by only using wavelength information. To achieve this, we use Support Vector Machines as supervised classifiers and pixels components as feature vectors. We propose a representation derived from the HSV color space that allows separation between color and intensity information. An extension of this transformation is also presented that allows to performs an a priori object/background segmentation. We show that these transformations not only allows intensity independent classification but also makes the classification problem more simple. As an illustration, we perform intensity independent pixel classification first on a synthetic then on real biological images.
1
Introduction
In confocal microscopy image analysis, there are, depending on the application, two main ways to qualify a biological phenomenon : studying the relative localization of the different marked objects or monitoring their fluorescence variation. Our work stands on the second case and concerns the study of siRNA transfection. SiRNA molecules, marked with a red dye, are used to inhibit specific proteins, here the Green Fluorescent Protein (GFP) produced by human macrophages mutant cells. When the cells are transfected (ie. siRNA molecules enter the cell), the GFP production is inhibited and cell’s green fluorescence decreases. Quantifying the variation of cell fluorescence is a way to address the amount of transfection. Thus, the aim of the analysis is to discriminate the cells containing siRNA from the others priorly to quantify their respective fluorescence. The input images are two bands color images. In this paper, we address the problem of pixel classification to define whether a pixel belongs to the background, to a transfected or to a non transfected cell. As we have interest in monitoring transfected cells fluorescence intensity, we consider that signal intensity should not be taken into account for objects identification so that the different kind objects should only be distinguished based
Images concerning the biological application have been provided by J-P. Carralot from the BIP-TB group at IPK.
J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 597–606, 2007. c Springer-Verlag Berlin Heidelberg 2007
598
B. Lenseigne et al.
on the wavelength they emit. Nevertheless, the assumption that significant objects have a higher intensity than the background can be used to perform prior object/background separation. To perform this task, we use Support Vector Machines (SVM) classifiers. Classification is performed at the pixel level and a specific color representation is proposed that makes it possible to separate intensity and wavelength information. This approach allows the biologist to characterize examples corresponding to each case and to include in the model the artifacts that may occur in the images [1,2]. This paper is organised as follows: in section 2 we introduce a brief overview of SVM classifiers. In section 3, we propose a new color representation that splits wavelength and intensity information. An extension of this transformation is also presented to solve the problem of object/background segmentation. In section 4, we present some results on a synthetic image and then on real transfection image. Finally the conclusion describes further extension of this work.
2
SVM Classifiers Overview
In their initial formulation, Support Vector Machines provide a two-classes linear classifier that finds an optimal decision hyperplane by maximizing the distance between the individuals xi in learning dataset and the decision hyperplane [3,4]. The SVM learning algorithm finds a decision function that defines whether a point x is on one side or the other of that hyperplane so that the function that affects a label to each vector to classify is : l f (x) = sign αi yi xi x + b . i=1
This decision function is learned based on a set of labelled data {(xi , yi ), i = 1, . . . , l, yi ∈ {−1, 1}}. One of the main interest of SVM is their ability to deal with cases where the examples in the learning dataset are not linearly separable. Those cases are handled by projecting the data into an higher order space: if Φ(x) : Rn → Rn+ is the mapping function that realizes this projection, we can define a kernel function so that K(xi , xj ) = Φ(xi )Φ(xj ). Using such function allows to define the dot product in the augmented space without having to explicit the form of Φ(x). So that the decision function in the augmented space can be written: l f (x) = sign αi yi K(xi x) + b . (1) i=1
Finally, it appears that most of the αi will be 0. Values xi for which αi = 0 are used to define the separating hyperplane and are called ”‘support vectors”’. Many kernel functions have been proposed in the literature [5,6]. Table 1 summarizes some commonly used kernel functions. In most of the cases, the kernel function depends on some additional parameters (eg. γ for the RBF kernel) and, in order to deal with noisy or ill labelled data in the training set, one
A New Color Representation for Intensity Independent Pixel Classification
599
introduces an additional parameter C in the SVM’s formulation. C is a penalization parameter that represents the trade-off between margin maximization and class separation (C = ∞ for a classifier with maximal classification rate). Classifiers using the C parameter are called C-SVM. Table 1. Some commonly used kernel functions [7] kernel function linear K(xi , xj ) = xTi xj polynomial K(xi , xj ) = (γxTi xj + r)d , γ > 0 RBF exp(−γ xi − xj 2 ), γ > 0 sigmoid K(xi , xj ) = tanh(γxTi xj + r)
For our study, we use C-SVM with the RBF kernel. This kernel is known as being the most generic one[7]. The classifier’s hyperparameters (C, γ) are estimated using estimated using a model selection algorithm that optimizes both classification rate and complexity [2]. Feature vectors to classify are pixels color components. In order to perform an intensity independent classification, an appropriate representation of these components will be described described in next section.
3
An Intensity Independent Color Representation
We intend to define a pixel value representation that allows to split the color and intensity information, and also to use these different components as a feature vector for the classifier. This implies a representation where all components have similar amplitude and where the dot product between feature vectors is defined. 3.1
The HSV Color Space
In optical imagery, the HSV color space [8] provides an interesting representation for a large number of applications ranging from skin detection [9] to optical microscopy [10]. It provides a color decomposition close to human perception where wavelength, brightness and intensity information are separated: – The H (Hue) channel determines which basic color it is. A hue is referenced as an angle on a color wheel (H ∈ [0, 360]). – The S (Saturation) determines the grey level of the colour (or the amount of white light in the colour) (S ∈ [0, 1]). – Finally, the V (Value) represents the global intensity of the light(V ∈ [0, 1]). The standard RGB to HSV transformation is described below: r, g, b, S, V ∈ [0, 1], H ∈ [0 − 360], max = max(r, g, b), min = min(r, g, b)
600
B. Lenseigne et al.
⎧ undef ined if ⎪ ⎪ ⎪ g−b ⎪ ⎪ if ⎨ 60 × max−min , g−b + 360, if H = 60 × max−min ⎪ b−r ⎪ ⎪ 60 × max−min + 120, if ⎪ ⎪ ⎩ 60 × r−g max−min + 240, if min S = 1 − max V = max
max = min max = r and g ≥ b max = r and g < b max = g max = b
(2)
In fluorescent microscopy images, the different color bands are decorrelated and thus cannot be interpreted exactly in the same way. Moreover each color band corresponds to a specific wavelength, therefore there is no white light component and the Saturation channel has a constant value (fig. 1). 3.2
The cHsHV Pixel Representation
Moreover, the main drawback of the original HSV color space is the fact that the hue value is an angle. Thus the dot product between two color vectors cannot be directly written, which makes this representation inappropriate for pixel classification. The solution we propose is to re-project the pixel components in a cartesian coordinate system by using the trigonometric lines of the Hue channel instead of the angular value (fig. 1). As we consider the fact that the Saturation channel brings no information, we can represent a pixel in this space by: ⎛ ⎞ cos(H) pcHsHV = ⎝ sin(H) ⎠ V Finally, the Value channel is rescaled from [0, 1] to [−1, 1] leading to similar dynamics for each channel. Using this representation, pixel components can directly be used for classification. 3.3
The cHsHVm (Value Masked) Pixel Representation
As the V channel corresponds to the global intensity of the signal, pixels belonging to significant objects have a Value slightly higher than the background. This makes it possible to perform an a priori object/background segmentation of the images by using the V channel as a mask. Pixels whose Value is too low are set to zero and we only consider the cos(H) and sin(H) components of the remaining pixels. This allows usage of a 2D representation of the pixels:
cos(H)Ψ (V − V0 ) p= sin(H)Ψ (V − V0 ) Where Ψ (x) is a step (Heavyside) function (Ψ (x) = 1 if x > 0, 0 otherwise) and V0 the minimal value for a significant signal.
A New Color Representation for Intensity Independent Pixel Classification
601
Fig. 1. Decomposition of a synthetic image in the different pixel representation : RGB, HSV, cHsHV and cHsHVm. The original image contains pure red and green (upper and lower band) and a mix in variable proportions of green and red (center part). Signal intensity gradually decreases on left and right border of the image. In the HSV representation, the Saturation channel brings no information and the Value channel carries the information about signal intensity. In the cHsHV representation, sin(H) and cos(H) bands do not depend on signal intensity. This information is accessible via the Value channel. In the cHsHVm transformation, the Value channel has been used to find significant pixels and the representation does not take into account the intensity anymore. Thanks to this representation, it becomes possible to address pixel classification by only using Hue information.
3.4
Using cHsHV/cHsHVm Representation for Classification
Beside the wavelength/intensity separation, the representations we propose has some major advantages for pixel classification: – First, as RGB data needs to be rescaled to have an amplitude in [−1, 1], cos(H) and sin(H) channels do not need such a transformation so that cHsHVm images can be directly used for classification. In the case of cHsHV images, only the Value channel has to be rescaled from [0, 1] to [−1, 1]. – Second, by using the trigonometric lines of the Hue (cos(H) and sin(H)), the color information is projected on the trigonometric circle. This circle is extended to the surface of a cylinder when the Value channel is taken into account (cHsHV representation). Such configurations make the separation between the classes easier as a given mix of wavelength can always be separated from the others by an hyperplane (fig. 2). – Finally, from the application point of view, as cHsHVm already provides an object/background separation which means that only one classifier is required to identify two different kinds of biological objects.
602
B. Lenseigne et al.
Fig. 2. Projection of pixels values of a confocal microscopy image in color space coordinates. The image presents siRNA transfected (red and green) and non tranfected GFP (green only) cells. RG plane projection emphasizes the fact that this color space is not well adapted for pixel classification. On the other hand, the cHsHV transformation leads to a projection of pixel values on the surface of a cylinder (for displaying, pixels components have been rescaled to [0, 255]). In this representation, any class of pixel presenting a specific combination of wavelength can be separated from the others by an hyperplane.
Next section shows some examples of pixel classification with both cHsHV and cHsHVm representations on a synthetic and on real confocal microscopy images.
4
Preliminary Results
The experiments demonstrates the efficiency of the cHsHV and cHsHVm representations to find classes corresponding to a specified combination of wavelength with a variable intensity. These experiences were performed using the SVM framework described in 2. SVM are binary classifiers so that the classification has to be performed by defining one class against all the others. We will start by presenting some results on a synthetic image and thereafter results of pixel classification on a real microscopy image of siRNA transfection. 4.1
Pixel Classification on a Synthetic Image
The synthetic image presents the different cases which occurs in the real siRNA transfection images: – non transfected cells only contain green color but with varying intensity;
A New Color Representation for Intensity Independent Pixel Classification
603
Fig. 3. Classification of pixels corresponding to the pure green, pure red and mixed red and green classes. The classification is not possible in the original color space: all the colored pixels are affected to the same class. The cHsHV representation leads to a partition between the desired wavelength (label green) combination on one side, background and remaining colored pixels on the other side (label red). Finally the cHsHVm representation builds to classes by only considering the colored pixels that where previously selected while masking using the Value channel as mask.
– transfected cells present a mix of green and red color in different proportion depending on transfection amount and GFP knock down; – extra cellular siRNA appears in red (this class is not taken into account in the biological application) These cases appear on a check board mixed with black pixels simulating the background. The goal of the classification is to find the colored cases of the check board that correspond to a given wavelength combination. The learning image contains all cases but with a constant intensity. The test consist in learning successively each class against the two others. Results are displayed in figure 3. As a validation of our approach, we first performed the tests on the original RGB image. In the RGB color space, pixel components inside each class cover a very large range and the distance between two pixels belonging to different classes can be very low (eg. dark green, red and yellow). Thus, as we can see in figure 3, pixels classification in this color space is not possible.
604
B. Lenseigne et al.
Fig. 4. Pixel classification of an example image of the siRNA transfection study. The input image presents both cases: transfected cells on the left part, non transfected cells on the right. The figure presents the results using successively each kind of cells as label and small portion of the input image for learning. In the output images, the desired class is marked in green and pixels belonging to the other class in red. Black pixels are pixels that where not classified. Note that the classification was able to find cells that were not visible in the original image (red circles in the lower right image).
This problem will not occur in our representations based on the trigonometric lines of the Hue (Cf 3.2). As it was previously explained (Cf. 3.4), pixel values are projected on either a cylinder (cHsHV) or a circle (cHsHVm). Those projections lead to configurations where any combination of wavelength can easily be separated from the others. For the cHsHV representation, the classifier will split the pixels into classes corresponding to the desired combination of wavelength on one side, background and other wavelength on the other side. When using the cHsHVm representation, the masking on the V channel sets to null pixels with a low intensity. In this case, only the colored pixels are classified and the two classes correspond to the desired wavelength on one side and the colored pixels with different wavelength (3) on the other side. 4.2
Real Image Segmentation
The test image is a montage presenting both cases occuring in our application: the transfected cells contain a mix of green and red wavelength in different
A New Color Representation for Intensity Independent Pixel Classification
605
proportion. The non transfected cells only emits light in the green channel. However both transfected and non transfected cells may have very variable intensities, within a range of 1 to 10 times in our application (4). The learning image is a small portion of the input image and also presents the two kinds of cells. Test have been performed using either transfected or non tranfected cells as labelled class. As for the synthetic image, cHsHV representation finds the desired class and affects all the remaining pixels to the other class without taking in account if they belong to a cell of the other class or to the background. The cHsHVm transformation implies a prior object/background separation, and also makes it possible to directly identify the two classes of cells.
5
Conclusion and Further Works
We addressed the problem of monitoring fluorescence intensity of biological images acquired with a confocal microscope. More precisely, we focus on classifying pixels by only considering the wavelength emited by the objects. To perform this task, we use Support Vector Machines for pixels classification and pixels components as feature vectors. To avoid the problems due to objects with a variable intensity, we propose to represent the color components of each pixel in a specific space where color and intensity information are separated (cHsHV). An extension of this transformation is also proposed to perform an a priori object/background separation (cHsHVm). We show that these representations not only provide the desired separation between wavelength and intensity but also changes the topology of the feature space thus leading to classifiers more simple than in the original color space. To validate our approach, we used those color representations to perform pixel classification first on a synthetic image then on biologic images: siRNA transfection monitoring. The results show the efficiency of our approach to perform intensity independent classification while this task cannot be performed in the original RGB color space. As this approach is very promising, the next step is to use it to process large image databases for the siRNA transfection study. But theses representations associated with supervised classifiers also provide a generic tool model based biological objects recognition. Finally, as classification was performed using a very generic framework, significant performances improvements can be expected with more specific tools that use the specific topology of the cHsHV or cHsHVm spaces.
References 1. Dorval, T., Genovesio, A.: Automated confocal microscope bias correction. In: of Physics, I., A. (ed.) 5th International Workshop on Information Optics; WIO’06, pp. 463–470 (2006)
606
B. Lenseigne et al.
2. Lenseigne, B., Brodin, P., Jeon, H., Christophe, T., Genovesio, A.: Support vector machines for automatic detection of tuberculosis bacteria in confocal microscopy images. In: IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Washington DC, USA, Springer, to be published (2007) 3. Scholkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2002) 4. Burges, C.J.C.: A tutorial on Support Vector Machine for pattern recognition. Usama fayyad edn. (1998) 5. Boughorbel, S., Tarel, J.P., Fleuret, F., Boujemaa, N.: Gcs kernel for svmbased image recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadro˙zny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 595–600. Springer, Heidelberg (2005), http://www-rocq.inria.fr/tarel/icann05b.html 6. Ayat, N., Cheriet, M., Suen, C.: Optimization of the svm kernels using an empirical error minimization scheme. In: Lee, S.-W., Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, pp. 354–369. Springer, Heidelberg (2002) 7. Hsu, C., C.C., C., C.J., L.: A practical guide to support vector classification. National Taiwan University, Taipwi Taiwan (2003) 8. Munsell, A.: A Grammar of Color. Van-Nostrand-Reinhold, New York (1969) 9. Hung, S., Bouzerdoum, A.S., Chai, D.S.: Skin segmentation using color pixel classification: analysis and comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 148–154 (2005) 10. Ihlow, A., Seiffert, U.: Microscope color image segmentation for resistance analysis of barley cells against powdery mildew. In: 9. Workshop ”Farbbildverarbeitung”. Number Report Nr. 3/, Ostfildern-Nellingen, Germany, ZBS Zentrum f¨ ur Bild- und Signalverarbeitung e.V. Ilmenau (2003) pp. 59–66 (2003)
Colon Visualization Using Cylindrical Parameterization Z. Mai, T. Huysmans, and J. Sijbers University of Antwerp, IBBT-Vision Lab, Universiteitsplein 1, Building N, B-2610 Wilrijk
Abstract. Using cylindrical parameterization, the 3D mesh surface extracted from colon CT scan images is parameterized onto a cylinder, and afterwards visualized with a modified Chamfer distance transformation of the original CT images with regards to the colon centerline/boundary distance. The cylinder with information from distance transformation is then unfolded with numerical integration along its circumferential direction and mapped to a plane, which approximates the view of a colon cut open along its length.
1
Introduction
Computed tomographic (CT) colonography is a new-generation technique which can be used for detecting colorectal neoplasms by using volumetric CT data combined with specialized imaging software [1]. When CT was first introduced into practice more than 20 years ago, few radiologists expected to detect colon polyps by using this technology. Major advances in CT technology have shortened the acquisition time of thin sections and volumes of body tissue within seconds. Currently state-of-the-art multichannel CT scanners can be used to acquire all of the data for an abdominopelvic examination in a single breath hold [1]. The fast acquisition combined with computer-aided 3D visualization proves to be fruitful for virtual colonography in various ways, the most important of which is non-invasive detection of the presence of pathologies, e.g., polyps. There are different methods in virtual colonography, one is 3D rendering and another 2D image display. The former method would require the user to navigate through the 3D colon data as an output from preprocessing of either surface rendering or volume rendering of the original colon data. Although this method is intuitive and user-friendly, due to the tortuous nature of the colon as well as the polyp, it often results in uninspected areas, in addition to the fact that it is computationally costly to construct such a navigable image. Therefore, most investigators resort to it for confirmation instead of for primary evaluation [1]. On the other hand, 2D visualization of the colon either uses the original CT scan image as the primary investigation source, or alternatively utilizes various algorithms to flatten the 3D (surface)rendering result to obtain a 2D image for inspection [2]. In this paper, we propose an alternative approach. We first introduce a method to parameterize the 3D colon surface onto a cylinder with the same topology. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 607–615, 2007. c Springer-Verlag Berlin Heidelberg 2007
608
Z. Mai, T. Huysmans, and J. Sijbers
To this end, we construct a harmonic scalar field on the mesh surface. Certain mathematical techniques from Riemann surface theory [3, 4] enable us to map arbitrary tubular surface with open ends onto a cylinder and further into a plane. Our approach differs from the one in [3] in that we use different procedure to cut open the tubular surface, and also that after flattening we make use of the Chamfer distance transformation of the original colon image to visualize the 2D flattened image with protruding features highlighted with color. The color table is easily tunable so as to emphasize the presence of the pathology. To preserve colon geometry, especially along circumferential direction, which is indeed underrepresented in the rectangular flattened image, we again exploit information from the distance transformation during resampling the rectangular image to make the length of each sampled line matches the circumference of corresponding circle in the colon, thus minimizing the distortion resulted from the mapping from 3D to 2D. We now summarize the contents of this paper. In Section 2, we give an overview of our parameterization method, in Section 3, we describe the calculation of our Chamfer distance transformation, in Section 4, we provide the actual pipeline of processing in the original colon CT scan images, and in Section 5, we draw some conclusions about our approach.
2
Cylindrical Parameterization
We first consider the mathematical model of the parameterization. First, let Λ ⊂ 3 represent a continuous surface which is topologically equivalent to an open-ended cylinder. For the discrete representation, we assume that we are given as input a triangulated manifold mesh M = (P, T ), composed of a set of vertices P and a set of triangles T . Each vertex i ∈ P is assigned a position xi ∈ 3 in a 3-D Euclidean space. The boundary of Λ consists of two topological circles, which we will denote as χ0 and χ1 . Since a cylinder cut open along its length is topologically equivalent to a rectangle, we want to find a mapping, F : Λ → S, that maps Λ to a rectangle, where F = u + iv, u and v are both mappings from 3 to 1 , i.e., u, v=f : 3 → 1 . Let’s first consider the construction of u. This function u can be found to be the solution to the Laplace equation Δu = 0 with Dirichlet boundary condition u = 0 on χ0 and u = 1 on χ1 . Specifically, u is the harmonic function that minimizes the Dirichlet functional 1 D(u) = |Δu|2 dS, (1) 2 Λ with boundary conditions u|χ0 = 0
and
u|χ1 = 1
(2)
In the discrete case, let PL(M ) denote the space of piecewise linear functions on M . For each vertex V ∈ M , let φV be the continuous function such that
Colon Visualization Using Cylindrical Parameterization
φV (V ) = 1 φV (W ) = 0,
609
(3) W = V,
(4)
It is now apparent that the set {φV } form an orthonormal basis for PL(M ), and any u ∈ PL(M ) can be written as u= uV φV (5) Vertex V from M
To find the minimizer of D(u), we introduce the matrix DV W = ∇φV · ∇φW dS
(6)
for any arbitrary pair of vertices V and W . If V = W , then as shown in Fig.1, for the two triangles that share the edge V W , V W X and V W Y , we have 1 DV W = − {cot ∠X + cot ∠Y } 2
(7)
where ∠X is the angle on vertex X in V W X, and ∠Y the one on vertex Y in V W Y . If V = W , then we have DV V = − DV W (8) W =V
In order to make u the minimizer, the following condition must be satisfied: DV W u W = − DV W (9) W ∈M\(χ0 ∪χ1 )
W ∈χ1
Therefore, we can solve the linear equation (9) to obtain the harmonic function u. We then make a cut on the surface from χ0 to χ1 . In order to do so, instead of simply jumping from one vertex to the next one as in [3], we start by traversing V Y
X
W
Fig. 1. Two triangles VWX and VWY that share a common edge VW
610
Z. Mai, T. Huysmans, and J. Sijbers
through one triangle with an edge on χ0 , and follow the gradient g of u in that triangle until it hits another edge. This introduces a split of the triangle into halves. We then move on to the next triangle, still following closely the gradient of u. Since u is continuously increasing from χ0 to χ1 , we are guaranteed to find a path C linking χ0 and χ1 that follows the gradient of u. Now to calculate its conjugate harmonic function v, we start from calculating the value of v on C by integrating the normal derivative of u [5], and then solve the linear system equations to get v similar to the case of u.
3
Region Growing Chamfer Distance Transformation
Typical distance maps are images in which the value of each pixel of the foreground represents its distance to the nearest pixel of the background. A Euclidean distance map [6] can be computationally expensive as a direct application following this definition usually requires an excessive computation time. Alternative fast algorithms were developed to generate Distance Transforms (DT), as approximations of the Euclidean distance maps, which have found applications in various fields such as chamfer matching, registration of medical images [7], generation of morphological skeletons [8] or active contour models. Numerous DT algorithms have been proposed, with various trade-offs between computation time and approximation quality. The DT in the literature belong to two categories: the Chamfer DT originally proposed by Borgefors [9] and the Vector DT proposed by Danielsson [10]. In this paper, we present a regiongrowing Chamfer DT where pixels are scanned by increasing value of the distance with respect to specific seed point(s), which in our current work are the centerline points in colon surface. The underlying assumption for Chamfer DT (CDT) is that the distance value for one certain pixel can be calculated from its neighbours’s distance value plus a mask constant. CDT are usually produced in 2 raster scans over the image, using half of the neighbour pixels as a mask for each scan. In our region growing approach, instead of using raster scans, pixels are actually considered by increasing distance values, i.e., growing the region of considered pixels outward from given seed points until every non-background pixel is included. It is implemented with a data structure called Queued List (QL), where the position of
Fig. 2. The mask used in Chamfer distance transformation, 0 indicates origin point, a and b refer to 2D and 3D case respectively
Colon Visualization Using Cylindrical Parameterization
611
to-be-processed pixels are stored. The QL is initialized with filling pixels for the given seed points, and all other to-be-processed pixels are filled with a Maximum Distance value. The algorithm processes one pixel from the beginning of QL at a time, and adds the pixel’s unprocessed neighbors to the back of QL. In each processing round, a pixel’s value is compared to its neighour’s plus corresponding mask (see Fig.2) constant, and if a smaller value is resulted, the pixel’s value will be updated and its unprocessed neighbors will be added to QL. For 2D images, the neighbor search can be either 4- or 8-, while in our specific case of 3D colon image, we will use 26-neighbors. The actual algorithm can be written as: Initialization:
Main:
4
QL is filled with all pixels for seed points All seed points are filled with 0 distance value All non-seed-point to-be-processed pixels are filled with Maximum Distance value while QL is not empty { get P from QL for each 26-neighbor n of P { d=dist(n)+mask(n,P) if d
Colon Visualization Pipeline
The data sets provided to us for testing come from AGFA NV. The data set shown in left of Fig.3 consists of 365 slices of 512 × 512 colon CT scan images. Using the segmentation and the colon centerline data from AGFA, we first extracted a triangulated mesh surface out of the segmented image. The surface extraction is done using the marching cube algorithm in the Visualization Toolkit environment. Afterwards, the triangles on both ends of the mesh surface were removed to ensure its open-ended cylindrical topology. We then used the method described in Section 2 to make a cut from one point of the boundary curve to the other boundary curve of the surface, and parameterized the open surface to a cylinder. Next, we computed the distance transformation as described in Section 3 for the whole data set as a volume image, using the centerline information. Note that all pixel information other than the segmented boundary and the centerline
612
Z. Mai, T. Huysmans, and J. Sijbers
Fig. 3. Left: the original colon image; Right: the distance-transformed colon image. Note that all information except the segmentation and centerline is filtered out.
Fig. 4. The colon mesh surface visualized with the distance transformation information, color scales from blue to red, indicating lowest and highest distance value
Colon Visualization Using Cylindrical Parameterization
613
Fig. 5. The parameterized colon surface with the same color information as in previous figure. Left: The cylindrical parameterization; Right: The unfolded cylinder, i.e., the rectangle.
Fig. 6. The cut-open colon surface, with circumferential geometry reserved, color scale same as in previous Figure. 4 and Figure. 5. Triangles in both ends were removed to ensure the open-ended cylindrical topology.
Fig. 7. The flattened surface rendered using normal mapping
is filtered out in order to ensure error-free distance transformation. The sample image from the result is shown in the right of Fig.3. We then regarded the distance-transformed volume image as a volume, and for each vertex on the mesh surface, we sampled the distance value on the corresponding voxel in the new volume and assigned it to the vertex as the color scalar information. Next, we visualized the mesh surface with the additional color information, which is useful in enhancing the visibility of certain feature with prominent size or height
614
Z. Mai, T. Huysmans, and J. Sijbers
a
b
c
Fig. 8. Visualization of the polyp: a. The part included in red square indicates the polyp region; b. The polyp as shown in the original colon surface; c. The zoom-in view of the polyp region in the flattened surface
different from its surroundings, such as polyps. The result of this coloring is shown in Fig.4. Since the original mesh surface can be 1-to-1 mapped to the cylindrical parameterization, it is then straightforward to reassign each color scalar to each vertex in the parameterization. Subsequently, we mapped the cylinder onto a plane, the mapping procedure is as follows: we sampled each circumferential line on the cylinder and made the sum of all the scalar values for each sampled vertex, i.e., line integral of the distance value, as the length of the mapped line on the plane. Adjacent circumferential lines are close enough so as to ensure the number of total sampled vertices are representative of that in the original mesh. This is done to preserve the circumferential geometry in the original colon surface. To further preserve the shape information, we use normal mapping to render the flattened surface. The result of the visualization is shown in Fig.5, Fig.6 and Fig.7. In Fig.8, we show the actual visualization for one polyp.
5
Conclusion
In this paper, we presented a pipeline of visualization technique based on cylindrical parameterization of the colon surface extracted from a series of colon CT
Colon Visualization Using Cylindrical Parameterization
615
images. The pipeline involves the cylindrical parameterization of the original mesh surface, the distance transformation of the colon volumetric image as well as the procedure we used to map the cylinder onto a plane along with color information, with circumferential geometry preserved.
Acknowledgement This research was co-funded by the IBBT (Interdisciplinary Institute for BroadBand Technology), a research institute founded by the Flemish Government in 2004, and the involved companies (Agfa, Barco, Medicim, Namahn). Furthermore, the research was partially funded by the I.W.T. (Institute for Science and Technology - Flanders) and the F.W.O. (Fund for Scientific Research - Flanders, Belgium).
References 1. Daniel Johnson, C.A.H.D.: CT colonography: The next colon screening examination. Radiology 216, 331–341 (2000) 2. Paik, D., Beaulieu, C., Jeffrey, R., Karadi, C., Napel, S.: Visualization modes for CT colonography using cylindrical and planar map projections. Technical report, Department of Radiology, Standford University School of Mechicine, Standford, CA (1999) 3. Haker, S., Angenent, S., Tannenbaum, A., Kikinis, R.: Nondistorting flattening maps and the 3D visualization of colon CT images. IEEE Transactions on Biomedical Engineering 19, 665–671 (2000) 4. Dong, S., Kircher, S., Garland, M.: Harmonic functions for quadrilateral remeshing of arbitrary manifolds. Computer Aided Geometric Design 22, 392–423 (2005) 5. Raunch, J.: Partial Differential Equations. Springer, New York (1991) 6. Cuisenaire, O.: Distance transformations: Fast algorithms and applications to medical image processing. PhD thesis, Catholic University of Leuven (1999) 7. Cuisenaire, O., Thiran, J., Macq, B., Michel, C., De Volder, A., Marques, F.: Automatic registration of 3D MR images with a computerised brain atlas. SPIE Medical Imaging 1710, 438–449 (1996) 8. Qing, K., Means, R.: Novel approach for image skeleton and distance transformation parallel algorithms. SPIE Medical Imaging 2617, 737–742 (1994) 9. Borgefors, G.: Distance transformations in arbitrary dimensions. CVGIP 27, 321– 345 (1984) 10. Danielsson, P.: Euclidean distance mapping. CGIP 14, 227–248 (1980)
Particle Filter Based Automatic Reconstruction of a Patient-Specific Surface Model of a Proximal Femur from Calibrated X-Ray Images for Surgical Navigation Guoyan Zheng and Xiao Dong MEM Research Center - ISTB, University of Bern, Stauffacherstrasse 78, Switzerland
[email protected]
Abstract. In this paper, we present a particle filter based 2D/3D reconstruction scheme combining a parameterized multiple-component geometrical model and a point distribution model, and show its application to automatically reconstruct a surface model of a proximal femur from a limited number of calibrated X-ray images with no user intervention at all. The parameterized multiple-component geometrical model is regarded as a simplified description capturing the geometrical features of a proximal femur. Its parameters are optimally and automatically estimated from the input images using a particle filter based algorithm. The estimated geometrical parameters are then used to initialize a point distribution model based 2D/3D reconstruction scheme for an accurate reconstruction of a surface model of the proximal femur. We designed and conducted in vitro and in vivo to compare the present automatic reconstruction scheme to a manually initialized one. An average mean reconstruction error of 1.2 mm was found when the manually initialized reconstruction scheme was used. It increased to 1.3 mm when the automatic one was used. However, the automatic reconstruction scheme has the advantage of elimination of user intervention, which holds the potential to facilitate the application of the 2D/3D reconstruction in surgical navigation. Keywords: proximal femur, fluoroscopy, surface reconstruction, particle filter, multiple-component geometrical model, point distribution model.
1
Introduction
A patient-specific surface model of a proximal femur plays an important role in planning and supporting various computer-assisted surgical procedures including total hip replacement, hip resurfacing, and proximal femur osteotomy. Accordingly, various reconstruction methods have been developed. One of these methods is to extract a three-dimensional (3D) surface model from volume data pre-operatively acquired from Computed Tomography (CT) or Magnetic Resonance Imaging (MRI) and then intra-operatively to register J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 616–627, 2007. c Springer-Verlag Berlin Heidelberg 2007
Particle Filter Based Automatic Reconstruction
617
the extracted surface model to the patient anatomy. However, the high logistic effort and cost, the extra radiation involved with the CT imaging, and the large quantity of data to be acquired and processed make them less functional. The alternative is to reconstruct a patient-specific surface model from a limited number of intra-operatively acquired two-dimensional (2D) fluoroscopic images using a statistical model. Several research groups have explored the methods for reconstructing a patientspecific model from a statistical model and a limited number of calibrated X-ray images [1][2][3][4] [5]. Except the method presented in Yao and Taylor [1], which depends on a deformable 2D/3D registration between an appearance based statistical model and a limited number of X-ray images, all other methods have their reliance on a point distribution model (PDM) in common. The common disadvantage of all these PDM based reconstruction methods lies in the fact that they require either knowledge about anatomical landmarks [3][5], which are normally obtained by interactive reconstruction from the input images, or an interactive alignment of the model with the input images [2][4]. Such a supervised initialization is not appreciated in a surgical navigation application, largely due to the strict sterilization requirement. To eliminate the user intervention constraint, we propose in this paper a particle filter based 2D/3D reconstruction scheme combining a parameterized multiple-component geometrical model [6] and a point distribution model [7], and show its application to automatically reconstruct a surface model of the proximal femur with no user intervention at all. The parameterized multiplecomponent geometrical model is regarded as a simplified description capturing the geometrical features of a proximal femur. The constraints between different components are described by a causal Bayesian network. A particle filter based algorithm is applied to automatically estimate their parameters from the input X-ray images. The estimated geometrical parameters of the proximal femur are then used to initialize a point distribution model based 2D/3D reconstruction scheme for an accurate reconstruction of a surface model of the proximal femur. This paper is organized as follows. Section 2 briefly recalls the 2D/3D reconstruction scheme introduced in [5]. Section 3 describes the approach for automatic initialization. Section 4 reports the experimental results, followed by conclusions in Section 5.
2 2.1
2D/3D Reconstruction Scheme Image Acquisition
In this work, we assumes that the X-ray images are calibrated for their intrinsic parameters and that the X-ray images are corrected for distortion. If multiple X-ray images are used, they are all registered to a common reference frame.Due to the limited imaging volume of a fluoroscope, we ask for four images for the proximal femur from different view direction, of which two images focus on the proximal femoral head and the other two focus on the femoral shaft. The calibrated fluoroscopic image set is represented by I. Although all four images are
618
G. Zheng and X. Dong
used to estimate the parameters of the multiple-component geometrical model, only those two images that focus on the proximal femur are used for surface reconstruction. 2.2
Statistical Model of the Proximal Femur
The PDM used in this paper was constructed from a training database consisted of 30 proximal femoral surfaces from above the less trochanter. Let xi , i = 0, 1, ..., m − 1, be m members of the aligned training surfaces. Each member is described by a vectors xi with N vertices: xi = {x0 , y0 , z0 , x1 , y1 , z1 , ..., xN −1 , yN −1 , zN −1 }
(1)
The PDM is obtained by applying principal component analysis. D=
1 (m−1)
·
m−1
¯ ) · (xi − x ¯ )T (xi − x
i=0
σ0 ≥ σ1 ≥ · · · ≥ σm1 −1 > 0; m1 ≤ m − 1 D · pi = σi2 · pi ; i = 0, · · · , m1 − 1
(2)
¯ and D are the mean vector and the covariance matrix, respectively. where x {σi2 } are non-zero eigenvalues of the covariance matrix D, and {pi } are the corresponding eigenvectors. The sorted eigenvalues σi2 and the corresponding ¯ repeigenvectors pi are the principal directions spanning a shape space with x resenting its origin. Then, an instance M generated from the statistical model with parameter set Q = {s, α0 , α1 , · · · , αm1 −1 } can be described as: M : x(Q) = s · (¯ x+
m 1 −1
(αi · pi ))
(3)
i=0
where s is the scaling factor; {αi } are the weights calculated by projecting vector ¯ ) into the shape space. The mean surface model x ¯ is shown in Fig. 1, (x/s − x left. 2.3
2D/3D Reconstruction Scheme
Our 2D/3D reconstruction scheme is a further improvement of the approach we introduced in [5], which combines statistical instantiation and regularized shape deformation with an iterative image-to-model correspondence establishing algorithm. The image-to-model correspondence is established using a non-rigid 2D point matching process, which iteratively uses a symmetric injective nearestneighbor mapping operator and 2D thin-plate splines based deformation to find a fraction of best matched 2D point pairs between features detected from the fluoroscopic images and those extracted from the 3D model using an approach described in [8]. The obtained 2D point pairs are then used to set up a set of 3D point pairs such that we turn a 2D-3D reconstruction problem to a 3D-3D one.
Particle Filter Based Automatic Reconstruction
619
Fig. 1. The mean surface model of our point distribution model (left) and a schematic view of landmark reconstruction (right)
The 3D/3D reconstruction problem is then solved optimally in three sequential stages. The first stage, affine registration, is to iteratively estimate a scale and a rigid transformation between the mean surface model of the PDM and the input 3D points using a variant of the iterative closest point (ICP) algorithm [9]. The estimation results of the first stage are used to establish point correspondences for the second stage, statistical instantiation, which analytically instantiates a surface model from the PDM using a Mahalanobis prior based statistical approach [10]. This surface model is then fed to the third stage, kernel-based deformation. In this stage, we further refine the statistically instantiated surface model using an alternative derivation of the familiar interpolating thin-plate spline (TPS) [11] that enables weighting between the PDM instantiated surface model and the TPS interpolation. For details, we refer to our previous works [5]. 2.4
Manual Initialization
The convergence of the 2D/3D reconstruction scheme introduced in [5] relies on a proper initialization of scale and pose of the mean surface model of the PDM. In our previous work [5], three anatomical landmarks, i.e., the center of the femoral head, a point on the axis of the femoral neck, and the apex of the greater trochanter were reconstructed interactively from the input fluoroscopic images, as shown in Fig. 1, right, and were used to compute the initial scale s0 and the initial rigid transformation T0 of the mean surface model of the PDM in relative to the input images.
3 3.1
The Proposed Approach Proximal Femur Model
The proximal femur is approximated by a simplified geometrical model consisting of 3 components: head, neck and shaft, which are described by a sphere, a truncated cone and a cylinder with parameter set XF emur = {XHead , XN eck , XShaf t } respectively as shown in Fig. 2, left. These three components are constrained by
620
G. Zheng and X. Dong
the anatomical structure of the proximal femur. The advantage of using such a model is apparent. On the one hand, this simplified 3D model has the capability to catch the global structure of the anatomy from the fluoroscopic images and is not dependent on the view directions of the input images. On the other hand, using such a model to estimate the geometrical parameters of the proximal femur is much less computational expensive than using a point distribution model, largely due to the simple and parameterized geometrical shape of its components. – Head: Femoral head is modeled as a 3D sphere XHead , which is parameterized by the centroid CHead = [xHead , yHead , zHead ] and its radius RHead – Neck: Femoral neck is modeled as a truncated cone XN eck described by its centroid CN eck = [xN eck , yN eck , zN eck ], mean radius RN eck , aspect ratio of the cross section APN eck , length of its axis LN eck , and direction of its axis, AN eck . – Shaft: Femoral shaft is modeled as a 3D cylinder XShaf t described by its centroid CShaf t = [xShaf t , yShaf t , zShaf t ], radius RShaf t , length of its axis LShaf t , and direction of tis axis, AShaf t . The constraints among components are represented by a causal Bayesian network as shown in Fig. 2, right, where all π(·)’s are prior distributions and all p(·)’s are conditional distributions. The prior distributions are designed according to the information estimated from the calibrated images and the prior information about the geometrical features of each component, e.g., the centroids of three components are assumed uniformly distributed in the common view volume of the two fluoroscopic images around the proximal femur, which can be obtained by calculating the intersection of their projection frustums; the radii, the lengths (for neck and shaft) of different components, and neck aspect ratio are assumed to be uniformly distributed in their associated anatomical ranges. The structural constraints among components are set so that the component configuration that fulfills these constraints will show a higher probability of being assembled to represent a proper proximal femur. These constraints are regarded as the conditional distributions of those components when the configuration of their parent components is given. For example, femoral head and neck are closely connected, which means that given XHead the centroid of femoral neck can be solely determined when LN eck and AN eck are instantiated. The reason why the network starts from shaft component is that the shaft component is much easier to be detected from the images than other two components, which will accelerate the convergence of the model fitting algorithm as described below. 3.2
Bayesian Formulation and Objective Function
A Bayesian inference scheme is employed to integrate the prior information about the proximal femur and the observation in the input images. The prior information about the proximal femur is the combination of the prior distributions of different components and the conditional distributions between them. The observation model is based on a similarity measure described in [12] for fitting
Particle Filter Based Automatic Reconstruction
621
Fig. 2. The parameterized multiple-component geometrical model (left) and a causal Bayesian network for encoding the conditional distribution among components (right)
active shape models to the images. The resultant likelihood is then combined with the prior using Bayes’ rule to obtain the a posterior probability density of the parameterized multiple-component geometrical model given the input images. A particle filter based algorithm is then implemented to estimate those parameters of the multiple-component geometrical model by maximizing the a posterior probability density. Prior Distribution. The prior distribution of the parameterized multiplecomponent geometrical model of the proximal femur is the combination of the prior distributions of different components and the conditional distributions between them and has the form as follows: p(XF emur ) = p(XShaf t , XN eck , XHead ) = ((π(XShaf t ) · p(XN eck |XShaf t )) · π(XN eck )) · p(XHead |XN eck ) · π(XHead ) (4) Likelihood In this work, we use a combination of likelihood derived from edge matching and likelihood derived from intensity distribution matching. Likelihood derived from edge matching: We use an energy function derived from edges to measure discrepancies between projected extremal contours of the model obtained by simulating X-ray projection and the image edges extracted from fluoroscopic images by applying a Canny edge detector. Let Ξ(I,XF emur ) denote the extremal contours of the proximal femur model on one of the input images (I ∈ I) ( see Fig. 3). The energy function is given by: d2E (Ξ(I, XF emur ), E(I)) = [ min (g 2 (u, v))] (5) u∈Ξ(I,XF emur )
v∈E(I)
622
G. Zheng and X. Dong
Fig. 3. Edge likelihood computation based on the projected extremal contours of the proximal femur model (a combination of projected extremal contours from subcomponent) and the edge distance map, where the dots on the contours shows the positions used to calculate the likelihood 2
where g 2 (u, v) = u − v is a metric by which errors in edge matches are measured on the sampling positions u shown by dots along the projected extremal contours in Fig. 3. The likelihood associated with the discrepancy is defined as: pE (I|XF emur ) ∝ exp(−λE d2E (Ξ(I, XF emur ), E(I)))
(6)
where λE is a control parameter. Likelihood derived from intensity distribution matching: The matching between the intensity distribution of projected proximal femur model and fluoroscopic images can be treated as the local structure measurement as defined in [12]. Denote Θ(I, XF emur ) the projected silhouette of the model, the energy term derived from the local intensity distribution is given by: dG (Θ(I, XF emur )) = h(u, I) (7) u∈Θ(I,XF emur )
where h(u, I) is the local structure measurement defined as: h(u, I) = N (u, Θ(I, XF emur )) − T (u, I)
2
(8)
where N (u, Θ(I, XF emur (k + 1))) is the local intensity distribution of the projected model at position u (drawn as the dots along the normal directions of the projected model contour in Fig. 4.) and T (u, I) is the correspondent intensity distribution in the X-ray image. Fig. 5 shows normalized distribution of N (u, Θ(I, XF emur (k + 1))) and T (u, I) along the profile highlighted with a red ellipse in Fig. 4. The likelihood associated with this distance is defined as: pG (I|XF emur ) ∝ exp(−λG dG (Θ(I, XF emur ))) where λG is a control parameter.
(9)
Particle Filter Based Automatic Reconstruction
623
Fig. 4. Intensity likelihood computation based on the silhouette of projected proximal femur model where the black blocks show the silhouettes of the model and green lines show the positions used to calculate the likelihood
Finally the overall likelihood is defined as: p(I|XF emur ) = pE (I|XF emur )pG (I|XF emur )
(10)
I
Posterior Probability Density. Using Bayes rule, the prior distribution of the parameterized multiple-component geometrical model and the likelihood of the input images can be combined to obtain the a posterior probability density of the morphed model given the input images: p(XF emur |I) = p(I|XF emur ) · p(XF emur )/p(I) = λ · ( pE (I|XF emur ) · pG (I|XF emur ))
(11)
I
·π( XShaf t ) · π(XN eck ) · π(XHead ) · p(XHead |XN eck ) · p(XN eck |XShaf t ) where λ is a normalization constant. Our objective is to maximize the a posterior probability density in Eq. 11 with respect to the shape and pose parameters of the multiple-component geometrical model. In this work, we propose to solve it using an particle filter based algorithm. 3.3
Geometrical Model Fitting by Particle Filter
Particle filter, also known as the Condensation algorithm [13] is a robust filtering technique, based on the Bayesian framework. This technique provides a suitable basic framework for estimating parameters of a multiple-component geometrical model from images: particle filter estimates the states by recursively updating sample approximations of posterior distribution. In this work, we implement a particle filter based inference algorithm as follows. 1. Initialization: Generate the first generation of particle set with M particles {Pi0 = X0F emur,i }i=0,...,M−1 from the proposal distributions
624
G. Zheng and X. Dong
Fig. 5. Normalized intensity distribution of N (u, Θ(I, XF emur (k + 1))) (left) and T (u, I) along the profile highlighted with a red circle in Fig. 4
q 0 (XShaf t ) = π(XShaf t ) q 0 (XN eck ) = π(XN eck )q 0 (XShaf t )p(XN eck |XShaf t ) q 0 (XHead ) = π(XHead )q 0 (XN eck )p(XHead |XN eck ) 2. Observation: Given the current generation of particle set, calculate the weight of each particle as win ∝ p(I|XnF emur,i ), where p(I|XnF emur,i ) is defined by Eq. 10. 3. Update: Update the proposal distributions as q n+1 (XShaf t ) = NPDE(win , XnShaf t,i ) q n+1 (XN eck ) = π(XN eck )q n+1 (XShaf t )p(XN eck |XShaf t ) q n+1 (XHead ) = π(XHead )q n+1 (XN eck )p(XHead |XN eck ) where NPDE(win , XnShaf t,i ) is a nonparametric density estimation [14] . Generate the next generation of particle set from the updated proposal distributions. 4. Go to 2 until the particle set converges. 3.4
Unsupervised Initialization of the PDM
¯ of the PDM, the model vertices can be classified From the mean surface model x into three regions, femoral head, neck and shaft. The femoral head center and radius, axes of femoral neck and shaft can be determined in the mean surface model coordinate space by a 3D sphere fitting to the femoral head region and cylinder fittings to the femoral neck and shaft regions. The initial rigid transformation and scale can then be computed to fit the PDM (the scaled mean surface model) to the estimated geometrical model of the proximal femur.
4
Experimental Results
We designed and conducted two experiments to validate the present approach. The first experiment was conducted on 3 clinical dataset. Due to the lack of ground truth, we used the clinical dataset to verify the robustness of the particle filter based inference algorithm. We run the algorithm for 10 trials on each dataset with particle number M = 200. In each trial the proximal femur is correctly identified and the statistical results are shown in Table 1. An example of unsupervised initialization and proximal femur contour extraction using the inference results is shown in Fig. 6.
Particle Filter Based Automatic Reconstruction
625
Fig. 6. An example of unsupervised initialization (left) and proximal femur contour extraction (right) Table 1. Statistical results of the particle filter based inference algorithm, all results are relative to the mean values of the 10 trials Parameter
Data Set 1 Data Set 2 Data Set 3
Head Center (mm)
1.4±1.1
0.1±0.1
0.1±0.2
Head Radius (mm)
0.3±0.4
0.6±0.2
1.0±0.8
Neck Length (mm)
1.0±1.4
1.3±1.8
1.2±1.7
Neck Axis (degree)
0.8±0.7
2.3±1.0
1.8±1.1
Shaft Radius(mm)
0.2±0.3
0.1±0.2
0.2±0.2
Neck/Shaft Angle(degree)
0.8±1.0
2.0±2.5
1.8±2.6
Table 2. The reconstruction errors when different initialization methods were used Bone Index
No. 1 No. 2 No. 3 No. 4 No. 5 No. 6 No. 7 No. 8 No. 9 No. 10 Errors of manually initialized reconstruction
Median (mm) 1.7
1.3
0.8
0.9
1.3
1.0
0.9
0.8
0.8
1.1
Mean (mm)
1.4
0.9
1.3
1.4
1.1
1.1
1.0
1.0
1.2
1.7
Errors of automatic reconstruction Median (mm) 1.8
1.4
0.9
1.6
1.3
1.2
1.0
1.2
1.5
0.8
Mean (mm)
1.6
0.9
1.5
1.2
1.2
1.2
1.1
1.5
1.1
1.9
The second experiment was performed on 10 dry cadaveric femurs with different shape. The purpose was to evaluate the accuracy of the unsupervised 2D/3D reconstruction. For each bone, two studies were performed. In the first study, the 2D/3D reconstruction scheme was initialized using the interactively reconstructed landmarks as described in Section 2, whereas in the second study, the present algorithm was used to initialize the 2D/3D reconstruction scheme. To evaluate the reconstruction accuracy, 200 points were digitized from each bone surface. The distance between these points to the reconstructed surface of the associated bone were calculated and used to evaluate the reconstruction accuracy. The median and mean reconstruction errors for each study when using different initialization methods were recorded. The results are presented in Table 2. It was found that the automatic reconstruction was a little bit less accurate when compared to the manually initialized one. An average mean reconstruction error
626
G. Zheng and X. Dong
of 1.3 mm was found for the automatic reconstruction. It decreased to 1.2 mm when the manually initialized one was used.
5
Conclusions
In this paper, an automatic 2D/3D reconstruction scheme combining a parameterized multiple-component geometrical model with a point distribution model was presented. We solved the supervised initialization problem by using a particle filter based inference algorithm to automatically determine the geometrical parameters of a proximal femur from the calibrated fluoroscopic images. No user intervention is required any more. The qualitative and quantitative evaluation results on 3 clinical dataset and on dataset of 10 dry cadaveric bones indicate the validity of the present approach. Although the automatic reconstruction is a little bit less accurate than the manually initialized one, the former has the advantage of elimination of user intervention, which holds the potential to facilitate the application of the 2D/3D reconstruction in surgical navigation.
References 1. Yao, J., Taylor, R.H.: Assessing accuracy factors in deformable 2D/3D medical image registration using a statistical pelvis model. In: ICCV’03, vol. 2, pp. 1329– 1334 (2003) 2. Fleute, M., Lavall´ee, S.: Nonrigid 3D/2D registration of images using a statistical model. In: Taylor, C., Colchester, A. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI’99. LNCS, vol. 1679, pp. 138–147. Springer, Heidelberg (1999) 3. Benameur, S., Mignotte, M., Parent, S., et al.: 3D/2D registration and segmentation of scoliotic vertebra using statistical models. Comput. Med. Imag. Grap. 27, 321–337 (2003) 4. Benameur, S., Mignotte, M., Parent, S., et al.: A hierarchical statistical modeling approach for the unsupervised 3D biplanar reconstruction of the scoliotic spine. IEEE Trans. Biomed. Eng. 52, 2041–2057 (2005) 5. Zheng, G., Nolte, L.-P.: Surface reconstruction of bone from X-ray images and point distribution model incorparating a novel method for 2D-3D correspondence. In: CVPR’06, vol. 2, pp. 2237–2244 (2006) 6. Dong, X., Zheng, G.: A computational framework for automatic determination of morphological parameters of proximal femur from intraoperative fluoroscopi images. In: ICPR’06, vol. 1, pp. 1008–1013 (2006) 7. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models - their training and application. Comput. Vis. Image Underst. 61, 38–59 (1995) 8. Hertzmann, A., Zorin, D.: Illustrating smooth surface. In: SIGGRAPH’00, pp. 517–526 (2000) 9. Besl, P., McKay, N.D.: A method for registration of 3D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14, 239–256 (1992) 10. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: SIGGRAPH’99, pp. 187–194 (1999)
Particle Filter Based Automatic Reconstruction
627
11. Bookstein, F.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell. 11, 567–585 (1989) 12. Cootes, T., Taylor, C.: Statistical models of appearance for computer vision. Technical report, University of Manschester, United Kingdom (2004) 13. Isard, M., Blake, A.: Contour tracking by stochastic propagation of conditional density. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1064, pp. 343–356. Springer, Heidelberg (1996) 14. Scott, D.W.: Multivariate Density Estimation. Theory, Practice, and Visualization. Wiley, Chichester (1992)
Joint Tracking and Segmentation of Objects Using Graph Cuts Aurélie Bugeau and Patrick Pérez IRISA / INRIA, Campus de Beaulieu, 35 042 Rennes Cedex, France {aurelie.bugeau,perez}@irisa.fr
Abstract. This paper presents a new method to both track and segment objects in videos. It includes predictions and observations inside an energy function that is minimized with graph cuts. The min-cut/max-flow algorithm provides a segmentation as the global minimum of the energy function, at a modest computational cost. Simultaneously, our algorithm associates the tracked objects to the observations during the tracking. It thus combines “detect-before-track” tracking algorithms and segmentation methods based on color/motion distributions and/or temporal consistency. Results on real sequences are presented in which the robustness to partial occlusions and to missing observations is shown.
1 Introduction In recent and thorough review on tracking techniques [20], tracking methods are divided into three categories : point tracking, silhouette tracking and kernel tracking. These three categories can be recast as "detect-before-track" tracking, dynamic segmentation and tracking based on distributions (color in particular). The principle of "detect-before-track" methods is to match the tracked objects with observations provided by an independent detection module. This tracking can be done using deterministic methods or probabilistic methods. Deterministic methods correspond to matching by minimizing a distance based on certain descriptors of the object. Probabilistic methods allow taking measurement uncertainties into account. They are often based on a state space model of the object properties. Dynamic segmentation corresponds to a succession of segmentations. These silhouette tracking methods usually make evolve an initial contour to its new position in the current frame. This can be done using a state space model defined in terms of shape and motion parameters of the contour [9], [16] or by the minimization of a contour-based energy function. In latter case, the energy function includes temporal information in the form of either the temporal gradient (optical flow)[1], [7], [13] or appearance statistics originated from the object and the background regions in previous images [15] [19]. In [18] the authors use graph cuts to minimize such an energy function. The advantages of min-cut/max-flow optimization are its low computational cost, the fact that it converges to a global minimum (as opposed to local methods that get stuck in local minima) and that no a priori on the global shape model is needed. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 628–639, 2007. c Springer-Verlag Berlin Heidelberg 2007
Joint Tracking and Segmentation of Objects Using Graph Cuts
629
The last group of methods is based on kernel tracking. The best location for a tracked object in the current frame is the one for which some feature distribution (e.g., color) is the closest to the reference one. The most used method in this class is the “mean shift” tracker [5], [6]. Graph cuts have also been used for illumination invariant kernel tracking in [8]. These three types of tracking techniques have different advantages and limitations, and can serve different purposes. "Detect-before-track" methods can deal with the entries of new objects and the exit of existing ones. They use external observations that, if they are of good quality, might allow robust tracking and possibly accurate segmentations. Silhouette tracking has the advantage of directly providing the segmentation of the tracked object. With the use of recent graph cuts techniques, convergence to the global minimum is obtained for modest computational cost. Finally kernel tracking methods, by capturing global color distribution of a tracked object, allow robust tracking at low cost in a wide range of color videos. In this paper, we address the problem of multiple objects tracking and segmentation by combining the advantages of the three classes of approaches. We suppose that, at each instant, the objects of interest are approximately known as the output of a preprocessing algorithm. Here, we use a simple background subtraction but more complex alternative techniques could be applied. These objects are the “observations” as in Bayesian filtering. At each time the extracted objects are propagated using their associated optical flow, which gives the predictions. Intensity and motion distributions are computed on the objects of previous frame. For each tracked object, an energy function is defined using the observations and these distributions, and minimized using graph cuts. The use of graph cuts directly gives the segmentation of the tracked object in the new frame. Our algorithm also deals with the introduction of new objects and their associated trackers. In section 2, an overview of the method and the notations is given. The graph and associated energy function are then defined in section 3. Experimental results are shown in section 4, where we demonstrate in particular the robustness of our technique in case of partial occlusions and missing observations. We conclude in section 5.
2 Principle and Notations Before explaining the scheme of the algorithm, the notations and definitions must be introduced for the objects and the observations. 2.1 Notations In all this paper, P will denote the set of N pixels of a frame from an input sequence of images. To each pixel s of the image at time t is associated a feature vector zs,t = (C) (M ) (C) (M ) (zs,t , zs,t ), where zs,t is a 3-dimensional vector in RGB color space and zs,t is a 2dimensional optical flow vector. The optical flow is computed using Lucas and Kanade algorithm [12] with incremental multiscale implementation. We assume that, at time t, kt objects are tracked. The ith object at time t is denoted as Ot(i) and is defined as a set of pixels, Ot(i) ⊂ P . The pixels of a frame not belonging to the object Ot(i) belong to the “background” of this object.
630
A. Bugeau and P. Pérez
The goal of this paper is to perform both segmentation and tracking to get the ob(i) ject Ot(i) corresponding to the object Ot−1 of previous frame. Contrary to sequential segmentation techniques [10], [11], [14], we bring in object-level “observations”. They may be of various kinds (e.g., boxes or masks obtained by a class specific object detector, or static motion/color detectors). Here we consider that these observations come from a preprocessing step of background subtraction. Each observation amounts to a connected component of the foreground map after background subtraction (figure 1). The connected components are obtained using the "gap/mountain" method described in [17] and ignoring small objects. For the first frame, the tracked objects are initialized as the observations themselves. We assume that, at each time t, there are mt observations. The j th observation at time t is denoted as M(j) and is defined as a set of pixels, t (j) Mt ⊂ P . Each observation is characterized by its mean feature: (j) zs,t s∈Mt (j) zt = . (1) (j) |Mt |
(a)
(b)
(c)
Fig. 1. Observations obtained with background subtraction and object isolation. (a) Reference frame. (b) Current frame (c) Result of background subtraction and derived object detection (two objects with red bounding boxes).
2.2 Principle of the Algorithm (i) The principle of our algorithm is as follows. A prediction Ot|t−1 is made for each object (i) i of time t − 1. Once again, the prediction is a set of pixels, Ot|t−1 ⊂ P . We denote as (i) dt−1 the mean, over all pixels of the object at time t − 1, of optical flow vectors: (M) (i) z s∈Ot−1 s,t−1 (i) dt−1 = . (2) (i) |Ot−1 | (i) The prediction is obtained by translating each pixel belonging to Ot−1 by this average optical flow: (i) (i) (i) Ot|t−1 = {s + dt−1 , s ∈ Ot−1 } . (3)
Using this prediction, the new observations, as well as color and motion distributions (i) of Ot−1 , a graph and an associated energy function are built. The energy is minimized using min-cut/max-flow algorithm [4], which gives the new segmented object at time t, (i) (i) Ot . The minimization also provides the correspondences of the object Ot−1 with all the available observations. The sketch of our algorithm is presented in figure 2.
Joint Tracking and Segmentation of Objects Using Graph Cuts
631
(i)
Ot−1 Prediction Distributions computation
(i)
Ot|t−1 Construction of the graph
Observations
Energy minimization (min cut algorithm) (i)
Ot
(i)
Correspondances between Ot−1 and the observations
Fig. 2. Principle of the algorithm
3 Energy Function We define one tracker for each object. To each tracker corresponds, for each frame, one graph (figure 3) and one energy function that is minimized using the min-cut/max-flow algorithm [4]. Details of the approach are given in the following subsections.
(i)
Obs. 1
(1) (nt )
Ot|t−1 (2)
Obs. 2 (nt ) Result for object i at time t-1
Graph for object i at time t
Fig. 3. Description of the graph. The left figure is the result of the energy minimization at time t− 1. White nodes are labeled as object and black nodes as background. The optical flow vectors for the object are the dashed line arrows. The right figure shows the graph at time t. Two observations are available. Thick nodes correspond to the observations. See text for explanations and details on the edges.
3.1 Graph The undirected graph Gt = (Vt , Et ) is defined as a set of nodes Vt and a set of edges Et . The set of nodes is divided in two subsets. The first subset is the set of the N pixels of the image grid P . The second subset corresponds to the observations : to each observation Ëmt n(j) . The set M(j) is associated a node n(j) t t . The set of nodes thus reads Vt = P j=1 t Ë mt set of edges is divided in two subsets: Et = EP j=1 EM(j) . The set EP represents all t unordered pairs {s, r} of neighboring elements of P (thin black edges on right part of
632
A. Bugeau and P. Pérez
(j) figure 3), and EM(j) is the set of unordered pairs {s, n(j) (thick black t }, with s ∈ Mt t edges on right part of figure 3). (i) Segmenting the object Ot(i) amounts to assigning a label ls,t , either background, ”bg”, or object, “fg”, to each pixel node s of the graph. Associating observations to tracked objects amounts to assigning a binary label (“bg” of “fg”) to each observation node. The set of all the node labels is L(i) t .
3.2 Energy An energy function is defined for each object at each time. It is composed of unary data (i) (i) terms Rs,t and smoothness binary terms Bs,r,t : (i) (i) (i) (i) (i) (i) (i) Et (Lt ) = Rs,t (ls,t ) + λ Bs,r,t (1 − δ(ls,t , lr,t )) . (4) {s,r}∈Et
s∈Vt
Following [2], the parameter λ is set to 20. Data term. The data term can be decomposed into two parts. While the first one corresponds to the prediction, the second corresponds to the observations. For all the other nodes, we do not want to give any a priori on whether the node is part of the object or the background (labeling of these nodes will then be controlled by the influence of neighbors via binary terms).The first part of energy in (4) reads :
(i)
(i)
Rs,t (ls,t ) =
s∈Vt
(i)
(i)
−ln(p1 (s, ls,t )) +
mt
(i)
(j)
−ln(p2 (nt , ln(j) ,t )) .
(5)
t
j=1
s∈Ot|t−1
The new object should be close in terms of motion and color to the object at previous time. The color and motion distributions of the object and the background are then (i,M ) defined for previous time. The distribution p(i,C) for mot−1 for color, respectively pt−1 (C) tion, is a Gaussian mixture model fitted to the set of values {zs,t−1 }s∈O(i) , respectively t−1
(M )
{zs,t−1 }s∈O(i) . Under independency assumption for color and motion, the final distrit−1
bution for the object is : (i)
(i,C)
(C)
(i,M)
(M)
pt−1 (zs,t ) = pt−1 (zs,t ) pt−1 (zs,t ) .
(6)
(i,M ) (i,C) The two distributions for the background are qt−1 and qt−1 . The first one is a Gaussian (M ) mixture model built on the set of values {zs,t−1 }s∈P\O(i) . The second one is a uniform t−1
model on all color bins. The final distribution for the background is : (i)
(i,C)
(C)
(i,M)
(M)
qt−1 (zs,t ) = qt−1 (zs,t ) qt−1 (zs,t ) .
(7)
The likelihood p1 , which is applied to the prediction node in the energy function, can now be defined as : (i) pt−1 (zs,t ) if l = “fg” (i) p1 (s, l) = (8) (i) qt−1 (zs,t ) if l = “bg” .
Joint Tracking and Segmentation of Objects Using Graph Cuts
633
An observation should be used only if it corresponds to the tracked object. Therefore, we use the same distribution for p2 as for p1 . However we do not evaluate the likelihood of each pixel of the observation mask but only the one of its mean feature z(j) t . The likelihood p2 for the observation node n(j) is defined as t (i) (j) pt−1 (zt ) if l = “fg” (i) (j) p2 (nt , l) = (9) (i) (j) qt−1 (zt ) if l = “bg” . Binary term. Following [3], the binary term between neighboring pairs of pixels {s, r} of P is based on color gradients and has the form (i) Bs,r,t
− 1 = e dist(s, r)
(C) (C) zs,t −zr,t 2 σ2 T
.
(10)
As in [2], the parameter σT is set to (i,C)
σT = 4 ∗ (zs,t
(i,C)
− zr,t )2
(11)
where . denotes expectation over a box surrounding the object. For edges between the grid P and the observations nodes, the binary term is similar : B
(i) (j)
s,nt ,t
=e
−
(C) (j,C) 2 zs,t −zt σ2 T
.
(12)
Energy minimization. The final labeling of pixels is obtained by minimizing the energy defined above : (i) (i) ˆ (i) L (13) t = arg min Et (Lt ) . Finally this labeling gives the segmentation of the object Ot(i) , defined as : (i) (i) Ot = {s ∈ P : ˆ ls,t = “fg”} .
(14)
3.3 Creation of New Objects One advantage of our method comes from the nodes corresponding to the observations. It allows the use of observations to track and segment the objects at time t as well as to establish the correspondence between an object currently tracked and all the candidate objects imperfectly detected in current frame. If, after the energy minimization for an object i, a node n(j) is labeled as “fg” it means that there is a correspondence t between the object and the observation. If for all the objects, an observation node is labeled as “bg” after minimizing the energies, then the corresponding observation does not match any objects. In this case, a new object is created and is initialized as this observation.
634
A. Bugeau and P. Pérez
4 Experimental Results In this section results that validate the algorithm are presented. The sequences used are from the PETS 2001 data corpus (data set 1 camera 1 and dataset 3 camera 2), and the PETS 2006 data corpus (sequence 1 camera 4). The first tests are on relatively simple sequences. They are run on a subset of the PETS 2006 and on the PETS 2001, data set 3 sequence. Then the robustness to partial occlusions is shown on a subset of the PETS 2001, data set 1 sequence. Finally we present the handling of missing observations on a subset of the PETS 2006 sequence. For all the results except the first one, the frames have been cropped to show in more details the segmentation.
(a)
(b)
(c)
Fig. 4. Reference frames. (a) Reference frame for the PETS 2006 sequence. (b) Reference frame for the PETS 2001 sequence, dataset 1. (c) Reference frame for the PETS 2001 sequence, dataset 3.
4.1 Results with Observations at Each Time First results (figure 5) are on part of the PETS 2006 sequence with no particular changes. Observations are obtained by subtracting current frame with the reference frame (frame 10) shown on figure 4(a). In the first frame of test sequence, frame number 801, two objects are initialized using the observations. The chair on the left of the image is detected and always present in the tracking because a person was sited on it in the reference frame. Tracking this object is not a drawback as it could be an abandoned object. The person walking since the beginning is well tracked until it gets out of the image. A new object is then detected and a new tracker is initialized on it from frame 878. As one can see, even if the background subtraction and associated observations are not perfect, for example if part of the object is missing, our segmentation algorithm recovers the entire object. Second results are shown in figure 6. Observations are obtained by subtracting current frame with the reference frame (frame 2200) shown on figure 4(c). Two persons are tracked in this sequence in which the light is slowly changing. In addition to this gradual change, the left person moves from light to shade. Still, our algorithm tracks correctly both persons. 4.2 Results with Partial Occlusion Results showing the robustness to partial occlusions are shown in figure 7. Observations are obtained by subtracting current frame with the reference frame (frame 2700) shown
Joint Tracking and Segmentation of Objects Using Graph Cuts
(a)
(b)
635
(c)
Fig. 5. Results on the PETS 2006 sequence for frames 801, 820, 860, 900 (a) Result of simple background subtraction and extracted observations (bounding boxes) (b) Masks of tracked and segmented objects (c) Tracked objects on current frame
on figure 4(b). Three objects are tracked in this sequence. The third one, with green overlay, corresponds to the car shadow and is visible on the last frame shown. Our method allows the tracking of the car as a whole even when it is partially occluded with a lamp post. 4.3 Results with Missing Observations Last result (figure 8) illustrates the capacity of the method to handle missing observations thanks to the prediction mechanism. The same part of the PETS 2006 sequence as in figure 5 is used. In our test we have only performed the background subtraction on one over three frames. On figure 8, we compare the obtained segmentation with the one of figure 5 based on observations at each frame. Thanks to prediction, the result is only partially altered by this drastic temporal subsampling of observations. As one can
636
A. Bugeau and P. Pérez
(a)
(b)
(c)
Fig. 6. Results with partial occlusions on the PETS 2001 sequence for frames 2260, 2328, 2358 and 2398 (a) Result of background subtraction and extracted observations (bounding boxes) (b) Masks of tracked and segmented objects (c) Tracked objects on current frame
see, even if one leg is missing in frames 805 and 806, it can be recovered as soon as a new observation is available. Conversely, this result also shows that the incorporation of observations from a detection module enables to get better segmentations than when using only predictions.
5 Conclusion In this paper we have presented a new method to simultaneously segment and track objects. Predictions and observations composed of detected objects are introduced in an energy function which is minimized using graph cuts. The use of graph cuts permits the segmentation of the objects at a modest computational cost. A novelty is the use of observation nodes in the graph which gives better segmentations but also enables the association of the tracked objects to the observations. The algorithm is robust to partial occlusion, progressive illumination changes and to missing observations. The observations used in this paper are obtained by a very simple background subtraction
Joint Tracking and Segmentation of Objects Using Graph Cuts
(a)
(b)
637
(c)
Fig. 7. Results with partial occlusions on the PETS 2001 sequence for frames 2481, 2496, 2511 and 2526 (a) Result of background subtraction and extracted observations (bounding boxes) (b) Masks of tracked and segmented objects (c) Tracked objects on current frame
based on a single reference frame. More complex background subtraction or object detection could be used as well with no change to the approach. As we use distributions of objects at previous time to minimize the energy, our method would fail in case of very abrupt illumination changes. However by adding an external detector of abrupt illumination changes, we could circumvent this problem by keeping only the prediction and update the reference frame when an abrupt change occurs. We are currently
638
A. Bugeau and P. Pérez
(a)
(b)
(c)
Fig. 8. Results with observations only every 3 frames on the PETS 2006 sequence for frames 801 to 807 (a) Result of background subtraction and observations (b) Masks of tracked and segmented objects (c) Comparison with the masks obtained when there is no missing observations
Joint Tracking and Segmentation of Objects Using Graph Cuts
639
investigating a way to handle complete occlusions. Another research direction lies in handling the fusion and split of several detection masks in more cluttered scenes.
References 1. Bertalmio, M., Sapiro, G., Randall, G.: Morphing active contours. IEEE Trans. Pattern Anal. Machine Intell. 22(7), 733–737 (2000) 2. Blake, A., Rother, C., Brown, M., Perez, P., Torr, P.H.S.: Interactive image segmentation using an adaptive gmmrf model. In: Proc. Europ. Conf. Computer Vision (2004) 3. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In: Proc. Int. Conf. Computer Vision (2001) 4. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Machine Intell. 23(11), 1222–1239 (2001) 5. Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using meanshift. In: Proc. Conf. Comp. Vision Pattern Rec. (2000) 6. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based optical tracking. IEEE Trans. Pattern Anal. Machine Intell. 25(5), 564–577 (2003) 7. Cremers, D., Schnörr, C.: Statistical shape knowledge in variational motion segmentation. Image and Vision Computing 21(1), 77–86 (2003) 8. Freedman, D., Turek, M.W.: Illumination-invariant tracking via graph cuts. In: Proc. Conf. Comp. Vision Pattern Rec. (2005) 9. Isard, M., Blake, A.: Condensation – conditional density propagation for visual tracking. Int. J. Computer Vision 29(1), 5–28 (1998) 10. Juan, O., Boykov, Y.: Active graph cuts. In: Proc. Conf. Comp. Vision Pattern Rec. (2006) 11. Kohli, P., Torr, P.H.S.: Effciently solving dynamic markov random fields using graph cuts. In: Proc. Int. Conf. Computer Vision (2005) 12. Lucas, B.D., Kanade, T.: An iterative technique of image registration and its application to stereo. In: Proc. Int. Joint Conf. on Artificial Intelligence (1981) 13. Mansouri, A.: Region tracking via level set pdes without motion computation. IEEE Trans. Pattern Anal. Machine Intell. 24(7), 947–961 (2002) 14. Paragios, N., Deriche, R.: Geodesic active regions for motion estimation and tracking. In: Proc. Int. Conf. Computer Vision (1999) 15. Ronfard, R.: Region-based strategies for active contour models. Int. J. Computer Vision 13(2), 229–251 (1994) 16. Terzopoulos, D., Szeliski, R.: Tracking with Kalman snakes. In: Active Vision, pp. 3–20. MIT Press, Cambridge (1992) 17. Wang, Y., Doherty, J.F., Van Dyck, R.E.: Moving object tracking in video. In: Applied Imagery Pattern Recognition Annual Workshop (2000) 18. Xu, N., Ahuja, N.: Object contour tracking using graph cuts based active contours. In: Proc. Int. Conf. Image Processing (2002) 19. Yilmaz, A.: Contour-based object tracking with occlusion handling in video acquired using mobile cameras. IEEE Trans. Pattern Anal. Machine Intell. 26(11), 1531–1536 (2004) 20. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput. Surv. 38(4), 13 (2006)
A New Fuzzy Motion and Detail Adaptive Video Filter Tom M´elange1, Vladimir Zlokolica2, Stefan Schulte1 , Val´erie De Witte1 , Mike Nachtegael1, Aleksandra Pizurica3 , Etienne E. Kerre1, and Wilfried Philips3 1
Ghent University, Department of Applied Mathematics and Computer Science, Fuzziness and Uncertainty Modelling Research Unit, Krijgslaan 281 (Building S9), 9000 Gent, Belgium 2 MicronasNIT Institute, Fruskogorska 11, 21000 Novi Sad, Serbia&Montenegro 3 Ghent University, Dept. of Telecommunications and Information Processing (TELIN), IPI, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium
[email protected] http://www.fuzzy.ugent.be
Abstract. In this paper a new low-complexity algorithm for the denoising of video sequences is presented. The proposed fuzzy-rule based algorithm is first explained in the pixel domain and later extended to the wavelet domain. The method can be seen as a fuzzy variant of a recent multiple class video denoising method that automatically adapts to detail and motion. Experimental results show that the proposed algorithm efficiently removes Gaussian noise from digital greyscale image sequences. These results also show that our method outperforms other state-of-the-art filters of comparable complexity for different video sequences.
1
Introduction
These days, image sequences are widely used in several applications, such as broadcasting, tele-conferencing systems, surveillance systems, object tracking and so on. However, during the acquisition or transmission of these sequences, noise is often introduced. Therefore noise reduction is required. Apart from the visual improvement, this noise reduction is often also required as a preprocessing step to achieve better results in the further analysis of the video sequences or in the video compression. In many of the video applications the noise is well approximated by the additive white Gaussian noise model, which we consider in this paper. Most meaningful video denoising schemes nowadays use combined temporal and spatial filtering. Such spatio-temporal filters can be classified into separable [1,2,3,4] and non-separable [5,6,7] filters, based on whether the spatial and temporal filtering are performed in distinct steps or not. Another classification is the distinction between single-resolution [8,9] (pixel domain filters) and multiresolution [1,2,3,4] (e.g. wavelet domain) methods. Further, spatio-temporal filters can also be classified into motion [2,4,10] and non-motion [1,5] compensated J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 640–651, 2007. c Springer-Verlag Berlin Heidelberg 2007
A New Fuzzy Motion and Detail Adaptive Video Filter
641
filters, depending on whether they filter along an estimated motion trajectory or not. In this paper we present a new non-motion compensated spatio-temporal video denoising algorithm, making use of fuzzy set theory. Fuzzy set theory and fuzzy logic offer us a powerful tool for representing and processing human knowledge in the form of fuzzy if-then rules. Hard thresholds are replaced by a gradual transition, which is more appropriate for modelling complex systems. The method proposed in this paper can be seen as a fuzzy variant of the multiple class averaging filter from [5,6]. The main differences between the proposed method and the filter from [5,6] are: (i) pixels are not divided into discrete classes, but are treated individually, which leads to an increased performance; (ii) the use of linguistic variables in our method makes the filter more natural to work with and to understand in comparison to the artificial construction of exponential functions from the method of [5,6] and (iii) the fuzzy rules used in our method are easy to extend and to include new information in future work. Experimental results show that our method outperforms other state-of-the-art filters of a comparable complexity. The paper is structured as follows: In Section 2 we describe the algorithm in the pixel domain. In Section 3 the method is extended to the wavelet domain. Experimental results and conclusions are finally presented in Section 4 and Section 5 respectively.
2
Fuzzy Motion and Detail Adaptive Averaging in the Pixel Domain
In this section, we first explain the multiple class averaging filter from [5,6] in Subsection 2.1. Additionally this method is translated into a fuzzy logic framework in Subsection 2.2. 2.1
Multiple Class Averaging in the Pixel Domain [5,6]
We denote a noisy input image pixel as In (x, y, t) and the corresponding filtered pixel value as If (x, y, t). In this notation (x, y) indicates the spatial location and t stands for the temporal location. In [6] both a recursive and a non-recursive scheme are introduced. Because of the analogy between those two, a new notation Iv (x, y, t) is introduced to permit us to explain both at the same time. The index v can stand for both n (noisy) and f (filtered). In the recursive scheme, wherever it is possible, the already present filtered outputs of previous steps are used. This means that for pixels of the previous frame and for the already processed pixels of the current frame (pixels are filtered from top-left to bottom-right) the filtered outputs are used (v = f in these cases). For the remaining pixels, the noisy pixel values are used (i.e., v = n). In the non-recursive scheme, the noisy input pixels are used everywhere (and in every used formula the v must be replaced by n). In the method of [5,6] a 3 × 3 × 2 sliding window is used. This window consists of 3 × 3 pixels in the current frame and 3 × 3 pixels in the previous frame. We
642
T. M´elange et al.
will adopt the terms current window and previous window from [5,6] for the pixel values contained in respectively the current and the previous frame of the window. In the following we further denote the central pixel position of the filtering window (i.e., the pixel for which an output is computed in the current step) by (r, t) where r = (x, y) stands for the spatial position and t for the temporal position. The position of an arbitrary pixel (this may also be the central pixel position) in the 3 × 3 × 2 window is denoted by (r’, t ), where r’ = (x , y ) and t = t or t = t − 1. The output of the multiple class averaging filter for the central pixel position (r, t) in the window is a weighted mean of the pixel values in the 3×3×2 window: t r’
If (r,t) =
t =t−1
t
W (r’, t , r, t)Iv (r’, t )
t =t−1
r’
W (r’, t , r, t)
,
(1)
where the weight W (r’, t , r, t) for a particular pixel (r’, t ) in the window depends on the class index i(r’, t , r, t), the amount of detail d(r, t) in the window, the amount of motion m(r, t) between the current and the previous window and on whether (r’, t ) lies in the current (t = t) or the previous (t = t − 1) frame. The class index i(r’, t , r, t) depends on the absolute greyscale difference between the two pixel positions (r, t) and (r’, t ) given by Δ(r’, t , r, t) = |Iv (r’, t ) − In (r, t)|,
(2)
and is defined as: ⎧ 0, ⎪ ⎪ ⎪ ⎨1, i(r’, t , r, t) = ⎪ 2, ⎪ ⎪ ⎩ 3,
Δ(r’, t , r, t) ≤ kσn kσn < Δ(r’, t , r, t) ≤ 2kσn 2kσn < Δ(r’, t , r, t) ≤ 3kσn 3kσn < Δ(r’, t , r, t)
(3)
where σn represents the estimated standard deviation of the Gaussian noise. For the optimized value of the parameter k we refer to [5]. The function d(r, t) equals the local standard deviation in the current window: Iav (r, t) =
1 In (r’, t) , 9
(4)
r’
d(r, t) =
1
2 12 In (r’, t) − Iav (r, t) . 9
(5)
r’
Finally, m(r, t) is defined as the absolute difference between the average grey value in the current window and the average grey value in the previous window: 1 1 m(r, t) = In (r’, t) − Iv (r’, t − 1) . (6) 9 9 r’
r’
A New Fuzzy Motion and Detail Adaptive Video Filter
643
The weights for the pixels in the window are defined in [5,6] as:
exp − i(r’, t , r, t)/ η(d(r, t))σ β(m(r, t), t ), i = 0, 1, 2 n W (r’, t , r, t) = 0, i=3 (7) where the function η(d) = K1 exp(−K2 d) + K3 exp(−K4 d),
(8)
is used to determine the slope of the exponential function in (7). For the optimized values of the parameters we again refer to [5]. The function β(m(r, t), t ) in (7) is chosen to limit the contribution (decreasing the weight) of the pixels from the previous window in case of motion:
1, t = t β(m(r, t), t ) = (9) exp(−γm(r, t)), t = t − 1 In this equation, the parameter γ is used to control the sensitivity of the motion detector. For the optimal value of γ we refer to [5]. The ideas behind this multiple class averaging filter [5] are the following: (i) when motion is detected between two successive frames, only pixels from the current frame should be taken into account to avoid temporal blur; (ii) when large spatial activity (many fine details) is detected in the current filtering window, one should filter less to preserve the details. As a consequence more noise is left, but large spatial activity corresponds to high spatial frequencies and for these frequencies the eye is less sensitive [11]. In the opposite case, i.e., in homogeneous areas, strong filtering can be performed to remove as much noise as possible. 2.2
Fuzzy Motion and Detail Adaptive Averaging in the Pixel Domain
The general filtering scheme for the proposed fuzzy motion and detail adaptive method is given in Fig. 1. We adopt the above described idea from [5] as well as the filtering scheme with the 3×3×2 sliding window and the values Δ(r’, t , r, t), m(r, t) and d(r, t). Opposite to the multiple class averaging method, we no longer use the four discrete classes to express whether a given pixel value is similar to that of the central position in the filtering window. Instead we use one fuzzy set [12] ‘large difference’ for the values Δ(r’, t , r, t). A pixel Iv (r’, t ) has a greyscale value similar to that of the central pixel Iv (r, t) if the corresponding difference Δ(r’, t , r, t) is not large. Furthermore, we also used fuzzy sets to represent ‘large motion’ m(r, t) and ‘large detail’ d(r, t). A fuzzy set C in a universe U is characterized by a U → [0, 1] mapping μC , which associates with every element u in U a degree of membership μC (u) of u in the fuzzy set C. If a difference Δ(r’, t , r, t) for example has a membership one (zero) in the fuzzy set ‘large difference’, then this means that this difference
644
T. M´elange et al.
Fig. 1. The general filtering scheme
Δ(r’, t , r, t) is large (not large) for sure. Membership degrees between zero and one indicate that we do not know for sure if the difference is large or not. We also change the crucial step in the algorithm, namely the determination of the weights in (1). We replace the artificial construction of the exponentional functions in [5,6] by a more appropriate fuzzy logic framework containing natural linguistic variables. The weight W (r’, t , r, t) for the pixel position (r’,t’) is now defined as the membership degree in the fuzzy set large weight, which corresponds to the activation degree of the following fuzzy rules: Fuzzy Rule 1. Defining the membership degree in the fuzzy set ‘large weight’ of the pixel value at position r’ in the current frame (t = t) of the window with central pixel position r: IF (the variance d(r, t) is large AND the difference Δ(r’, t , r, t) is not large) OR
( the variance d(r, t) is not large)
THEN the pixel value at position r’ has a large weight W (r’, t , r, t) in (1). Fuzzy Rule 2. Defining the membership degree in the fuzzy set ‘large weight’ of the pixel value at position r’ in the previous frame (t = t − 1) of the window with central pixel position r: IF (the variance d(r, t) is large AND the difference Δ(r’, t , r, t) is not large)
OR (the variance d(r, t) is not large) AND the motion value m(r, t) is not large THEN the pixel value at position r’ has a large weight W (r’, t , r, t) in (1). Fuzzy rules are linguistic IF-THEN constructions that have the general form “IF A THEN B”, where A and B are (collections of) propositions containing linguistic variables. A is called the premise or antecedent and B is the consequence of the rule. The linguistic variables in the above fuzzy rules are (i) large for the detail value d(r, t), (ii) large for the difference Δ(r’, t , r, t), (iii) large for the motion value m(r, t) and (iv) large for the weight W (r’, t , r, t). The membership
A New Fuzzy Motion and Detail Adaptive Video Filter
645
functions that are used to represent the three fuzzy sets of (i) large difference, (ii) large detail and (iii) large motion are denoted as μΔ , μd and μm respectively. For these membership functions, we use simple trapezoidal functions as shown in Fig. 2. (a)
(b)
(c)
Fig. 2. (a) The membership function μd for the fuzzy set ‘large detail’, (b) The membership function μΔ for the fuzzy set ‘large difference’ and (c) The membership function μm for the fuzzy set ‘large motion’
In these figures, one observes five parameters that determine the form of the membership functions. To adapt the method to the noise level, the parameters have been related to the standard deviation of the noise σn . If the standard deviation is not known, as in most practical cases, it can be estimated for example by a noise estimation method for still images (like the median estimator proposed by Donoho and Johnstone [13]) applied to each frame separately or by a noise estimation method that also takes into account the temporal information contained in video sequences (like the method of Zlokolica [14]). Suitable values for the parameters were obtained experimentally by optimising their performance on several test sequences with several noise levels. We found thr1 = 1.36σn +1.2, T1 = 0.79σn +0.25, T2 = 5.24σn −15.35, t1 = 0.465σn −0.625 and t2 = 1.795σn + 3.275. Fuzzy Rules 1 and 2 contain AND and OR operators that are roughly equivalent to respectively intersections and unions of two fuzzy sets. Generally the intersection of two fuzzy sets A and B in a universe Y is specified by a mapping D leading to: μ(A∩B) (y) = D(μA (y), μB (y)), ∀y ∈ Y . Analogously, the union of A and B is specified by a mapping S leading to: μ(A∪B) (y) = S(μA (y), μB (y)),∀y ∈ Y . In fuzzy logic triangular norms and triangular conorms [15] are used for those mappings D and S, respectively. Two well-known triangular norms (together with their dual conorms) are the algebraic product (probabilistic sum) and the minimum (maximum). In this paper, we have chosen for the product and the probabilistic sum.
646
T. M´elange et al.
To represent the complement of a fuzzy set A in fuzzy logic, involutive negators [15] (roughly the equivalent of NOT operators) are used. We have used the wellknown standard negator N (x) = 1 − x, ∀x ∈ [0, 1]. For the complement of a fuzzy set A in Y this gives: μ(co(A)) (y) = N (μA (y)) = 1 − μA (y), ∀y ∈ Y . So for example the Fuzzy Rule 1 has an activation degree (that corresponds with the membership degree in the fuzzy set ‘large weight’) α · (1 − β) + (1 − α) − α · (1 − β) · (1 − α), with α = μd (d(r)) and β = μΔ (Δ(r, r’)).
3
Fuzzy Motion and Detail Adaptive Averaging in the Wavelet Domain
In this section we extend our method to the wavelet domain. Analogously to [5,6] each processed frame is first decomposed using the 2D wavelet transform [16]. Then the wavelet coefficients are filtered adaptively to a spatio-temporal neighbourhood in the wavelet bands of the current and the previous decomposed frame. Finally, the inverse wavelet transform is applied. 3.1
Basic Notions
The wavelet transform provides us with a representation that is very useful for image denoising. Image details (like edges and texture) are compacted in large coefficients, while homogeneous regions will result in small coefficients. A noisy input frame I(r, t) is decomposed into wavelet bands ys,d (r, t) representing its bandpass content at resolution scale 2s , direction d and spatial position r. We have used three orientation subbands, leading to three detail images at each scale, characterized by horizontal (d = LH), vertical (d = HL) and diagonal (d = HH) directions. Whenever there can be no confusion, we omit the indices s and d. We assume that the input sequence is contaminated with additive white Gaussian noise of zero mean with variance σn2 . Due to the linearity of the wavelet transform, the wavelet transformation of the noisy input yields an equivalent additive white noise model in each wavelet subband y(r, t) = β(r, t) + (r, t), where β(r, t) are noise-free wavelet coefficients and (r, t) are independent identically distributed normal random variables i ∼ N (0, σn2 ). 3.2
A Fuzzy Motion and Detail Adaptive Method in the Wavelet Domain
Analogously to [5,6], each of the wavelet bands and the low-frequency band are processed individually:
A New Fuzzy Motion and Detail Adaptive Video Filter
647
Filtering of the Low-frequency Band. For the filtering of the low-frequency band, we adapt the algorithm in an analogous way as in [5,6]. We still use the fuzzy set ‘large difference’, with the parameters T1 and T2 , appropriately adapted to the low-frequency band: T1 = 2.8333σn − 6.433 and T2 = 2.8733σn + 4.9667. The motion value is still computed as the absolute difference between the average coefficient value in the current frame of the window and the average coefficient value in the previous frame of the window. The parameters for the membership function μm of the fuzzy set ‘large motion value’ for the lowfrequency band are now experimentally determined as: t1 = 3.22σn + 1.5667 and t2 = 36.7667σn + 16.5. For the low-frequency band no detail value d(r, t) is computed. The weights W (r’, t , r, t) in (1) are now defined as the membership degrees in the fuzzy set ‘large weight’ based on the following fuzzy rules: Fuzzy Rule 3. Defining the membership degree in the fuzzy set ‘large weight’ of the coefficient at position r’ in the current low-frequency band (t = t) of the window with central position r: IF the difference Δ(r’, t , r, t) is not large THEN the coefficient at position r’ has a large weight W (r’, t , r, t) in (1). Fuzzy Rule 4. Defining the membership degree in the fuzzy set ‘large weight’ of the coefficient at position r’ in the previous low-frequency band (t = t − 1) of the window with central position r: IF the difference Δ(r’, t , r, t) is not large AND the motion value m(r, t) is not large THEN the coefficient at position r’ has a large weight W (r’, t , r, t) in (1). Filtering of the Wavelet Bands. The changes for the wavelet bands compared to the pixel domain method are analogous to those in [5,6]: 2 – d(r, t) is now defined as d(r, t) = r’ ys,d (r’, t). – We use only one motion value for all detail bands, namely the same motion value m(r, t) as computed for the low-frequency band. – The parameters that define the membership functions μΔ , μd and μm in Fig. 2 are adapted to the specific detail band. The experimentally optimized parameters thr1 for the different detail bands are given in Table 1. The optimized values for the parameters T1 and T2 for the membership function μΔ are different for detail bands from the first and the second scale. The optimized values are given in Table 1. The parameters for the membership function μm of the fuzzy set ‘large motion’ for the detail bands are the same as those for the low-frequency band, i.e., t1 = 3.22σn + 1.5667 and t2 = 36.7667σn + 16.5. Fuzzy Rules 1 and 2 can still be used to determine the weights in (1) for the detail bands. The only difference is that we are now working with wavelet coefficients instead of pixel values.
648
T. M´elange et al. Table 1. Optimized parameters for the different detail bands Band y1,LH y1,HL y1,HH y2,LH y2,HL y2,HH
4
thr1 5.5733σn − 14.2667 5.5733σn − 14.2667 46.6267σn − 243.0667 2.7533σn − 1.3 2.7533σn − 1.3 8.8267σn − 26.9333
Scale T1 s = 1 0.8867σn − 1.9667 s = 2 2.7067σn − 8.2667 Scale T2 s=1 2.94σn + 2.9 s = 2 2.8867σn + 0.8333
Experimental Results
Our algorithm has been implemented with a non-decimated wavelet transform (which is known to give better denoising results than the decimated one) with the Haar-wavelet. We have used only two levels in the decomposition, which is sufficient for relatively low noise levels that are realistic in the assumed video applications. In our experiments, we have processed 6 different sequences (“Salesman”, “Tennis”, “Deadline”, “Trevor”, “Flower garden” and “Miss America”) with added Gaussian noise (σn = 5, 10, 15, 20). We first compare our pixel domain methods to other state-of-the-art pixel domain methods in Subsection 4.1 and then do the comparison for the wavelet domain methods in Subsection 4.2. 4.1
Pixel Domain
We have compared the non-recursive (FMDAF) and the recursive (RFMDAF) scheme of our fuzzy motion and detail adaptive filter in the pixel domain to the following well-known pixel domain filters: the Rational filter (Rational) from [8], the 3D-KNN filter (KNN) from [17] as an extension of the 2D-KNN filter from [18,19], the motion and detail adaptive KNN filter (MDA-KNN) from [9], the threshold averaging filter (THR) from [20] and the recursive multiple class averaging filter (RMCA) from [5,6]. In Fig. 3 the PSNR results for the “Salesman” sequence and for two noise levels (σn = 10 and σn = 20) is given for the above mentioned pixel domain methods. From this figure we see that in terms of PSNR the FMDAF and RFMDAF filters perform better than the other pixel domain methods. For the “Salesman” and “Deadline” sequences, we further see that the MDA-KNN filter gives comparable results for low noise values (σn = 10). Finally, for the “Flower garden” sequence we find comparable results for the RMCA and the THR filters. 4.2
Wavelet Domain
Our recursive (WRFMDAF) and non-recursive (WFDMAF) wavelet domain methods have been compared to the following methods: the adaptive spatiotemporal filter (ASTF) from [4], the 3DWF filter from [7,21], the SEQWT filter
33
29.5
32.5
29
32
28.5 PSNR(dB)
PSNR(dB)
A New Fuzzy Motion and Detail Adaptive Video Filter
31.5
28
31
27.5
30.5
27
30
0
10
20
30
40
50
649
26.5
0
10
20
frame index
30
40
50
frame index
Fig. 3. Performance comparison for the pixel domain methods applied to the “Salesman” sequence with added Gaussian noise, (left) σn = 10, (right) σn = 20 37
34
36.5 33
36
32
35
PSNR(dB)
PSNR(dB)
35.5
34.5 34
31
30
33.5 33
29
32.5 32
0
10
20
30 frame index
40
50
28
0
10
20
30
40
50
frame index
Fig. 4. Performance comparison for the wavelet domain methods applied to the “Salesman” sequence with added Gaussian noise: (left) σn = 10, (right) σn = 20
from [1] and the recursive multiple class averaging filter in the wavelet domain (WRMCA) from [5,6]. In terms of PSNR the proposed wavelet based recursive WRFMDAF performs clearly better than the ASTF method. The WRFMDAF also performs slightly better than the WRMCA filter. Furthermore, the results for the proposed filter are similar to those of the more complex SEQWT filter, both visually as in terms of PSNR. Nevertheless, the filter is outperformed in terms of PSNR by the sophisticated motion-compensated filter WRSTF and the complex 3D wavelet transform method 3DWF. However, in the case of fast motion or high noise, the 3DWF filter tends to introduce some spatio-temporal blur. The PSNR results
650
T. M´elange et al.
for the processed “Salesman” sequence is given in Fig. 4. We can conclude that the proposed method outperforms the other multiresolution filters of a similar complexity.
5
Conclusion
In this paper we have presented a new low-complexity fuzzy motion and detail adaptive filter for the reduction of additive white Gaussian noise in digital video sequences. The proposed algorithm has first been explained in the pixel domain and has later been extended to the wavelet domain. Experimental results show that the pixel domain method outperforms other state-of-the-art pixel domain filters and so also does the wavelet domain method in comparison to other state-of-the-art wavelet domain filters of a comparable complexity. As future work we will try to extend our approach towards the denoising of colour video sequences and also try to find a framework for the denoising of video sequences contaminated with impulsive noise. Acknowledgement. This research was financially supported by the FWO project G.0667.06 of Ghent University. The authors would like to thank Prof. Selesnick from the Polytechnic University, New York, for providing them with the processed video sequences for the 3DWF algorithm through TELIN of Ghent University, which have been used for the comparison. They would also like to give a special thanks to Dr. A.M. Tourapis for providing them with the processed sequences by the ASTF algorithm also through TELIN of Ghent University. A. Pizurica is a postdoctoral research fellow of FWO, Flanders.
References 1. Pizurica, A., Zlokolica, V., Philips, W.: Noise reduction in video sequences using wavelet-domain and temporal filtering. In: Proc. SPIE Conf. Wavelet Applicat. Industrial Process. Providence, RI, pp. 48–59 (2003) 2. Zlokolica, V., Pizurica, A., Philips, W.: Wavelet-domain video denoising based on reliability measures. IEEE Transactions on circuits and systems for video technology 16(8), 993–1007 (2006) 3. Balster, E.J., Zheng, Y.F., Ewing, R.L.: Combined spatial and temporal domain wavelet shrinkage algorithm for video denoising. IEEE Trans. on Circuits and Systems for Video Technology 16(2), 220–230 (2006) 4. Cheong, H., Tourapis, A., Llach, J., Boyce, J.: Adaptive spatio-temporal filtering for video de-noising. In: IEEE International Conference on Image Processing, pp. 965–968. IEEE Computer Society Press, Singapore (2004) 5. Zlokolica, V., Pizurica, A., Philips, W.: Video denoising using multiple class averaging with multiresolution. In: Garc´ıa, N., Salgado, L., Mart´ınez, J.M. (eds.) VLBV 2003. LNCS, vol. 2849, pp. 172–179. Springer, Heidelberg (2003) 6. Zlokolica, V.: Advanced nonlinear methods for video denoising, PhD thesis, ch. 5, Ghent University, Ghent, Belgium (2006)
A New Fuzzy Motion and Detail Adaptive Video Filter
651
7. Sendur, L., Selesnick, I.W.: Bivariate shrinkage functions for wavelet based denoising exploiting interscale dependency. IEEE Trans. Image Process. 50(11), 2744– 2756 (2002) 8. Cocchia, F., Carrato, S., Ramponi, G.: Design and real-time implementation of a 3-D rational filter for edge preserving smoothing. IEEE Trans. on Consumer Electronics 43(4), 1291–1300 (1997) 9. Zlokolica, V., Philips, W.: Motion-detail adaptive k-nn filter video denoising, Report (2002), http://telin.ugent.be/∼ vzlokoli/Report2002vz.pdf 10. Jovanov, L., Pizurica, A., Zlokolica, V., Schulte, S., Kerre, E.E., Philips, W.: Combined wavelet domain and temporal filtering complient with video codec. In: IEEE Internat. Conf. on Acoust. Speech and Signal Process. ICASSP’07, Honolulu, Hawaii, USA, IEEE Computer Society Press, Los Alamitos (2007) (accepted) 11. Bellers, E.B., De Haan, G.: De-interlacing: A Key Technology for Scan Rate Conversion. Elsevier Science B.V, Sara Burgerhartstraat, Amsterdam (2000) 12. Zadeh, L.A.: Fuzzy Sets. Information and Control 8(5), 338–353 (1965) 13. Donoho, D., Johnstone, I.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 8, 425–455 (1994) 14. Zlokolica, V., Pizurica, A., Philips, W.: Wavelet domain noise-robust motion estimation and noise estimation for video denoising. In: First International Workshop on Video Processing and Quality Metrics for Consumer Electronics, Scottsdale, AR, USA (2005) 15. Weber, S.: A general concept of fuzzy connectives, negations and implications based on t-norms and t-conorms. Fuzzy Sets and Systems 11(2), 115–134 (1983) 16. Mallat, S.: A wavelet tour of signal processing, 2nd edn. Academic Press, Oval Road, London (1999) 17. Zlokolica, V., Philips, W., Van De Ville, D.: A new non-linear filter for video processing, In: IEEE Benelux Signal Processing Symposium, pp. 221–224 (March 2002) 18. Davis, L., Rosenfeld, A.: Noise cleaning by iterated cleaning. IEEE Trans. on Syst. Man Cybernet 8, 705–710 (1978) 19. Mitchell, H., Mashkit, N.: Noise smoothing by a fast k-nearest neighbor algorithm. Signal Processing: Image Communication 4, 227–232 (1992) 20. Lee, K., Lee, Y.: Treshold boolean filters. IEEE Trans. on Signal Processing 42(8), 2022–2036 (1994) 21. Selesnick, I.W., Li, K.Y.: Video denoising using 2d and 3d dual-tree complex wavelet transforms. In: Proc. SPIE Wavelet Applicat. Signal Image Process. San Diego, CA, pp. 607–618 (August 2003)
Bridging the Gap: Transcoding from Single-Layer H.264/AVC to Scalable SVC Video Streams Jan De Cock, Stijn Notebaert, Peter Lambert, and Rik Van de Walle Ghent University - IBBT Department of Electronics and Information Systems - Multimedia Lab Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium
Abstract. Video scalability plays an increasingly important role in the disclosure of digital video content. Currently, the scalable extension of the H.264/AVC video coding standard (SVC) is being finalized, which provides scalability layers for state-of-the-art H.264/AVC video streams. Existing video content that is coded using single-layer H.264/AVC, however, cannot benefit from the newly developed scalability features. Here, we discuss our architecture for H.264/AVC-to-SVC transcoding, which is able to derive SNR scalability layers from existing H.264/AVC bitstreams. Results show that the rate-distortion performance of our architecture approaches the optimal decoder-encoder cascade within 1 to 2 dB. Timing results indicate that intelligent conversion techniques are required, and that transcoding can significantly reduce the required computation time.
1
Introduction
Considering the proliferation of different devices with varying capabilities and the heterogeneous nature of the networks that are used to deliver video content, scalability is an important feature for compressed video sequences. Currently, the Joint Video Team of the MPEG and VCEG groups is working towards the standardization of the Scalable Extension of the H.264/AVC video coding standard (SVC). SVC makes it possible to encode scalable video bitstreams containing several dependency, spatial, and temporal layers. By parsing and extracting, lower layers can easily be obtained, providing different types of scalability. A disadvantage of the paradigm used for scalability in SVC is that scalability has to be provided at the encoder side by introducing different layers during encoding. This also implies that already encoded H.264/AVC-coded bitstreams cannot benefit from the scalability tools in SVC due to the lack of intrinsic scalability provided in the bitstream. The last few years, a considerably large technical and financial effort has been spent on the migration from MPEG-2 Video to H.264/AVC. Already, many initiatives are supporting single-layer H.264/AVC video coding, as was standardized in 2003. A new migration from non-scalable to scalable video coding would imply, yet again, the acquisition of new video encoding equipment, or the decoding J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 652–662, 2007. c Springer-Verlag Berlin Heidelberg 2007
Bridging the Gap
653
of the existing bitstreams, followed by a reencoding effort in the new scalable video coding format. Taking into account the high cost of the equipment, the investment required by the former might not be justifiable. The latter solution is more cost-effective, if the conversion can be performed in an efficient way. Since the encoding of SVC bitstreams is a highly computationally intensive process, a full decoding and reencoding operation is not practically feasible. Transcoding is a popular technique for fast adaptation of video content, allowing scalability without fully decoding and reencoding. In the past, different transcoding solutions have been presented, with architectures that provided such features as SNR, spatial, and temporal scalability [1]. It was shown that transcoding can be used to convert streams between different video coding standards, e.g., from MPEG-2 Video to H.264/AVC [2], or from H.263 to H.264/AVC [3]. A technique of transcoding from SVC with quality layers to single-layer H.264/ AVC is provided in the SVC specification, and is called bitstream rewriting. This technique allows the use of existing decoding equipment for playback of SVC streams that were coded using multiple quality layers. In this paper, we discuss an architecture that is able to transcode single-layer H.264/AVC bitstreams to SVC streams with different quality layers, i.e., the exact opposite of bitstream rewriting. We will show that transcoding from H.264/AVC to SVC imposes a number of challenges that are not present in the SVC-to-H.264/AVC direction, and provide an architecture that overcomes drift. Our transcoding solution allows the reuse of existing H.264/AVC streams and encoders, and provides a fast and efficient way of creating SNR scalable SVC streams. After transcoding, each of the available enhancement layers provides a refinement of the residual data in the base layer, by using a decreasing quantization step size. In order to be able to construct the enhancement layers from single-layer H.264/AVC streams, we base our architecture on techniques we developed for requantization transcoding. In [4,5,6], we have shown that the increased number of dependencies in H.264/AVC bitstreams imposes a number of non-negligible issues for requantization transcoding. When compared to previous video coding standards, such as MPEG-2 Video, the increased coding efficiency of new coding tools introduces the need for H.264/AVC-tailored transcoding solutions. In particular, attention has to be paid to the requantization transcoding and compensation of intra-coded macroblocks in P and B pictures in order to obtain acceptable video quality at the decoder [6]. Here, we extend these techniques to provide an architecture for H.264/AVCto-SVC transcoding. Starting from H.264/AVC bitstreams with hierarchically B-coded pictures, we discuss an architecture that is able to transcode these streams to multi-layer SVC bitstreams for combined temporal and SNR scalability. In this way, existing, previously encoded H.264/AVC bitstreams can be efficiently converted into bitstreams with inherent scalability layers that can be easily extracted at a later moment or in a further stage in the distribution chain. The remainder of this paper is organized as follows. In Sect. 2, we describe the SNR scalability techniques in SVC. In Sect. 3, we briefly discuss SVC-to-H.264/
654
J. De Cock et al.
AVC bitstream rewriting. In Sect. 4, we lay out our architecture for H.264/AVCto-SVC transcoding. In Sect. 5, we show implementation results.
2
SNR Scalability in SVC
Different techniques exist in the Joint Scalable Video Model (JSVM) [7] for providing SNR scalability. 2.1
Coarse-Grain Scalability
Coarse-Grain Scalability (CGS), in similarity to spatial scalability, uses different dependency layers with refinements. In the case of CGS, the difference is that no upsampling is required between successive enhancement layers. In every layer, quality refinements of the transform coefficients are stored by using a decreasing quantization step size. SVC supports up to eight CGS layers, corresponding to eight quality extraction points. Between successive refinement layers, inter-layer prediction is possible for both the motion information and the residual data. Also, an inter-layer intra prediction tool was provided to further improve coding efficiency of intra-coded macroblocks. 2.2
Fine-Grain Scalability
Fine-Grain Scalability (FGS) uses an advanced form of bitplane coding for encoding successive refinements of transform coefficients. The FGS slices have the property that they can be truncated at any byte-aligned position for SNR scalability [8]. FGS SNR scalability has the advantage that it provides a larger degree of flexibility, allowing a quasi-continuous spectrum of achievable bitrates, while CGS is limited to a number of pre-determined bitrates, i.e., one extraction point per layer. Due to its high computational complexity, however, the FGS concept was not included in one of the recently defined SVC profiles. As a consequence, it was removed from the Joint Draft. After further study and complexity reduction, FGS might be included in a future amendment to the current SVC specification. 2.3
Medium-Grain Scalability
As an alternative to FGS, Medium-Grain Scalability (MGS) was introduced. MGS tackles a number of problems that are encountered for CGS, such as the limited number of rate points, and the lack of flexibility for bitstream adaptation. MGS increases the number of achievable rate points by allowing different quality levels within one dependency layer. The flexibility is improved by allowing the removal of these quality levels at any point in the bitstream. Switching between the number of dependency layers (as is required for CGS), is only allowed at certain pre-defined points. In the current Joint Draft [9], 16 quality refinement levels are allowed for every dependency layer. In conjunction with CGS, this means that 128 quality extraction points are now achievable for SVC bitstreams.
Bridging the Gap
2.4
655
Drift Control
Both for FGS and MGS, attention has to be paid to drift control. Since residual information can be dropped at any point in the bitstream, reconstructed reference frames can differ between the encoder and decoder. Different mechanisms have been used in the past, each leading to a different trade-off between coding efficiency and drift. In MPEG-2 Video, the enhancement layer with the highest available quality was used as a reference. Here, loss of information in the enhancement layer resulted in drift. In MPEG-4 Visual, on the other hand, only the base layer was used as a reference for further prediction. This led to the complete elimination of drift. Compression efficiency, however, was significantly reduced when compared to single-layer coding. For FGS and MGS scalability in SVC, a different technique is used, by introducing the concept of key pictures. In the SVC bitstreams, key pictures function as synchronization points. By only using the base layer of a key pictures as a reference for prediction of the next key picture, no drift will be allowed in these pictures. For prediction of pictures in between successive key pictures, the highest available enhancement layer is used for prediction. In this way, drift is contained within the GOP boundaries (see Fig. 1).
key pictures
Fig. 1. Drift control in MPEG-2 Video, MPEG-4 Visual, and SVC, respectively
3
SVC-to-H.264/AVC Bitstream Rewriting
In order to make a clear distinction between the challenges in the normative SVC-to-H.264/AVC bitstream rewriting process, which is part of the current Joint Draft, and our H.264/AVC-to-SVC transcoding solution, we here shortly discuss SVC-to-H.264/AVC bitstream rewriting. Although the base layer of an SVC bitstream is required to be decodable by a standard H.264/AVC decoder, any other SVC layer will not be recognized, and discarded by the decoder as an unknown Network Abstraction Layer (NAL) unit type. SVC-to-H.264/AVC bitstream rewriting was proposed in [10]. The concept was introduced to allow the conversion of a stream with multiple CGS layers into one H.264/AVC-compliant stream by combining the residual data. The operation can be carried out at a network node, hereby eliminating unnecessary overhead of the SVC bitstream in the remainder of the network. Bitstream rewriting also
656
J. De Cock et al.
allows the display of scaled and rewritten SVC bitstreams on H.264/AVC base layer-only devices. In order to allow this functionality without quality loss, a number of changes were made to the SVC specification. Among others lossless rewriting required that scaling and combining residual data was possible in the transform-domain, i.e., without requiring an inverse transform. Another change involved imposing constraints on the transform sizes that are used in the base and enhancement layers. The transform sizes of the co-located macroblocks in the base and enhancement layers need to be identical. Since at the time of writing, the bitstream rewriting syntax and functionality was not yet finished, we here further discuss results for transcoding from H.264/AVC to SVC with bitstream rewriting functionality disabled. This also allows us to create SVC bitstreams that are not bound to the restrictions that are required for bitstream rewriting.
4 4.1
H.264/AVC-to-SVC Transcoding Open-Loop Transcoding Architecture
The most straightforward way of transcoding from H.264/AVC to SVC is by splitting the residual data into several layers. This can be achieved as follows, as is demonstrated in Fig. 2 for two dependency layers. Firstly, the incoming residual coefficients (with quantization parameter Q1 ) are dequantized, resulting in values oi , with i ranging from 1 to n, with n being the number of coefficients in one transform block. Next, these values are requantized using a coarser quantization parameter Q2 to obtain the coefficients for the base layer of the outgoing SVC bitstream. These coefficients are again dequantized, and subtracted from the values oi . The result is again quantized. We used the quantization parameter Q1 of the original bitstream to obtain the coefficients for the highest quality enhancement layer. Incoming H.264/AVC bitstream
VLD
Q1-1
oi
Q’2
+
-
VLC
SVC Base layer
VLC
SVC Enhancement layer
Q2-1
Q’1
Fig. 2. Open-loop transcoding architecture
The major issue in creating multiple layers is providing that every layer is self-contained, and decoding a stream at any extraction point results in driftfree video sequences. Since requantization results in errors, measures have to be provided to avoid error propagation at the different layers.
Bridging the Gap
4.2
657
Drift-Compensating Transcoding Architecture
To avoid requantization error propagation, we use compensation techniques we developed for H.264/AVC in [6]. Requantization errors can propagate, both spatially and temporally. Hence, depending on the slice and macroblock type currently being processed, a different compensation technique is applied. For motion-compensated macroblocks, temporal compensation is used, while for intra-predicted macroblocks, spatial compensation techniques are applied. As we have determined in previous research, compensation of intra-predicted macroblocks is an indispensable condition for obtaining video sequences with good visual quality. This is intuitively clear since in the case of I pictures, requantization errors can propagate between neighboring 4×4 or 16×16 blocks. In this way, the errors can easily accumulate throughout the image, and cause serious drift effects and distorted frames. The same effect will be noticed for intra-predicted macroblocks in P and B pictures, where drift results in artefacts in the intra-coded regions of the images. For drift compensation, we use the low-complexity compensation techniques as discussed in [6], and extend the architecture in order to support multiple CGS scalability layers. In Fig. 3, the resulting architecture is shown for transcoding to SVC bitstreams with two dependency layers. As mentioned, a distinction is made between spatial (intra-prediction based, IP) and temporal (motion-compensation based, MC) transform-domain compensation. In the architecture, two buffers are provided. The first buffer contains the requantization error values from the current frame, and is used to compensate surrounding macroblocks according to the sparse compensation matrices we derived in [4,5]. When a complete reference frame is transcoded, the content of the current frame buffer is copied to the reference frame buffer. The latter is used for temporal compensation of inter-predicted macroblocks. It is clear that the number of compensation frames used as reference determines to a large extent the complexity and memory usage of the overall architecture. In order to retain a low-complexity transcoding architecture, compensation is only applied at the borders of the Group of Pictures (GOP). For a GOP length of 8, this implies that frames with a frame number equal to a multiple of 8 will be compensated. Intermediate hierarchically B-coded pictures are not compensated. This only has a minor impact on quality due to the low transform coefficient energy retained in the B-coded pictures. In this way, drift may arise within the GOP structure, but it will not propagate across the GOP borders. The method used here is similar to the above-mentioned key picture concept used for MGS SNR scalability. After coding the base layer, the second and subsequent layers are obtained by subtracting the accumulated transform coefficient values of lower layers. In the JSVM, an inverse transform is applied between successive layers to decode the coefficients and perform calculations in the pixel domain. Here, for reduced complexity, we eliminate the inverse transform and perform calculations in the transform domain. This has only a minor impact on the rate-distortion performance of the transcoder.
658
J. De Cock et al. Reference frame buffer MC
Incoming H.264/AVC bitstream
VLD
Current frame buffer
Copy at end of frame
Temporal compensation
Q1-1 +
-
Spatial compensation
+
Q’2
+ -
Q2-1
Q’1
IP
VLC
SVC Base layer
VLC
SVC Enhancement layer
Fig. 3. Drift-compensating transcoding architecture
5
Implementation Results
We implemented both architectures, described in the previous section. The resulting transcoder is able to transcode H.264/AVC-compliant bitstreams to SVCdecodable streams with multiple CGS layers. For testing, we used sequences with varying motion characteristics, namely Foreman, Stefan, and Paris, in CIF resolution. The sequences were encoded using the H.264/AVC Joint Model reference software, version 11.0. We used hierarchical GOP structures with varying lengths (8, 16, and 32). Between the base layer and the enhancement layer of the SVC sequences, we used a ΔQ = Q2 − Q1 = 6. The H.264/AVC sequences were transcoded using our software implementation. As a reference, we reencoded the H.264/AVC sequences to SVC using the JSVM software, version 7.6. In Figs. 4-6, we give the rate-distortion results for the three sequences, after reencoding and transcoding with and without compensation. A large gap can be seen between the open-loop and compensating transcoder architectures for both the Stefan and Foreman sequences. Due to the lower amount of motion data in the Paris sequence, this gap remains smaller. For all three sequences, the driftcompensating transcoder architecture approaches the slow reencoding solution within 1 to 2 dB. For the important case of base layer-only decoding, the results are similar. The quality after transcoding with compensation approximates the rate-distortion optimal decoder-encoder cascade within 1 to 2 dB, as can be seen in Figs. 7-9. In Fig. 10, the 64 first frames of the Stefan sequence are shown with their corresponding PSNR values after reencoding, and transcoding with and without compensation. It is clear (particularly in the first intra period of 32 frames), that compensation is required in order to obtain reliable images and to restrain drift, and that open-loop transcoding is not applicable. Average timing results, as shown in Table 1, indicate the importance of transcoding in H.264/AVC-to-SVC conversion. The reencoding results were obtained by using the JSVM reference software, version 7.6, and compared to our transcoder implementation. The tests were obtained on a desktop PC with a Pentium 4 CPU at 3 GHz with 1 GB memory. As can be seen, a huge speed-up of more than 90% is
Bridging the Gap 46 44 42
PSNR-Y [dB]
40 38 36 34 32 Reencoded Compensated Open-loop
30 28 26 0
1
2
3
4 5 Bitrate [Mbps]
6
7
8
9
Fig. 4. Rate-distortion performance (Stefan sequence) 46 44 42
PSNR-Y [dB]
40 38 36 34 32 Reencoded Compensated Open-loop
30 28 26 0
1
2
3
4
5
Bitrate [Mbps]
Fig. 5. Rate-distortion performance (Foreman sequence) 46 44 42
PSNR-Y [dB]
40 38 36 34 32
Reencoded
30
Compensated 28
Open-loop
26 0
1
2
3
4
5
Bitrate [Mbps]
Fig. 6. Rate-distortion performance (Paris sequence)
659
J. De Cock et al. 46 44 42
PSNR-Y [dB]
40 38 36 34 32 Reencoded Compensated Open-loop
30 28 26 0.00
0.50
1.00
1.50 2.00 Bitrate [Mbps]
2.50
3.00
3.50
Fig. 7. Rate-distortion performance for base layer (Stefan sequence) 46 44 42
PSNR-Y [dB]
40 38 36 34 32
Reencoded
30
Compensated
28
Open-loop
26 0.00
0.20
0.40
0.60
0.80 1.00 Bitrate [Mbps]
1.20
1.40
1.60
Fig. 8. Rate-distortion performance for base layer (Foreman sequence) 46 44 42 40 PSNR-Y [dB]
660
38 36 34 32 Reencoded Compensated Open-loop
30 28 26 0.00
0.20
0.40
0.60
0.80 1.00 Bitrate [Mbps]
1.20
1.40
1.60
1.80
Fig. 9. Rate-distortion performance for base layer (Paris sequence)
Bridging the Gap
661
48 46 44 42 PSNR-Y [dB]
40 38 36 34 32 30 Reencoded Compensated Open-loop
28 26
Reencoded - base layer Compensated - base layer Open-loop - base layer
24 0
10
20
30 Frame nr.
40
50
60
Fig. 10. Stefan sequence (64 frames) Table 1. Timing results [s]
(Q1 , Q2 ) (14, 20) Stefan 531.4 Foreman 520.4 Paris 480.0
Reencoding (20, 26) (26, 32) 523.3 519.7 517.7 516.8 477.9 477.3
Transcoding (32, 38) (14, 20) (20, 26) (26, 32) (32, 38) 492.9 31.7 28.7 26.4 24.5 495.3 31.7 26.7 24.8 24.1 462.7 31.2 30.0 27.0 25.1
obtained by using transcoding instead of a decoder-encoder cascade. This can to a large extent be explained by the fact that in the decoder-encoder cascade, no information is reused from the incoming bitstream. In this way, all coding decisions have to be repeated. In particular, the time-consuming motion estimation is repeated, along with all mode decisions.
6
Conclusions
In this paper, we discussed our architecture for H.264/AVC-to-SVC transcoding, which derives SNR scalability layers from single-layer H.264/AVC video streams. Implementation results were provided that show that the rate-distortion optimal reencoder is approached within 1 to 2 dB. Timing results indicate the necessity of intelligent techniques for H.264/AVC-to-SVC conversion. Transcoding was shown to result in a reduction of execution time of more than 90% when compared to the reencoder due to intelligent reuse of information in the incoming bitstream.
Acknowledgements The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology
662
J. De Cock et al.
(IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT-Flanders), the Fund for Scientific Research-Flanders (FWO-Flanders), and the European Union.
References 1. Vetro, A., Christopoulos, C., Sun, H.: Video Transcoding Architectures and Techniques: an Overview. IEEE Signal Processing Magazine, 18–29 (2003) 2. Qian, T., Jun, S., Dian, L., Yang, X., Jia, W.: Transform domain transcoding from MPEG-2 to H.264 with interpolation drift-error compensation. IEEE Transactions on Circuits and Systems for Video Technology 16, 523–534 (2006) 3. Bialkowski, J., Barkowsky, M., Kaup, A.: Overview of low-complexity video transcoding from H.263 to H.264. In: Proceedings of ICME 2006 (IEEE International Conference on Multimedia and Expo), pp. 49–52. IEEE Computer Society Press, Los Alamitos (2006) 4. Notebaert, S., De Cock, J., De Wolf, K., Van de Walle, R.: Requantization Transcoding of H.264/AVC Bitstreams for Intra 4x4 Prediction Modes. In: Proceedings of PCM (Pacific-rim Conference on Multimedia) (2006) 5. De Cock, J., Notebaert, S., Lambert, P., De Schrijver, D., Van de Walle, R.: Requantization Transcoding in Pixel and Frequency Domain for Intra 16x16 in H.264/AVC. In: Proceedings of ACIVS (Advanced Concepts for Intelligent Vision Systems) (2006) 6. De Cock, J., Notebaert, S., Van de Walle, R.: A Novel Hybrid Requantization Transcoding Scheme for H.264/AVC. In: Proceedings of ISSPA (International Symposium on Signal Processing and its Applications) (2007) Accepted for publication 7. Reichel, J., Schwarz, H., Wien, M.: Joint Scalable Video Model (JSVM) 10. Joint Video Team, Doc. JVT-W202, San Jose, CA, USA (2007) 8. Schwarz, H., Marpe, D., Wiegand, T.: Overview of the Scalable Extension of the H.264/MPEG-4 AVC Video Coding Standard. Joint Video Team, Doc. JVT-W132, San Jose, CA, USA (2007) 9. Wiegand, T., Sullivan, G., Reichel, J., Schwarz, H., Wien, M.: Joint Draft 10. Joint Video Team, Doc. JVT-W201, San Jose, CA, USA (2007) 10. Segall, A.: SVC-to-AVC Bit-stream Rewriting for Coarse Grain Scalability. Joint Video Team, Doc. JVT-T061, Klagenfurt, Austria (2006)
Improved Pixel-Based Rate Allocation for Pixel-Domain Distributed Video Coders Without Feedback Channel Marleen Morb´ee1, Josep Prades-Nebot2, Antoni Roca2 , Aleksandra Piˇzurica1, and Wilfried Philips1 1
TELIN-IPI-IBBT Ghent University Ghent, Belgium
[email protected] 2 GTS-ITEAM Universidad Polit´ecnica de Valencia Valencia, Spain
[email protected]
Abstract. In some video coding applications, it is desirable to reduce the complexity of the video encoder at the expense of a more complex decoder. Distributed Video (DV) Coding is a new paradigm that aims at achieving this. To allocate a proper number of bits to each frame, most DV coding algorithms use a feedback channel (FBC). However, in some cases, a FBC does not exist. In this paper, we therefore propose a rate allocation (RA) algorithm for pixel-domain distributed video (PDDV) coders without FBC. Our algorithm estimates at the encoder the number of bits for every frame without significantly increasing the encoder complexity. For this calculation we consider each pixel of the frame individually, in contrast to our earlier work where the whole frame is treated jointly. Experimental results show that this pixel-based approach delivers better estimates of the adequate encoding rate than the frame-based approach. Compared to the PDDV coder with FBC, the PDDV coder without FBC has only a small loss in RD performance, especially at low rates.
1
Introduction
Some video applications, e.g., wireless low-power surveillance, disposable cameras, multimedia sensor networks, and mobile camera phones require lowcomplexity coders. Distributed video (DV) coding is a new paradigm that fulfills this requirement by performing intra-frame encoding and inter-frame decoding [1]. Since DV decoders and not encoders perform motion estimation and motion compensated interpolation, most of the computational load is moved from the encoder to the decoder.
This work has been partially supported by the Spanish Ministry of Education and Science and the European Commission (FEDER) under grant TEC2005-07751-C0201. A. Piˇzurica is a postdoctoral research fellow of FWO, Flanders.
J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 663–674, 2007. c Springer-Verlag Berlin Heidelberg 2007
664
M. Morb´ee et al.
One of the most difficult tasks in DV coding is allocating a proper number of bits to encode each video frame. This is mainly because the encoder does not have access to the motion estimation information of the decoder and because small variations in the allocated number of bits can cause large changes in distortion. Most DV coders solve this problem by using a feedback channel (FBC), which allows the decoder to request additional bits from the encoder when needed. Although this way an optimal rate is allocated, it is not a valid solution in unidirectional and offline applications, and increases the decoder complexity and latency [2]. In this paper, we propose a rate allocation (RA) algorithm for pixel-domain distributed video (PDDV) coders that do not use a FBC. Our algorithm computes the number of bits to encode each video frame without significantly increasing the encoder complexity. The proposed method is related to our previous work [3] on PDDV coders without FBC. However, in this paper, the algorithm is improved by estimating the error probabilities for each pixel separately instead of for the whole frame jointly. We also adapted the algorithm for the case of lossy (instead of lossless) coding of the key frames. The experimental results show that the RA algorithm delivers good estimates of the rate and the frame qualities provided by our algorithm are quite close to the ones provided by a FBC-based algorithm. Furthermore, we observe that the rate estimates and frame quality are significantly improved compared to our previous work [3]. The paper is organized as follows. In Section 2, we study the basics of PDDV coding. In Section 3, we study the RA problem and the advantages and inconveniences of using a FBC. Then, in Section 4, we describe the RA algorithm. Subsequently, in Section 5, we compare the performance of a DV coder using a FBC and the performance of the same DV coder using our RA algorithm. Finally, the conclusions are presented in Section 6.
2
Pixel-Domain DV Coding
In DV coders, the frames are organized into key frames (K-frames) and WynerZiv frames (WZ-frames). The K-frames are coded using a conventional intraframe coder. The WZ-frames are coded using the Wyner-Ziv paradigm, i.e., they are intra-frame encoded, but they are conditionally decoded using side information (Figure 1). In most DV coders, the odd frames are encoded as K-frames, and the even frames are encoded as WZ-frames [4, 5, 3]. Coding and decoding is done unsequentially in such a way that, before decoding the WZ-frame X, the preceding and succeeding K-frames (XB and XF ) have already been transmitted and decoded. Thus, the receiver can obtain a good approximation S of X by ˆ B and X ˆ F ). S is used as part of interpolating its two closest decoded frames (X the side information to conditionally decode X, as will be explained below. The DV coders can be divided into two classes: the scalable coders [5, 2, 3], and the non-scalable coders [4]. The scalable coders have the advantages that the rate can be flexibly adapted and that the rate control is easier than in the
Improved Pixel-Based Rate Allocation for PDDV Coders without FBC
Receiver
Transmitter Slepian-Wolf codec Turbo Encoder
Parity Buffer bits
FBC
Rate Allocation
XB XF
ˆB, X ˆF X Intra-frame Encoder
Xk Turbo Decoder
...
...
WZ-frames Xk BP X extraction &selection
665
ˆ X
Rec.
... Sk
BP extraction
S
Frame Interpolation
Intra-frame Decoder
Intra-frame Decoder
ˆB X ˆF X
K-frames Fig. 1. General block diagram of a scalable PDDV coder
non-scalable case. In this paper, we focus on the practical scalable PDDV coder depicted in Figure 1 [5, 2, 3]. In this scheme, we first extract the M Bit Planes (BPs) Xk (1 ≤ k ≤ M ) from the WZ-frame X. M is determined by the number of bits by which the pixel values of X are represented. Subsequently, the m most significant BPs Xk (1 ≤ k ≤ m, 1 ≤ m ≤ M ) are encoded independently of each other by a Slepian-Wolf (SW) coder [6]. The transmission and decoding of BPs is done in order of significance (the most significant BPs are transmitted and decoded first). The SW coding is implemented with efficient channel codes that yield parity bits of Xk , which are transmitted over the channel. At the receiver side, the SW decoder obtains the original BP Xk from the transmitted parity bits, the corresponding BP Sk extracted from the interpolated frame S, and the previously decoded BPs {X1 , . . . , Xk−1 }. Note that Sk can be considered the result of transmitting Xk through a noisy virtual channel. The SW decoder is a channel decoder that recovers Xk from its noisy version Sk . Finally, the decoder obtains the reconstruction x ˆ of each pixel x ∈ X by using the decoded bits xk ∈ Xk (k = 1, . . . , m) and the corresponding pixel s of the interpolated frame S through ⎧ ⎪xL , s < xL ⎨ x ˆ = s, (1) xL ≤ s ≤ xR ⎪ ⎩ xR , s > xR with xL =
m i=1
xi 28−i and xR = xL + 28−m − 1.
(2)
666
3
M. Morb´ee et al.
The Rate Allocation Problem
In PDDV coders, the optimum rate R∗ is the minimum rate necessary to losslessly1 decode the BPs Xk (k = 1, . . . , m). The use of a rate higher than R∗ does not lead to a reduction in distortion, but only to an unnecessary bit expense. On the other hand, encoding with a rate lower than R∗ can cause the introduction of a large number of errors in the decoding of Xk , which can greatly increase the distortion. This is because of the threshold effect of the channel codes used in DV coders. A common RA solution adopted in DV coders is the use of a FBC and a ratecompatible punctured turbo code (RCPTC) [7]. In this configuration, the turbo encoder generates all the parity bits for the BPs to be encoded, saves these bits in a buffer (see Figure 1), and divides them into parity bit sets. The size of a parity bit set is N/Tpunc, where Tpunc is the puncturing period of the RCPTC and N is the number of pixels in each frame. To determine the adequate number of parity bit sets to send for a certain BP Xk , the encoder first transmits one parity bit set from the buffer. Then, if the decoder detects that the residual error probability Qk (for the calculation see Section 4.4) is above a threshold t, it requests an additional parity bit set from the buffer through the FBC. This transmission-request process is repeated until Qk < t. If we denote by Kk the number of transmitted parity bit sets, then the encoding rate Rk for BP Xk is Rk = r Kk
N , Tpunc
(3)
with r being the frame rate of the video. However, although the FBC allows the system to allocate an optimal rate, this FBC cannot be implemented in offline applications or in those applications where communication from the decoder to the encoder is not possible. In those applications, an appropriate RA algorithm at the encoder can take over its role. In the following section, we will describe this RA algorithm to suppress the FBC in more detail.
4
The Rate Allocation Algorithm
The main idea of the proposed method is to estimate at the encoder side, for each BP of the WZ-frames, the optimal (i.e. the minimal required) number of parity bits for a given residual error probability. An important aspect of the proposed approach is also avoiding underestimation of the optimal number of parity bits. Indeed, if the rate is underestimated, the decoding of the BPs of the frames will not be error-free and this will lead to a large increase in distortion. Let us denote by U the difference between the original frame and the side information frame: U = X − S. As in [4,5,3], we assume that a pixel value u ∈ U follows a Laplacian distribution with a probability density function (pdf) 1
In practical PDDV coding, SW decoders are allowed to introduce a certain small amount of errors.
Improved Pixel-Based Rate Allocation for PDDV Coders without FBC
p(u) = where α =
α (−α|u|) e 2
667
(4)
√ 2/σ and σ is the standard deviation of the difference frame U .
ˆB, X ˆF X, X
Estimation of σ 2
σ ˆ2
Estimation of {Pk }
{Pk } Estimation of {Rk }
{Rk }
Fig. 2. Rate allocation module at the encoder
As every BP of a WZ-frame X is separately encoded, a different encoding rate Rk must be allocated to each BP Xk . As the virtual channel is assumed to be a binary symmetric channel, to obtain Rk , we need to know the bit error probability Pk of each BP Xk . To calculate this probability, we first make an estimate σ ˆ 2 of the parameter σ 2 (Section 4.1). Then, for each BP Xk , we use σ ˆ to estimate Pk (Section 4.2). Once Pk is estimated, we can determine the encoding rate Rk for BP Xk by taking into account the error correcting capacity of the turbo code (Section 4.3). In Figure 2, a block diagram of the RA module is depicted. Although we aim at an overestimation of the rate, this is not always achieved. Therefore, once the parity bits have been decoded, the residual error probability ˆ k ) (Section 4.4). If Q ˆ k is above a threshold t, Qk is estimated at the decoder (Q the parity bits of the considered BP are discarded and the frame is reconstructed with the available previously decoded BPs. This way, we prevent an increase in the distortion caused by an excessive number of errors in a decoded BP. In the following, we explain each step of our RA algorithm in more detail. 4.1
Estimation of σ 2
We estimate σ 2 at the encoder so the estimate should be very simple in order to avoid significantly increasing the encoder complexity. We adopt the approach of [3], but we take the coding of the K-frames into account. σ ˆ 2 is then the mean squared error (MSE) between the current WZ-frame and the average of the two closest decoded K-frames: ˆ B (v, w) + X ˆ F (v, w) 2 1 X σ ˆ2 = X(v, w) − (5) N 2 (v,w)∈X
with N denoting the number of pixels in each frame. The decoded frames are obtained by the intra-frame decoding unit at the encoder site (see Figure 1). In general, the resulting σ ˆ 2 is an overestimate of the real σ 2 since it is expected that the motion compensated interpolation performed at the decoder to obtain the side information will be more accurate than the simple averaging of the two closest decoded K-frames. This is exactly what is required for our purpose, since we prefer an overestimation of the encoding rate to an underestimation, as explained above.
668
4.2
M. Morb´ee et al.
Estimation of the Error Probabilities {Pk }
Let us assume that the most significant k − 1 bits of the pixel value x ∈ X have already been decoded without errors. Hence, both the encoder and the decoder know from {x1 , . . . , xk−1 } that x is in the interval [xL , xR ] where xL and xR are as in (2) with m = k − 1. At the encoder, the bit value xk shrinks this interval in such way that x ∈ [xL , xC ] if xk = 0, and x ∈ [xC + 1, xR ] if xk = 1 with
xL + xR xC = . (6) 2 An error in xk occurs if x ∈ [xL , xC ] and s ∈ [xC + 1, xR ] or if x ∈ [xC + 1, xR ] and s ∈ [xL , xC ]. By assuming a Laplacian pdf for the difference between the original frame and the side information, the conditional pdf of s given x and xL ≤ s ≤ xR is ⎧ α −α|x−s| ⎪ 2e ⎪ ⎨ if xL ≤ s ≤ xR P(x ≤ s ≤ xR |x) L p(s|x, xL ≤ s ≤ xR ) = . (7) ⎪ ⎪ ⎩ 0 otherwise From (7), the error probability of bit value xk of pixel value x is estimated through ⎧ x R ⎪ ⎪ p(s|x, xL ≤ s ≤ xR ) ds if xk = 0 ⎪ ⎨ xc +0.5 Pe (xk ) = (8) xc +0.5 ⎪ ⎪ ⎪ ⎩ p(s|x, x ≤ s ≤ x ) ds if x = 1 L
R
k
xL
Note that the integration intervals are extended by 0.5 in order to cover the whole interval [xL , xR ]. For the first BP X1 , no previous BPs have been transmitted and decoded and, consequently, xL = 0, xR = 255, and xC = 127 for all the pixels. Finally, we estimate the average error probability Pk for the entire BP Xk . Therefore, we take into account the histogram of the frame H(x), which provides the relative frequency of occurrence for each pixel value x. Pk is then estimated through 255 Pk = H(x)Pe (xk ). (9) x=0
4.3
Estimation of the Encoding Rates {Rk }
Once Pk is estimated, we choose the corresponding encoding rate Rk that enables us to decode the estimated number of errors with a residual error probability Qk below a threshold t (Qk < t). The calculation of Qk is explained in Section 4.4. To estimate Rk , we need to express the residual error probability Qk as a function of
Improved Pixel-Based Rate Allocation for PDDV Coders without FBC
669
input error probability Pk and the number of parity bit sets Kk [3]. We estimate these functions experimentally by averaging simulation results over a large set of video sequences with a wide variety of properties. Using these experimental functions and knowing Pk and the threshold t, we estimate the adequate number of parity bit sets Kk . Finally, we obtain Rk from Kk through (3), with r the frame rate, Tpunc the puncturing period and N the number of pixels in each frame. 4.4
Estimation of the Residual Error Probabilities {Qk }
If the rate allocated to encode a BP is too low, the decoded BP can contain such a large number of errors that the quality of the reconstructed frame is worse than the quality of the side information. To prevent this situation, we need to know the residual error probability Qk of each BP at the decoder. We estimate Qk as [8] N 1 ˆk = 1 Q (10) N n=1 1 + e|Ln | where N is the number of pixels in each frame and Ln the log-likelihood ratio ˆ k is above a certain threshold of the nth bit in the considered BP Xk [8]. If Q ˆ (Qk > t), the decoded BPs are discarded and the frame is reconstructed with the available previously error-free decoded BPs.
5
Experimental Results
In this section, we experimentally study the accuracy of our RA algorithm when it is used in a PDDV coder without FBC (RA-PDDV coder) and compare it with the rate allocations provided by the same coder using a FBC (FBC-PDDV coder). We will also discuss the improvement compared to our previous work [3]. The PDDV coder used in the experiments first decomposes each WZ-frame into its 8 BPs. Then, the m most significant BPs are separately encoded by using a RCPTC; the other BPs are discarded. In our experiments, m is chosen to be 3. The turbo coder is composed of two identical constituent convolutional encoders of rate 1/2 with generator polynomials (1, 33/31) in octal form. The puncturing period was set to 32 which allowed our RA algorithm to allocate parity bit multiples of N/32 bits to each BP, where N is the number of pixels in each frame. The K-frames were either losslessly transmitted or intra-coded using H.263 with quantization parameter QP . The interpolated frame was generated at the decoder with the interpolation tools described in [5]. We encoded several test QCIF sequences (176×144 pixels/frame, 30 frames/s) with two RA strategies: our RA algorithm and the allocations provided by the ˆ k (our RA approach) FBC-PDDV coder. The threshold t for Qk (FBC) and for Q 1 is set to N , where N is the number of pixels in each frame. Tables 1 and 2 show the difference between the RA (in kb/s) provided by our algorithm and the RA using the FBC when encoding the first BP of each frame. More specifically, the percentage of frames with a difference in rate of ΔR kb/s
670
M. Morb´ee et al.
Table 1. Percentage of frames that differ by ΔR from the rate of the FBC (for the first BP). The K-frames are losslessly transmitted. The previous method is the method described in [3]. Video sequence Akiyo Carphone Foreman Salesman Mobile
Method
≤-24 kb/s
% of frames with ΔR -12 0 +12 kb/s kb/s kb/s
current previous current previous current previous current previous current
0 12.1 0 7.4 1.0 7.5 0 8.0 0
0 14.7 5.4 10.1 2.5 17.6 0 10.1 2.0
100 59.7 42.6 23.5 25.8 23.1 93.9 45.0 38.5
≥+24 kb/s
0 10.1 40.5 34.9 30.8 13.6 6.1 26.8 58.1
0 3.4 11.5 24.2 39.9 38.2 0 10.1 1.4
Table 2. Percentage of frames that differ by ΔR from the rate of the FBC (for the first BP). The K-frames are intra-coded with H.263 (QP = 10). Video sequence
≤-24 kb/s
Akiyo Carphone Foreman Salesman Mobile
0 0 2.5 0 0
% of frames with ΔR -12 0 +12 kb/s kb/s kb/s 0 8.1 4.0 0 0
47.3 39.9 41.4 66.2 60.8
52.7 43.9 21.7 33.8 36.5
≥+24 kb/s 0 8.1 30.3 0 2.7
is shown. In Table 1 the K-frames are losslessly coded while in Table 2 the Kframes are intra-coded with H.263 and QP = 10. Note that for the lossless case the ideal rate is allocated in between 25% and 100% of the frames (depending on the sequence), whereas in our previous work [3], the ideal rate was allocated in between 23% and 60% of the frames. For Akiyo, Carphone, Foreman and Salesman, we observe an increase in the percentage of respectively 40.3%, 19.1%, 2.7% and 48.9%. Moreover, we notice that with the current approach for only very few frames the rate is underestimated, which is desirable for our purpose (as explained in Section 4). The results for the case of lossy coding of the K-frames are a little worse but similar. Also here, non-optimal rate allocations are nearly always overestimations. Tables 3, 4, 5 and 6 also show the difference between the RA (in kb/s) provided by our algorithm and the RA using the FBC but now for the second and third BP of each frame. In Tables 3 and 5 the K-frames are losslessly coded while
Improved Pixel-Based Rate Allocation for PDDV Coders without FBC
671
Table 3. Percentage of frames that differ by ΔR from the rate of the FBC (for the second BP). The K-frames are losslessly transmitted. The previous method is the method described in [3]. Video sequence Akiyo Carphone Foreman Salesman Mobile
Method
≤-24 kb/s
current previous current previous current previous current previous current
0 0.7 0 0 0 0 0 0 0
% of frames with ΔR -12 0 +12 kb/s kb/s kb/s 0.7 8.0 0.7 2.0 0 1.5 0.7 10.1 1.4
87.2 31.5 14.9 7.4 19.2 12.6 43.9 31.5 27.0
≥+24 kb/s
10.1 28.9 37.8 16.1 24.8 10.5 44.6 18.1 37.8
2.0 30.9 46.6 74.5 56.1 75.4 10.8 40.3 33.8
Table 4. Percentage of frames that differ by ΔR from the rate of the FBC (for the second BP). The K-frames are intra-coded with H.263 (QP = 10). Video sequence
≤-24 kb/s
Akiyo Carphone Foreman Salesman Mobile
0 0 0 0 0
% of frames with ΔR -12 0 +12 kb/s kb/s kb/s 0.7 0 1.5 25.7 0.7
0 20.3 26.3 74.3 31.8
93.9 49.3 21.2 0 37.8
≥+24 kb/s 5.4 30.4 51.0 0 29.7
Table 5. Percentage of frames that differ by ΔR from the rate of the FBC (for the third BP). The K-frames are losslessly transmitted. Video sequence
≤-24 kb/s
Akiyo Carphone Foreman Salesman Mobile
0.7 0 0 0.7 0
% of frames with ΔR -12 0 +12 kb/s kb/s kb/s 74.3 4.1 0.5 25.0 0.7
17.6 31.1 17.7 39.9 16.2
5.4 14.9 8.6 24.3 27.7
≥+24 kb/s 2.0 50.0 73.2 10.1 55.4
672
M. Morb´ee et al.
Table 6. Percentage of frames that differ by ΔR from the rate of the FBC (for the third BP). The K-frames are intra-coded with H.263 (QP = 10). Video sequence
≤-24 kb/s
Akiyo Carphone Foreman Salesman Mobile
0 0 0 0 0
% of frames with ΔR -12 0 +12 kb/s kb/s kb/s 0 2.0 0 0 0
100 21.0 4.0 0 2.7
36.5
0 27.7 19.7 37.2 18.9
≥+24 kb/s 0 49.3 76.3 62.8 78.4
38.5
36 38
PSNR(dB)
PSNR(dB)
35.5 35 34.5
37.5
37
34 36.5 33.5 33
FBC (optimal RA) RA algorithm 0
50
100
150 200 Rate(kb/s)
250
300
FBC (optimal RA) RA algorithm 36
350
0
50
100
45.2
36.5
45
36
44.8
35.5
44.6 44.4
34 33.5
FBC (optimal RA) RA algorithm
43.8 40
60 Rate(kb/s)
(c) Salesman
350
35
44
20
300
34.5
44.2
0
250
(b) Foreman
PSNR(dB)
PSNR(dB)
(a) Carphone
150 200 Rate(kb/s)
80
100
120
33
FBC (optimal RA) RA algorithm 0
50
100 150 Rate(kb/s)
200
250
(d) Mobile
Fig. 3. RD performance of our RA algorithm for the sequences (a) Carphone, (b) Foreman, (c) Salesman and (d) Mobile. Compared is the RD performance for the case of optimal rate allocation. The K-frames are losslessly transmitted.
in Tables 4 and 6 the K-frames are intra-coded with H.263 and QP = 10. Similar improvements of the pixel-based approach compared to the frame-based approach [3] as for the first BP can be noticed. We observe that the inaccuracy of the RA increases when the BPs are less significant, like in [3].
Improved Pixel-Based Rate Allocation for PDDV Coders without FBC 34
673
33.5
33.5 33
PSNR(dB)
PSNR(dB)
33 32.5 32
32.5
32
31.5 31.5 31 30.5
FBC (optimal RA) RA algorithm 0
50
100
150 200 Rate(kb/s)
250
300
FBC (optimal RA) RA algorithm 31
350
0
50
(a) Carphone
100
150
200 250 Rate(kb/s)
300
350
400
(b) Foreman
34
31.5
33.8 31
33.6
PSNR(dB)
PSNR(dB)
33.4 33.2 33
30.5
30
32.8 32.6
29.5 FBC (optimal RA) RA algorithm
32.4 32.2
0
50
100
150 Rate(kb/s)
(c) Salesman
200
250
FBC (optimal RA) RA algorithm 300
29
0
50
100
150
200 250 Rate(kb/s)
300
350
400
(d) Mobile
Fig. 4. RD performance of our RA algorithm for the sequences (a) Carphone, (b) Foreman, (c) Salesman and (d) Mobile. Compared is the RD performance for the case of optimal rate allocation. The K-frames are intra-coded with H.263 (QP = 10).
In Figures 3 and 4, we show the RD curves of Carphone, Foreman, Salesman, and Mobile for the RA-PDDV coder, and we compare them with the corresponding RD curves when, for the given puncturing period, an optimal rate is allocated (FBC-PDDV coder). In Figure 3 the K-frames are losslessly coded while in Figure 4 the K-frames are intra-coded with H.263 and QP = 10. The value of the PSNR at rate 0 shows the average quality of the interpolated frame S. For both lossless and lossy coding of the K-frames, we observe that the loss in RD performance of the RA-PDDV coder when compared to the FBC-PDDV coder is very small for low rates. The difference in RD performance increases with higher rates to an extent that varies from sequence to sequence. The acceptability of this performance loss is application-dependent.
6
Conclusion
In this paper, we presented an RA algorithm for rate-compatible, turbo codebased PDDV coders. Without complicating the encoder, the algorithm estimates the appropriate number of bits for each frame. In this calculation the error
674
M. Morb´ee et al.
probabilities are estimated for each pixel individually and not for the whole frame jointly, as was the case in our previous work. The proposed pixel-based RA algorithm delivers more accurate estimates of the encoding rate than the framebased approach. This pixel-based RA algorithm allows to remove the FBC from the traditional scheme, with only a small loss in RD performance, especially for low rates.
References 1. Puri, R., Ramchandran, K.: PRISM: A new robust video coding architecture based on distributed compression principles. In: Proc. Allerton Conference on Communication, Control, and Computing, Allerton, IL, USA (October 2002) 2. Brites, C., Ascenso, J., Pereira, F.: Feedback channel in pixel domain Wyner-Ziv video coding: myths and realities. In: 14th EUSIPCO’06, Florence, Italy (September 2006) 3. Morb´ee, M., Prades-Nebot, J., Piˇzurica, A., Philips, W.: Rate allocation algorithm for pixel-domain distributed video coding without feedback channel. In: ICASSP, Hawaii, USA (April 2007) 4. Aaron, A., Zhang, R., Girod, B.: Wyner-Ziv coding of motion video. In: Proc. Asilomar Conference on Signals and Systems, Pacific Grove, California, USA (November 2002) 5. Ascenso, J., Brites, C., Pereira, F.: Improving frame interpolation with spatial motion smoothing for pixel domain distributed video coding. In: 5th EURASIP Conference, Slovack, Republic (June 2005) 6. Slepian, J., Wolf, J.: Noiseless coding of correlated information sources. IEEE Trans. Inf. Theory 19(4) (1973) 7. Rowitch, D., Milstein, L.: On the performance of hybrid FEC/ARQ systems using rate compatible punctured turbo codes. IEEE Trans. Comm. 48(6), 948–959 (2000) 8. Hoeher, P., Land, I., Sorger, U.: Log-likelihood values and monte carlo simulation– some fundamental results. In: Int. Symp. on Turbo Codes and Rel. Topics, pp. 43–46 (September 2000)
Multiview Depth-Image Compression Using an Extended H.264 Encoder Yannick Morvan1 , Dirk Farin1 , and Peter H. N. de With1,2 1
Eindhoven University of Technology, PO Box 513, 5600 MB, The Netherlands 2 LogicaCMG, TSE, PO Box 7089, 5600 JB Eindhoven, The Netherlands
Abstract. This paper presents a predictive-coding algorithm for the compression of multiple depth-sequences obtained from a multi-camera acquisition setup. The proposed depth-prediction algorithm works by synthesizing a virtual depth-image that matches the depth-image (of the predicted camera). To generate this virtual depth-image, we use an image-rendering algorithm known as 3D image-warping. This newly proposed prediction technique is employed in a 3D coding system in order to compress multiview depth-sequences. For this purpose, we introduce an extended H.264 encoder that employs two prediction techniques: a blockbased motion prediction and the previously mentioned 3D image-warping prediction. This extended H.264 encoder adaptively selects the most efficient prediction scheme for each image-block using a rate-distortion criterion. We present experimental results for several multiview depthsequences, which show a quality improvement of about 2.5 dB as compared to H.264 inter-coded depth-images.
1
Introduction
The emerging 3D video technology enables novel applications such as 3D-TV or free-viewpoint video. A free-viewpoint video application provides the ability for users to interactively select a position (viewpoint) for viewing the scene. To render user-selected views of the video scene, various image-synthesis techniques have been developed [1]. The two major techniques use either a geometric model of the scene, or an interpolative model based on the neighboring cameras to generate a new user-selected view. Recently, it has been shown that using a mixture of both techniques enables real-time free-viewpoint video rendering. One example of this [2] allows the synthesis of intermediate views along a chain of cameras. The algorithm estimates the epipolar geometry between each pair of successive cameras and rectifies the images pairwise. Disparity images are estimated for each pair of cameras and synthetic views are interpolated using an algorithm similar to the View Morphing [3] technique. A second example [4] employs a similar video capturing system composed of a set of multiple cameras. As opposed to the previous approach, the cameras are fully calibrated prior to the capture session (see Figure 1). Since the cameras are calibrated, the depth can be subsequently estimated for each view. Using the estimated depth-information, 3D warping techniques can be employed J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 675–686, 2007. c Springer-Verlag Berlin Heidelberg 2007
676
Y. Morvan, D. Farin, and P.H.N. de With
to perform view synthesis at the user-selected viewpoint. This selected virtual camera position is used to warp the two nearest-neighboring views by employing their corresponding depth images. Both warped views are finally blended to generate the final rendered image.
p
p1 p2
camera 1
R1 , c1
pN
p3
camera N camera 2 camera 3
R2 , c2
Z Y
R3 , c3
RN , cN
X
Fig. 1. Multiview capturing system in which the position and orientation of each camera is known. Because camera parameters are known, depth images can be estimated for each view and an image-warping algorithm can be used to synthesize virtual views.
Considering the transmission of 3D data, for both approaches, one depthimage for each view should be coded and transmitted. The major problem of this approach is that for each camera-view, an additional depth signal has to be transmitted. This leads to a considerable increase of the bitrate for transmitting 3D information. For example, an independent transmission of 8 depth-views of the “Breakdancers” sequence requires about 1.7 Mbit/s with a PSNR of 40 dB. This bitrate comes on top of the 10 Mbit/s for the multiview texture data. Therefore, a more efficient compression algorithm for transmitting depth-data is highly desirable, which is the key aspect of this paper. Previous work on multiview depth-image compression has explored the idea that the estimated depth-images are highly correlated. As a result, a coding gain can be obtained by exploiting the inter-view dependency between the depthsequences. To this end, two different approaches for predictive coding of depthimages have been investigated. A first depth-image prediction technique uses a block-based motion prediction [5]. The idea followed is to multiplex the depthviews such that a single video depth-stream is generated. The resulting video is then compressed using an H.264 encoder. A second, alternative depth-image prediction scheme [5] is based on an image-warping algorithm that synthesizes a depth-image as seen by the predicted camera. The advantage of a warpingbased depth-image prediction is that the views can be accurately predicted, even when the baseline distance between the reference and predicted cameras is large, thereby yielding a high compression ratio. In this paper, we propose a technique for coding multiple depth-sequences that employs predictive coding of depth-images. The depth-image prediction employs the two above-described algorithms, i.e. the block-based motion prediction and the image-warping prediction. The most efficient prediction method
Multiview Depth-Image Compression Using an Extended H.264 Encoder
677
is then selected for each image-block using a rate-distortion criterion. Because the prediction accuracy has a significant impact on the coding efficiency, we have implemented three different image-rendering algorithms for warping the depth-images: 1. simple 3D image warping, and 2. triangular-mesh-based rendering technique, and 3. Relief Texture [6] image warping. Each of them has different rendering accuracy and computational complexity. First, the 3D image-warping technique performs image rendering at limited computing power by employing a simplified warping equation combined with several heuristic techniques. However, the quality of the rendered image is degraded, which thus results in a less accurate prediction. Second, the triangular-meshbased technique aims at a high-quality rendered image by performing a subpixel warping algorithm. Reciprocally, such a precise algorithm is carried out at the cost of a high computational load. Third, an intermediate approach, i.e. relief-texture, decomposes the image-warping equation into a succession of simpler operations to obtain a computationally-efficient algorithm. For each imagerendering algorithm, we have conducted compression experiments and we present their coding gain. Experimental results show that the proposed depth-prediction algorithm yields up to 2.5 dB improvement when compared to H.264 inter-coded depth-images. The remainder of this paper is organized as follows. Section 2 provides details about the warping-based depth-image prediction algorithms while Section 3 shows how the prediction algorithms can be integrated into an H.264 encoder. Experimental results are provided by Section 4 and the paper concludes with Section 5.
2
Warping-Based Depth-Image Prediction
In this section, we describe three alternative techniques for image warping that will be employed for depth-image prediction. First, we introduce the 3D imagewarping [7] technique initially proposed by McMillan et al. and second, we describe a mesh-based image-rendering technique. Finally, a variant of the relieftexture mapping algorithm is proposed, that integrates the optics underlying real cameras. 2.1
Prediction Using 3D Image Warping
A single texture image and a corresponding depth-image are sufficient to synthesize novel views from arbitrary positions. Let us consider a 3D point at homogeneous world coordinates Pw = (Xw , Yw , Zw , 1)T captured by two cameras and projected onto the reference and predicted image planes at pixel positions p1 = (x1 , y1 , 1)T and p2 = (x2 , y2 , 1)T , respectively (see Figure 2). We assume that the reference camera is located at the coordinate-system origin and looks
678
Y. Morvan, D. Farin, and P.H.N. de With
Pw x1 x2
y1 p1 y2
p2
reference view
Z
predicted view
X Y R, t
Fig. 2. Two projection points p1 and p2 of a 3D point Pw
along the Z-direction. The predicted camera location and orientation are described by its camera center C2 and the rotation matrix R2 . This allows us to define the pixel positions p1 and p2 in both image planes by λ1 p1 = [K1 |03 ] Pw , ⎛ ⎞ Xw R2 −R2 C2 λ2 p2 = [K2 |03 ] T Pw = K2 R2 ⎝ Yw ⎠ − K2 R2 C2 , 03 1 Zw
(1) (2)
where K1 , K2 represent the 3×3 intrinsic parameter matrix of the corresponding cameras and λ1 , λ2 some positive scaling factors [8]. Because the matrix K1 is upper-triangular and K1 (3, 3) = 1, the scaling factor λ1 can be specified in this particular case by λ1 = Zw . From Equation (1), the 3D position of the original point Pw in the Euclidean domain can be written as (Xw , Yw , Zw )T = K1−1 λ1 p1 = K1−1 Zw p1 .
(3)
Finally, we obtain the predicted pixel position p2 by substituting Equation (3) into Equation (2) so that λ2 p2 = K2 R2 K1−1 Zw p1 − K2 R2 C2 .
(4)
Equation (4) constitutes the image-warping [7] equation that enables the synthesis of the predicted view from the original reference view and its corresponding depth-image. In the case that the world and reference-camera coordinate systems do not correspond, a coordinate-system conversion of the external camera parameters is performed. Similarly, the world depth-values Zw are converted into the new reference coordinate system as well. One issue of the previously described method is that input pixels p1 of the reference view are usually not mapped to a pixel p2 at integer pixel position. In our implementation, to obtain an integer pixel position, we simply map the sub-pixel coordinate p2 to the nearest integer pixel position pˆ2 with
Multiview Depth-Image Compression Using an Extended H.264 Encoder
679
pˆ2 = (yˆ2 , xˆ2 , 1) = (x2 + 0.5, y2 + 0.5, 1). A second complication is that multiple original pixels can be projected onto the same pixel position in the predicted view. For example, a foreground pixel can occlude a background pixel in the interpolated view, which results in overlapping pixels. Additionally, some regions in the interpolated view are not visible from the original viewpoint, which results in holes in the predicted image. While the problem of overlapping pixels can be addressed using a technique called occlusion-compatible scanning order [7], undefined pixels in the predicted image cannot be analytically derived. Therefore, in our implementation, undefined pixels are padded using a simple pixel-copy of the nearest neighboring pixel. For simplicity, we defined a neighboring pixel as the nearest pixel in the image line. Although multiple heuristic techniques have been employed, experiments (see Section 4) have revealed that such a 3D image-warping generates depth-images with sufficient quality to perform predictive coding of depth-images. 2.2
Prediction Using Triangular Mesh
To avoid rendering artifacts such as occluded or undefined pixels, a natural approach to render 3D images is to employ a micro-triangular mesh. The idea is to triangulate the reference depth-image so that each triangle locally approximates the object surface. In our implementation, the depth-image triangulation is performed such that two micro-triangles per pixel are employed. For each trianglevertex in the reference image, the corresponding position of the warped-vertex is calculated using Equation (4). Finally, a rasterization procedure is performed that converts the triangle-based geometric description of the warped image into a bitmap or raster image (see Figure 3). For efficient implementation, it can be noticed that each adjacent triangle shares two common vertices. Therefore, only one warped-vertex position per pixel needs to be computed to obtain the third warped-vertex position.
Fig. 3. Micro-triangular mesh rendering processing stages: first, each triangle vertex in the reference image is warped and, second, each triangle is rasterized to produce the output image
While such a technique leads to high-quality image-rendering, one disadvantage is the very large number of micro-triangles that involves a high computational complexity. As an alternative technique, relief-texture mapping has been introduced to reduce the polygonal count required in the warping procedure.
680
2.3
Y. Morvan, D. Farin, and P.H.N. de With
Prediction Using Relief Texture Mapping
The guiding principle of the relief-texture algorithm is to factorize the 3D imagewarping equation into a combination of 2D texture-mapping operations. One well-known 2D texture-mapping operation corresponds to a perspective projection of planar texture onto a plane defined in a 3D world. Mathematically, this projection can be defined using homogeneous coordinates by a 3 × 3 matrix multiplication, and corresponds to a homography transform between two images. The advantage of using such a transformation is that a hardware implementation of this function is available in most Graphic Processor Units (GPU), so that processing time is dramatically reduced. Let us now factorize the warping function to obtain a homography transform in the factorization. From Equation (4), it can be derived that λ2 K 1 C2 p2 = K2 R2 K1−1 · (p1 − ). Zw Zw
(5)
Analyzing this equation, it can be seen that the first factor K2 R2 K1−1 is equivalent to a 3 × 3 matrix and represents the desired homography transform. Let us now analyze the second factor of the factorized equation, i.e. (p1 − K1 C2 /Zw ). This second factor projects the input pixel p1 onto an intermediate point pi = (xi , yi , 1)T that is defined by λi pi = p1 −
K1 C2 , Zw
(6)
where λi defines a homogeneous scaling factor. It can be seen that this last operation performs the translation of the reference pixel p1 to the intermediate pixel pi . The translation vector can be expressed in homogeneous coordinates by ⎛ ⎞ ⎛ ⎞ xi x1 − t1 K 1 C2 λi ⎝ yi ⎠ = ⎝ y1 − t2 ⎠ with (t1 , t2 , t3 )T = . (7) Zw 1 1 − t3 Written in Euclidean coordinates, the intermediate pixel position is defined by xi =
x1 − t1 , 1 − t3
yi =
y 1 − t2 . 1 − t3
(8)
It can be noticed that this result basically involves a 2D texture-mapping operation, which can be further decomposed into a sequence of two 1D transformations. In practice, these two 1D transformations are performed first, along rows, and second, along columns. This class of warping methods is known as scanline algorithms [9]. An advantage of this additional decomposition is that a simpler 1D texture-mapping algorithm can be employed (as opposed to 2D texture-mapping algorithms).
Multiview Depth-Image Compression Using an Extended H.264 Encoder
681
The synthesis of the view using relief-texture mapping is summarized as follows: – Step 1: Perform warping of the reference depth-image along horizontal scanlines, – Step 2: Perform warping of the (already horizontally-warped) depth-image along vertical scanlines, – Step 3: Compute the planar projection of the intermediate depth-image using the homography transform defined by K2 R2 K1−1 (for fast computing, exploit the GPU).
3
Incorporating Image Warping into an H.264 Encoder
We now propose a novel H.264 architecture dedicated to multiview coding that employs a block-based motion-prediction scheme and the previously explained image-warping prediction technique. To integrate both warping-based prediction and block-based motion prediction, we have first added to the H.264 block-based motion-prediction algorithm a warping-based image prediction procedure, with the aim to select of one of the two according to some criterion. A disadvantage of such a multiview encoder is that the prediction error for the warping algorithm is not minimized, because high-quality warping does not necessarily lead to minimum prediction error. As a result, the compression efficiency is decreased. An alternative to selecting between two predictors, we employ a combination of the two predictors: (a) the warping-based predictor followed by (b) the block-based motion predictor (see Figure 4). The system concept becomes now as follows. First, we provide an approximation of the predicted view using image warping and, second, we refine the warping-based prediction using block-based motion prediction. In the refinement stage, the search for matching blocks is performed in a region of limited size, e.g. 16×16 pixels. For comparison, the motion disparity between two neighboring views in the “Ballet” sequence can be as high as 64 × 64 pixels. Figure 4 shows an overview of the described coding architecture. Besides the compatibility with H.264 coding, the advantage of this approach is that the coding-mode selection can be performed for each image-block. More specifically, we employ three different coding modes in our multiview encoder. First, if the previously encoded depth-image Dt−1 provides an accurate prediction of an image-block, Dt−1 is selected as a reference. Alternatively, in the case the warped depth-image W (Dt−1 ) is sufficiently accurate, W (Dt−1) is selected as a a reference. Third, in the case the image-block cannot be accurately predicted using both previous prediction algorithms, the image-block is H.264 intra-coded as a fallback. This last case mostly occurs for occluded pixels that cannot be predicted with sufficient accuracy. To select the most appropriate coding mode, the same rate-distortion criterion that is employed in a standard H.264 encoder, is used. Thus, the H.264 standard offers suitable coding modes and an appropriate predictor-selection criterion to handle the various prediction accuracies of our algorithm.
682
Y. Morvan, D. Farin, and P.H.N. de With view 0 view N
views multiplexer
Motion estimation
DCT
Quantization
entropy coding
Inverse quantization
Motion compensation
Inverse DCT
Dt-1 index 1
W(Dt-1) index 0
image warping
Decoded Picture Buffer camera parameters
Fig. 4. Architecture of the extended H.264 encoder that adaptively employs the previously encoded depth image Dt−1 or the corresponding warped image W (Dt−1 ) as reference frames
To enable the H.264 encoder using two different predictors, we employ two reference frames in the Decoded Picture Buffer (DPB) in the reconstruction loop: one reference for the warping-based prediction and a second for the block-based motion prediction (see Figure 4). However, the selection of the frame index in which each reference frame should be loaded in the DPB is important because of the following reason. In a standard H.264 encoder, the previously encoded frame (most correlated) is loaded in the DPB at index 0 and the “older” is available at index 1. This enables a “SKIP” coding mode that can be selected in the case the reference frame at index 0 in the DPB provides an accurate prediction. In this case, no quantized residual data or motion vectors are transmitted, thereby leading to a high coding efficiency. When using depth-images, our approach is to also load the most correlated depth-image in the reference frame buffers at index 0. Because the warping-based algorithm typically provides an accurate prediction, the warped depth-image should be loaded at index 0 while the previously encoded depth-image should be loaded at index 1 in the DPB. Consequently, a large number of image-blocks can be encoded using the “SKIP” coding mode (see Table 2). Table 1 show a summary of possible coding modes employed in the extended H.264 encoder.
4
Experimental Results
For evaluating the performance of the coding algorithm, experiments were carried out using the “Ballet” and “Breakdancers” depth-sequences. The presented experiments investigate the impact of depth-prediction across multiple views. To measure the efficiency of the block-based motion-prediction algorithm, the
Multiview Depth-Image Compression Using an Extended H.264 Encoder
683
Table 1. Summary of possible coding modes and their corresponding description Coding Mode Intra Inter-Block-Based-Motion
Description Standard H.264 intra-coding The previously encoded depth image Dt−1 is selected as a reference. The image-block is H.264 inter-coded. The warped depth-image W (Dt−1 ) is selected as a reference. The image-block is H.264 intercoded. The warped image provides a sufficiently accurate prediction such that the image-block is inter-coded, using the H.264 “SKIP” coding mode.
Inter-Warping
Inter-Warping (SKIP mode)
multiview depth-images were multiplexed and compressed by a standard H.264 encoder. To ensure that the temporal motion prediction does not interfere with the evaluation of the inter-view prediction algorithms, an intra-coded frame is inserted within each frame period. Figure 5 illustrates how the multiview depthimages are predicted using (1) block-based motion prediction only to obtain P -frames, or (2) an additional warping-based prediction to obtain Pw -frames. views time
I
P
views
P
P
P
P time
I
P
P
(a)
P
P
P
I
Pw
Pw
Pw
Pw
Pw
I
Pw
Pw
Pw
Pw
Pw
(b)
Fig. 5. (a) The multiple depth-images are predicted using a block-based motion prediction to obtain H.264 P -frames. (b) Depth-images are predicted using a block-based motion prediction and a warping-based image prediction to obtain Pw -frames.
Let us now discuss the obtained coding results using the extended H.264 coder and the above-given prediction structures. We perform the compression of depth-images under four different conditions. Depth-images are predicted using block-based motion estimation and subsequently one of the four options: 1. no additional warping-based prediction, i.e. original H.264 encoder (“Blockbased prediction”) or, 2. the 3D image-warping algorithm (“3D warping and block-based prediction”) or, 3. the mesh-based rendering technique (“Triangular mesh and block-based prediction”) or, 4. the relief-texture rendering algorithm (“relief texture and block-based prediction”). To measure the efficiency of the warping-based predictive-coding algorithms, we have implemented and inserted the three warping-based prediction algorithms
684
Y. Morvan, D. Farin, and P.H.N. de With
in the H.264 encoder. As described in Section 3, the warping-based prediction is followed by a prediction-error minimization. In our implementation, this refinement-minimization step is carried out by the H.264 block-based motioncompensation over a region of 16 × 16 pixels. For coding experiments, we have employed the open-source H.264 encoder x264 [10]. The arithmetic coding algorithm CABAC was enabled for all experiments. For each sequence, the frame rate is 15 frames per second. Thus, the transmission of 8 views corresponds to a frame rate of 120 frames per second. Such a high frame rate explains the magnitude of the presented bitrates in Figure 6, ranging from approximatively 500 kbit/s to 5.5 Mbit/s. 48
48
44
Block-based prediction 3D warping and block-based prediction 46 Triangular mesh and block-based prediction Relief texture and block-based prediction PSNR (dB)
PSNR (dB)
Block-based prediction 3D warping and block-based prediction Triangular mesh and block-based prediction 46 Relief texture and block-based prediction
42
44 42
40 40
38 36 500
1000
1500
2000
2500
Bitrate (kbit/s)
(a)
3000
38 2000 2500 3000 3500 4000 4500 5000 5500 Bitrate (kbit/s)
(b)
Fig. 6. Rate-distortion curves for encoding (a) the “Breakdancers” and (b) the “Ballet” depth-sequences
We produced the obtained rate-distortion curves of Figure 6(a) and Figure 6(b) under the parameters settings. First, it can be observed that all proposed warping-based prediction algorithms consistently outperform the standard block-based motion-prediction scheme. For example, considering Figure 6(a), it can be seen that the triangular-mesh rendering algorithm described in Section 2.2 yields a quality improvement of up to 2.5 dB over the block-based motion-prediction algorithm at 1 Mbits/s for the “Breakdancers” sequence. Additionally, although the “Ballet” multiview depth-sequence shows large occluded regions, a depth-image warping-based prediction yields a quality improvement of up to 1.5 dB at a bitrate of 3 Mbit/s. Let us now consider the two rate-distortion curves denoted “3D image warping and block-based motion” in Figure 6. Although multiple heuristic techniques have been employed to perform the 3D image-warping, a limited loss of quality of about 0.4 dB was observed at a bitrate of 1 Mbit/s and 3 Mbit/s for the sequences “Breakdancers” and “Ballet”, respectively. For a low-complexity encoder, it is therefore appropriate to employ the image-warping technique from Section 2.1. Finally, while it has been discussed [6] that the relief-texture image-warping algorithm may produce rendering artifacts along depth-discontinuities, coding experiments show no significant coding
Multiview Depth-Image Compression Using an Extended H.264 Encoder
685
difference between a prediction performed using a triangular mesh or relief texture mapping. Therefore, relief texture can be effectively employed in a hardware implementation. Observing Figure 7, it can be seen that occluded image-blocks at the right side of the two persons are intra-coded and sharp edges are encoded using a block-based motion prediction. Moreover, as can be noticed, the warping-based prediction provides a sufficiently accurate prediction in smooth areas. Because depth-images mainly consist of smooth regions, this coding mode is frequently selected. This observation is confirmed by the coding-mode selection statistics provided by Table 2. Table 2. Coding-mode selection statistics using the triangular-mesh depth-image prediction Intra Inter-Block-Based-Motion Inter-Warping Inter-Warping (SKIP mode)
Breakdancers 8.1% 4.3% 3.3% 84.3%
Ballet 17.3 % 5.4 % 3.3% 74.0%
Fig. 7. Magnified area of one encoded depth-image from the “Ballet” sequence indicating the coding-mode selection. Coding modes “Intra” and “Inter-Block-Based-Motion” are refered to as a vertical line and a backward diagonal line, respectively. The coding mode “Inter-Warping” occupies the remaining space.
5
Conclusions
We have presented a new algorithm for the compression of multiview depthimages. The algorithm is based on extending the H.264 prediction by adding a rather accurate image-warping predictor. This approach leads to an extended H.264 encoder where the image warping is preceding the reference frame buffer in the reconstruction loop. Consequently, the depth-image is predicted using either a (1) a block-based motion prediction or (2) an image-warping predictor followed
686
Y. Morvan, D. Farin, and P.H.N. de With
by a block-based motion-prediction refinement. The selection of the prediction algorithm is optimized for each image-block using a rate-distortion criterion. Three image-warping techniques with different computational complexity have been integrated into an H.264 encoder and evaluated. Experimental results show that the most accurate image-warping algorithm leads to a quality improvement of up to 2.5 dB over the block-based motion-prediction algorithm. Additionally, it was found that the simplified 3D image-warping technique could synthesize a sufficiently accurate prediction of depth-images such that a quality improvement of 2.1 dB was obtained. Therefore, the presented technique demonstrates that an adaptive selection of different predictors can be beneficially employed to improve the compression of multiview depth-sequences, with a minor extension of the H.264 encoder.
References 1. Shum, H.Y., Kang, S.B.: Review of image-based rendering techniques. In: Proceedings of SPIE, Visual Communications and Image Processing, vol. 4067, pp. 2–13 (2000) 2. Farin, D., Morvan, Y., de With, P.H.N.: View interpolation along a chain of weakly calibrated cameras. In: IEEE Workshop on Content Generation and Coding for 3DTelevision, IEEE Computer Society Press, Los Alamitos (2006) 3. Seitz, S.M., Dyer, C.R.: View morphing. In: SIGGRAPH ’96: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 21– 30. ACM Press, New York (1996) 4. Zitnick, C.L., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High-quality video view interpolation using a layered representation. ACM Transactions on Graphics 23(3), 600–608 (2004) 5. Morvan, Y., Farin, D., de With, P.H.N.: Prediction of depth images across multiple views. In: Proceedings of SPIE, Stereoscopic Displays and Applications (2007) 6. Oliveira, M.M.: Relief Texture Mapping. Ph.D. Dissertation. UNC Computer Science (March 2000) 7. McMillan, L.: An Image-Based Approach to Three-Dimensional Computer Graphics. University of North Carolina (April 1997) 8. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2004) 9. Wolberg, G.: Digital Image Warping. IEEE Computer Society Press, Los Alamitos (1990) 10. x264 a free H264/AVC encoder last visited: March (2007), http://developers.videolan.org/x264.html
Grass Detection for Picture Quality Enhancement of TV Video Bahman Zafarifar1,2,3 and Peter H. N. de With1,2 1
Eindhoven University of Technology, PO Box 513, 5600 MB, The Netherlands {B.Zafarifar,P.H.N.de.With}@tue.nl 2 LogicaCMG, PO Box 7089, 5600 JB Eindhoven, The Netherlands 3 Philips Innovative Applications (CE), Pathoekeweg 11, 8000 Bruges, Belgium
Abstract. Current image enhancement in televisions can be improved if the image is analyzed, objects of interest are segmented, and each segment is processed with specifically optimized algorithms. In this paper we present an algorithm and feature model for segmenting grass areas in video sequences. The system employs adaptive color and position models for creating a coherent grass segmentation map. Compared with previously reported algorithms, our system shows significant improvements in spatial and temporal consistency of the results. This property makes the proposed system suitable for TV video applications.
1
Introduction
Image enhancements in current flat display TVs are performed globally (on the entire image)as in the conventional contrast and brightness adjustments, or locally (on a selected part of the image) as in sharpness enhancement, considering the local statistical properties of the image. For example, some enhancement filters operate along the edge axis, or select a partial set of pixels that are likely to be part of a single object [1]. The local adaptation is typically based on simple pictorial features of the direct neighborhood, rather than considering the true semantic meaning of the object at hand. It is therefore understandable that the obtained picture quality is sub-optimal as compared to a system that locally adapts the processing to the true nature of the objects. Object-based adaptation can be realized if the image is analyzed by a number of object detectors, after which object are segmented and processed with optimized algorithms [2]. Having object detectors in a TV system also enables semantic-level applications such as indoor/outdoor classification, sports detection, semantic-based selection of the received or stored video, or aiding the emerging 3D-TV systems. Grass fields are frequently seen in TV video, especially in sports programs and outdoor scenes. At the pixel level, grass detection can be used for color shifting and sharpness enhancement, and preventing spurious side effects of other algorithms such as the unintended smoothing effect of noise reduction algorithms in grass areas, by dynamically adapting the settings of the noise filter. TV applications require that the detection results are pixel-accurate and spatially and temporally consistent, and that the algorithm allows for real-time J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 687–698, 2007. c Springer-Verlag Berlin Heidelberg 2007
688
B. Zafarifar and P.H.N. de With
implementation in an embedded environment. Spatial consistency means that the segmentation results should not contain abrupt spatial changes when this is not imposed by the values of the the actual image pixel. Video applications also demand that the segmentation results do not exhibit abrupt changes from frame to frame when the actual image does not contain such abrupt changes. We refer to the latter as temporal consistency. Our algorithm takes these requirements into account and produces a probabilistic grass segmentation map based on modeling the position and the color of grass areas. The remainder of the paper is organized as follows. In Section 2 we review the previously reported work on real-time grass segmentation for TV applications. Section 3 discusses the properties of grass fields and the requirements of TV applications, Section 4 describes the proposed algorithm, Section 5 presents the results and Section 6 concludes the paper.
2
Related Work
Previously reported work on grass detection for real-time video enhancement includes a method [3] that is based on pixel-level color and texture features. The color feature is in the form of a 3D Gaussian function in the YUV color space, and the texture feature uses the root-mean-square of the luminance component. These two features are combined to form a pixel-based continuous grassprobability function. Due to the pixel-based approach of this method, the resulting segmentation contains significant noise-like local variations, caused by the changing texture characteristics in grass fields. As a result, a post-processed image using this method can contain artifacts due to the mentioned local variations in the segmentation map. As a solution to this problem, [4] proposes to average the results of a pixelbased color-only grass-detection system using blocks of 8×8 pixels. The obtained average values are then classified to grass/no-grass classes using a noisedependent binary threshold level. Although the applied averaging alleviates the previously mentioned problem of pixel-level local variations in the segmentation map, the proposed hard segmentation causes a different type of variations in the segmentation result, namely in the form of the nervousness of the resulting 8×8 pixel areas. Such hard segmentation is obviously inadequate for applications like color shifting. Even for less demanding applications like noise reduction, we S y s te m
Im a g e
Im a g e A n a ly s is
In itia l p r o b a b ility
o v e r v ie w
M o d e lin g
M o d e ls
S e g m e n ta tio n
S e g m e n ta tio n m a p
Fig. 1. Overview of the proposed system: starting with image analysis, followed by modeling the color and position of grass areas, and finally segmenting the grass pixels
Grass Detection for Picture Quality Enhancement of TV Video
689
found that the hard segmentation leads to visible artifacts in the post-processed moving sequences. We propose a system that builds upon the above-mentioned methods, thereby benefiting from their suitability for real-time implementation, while considerably improving the spatial and temporal consistency of the segmentation results. The proposed system (Fig. 1) performs a multi-scale analysis of the image using color and texture features, and creates models for the color and the position of the grass areas. These models are then used for computing a refined pixel-accurate segmentation map when such accuracy is required by the application.
3 3.1
Design Considerations Observation of Grass Properties
Grass fields can take a variety of colors, between different frames or even within a frame. The color depends on the type of vegetation, illumination and shadows, patterns left by lawn mowers, camera color settings, and so on. Consequently, attempting to detect grass areas of all appearances is likely to result in a system that erroneously classifies many non-grass objects as grass (false positives). For this reason, we have limited ourselves to green-colored grass (commonly seen in sport videos). Despite having chosen a certain type of grass, the color can still vary due to shadows. We address this by accounting for color variations within the image, with a spatially-adaptive color model that adapts to the color of an initial estimate of grass areas. The typical grass texture is given by significant changes in pixel values. The variations are most prominent in the luminance (Y component in YUV color space), and exist far less in the chrominance (U and V) components (see Fig. 3). This high-frequency information in chrominance components is further suppressed by the limited chrominance bandwidth in recording and signal transmission systems [5]. To make the matters worse, the chrominance bandwidth limitation in digitally coded sources often leads to blocking artifacts in the chrominance values of the reconstructed image, resulting in spurious texture when the chrominance components are used for texture analysis. Therefore, we use only the luminance component for texture analysis. The characteristics of grass texture varies within a frame, based on the distance of the grass field to the camera, camera focus and camera motion. To capture a large variety of grass texture, we employ a multi-scale analysis approach. Grass texture can vary locally due to shadows caused by other grass leaves, or due to a local decrease in the quality of the received signal (blocking artifacts or lack of high frequency components). Therefore, we perform a smoothing operation on the created models to prevent the mentioned local texture variations from abruptly influencing the segmentation result.
690
3.2
B. Zafarifar and P.H.N. de With
Application Requirements and Implementation Considerations
Our primary target is to use our grass detector for high-end TV applications, such as content-based picture quality improvement. This means that the algorithm should allow for real-time operation, that it should be suitable for implementation on a resource-constrained embedded platform, and that the detection results should be spatially and temporally consistent to avoid artifacts in the post-processed image. We have considered the above-mentioned issues in the design of our algorithm. – Firstly, we have chosen for filters that produce spatially consistent results and yield smooth transitions in the color and position models. – Secondly, we have avoided using image-processing techniques that require random access to image data. This allows for implementation of the algorithm in a pixel-synchronous system. The reason behind this choice is that video-processing systems are often constructed as a chain of processing blocks, each block providing the following one with a constant stream of data, rather than having random memory access. – Thirdly, we have avoided processing techniques that need large frame memories for (temporary) storage of the results. For example, the results of the multi-scale analysis are directly downscaled to a low resolution (16 times lower than input resolution), without having to store intermediate information. – Lastly, we perform the computationally demanding operations, such as calculations involved in model creation, in the mentioned lower resolution. This significantly decreases the amount of required computations.
4
Algorithm Description
In this section, we describe the proposed system in detail. The system is comprised of three main stages, as shown in Fig. 1. The Image Analysis stage computes a first estimate of the grass areas. We call this the initial probability of grass. Using this initial probability, we create two smooth models in the Modeling stage for the color and the position of the grass areas. While the position model can be directly used for certain applications like adaptive noise reduction or sharpness enhancement, other applications, such as color shifting, require a pixel-accurate soft segmentation map. The Segmentation stage calculates this pixel-accurate final segmentation map, using the created color and position models and the image pixel values. The following sections elaborate on the mentioned three stages. 4.1
Image Analysis
In Section 3, we observed that grass areas can take a variety of colors due to illumination differences (shadows, and direct or indirect sunlight). RGB and YUV are the two common color formats in TV systems. In an RGB color system,
Grass Detection for Picture Quality Enhancement of TV Video
691
Im a g e A n a ly s is
Im a g e
Y U V S 0 1
Y S 0 1
C o lo r F e a tu re s c a le 0 1
T e x tu re F e a tu re s c a le 0 1
In itia l P c o lo r S 0 1
In itia l P te x t . S 0 1
M U L
In itia l p r o b . S 0 1
M U L
In itia l p r o b . S 0 2
1 6 L in e a r filte r
In itia l p r o b . S 0 1 @ S 1 6
2 Y U V S 0 2
C o lo r F e a tu re
In itia l P c o lo r S 0 2
Y S 0 2
T e x tu re F e a tu re
In itia l P te x t . S 0 2
s c a le 0 2
s c a le 0 2
M
...
...
...
In itia l p r o b . S 0 4
8 L in e a r filte r
4 L in e a r filte r
In itia l p r o b . S 0 2 @ S 1 6
X
A
In itia l p r o b . M S @ S 1 6
In itia l p r o b . S 0 4 @ S 1 6
Fig. 2. Schematic overview of image analysis stage. The initial grass probability is calculated for the image in three scales. The results are downscaled and combined to produce the multi-scale initial grass probability.
each component is a function of both chrominance and luminance, while the luminance and chrominance information in a YUV color system are orthogonal to each other. This means that the UV components are less subject to illumination, and therefore we chose the YUV color system for image analysis. Color: Despite the inherent separation of luminance and chrominance information in the YUV color format, we observed a slight correlation between the luminance and chrominance components for grass areas. Figure 3 depicts the histograms of grass-pixel values in the YUV domain, where the correlation between luminance and chrominance can be seen in the left-most (YU) graph. Our purpose is to approximate this cloud of pixels, using a 3D Gaussian function. This is done by estimating the parameters of this 3D Gaussian using Principle Component Analysis in the training phase. The parameters consist of the center (mean grass color), the orientation of the main axes and the variance along these axes. During the analysis phase, the pixel values, (Y, U, V ) are translated by the mentioned mean grass color, and rotated by the axes angles to create the transformed values Yr , Ur , Vr . The color probability (Pcolor ) is then computed by − Pcolor = e
Yr σy1
2
2 2 r r + σUu1 + σVv1
,
(1)
where σy1 , σu1 and σv1 are the standard deviations of the corresponding axes. Texture: Texture is a frequently-used feature in image-segmentation applications [6]. In case of grass detection, the texture feature helps in distinguishing
692
B. Zafarifar and P.H.N. de With
Fig. 3. Histogram of grass-pixel values in the YUV domain, taken over grass areas of a training set, including cloudy, sunny and shadow conditions. Left: U vs. Y, Middle: V vs. Y, Right: : U vs. V.
grass areas from other green objects. In Section 3.1 we motivated the choice of the luminance component for texture analysis. We found that grass has a random, noise-like texture and does not show any unique spatial regularity. In fact, we did not find a way for general distinction between the grass texture and the image noise. Therefore, we subtract the texture measured from image noise from the total measured texture in our texture calculation. As a result, the grass texture can be masked by image noise when the amount of noise exceeds the measured grass texture. For this reason, the texture feature is only useful for images containing a moderate amount of noise. Additionally, the texture feature will provide little information when grass images are taken from a very far distance, or when the quality of the video material is low. Despite these limitations, texture was found to be a useful feature for separating grass from smooth grass-colored surfaces. As texture measure, we use the Sum of Absolute Differences (SAD) between adjacent pixels in a 5×5 pixels analysis window. The texture metric PSAD is calculated as SADhor (r, c) =
w
w−1
|Y (r + i, c + j) − Y (r + i, c + j + 1)| ,
i=−w j=−w
SADver (r, c) =
w−1
w
|Y (r + i, c + j) − Y (r + i + 1, c + j)| ,
i=−w j=−w
PSAD =
SADhor + SADver − TSAD , NSAD
(2)
where SADhor and SADver are the horizontal and vertical SADs respectively, and TSAD is a noise-dependent threshold level. Further, r and c are the coordinates of the pixel under process, w defines the size of the analysis window, and factor 1/NSAD normalizes the SAD to the window size. PSAD is further clipped and normalized to a maximum value so that it has the nature of a probability (Ptexture ). In the remainder of this paper, we will refer to Ptexture as a probability.
Grass Detection for Picture Quality Enhancement of TV Video M o d e lin g Y U V S 0 1 In itia l P c o lo r S 0 1
1 6 P c o lo r - a d a p tiv e
S e g m e n ta tio n
Y U V P c o lo r -a d a p tiv e S 1 6
A d a p tiv e G a u s s ia n filte r
In itia l p r o b . M S @ S 1 6
693
G a u s s ia n filte r
C o lo r m o d e l S 1 6
P o s itio n m o d e l S 1 6
1 6 B i- lin e a r
1 6 B i- lin e a r
Y U V S 0 1 C o lo r c e n te r s S 0 1
C o lo r F e a tu re
P c o lo r fin a l S 0 1
M U L
S e g m e n ta tio n m a p
P p o s itio n S 0 1
Fig. 4. Modeling and Segmentation stages of the algorithm. Left - Modeling: creating the color and the position models using the initial grass probability. Right - Segmentation: pixel-accurate soft segmentation of grass areas.
Multi-scale Analysis: In Section 3 we observed that the grass texture contains local variations caused by the camera focus, shadows and local image-quality differences (in digitally coded material). In order to capture the grass texture under these different conditions, we have adopted a multi-scale (multi-resolution) image-analysis approach. Using multi-scale analysis, the texture that is not captured in one analysis scale, may still be captured in another scale. Figure 2 depicts the mentioned multi-scale image analysis. Here, the initial grass probability is calculated for three different scales of the image, the image in each scale being half the size of the image in the previous scale. The resulting grass probabilities (Initialprob.S01, S02, S04 in Fig. 2) are then downscaled to a common resolution (Initialprob.S01@S16, S02@S16, S04@S16 at the right-hand side in Fig. 2) and combined together using the M aximum operation (M AX block in Fig. 2) to produce the multi-scale initial grass probability (Initialprob.M S@S16 in Fig. 2). The reason for downscaling is to limit the computation and memory requirements in the modeling stage. The downscale factor (16) was chosen as a tradeoff between lower computation and memory requirements, and spatial resolution of the models, when the input image has Standard Definition resolution. Three scales of analysis proved to be sufficient for capturing the grass texture. Using lower resolutions for image analysis will lead to a reduced spatial resolution of the initial grass probability, causing spatial inaccuracy of the position- and color models and the eventual segmentation map. We have considered several measures to reduce the computational complexity and the required memory. Firstly, the calculated initial probabilities of all scales are directly downscaled to a low common resolution (S16 in Fig. 2). Secondly, by avoiding the need to store the intermediate (higher resolution) results in the memory, we achieve a high memory efficiency. Thirdly, modeling stage operates on lower resolution images, which considerably decreases the amount of required computations. For improving the performance of the aforementioned downscaling of the initial probabilities, we use a linear-filtering operation that works as follows. A pixel in the higher-resolution image (the input of the downscaled block) will affect the
694
B. Zafarifar and P.H.N. de With
values of nine pixels of the low-resolution image according to a linear weighting function. The weight is proportional to the the distance between the position of the high-resolution pixel and the centers of the low-resolution pixels. The downscaled image obtained by this filtering method proved to be much more suitable for moving video material, as compared to block averaging. 4.2
Modeling Grass
Color Model: In Section 3 we noticed that the grass is subject to different illumination conditions. Using fixed color-centers for the final color feature (Fig. 4-right) will lead to partial rejection of grass areas of which the color significantly deviates from the color centers. We found that a better result can be achieved by accounting for the color variation within an image using a spatiallyadaptive color model. The model in fact prescribes the expected color of the grass for each image position. To this end, each color component (Y, U, and V) of the image is modeled by a matrix of values of which the dimensions are 16 times smaller than the input image resolution. Each matrix is fitted to the corresponding color component of the image using an adaptively weighted Gaussian filter that takes the initial grass probability as a weight. The calculation steps are as follows. First, the image is downscaled to the size of the model, using color-adaptive filtering (denoted as Y U VP color−adaptive @S16 in Fig. 4-left). The color-adaptive filter reduces the influence of outliers, such as extremely bright pixels caused by glair of the sun, on the values of the downscaled image. The downscaled luminance component Y (r, c) is given by 15 15
Y (r, c) =
(YS01 (16r + i, 16c + j) × PcolorS01 (16r + i, 16c + j))
i=0 j=0 15 15
, (3) (PcolorS01 (16r + i, 16c + j))
i=0 j=0
where YS01 is the luminance component at the input resolution, PcolorS01 is the color probability at the input resolution, and r and c are the position-indices of the downscaled image. Next, the color model is computed, using the downscaled representations by (we present only the Y model, MY ) h
MY (r, c) =
w
(Y (r + i, c + j) × PgrassInit (r + i, c + j) × G(i, j))
i=−h j=−w h
w
,(4) (PgrassInit (r + i, c + j) × G(i, j))
i=−h j=−w
where Y is the downscaled luminance component, PgrassInit is the initial grass probability, G is a 2D Gaussian kernel, h and w are the model dimensions, and r and c are the model position-indices.
Grass Detection for Picture Quality Enhancement of TV Video
695
Position Model: We noted in Section 3 that the texture of grass fields contains micro-level variations. Achieving a spatially-consistent detection result requires filtering of these local texture variations. Therefore, we model the positional probability of the grass areas using a smooth position model. The position model Mposition is obtained by filtering the initial grass probability PgrassInit using a Gaussian kernel G as l l (PgrassInit (r + i, c + j) × G(i, j)) Mposition (r, c) =
i=−l j=−l l l
,
(5)
(G(i, j))
i=−l j=−l
where l is the size of the Gaussian kernel, and r and c are the model positionindices. The above-mentioned filtering procedures (Eqns. (3), (4) and (5)) use the computationally demanding division operation. However, the total amount of computations is significantly reduced thanks to the small dimensions of the models (16 times smaller than the input resolution, in both horizontal and vertical dimensions). Furthermore, to achieve a better temporal stability for moving images, we employ recursive temporal filtering while computing the models. 4.3
Segmentation
When the position model is upscaled to the input image resolution, it produces a map indicating the positional probability of grass for all image positions. This probability map can be directly used for applications like adaptive noise reduction or sharpness enhancement. Other applications, such as color enhancement, may require a pixel-accurate segmentation map, which can be computed as (Fig. 4-right) PgrassF inal = PcolorF inal × Pposition .
(6)
Here, Pposition denotes the upscaled version of the position model. PcolorF inal is the pixel-accurate final color probability, computed by a 3D Gaussian probability function that uses the YUV values of the image at the input resolution. In contrast to the color feature used in the image analysis-stage (Eqn. (1)), the center of the 3D Gaussian is not fixed here, but defined by the upscaled version of the spatially varying color model. The standard deviations of the 3D Gaussian are smaller than those applied in the image-analysis stage, which helps in reducing false acceptance of non-grass objects. Further, the texture measure has been excluded in the final grass probability to improve the spatial consistency of the detection. As can be seen in Fig. 4-right, the color and the position models are upscaled (interpolated) by a bi-linear filter prior to being used for determining the color probability. This interpolation is performed on-the-fly, without storing the upscaled images in a memory.
696
B. Zafarifar and P.H.N. de With
Fig. 5. Results comparison. Left: input, Middle: proposed in [4], Right: our proposal.
5
Experimental Results and Performance Discussion
The proposed algorithm can be trained for detecting grass of a certain color range by choosing appropriate parameters for the color feature. For obtaining these parameters for green-colored grass, we manually annotated the grass areas in 36 images, which were captured under different illumination conditions such as under cloudy and sunny sky, or with and without shadows. Using Principle Component Analysis, we obtained the center, the orientation and the standard deviations of the three axes the 3D Gaussian envelop around the annotated grass pixels (see Fig. 3). We applied the trained algorithm to a test set containing 50 still images and 5 moving sequences, visually inspected the results and made a side-by-side comparison with the algorithm proposed in [4]. The reason for this subjective comparison is that we aim at an algorithm having a high spatial and temporal consistency in the detection result, and at present, there is no metric for such a performance requirement. Compared with the existing algorithms, we observed a significant improvement in the spatial and temporal consistency of the segmentation results, and improved detection results in images containing grass with different illuminations. We also found the proposed smooth probabilistic segmentation map to be more adequate for image post-processing applications. In the following, we discuss a few examples of the results. Figure 5 compares the results of our proposal with that of [4]. We can see in the middle column that the existing algorithm detects some tree areas as grass (false positives). Similarly, false positives are found in the ground areas in the middle of the grass field. Our proposal shows a clear improvement in these areas. The improvement is due to a more compact modeling of the grass color values, using the PCA analysis.
Grass Detection for Picture Quality Enhancement of TV Video
697
Fig. 6. Results comparison. Left: input, Middle: proposed in [4], Right: our proposal.
Fig. 7. Results of the spatially-adaptive color model and the smooth position model. Top-Left: input image, Top-middle: the position model, Top-right: the color model, Bottom-left: segmentation result using fixed color model, Bottom-middle: segmentation result using spatially adaptive color model, Bottom-right: result existing algorithm.
Figure 6 portrays a more complex, which is difficult for both algorithms. First, we notice the false positives of the existing algorithm in the flower garden, whereas these small green objects are filtered out in our proposal owing to the smooth position model. Second, we notice that both algorithms have problems with the tree areas at the top of the picture. Such false positives occur in our algorithm on large, green textured areas (tree leaves). Lastly, we notice that our algorithm produces lower probabilities in the smooth grass area at the top-right side of the image, resulting in missing grass detection in that area. This is due to the absence of texture in these areas. This false negative is not in the form of abrupt changes, making the consequences less severe. Figure 7 shows the benefit the adopted locally adaptive color model. We can see that although there is a large difference in the color of sunny and shadow areas, the resulting segmentation map (Bottom-middle) does not abruptly reject any of these two areas. While the existing algorithm (Bottom-right) shows a
698
B. Zafarifar and P.H.N. de With
deteriorated detection in the shadow, our algorithm (Bottom-middle) preserves a positive detection of grass, albeit at a lower probability.
6
Conclusion
We have presented an algorithm for consistent detection of grass areas for TV applications, with the aim to improve the quality in the grass areas in the image. For such applications, it is of utmost importance that the image segmentation results are both spatially and temporally coherent. Not complying with this requirement would lead to artifacts in the post-processed video. To achieve this, we have modeled the grass areas using a spatially adaptive color model and a smooth position model. The color model accounts for the large color range of the grass areas within the image, which occurs particularly when the image contains both sunny and shadowed parts. The position model ensures that local variations of the grass texture do not abruptly influence the segmentation result. Furthermore, a multi-scale image analysis approach helps in capturing different appearances of grass. When compared to an existing algorithm, our system shows significant improvements in spatial and temporal consistency of the segmentation result. During the algorithm design, we kept the limitations of an embedded TV platform into account. As such, we avoid the need for storing intermediate results by directly downscaling the analysis results to a low resolution, and by performing the more complex computations at this low resolution. This approach decreases the memory and computation requirements. Furthermore, the algorithm is suitable for implementation in a pixel-synchronous video platform. This is due to our choice for analysis and modeling techniques which have a regular memory access and deterministic computation requirement, as compared to techniques that require random access to image data, or exhibit a variable computation demand.
Acknowledgement The authors gratefully acknowledge Dr. Erwin Bellers and Stephen Herman for their specific input on the existing algorithms for real-time grass detection.
References 1. de Haan, G.: Video Processing for Multimedia Systems. University Press, Eindhoven (2000) 2. Herman, S., Janssen, J.: System and method for performing segmentation-based enhancements of a video image, European Patent EP 1 374 563, date of publication (January 2004) 3. Herman, S., Janssen, J.: Automatic segmentation-based grass detection for real-time video, European Patent EP 1 374 170, date of publication (January 2004) 4. Herman, S., Bellers, E.: Image segmentation based on block averaging, United States Patent US 2006/0072842 A1, date of publication (April 2006) 5. Netravali, A., Haskell, B., Puri, A.: Digital Video: an Introduction to MPEG-2. International Thompson Publishing (1997) 6. Alan, C.: Handbook of Image and Video Processing. Academic Press, London (2000)
Exploitation of Combined Scalability in Scalable H.264/AVC Bitstreams by Using an MPEG-21 XML-Driven Framework Davy De Schrijver, Wesley De Neve, Koen De Wolf, Davy Van Deursen, and Rik Van de Walle Department of Electronics and Information Systems – Multimedia Lab Ghent University – IBBT Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium
[email protected]
Abstract. The heterogeneity in the contemporary multimedia environments requires a format-agnostic adaptation framework for the consumption of digital video content. Preferably, scalable bitstreams are used in order to satisfy as many circumstances as possible. In this paper, the scalable extension on the H.264/AVC specification is used to obtain the parent bitstreams. The adaptation along the combined scalability axis of the bitstreams must occur in a format-independent manner. Therefore, an abstraction layer of the bitstream is needed. In this paper, XML descriptions are used representing the high-level structure of the bitstreams by relying on the MPEG-21 Bitstream Syntax Description Language standard. The adaptation process is executed in the XML domain by transforming the XML descriptions considering the usage environment. Such an adaptation engine is discussed in this paper in which all communication is based on XML descriptions without knowledge of underlying coding format. From the performance measurements, one can conclude that the transformations in the XML domain and the generation of the corresponding adapted bitstream can be realized in real time.
1
Introduction
Nowadays, digital video content can be accessed by different users in heterogeneous environments. Two components, in particular scalable bitstreams and a format-agnostic adaptation framework, are needed in order to control the huge diversity in content and resource constraints such as terminal capabilities, band width, and user preferences. In this paper, both technologies are brought together to adapt the scalable bitstreams by making use of a format-agnostic engine. The aim of Scalable Video Coding (SVC) is to encode a video sequence once, after which the generated bitstream can be adapted by using simple truncation operations. These operations make it possible to extract bitstreams containing a lower frame rate, spatial resolution, and/or visual quality from the parent bitstream. To realize this goal, an SVC bitstream will contain three embedded scalability axes (temporal, spatial, and SNR) along which adaptations can be J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 699–710, 2007. Springer-Verlag Berlin Heidelberg 2007
700
D. De Schrijver et al.
executed. Every scalability axis is independently accessible but it is also possible to adapt the bitstream by truncating along multiple axes at the same time. This results in combined scalability and this type of scalability will be exploited in this paper. Hereby, we will make use of bitstreams compliant with the Joint Scalable Video Model (JSVM) version 4 specification. The scalable bitstreams will be adapted by a format-independent engine. Therefore, we will describe the high-level structure of the bitstreams in the Extensible Markup Language (XML). The XML descriptions will be generated by relying on the MPEG-21 Bitstream Syntax Description Language (MPEG-21 BSDL, [1]) framework. In this paper, we will describe the generation of the XML descriptions for our JSVM-encoded bitstreams. This gives us the possibility to shift the focus of the content customization process to the XML domain. The adaptation process in the XML domain can be realized by a transformation engine without knowledge of the underlying coding format. Such an engine typically takes a stylesheet representing the transformation actions as input. Here, we will make use of Streaming Transformations for XML (STX, [2]). We will pay special attention to the implementation of a stylesheet that exploits the combined scalability characteristic of JSVM-encoded bitstreams. An adaptation engine will be proposed in which all communication is based on XML descriptions and in which the adaptation is executed without knowledge of the underlying coding format. The outline of this paper is as follows. In Sect. 2, MPEG-21 BSDL is explained in order to generate XML descriptions of the scalable bitstreams used. The creation of the scalable bitstreams is discussed in Sect. 3. Section 4 describes the adaptation process in the XML domain. More precisely, the STX stylesheet implementing the combined scalability is discussed. A complete XML-driven framework, in which the adaptation engine is format-agnostic, is sketched in Sect. 5. The performance results of such an XML-driven framework for video content adaptation are provided in Sect. 6. Finally, a conclusion is given in Sect. 7.
2
MPEG-21 Bitstream Syntax Description Language
The MPEG-21 Digital Item Adaptation (DIA) specification enables the adaptation of multimedia content in heterogeneous environments. One of the building blocks of DIA is MPEG-21 BSDL. This language allows to build an interoperable description-driven framework in which multimedia content can be adapted in a format-agnostic manner [3]. In Fig. 1, an overview of a BSDL-driven framework for video content adaptation is given. Such a framework is based on automatically generated XML descriptions containing information about the high-level structure of bitstreams. The high-level structure of a coding format is established in a Bitstream Syntax Schema (BS Schema). Such a BS Schema is constructed by using MPEG-21 BSDL. Resulting that a generic software module can be obtained in order to generate the XML descriptions, in particular the BintoBSD Parser. A (scalable) bitstream is given to the BintoBSD Parser after which it
Exploitation of Combined Scalability in Scalable H.264/AVC Bitstreams
Encoder
Decoder
Scalable Bitstream
Adapted Bitstream
BintoBSD
XML Description (BSD)
4 1 255 256 300 556 150 706 231
BS Schema
Transformation: e.g., remove odd frames
BSDtoBin
Adapted XML Description (BSD’)
2 256 300 706 231
701
Usage Environment Description
Start Byte Length
Fig. 1. Overview of the MPEG-21 BSDL framework for video content adaptation
generates an XML description by interpreting the corresponding BS Schema containing the structure of the coding format. The generated XML description is called a Bitstream Syntax Description (BSD) in MPEG-21. In Fig. 1, one can observe that the generated BSD contains syntax values, as well as references to data blocks in the original bitstream (by using the start byte and the length of the block). These references lead to the high-level nature of the BSDs. Once a BSD is available, it can be transformed considering the usage environment characteristics, such as the available band width, screen resolution, or CPU power. How the transformation of the BSD should be performed is not standardized. E.g., one can make use of Extensible Stylesheet Language Transformations (XSLT) or STX in order to execute the transformation. In this paper, we have chosen for STX because of its streaming capabilities, low memory footprint, and relative fast execution times [4]. In the example depicted in Fig. 1, the odd frames are removed by simple removal operations in the XML domain and the available syntax element (i.e., num frame) is adapted by a replace operation. The last step in the framework is the generation of the adapted bitstream. This process is executed by the BSDtoBin Parser. The functioning of this parser is also described in the DIA specification resulting in a generic software module. The BSDtoBin Parser takes as input the adapted BSD, the corresponding BS Schema, and (mostly) the original bitstream. After its generation, the adapted bitstream can be decoded and rendered on the desired device. The BSDL framework can be used in an XML-driven format-independent content adaptation engine in which all decisions and communications are based on XML documents. Such a complete framework will be sketched in Sect. 5.
3 3.1
Scalable Extension of H.264/MPEG-4 AVC Generation of an Embedded Scalable Bitstream
The video coding specification used, in particular JSVM, is an extension of the non-scalable single-layered H.264/MPEG-4 Advanced Video Coding scheme (H.264/AVC, [5]). Consequently, a JSVM decoder can decode H.264/AVC
702
D. De Schrijver et al.
bitstreams and the base layer of a scalable bitstream should be complaint with H.264/AVC. Note, the fundamental building blocks of JSVM bitstreams are Network Abstraction Layer Units (NALUs), similar to H.264/AVC bitstreams. Figure 2 shows the high-level structure of a JSVM encoder providing three spatial levels. The original high-resolution video sequence has to be downscaled in order to obtain spatial scalability and the different spatial layers. Every spatial layer contains a core H.264/AVC encoder extended with inter-layer prediction and SNR scalability capabilities. Each core encoder introduces the temporal and SNR scalability axes and minimizes the redundancy in the video input sources.
Core Encoder 2D Spatial Reduction
inter-layer prediction
Core Encoder 2D Spatial Reduction
Scalable Bitstream
inter-layer prediction
Core Encoder
Fig. 2. Structure of a JSVM encoder providing three spatial levels
3.2
Temporal, Spatial, and SNR Scalability
In each spatial layer, a temporal decomposition is performed resulting in the achievement of temporal scalability. In the JSVM, hierarchical B pictures are employed to obtain a pyramidal decomposition and to remove temporal redundancy at the same time. Hierarchical B pictures are a special case of the general concept of sub-sequences and sub-sequence layers in H.264/AVC [6]. In JSVM, a Group of Pictures (GOP) is built by taking a key picture and all pictures that are temporally located between the key picture and the previous key picture. A key picture can be intra-coded or inter-coded using previous key pictures as reference for motion compensated prediction. Figure 3 illustrates a dyadic hierarchical coding scheme based on B pictures. Dyadic means that every temporal enhancement layer contains as many pictures as the summation of all pictures of the lower layers (resulting in a reduction of a frame rate ratio of 2 when an enhancement layer is removed). The temporal decomposition has to be executed for every spatial layer resulting in a motion field for every layer. When these motion fields are highly correlated, scalable coding of these fields is greatly recommendable. Because of the similarities between the motion fields of different spatial layers, one can expect that the corresponding residual pictures also show a high resemblance. In JSVM, a (bilinear) interpolation filter can be used for upsampling a residual frame to predict the corresponding residual frame of the higher resolution layer. After the temporal decomposition, every spatial layer contains residual frames resulting from intra-frame, inter-frame, or inter-layer prediction. These 2D signals still contain a lot of spatial redundancy, which can be further reduced
Exploitation of Combined Scalability in Scalable H.264/AVC Bitstreams
B3
4th temporal layer
B3
B2
3rd temporal layer
B3
B3
B2
I0/P0
B3
B3
B2
B1
I0/P0
Group of Pictures (GOP) = Key picture
B3
B2
B1
2nd temporal layer 1st temporal layer
B3
x
703
I0/P0
Group of Pictures (GOP)
y = x is a reference for y
Fig. 3. Dyadic hierarchical B picture coding scheme for a GOP size of 8 pictures
using the Hadamard and DCT-based transforms as defined in the H.264/AVC standard. At this stage, the original video sequence can still be reconstructed without errors (lossless coding). In order to obtain higher compression ratios, the encoder will introduce errors by quantizing the transformed blocks. During this process, SNR scalability can be introduced. More precisely, Fine Grain Scalability (FGS, [7]) allows the generation of a quality-scalable bitstream. For each spatial layer, a quality base layer provides a minimum reconstruction quality and by using the FGS-encoded enhancement layers, higher quality bitstreams can be obtained. Each FGS-encoded enhancement layer can be truncated at any arbitrary point to obtain a high variety of possible bit rates. 3.3
Ef f icient Bitstream Extraction Along the Scalability Axes
Once one has gained insight into the construction of an embedded scalable bitstream, an extractor can be built. This extractor is capable of deriving partial bitstreams from the parent stream containing lower temporal or spatial resolutions at given target bit rates. One of the requirements for the discussed SVC is that the specification needs to define a mechanism supporting an efficient extraction process. To obtain this goal, JSVM uses Supplemental Enhancement Information (SEI) messages. SEI messages contain meta information that is not required for constructing the picture samples. In H.264/AVC, these messages assist in the processes related to decoding, displaying, or other purposes; JSVM extends this functionality by using a few of these messages in the extraction process. Every SEI message type has a number, payloadType, indicating the kind of information that the message represents. The numbers 0 to 21 inclusive are already specified by H.264/AVC, while the numbers 22 to 25 inclusive are added by JSVM. The latter four SEI messages are introduced to simplify the extraction process. The most important SEI message for the extractor is unarguably the scalability info message (type number 22). This message is transmitted at the beginning of the bitstream and provides basic information about the embedded scalability features such as the number of layers, the cumulated bit rates of the different layers, and the resolution of the layers.
704
4
D. De Schrijver et al.
XML-Driven Exploitation of the Combined Scalability
The combined scalability of a JSVM bitstream can be exploited when a desired frame rate, spatial resolution, and bit rate are given to the extractor. In our XML-driven framework, the extractor is implemented by a STX stylesheet taking adaptation parameters as input, in particular width, height, framerate, and bitrate. As mentioned in Fig. 1, the stylesheet will transform the XML description of the bitstream reflecting the adaptation in the XML domain. In order to obtain such a BSD, a BS Schema for the JSVM standard has to be developed. In [8], we have explained the creation of a possible BS Schema for JSVM bitstreams that are adaptable along the combined scalability axis. The stylesheet does not have knowledge about the properties of the embedded scalability axes. As explained in Sect. 3, this information is available in the scalability info SEI message, conveyed by the first NALU of the bitstream. The information encapsulated in this message can also be obtained by an analysis of the complete bitstream. However, this will disturb the requirement regarding the achievement of an efficient bitstream extraction process. Information about every layer that can be extracted from the bitstream is available in the SEI message. A fragment of the BSD containing information about an encapsulated layer is given in Fig. 4. The STX stylesheet will interpret this information for every layer. Every layer has a unique identifier (line 2). The layer ID can be used by other SEI messages further down the stream to update the layer information, e.g., to signal an increase in the bit rate because of a scene containing a lot of motion. Next, the fgs layer flag is present indicating that this layer is an FGS enhancement layer such that it can be truncated at any arbitrary point. Further in the fragment, one can observe the decoding dependency information (line 13 – 17). This information reports to which temporal level, spatial layer, and 1
5
10
15
20
25
30
< layer_info > < layer_id > 53 layer_id > < fgs_layer_flag >1 fgs_layer_flag > < s u b _ p i c _ l a y e r _ f l a g >0 s u b _ p i c _ l a y e r _ f l a g > < s u b _ r e g i o n _ l a y e r _ f l a g >0 s u b _ r e g i o n _ l a y e r _ f l a g > < p r o f i l e _ l e v e l _ i n f o _ p r e s e n t _ f l a g >0 p r o f i l e _ l e v e l _ i n f o _ p r e s e n t _ f l a g > < d e c o d i n g _ d e p e n d e n c y _ i n f o _ p r e s e n t _ f l a g >1 d e c o d i n g _ d e p e n d e n c y _ i n f o _ p r e s e n t _ f l a g > < b i t r a t e _ i n f o _ p r e s e n t _ f l a g >1 b i t r a t e _ i n f o _ p r e s e n t _ f l a g > < f r m _ r a t e _ i n f o _ p r e s e n t _ f l a g >1 f r m _ r a t e _ i n f o _ p r e s e n t _ f l a g > < f r m _ s i z e _ i n f o _ p r e s e n t _ f l a g >1 f r m _ s i z e _ i n f o _ p r e s e n t _ f l a g > < l a y e r _ d e p e n d e n c y _ i n f o _ p r e s e n t _ f l a g >0 l a y e r _ d e p e n d e n c y _ i n f o _ p r e s e n t _ f l a g > < i n i t _ p a r a m e t e r _ s e t s _ i n f o _ p r e s e n t _ f l a g >0 i n i t _ p a r a m e t e r _ s e t s _ i n f o _ p r e s e n t _ f l a g > < decoding_dependency_info_present_flag_is_1 > < temporal_level >4 temporal_level > < dependency_id >3 dependency_id > < quality_level >1 quality_level > d e c o d i n g _ d e p e n d e n c y _ i n f o _ p r e s e n t _ f l a g _ i s _ 1 > < bitrate_info_present_flag_is_1 > < avg_bitrate > 1191 avg_bitrate > < max_bitrate >0 max_bitrate > b i t r a t e _ i n f o _ p r e s e n t _ f l a g _ i s _ 1 > < frm_rate_info_present_flag_is_1 > < c o n s t a n t _ f r m _ r a t e _ i d c >0 c o n s t a n t _ f r m _ r a t e _ i d c > < avg_frm_rate > 6144 avg_frm_rate > f r m _ r a t e _ i n f o _ p r e s e n t _ f l a g _ i s _ 1 > < frm_size_info_present_flag_is_1 > < f r m _ w i d t h _ i n _ m b s _ m i n u s 1 > 79 f r m _ w i d t h _ i n _ m b s _ m i n u s 1 > < f r m _ h e i g h t _ i n _ m b s _ m i n u s 1 > 31 f r m _ h e i g h t _ i n _ m b s _ m i n u s 1 > f r m _ s i z e _ i n f o _ p r e s e n t _ f l a g _ i s _ 1 > layer_info >
Fig. 4. Fragment of the scalability information SEI message as available in the BSD
Exploitation of Combined Scalability in Scalable H.264/AVC Bitstreams
1
5
10
15
20
705
< nal_unit > < f o r b i d d e n _ z e r o _ b i t >0 f o r b i d d e n _ z e r o _ b i t > < nal_ref_idc >1 nal_ref_idc > < nal_unit_type > 21 nal_unit_type > < nal_unit_information_for_scalable_extension > < s i m p l e _ p r i o r i t y _ i d >0 s i m p l e _ p r i o r i t y _ i d > < discardable_flag >0 discardable_flag > < extension_flag >1 extension_flag > < if_extension_flag_is_equal_1 > < temporal_level >4 temporal_level > < dependency_id >3 dependency_id > < quality_level >1 quality_level > i f _ e x t e n s i o n _ f l a g _ i s _ e q u a l _ 1 > n a l _ u n i t _ i n f o r m a t i o n _ f o r _ s c a l a b l e _ e x t e n s i o n > < raw_byte_sequence_payload > < coded_slice_of_an_IDR_picture_in_scalable_extension > < slice_layer_in_scalable_extension_rbsp > < slice_payload > 45190 3862 slice_payload > s l i c e _ l a y e r _ i n _ s c a l a b l e _ e x t e n s i o n _ r b s p > c o d e d _ s l i c e _ o f _ a n _ I D R _ p i c t u r e _ i n _ s c a l a b l e _ e x t e n s i o n > r a w _ b y t e _ s e q u e n c e _ p a y l o a d > nal_unit >
Fig. 5. Fragment of a NALU description as available in the BSD
quality level the layer in question belongs. Based on these numbers, the stylesheet will determine if a certain NALU has to be removed or not. The numbers reflect the layered structure in Fig. 2 and Fig. 3. So far, we have only discussed structural information about the layer. The other data in the fragment contain information about the properties of the layer. Lines 18 to 21 specifies information about the bit rate. Note, only the average bit rate is calculated by the encoder. The avg bitrate syntax element contains the average bit rate that is needed to extract this layer and all underlying layers necessary to decode the layer. The syntax element is expressed in units of 1000 bits per second. Thereupon, the average frame rate is given followed by the resolution of the frames embedded in this layer. The average frame rate is expressed in frames per 256 seconds, resulting in a frame rate of 24Hz for layer 53 in Fig. 4. The resolution is expressed in macroblocks and the width and height in pixels can be calculated as follows: widthpixels = (f rm width in mbs minus1 + 1) × 16 heightpixels = (f rm height in mbs minus1 + 1) × 16 As such, frames belonging to layer 53 will have a resolution of 1280 × 512. Once the layer is determined that has to be extracted based on the adaptation parameters, the stylesheet uses the values of the temporal level, the dependency id, the quality level, and the fgs layer flag syntax elements to decide if a certain NALU has to be removed. Figure 5 shows a fragment of the BSD representing the high-level structure of a NALU belonging to layer 53. From lines 9 to 13, the NALU header contains the necessary information in order to determine to which the layer the NALU belongs. If these values indicate that the NALU in question are a part of a frame that has not to be decoded on the desired device, the NALUs will be removed from the BSD by the stylesheet. For example, all NALUs being part of a frame with a higher resolution than the desired sequence will be removed immediately (using the dependency id syntax element). One can observe that the high-level structure of our BSDs is obtained by using references to the original bitstream during the description of the
706
D. De Schrijver et al.
BSDLink Description Steering Description
Adaptation Engine
Adaptation Decision-Taking Engine
Usage Environment Description
STX Stylesheet Parameters Bitstream Syntax Description (BSD) STX Engine
BS Schema
Transformed BSD
Scalable Bitstream
BSDtoBin
Adapted (Scalable) Bitstream
Fig. 6. Format-agnostic XML-driven framework for video content adaptation
payload (line 18). The data in the payload contains coded picture samples, which are unimportant for an efficient adaptation engine. Nevertheless, the stylesheet has to change the value of the slice payload tag in order to realize FGS scalability. This editing operation is only allowed when the NALU belongs to an FGS enhancement layer. The stylesheet will determine the existence of the FGS layer based on the value of the quality level element of the NALU (which should be bigger than 0) and the fgs layer flag element of the corresponding layer as reported in the SEI message (which should be equal to 1). In this case, the stylesheet can replace the length of the payload (e.g., changing line 17 in Fig. 5 to <slice payload>45190 2000). Besides analyzing the SEI messages and customizing NALUs containing coded picture data, the stylesheet has to remove the parameter sets from the BSD that are no longer necessary for correctly decoding the adapted bitstream as well. Our BS Schema together with a STX implementation of the combined scalability can be found on http://multimedialab.elis.ugent.be/BSDL.
5
A Framework for an XML-Based Adaptation Engine
So far, we have explained the creation of JSVM scalable bitstreams and an adaptation framework based on the use of XML descriptions and STX stylesheets. In order to obtain a complete format-agnostic adaptation engine, the parts as discussed so far have to be brought together in such a way that the engine does not have to know what the underlying coding format is to create a tailored bitstream for the desired usage environment. Figure 6 shows a framework for a format-agnostic adaptation engine. This engine consists of three main parts, in particular the Adaptation Decision-Taking Engine (ADTE), the STX engine, and the BSDtoBin Parser. The ADTE provides adequate decisions to adapt the scalable bitstream according to the usage environment. Therefore, the ADTE
Exploitation of Combined Scalability in Scalable H.264/AVC Bitstreams
1
5
10
15
20
707
< DIA > < Description xsi:type = " BSDLinkType " > < S t e e r i n g D e s c r i p t i o n R e f uri = " s c a l a b i l i t y I n f o r m a t i o n . xml " / > < BitstreamRef uri =" scalableBitstream . h264 " / > < BSDRef uri = " b i t s t r e a m D e s c r i p t i o n . xml " / > < B S D T r a n s f o r m a t i o n R e f uri = " c o m b i n e d S c a l a b i l i t y . stx " type = " http: // stx . sourceforge . net /2002/ ns " /> < Parameter xsi:type = " IOPinRefTy pe " name = " width " > < value > WIDTH_SEQUENCE value > Parameter > < Parameter xsi:type = " IOPinRefType " name = " height " > < value > HE IGHT_SEQUENCE value > Parameter > < Parameter xsi:type = " IOPinRefType " name = " framerate " > < value > FRAME_RATE value > Parameter > < Parameter xsi:type = " IOPinRefType " name = " bitrate " > < value > BIT_RATE value > Parameter > Description > DIA >
Fig. 7. BSDLink description to steer an adaptation engine
takes as input a Usage Environment Description (UED) describing the terminal capabilities, network characteristics, and user preferences. Because of the format-agnostic character of the adaptation engine, the ADTE has to know which bitstreams can be extracted from the parent stream. This information will be transmitted to the ADTE by using a steering description containing the same information as in the scalability information SEI message. A more detailed explanation of the functioning of an ADTE can be found in [9]. It is important to mention that the outputs of an ADTE is a set of transformation parameters. The STX engine expects these parameters in order to execute the transmitted STX stylesheet and to transform the BSD. From that point, the framework of Fig. 1 is followed. The last part needed in a format-agnostic adaptation engine is a tool that can be used to link the different inputs of the engine. Therefore, the standardized Bitstream Syntax Description Link (BSDLink) tool will be used. The used BSDLink description to steer a format-agnostic adaptation engine receiving our scalable bitstreams, is given in Fig. 7. The different inputs of Fig. 6 can be found in this description. On line 3, the steering description used by the ADTE is given, while the reference to the original scalable bitstream is given on line 4. The reference to the high-level XML description, in particular the BSD, is given on line 5. This BSD will contain a reference to the corresponding BS Schema (used by the BSDtoBin Parser). Finally, the STX stylesheet implementing our combined scalability is given on line 6. Our stylesheet needs as input four parameters, in particular the width, height, frame rate, and bit rate of the tailored bitstream. These parameters needed are given in Fig. 7 from lines 7 to 18 and the values of the parameters are determined by the ADTE using the IDs WIDTH SEQUENCE, HEIGHT SEQUENCE, FRAME RATE, and BIT RATE. From Fig. 6 and the description in Fig. 7, it is clear that the adaptation engine is format-agnostic and that all communication is based on using XML descriptions. A public demonstration of a similar MPEG-21 based adaptation framework can be found on the DANAE website: http://danae.rd.francetelecom.com.
708
6
D. De Schrijver et al.
Performance Results
6.1
Methodology
To evaluate the performance of the discussed format-agnostic XML-driven adaptation framework, we have generated four encoded scalable bitstreams compliant with the JSVM version 4 specification. Each bitstream contains a part of the new world 1 trailer with a resolution of 1280×512 at a frame rate of 24Hz. The encoder generates bitstreams with 5 temporal, 4 spatial, and 3 quality levels. The other characteristics for each bitstream are given in Table 1, in particular the number of frames, the number of NALUs, and the size of the generated bitstreams. For each bitstream, the corresponding BSD is generated by using an optimized BintoBSD Parser as explained in [10]. The generated BSDs are subject to the transformation reflecting the adaptation in the XML domain. From each bitstream, three partial streams are extracted containing a resolution of 320 × 128 by 12Hz at 400 KBits/s, a resolution of 640 × 256 by 24Hz at 1200 KBits/s, and a resolution of 1280 × 512 by 24Hz at 5000 KBits/s. The combined scalability is implemented in a STX stylesheet and Joost (version 2005-05-21)2 is used as STX engine. Finally, a modified BSDtoBin Parser of the MPEG reference software version 1.2.1 is used to generate the adapted scalable bitstreams. The performance measurements were done on a PC having an Intel Pentium IV CPU, clocked at 2.8GHz with Hyper-Threading and having 1GB of RAM. Table 1. Characteristics of the scalable bitstreams and corresponding BSDs Original Bitstreams BSD Characteristics Name #Frames #NAL Units Size (MB) ET (s) sizep (KB) sizec (KB) Ratio (%) Seq 1 250 3263 8.67 15.1 3352 45 98.66 Seq 2 500 6513 39.59 39.5 6597 90 98.63 Seq 3 1000 13013 83.55 76.2 13086 178 98.64 Seq 4 2000 26013 175.51 156.6 26073 352 98.65
6.2
Discussion of the Results
The results of the BSD generation process are given in Table 1, in particular the Execution Times (ETs) of the BintoBSD Parser, the sizes of the resulting XML descriptions in plain text (sizep ), the sizes of the compressed BSDs (sizec by using EasyZip v3.5), and the compression ratios. Hereby, we can conclude that the ET is linear as a function of the length of the sequence. The sizes of the plain-text generated BSDs is substantially compared to the original bitsteam, approximately 15% of the size of the bitstream. By compressing the BSDs, the overhead originates from the XML description is negligible, roughly 0.1%. Table 2 shows the results of the performance of the STX transformations and the BSDtoBin Parser. The ET of the transformations is linear as a function of the length of the sequence when the desired bitstreams contain the same 1 2
This trailer can be downloaded from http;//www.apple.com/trailers. This engine can be found on http://joost.sourceforge.net.
Exploitation of Combined Scalability in Scalable H.264/AVC Bitstreams
709
Table 2. Performance results of the adaptation engine Input STX Transformation BSDtoBin Parser Name Desired Bitstream ET(s) #NALUs sizep (MB) sizec (KB) ET(s) GS OB (NALUs/s) (Kbits/s) 320x128@12:400 3.69 882 911 15 1.14 775.2 209.98 Seq 1 640x256@24:1200 4.38 2260 2370 32 1.55 1457.9 948.44 1280x512@24:5000 4.89 3263 3434 45 1.95 1673.50 5001.71 320x128@12:400 5.97 1757 1719 27 1.53 1147.08 406.98 Seq 2 640x256@24:1200 7.05 4010 4105 56 2.00 2001.79 2380.53 1280x512@24:5000 7.98 6013 6229 83 2.86 2100.54 5001.71 320x128@12:400 10.30 3507 3336 83 2.04 1718.61 407.18 Seq 3 640x256@24:1200 12.39 8010 8104 50 3.00 2667.51 1201.08 1280x512@24:5000 14.25 12013 12050 108 4.78 2515.81 5001.31 320x128@12:400 18.90 7007 6413 96 3.06 2288.37 408.75 Seq 4 640x256@24:1200 23.09 16010 15719 211 5.04 3176.34 1200.69 1280x512@24:5000 26.75 24013 24001 322 8.51 2820.88 5000.82
characteristics. The ETs are smaller for bitstreams containing a lower resolution, frame rate, or bit rate, because of less I/O operations (which can be derived from the number of NALUs available in the transformed BSDs). The sizes (in plain text and compressed) also represent the influence of the adaptation parameters on the available NALUs. In case that only a reduction in bit rate is desired, the sizes of the transformed BSDs are almost the same as for the original BSDs. The transformation executes almost no removal operations only the payload sizes of the FGS layers are adapted (as explained in Sect. 4). The ET of the BSDtoBin Parser is also linear as a function of the length of the sequence. The Generation Speed (GS) of the parser increases but will converge for longer sequences. This increasing characteristic must be found in the start-up time of the parser (loading of the parser and interpreting of the BS Schema). The Obtained Bit rates (OBs) of the adapted bitstreams approach the desired rates very well. If the OB is lower than the desired one, then the asked layer contains not enough bits to reach the bit rate (resulting in no truncation operations). This means that our adaptation engine can generate bitstreams containing a desired bit rate without knowledge of the underlying coding format. From the performance results, we can conclude that the transformation together with the generation of the adapted bitstream can be done in real time.
7
Conclusion
In this paper, a format-agnostic framework for video content adaptation was proposed in which all communication is based on XML descriptions. Not only is the usage environment described in XML but also the high-level structure of the scalable bitstreams. This gives us the opportunity to shift the adaptation process to the XML domain. In order to obtain scalable bitstreams, the scalable extension of the H.264/AVC specification was used. These bitstreams can be adapted along the three scalability axes at the same time (better known as combined scalability). The corresponding XML descriptions of the scalable bitstreams are obtained by using MPEG-21 BSDL. The transformation exploiting
710
D. De Schrijver et al.
the combined scalability of the XML descriptions has been implemented in STX. From the performance results, we can conclude that the execution time of the transformations and the generation of the adapted bitstreams is linear as a function of the length of the sequences. Finally, we have proved that our XMLdriven format-agnostic framework can execute the adaptations in real time.
Acknowledgements The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientific Research-Flanders (FWO-Flanders), the Belgian Federal Science Policy Office (BFSPO), and the European Union.
References 1. Panis, G., Hutter, A., Heuer, J., Hellwagner, H., Kosch, H., Timmerer, C., Devillers, S., Amielh, M.: Bitstream syntax description: a tool for multimedia resource adaptation within MPEG-21. Signal Processing: Image Communication 18(8), 721–747 (2003) 2. Becker, O.: Transforming XML on the fly. In: Proceedings of XML Europe (2003) 3. Devillers, S., Timmerer, C., Heuer, J., Hellwagner, H.: Bitstream syntax description-based adaptation in streaming and constrained environments. IEEE Transactions on Multimedia 7(3), 463–470 (2005) 4. De Schrijver, D., De Neve, W., Van Deursen, D., De Cock, J., Van de Walle, R.: On an evaluation of transformation languages in a fully XML-driven framework for video content adaptation. In: Proceedings of 2006 IEEE International Conference on Innovative Computing, Information and Control, Beijing, China, vol. 3, pp. 213–216. IEEE, Los Alamitos (2006) 5. ITU-T and ISO/IEC JTC 1: ISO/IEC 14496-10:2004 Information technology – Coding of audio-visual objects – Part 10: Advanced Video Coding (2004) 6. Tian, D., Hannuksela, M.M., Gabbouj, M.: Sub-sequence video coding for improved temporal scalability. In: IEEE International Symposium on Circuits and Systems, Kobe, Japan, vol. 6, pp. 6074–6077 (2005) 7. Weiping, L.: Overview of fine granularity scalability in MPEG-4 video standard. IEEE Trans. on Circuits and Systems for Video Technology 11(3), 301–317 (2001) 8. De Schrijver, D., De Neve, W., De Wolf, K., Notebaert, S., Van de Walle, R.: XML-based customization along the scalability axes of H.264/AVC scalable video coding. In: Proceedings of 2006 IEEE International Symposium on Circuits and Systems (ISCAS), Island of Kos, Greece, pp. 465–468. IEEE, Los Alamitos (2006) 9. Mukherjee, D., Delfosse, E., Kim, J.G., Wang, Y.: Optimal adaptation decisiontaking for terminal and network quality-of-service. IEEE Trans. Multimedia 7(3), 454–462 (2005) 10. De Schrijver, D., De Neve, W., De Wolf, K., Van de Walle, R.: Generating MPEG21 BSDL descriptions using context-related attributes. In: Proceedings of the 7th IEEE International Symposium on Multimedia, Irvine, USA, pp. 79–86. IEEE Computer Society Press, Los Alamitos (2005)
Moving Object Extraction by Watershed Algorithm Considering Energy Minimization Kousuke Imamura, Masaki Hiraoka, and Hideo Hashimoto Kanazawa University, Graduate School of Natural Science and Technology, Kakuma-machi, Kanazawa, Ishikawa 920-1192 Japan {imamura,hasimoto}@ec.t.kanazawa-u.ac.jp,
[email protected]
Abstract. MPEG-4, which is a video coding standard, supports object-based functionalities for high efficiency coding. MPEG-7, a multimedia content description interface, handles the object data in, for example, retrieval and/or editing systems. Therefore, extraction of semantic video objects is an indispensable tool that benefits these newly developed schemes. In the present paper, we propose a technique that extracts the shape of moving objects by combining snakes and watershed algorithm. The proposed method comprises two steps. In the first step, snakes extract contours of moving objects as a result of the minimization of an energy function. In the second step, the conditional watershed algorithm extracts contours from a topographical surface including a new function term. This function term is introduced to improve the estimated contours considering boundaries of moving objects obtained by snakes. The efficiency of the proposed approach in moving object extraction is demonstrated through computer simulations.
1 Introduction MPEG-4, which is a video coding standard, supports object-based functionalities for high efficiency coding. MPEG-7, a multimedia content description interface, handles the object data in systems such as retrieval and/or editing systems. Therefore, extraction of semantic video objects is an indispensable tool that benefits these newly developed schemes. Since these standards do not prescribe the technique for object extraction, a number of object extraction techniques, such as chromakey, texture analysis, contour extraction, and contour tracking, have been proposed. Snakes (active contour models), which are a type of contour extraction algorithm by minimizing an energy function, were proposed by Kass et al. [1]. Snakes stably extract smooth closed contours from an image. Hence, this scheme has been used for region extraction and image recognition. A number of attempts have been made to improve the models with respect to the reduction of computational complexity and adaptability to more than one object, for example [2,3]. In snakes, it may be difficult to set the initial contour and the suitable energy functional for object extraction. In addition, the closed contour is often defined as a set of discrete points for the reduction of noise influence and computational complexity, but the closed contour is not able to accurately represent the true curve. Vieren et al. [4] applied snakes to J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 711–719, 2007. © Springer-Verlag Berlin Heidelberg 2007
712
K. Imamura, M. Hiraoka, and H. Hashimoto
interframe difference images for the contour extraction of moving objects. The problem with this approach is that although it can provide a rough contour, it may not include accurate boundaries of moving objects. On the other hand, watershed algorithm has been proposed as a technique for region segmentation [5]. Watershed algorithm is a type of region-growing algorithm and treats the input image as a topographic surface. The boundary of segments obtained by watershed algorithm is in accordance with the edge of the object, so we can obtain accurate shape information. However, the influence of noise and the lighting condition lead to over-segmentation. Therefore, a number of preprocessing tasks are required for eliminating the unnecessary edges. Moreover, in the case of moving object extraction, it is difficult to judge whether each region belongs to an object. New efficiency approaches, which combine snakes and watershed algorithm, were proposed in image segmentation. In [6], the watershed is represented as the energy minimum point. In [7], over-segmentation in watershed algorithm are restrained by using the energy criterion of snakes. In the present paper, we propose an alternative technique that extracts the shape of moving objects by combining snakes and watershed algorithm. First, snakes extract contours of the moving objects from the interframe difference image as the result of minimization of an energy function. Second, the conditional watershed algorithm extracts edge information from a topographic surface including a new function term. We introduce a new function that incorporates the result of energy minimization by snakes into watershed algorithm. The conditional watershed algorithm extracts one closed contour from each local region.
2 Snakes and Watershed Algorithm 2.1 Snakes Snake is represented parametrically by a vector v(s) = (x(s), y(s)) (0 ≤ s ≤ 1) and the shape of the object is extracted by changing the contour through the iterative minimization of the energy. The energy functional of the contour is defined as 1 Esnakes = {Eint (v(s)) (1) 0 +Eimage (v(s)) + Econ (v(s))}ds, where Eint(v(s)) represents the internal energy of the contour due to bending, Eimage (v(s)) is the image force, and Econ(v(s)) is the external constraint. Snakes proposed by Kass et al. are sensitive to noise and minimization of the functional requires computational cost. In order to prevent this problem, Williams et al. proposed snakes based on a discrete model for improvement of the noise tolerance and computational complexity. The discrete contour of snakes is represented by control points vi = (xi, yi) (i = 1, 2, … , n), which are defined in a clockwise manner (vn+1= v1). The contour energy in this approach is minimized by a greedy algorithm. In the greedy algorithm, the energy is calculated in the neighborhood of each control point vi, and the control point vi is moved to the minimum energy position. This process is iterated until convergence is attained, and we obtain the final contour.
Moving Object Extraction by Watershed Algorithm Considering Energy Minimization
713
2.2 Watershed Algorithm Watershed algorithm is a region-growing algorithm and treats the input image as a topographic surface. The luminance gradient is assumed to be the altitude of the topographic surface. The surface is slowly immersed from the minima at the lowest altitude. Dams are erected at locations where the waters coming from two different minima regions merge. The dam corresponds to the border of each region.
3 Moving Object Extraction Algorithm We describe the proposed moving object extraction algorithm using snakes and watershed algorithm. 3.1 Setting of Initial Contours In the case of applying the splitting snakes proposed by Araki et al. [3], it is not necessary to prepare initial contours corresponding to the number of objects in advance, and the one initial contour is set on the outer frame of the image. However, the setting of the initial contour on the outer frame involves the problems of the computational costs for convergence and sensitivity to the local minima. In the present paper, we set the initial contours around regions that include moving objects. The initial contour setting is performed as follows: 1. The frame difference image is partitioned into 16 × 16 pixel blocks, and the mean value mi of absolute frame difference for each block is calculated. The histogram for mi is constructed. 2. The threshold THm detecting block as a part of moving object is determined to the value around the upper tail of histogram. This value varies with image content and noise condition, but is about 5~10 from our experimental results. 3. The block detected as the moving object part (mi ≥ THm) is tested for its connectivity in a 7 × 7 block window. For the case in which less than three blocks are connected, this block is deleted through error detection. 4. Dilation operation with a 3 × 3 block window is applied to the region of the object blocks. 5. The initial control points are set at every eight pixels in the clockwise direction on the outer circumference of the extended region. 3.2 Moving Object Extraction of Snakes The initial contour converges on the neighborhood of the moving object boundary by energy functional minimization. In the present paper, the energy functions of the snakes for a frame difference image are defined as
1 n E spline ( v i ) = ∑ ( wsp1 | v i − v i −1 | 2 2 i=1 + wsp 2 | v i+1 − 2 v i + v i −1 | 2 ),
(2)
714
K. Imamura, M. Hiraoka, and H. Hashimoto
1 n E area ( v i ) = ∑ warea [ xi ( y i+1 − y i ) 2 i =1 − ( xi+1 − xi ) y i ],
(3)
n
E diff ( v i ) = − ∑ wdiff | D ( v i ) | 2
(4)
i =1
where wsp1, wsp2, warea, and wdiff ≥ 0 are used to balance the relative influence of the terms. The first term of Espline represents the elasticity of the contour, and the second term represents the stiffness. Earea denotes the area energy of the region closed by the contour. These two energies depend on the shape of the contour. In addition, we use the difference energy Ediff, which is obtained from frame difference image D. The difference energy causes the contour to converge to the high value of frame difference due to its negative enforcement on the whole. The contour model is renewed in order to minimize the energy using a greedy algorithm. In the renewal, if the distance between the adjacent control points is more than 10 pixels, then the new control point is inserted midway between these points. In addition, if the distance is less than two pixels, then one of the pixels is deleted. For the case in which the total number of contour models is less than 20, the contour model is deleted as an insignificant object. The renewal process is iterated until the number of moving control points decreases to less than 5% of the initial number. 3.3 Topographic Map for Watershed Algorithm Unnecessary information, such as that caused by noise and/or local texture, should be removed for region segmentation by watershed algorithm. Thus, we carry out preprocessing in order to obtain the luminance gradient image. This preprocessing is not performed on the entire image, but rather on limited regions, because of the computational costs involved. In the proposed method, the preprocessing is performed on the inside of the initial contour of snakes because this area includes the target region of watershed algorithm and may have a variable size depending on the object. We describe the procedure making the local luminance gradient image on which watershed algorithm is performed. First, a morphological filter [8] smoothes the image while maintaining the edge features. Next, the filtered image is transformed to the luminance gradient image by the multiscale morphological gradient [9]. The morphological reconstruction [10] is applied to the luminance gradient image for the prevention of over-segmentation. Watershed algorithm of the proposed method employs new function term that is added to the luminance gradient, and constructs a topographic map. This term corresponds to distance evaluation between the energy minimum line by snakes and the estimation point. And, the distance evaluation function d(x) is defined as
d ( x) = e
−
x2
(5)
2δ 2
where x is the distance from the contour obtained by snakes, and constant.
δ
is a positive
Moving Object Extraction by Watershed Algorithm Considering Energy Minimization
715
As a result, the topographic map T at a point (i, j) is represented as:
T (i, j ) = α ⋅ g (i, j ) + (1 − α ) ⋅ g max ⋅ d (i, j )
(6)
where g denotes the luminance gradient, and gmax is the highest gradient value in the image. α denotes weighting between the luminance gradient and the distance evaluation, and α is a positive constant in [0,1]. 3.4 Object Shape Detection by the Conditional Watershed Algorithm We assume that the obtained energy minimized contour circumferences include the boundary of the moving object, so watershed algorithm extracts this boundary from only the topographic map of the contour circumference. For this purpose, we define watershed areas of width L from the contours obtained by the snakes, and the value of the topographic map in the outer watershed area is changed to zero. However, the plural edges may be extracted from the area by the ordinary watershed algorithm. Therefore, for the case in which watershed area has plural local maxima, the additional condition whereby the maximum among them is regarded as the contour of the moving object is added.
4 Simulation and Results The proposed moving object extraction was examined by computer simulation. ``Hall Monitor'', ``Bream'' and ``Japanese Room'' (CIF, grayscale) were used as test sequences.
(a) Setting of initial contour.
(b) Convergent result.
Fig. 1. Contour Extraction by snakes (Hall Monitor)
4.1 Setting of the Initial Contours We first verify the initial contour setting in snakes. The threshold THm for moving object detection is used to judge whether the block is included in the moving object. For the case in which the image includes a high degree of noise, we may need to revise the threshold THm.
716
K. Imamura, M. Hiraoka, and H. Hashimoto
Figure 1 (a) shows the initial contour of Hall Monitor at THm = 5. From Figure 1 (a), the initial contour is appropriately placed around the moving objects. 4.2 Energy Minimization by Snakes We verified the contour extraction by snakes to the frame difference image. Figure 1 (b) shows the convergent result from the initial contour in Figure 1 (a). The number of iterations until convergence was 59. The number of initial control points was 72, and the number of final control points was same. The weights wsp1, wsp2, wdiff and warea were set to 20.0, 5.0, 1.0 and 24.0, respectively. From Figure 1 (b), the contour of the walker was extracted. However, part of the walker's leg was not extracted properly because its movement was not as great. 4.3 Topographic Map in Watershed Area Next, we made a topographic map for the conditional watershed algorithm. Figure 2 shows the image obtained by morphological reconstruction after multiscale morphological gradient estimation and morphological filtering. From Figure 2, a luminance gradient image enhancing the contour with little influence of noise was obtained.
Fig. 2. Local luminance gradient image
Fig. 3. Watershed area (L = 9)
Moving Object Extraction by Watershed Algorithm Considering Energy Minimization
717
Watershed algorithm extracts the contour from the watershed area around the contour obtained by snakes. Figure 3 shows the watershed area with the expanding width of L=9. 4.4 Contour Decision by Watershed Algorithm Finally, we verified that the contour was obtained by the proposed method. Figure 4 shows the effectiveness of the new topographic function T with weight α in Eq. 6. From these figures, as α increases, the extracted contour gradually becomes close to the conditional watershed contours (α = 1.0). The extracted contours with α less than 0.8 has good smoothness and the lost walker's leg can be partly recovered. In addition, the contour of the walker's head is extracted without wrong notches. Figures 5 and 6 show the results of contour extraction for other test sequences, Bream and Japanese Room, respectively. Comparing these results, the proposed method extracts the contour more accurately than by snakes and more smoothly than by watershed. In particular, the right hand of the lady in the Japanese Room is improved.
(a) α = 0.2
(c) α = 0.8
(b) α = 0.4
(d) α = 1.0
Fig. 4. Results of contour extraction (Hall Monitor)
718
K. Imamura, M. Hiraoka, and H. Hashimoto
(a) Snakes (α = 0.0)
(b) Watershed (α = 1.0)
(c) Proposed method (α = 0.8) Fig. 5. Results of contour extraction (Bream)
(a) Snakes (α = 0.0)
(b) Watershed (α = 1.0)
Fig. 6. Results of contour extraction (Japanese Room)
Moving Object Extraction by Watershed Algorithm Considering Energy Minimization
719
(c) Proposed method (α = 0.8) Fig. 6. (continued)
5 Conclusion In the present paper, we proposed a technique for motion object extraction combining snakes and watershed algorithm. The simulation results show that the proposed method provides accurate moving object extraction. As a result, we have confirmed the possibility of the novel moving object extraction method combining snakes and watershed algorithm. We will examine the possibility of adapting the proposed method to the extraction of moving objects from a moving background.
References 1. Kass, M., Witikin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988) 2. Williams, D.J., Shah, M.M.: A fast algorithm for active contours. In: Proc. of 3rd ICCV pp. 592–595 (1990) 3. Araki, S., Yokoya, N., Iwasa, H., Takemura, H.: Splitting active contour models based on crossing detection for extracting multiple objects. IEICE Trans. on Information and Systems 179-D-II 10, 1704–1711 (1996) 4. Vieren, C., Cabestaing, F., Postaire, J.-G.: Catching moving objects with snakes for motion tracking. Pattern Recognition Letters 16, 679–685 (1995) 5. Vincent, L., Soille, P.: Watersheds in digital spaces: An efficient algorithm based on immersion simulations. IEEE Trans. Pattern Analysys and Machine Intelligence 13(6), 583–598 (1991) 6. Park, J., Keller, J.M.: Snakes on the watershed. IEEE Trans. Pattern Analysis and Machine Intelligence 23(10), 1201–1205 (2001) 7. Nguyen, H.T., Worring, M., van den B.R.: Watersnakes: energy-driven watershed segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence 25(3), 330–342 (2003) 8. Cortez, D., et al.: Image segmentation towards new image representation methods. Signal Processing: Image Communication 6, 485–498 (1995) 9. Wang, D.: A multiscale gradient algorithm for image segmentation using watersheds. Pattern Recognition 30(12), 2043–2052 (1997) 10. Vincent, L.: Morphological grayscale reconstruction in image analysis: Applications and efficient algorithm. IEEE Trans. Image Processing 2(2), 177–201 (1993)
Constrained Inter Prediction: Removing Dependencies Between Different Data Partitions Yves Dhondt, Stefaan Mys, Kenneth Vermeirsch, and Rik Van de Walle Department of Electronics and Information Systems – Multimedia Lab Ghent University – IBBT Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium {yves.dhondt,stefaan.mys,kenneth.vermeirsch, rik.vandewalle}@ugent.be
Abstract. With the growing demand for low delay video streaming in errorprone environments, error resilience tools, such as the data partitioning tool in the H.264/AVC specification, are becoming more and more important. In this paper, the introduction of constrained inter prediction into the H.264/AVC specification is proposed. Constrained inter prediction can help the data partitioning tool by removing the dependencies between partitions B and C, thereby making it possible to process partition C if partition B is lost. From the experimental results it is observed that the cost for introducing this technique can be neglected. Furthermore, when constrained inter prediction is used in combination with constrained intra prediction, resulting bitstreams have an increased peak signal-to-noise ratio of up to 1.8 dB in error-prone environments compared to when only constrained intra prediction is used.
1 Introduction Recently developed video coding specifications, such as H.264/AVC [1,2], achieve a high compression ratio thanks to their ability to exploit the temporal redundancy between successive frames. The downside of this technique is that loss of even the smallest packet can introduce an error which propagates through a number of successive frames, thereby severely damaging a large part of the decoded video. Currently, the streaming of multimedia content is done over packet-based networks like the Internet. Most of those networks implement the Internet Protocol (IP). The downside of this protocol is that it only provides a best-effort algorithm to transport data, meaning that there is no guarantee that sent data actually reach their destination. In some environments (e.g., wireless networks), the high packet-loss ratios make the streaming of video and other multimedia rather difficult. Unlike most applications, video streaming applications often have limited or no time to request a retransmission of the lost data. As a result, either the providers of coded video data have to make their data very robust against transmission errors or the players have to provide good reconstruction techniques to conceal the errors. The H.264/AVC specification defines several new tools to make bitstreams more robust. The most important ones are Flexible Macroblock Ordering [3,4,5], Redundant Slices [6,7] and Data Partitioning [8,9]. Flexible macroblock ordering allows J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 720–731, 2007. © Springer-Verlag Berlin Heidelberg 2007
Constrained Inter Prediction: Removing Dependencies Between Different Data Partitions
721
coding the different macroblocks within a frame in a non-trivial order, thereby breaking the prediction functionality but at the same time actively helping error concealment algorithms in their attempt to reconstruct missing macroblocks. The use of redundant slices can be compared to the retransmission of lost data with the big difference being that the redundant slice is transmitted, independent of the original coded slice being lost or not. Hence, the use of redundant slices introduces a significant overhead into the coded bitstream. Data partitioning divides the data of a coded slice into three partitions according to the importance of the data thereby allowing differentiated networks [10] to better protect the more important data. Since data partitioning comes down to a reordering and splitting of the syntax elements within a slice, its overhead can be ignored compared to the overhead of the other two tools. Although data partitioning looks the most promising error resilience tool, it has the drawback that there still are several dependencies between the different partitions. As a result, the loss of one partition can make other, correctly received partitions useless. Constrained intra prediction can help in removing the dependency of the partition containing data about intra-coded macroblocks on the partition containing data about inter-coded macroblocks. However, the inverse dependency remains. To remove that dependency as well, a new technique, called constrained inter prediction, is proposed in this paper. The remainder of this paper is organized as follows. In the next Section an introduction to data partitioning and the dependencies between the different partitions is given. In Section 3, constrained inter prediction is introduced and discussed. Then, in Section 4, the cost of constrained inter prediction is measured in terms of loss of coding efficiency. In Section 5, a decoder which can handle corrupted coded bitstreams using data partitioning is described after which experiments analyzing the benefits of constrained inter prediction in an error-prone environment are set up and discussed. This paper ends with some conclusions in Section 6.
2 Data Partitioning In the H.264/AVC specification, the network layer is represented as an abstract concept. The advantage of this is that the video coding layer works completely independent of the network layer. Communication between the two layers is done by means of Network Abstraction Layer Units (NALUs). An H.264/AVC NALU consists of a one -byte header followed by an arbitrary length payload. Normally, each coded slice is encapsulated into exactly one NALU. However, in the case of data partitioning, each coded slice is split into three parts, called data partitions, which are each encapsulated in a NALU of their own. Each of these NALUs can then be sent to the decoder in a specific way (e.g., different network, different priority). The H.264/AVC specification defines the three data partitions, labeled A, B, and C, as follows: partition A contains the slice header, macroblock types, quantization parameters, prediction modes, and motion vectors; partition B contains residual information of intra-coded macroblocks; partition C contains residual information of inter-coded macroblocks.
722
Y. Dhondt et al.
Fig. 1. Overview of the dependencies between the different data partitions and how some can be removed
Since it is possible to have multiple slices within a coded picture, there can be multiple partitions A, B, and C for a single coded picture. To identify which partitions belong to which slice, the syntax element slice_id is used. When arbitrary slice order is not allowed, the first slice of a coded picture shall have slice_id equal to zero and the value of slice_id shall be incremented by one for each subsequent slice of the coded picture in decoding order. Partition B (or C) can be empty if there are no intra-coded (or inter-coded) macroblocks in the coded slice. An encoder does not have to send, or signal, empty partitions to the decoder. Hence, a basic decoder will assume that missing partitions are empty partitions and handle the bitstream accordingly. The purpose of data partitioning is to divide the coded data into several partitions depending on the importance of the data. A network, which can give different priorities to different packets, can then protect the important data in a better way. However, by itself data partitioning does not remove any dependencies which might exist between the different partitions. As a result, the loss of one partition might make another partition useless. In the following paragraphs, the different dependencies will be discussed, as well as a standardized technique to remove some of them. A visual overview of the different dependencies can be seen in Fig.1. To correctly parse partitions B and C, an H.264/AVC decoder has to know how each of the macroblocks within the slice was predicted. Hence, the information stored in partition A is needed by the parser. Therefore, if partition A gets lost, partition B and C become useless. Partition A on the other hand, does not need any information from the other partitions to be correctly parsed. Furthermore, if only partition A is received correctly, error concealment algorithms can still use data from it, like the motion vectors, to repair the damaged areas. So, while partitions B and C are dependent on partition A, the inverse is not true. When considering the dependencies between partitions B and C, things are slightly more complicated. Firstly, intra-coded macroblocks can be predicted by means of their neighbours without any restrictions on the coding type of those neighbouring macroblocks. So, it is possible for intra-coded macroblocks to use inter-coded macroblocks for their prediction. In such a case, partition B will be dependent on partition C. Secondly, there is the use of Context-based Adaptive Variable Length Coding (CAVLC) by the H.264/AVC specification. To achieve optimal compression efficiency, CAVLC uses the number of non-zero transform coefficients in neighboring macroblocks to parse the number of non-zero transform coefficients in the current
Constrained Inter Prediction: Removing Dependencies Between Different Data Partitions
723
macroblock. Since CAVLC does not take coding types into account, intra- and inter-coded macroblocks can use information from each other. Hence, partition B and C are dependent on each other when CAVLC is used. Note that, due to the way the different profiles are defined in the H.264/AVC specification, data partitioning can not be used in combination with Context-Based Adaptive Binary Arithmetic Coding (CABAC). Therefore, in the remainder of this paper, only the influence of CAVLC will be studied. In an attempt to partially remove those dependencies, constrained intra prediction was defined in the H.264/AVC specification. Using constrained intra prediction, intracoded macroblocks can only be compressed using information from other intra-coded macroblocks within the same coded slice. This eliminates the first dependency. If constrained intra prediction is used in combination with data partitioning, then the total number of non-zero transform coefficients is considered zero if the current macroblock is coded using an intra prediction mode, while the other macroblock is coded using inter prediction. Hence, using constrained intra prediction, partition B can be decoded independently of partition C. Constrained intra prediction does not make partition C independent of partition B however. Since the inter-coded data in partition C can still be predicted using the intra-coded data in partition B. Therefore, when a partition B is lost, the accompanying partition C is still not useful. This is a drawback since, most of the time, inter coded pictures will contain a lot more inter-coded macroblocks than intra-coded ones. Hence, a small loss (partition B) will automatically result in a large loss (partition B and partition C). In the following section, constrained inter prediction, indicated in bold in Fig. 1, is proposed as a new technique to solve this problem.
3 Constrained Inter Prediction In this section, constrained inter prediction is defined. Since constrained inter prediction is proposed as an extension to the H.264/AVC specification, its impact and a way to signal its presence in a bitstream are also discussed here. 3.1 Definition We define constrained inter prediction as the constraint that inter-coded macroblocks can only be coded using information from previously coded pictures or other intercoded macroblocks within the same slice. As one can see, this definition is pretty similar to the one for constrained intra prediction but targets inter-coded macroblocks rather than intra-coded ones. Just like constrained intra prediction, constrained inter prediction is only truly useful in combination with data partitioning. The major advantage of constrained inter prediction is that, within a coded slice, data from inter-coded macroblocks no longer depends on data from intra-coded macroblocks. As a result, when data partitioning is applied to a coded slice, partition C will no longer be dependent on partition B. This means that in an error-prone environment, partition C can still be processed if partition B gets lost or corrupted.
724
Y. Dhondt et al.
3.2 Impact of Constrained Inter Prediction In the previous section, it was already mentioned that CAVLC does not normally take the coding type of neighbouring macroblocks into account. However, the use of constrained inter prediction in combination with data partitioning does force CAVLC to do so since an inter-coded macroblock can no longer use data from intra-coded neighbours. Constrained inter prediction sets the total number of non-zero coefficients to zero if the current macroblock is coded using inter prediction while the other macroblock is coded using an intra prediction mode. Hence, constrained inter prediction requires a small but important change in CAVLC. Due to the change in CAVLC, bitstreams encoded using constrained inter prediction are no longer compliant with the H.264/AVC specification. Hence, the current generation of decoders will not be able to handle such bitstreams. Therefore, a possible solution on how to signal constrained inter prediction for future decoders is presented in the next paragraph. 3.3 Signaling Constrained Inter Prediction in an H.264/AVC Bitstream Constrained inter prediction is similar to constrained intra prediction. Therefore, its use could be signaled in the same way. This means that an extra one-bit syntax element, called constrained_inter_pred_flag, should be added to each picture parameter set, signaling whether constrained inter prediction is used in the bitstream or not. Unfortunately, no spare bits are provided in the picture parameter set for future use. A first solution is to extend the picture parameter set in a similar way as was done in the past for the sequence parameter set to add an alpha channel to H.264/AVC FRext [11]. This solution involves defining a new type of NALU. However, most decoders who come across NALUs they do not know how to process, tend to skip those NALUs. Hence, these decoders would skip the extended parameter set and would not realize that something is different about a bitstream using constrained inter prediction. Therefore, they would probably crash during the parsing of CAVLCcoded data in a partition B and/or C. A second solution is to indicate the use of constrained inter prediction in the sequence parameter set on which the picture parameter set depends using one of the four bits (reserved_zero_4bits) currently reserved for future use. The value of that bit should be 0 (false) if constrained inter prediction is not used, and 1 (true) if it is used. That way, if constrained inter prediction is not used, decoders implementing the current version of the H.264/AVC specification can still decode the bitstream successfully. Furthermore, if constrained inter prediction is used, by parsing that bit in the sequence parameter set, decoders will notice that the resulting bitstream is not compliant to the current H.264/AVC specification and can gracefully halt the decoding process. Although the second solution signals constrained inter prediction at sequence level rather than at picture level (sequence parameter set versus picture parameter set), it is less complex and will cause less problems for the current generation decoders than the first solution. Therefore, we used the second solution in our experiments in the following sections.
Constrained Inter Prediction: Removing Dependencies Between Different Data Partitions
725
Table 1. Relative overhead of data partitioning (a) if no constrained prediction is used, (b) if only constrained intra prediction is used, and (c) if both constrained intra and inter prediction are used in combination with a GOP length of 15
QP 20 24 28 32 36 40
news- 1 slice/pic (a) (b) (c) 0.21 7.23 7.23 0.33 9.70 9.80 0.51 12.76 12.76 0.76 15.86 15.96 1.12 20.18 20.20 1.59 24.59 24.61
football- 2 slices/pic (a) (b) (c) 0,13 2,76 3,04 0,20 3,57 3,77 0,31 4,52 4,63 0,50 5,96 6,09 0,82 7,54 7,58 1,36 9,35 9,56
foreman- 4 slices/pic (a) (b) (c) 0.51 4.29 4.35 0.90 6.71 6.80 1.50 8.98 9.20 2.39 12.14 12.22 3.59 15.75 15.72 5.03 20.05 20.01
Table 2. Relative overhead of data partitioning (a) if no constrained prediction is used, (b) if only constrained intra prediction is used, and (c) if both constrained intra and inter prediction are used in combination with a GOP length of 30
QP 20 24 28 32 36 40
hall_monitor-1slice/pic (a) (b) (c) 0.12 1.72 1.85 0.30 5.31 5.32 0.67 9.93 10.16 1.23 16.62 16.70 1.98 20.65 20.84 2.97 26.31 26.35
football- 2 slices/pic (a) (b) (c) 0.14 2.39 2.78 0.22 3.20 3.40 0.34 4.02 4.26 0.54 4.89 5.03 0.89 6.06 5.90 1.50 7.16 7.51
mobile- 4 slices/pic (a) (b) (c) 0.18 0.70 0.71 0.27 0.99 1.00 0.44 1.44 1.55 0.85 2.52 2.67 1.67 4.37 4.63 2.89 6.68 6.62
4 Cost of Constrained Inter Prediction In this section, the cost of constrained inter prediction, in terms of lost coding efficiency, is evaluated. Since constrained inter prediction is intended to be used in combination with data partitioning, it will only be evaluated for these cases. For this experiment, six different test sequences with all kinds of motion characteristics (e.g., object movement, camera movement) are used: news, hall monitor, mobile, foreman, canoa, and football. Each sequence has a CIF resolution and is 210 frames long. The sequences are encoded using a modified version of JM 12.0 [12] with six different quantization parameters: 20, 24, 28, 32, 36, and 40. Furthermore, each sequence is coded using two different GOP sizes: 15 and 30. The first picture of a GOP is encoded as an instantaneous decoding refresh (IDR) picture. To minimize the impact of error propagation, intra macroblock refresh (IMBR) is used in such a way that each macroblock is intra-coded at least once within all the inter-coded pictures of a GOP. To do so, IMBR is set to the rounded up value of the division of the number of macroblocks within a frame by the size of the GOP minus 1. For a CIF resolution and a GOP of size 15 (resp. 30), this results in an IMBR of 29 (resp. 14). Finally, four versions of every configuration were generated: (1) one without data partitioning, (2) one with data partitioning but without constrained prediction, (3) one
726
Y. Dhondt et al.
with data partitioning and constrained intra prediction, and (4) one with data partitioning and constrained intra and inter prediction. Table 1 and 2 contain the relative overhead of data partitioning using different constrained prediction modes compared to the case where no data partitioning is used for some of the coded bitstreams. Most of the overhead in the columns (a), where data partitioning without constrained prediction is used, can be explained by the way bitstreams are stored. The H.264/AVC Annex B syntax provides synchronization marker of three (or four) bytes between different NALUs. Since data partitioning splits every NALU into three, an extra six bytes are needed to store a coded slice. The remaining overhead, caused by signaling the slice id, can be neglected. Columns (b) and (c) are indicating the relative overhead in case data partitioning is used in combination with constrained intra prediction, and constrained intra and inter prediction respectively. As one can see in columns (b), using constrained intra prediction can cause a rather large overhead especially in low-motion sequences (e.g., hall_monitor and news) which are coded at a low quality. The overhead can be explained by the use of IMBR. In most cases, intra-coded macroblocks within intercoded slices have no intra-coded neighbours. Hence, when using constrained intra prediction, they will be poorly predicted. The extra cost of adding constrained inter prediction on top of constrained intra prediction is, when columns (b) and (c) are compared, almost non-existent. In some rare cases, the overhead is actually negative. This is caused by statistical noise in CAVLC. When comparing the results for the football sequences in Table 1 and 2, one notices that the GOP length has not much influence on the overall cost. The results for a GOP of 30 are slightly better than those for a GOP of 15. This can be attributed to the fact that when a GOP of 15 is used, the number of forced intra-coded macroblocks within inter-coded pictures is twice as large as when a GOP of 30 is used. In the above experiment, the cost of constrained inter prediction was not evaluated separately from the cost of constrained intra prediction since both tools should be used in combination with each other. However, a small experiment, which will not be discussed here, was conducted to research this. The results showed an average overhead of less than 0.5%, i.e., similar to the difference in overhead between columns (b) and (c). Although constrained inter prediction is introduced to be used in combination with constrained intra prediction, these results show that when only constrained inter prediction is used, a bitstream using data partitioning can be made more robust without much change to the bitrate.
5 Constrained Inter Prediction in an Error-Prone Environment In this section, an extension to the reference decoder is described, which can handle the loss of one or more data partitions. Using this extended decoder, constrained inter prediction is evaluated in a simulated error-prone environment. 5.1 A Decoder with Error Concealment Capabilities for Data Partitioning The H.264/AVC specification only describes the decoding process for compliant bitstreams. There is no standard way to handle missing slices or data partitions. In the
Constrained Inter Prediction: Removing Dependencies Between Different Data Partitions
727
current version of the reference software, JM 12.0, several error concealment schemes are implemented to handle the loss of slices [13]. Unfortunately, none of these methods are able to handle the loss of one or more data partitions. In the following paragraphs a short overview of points to pay attention to when developing an algorithm to handle the loss of certain data partitions, is presented. Firstly, it looks like an H.264/AVC decoder can use the slice_id to identify which partitions belong to the same slice and which do not. However, consider the scenario where coded pictures consist out of only one coded slice. If for the first coded picture partition A is received correctly while for the second coded picture only partitions B and C are received correctly, then all three partitions will have the same slice_id while still belonging to two different coded pictures. A decoder handling these partitions will most likely hang during the parsing process since the data in partitions B and C are not related to the data in partition A. Hence, detecting which partitions belong together should be done before the data reaches the decoder (e.g., by the network receiver). Secondly, if a partition B or C is empty, according to the H.264/AVC specification it does not have to be sent to the decoder. As a consequence, a decoder is not directly able to spot the difference between empty and missing partitions. However, by parsing partition A, a decoder knows which types of macroblocks were used to code the slice, and therefore knows if partitions B and C are actually needed or not to decode the coded slice correctly. Thirdly, the loss of the partition C (or B) on which the partition B (or C) relies, i.e., when no constrained prediction is used, does not automatically imply that the partition can not be partially used if the other partition is lost. As long as only one partition is used, the parsing process will work correctly. It is only when an attempt is made to parse a part of the partition by means of information from the lost partition that the decoder will not be able to continue correctly. Hence, it is still possible to partially process a partition. Fourthly, if the inter-coded macroblocks of a coded slice can no longer be decoded due to the loss of partition C, the motion vectors of the macroblock, which are stored in partition A, can still be used to conceal the macroblock. The authors are not aware of any techniques which are able to do something similar with the data stored for intra-coded macroblocks in partition A to conceal the loss of partition B. Keeping the above points in mind, we extended the H.264/AVC reference software to handle data partitioning with loss. A flow chart of the algorithm is shown in Fig. 2. Lost intra-coded macroblocks are marked as lost during the decoding phase and are afterwards repaired by the error concealment schemes already available in the reference software. For the concealment of lost inter-coded macroblocks only the motion vectors from partition A are used. The residual data for those macroblocks is assumed to be zero. The bold parts of the flowchart show how the algorithm handles the loss of partition B in case constrained inter prediction is used. When this part is left out, the algorithm can also be used for the current generation of H.264/AVC bitstreams which use data partitioning.
728
Y. Dhondt et al.
Fig. 2. Flowchart on how the extended decoder handles data partitioning (with loss)
Constrained Inter Prediction: Removing Dependencies Between Different Data Partitions
correct
x y
729
error
Fig. 3. A two-state Gilbert model with x indicating the chance that a packet will be lost if the previous packet was received correctly and y indicating the inverse
5.2 Experiments Since the performance of constrained intra prediction is already thoroughly studied in [8], the experiment in this section only focuses on the added value of constrained inter prediction on top of bitstreams already using constrained intra prediction. In the previous section, it was shown that the overhead of constrained inter prediction is extremely low. As a result, bitstreams using constrained inter prediction have approximately the same bitrate as bitstreams not using it. Therefore, the bitstreams with constrained intra and/or inter prediction generated for the cost analysis can be reused in this experiment. Table 3. PSNR values of the luminance component of decoded sequences after concealment (a) if only constrained intra prediction is used and (b) if both constrained intra and inter prediction are used in combination with a GOP length of 15 and 2 slices per coded picture
QP 20 24 28 32 36 40
canoa (a) (b) 34.67 36.01 32.22 33.70 30.47 31.53 28.95 29.81 27.84 27.86 25.87 25.96
news (a) (b) 40.33 40.65 38.93 38.98 36.76 37.14 34.60 35.14 32.32 32.60 29.83 29.98
foreman (a) (b) 38.04 39.44 36.34 36.90 35.05 35.69 33.72 33.87 31.22 31.60 29.66 29.94
hall_monitor (a) (b) 38.38 39.91 37.88 38.25 36.38 36.60 34.57 34.93 32.64 32.78 29.43 30.30
Table 4. PSNR values of the luminance component of decoded sequences after concealment (a) if only constrained intra prediction is used and (b) if both constrained intra and inter prediction are used in combination with a GOP length of 30 and 4 slices per coded picture
QP 20 24 28 32 36 40
news (a) (b) 38.41 39.34 37.59 38.62 36.20 36.89 34.47 34.53 32.19 32.45 29.79 29.91
mobile (a) (b) 30.93 32.76 29.80 30.92 28.38 28.84 27.52 28.32 26.03 26.06 24.24 24.33
foreman (a) (b) 36.54 37.41 35.45 35.66 34.48 34.76 32.35 32.91 30.22 31.52 29.41 29.78
football (a) (b) 32.23 33.06 30.98 31.93 30.56 31.73 29.15 30.17 28.88 29.23 27.05 27.76
730
Y. Dhondt et al.
As an error resilience tool, data partitioning is mainly intended to be used in differentiated networks where unequal error protection can be applied to the different partitions. To simulate this type of network, a separate channel with specific characteristics for each partition is used. Since the focus of this experiment is the impact of constrained prediction, only the impact of loss on partitions B and C will be investigated. Hence, the channel carrying partition A will be assumed lossless. The channels use a two-state Gilbert model [14] as shown in Fig. 3 with x being 4.44% (7.50%) and y being 40.00% (30.00%) for the channel carrying partition B (C). This means an overall error rate of 10.00% (20.00%) for partition B (C). For the channels carrying partitions B and C, four error patterns are generated which are then combined into 16 error patterns. Those patterns are applied to the bitstreams with and without constrained inter prediction. The resulting bitstreams are decoded using the adapted decoder. Finally, for the 16 versions of each bitstream, the Peak Signal-to-Noise Ratio (PSNR) of the different decoded sequences is calculated and averaged to measure the effect of constrained inter prediction. Table 3 and 4 contain the results of the experiments. As one can see, the versions with both constrained intra and inter prediction (columns (b)) always outperform the versions with constrained intra prediction (columns (a)) only. Constrained inter prediction seems to be most useful for bitstreams encoded with a quantization parameter close to zero (i.e., high quality).
6 Conclusions In this paper, constrained inter prediction was presented. This technique can, when combined with data partitioning, make bitstreams more robust by removing the dependency of partition C on partition B. Experimental results showed that, unlike the cost for using constrained intra prediction, the cost for using constrained inter prediction is low. Furthermore, the H.264/AVC reference software was extended such that the loss of certain data partitions can be handled gracefully by the error concealment techniques available in the software. Experimental results, using the adapted decoder to handle data partitioned bitstreams which were sent over a differentiated error-prone network, showed that the use of constrained inter prediction results in video streams which have a peak signal-to-noise ratio which is up to 1.8 dB better than when only constrained intra prediction is used.
Acknowledgment The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientific Research-Flanders (FWO-Flanders), the Belgian Federal Science Policy Office (BFSPO), and the European Union.
Constrained Inter Prediction: Removing Dependencies Between Different Data Partitions
731
References 1. Advanced video coding for generic audiovisual services, ITU-T Recommendation H.264 (2005) 2. Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol. 13, 560–576 (2003) 3. Wenger, S., Horowitz, M.: Flexible macroblock ordering (FMO) 101. (2002), available from http://ftp3.itu.ch/av-arch/jvt-site/, _07_Klagenfurt/JVT-D063.doc 4. Lambert, P., De Neve, W., Dhondt, Y., Van de Walle, R.: Flexible macroblock ordering in H.264/AVC. Journal of Visual Communication and Image Representation 17, 358–375 (2006) 5. Dhondt, Y., Mys, S., Lambert, P., Van de Walle, R.: An evaluation of flexible macroblock ordering in error-prone environments. In: Proceedings of the SPIE/Optics East conference, Boston (2006) 6. Xu, J., Wu, Z.: A perceptual sensitivity based redundant slices coding scheme for errorresilient transmission H.264/AVC video. In: Proceedings of the IEEE International Conference on Communications, Circuits and Systems, vol. 1, pp. 139–142. IEEE, Los Alamitos (2006) 7. Rane, S., Girod, B.: Systematic lossy error protection of video based on H.264/AVC redundant slices. In: Proceedings of the Visual Communication and Image Processing VCIP2006 conference vol. 1 (2006) 8. Stockhammer, T., Bystrom, M.: H.264/AVC data partitioning for mobile video communication. In: Proceedings of the IEEE International Conference on Image Processing, vol. 1, pp. 545–548. IEEE, Los Alamitos (2004) 9. Mys, S., Dhondt, Y., Van de Walle, D., De Schrijver, D., Van de Walle, R.: A performance evaluation of the data partitioning tool in H.264/AVC. In: Proceedings of the SPIE/Optics East conference, Boston (2006) 10. Blake, S., Black, D., Carlson, M., Davies, E., Zang, W., Weiss, W.: An architecture for differentiated services. Internet Standards Track RFC 2425, IETF (1998) 11. Haskell, B., Singer, D.: Addition of alpha channel to AVC/H.264 FRext (2004), available from http://ftp3.itu.ch/av-arch/jvt-site/, _07_Redmond/JVT-L013r3.doc 12. JVT H.264/AVC reference software. available from http://iphome.hhi.de/suehring/tml/ download/ 13. Wang, Y-K., Hannuksela, M., Varsa, V., Hourunranta, A., Gabbouj, M.: The error concealment feature in the H. In: Proceedings of the IEEE International Conference on Image Processing, vol. 2, pp. 729–732. IEEE, Los Alamitos (2002) 14. Gilbert, E.: Capacity of a burst-noise channel. Bell Sys Tech. Journal 39, 1253–1265 (1960)
Performance Improvement of H.264/AVC Deblocking Filter by Using Variable Block Sizes Seung-Ho Shin1, Duk-Won Oh2, Young-Joon Chai3, and Tae-Yong Kim3 1,3
GSAIM, Chung-Ang University, Seoul, Korea 1,2 TU Media Corp, Seoul, Korea
[email protected],
[email protected], {chai1014,kimty}@cau.ac.kr
Abstract. Currently H.264/AVC supports variable block motion compensation, multiple reference images, 1/4-pixel motion vector accuracy, and in-loop deblocking filter, compared with the existing compression technologies. While these coding technologies are major functions of compression rate improvement, they lead to high complexity at the same time. For the H.264 video coding technology to be actually applied on low-end / low-bit rates terminals more extensively, it is essential to improve the coding speed. Currently the deblocking filter that can improve the moving picture’s subjective image quality to a certain degree is used on low-end terminals to a limited extent due to computational complexity. In this paper, a performance improvement method of the deblocking filter that efficiently reduces the blocking artifacts occurred during the compression of low-bit rates digital motion pictures is suggested. Blocking artifacts are plaid images appear on the block boundaries due to DCT and quantization. In the method proposed in this paper, the image's spatial correlational characteristics are extracted by using the variable block information of motion compensation; the filtering is divided into 4 modes according to the characteristics, and adaptive filtering is executed in the divided regions. The proposed deblocking method reduces the blocking artifacts, prevents excessive blurring effects, and improves the performance about 40% compared with the existing method. Keywords: H.264, AVC, deblocking filter, loop filter, variable blocks.
1 Introduction H.264, using new video coding technologies, increases the compression rate with the same image quality compared with the existing H.263v2 (H.263+) [2] or MPEG-4 Visual (Part 2) [3]. The remarkable characteristics of H.264 include variable block motion compensation, multiple reference images, 1/4-pixel motion vector accuracy, and in-loop deblocking filter [1]. Although such coding technologies are the main functions to improve compression efficiency, the complexity leads to the unavoidable increase in coding process. Performance improvement to decrease the complexity and to prevent the quality deterioration is necessary to adapt the newly defined techniques into low-end terminals. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 732–743, 2007. © Springer-Verlag Berlin Heidelberg 2007
Performance Improvement of H.264/AVC Deblocking Filter
733
In this paper, methods to improve the performance of the deblocking filter to enhance the subjective image quality on low-end/low-bit rates terminals are presented. In the H.264 standards, the deblocking filter, also called the loop filter, is used to decrease blocking artifacts. The blocking artifacts are a distortion that appears in compressed video material as abnormally large pixel blocks. It is especially visible with fast motion sequences or quick scene changes. Therefore deblocking filtering is necessary to decrease such distortion on the boundaries between macroblocks. Since H.264 can segment 16x16 macroblocks up to 4x4 blocks, it is possible to decrease artifacts on the boundaries between 4x4 blocks. In the process of decreasing the blocking artifacts, however, the actual images’ edges may erroneously be blurred. And it may not be used on low-end terminals due to complex computation and large memory capacity. Despite such shortcomings, however, the deblocking filter can be said to be the most essential technology in enhancing the subjective image quality. The general opinion of the subjective image quality test proves that there is distinguishing difference in the image qualities with and without the deblocking filter used [8], [9]. In this paper, we suggest a method to enhance the filtering performance by executing deblocking filtering using the variable block information of the motion compensation. By using the variable block information and considering human’s visual characteristics and moving picture characteristics, the filtering mode is classified into 4 types to adapt the filter structure. In section 2, the in-loop deblocking filter, the characteristic coding technology of H.264, and variable block-size motion compensation are introduced. In section 3, the variable block-based deblocking filter is proposed. The proposed method is verified through experiments in section 4. The conclusions are stated in section 5.
2 Deblocking Filter in H.264/AVC 2.1 In-Loop Deblocking Filter In H.264/AVC, the block distortion is reduced by using the adaptive in-loop deblocking filter. The H.264/AVC deblocking filter can be applied to the edges of all 4x4 blocks in a macroblock except for edges on slice boundaries. In order to apply the filter to each macroblock, the filtered pixels at the top and on the left of the current macroblock are used, and the luma and chroma components are separately processed. Filtering is applied to vertical/horizontal edges of 4x4 blocks in a macroblock, filtering is done first from the left to the right vertically and then from top to bottom on the horizontal edges. For the 16x16 luma component, it is applied to four 16-pixel edges; for the 8x8 chroma components, it is applied to two 8-pixel edges. Fig.1 shows four samples on either side of a vertical or horizontal boundary in adjacent blocks p and q (p0, p1, p2, p3 and q0, q1, q2, q3). The strength of the filter depends on the current quantization, the coding modes of neighboring blocks and the gradient of image samples across the boundary.
734
S.-H. Shin et al.
Fig. 1. Edge filtering order in a macroblock
The filtering process goes through three processes, such as, boundary strength selection, filter decision, and filter implementation as the following (Fig.2) [10];
Fig. 2. Deblocking filter process
h Boundary Strength: In this process, it is decided whether the filtering is needed, and how much strength is applied. The choice of filtering outcome depends on the boundary strength and on the gradient of the image samples across the boundary. The boundary strength parameter (bS) is chosen according to the rules as shown in Table1. Table 1. Selection of Boundary Strength (bS) p and/or q is intra coded and boundary is a macroblock boundary p and q are intra coded and boundary is not a macroblock boundary Neither p or q is intra coded; p and q contain coded coefficients Neither p or q is intra coded; neither p or q contain coded coefficients; p and q use different reference pictures or a different number of reference pictures or have motion vector values that differ by one luma sample or more otherwise
bS = 4 bS = 3 bS = 2 bS = 1
bS = 0
The result of applying these rules is that the filter is stronger at places where there is likely to be significant blocking distortion, such as the boundary of an intra coded macroblock or a boundary between blocks that contain coded coefficients.
Performance Improvement of H.264/AVC Deblocking Filter
735
h Filter Decision: When the bS has been chosen in the block, the filtering of boundary samples is determined by analyzing each pixel on the block boundary. A Group of samples from the set (p2, p1, p0, q0, q1, q2) is filtered only if: bS > 0 and |p0-q0| < α and |p1-p0| < β and |q1-q0|≤ β .
(1)
α and β are thresholds defined in the standard [1]. They increase with the average
quantizer parameter (QP) of the two blocks p and q. The effect of the filter decision is to ‘switch off’ the filter when there is a significant change across the block boundary in the original image. When QP is small, anything other than a very small gradient across the boundary is likely to be due to image features rather than block effects that should be preserved and so the thresholds α and β are low. When QP is larger, blocking distortion is likely to be more significant and α , β are higher so that more boundary samples are filtered. h Filter Implementation: After the boundary strength and filter decision, filtering is applied by the following rules; (a) In the case of bS < 4: A 4-tap Finite Impulse Response (FIR) filter is applied with inputs p1, p0, q0 and q1, producing filtered outputs p’0 and q’0. If |p2-p0| is less than threshold β , another 4-tap filter is applied with inputs p2, p1, p0 and q0, producing filtered output p’1. If |q2-q0| is less than threshold β , a 4-tap filter is applied with inputs q2, q1, q0 and p0, producing filtered output q’1. (b) In the case of bS = 4: Filtering is applied by the rules in Table 2. Table 2. Filter implementation in the case of (bS = 4) block p
rule If |p2-p0| < β and |p0-q0| < round ( α / 4 ) and this is a luma block else If |q2-q0| < β and |p0-q0| < round
q
( α / 4 ) and this is a luma block else
input p2, p1, p0, q0, q1 p2, p1, p0, q0 p3, p2, p1, p0, q0 p1, p0, q1 q2, q1, q0, p0, p1 q2, q1, q0, p0 q3, q2, q1, q0, p0 q1, q0, p1
FIR filter 5-tap 4-tap 5-tap 3-tap 5-tap 4-tap 5-tap 3-tap
output p’0 p’1 p’2 p’0 q’0 q’1 q’2 q’0
2.2 Variable Block-Size Motion Compensation The variable block-size motion compensation (VBSMC) technology of H.264 conforms well to the image characteristics or the motion characteristics inside the motion by dividing the motion compensation block size more finely than the existing H.263 or MPEG-2/4. That is, in MPEG-2 the 16x16-pixel fixed-size motion compensation block is used; in MPEG-4 Visual (Part 2) two kinds of pixel motion compensation blocks, 16x16 and 8x8, are used [3]. Differently in H.264, seven kinds of motion compensation block sizes, from the 16x16 pixel to 4x4 pixels, are used to compensate motions (Fig.3).
736
S.-H. Shin et al.
Fig. 3. Variable blocks used for motion compensation in H.264
For the flat regions or in the case the size of objects is large, the motion compensation is executed with large 16x16 blocks; for complex regions or in the case the sizes of objects are small, the motion compensation is achieved by small blocks such as 4x4 blocks. In general the smaller the size of the blocks for motion compensation, the better motion compensation results can be obtained. If the size of the blocks gets smaller, however, more searches should be carried out, thus increasing the complexity and the number of motion vectors to transmit. In order to solve such problem in H.264, an adaptive motion compensation which applies the size of blocks selectively according to the image characteristics is carried out [4]. The luma component of each macroblock may be split in MB mode and motion compensated either as one 16x16 partition, two 16x8 partitions, two 8x16 partitions or four 8x8 partitions. If the 8x8 mode is chosen, each of the four 8x8 sub-macroblocks within the macroblock may be split in sub-MB mode, either as one 8x8 sub-MB partition, two 8x4 sub-MB partitions, two 4x8 sub-MB partitions or four 4x4 sub-MB partitions [10].
3 Variable Block-Based Deblocking Filter If the screen size of images gets larger, the computational cost of the deblocking filter increases proportionally. The filtering method currently adopted in H.264 standard is diverse in the choice of filter coefficient according to adjacent block characteristics, reference picture characteristics, and I/P/B coding. Since “If” commands are used profusely in the selection of filter coefficient, fast computation through pipelining cannot be expected in the implementation of the actual deblocking filter [6]. As a result many commercial H.264 codec tend not to use the deblocking filter for realtime coding, which will results in very fatal image deterioration as time passes. Generally in motion compensation, variable blocks are divided into blocks of 16x16, 16x8, and 8x16, which are MB mode, in flat regions or large objects and divided into blocks of 8x8, 8x4, 4x8, and 4x4, which are sub-MB mode, in complex and fine regions with lots of motion [11]. And the human visual system (HVS) is more sensitive to the discontinuity of flat and simple regions than complex regions. In flat and simple regions, strong filtering is applied to decrease block distortion phenomenon; in complex and fine regions, weak filtering is applied in order to prevent the edges of actual objects from being blurred [7]. The remarkable features of the variable block-based deblocking filter proposed in this paper are as follows;
Performance Improvement of H.264/AVC Deblocking Filter
737
− Executing filtering using the moving picture’s characteristics between adjacent blocks according to the human perception. − Using the variable block-size segmentation information embedded into the motion compensation. − Applying adaptive filter on the fields of four separate filter modes. In brief, the deblocking filtering performance can be improved by analyzing image characteristics using variable block information into the motion compensation. Thus, we can reduce the blocking artifacts without much quality degradation with the existing method.
Fig. 4. Modified H.264 codec and deblocking filter structures
Fig. 4 shows the H.264 codec and deblocking filter structures when the proposed method was applied. Since the variable block information of motion compensation needs to be used, the deblocking filter module is located after the motion compensation processor. Table 3. Decision of Filter mode flat p and q are MB mode and boundregion ary is a 16-pixel boundary simple p and q are MB mode and boundregion ary is a 8-pixel boundary normal p and/or q is sub-MB mode and region boundary is a 8-pixel boundary complex p and/or q is sub-MB mode and region boundary is a 4-pixel boundary otherwise (p: adjacent block, q: current block)
Filter mode = 4
strongest filtering
Filter mode = 3
strong filtering
Filter mode = 2
normal filtering
Filter mode = 1
weak filtering
Filter mode = 0
no filtering
Table 3 shows the descriptions of image spatial regions and filter modes for the filter implementation. The filtering is divided into 4 filter modes according to the defined rules in Table 3, and adaptive filtering is implemented in the divided regions. Filtering starts in the vertical direction of the whole macroblock excluding the edges of slice boundaries and then proceeds in the horizontal direction. The filtering
738
S.-H. Shin et al.
is executed based not on the 4x4 blocks within a 16x16 macroblock, which is the existing method, but on the variable blocks of motion compensation. The pixel values changed in the vertical filtering are reflected in the horizontal filtering. Fig. 5, 6, and 7 are the examples of determining the filter mode according to the adjacent block characteristics in the horizontal filtering of 16x16 blocks.
Fig. 5. Filter mode (4) decision of 16x16 variable blocks
Fig. 5 shows the case that the adjacent blocks of the current 16x16 block are selected as 16x16 or 8x16 on the motion compensation process. In such a case, assuming that the image that surrounds the boundaries is a flat region, the strongest filtering is applied by assigning filter mode (4). In order to reduce the internal as well as the block boundary’s blocking artifacts, filtering is applied to all edges of 4x4 blocks in the 16x16 block, in the 4 boundaries of the luma component and 2 boundaries of each chroma component. The filtering process is the same as the existing method. Fig. 6 shows the case that the adjacent blocks of the current 16x16 block are 16x8 or 8x8. In this case, assuming that the image that surrounds the boundary is the simple
Fig. 6. Filter mode (3) decision of 16x16 variable blocks
Performance Improvement of H.264/AVC Deblocking Filter
739
region, filter mode (3) is assigned. Filtering is applied to the 8 pixels on the concerned block boundaries, p0 ~ p3 and q0 ~ q3. The 8 pixels centered on the block boundary, p3, p2, p1, p0, q0, q1, q2 and q3, are designated as filtering pixels and a 9-tap FIR filter with weights (1/16, 1/16, 1/8, 1/8, 1/4, 1/8, 1/8, 1/16, 1/16) is applied to them.
Fig. 7. Filter mode (1) decision of 16x16 variable blocks
Fig. 7 shows the case that the adjacent blocks of the current 16x16 block are 8x4 or 4x4. Assuming that it is the case where the actual edges of features exist in the complex region of the image, the weakest filtering is applied assigning filter mode (1). The filtering is applied to 2 pixels, p0 and q0, centered on the concerned block boundary. Since the filtering must be done the most finely and carefully, only the two boundary pixels p0 and q0 are designated as the filtering pixels and the pixels of p0’ and q0’ are produced by applying the filtering formulas as the following. d = (3p1 – 8p0 + 8q0 – 3q1) / 16.
(2)
d’ = sign(d)Max[0, |d| – Max(0, 2(|d| – QP))].
(3)
p0’ = p0 + d’.
q0’ = q0 + d’.
(4)
When the current filtering blocks are 16x16 or 8x16, which are MB modes, and the adjacent blocks are 8x8 or 4x8, which are sub-MB modes, it is assumed to be the normal region that are more detail than the case of Fig. 6 and filter mode (2) is assigned. In this case, assuming that blocking artifacts and actual image feature edges coexist, the filtering is applied to the range of 4 pixels, p0, p1, q0, and q1, centered on the concerned block boundaries. By using following formulas, p1’, p0’, q0’ and q1’ are produced by filtering of p1, p0, q0 and q1. d = (p0 – q0) / 5. (5) p1’ = p1 + sign(d) * |d|.
(6)
p0’ = p0 + 2 * sign(d) * |d|.
(7)
q0’ = q0 – 2 * sign(d) * |d|.
(8)
740
S.-H. Shin et al.
q1’ = q1 – sign(d) * |d|.
(9)
Fig. 8 shows the other fine examples of filter mode (1) or (2) in the current block adjacent to upper and left block.
Fig. 8. Example of Filter Mode (2) or (1) in variable blocks
In such a way, the filter mode is determined adaptively by searching the adjacent blocks of the 7-type variable blocks and filter implementation according to the filter mode is applied to the pixels on the concerned block boundary at a time. The filtering in the vertical direction proceeds following the same process as the horizontal case. The filter modes in the horizontal and vertical boundaries are listed in Table 4. Table 4. filter modes adjacent to edge boundaries in variable blocks 16x16 16x8 8x16 H V H V H V 4 4 3 3 3 4 16x16 4 3 4 3 3 3 16x8 3 4 3 3 3 4 8x16 3 2 3 2 2 2 8x8 2 2 2 2 2 2 8x4 1 2 1 2 1 2 4x8 1 2 1 2 1 2 4x4 (H: horizontal filtering, V: vertical filtering) block
8x8 H 2 2 2 2 2 1 1
8x4 V 2 2 2 2 1 2 1
H 2 2 2 2 2 1 1
4x8 V 1 1 1 1 1 1 1
H 1 1 1 1 1 1 1
4x4 V 2 2 2 2 1 2 1
Fig. 9. Example of filter mode selection that used variable blocks
H 1 1 1 1 1 1 1
V 1 1 1 1 1 1 1
Performance Improvement of H.264/AVC Deblocking Filter
741
Fig. 9 shows the example of filter mode decision when the proposed method was applied.
4 Experimental Results In order to measure the performance of the deblocking filter proposed in this paper, the encoder JM (Joint Model) version 10 [12], recommended by the H.264 standardization group, was used for the experiments. Since the H. 264 standardization group recommend the comparison of bit rate difference in percent (%) and PSNR difference in order to evaluate the effect on the image quality, ∆PSNR was used to evaluate the result. The performance improvement through reduced computational cost is measured by using the Eq. (11) as the following. ∆PSNR (dB) = (JM’s PSNR – proposed method’s PSNR).
(10)
Computation reduction (%) = (JM’s computational cost – proposed method’s computational cost) / (JM’s computational cost) * 100.
(11)
The sequences used in the experiments were composed of images that contain various characteristics. The format was QCIF and the luma and chroma components were sampled 4:2:0. The experiments were conducted by changing the quantization parameter (QP) in the I and P frames.
Fig. 10. Experimental results of the Foreman sequence
742
S.-H. Shin et al.
Fig. 10 shows the detailed result values of Foreman, one of the experimental sequences. Table 5 shows the averages of PSNR differences and computation reduction of the experimental sequences. Table 5. Average Results of the Experimental Sequences Container Stefan Mobile R1 R2 R1 R2 R1 R2 22 –0.21 38.82 –0.23 49.17 –0.17 35.83 26 –0.19 32.99 –0.21 45.88 –0.15 37.91 30 –0.26 27.86 –0.19 42.33 –0.13 36.74 34 –0.29 25.16 –0.24 40.10 –0.17 40.57 avg. –0.24 31.21 –0.22 44.37 –0.16 37.76 (R1: ∆PSNR(dB), R2: Computation reduction (%)) QP
News R1 R2 –0.13 47.63 –0.14 42.20 –0.15 39.47 –0.14 36.17 –0.14 41.36
According to the results in Table 5, the average PSNR reduction (∆PSNR) was 0.19 (dB) with almost no change in objective picture quality and the total average processing time decreased by 38.67 (%) compared with the existing method. It is judged that the reduction in the number of execution of “If” and “for” statements by 23.89 (%) influenced the reduction in the total processing time.
5 Conclusions Due to the complex computation, it is difficult to implement or apply the H.264 deblocking filter to low-end terminals such as wireless communication or mobile phones. Also the more the H.264 coding method is optimized, the deblocking filter’s computational complexity increase proportionally. The method proposed in this paper enhanced the performance by executing deblocking filtering selectively using the variable block information of motion compensation compared with the existing method. In the concrete, it is not necessary to analyze the image characteristics (e.g. flat regions or complex regions) separately for filter implementation. According to image characteristics, strong filtering is executed for the flat regions to minimize the blocking artifacts, and weak filtering is applied to complex regions in order to maintain the image features as much as possible. The filtering was executed by the variable block-based to decrease computational cost. As a result the filtering speed improved without much deterioration of quality. According to the results, it was verified that the computational cost can be decreased about 40% without much quality degradation. Therefore, it is expected that the implementation of the deblocking filter on lowend/low-bit rates terminals is possible by decreasing the complexity of the deblocking filter if the proposed method is used. Acknowledgments. This research was supported by the ITRC (Information Technology Research Center, MIC) program and Seoul R&BD program, Korea.
Performance Improvement of H.264/AVC Deblocking Filter
743
References 1. Draft ITU-T Recommendation and Final Draft International Standard of Joint Specification (ITU-T Rec. H.264/ISO/IEC 14496-10 AVC) (March 2003) 2. Draft ITU-T Recommendation H.263, Video Coding for low bitrate communication Telecommunication Standardization Sector of International Telecommunication Union (October 1995) 3. ISO/IEC 14496-2, Information technology-coding of audio-visual objects. Part 2: Visual (December 2001) 4. Ahmad, A., Khan, N., Masud, S., Maud, M.A.: Selection of variable block sizes in H.264. In: IEEE Int’l Conf. on Acoustics, Speech, and Signal Processing (ICASSP ’04), vol. 3, pp. 173–176. IEEE, Los Alamitos (2004) 5. Cheng, C.C., Chang, T.S.: An efficient deblocking filter for H.264/AVC. In: IEEE Int’l Conf. On Consumer Electronics, IEEE Computer Society Press, Los Alamitos (2005) 6. Huang, Y., Chen, T.: Architecture Design for Deblocking Filter in H.264/AVC. In: Proceedings of ICME, Baltimore, Maryland, USA, July 6-9, 2003, pp. 693–696 (2003) 7. Kim, S.D., Yi, J., Kim, H.M., Ra, J.B.: A deblocking filter with two separate modes in block-based video coding. IEEE Trans Circuits Syst. Video Technol. 9, 156–160 (1999) 8. Lee, Y.L., Park, H.W.: Loop filtering and post-filtering for low-bitrates moving picture coding. Signal Processing Image Commun. 16, 871–890 (2001) 9. List, P., Jock, A., Lainema, J., Bjontegaard, G., Karczewicz, M.: Adaptive Deblocking Filter. IEEE Trans Circuits. Syst. Video Technol. 13(7) (2003) 10. Richardson, E.G.: H.264 and MPEG-4 Video Compression, pp. 170–187. John Wiley & Sons, Chichester (2003) 11. Zhou, Z., Sun, M.T., Hsu, Y.F.: Fast variable block-size motion estimation algorithm based on merge and slit procedures for H.264 / MPEG-4 AVC. In: International Symposium on Circuits and Systems, vol. 3, pp. 725–728 (2004) 12. JVT software JM10.2 (May 2006)
Real-Time Detection of the Triangular and Rectangular Shape Road Signs Bogusław Cyganek AGH - University of Science and Technology Al. Mickiewicza 30, 30-059 Kraków, Poland
[email protected]
Abstract. Road signs recognition systems are developed to assist drivers and to help increase traffic safety. Shape detectors constitute a front-end in majority of such systems. In this paper we propose a method for robust detection of triangular, rectangular and rhombus shaped road signs in real traffic scenes. It starts with segmentation of colour images. For this purpose the histograms were created from hundreds of real warning and information signs. Then the characteristic points are detected by means of the developed symmetrical detector of local binary features. The points are further clusterized and used to select shapes from the input images. Finally, the shapes are verified to fulfil geometrical properties defined for the road signs. The proposed detector shows high accuracy and very fast operation time what was verified experimentally.
1 Introduction The purpose of the drivers’ assisting systems is to facilitate car driving by providing additional security level. Recognition of road signs (RSs) constitutes a part of such systems. Information on passing signs can alert a driver to prevent dangerous situations. For instance recognition of a sign warning about road or railway intersection can be checked against current speed of a vehicle and, if it is too excessive, a warning message can be passed to a driver. Much research has been done towards development of robust RSs recognition systems. For review one can refer to [1-3][5-10]. Shape detection constitutes a first stage in majority of the systems. Its reliable operation in a noisy environment is a prerequisite for successful classification. In this paper a novel method is presented for detection of the triangular, rectangular and diamond shapes for purpose of detection of the warning and information signs from the real traffic scenes. These are signs from the groups “A” and “D” in a Polish legislation [11], respectively. However, the method can be used with other signs and even for detection of different objects. It starts with colour segmentation, which is based on the simple thresholds acquired from the empirically created colour histograms of the real signs. The other segmentation method, based on the support vector classifiers, is presented in [3] (this paper presents also a detector for circular RSs). The segmented images are then processed by the detector of local binary features from which the salient points, and finally, shapes of interest, are inferred. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 744–755, 2007. © Springer-Verlag Berlin Heidelberg 2007
Real-Time Detection of the Triangular and Rectangular Shape Road Signs
745
2 Architecture of the Road Signs Detector An overview of a complete RS recognition system is presented in Fig. 1. In this paper we focus on the first two stages (gray), whereas classification is dealt with in [1-3].
Fig. 1. General architecture of the complete road signs recognition system
Fig. 2 depicts the modules pertinent to the image acquisition and filtering stage in Fig. 1. The processing starts with image acquisition with the Marlin F-033C camera. It is programmed also to do a low-pass filtering. Then the colour segmentation is performed to obtain a binary image with selected regions of potential signs. The segmentation is done in the HSI space based on a simple threshold method, thanks to the colour histograms acquired from many hundreds of real examples. The threshold values for different colours encountered in the Polish road signs are presented in Table 1. The values refer to the normalized [0-255] HSI space.
Fig. 2. Block diagram of the image acquisition and filtering stages Table 1. Empirical threshold values for different colours encountered in the Polish road signs. The values refer to the normalized [0-255] HSI space.
Blue Yellow
Hue [120-165] [15-43]
Saturation [80-175] [95-255]
Fig. 3 depicts modules of the shape detector which are described in this paper. The process starts with detection of salient points which, prior to be used, have to be clusterized. The salient points are corners of the detected shapes. Then the figure
746
B. Cyganek
detection and verification stages follow. The presented detector is able to recognize triangles and rectangles of different position, scale, and rotation. The figure verification stage has to assure that only shapes that comply with the formal RS specifications are passed to the classifiers.
Fig. 3. Key modules of the shape detector
Further stages of image processing for acquisition of important features and sign classification are presented in Fig. 4. They start with extraction of image areas in positions of the detected figures. These are get from the monochrome version of the input image, since colour information is not taken to the used classifiers [1-3]. Mono signal is simply taken from the red channel instead of linear colour averaging. This has some positive aspect for the subsequent extraction of the pictograms [3].
Fig. 4. Final stages of image processing for acquisition of binary feature vectors and sign classification
The purpose of the subsequent shape registration stage in Fig. 4 is its normalization to the size and orientation required by the classifier. This is done by solving simple linear equation to get parameters of an affine transformation. These are used in the image warping module. An affine transformation is assumed to be sufficient since RSs are rigid planar objects. This has been verified experimentally to operate well in real situations [3]. Finally, the potential sign area is binarized and sampled. Next, the
Real-Time Detection of the Triangular and Rectangular Shape Road Signs
747
binary feature vector is fed to the classification module, which in our system was constructed as an assembly of cooperating neural networks [2].
3 Detection of the Characteristic Points The characteristic points for the road sign shapes are their corners. Knowledge of positions of three such corners is usually sufficient for unique localization of the whole shape. However, this is sometimes troublesome due to occlusions or imperfections in the segmentation stage. This technique can be used to detect any shapes that can be characterized by their corner points. For other shapes, such as circles, other techniques can be used [3]. The salient points are detected in the binary images obtained from the segmentation module. The technique is very fast. To check a point whether it is or is not one of the characteristic points its neighbourhood has to be analyzed. This is done with a detector of local binary features (DLBF). In the general case, it is composed of four rectangular panes, centred at a point of interest P’C, presented in Fig. 5. For a discrete pixel grid, the central point lies on a virtual position which does not coincide with the image grid. Therefore, DLBF is anchored at a real point PC which lies on a discrete grid. A DLBF operates on four panes R0, R1, R2, and R3 - Fig. 5 – each of size hi×vi. Detection with a DLBF is done by counting number of set pixels in each pane. Thus, for each point we obtain a set of four counters c0, c1, c2, and c3. These counters are then compared with the predefined templates for salient points. If a match is found then a point PC is classified as a salient point.
Fig. 5. A detector of local binary features
748
B. Cyganek
For the road signs, DLBF is simplified to the symmetrical DLBF (SDLBF), in which all the panes are squares of the same size – depicted in Fig. 6. Each pane Ri is additionally divided alongside its diagonal into two parts. Thus, a SDLBF contains eight segments. An analysis of their counters allows classification of a central point PC to one of the groups of salient points. Thus, providing allowable values for the counters, a type of a salient point is defined.
Fig. 6. A symmetrical detector of local binary features
Fig. 7 depicts a detailed partitioning of a single pane into regions T0 and T1. It is not symmetrical since one part contains a diagonal D. For instance, for a 9×9 pane we have 81 detection elements, from which 36 belongs to T0 and 36+9=45 to T1. Fig. 8 shows SDLBF used for detection of salient points which are characteristic to the triangular, rectangular and diamond shaped road signs. If, for instance, panes 5
Fig. 7. Partitioning of a single pane in the SDLBF
Real-Time Detection of the Triangular and Rectangular Shape Road Signs
749
and 6 are almost entirely filled, while all other are empty, then a point can be a top corner of a triangle. Similarly, if the panes 0 and 1 are filled, whereas the others are fairly empty, then an anchor point can correspond to the bottom-right corner of a rectangle (see Fig. 8). SDLBF is very accurate and fast to be computed. It works fine after defining fill ratios for different salient points. This can be further simplified by definition of only two states “empty” and “full” for the panes of the SDLBF detector.
Fig. 8. Detection of salient points with the SDLBF detector. A central point is classified based on the counted fill ratios of each of the eight panes.
In our experiments, good results were obtained by setting the “empty” state as a number of fill ratio being less or equal to 5% of the total pane capacity (which is 36/45 for 9×9 panes). The “full” state was set to be ≥95%. Other control parameters are size and number of panes of the SDLBF window which have to be tailored to the expected size of detected shapes. This naturally depends on the resolution of the input images.
4 Clusterization of the Salient Points SDLBF produces set of salient points. It appears however, that the points tend to create local concentrations, e.g. instead of a single point for a corner of a sign we get a local cloud of points, where each point is distant few pixels from each other. Thus, the next step consists of finding each local cluster and its replacement with a single point, located at a centre of gravity of this cluster. The set SP of all points detected with the SDLBF is described as follows:
750
B. Cyganek
S P = { P0 , P1 , … , Pn } = {( x0 , y0 ) ,
( x1 , y1 ) ,
( xn , yn )}
…,
(1)
In SP the clusters (sub-sets) Ki are determined based on the distances among the points. The set of all clusters C(Sp) is denoted as follows:
C ( S P ) = { K1 , K 2 , … , K m } =
{{
} {…,
= … , xi1 , … , ,
}
xi2 , … , , … ,
{… ,
}}
xim , … ,
(2)
Then, for each cluster its centre of gravity is found which finally represents the whole cluster. This process results with the set M(C(Sp)), as follows: M ( C ( S P ) ) = { K1 , K 2 , … , K m } =
= {( x0 , y0 ) ,
( x1 , y1 ) ,
…,
(3)
( xn , yn )}
where xp =
1 # Kp
∑
x pi ∈K p
x pi , and
yp =
1 #Kp
∑
y pi ∈K p
y pi
(4)
Clusterization is governed by only one parameter which is the maximal distance dτ between any two points above which we classify the points as belonging to different clusters. This means that if for two points Pi and Pj it holds that d ( Pi , Pj ) ≤ dτ ,
(5)
where d(.,.) denotes a metric (e.g. Euclidean), then these points belong to one cluster.
Fig. 9. A distance matrix D
For a set SP, containing n points, the process of its clusterization starts with building of the distance matrix D which contains distances for each pair drawn from the set SP. There are n(n-1)/2 such pairs. Thus, D is a triangular matrix with zero diagonal. An example for five elements depicts Fig. 9. In this case we have 5*4/2=10 different point distances.
Real-Time Detection of the Triangular and Rectangular Shape Road Signs
751
The clusterization algorithm, outlined in Fig. 10, finds the longest distinctive chains of points in the SP. For each point in a chain, there is at least another point, which is no further than dτ. j = 0; // initial number of clusters ”build the distance matrix D”; do { ”take the first not clusterized point Pi from the set SP”; ”create a cluster Kj which contains Pi”; “mark Pi as already clusterized”; // writing special value in D for( ”all not clusterized points Pi from SP” ) { if( ”in Kj there is a close neighbour to Pi” ) { //read D(i,j) ”add Pi to Kj”; ”set Pi as clusterized”; } } j = j + 1; } while( ”there are not clusetrized points in SP” );
Fig. 10. Point clusterization algorithm
5 Experimental Results The test platform consists of the IBM PC with Pentium IV 3.4GHz and 2GB RAM. Implementation of the system was done in C++ in the Microsoft® Visual 6.0 IDE. Experimental results of detection of two warning signs A-12b and A-14 are presented in Fig. 11. The original scene is visible in Fig. 11a. The colour segmented map and this map after the morphological erosion depict Fig. 11b-c. The detected salient points presents Fig. 11d. Each type of a point (i.e. upper corner, lower corner, etc.) is drawn with different colour. From these points the figures are created and verified, see Fig. 11e. The cropped and registered shapes from the red channel are visible in Fig. 11f,g, for both signs, respectively. Finally, the binary features prepared for the classification are visualized in Fig. 11h,i. Quality of the segmentation process directly influences the detector of salient points. In Polish warning signs the red border is usually very thin and therefore the segmentation is done for yellow areas. However, in other countries, where the red rim is much thicker, the segmentation should be done in search for red areas. However, segmentation in yellow allows very easy separation of doublets of signs, as depicted in Fig. 11a. In real cases many salient points are detected. These, after clusterization, are used to generate all possible shapes. However, only the ones that fulfil the predefined conditions are left for further processing. In our system, the first verification parameter is relative size of a detected shape. If it is below 10% of the minimal resolution, then such shape is rejected since it is too small for registration and feature detection stages. In case of triangles, there equilateral conditions are checked next. For rectangles we assume that the vertical sides can be longer than horizontal ones, but only up to 25%. These parameters are taken from the formal specification of the Polish road signs and for other groups the rules can be different.
752
B. Cyganek
a
b
c
d
e
f
g
h
i
Fig. 11. Experimental results of detection of the warning signs (group “A”). The original scene (a), the colour segmented map (b), after erosion (c), salient points (d), detected figures (e,f).
The checked detection accuracy is very high and is above 97% for all groups of signs and the data base of real road scenes. Some problems are encountered if a sign is partially occluded; Especially if occluded is a region containing one of the salient points. In practice, however, we are processing a video stream, so if a sign is not detected in one frame there are big changes it will in one of the next frames with changed camera viewing position. Table 2 presents the average execution times for different signs. System processes in real-time the input video stream of resolution 320×240. The most time consuming stages are the morphological erosion and segmentation, respectively. Table 2. Average execution times (ms) for detection of the road signs with different shapes Triangular “A” 38
Inverted triangle “A” 30
Rectangular “D” 37
Diamond “D” 29
Fig. 12a presents an example of other traffic scene used in experiments for detection of different road signs. The yellow segmented map depicts Fig. 12b. The same map after filtering is visible in Fig. 12c. The salient points are depicted in Fig. 12d with different colours depending on their category. A detected figure is in Fig. 12e and the same figure put on the original image in Fig. 12f.
Real-Time Detection of the Triangular and Rectangular Shape Road Signs
a
b
c
d
e
f
753
Fig. 12. Detection of the inverted triangle (the A-7 sign). The scene (a). The yellow segmented map (b), the map after filtering (c). The salient points in different colours depending on their category (d). A detected figure (e), the same figure superimposed on the original image (f).
a
b
c
d
e
f
Fig. 13. Detection of an information sign (D-6) in the scene from Fig. 12a. The blue segmented map (a), its filtered version (b). The salient points for rectangles (c). A detected and verified rectangle (d). The found rectangle drawn in the original image. The registered sign and its feature vector (f).
Fig. 13 presents stages of detection of an information sign (D-6, in this case) in a image depicted in Fig. 12a. The blue segmented map and its filtered version depict Fig. 12a-b. The salient points for rectangles are visualized in Fig. 12c. Different
754
B. Cyganek
points are drawn with different intensities. From many potential rectangles one has been verified – it is depicted in Fig. 12d and Fig. 12e, in an original image. Finally, the registered sign and its feature vector are presented in Fig. 12f.
6 Conclusions This paper describes a real-time detector of triangular and rectangular road signs. These are warning and information signs in the Polish legislation. However, the presented method can be easily adapted to other conditions since the presented techniques are quite universal. The main assumption on detected objects is that they are planar rigid bodies and can be easily spotted by their colour properties. The method relies heavily on the colour segmentation stage which, in this version of the system, is a simple thresholding method, performed in the HSI space. The proper thresholds for colours, which are characteristic for each group of signs, have been found experimentally from hundreds of real examples. Nevertheless, the method was verified to work well with scenes obtained in daily sunny conditions. For other cases, more robust segmentation seems to be necessary. This is a field of our further research [3]. The segmented maps are processed by the symmetrical detector of local binary features, which in our case are corners of the sought figures. It operates simply by counting number of pixels falling into each of its symmetrical panes. Based on this a test point is characterized to one of the categories. The main virtue of this approach is its simplicity and very fast, as well as accurate, operation. Since the detected points tend to create clusters, for each cluster its mean representative is selected. This is obtained with the simple clusterization algorithm, also presented in this paper. Based on the detected salient points all possible configurations are checked for selection of shapes that fulfil the predefined geometrical conditions of the road signs. This is kind of a pre-classification stage which allows very fast rejection of false positives. However, the method is no free from some problems which can occur if parts of a sign are occluded. A figure can be omitted if one of its salient points is not detected. However, occlusion of other areas of a sign does not influence the method, although the resulting feature vector can be partially faulty. Problem of occlusions is not so severe since at least dozen of frames per second is processed, so there is high probability that only part of the stream will contain occluded signs. False detections are also possible. They can be resolved by the already mentioned tracking of consecutive frames in the input video stream. The second verification stage is the classification module which classifies a sign based on its pictogram encoded into a feature vector [2]. The method was verified experimentally on our data base of real traffic scenes. The obtained results asserted high accuracy of the method and its real time operation properties. The presented detector consist a front end of the road sign recognition system presented in [2].
Real-Time Detection of the Triangular and Rectangular Shape Road Signs
755
Acknowledgements This work was supported from the Polish funds for the scientific research in the year 2007.
References 1. Cyganek, B.: Rotation Invariant Recognition of Road Signs with Ensemble of 1-NN Neural Classifiers. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4132, pp. 558–567. Springer, Heidelberg (2006) 2. Cyganek, B.: Recognition of Road Signs with Mixture of Neural Networks and Arbitration Modules. In: Wang, J., Yi, Z., Zurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. LNCS, vol. 3973, pp. 52–57. Springer, Heidelberg (2006) 3. Cyganek, B.: Circular Road Signs Recognition with Soft Classifiers. Accepted to the Integrated Computer-Aided Engineering. IOS Press, Amsterdam (2007) 4. Chrysler, D.: The Thinking Vehicle (2002), http://www.daimlerchrysler.com 5. Escalera, A., Armingol, J.A.: Visual Sign Information Extraction and Identification by Deformable Models. IEEE Tr. On Int. Transportation Systems 5(2), 57–68 (2004) 6. Fleyeh, H., Gilani, S.O., Dougherty, C.: Road Sign Detection And Recognition Using Fuzzy Artmap. In: IASTED Int. Conf. on Art. Intell. and Soft Computing, pp. 242–249 (2006) 7. Gao, X.W., Podladchikova, L., Shaposhnikov, D., Hong, K., Shevtsova, N.: Recognition of traffic signs based on their colour and shape features extracted using human vision models. Journal of Visual Communication & Image Representation, 675–685 (2005) 8. Gavrila, D.M.: Multi-feature Hierarchical Template Matching Using Distance Transforms. In: Proc. of the Int. Conf. on Pattern Recognition, Brisbane, pp. 439–444 (1998) 9. Paclik, P., Novovicova, J., Pudil, P., Somol, P.: Road sign classification using Laplace kernel classifier. Pattern Recognition Letters 21, 1165–1173 (2000) 10. Piccioli, G., Micheli, E.D., Parodi, P., Campani, M.: Robust method for road sign detection and recognition. Image and Vision Computing 14, 209–223 (1996) 11. Road Signs and Signalization. Directive of the Polish Ministry of Infrastructure, Internal Affairs and Administration (Dz. U. Nr 170, poz. 1393) (2002)
High-Resolution Multi-sprite Generation for Background Sprite Coding Getian Ye Multimedia and Video Communications Group National ICT Australia 223 Anzac Parade, Kensington, NSW 2052, Australia Phone: 61-2-83060428, Fax: 61-2-83060404
[email protected]
Abstract. In this paper, we consider high-resolution multi-sprite generation and its application to background sprite coding. Firstly, we propose an approach to partitioning a video sequence into multiple background sprites and selecting an optimal reference frame for each sprite range. This approach groups images that cover a similar scene into the same sprite range. We then propose an iterative regularized technique for constructing a high-resolution sprite in each sprite range. This technique determines the regularization parameter automatically and produces sprite images with high visual quality. Due to the advantages of high-resolution multi-sprites, a high-resolution sprite coding method is also presented and it achieves high coding efficiency.
1
Introduction
Background sprite coding is a well-known and efficient object-based video compression technique and has been adopted in the MPEG-4 standard. A sprite, which is also referred to as a mosaic, is a large image composed of pixels belonging to a background object visible throughout a video segment. As the sprite is transmitted only once, sprite coding can achieve high coding efficiency at a low bit-rate. Usually, the background sprite is not directly available at the encoder, it must be generated prior to coding. The most important task in sprite generation is global motion estimation (GME) that can find a set of warping parameters to describe the motion of a background object according to an appropriate motion model. Many sprite-based video coding techniques have been studied in recent years. A layered video object coding system, using the sprite technique and affine motion model, was first proposed in [1]. Smolic et al. [2] proposed a long term GME for on-line sprite generation. The motion estimation in this technique combined the advantages of feature matching and optical flow methods and was based on the biquadratic model. In [3][4], efficient and robust GME techniques were proposed for sprite coding. Another highly efficient sprite coding approach based
National ICT Australia is funded through the Australian Government’s Backing Australia’s Ability initiative, in part through the Australian Research Council.
J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 756–767, 2007. c Springer-Verlag Berlin Heidelberg 2007
High-Resolution Multi-sprite Generation for Background Sprite Coding
757
on static sprite generation and spatial prediction techniques was presented in [5]. This approach employed a hybrid technique that estimates the background motion relative to the generated sprite and then used a reliability-based method for blending. The above-mentioned techniques can be considered as single-sprite techniques as they are developed for generating a single sprite. The GME in these techniques is mainly to estimate the relative motion between all pairs of consecutive frames in a sequential manner. It is called the differential GME. By using the concatenation property of projective mapping, the sprite generation in single-sprite techniques is usually performed by initializing the sprite with the first frame of a video sequence and then warping and blending the following frames to this reference frame. In practice, however, the relative motion or projective mapping between consecutive frames is estimated approximately. The concatenation of a number of estimated motion usually introduces a cumulative error, especially when the camera reverses direction or loops back revisiting certain parts of the scene more than once. If camera motion is large, the cumulative error may cause the misalignment between different frames and consequently degrade the subjective quality of the sprite. In addition, single-sprite techniques may be inefficient when the camera motion is large and complex. In some cases, it is difficult or even impossible to generate a single sprite. To improve the quality of the sprite and the efficiency of sprite coding, multiple sprites can be generated instead of a single sprite. That is, a background sprite can be partitioned into several independent parts. In [6], a multi-sprite generation method was proposed. This method chooses the reference frame for each sprite by thresholding the scaling and rotation parameters of the projective model. It still uses the differential GME for the pairs of consecutive frames that may result in cumulative errors in each sprite. D. Farin et al. [7] proposed a method that provides an optimal partitioning of a video sequence into independent sprites. It minimizes the total sprite coding cost by choosing the optimal reference frame for each of sprite ranges independently. However, both methods do not group the images, which cover a similar scene, into the same sprite range. Most of single-sprite and multi-sprite techniques only consider generating sprites with the same resolution as the original images. This kind of sprites is called the low-resolution (LR) sprite. High-resolution (HR) image reconstruction has been an active research area [8]. HR image reconstruction algorithms investigate the relative subpixel motion information between multiple LR images and increase the spatial resolution by fusing them into a single frame. HR reconstruction techniques have also been combined with image mosaicing to generate HR sprites with improved resolution. A. Smolic et al. [9] proposed to generate HR sprite using image warping. In this method, each pixel of each frame is mapped into the HR sprite and its gray-level value is assigned to the corresponding pixel in the HR sprite if it falls close to an integer-pixel position in the HR sprite. This method did not take into account the reconstruction from sprite. In this paper, we firstly propose an approach to multi-sprite partitioning and selecting an optimal reference frame for each sprite range. Considering both short-term and long-term motion influences, the proposed approach divides a
758
G. Ye
video sequence into independent sprites and groups the images, which cover a similar scene, into the same sprite range. We then propose an iterative regularized technique for constructing the HR sprite for each sprite range. This technique determines the regularization parameter automatically and produces sprite images with high visual quality. Due to the advantages of HR multi-sprites, a HR sprite coding method is also presented.
2 2.1
HR Multi-sprite Generation Problem Formulation
We use homogeneous coordinates to express points, i.e., 2-D points in the image plane are represented as (x, y, 1) with (x, y) being the corresponding Cartesian coordinates. Let Fi and Fj be two frames from a video sequence. The transformation between Fi and Fj is represented as a 3 × 3 matrix Mi,j so that ⎡ ⎤ m 1 m2 m 3 pj = Mi,j pi = ⎣m4 m5 m6 ⎦ pi (1) m 7 m8 m 9 where pi and pj are the corresponding points in Fi and Fj , respectively. The parameter m9 in (1) is usually normalized to be 1. If the transformation is expressed using Euclidean coordinates, we obtain the projective model xj =
m1 xi + m2 yi + m3 m4 xi + m5 yi + m6 , yj = , m7 xi + m8 yi + 1 m7 xi + m8 yi + 1
(2)
where (xi , yi ) and (xj , yj ) are the corresponding locations under the transformation in Fi and Fj , respectively. The single-sprite techniques usually choose the first frame of a video sequence as the reference frame and find the relative transformation between pairs of consecutive frames, i.e., Mi,i+1 . The transformation M1,j between F1 and Fj can then be determined by using the concatenation property of projective mappings, i.e., M1,j = M1,2 M2,3 · · · Mj−1,j . In addition, the transformation Mj,1 −1 −1 can be obtained by computing Mj,1 = M−1 j−1,j · · · M2,3 M1,2 . This relationship facilitates warping the images of a sequence into the coordinate system of the reference frame. In practice, however, the relative motion or projective mapping between consecutive frames is estimated approximately. The concatenation of a number of estimated motion usually introduces cumulative errors, especially when the camera reverses direction or loops back revisiting certain parts of the scene more than once. If camera motion is large, the cumulative error may cause the misalignment between different frames and consequently degrade the subjective quality of the sprite. In addition, the perspective deformation increases rapidly when the camera rotates away from its frontal view position. In some cases, it is difficult or even impossible to generate a single sprite. To avoid the problems discussed above, we consider multi-sprite generation that aims to choose different reference frames to partition a video sequence into
High-Resolution Multi-sprite Generation for Background Sprite Coding
759
different sprite ranges independently. The images in each sprite range are warped into the coordinate system of the corresponding reference frame and an independent sprite can then be obtained. Multi-sprite technique can handle large and complex camera motion and provide multiple LR sprites with good visual quality. Previous multi-sprite generation methods do not group the images, which cover a similar scene, into the same sprite range. In addition, they only produce LR multi-sprites that have the same resolution as the original images. In this paper, we consider generating HR sprite with improved resolution. HR algorithms usually investigate the relative subpixel motion information between a reference image and other images and then increases the spatial resolution by fusing other images into the reference image [8]. HR reconstruction often requires the cumulative GME that directly finds the relative motion between the reference image and other images. 2.2
Multi-sprite Partitioning and Reference Frame Selection
According to the discussion above, multi-sprite generation is required to group all the images, which cover a similar scene, into the same sprite range although they are captured at very different instances of time. Reference frame selection is important for the cumulative GME and HR multi-sprite generation. In this section, we propose a new method for multi-sprite partitioning. It uses the overlap between any two frames to measure their similarity. The degree of overlap between two frames may indicate if the motion between them can be correctly estimated and be helpful for determining the non-overlapping area needed to be encoded. The first step of the proposed method is to find the relative motion between all the pairs of consecutive frames using the robust GME presented in [3]. The transformation between any two frames can be obtained by using the concatenation property of projective mappings. Hence, the degree of overlap between two frames can approximately be estimated by simply warping the coordinates. We now consider multi-sprite partitioning and choosing optimal reference frames based on the degree of overlap. Given a video sequence containing N frames, i.e., F = {F1 , F2 , · · · , FN }, we partition F into K sprite ranges represented by Sk (k = 1, 2, · · · , K). The reference frame and the total number of input frames in each sprite range Sk are represented by Rk and Lk , respectively. If a frame Fn belongs to a sprite range Sk , the degree of overlap between Fn and Rk is ¯k . represented by Δkn and the averaged overlap in this range is denoted by Δ When partitioning a sequence, a threshold for the overlap is pre-defined and is represented by ΔT H . The proposed approach to multi-sprite partitioning and selecting reference frames is described as follows: 1. Initialize: K = 1, L1 = 1, R1 = F1 , and add F1 into S1 . 2. Repeat (n = 2, 3, · · · , N ) (2.1) Repeat (k = 1, 2, · · · , K) (a) Calculate the overlap Δkn between Fn and the existing reference frame Rk .
760
G. Ye
(2.2) Determine which reference frame has the largest overlap with Fn , i.e., kmax = arg max Δkn . k
(2.3) If Δknmax ≤ ΔT H , set K = K + 1 and RK = Fn . (2.4) If Δknmax > ΔT H , update the sprite range Skmax and the corresponding reference frame Rkmax : (a) If Lkmax < 2, add Fn into Skmax and set Lkmax = Lkmax + 1, ¯k = Δkmax . Δ n (b) If Lkmax ≥ 2, repeat (lk = 1, 2, · · · , Lkmax ) (i) Calculate the overlaps between Fn and all the frames in Skmax except Rkmax and then compute the averaged overlap ¯k . Δ n ¯kn ≥ Δ ¯k , set Rkmax = Fn and Δ¯k = Δ ¯kn . (ii) If Δ ¯kn < Δ ¯k , add Fn into Skmax . (iii) If Δ In Step (2.1), we compute the overlaps between an input frame Fn and the reference frames of the existing sprite ranges Sk . Step (2.2) determines which reference frame has the largest overlap (i.e., Δknmax ) with Fn . If Δknmax is less than a threshold ΔT H , i.e., Fn does not have enough overlap with any existing reference frames, a new sprite range is generated by using Fn as its reference frame. If Δknmax is larger than ΔT H , the corresponding sprite range Skmax and the reference frame Rkmax are updated as shown in Step (2.4). That is, if Fn is more similar to all the frames in Skmax than Rkmax , Fn is considered as the new reference frame of Skmax . It is noted that the proposed approach does not require a priori knowledge of the number of sprites K. In addition, the selection of the threshold ΔT H is very important. If it is too small, the motion between the reference frame and other frames cannot easily be obtained. If it is too large, redundant sprites may be generated. 2.3
HR Sprite Generation
HR multi-sprite generation is implemented with the following major steps: (1) wavelet-based image interpolation by 2, (2) cumulative GME for interpolated images, and (3) HR sprite construction. Image interpolation refers to the process of upsampling followed by appropriate low-pass filtering, while image decimation refers to downsampling after appropriate anti-alias filtering. In this paper, the low-pass synthesis and analysis filters of the biorthogonal Daubechies 7/9 wavelet transform are used as low-pass and anti-alias filters for image interpolation and decimation, respectively. The HR sprite construction involves image warping and blending. Assuming that there are K frames of LR images available in a sprite range, the observation model can be expressed as yk = DBk Wk R [x]k + nk ,
(3)
where yk (k = 1, 2, · · · , K) and x denote the kth LR image and the HR sprite image, respectively, which are rearranged in lexicographic order. The reconstruction from HR sprite, which corresponds to the kth image, is denoted by R[·]k .
High-Resolution Multi-sprite Generation for Background Sprite Coding
761
The geometric warp operator and the blur matrix between the HR sprite image x and the kth LR image are represented by Wk and Bk , respectively. The decimation operator is denoted by D and nk represents a lexicographically ordered noise vector. In practice, the noise in (3) is modeled as the additive white Gaussian noise. Determining the HR reconstruction is often an ill-posed problem [8] because of an insufficient number of LR images and an ill-conditioned blur operator. Procedures adopted to stabilize the inversion of ill-posed problem are called regularization. It is helpful to find a stable solution and improve the rate of convergence. By using a deterministic regularization, the constrained least squares formulation can be written as K 2 2 ˆ = arg min x yk − DBk Wk R [x]k 2 + λ Lx2 . (4) x
k=1
where L is chosen to be the 2-D Laplacian operator and λ is the regularization parameter that controls the tradeoff between fidelity to the original data and smoothness of the solution. Based on the gradient descent algorithm for minimizing (4), the robust iterative update for HR sprite can be expressed as
K ˆ (n+1) = x ˆ (n) +α(n) RT WkT BTk DT (yk − DBk Wk R[ˆ x x(n) ]k ) −λ(n) LT Lˆ x(n) k=1
(5) where α(n) is a scalar defining the step size in the direction of the gradient, DT denotes the interpolation operator, and RT [·]K k=1 represents the sprite construction using K images. It is seen from (5) that an error sprite is built by using all the errors or differences between the original and reconstructed LR images. The ˆ (n) . This process error sprite is subsequently used for updating the HR sprite x is repeated iteratively to minimize the energy of the error in (4). The critical issue in the application of (5) is the determination of the reg2 ularization parameter λ(n) , which balances the constraint Lx2 and the error (n) energy. We propose to define the regularization parameter λ as K
2 (n)
]k 2 k=1 yk − DBk Wk R[x (n)
λ = . (6) K Lx(n) 2
The numerator of the right term in (6) is the error energy, which decreases with the iteration. That is, the differences between the reconstructed LR images and observed LR images become smaller as the iteration proceeds. The rate of change of the regularization parameter becomes smaller as the error energy decreases. The denominator of the right term in (6) is the energy of the high
2 pass filtered HR sprite image i.e., Lx(n) 2 . With the progress of the iterative
2 process, the value of Lx(n) increases because high frequency components in 2
x(n) are restored. Thus, the value of the regularization parameter decreases with the iteration. Foreground objects usually result in outliers when building the background sprite. GME may also introduce outliers due to motion errors. Temporal median
762
G. Ye Table 1. Details of multi-sprites generated for the Coastguard sequence Sprite Averaged overlap Reference Sprite Number of Sprite ¯k (%) index Δ frame range frames area 1 70.5 110 1 → 211 211 78k 2 87.5 257 212 → 300 89 63k Table 2. Details of multi-sprites generated for the Stefan sequence
Sprite Averaged overlap Reference Sprite Number of ¯k (%) index Δ frame range frames 1 73.4 229 1 → 94, 176 → 236 155 2 91.7 241 95 → 175, 237 → 249 94 3 67.4 259 250 → 272 23 4 79.9 287 273 → 300 28
Sprite area 236k 197k 264k 153k
filtering is often used to reject these outliers in blending a sprite. However, it requires sorting operation that is computationally very expensive especially when a large number of overlapping images are involved in sprite blending at a pixel location. Inspired by the work presented in [10], we apply temporal mode filtering for blending, which can be performed sequentially.
3
HR Sprite Coding
Since the HR sprite image is usually an arbitrarily shaped image, some regions in it are transparent. In order to improve the coding efficiency, these transparent regions do not need to be compressed. In this paper, the region-of-interest (ROI) coding scheme in the JPEG 2000 standard is applied for coding the HR sprite image because it allows the ROI to be coded at higher quality than the transparent regions. Based on the MAXSHIFT method [11], it does not require the mask of the sprite image at the decoder. Before coding the arbitrarily shaped sprite image, the repetitive image padding scheme adopted in the MPEG-4 standard is performed to fill the transparent regions. In order to reconstruct each background image from the sprite at the decoder, the motion parameters are also required to be coded. The projective transformation in (2) can be defined as either a set of motion parameters or the displacements of some reference points. Instead of directly encoding the motion parameters of the projective model in (2), the displacements of reference points are encoded. In our application, we simply use a 20-bit floating point number to encode each displacement.
4
Experimental Results
In this section, we present some experimental results to demonstrate the performance of the proposed techniques. The threshold used for multi-sprite partitioning, i.e., ΔT H , is chosen to be 30%. When using the temporal mode filtering, the
High-Resolution Multi-sprite Generation for Background Sprite Coding
763
(a) The 1st HR sprite
(b) The 2nd HR sprite Fig. 1. HR multi-sprites of the Coastguard sequence generated by our proposed techniques. These sprite images have been scaled down to fit this page.
Fig. 2. LR single-sprite of the Coastguard sequence generated by the method in [5]. The sprite area is 100k. This sprite image has been scaled down to fit this page.
764
G. Ye
(a) The 1st HR sprite
(b) The 2nd HR sprite
(c) The 3rd HR sprite
(d) The 4th HR sprite Fig. 3. HR multi-sprites of the Stefan sequence generated by our proposed techniques. These sprite images have been scaled down to fit this page.
High-Resolution Multi-sprite Generation for Background Sprite Coding
765
Fig. 4. LR single-sprite of the Stefan sequence generated by the method in [5]. The sprite area is 1498k. This sprite image has been scaled down to fit this page.
number of bin is fixed to be 16. We use the Coastguard and Stefan sequences. There are 300 frames in each of these sequences and the image size is 352 by 288 pixels. These two sequences involve large camera motion. The Stefan sequence contains more complex camera motion than the Coastguard sequence. When building background sprites, we do not use segmentation masks for the foreground objects. However, segmentation masks are used to reject the pixels belonging to foreground objects when evaluating the rate-distortion performance (i.e., bit-rate versus PSNR) of sprite coding schemes. That is, the calculation of PSNR considers the background pixels only. The proposed multi-sprite partitioning approach divides the backgrounds of the Coastguard and Stefan sequences into two and four independent background sprites, respectively. Table 1 and Table 2 show the corresponding details of sprites. We can see that the proposed approach can group the images that cover a similar scene into the same sprite and ensures the reference frame has large overlaps with other frames in the same sprite range. Moreover, we found that the total sprite area (or the total number of pixels needed to be coded) of Stefan sequence (850k pixels) is very similar to that reported in [7], i.e., 841k pixels. Fig. 1 and Fig. 3 depict the HR multi-sprites generated by our proposed techniques for the Coastguard and Stefan sequences, respectively. We can see that the left part of the sprite image shown in Fig. 3(b) is slightly blurred by comparison with the right part. That is because both the number of frames and the overlaps between them are quite small resulting from fast camera panning. For comparison purposes, we also generate LR single-sprite for the Coastguard and Stefan sequences. We found that a majority of single sprite techniques cannot produce a single sprite for the Stefan sequence except [5]. Fig. 2 and Fig. 4 show the LR single-sprite images generated by the method presented in [5] for both sequences. We can see that there exist several distortions in LR single-sprite images. To explore the performance of the proposed HR sprite coding, we generate LR multi-sprites by using similar framework. However, in the LR sprite construction, the wavelet-based interpolation and iterative update procedure in (5) are not used. When encoding single-sprite ad multi-sprite images, JPEG 2000 image
766
G. Ye 36 LR single−sprite LR multi−sprites HR multi−sprites
34
PSNR (dB)
32
30
28
26
24
22
1
2
3
4
5
6
7
8
9
10
11
80
90
100
110
Bit−rate (kbps)
(a) Coastguard 30 LR single−sprite LR multi−sprites HR multi−sprites
29
28
PSNR (dB)
27
26
25
24
23
22
21 10
20
30
40
50
60
70
Bit−rate (kbps)
(b) Stefan Fig. 5. Rate-distortion comparison for the Coastguard and Stefan sequences
compression is adopted. Image tiling in JPEG 2000 affects the image quality both subjectively and objectively. Because larger tiles perform visually better than smaller tiles, the whole sprite image is treated as an entire tile in our experiments. Fig. 5(a) and (b) show the rate-distortion performance of LR and HR sprite coding for the Coastguard and Stefan sequences, respectively. We can easily see that the proposed HR multi-sprite coding outperforms LR single-sprite and LR multi-sprite coding schemes.
5
Conclusions
We have proposed an approach to multi-sprite partitioning and selecting an optimal reference frame for each sprite according to the degree of the overlap
High-Resolution Multi-sprite Generation for Background Sprite Coding
767
between this reference frame and all the other frames in the corresponding sprite range. This approach can group the images, which cover a similar scene, into the same sprite range. In addition, we have proposed an iterative regularized technique for HR sprite construction. This technique determines the regularization parameter automatically by considering the balance between the constrain and error energy. Moreover, it produces sprite images with high visual quality. A HR sprite coding method has also been presented. Experimental results show that the proposed methods for HR multi-sprite generation can produce sprites with good quality and can greatly improve the performance of background sprite coding.
References 1. Lee, M., Chen, W., Lin, C., Gu, C., Markoc, T., Zabinsky, S., Szeliski, R.: A layered video object coding system using sprite and affine motion model. IEEE Trans. Circuits Syst. Video Technol. 7, 130–145 (1997) 2. Smolic, A., Sikora, T., Ohm, J.: Long-term global motion estimation and its application for sprite coding, content description, and segmentation. IEEE Trans. Circuits Syst. Video Technol. 9, 1227–1242 (1999) 3. Dufaux, F., Konrad, J.: Efficient, robust, and fast global motion estimation for video coding. IEEE Trans. Image Process. 9, 497–501 (2000) 4. Keller, Y., Averbuch, A.: Fast gradient methods based on global motion estimation for video compression. IEEE Trans. Circuits Syst. Video Technol. 13, 300–309 (2003) 5. Lu, Y., Gao, W., Wu, F.: Efficient background video coding with static sprite generation and arbitrary-shape spatial prediction techniques. IEEE Trans. Circuits Syst. Video Technol. 13, 394–405 (2003) 6. Chien, S., Chen, C., Chao, W., Hsu, C., Huang, Y., Chen, L.: A fast and high subjective quality sprite generation algorithm with frame skipping and multiple sprits techniques. In: Proc. of IEEE International Conference on Image Processing, IEEE, Los Alamitos (2003) 7. Farin, D., de With, P.H.: Enabling arbitrary rotational camera motion using multisprites with minimum coding cost. IEEE Trans. Circuits Syst. Video Technol. 16, 492–506 (2006) 8. Park, S., Park, M., Kang, M.: Super resolution image reconstruction: a technical overview. IEEE Signal Processing Magazine , 21–36 (2003) 9. Smolic, A., Wiegand, T.: High-resolution video mosaicing. In: Proc. of IEEE International Conference on Image Processing, IEEE, Los Alamitos (2001) 10. Capel, D.: Super resolution and image mosaicing, Ph.D. thesis, Department of Engineering Science, Oxford University (2001) 11. Taubman, D., Marcellin, M.: JPEG 2000 - Image Compression Fundamentals, Standards, and Practice. Kluwer, MA (2002)
Motion Information Exploitation in H.264 Frame Skipping Transcoding Qiang Li, Xiaodong Liu, and Qionghai Dai Broadband Networks & Digital Media Laboratory Graduate School at Shenzhen, Tsinghua University, China
[email protected]
Abstract. This paper proposes an adaptive motion mode selection method in H.264 frame skipping transcoding. In order to reduce the high complexity arising from variable block sizes in H.264, the proposed method exploits original motion information from incoming bitstreams. In addition, the paper also adopts Forward Dominant Vector Selection approach in MV composition of H.264 transcoding, in comparison with Bilinear Interpolation method. The simulation results show that the proposed method achieves good trade-off between computational complexity and video quality.
1 Introduction Video transcoding techniques have become more and more indispensable today, mainly due to the universal access to all kinds of video data through diverse processing terminals and various network links. Transcoding operations can perform conversions of video data, to transform one compressed video stream to another appropriate one with different parameters or formats [1], [2]. It is obviously too expensive in terms of computation and processing delay to conduct a cascaded decoding and then fully re-encoding operation. By exploiting the information in the original bitstream, video transcoding techniques can improve the real-time performance, and enhance the overall efficiency effectively. In various scenarios, transcoders deal with different problems, such as bit-rate adaptation [3], [4], spatial/temporal resolution reduction [5], [6], [7], and format conversions [8], [9]. Temporal resolution reduction, namely frame skipping transcoding, is very useful when the terminal’s processing capability is quite limited. Additionally, reduction in frame rate can maintain higher video quality by saving more bits for remaining frames. Since many original motion vectors (MV) point to the skipped frames, the incoming MVs are no longer valid, and new MVs which point to the remaining frames must be derived. A bilinear interpolation method has been developed in [10] to obtain new MVs based on the incoming MVs. In [11], a method called Forward Dominant Vector Selection (FDVS) has been presented, which achieves better performance. Other researchers also proposed some approaches in [6], [12], [13] to address the issue. The existence of several video compression standards makes the video transcoding technology much more necessary as well as more challenging. As the newest J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 768–776, 2007. © Springer-Verlag Berlin Heidelberg 2007
Motion Information Exploitation in H.264 Frame Skipping Transcoding
769
international video coding standard, H.264/AVC [14] improves both coding efficiency and flexibility for a broad variety of applications. Compared to prior video coding standards, H.264 supports many new features, such as motion vectors over picture boundaries and variable block-size motion compensation [15]. These features in turn cause some new problems which conventional transcoding techniques cannot solve properly. In H.264 frame skipping transcoding, MV derivation involves much more complexity, because of the adoption of up to seven inter block sizes. A Block-Adaptive Motion Vector Resampling method (BAMVR) [16] has been proposed to estimate MVs in H.264 transcoding. In addition, one optimal motion mode should be determined from various inter predictive modes for each macroblock. This is a completely fresh topic for frame-skipping transcoding, requiring some new skills and methods. In [16], the rate-distortion optimization algorithm is also combined with BAMVR to obtain the optimal mode. The method reduces the computational complexity compared to the full motion estimation. However, this method derives new MVs using the interpolation, rather than the superior FDVS. Moreover, the incoming motion mode information has not been exploited appropriately. This paper proposes an adaptive motion mode selection method in H.264 frame skipping transcoding, to efficiently choose the optimal mode based on the original motion information. Forward Dominant Vector Selection approach is also adopted into the MV composition process of H.264 transcoding in this paper, considering the better performance of the FDVS method in conventional transcoding prior to H.264. The rest of the paper is organized as follows. Section 2 introduces the architecture of the proposed transcoder. Section 3 relates the proposed MV composition techniques in H.264 transcoding. Section 4 discusses the proposed adaptive motion mode selection method in detail. Simulation results of the proposed methods are presented in Section 5 while section 6 concludes the paper.
2 Architecture of Proposed Transcoder Transcoding architectures are among several hot topics in the video transcoding research. There are three different kinds of architectures for video transcoding. The simplest type is open loop transcoder [3], [17], which directly re-quantizes the residual errors without any changes of motion vectors, or discards high frequency coefficients [4]. Despite its simple structure and easy implementation, open loop transcoder suffers from the drift problem resulted from the mismatch between the reference frames in encoder and end-decoder. Cascaded pixel domain transcoder [4], [18] belongs to the second type of transcoding architecture. Transcoders in this kind decode the bitstream into pixel domain at first, and then re-encode the data by reusing some incoming information like motion vectors. Pixel domain transcoder avoids the drift problem due to the compensation in its closed loop structure. The last type is frequency domain transcoding architecture [19]. This kind of transcoder decodes bitstream and conduct encoding only in frequency domain rather than pixel domain, simplifying the overall coding process further.
770
Q. Li, X. Liu, and Q. Dai
The proposed transcoder has a cascaded pixel domain architecture. Although the frequency domain transcoder could reduce the amount of computation, the linearity of frequency transform is not always perfect enough to avoid the drift occurrence. Besides that, H.264 adopts a 4× 4 transform while all major prior standards used a transform block size of 8× 8 . Frequency domain transcoding in H.264 should consider this new situation, which is beyond the discussion areas of this paper.
3 MV Composition in H.264 Frame Skipping Transcoding In frame skipping transcoding, it is necessary to obtain new MVs from the current frame to a previous remaining frame. New MVs could be derived through tracing back instead of redoing a motion search. As in Fig.1, the MV of block B should be the sum of MV1 and MV2. Since the predicted area, like BP in Fig.1, is usually not aligned with the boundaries of blocks, MV2 should be obtained through a composition of overlapping blocks’ MVs.
Fig. 1. MV tracing in frame skipping transcoding
As described in section 1, bilinear interpolation and Forward Dominant Vector Selection are two major methods of MV composition proposed for frame skipping transcoding. Different from bilinear interpolation, FDVS [11] chooses the MV of the reference block with the largest overlapping area. Previous experiments show that coding efficiency is higher using FDVS than bilinear interpolation. Moreover, FDVS apparently involves less computation. In spite of increased block types in motion estimation, the fundamental of MV composition in H.264 transcoding should be still the same. Bilinear interpolation is used for MV composition in [16]. But until now, the FDVS has not been applied in H.264 transcoding to our best knowledge. This paper adopts this efficient method into H.264 frame skipping transcoding. For instance, in Fig. 2, where the predicted area BP overlaps five blocks in the previous reference frame, from B1 to B5, the MV of BP is directly obtained by choosing the MV of block B5 according to FDVS, because B5 has the largest overlapping portion with BP.
Motion Information Exploitation in H.264 Frame Skipping Transcoding
771
B2 B1
B3
BP
B4 B5
Fig. 2. Overlapping blocks in H.264 frame skipping transcoding
4 Adaptive Motion Mode Selection H.264 utilizes seven different block sizes [15] in motion estimation, as in Fig. 3, rather than the uniform 16 × 16 block type. In H.264 inter prediction, partitions with luma block sizes of 16 × 16 , 16 × 8 , 8 × 16 , and 8 × 8 are supported. In case partitions with 8 × 8 size are chosen, each 8 × 8 block can be further partitioned into blocks of 8 × 4 , 4 × 8 , and 4× 4 sizes. The optimal block mode can be determined after comparing the costs of all the possible modes. Given the fact that motion estimation is the most time-consuming operation in the video encoding, the motion mode selection appears to be the biggest factor constraining the coding speed. Especially for some real-time transcoding applications, the remarkable delay might be unacceptable. The motion vector resampling method in [16] divides a block with any size into several 4 × 4 subblocks to trace motion vectors. After composing MVs of each 4 × 4
Fig. 3. Seven block types in H.264 motion estimation
772
Q. Li, X. Liu, and Q. Dai
subblocks, the MVs of all the block modes can be also obtained by averaging the MVs of comprised subblocks. And then the optimal mode could be selected from all the candidate modes. In contrast, our proposed adaptive motion mode selection method exploits the original mode information to select the optimal mode more efficiently. This method is designed based on the following thoughts: 1) Within an average video sequence, only parts of the frame content, usually not many, experience detailed motion with small block sizes. Thus it is unnecessary to divide all blocks into 4 × 4 subblocks, which may turn out to be quite a disadvantage for speed improvement. 2) The macroblocks comprising small blocks are inclined to maintain the small size partitions because the detailed motion situation could hardly change during the short intervals of several skipping frames. In these views, the proposed motion mode selection procedure for each inter-typed macroblock is described as follows. Step 1: Divide the macroblock into four 8× 8 blocks as the element blocks. Step 2: If there exist original partitions smaller than 8× 8 , divide the located 8 × 8 area into 4 × 4 blocks as the element blocks instead. Step 3: Trace back the MV of each element block to an unskipped frame, using FDVS or bilinear interpolation. Step 4: If there are no 4 × 4 element blocks, go to step 5; else go to step 6. Step 5: Obtain the MVs of 16 × 16 , 16 × 8 , 8 × 16 partitions by averaging the tracing MVs of comprised element blocks, and then select the optimal mode from 16 × 16 , 16 × 8 , 8 × 16 , and 8× 8 (without further division). Step 6: Set the macroblock to mode P 8× 8 . For each 8× 8 area with 4 × 4 element blocks, obtain the MVs of 8× 8 , 8× 4 , 4 × 8 partitions by averaging the MVs of comprised element blocks, and select the optimal submode from 8× 8 , 8× 4 , 4 × 8 , and 4 × 4 ; For the other 8× 8 areas, set the submodes to 8× 8 , and maintain respective tracing MVs.
5 Simulation Results Reference software JM 10.1 is used as the H.264 codec in the simulation. The transcoder is implemented by cascading a JM decoder and a simplified JM encoder. Since the proposed methods are mainly aimed at improving performance of some real time systems, B mode prediction are not considered. Test sequences Suzie, Carphone with QCIF picture size and Silent, Tempete with CIF are all compressed into the format of H.264 in advance. As incoming bitstreams, the coded sequences are fed into the transcoder, which skips every other frame and generates a new H.264 bitstream. For simplicity, the coded bitstream uses only one reference frame. In the simulation, the proposed Adaptive Motion Mode Selection (AMMS) method is applied in the transcoder. For comparison, the BAMVR method in [16] is also conducted in the experiments. The Bilinear Interpolation (BI) and FDVS methods are both applied in MV composition as alternatives. MV refinements with a search range of 1 pixel
Motion Information Exploitation in H.264 Frame Skipping Transcoding
773
Table 1. The comparison in processing time (ms/frame)
Test Sequence Output Bitrate Re-encoding AMMS + FDVS AMMS + BI BAMVR+FDVS BAMVR+BI
Suzie 70kb/s 853 526 534 547 553
Carphone 300kb/s 1100 623 628 637 640
Silent 70kb/s 1807 964 972 1078 1146
Tempete 750kb/s 3400 1182 1185 1229 1233
a)
b) Fig. 4. PSNR comparison between AMMS and BAMVR with Sequence a) Silent of 70kb/s; b) Tempete of 750kb/s
are used in both transcoders with the proposed method and BAMVR. As the benchmark, the test values of cascaded re-encoding are also presented. The re-encoding is conducted with a search range of 16 pixels, and without RD optimization as in many real time applications.
774
Q. Li, X. Liu, and Q. Dai
Table 1 presents the processing time of re-encoding or transcoding with different schemes. Each value is an average after five independent test runs. From this table, it can be seen that the proposed Adaptive Motion Mode Selection method greatly improves the coding efficiency, reducing the processing time by 40%-65%. Comparing to the BAMVR method [16], the proposed method still enhances the efficiency by some degree. On the other hand, FDVS also performs well, speeding up the transcoding process in comparison with Bilinear Interpolation.
Fig. 5. PSNR comparison between FDVS and BI
Fig. 6. Visual quality of transcoded pictures (left) and fully re-encoded counterparts (right)
Motion Information Exploitation in H.264 Frame Skipping Transcoding
775
In Fig. 4, PSNR values of pictures transcoded with both AMMS and BAMVR are presented, as well as the values after re-encoding. Two sequences with different output bitrates are used. In both transcoders, PSNR is reduced by 2-4 dB. It can be seen that the proposed AMMS method achieves better performance than BAMVR. In Fig. 5, Suzie sequence is used with the output bitrate of 70 kb/s. This figure shows that combined with AMMS method, both FDVS and BI experience a video quality degradation of about 2-4 dB. The FDVS performs slightly better than bilinear interpolation. Fig. 6 presents two frames of the Suzie sequence after being trancoded with AMMS+FDVS methods. The frames through fully re-encoding are placed on the right for comparison. Despite the slight content blurring, the transcoded frames still maintain a satisfactory quality. It is worth mentioning that as the search range of MV refinements increases, the video quality could be improved further.
6 Conclusion This paper investigates the motion estimation and motion mode selection in H.264 frame skipping transcoding, in order to find out some techniques to efficiently exploit the original motion information in the incoming bitstream. Specifically, an adaptive motion mode selection method is proposed, with an effort to make use of original motion modes as fully as possible. In addition, Forward Dominant Vector Selection [11] is adopted in this paper for H.264 transcoding, in comparison with bilinear interpolation method. Simulation results show remarkable improvements in real-time performance of the proposed approaches while satisfactory video quality is still maintained. Admittedly, there are some limitations of these approaches, e.g. bidirectional inter prediction is not considered, and multi-frame reference and some other new features of H.264 are neglected in this paper. All these problems will become research topics in the near future.
Acknowledgements This work is supported by the key project (No.60432030) and the Distinguished Young Scholars (No.60525111) of National Natural Science Foundation of China.
References [1] Ahmad, I., Wei, X., Sun, Y., Zhang, Y.-Q.: Video transcoding: An overview of various techniques and research issues. IEEE Trans. Multimedia 7(5), 793–804 (2005) [2] Vetro, A., Christopoulos, C., Sun, H.: Video transcoding architectures and techniques: An overview. IEEE Signal Process. Mag. 20(2), 18–29 (2003) [3] Nakajima, Y., Hori, H., Kanoh, T.: Rate conversion of MPEG coded video by re-quantization process. In: Proc. IEEE Int. Conf. Image Processing, Washington, DC, vol. 3, pp. 408–411. IEEE, Los Alamitos (1995) [4] Sun, H., Kwok, W., Zdepski, J.: Architectures for MPEG compressed bitstream scaling. IEEE Trans. Circuits Syst. Video Technol. 6, 191–199 (1996)
776
Q. Li, X. Liu, and Q. Dai
[5] Bjork, N., Christopoulos, C.: Transcoder architectures for video coding. IEEE Trans. Consumer Electron. 44, 88–98 (1998) [6] Shanableh, T., Ghanbari, M.: Heterogeneous video transcoding to lower spatio-temporal resolutions and different encoding formats. IEEE Trans. Multimedia 2, 101–110 (2000) [7] Yin, P., Wu, M., Lui, B.: Video transcoding by reducing spatial resolution. In: Proc. IEEE Int. Conf. Image Processing, Vancouver, BC, Canada, pp. 972–975. IEEE, Los Alamitos (2000) [8] Shanableh, T., Ghanbari, M.: Heterogeneous video transcoding MPEG:1,2 to H.263. In: Proc. of the Packet Video’99Workshop, NYC, USA (1999) [9] Dogan, S., Sadka, A.H., Kondoz, A.M.: Efficient MPEG-4/H.263 video transcoder for interoperability between heterogeneous multimedia networks. IEE Electronics Letters 35(11), 863–864 (1999) [10] Hwang, J.-N., Wu, T.-D.: Motion vector re-estimation and dynamic frame-skipping for video transcoding. In: Conf. Rec. 32nd Asilomar Conf. Signals, System & Computer, vol. 2, pp. 1606–1610 (1998) [11] Youn, J., Sun, M.-T., Lin, C.-W.: Motion vector refinement for high performance transcoding. Multimedia 1(1), 30–40 (1999) [12] Chen, M.-J., Chu, M.-C., Pan, C.-W.: Efficient motion-estimation algorithm for reduced frame-rate video transcoder. IEEE Trans. Circuits Syst. Video Technol. 12(4), 269–275 (2002) [13] Yusuf, A.A., Murshed, M., Dooley, L.S.: An adaptive motion vector composition algorithm for frame skipping video transcodine. In: IEEE MELECON 2004, Dubrovnik, Croatia, May 12-15, 2004 (2004) [14] Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264/ISO/IEC 14496-10 AVC). In: Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, JVTG050 (2003) [15] Wiegand, T., Sullivan, G.J., Bjøntegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol. 13(7) (2003) [16] Shin, I.-H., Lee, Y.-L., Park, H.W.: Motion estimation for frame-rate reduction in H.264 transcoding. In: Proc. Second IEEE Workshop on Software Technologies for Future Embedded and Ubiquitous Systems, pp. 63–67. IEEE Computer Society Press, Los Alamitos (2004) [17] Eleftheriadis, A., Anastassiou, D.: Constrained and general dynamic rate shaping of compressed digital video. In: Proc. IEEE Int. Conf. Image Processing, Washington, DC, IEEE, Los Alamitos (1995) [18] Youn, J., Sun, M.T., Xin, J.: Video transcoder architectures for bit rate scaling of H.263 bit streams. In: ACM Multimedia 1999, Orlando, ACM, New York (1999) [19] Assuncao, P.A.A., Ghanbari, M.: A frequency-domain video transcoder for dynamic bitrate reduction of MPEG-2 bit streams. IEEE Trans. Circuits Syst. Video Technol. 8, 953–967 (1998)
Joint Domain-Range Modeling of Dynamic Scenes with Adaptive Kernel Bandwidth Borislav Anti´c and Vladimir Crnojevi´c Departement of Electrical Engineering, University of Novi Sad, Trg Dositeja Obradovi´ca 6, 21000 Novi Sad, Serbia {tk boris,crnojevic}@uns.ns.ac.yu
Abstract. The first step in various computer vision applications is a detection of moving objects. The prevalent pixel-wise models regard image pixels as independent random processes. They don’t take into account the existing correlation between the neighboring pixels. By using a nonparametric density estimation method over a joint domain-range representation of image pixels, this correlation can be exploited to achieve high levels of detection accuracy in the presence of dynamic backgrounds. This work improves recently proposed joint domain-range model for the background subtraction, which assumes the constant kernel bandwidth. The improvement is obtained by adapting the kernel bandwidth according to the local image structure. This approach provides the suppression of structural artifacts present in detection results when the kernel density estimation with constant bandwidth is used. Consequently, a more accurate detection of moving objects can be achieved.
1
Introduction
The detection of moving objects is very important task in modern computer vision. In a typical application of automated visual surveillance, an area of interest is usually monitored by using static cameras, thus allowing the employment of background modeling techniques for the detection of moving objects [1], [2] and [3]. In other computer vision applications like object tracking or recognition, the segmentation of moving objects is often necessary preprocessing step [4], [5], [6]. Background subtraction is a widely adopted approach for detection of moving objects in videos from static cameras. The fact that the imaging sensor is not moving doesn’t necessarily mean that the background is stationary - swaying trees, waves at the water surface and various ”unimportant” movements are just a few examples of non-stationary background. Additionally, in most real-world situations a sensor will not satisfy the requirement of being absolutely static due to wind, ground vibrations, etc. Consequently, there will be some amount of background motion which the background model should take into account. All these examples indicate a need for a reliable background modeling algorithm, which should be robust enough to deal with them. As the first attempt to detect moving objects, a difference between the adjacent frames has been proposed [7]. This simple technique proved to be inefficient J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 777–788, 2007. c Springer-Verlag Berlin Heidelberg 2007
778
B. Anti´c and V. Crnojevi´c
in real-world situations. Different approach based on statistical modeling of the background emerged as a more effective solution. Various algorithms for modeling the uncertainties of the background have been proposed and they can be divided into two groups: pixel-wise models and regional models. Pixel-wise models are predominantly used, while models based on the regional properties began to appear recently. The assumption that a single Gaussian distribution N (μ, σ2 ) can be used for statistical modeling of a single pixel in a video sequence was used in [1], where the color of each pixel I(x, y) was modeled with a single three-dimensional Gaussian, I(x, y) ∼ N (μ(x, y), (x, y)). Various artifacts appearing in most outdoor situations, like shadows, glitter and periodic object motion, proved to be cumbersome for background model based on a single Gaussian pdf. A mixture of Gaussians was proposed as a solution for the multimodality of the underlying background probability density function in [8] and [2]. A decision whether the pixel belongs to the background is made by comparing it with every Gaussian density. The pixel is either associated with its closest density, or declared as a foreground pixel. Based on this decision, the model is updated either by recalculating the mean and variance or by introducing a new distribution into the mixture. Although this approach has become some kind of a standard in background subtraction, it has several drawbacks: it is not flexible enough, it doesn’t take into account the spatial relations of the proximal pixels, and there is a need to specify in advance the number of Gaussians. Nonparametric data-driven kernel density estimation (KDE) was used in [3] to enable more sensitive detection of moving targets with very low false alarm rates. Background subtraction in a non-stationary scene based on the concept of a spatial distribution of Gaussians (SDG) has been addressed in [9], where single Gaussian was used which is insufficient to model multimodal spatial probabilities related to occurrence of background object on different locations. The pixel-wise approaches assume that the adjacent pixels are uncorrelated, which is far from realistic. In real scenes, neighboring pixels exhibit strong correlation. The second group of methods use region models of the background in order to account for this correlation. Eigenspace decomposition of the whole images proposed in [10] is a global method where the foreground objects are detected by projecting the current image in the eigenspace and finding the difference between the reconstructed and actual image. In region-based approaches proposed in [11] and [12] image regions are modeled as an autoregressive moving average (ARMA) process, which is used to incrementally learn (using PCA) and then predict motion patterns in the scene. The most comprehensive region-based background subtraction model published recently was proposed by Sheikh and Shah in [13], where three innovations over existing approaches were introduced. First, it has been shown that the region based approach is superior to pixelwise approach, due to its ability to exploit useful correlation between the spatially proximal pixels. By using a nonparametric kernel density estimation (KDE) method over a joint domain-range representation of image pixels, the single probability density background model is assumed. Secondly, more elaborate
Joint Domain-Range Modeling of Dynamic Scenes
779
foreground model is introduced, which uses the temporal persistence of moving objects, i.e. objects detected in the preceding frame contain substantial evidence for detection in the current frame. The third innovation is MAP-MRF decision framework in which the background and foreground models are combined in a single Bayesian framework. It has been shown that joint domain-range background modeling based on nonparametric kernel density estimation is more adequate than the previously proposed methods. Prior to [13] spatial correlation has been analyzed in [14],where the statement that neighboring blocks of pixels belonging to the background should experience similar variations over time was proposed. For regions belonging to a same background object this assumption can be true, but for regions at the border of distinct background objects it will not hold. This produces several false detections that can be observed in [14] and [13], appearing at the borders of different background objects. In this paper a new joint domain-range approach to background modeling is proposed, which significantly improves nonparametric kernel density estimation introduced in [13]. While the method of Sheikh and Shah is successful in modeling the static and dynamic background regions, the problem arises at the region borders where abrupt changes in illumination intensity occur. Instead of using a constant kernel bandwidth like in [13], in this work image gradient is used to adaptively change the orientation and dimensions of the kernel at the borders of the region. This approach provides more accurate modeling of non-stationary background containing regions having different texture and illumination. This paper is organized as follows. In section II an overview of background modeling using kernel density estimation with constant bandwidth is given. Background modeling approach with gradient-driven variable kernel bandwidth is presented in section III. Results are given in section IV and briefly summarized in Conclusion.
2
Background Modeling with Constant Bandwidth KDE
It has already been shown in [3] that KDE can produce more flexible scene model than traditionally used Gaussian mixture models. Nonparametric estimation methods operate on the idea that dense regions in a given feature space, populated by feature points from a class, correspond to higher underlying probability density values. However, the increased complexity is the price for this improvement. By adopting a joint domain-range approach, a single KDE model is used for the whole image instead of one model per pixel [13]. It has been shown that keeping of a single joint domain-range non-parametric model is more effective than the prevailing pixel-wise models. Pixel-wise models ignore the spatial correlation between neighboring pixels, while the joint representation provides a direct mean to model and exploit this dependency. In both approaches, the decision whether the current pixel x belongs to the foreground is usually based on log-likelihood ratio test
780
B. Anti´c and V. Crnojevi´c
δ=
(x|ψb ) 1, − ln PP (x|ψ >T f) , 0, otherwise
(1)
where P (x|·) denotes the probability that pixel is from the background ψb or foreground ψf . In this work, the main interest is to make the accurate model of the background. Therefore, the simplest foreground model is assumed based on the uniform distribution P (x|ψf ) = γ. (2) where γ = 1/(M N L) (in this work N and M are frame dimensions and L is the number of possible intensity values). This means that a foreground pixel can have any intensity at any location in the image with probability γ. This model is quite reasonable if the knowledge gained from the foreground detection in previous frames is not used. However, it is straightforward to apply the results presented in this paper to more elaborate foreground models as in [13]. The analysis can be performed on a feature space where the K pixels are represented by xi ∈ 3 , i = 1, 2, ..., K. The feature vector x is a joint domain-range representation, where the space of the image lattice is domain (x, y) and range is the image intensity i. By doing so, a single model of the entire background fX,Y,I (x, y, i) can be made, instead of a collection of pixel-wise models. The background model is built from all the samples xi that appeared in the last NF frames. Kernel density estimator is built by assigning an appropriate kernel to each of these n samples [15], [16]. The probability that estimation point x belongs to the background ψb is given as P (x|ψb ) = n−1
n
ϕH (x − xi ).
(3)
i=1 −1/2
Here, ϕH is a d-variate kernel function ϕH = |H| ϕ(H −1/2 x) and H is symmetric positive definite bandwidth matrix [17]. In order to reduce the complexity it is usually assumed that matrix H is diagonal. In this paper it is assumed that d = 3 and the kernel ϕ is Gaussian with zero mean and unity variance, but other kernel functions and space dimensions can 2 2 2 be used in the same manner. If the matrix H is diagonal H = diag(σD , σD , σR ), ϕH can be separated as ϕH (x, y, i) = ϕD (x, y)ϕR (i),
(4)
where ϕD and ϕR are domain and range marginals kernel ϕH , respectively, − 12 1 ϕD (x, y) = e 2πσx σy
x2 2 σx
2
+ y2
σy
− 12 1 = e 2 2πσD 2
x2 +y2 σ2 D
,
(5)
− i2 1 ϕR (i) = √ e 2σR . (6) 2πσR In the sequel it is shown that if some non-diagonal elements in matrix H are allowed to be nonzero, the model accuracy can be improved significantly with
Joint Domain-Range Modeling of Dynamic Scenes
781
Fig. 1. Kernel adaption to image parts with abrupt intensity changes
negligible increase in complexity. A non-diagonal matrix H has been introduced in [18], but only for pixel-wise model. In this work a non-diagonal matrix H has been developed for joint domain-range background model. As can be seen, this leads to an adaptive kernel that can change its shape in order to better fit the local image structure.
3
Gradient-Driven Adaptive Bandwidth KDE Model
Joint domain-range model of dynamic scenes with constant kernel bandwidth is well-suited for situation where only smooth transition of image intensity are present. However, in case of abrupt changes in the background, this approach fails to produce a reliable model. Pixels positioned on the region borders exhibit more variation in the range. This is more noticeable in the case of a large intensity difference between the neighboring regions. Neighboring pixels belonging to the adjacent regions of different intensities are far from each other in the joint domain-range space. A part of space between them is not densely populated with data samples. Consequently, the background probability density associated with this part of the space will be decreased due to a small number of kernels that contribute to a probability calculation. In order to amend this deficiency, joint domain-range KDE model has to increase the range component of bandwidth of the kernels located in the transition zone as shown in Fig.1. Also, the shrinkage of the kernel along the direction perpendicular to an edge located at the border produces better localization in domain. The model generated with these modifications is much better adjusted to the region borders without sacrificing accurate modeling of smooth image areas. The main idea behind the proposed concept is to modify the kernel bandwidth and orientation in accordance with the gradient associated with the given pixel. The gradient is calculated by using vertical and horizontal Sobel operators. First step in the proposed concept is to define the kernel rotation in order to align it with the direction of an edge, i.e. direction perpendicular to the gradient vector as shown in Fig.2. In order to accomplish this a new domain D with coordinates (x , y ) is introduced, which corresponds to a domain D rotated for an angle θ.
782
B. Anti´c and V. Crnojevi´c
Fig. 2. Kernel orientation and deformation based on image gradient direction and intensity
Accordingly, a rotated Gaussian kernel can be defined as
ϕD (x , y ) =
1 e 2πσx σy
− 12
x2 σ2 x
2
+ y2
σ y
,
(7)
where σx2 and σy2 are variances along the axis x and y , where the axis are defined as x = x cos θ + y sin θ,
(8)
y = −x sin θ + y cos θ. Since the rotation of coordinate system is an isometric transformation (Jacobian || = 1), the equality ϕD (x, y) = ϕD (x , y ) holds. This can be further expressed as
−
ϕD (x, y) =
e
x2 2
2 cos2 θ + sin2 θ σ2 σ x y
2
+ y2
2 sin2 θ + cos2 θ σ2 σ x y
+ 2xy 2 cos θ sin θ
2πσx σy
1 σ2 x
−
1 σ2 y
. (9)
Gradient vector of the intensity grad(f ) projected on the coordinates x and y will give fx and fy , respectively. Appropriate estimates of partial derivatives such as Sobel’s operators can be used for calculation of fx and fy . The argument of a gradient vector is related with the rotation angle θ according to the following ψ = arg{grad(f )} =
π + θ. 2
(10)
Joint Domain-Range Modeling of Dynamic Scenes
783
By introducing ρ as a ratio between fx and fy , the following relation can be derived sin( π2 + θ) fy ρ= = tan ψ = , (11) fx cos( π2 + θ) 1 tan θ = − . ρ
(12)
The kernel variances σx2 and σy2 should be modified along the coordinates of a rotated coordinate system (x , y ). Therefore, a new parameter k is introduced as follows σx = σD , 1 σy = σD . k
(13) (14)
By combining modified variances σx2 and σy2 with terms from Eq.(9), the following equations are made cos2 θ sin2 θ 1 1 2 + 2 = 2 1 + (k − 1) , (15) σx2 σy σD 1 + ρ2 sin2 θ cos2 θ 1 ρ2 2 + = 1 + (k − 1) , (16) 2 σx2 σy2 σD 1 + ρ2
ρ 1 1 2 2 cos θ sin θ − 2 = 2 k2 − 1 . (17) σx2 σy σD 1 + ρ2 Now, it is necessary to find the appropriate relation between the k and |grad(f )|. If a data sample is located in a flat image it is desirable that the associated kernel is isotropic. In that case k should satisfy fx = fy = 0 ⇒ k = 1.
(18)
Conversely, if a sample is located at the abrupt change in image intensity, kernel should be contracted in such a way that the value of k is asymptotically proportional to the image gradient k ∼ |grad(f )| = fx2 + fy2 , fx , fy 1. (19) In accordance with equations (18) and (19), the following relation for k has been chosen k = 1 + fx2 + fy2 . (20) Consequently, the following will hold cos2 θ sin2 θ 1 + fx2 + 2 = , 2 2 σx σy σD
(21)
784
B. Anti´c and V. Crnojevi´c
1 + fy2 sin2 θ cos2 θ + = , 2 σx2 σy2 σD 1 1 2fx fy 2 cos θ sin θ − 2 = 2 . σx2 σy σD
(22) (23)
By combining equations (21), (22) and (23) in Eq.(9), it is straightforward to get the following equation for the domain component of the kernel ϕH ϕD (x, y) =
1 2π
2 σD
e
−
1 2σ2 D
[x2 (1+fx2 )+y2 (1+fy2 )+2xyfx fy ]
k
− 12 [x2 +y 2 +(xfx +yfy )2 ] k 2σ D = e . 2 2πσD
(24)
If the kernel ϕH is contracted in domain k times, then in range it should be extended with the same factor σI = kσR ,
(25)
thus preserving the same value of probability density in the center of the kernel. Consequently, the range component ϕR of the kernel ϕH will be 2
2
− i2 − 2i 2 1 1 ϕR (i) = √ e 2σI = √ e 2k σR . 2πσI 2πkσR
(26)
Joint domain range kernel with adaptive bandwidth can be given as
1 ϕH (x, y, i) = ϕD (x, y)ϕR (i) = √ 3 e 2 σ 2π σD R
−
x2 +y2 +(xfx +yfy )2 2σ2 D
+
i2
(
2 +f 2 2σ2 1+fx y R
) .
(27) The obtained mathematical expression Eq.(27) comprises all necessary pixel neighborhood information that direct the orientation of the kernel. By using this expression significant improvement in detection of moving objects is achieved with a small increase in complexity. The results of foreground detection obtained using uniform model of the foreground given by Eq.(2) and the adaptive bandwidth KDE background model defined by Eq.(27) are presented in the following section.
4
Results
In this work the range is set to be the grayscale intensity space with L = 256 levels, but any other color space can be used instead. Apart from Sobel operators used as the estimate of the image gradient other estimators can be applied in the same manner. Uniform probability is assumed as the foreground model - a foreground pixel can have any intensity value at any location in the image with
Joint Domain-Range Modeling of Dynamic Scenes
(a)
785
(b)
Fig. 3. (a) Frame #145 of the video sequence CAMPUS, (b) Gradient intensity of the same frame
(a)
(b)
Fig. 4. Log-likelihood ratio:(a) Sheikh-Shah background model, (b) proposed background model, both using the same uniform foreground model
probability γ as defined in Eq.(2), where M = 640 and N = 480. Also, instead of the assumed Gaussian kernel for the background model, alternative functions like Epanechnikov kernel or triangular kernel can be used [17]. Values of domain and range standard deviations of Gaussian kernel used in experiments σD and σR were 1 and 6, respectively. The background model is built using last NF = 50 frames. Frame #145 of the video sequence CAMPUS used in experiments is given in Fig.3(a). The gradient magnitude of the image from Fig.3(a) is given in Fig.3(b). The kernels are oriented based on the argument of the gradient and deformed according to its magnitude. Log-likelihood ratios for Sheikh-Shah background model and for the proposed background model are shown in Figs.4(a) and (b), respectively. Both ratios are derived under the assumption of uniform foreground model. Background static objects with significant gradient values, such as cars, buildings, trees etc., are much more visible in Fig.4(a). These parts of the background will be susceptible to false detection. As can be seen in Fig.4(b), the
786
B. Anti´c and V. Crnojevi´c
(a)
(b)
(c)
(d)
Fig. 5. Foreground detection results: (a) Sheikh-Shah, (b) Proposed, (c) Sheikh-Shah with median filter postprocessing, (d) proposed with median filter postprocessing
proposed algorithm suppresses more efficiently parts of the background with high gradient values. Consequently, it is less prone to false detections. The detection results for both algorithms are presented in Fig.5. In both cases, the detection is realized as a binary classification based on comparing the loglikelihood ratio with the same threshold value T = −1. In Fig.5(a) structural artifacts located near the edges in the background are detected as a foreground objects. Detection result of the proposed background model shown in Fig.5(b) contains less false positives. Moreover, those are less structured than in Fig.5(a) and more similar to a random noise. Therefore, a simple median filter can remove them efficiently. The results of postprocessing with median filter of 3x3 window size for images in Figs.5(a) and (b) are shown in Figs. 5(c) i (d). Apart from the foreground object, there are still some background artifacts in Fig.5(c), while they are completely suppressed in Fig.5(d). However, true positive results are almost equal for both approaches. Receiver operating characteristics are given in Fig.6. It can be observed that the proposed method outperforms Sheikh-Shah background model from [13] for all detection rates.
Joint Domain-Range Modeling of Dynamic Scenes
787
0.98
True Positive
0.96 0.94 0.92 0.9 Sheikh−Shah background model Proposed background model
0.88 0.86 0
0.01
0.02 0.03 False Positive
0.04
0.05
Fig. 6. ROC curves comparison result
5
Conclusion
Joint domain-range modeling of dynamic scenes where one model is built for the entire background allows the efficient use of nonparametric kernel density estimation. Also, this approach takes into account local dependencies of proximal pixels thus providing high levels of detection accuracy in the presence of dynamic backgrounds. However, the use of kernel with constant bandwidth pose some constraints - different parts of the image having diverse properties cannot be successfully modeled with a single kernel shape. Improvement of joint domainrange model proposed in this work is based on the adaptive kernel bandwidth. According to the local image structure, a bandwidth is adaptively changed to attain better modeling of the background. Kernel is oriented and deformed in accordance with the gradient associated with the given pixel. The range component of kernel’s bandwidth located in the intensity transition zone is increased, while the kernel is contracted along the direction perpendicular to an edge. The model generated with these modifications is much better adjusted to the region borders without sacrificing accurate modeling of smooth image areas. This approach provides the suppression of structural artifacts present in the constant bandwidth kernel density model. Accordingly, the result is a more accurate detection of moving objects.
References 1. Wren, C., Azarbayejani, A., Darrel, T., Pentland, A.: Pfinder: Real Time Tracking of the Human Body. IEEE Trans. Pattern Analysis and Machine Intelligence (1997) 2. Stauffer, C., Grimson, W.: Learning Patterns of Activity Using Real-Time Tracking. IEEE Trans. Pattern Analysis and Machine Intelligence (2000)
788
B. Anti´c and V. Crnojevi´c
3. Elgammal, A., Harwood, D., Davis, L.: Background and Foreground Modeling Using Non-Parametric Kernel Density Estimation for Visual Surveillance. In: Proc. IEEE, IEEE, Los Alamitos (2002) 4. Isard, M., Blake, A.: Condensation—Conditional Density Propagation for Visual Tracking. Proc. Int’l J. Computer Vision 29(1), 5–28 (1998) 5. Comaniciu, D., Ramesh, V., Meer, P.: Real-Time Tracking of Non-Rigid Objects Using Mean Shift. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, IEEE, Los Alamitos (2000) 6. Haritaoglu, I., Harwood, D., Davis, L.: W4: Real-Time of People and Their Activities. IEEE Trans. Pattern Analysis and Machine Intelligence (2000) 7. Jain, R., Nagel, H.: On the Analysis of Accumulative Difference Pictures from Image Sequences of Real World Scenes. IEEE Trans. Pattern Analysis and Machine Intelligence (1979) 8. Friedman, N., Russell, S.: Image Segmentation in Video Sequences: A Probabilistic Approach. In: Proc. 13th Conf. Uncertainity in Artificial Intelligence (1997) 9. Ren, Y., Chua, C.-S., Ho, Y.-K.: Motion Detection with Nonstationary Background. Machine Vision and Application (2003) 10. Oliver, N., Rosario, B., Pentland, A.: A Bayesian Computer Vision System for Modeling Human Interactions. IEEE Trans. Pattern Analysis and Machine Intelligence (2000) 11. Monnet, A., Mittal, A., Paragios, N., Ramesh, V.: Background Modeling and Subtraction of Dynamic Scenes. In: IEEE Proc. Int’l Conf. Computer Vision, IEEE, Los Alamitos (2003) 12. Zhong, J., Sclaroff, S.: Segmenting Foreground Objects from a Dynamic Textured Background Via a Robust Kalman Filter. In: IEEE Proc. Int’l Conf. Computer Vision, IEEE, Los Alamitos (2003) 13. Sheikh, Y., Shah, M.: Bayesian Modeling of Dynamic Scenes for Object Detection. IEEE Trans. Pattern Analysis And Machine Intelligence 27(11) (2005) 14. Seki, M., Wada, T., Fujiwara, H., Sumi, K.: Background detection based on the cooccurrence of image variations. In: Proc. of CVPR 2003, vol. 2, pp. 65–72 (2003) 15. Parzen, E.: On Estimation of a Probability Density and Mode. Annals of Math. Statistics (1962) 16. Rosenblatt, M.: Remarks on Some Nonparametric Estimates of a Density Functions. Annals of Math. Statistics (1956) 17. Wand, M., Jones, M.: Kernel Smoothing. Monographs on Statistics and Applied Probability (1995) 18. Mittal, A., Paragios, N.: Motion-based Background Subtraction Using Adaptive Kernel Density Estimation. In: EEE Conference in Computer Vision and Pattern Recognition (CVPR), IEEE, Los Alamitos (2004)
Competition Based Prediction for Skip Mode Motion Vector Using Macroblock Classification for the H.264 JM KTA Software Guillaume Laroche1,2, Joel Jung1, and Beatrice Pesquet-Popescu2 1
Orange-France Telecom R&D, 38-40 rue du G. Leclerc, 92794 Issy Les Moulineaux, France {guillaume.laroche,joelb.jung}@orange-ftgroup.com 2 ENST Paris, 46 rue Barrault, 75014 Paris, France {beatrice.pesquet}@enst.fr
Abstract. H.264/MPEG4-AVC achieves higher compression gain in comparison to its predecessors H.263 and MPEG4 part 2. This gain partly results from the improvement of motion compensation tools especially the variable block size, the 1/4-pel motion accuracy and the access to multiple reference frames. A particular mode among all Inter modes is the Skip mode. For this mode, no information is transmitted except the signaling of the mode itself. In our previous work we have proposed a competing framework for better motion vector prediction and coding, also including the Skip mode. This proposal has recently been adopted by the Video Coding Expert Group (VCEG) in the Key Technical Area-software (KTA) of H.264, which is the starting point for future ITU standardization activities. In this paper we propose an extension of this method based on the adaptation of two families of predictors for the Skip mode according to the video content and to statistical criteria. A systematic gain upon the previous method, with an average of 8.2% of bits saved compared to H.264 standard, is reported.
1 Introduction The ITU-T SG16-Q61 H.264 standard also known as ISO/IEC JTC 1/SC 29/WG 112 MPEG-4 AVC [1], finalized in March 2003, achieves an efficient compression by the improvement of existing tools and the inclusion of new ones such as 1/4-pel motion accuracy, multiple reference frames, variable macroblock partition for Inter modes, new Intra predictors, arithmetic coding (CABAC), Hierarchical B frames and competing 4x4 and 8x8 size transform. Moreover, to select the best coding mode among all these possibilities, efficient non-normative tools based on rate-distortion optimization [2] have been proposed and integrated in the reference software [3]. 1 2
Video Coding Experts Group (VCEG). Moving Picture Experts Group (MPEG).
J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 789–799, 2007. © Springer-Verlag Berlin Heidelberg 2007
790
G. Laroche, J. Jung, and B. Pesquet-Popescu
Today VCEG and MPEG have formed the Joint Video Team (JVT) and focus on both scalable video codec (H.264-SVC) and multiview video codec (H.264-MVC). However the classical activity on video coding has not stopped. At the 26th VCEG meeting it was decided to establish a KTA-software [4] (Key Technical Area), which gathers all efficient tools since the finalization of the H.264 standardization. The aim of this software is to gather coding efficiency tools, keep progressing and encourage people to contribute. The current version 1.2 of the KTA software which based on JM11.0 [3] contains five new tools: 1/8-pel motion accuracy [5] for motion estimation, Adaptive Interpolation Filter [6] to improve the sub-pel motion, Adaptive Prediction Error Coding [7] to select between standard transform domain and spatial domain coding, Adaptive Quantization Matrix Selection [8] to adaptively select the quantization matrix for the transformed residual coding. The fifth one is the competition based motion vector prediction scheme (MVComp) which we proposed in [9], and [10]. A first evolution of this MVComp method has recently been proposed in [11]. In this paper we propose to improve the latter tool with an automatic adaptation of the set of predictors for the Skip mode based on the video content. We consider two background types: the background with little or no motion and the background with medium or high motion. The idea is to use few predictors where the motion is low and more for other situation without sending any side information for background classification. The remaining of this paper is organized as follows: a summary on the Skip mode is given in Section 2. The image classification into two background types and the predictor selection in each family of predictors are described in Section 3. Section 4 presents experimental results and reports an average gain of 8.2% compared to the standard version of H.264.
2 State of the Art 2.1 Skip Mode Selection and Coding in H.264 The Skip mode is a particular way of Inter coding. For a skipped macroblock no block residue or motion vector residue or reference frame information is transmitted. Only the Skip mode itself is signaled. The decoded macroblock corresponds to the block predictor from the first reference frame motion compensated by the motion vector predictor for the Skip mode [1]. The motion vector predictor for the Skip mode in H.264 standard is a spatial median of the neighboring blocks motion vectors mva , mvb and mv c , as depicted in Fig 1. If one or more neighboring motion vectors are not available or have not the same reference frame, the value of the predictor switches to mv a , mvb , mv c or even
mvd depending on the availability of each of them. Moreover, if mv a or mvb
is equal to 0, the motion vector predictor of the Skip mode is equal to 0.
Competition Based Prediction for Skip Mode Motion Vector
mvd
mv a
mvb
791
mv c
Current Block
Fig. 1. Designation and location of the spatial vectors used for the H.264 median prediction
For the selection of the best coding mode the reference software used the minimization of the rate distortion criterion:
J = D + λR
(1)
where D is the distortion computed in the spatial domain, λ is the Lagrange multiplier depending on the quantization parameter (QP) and R is the rate of all components to be encoded. The rates of all coding modes are computed in exact bitrates. In particular, the rate distortion criterion for the Skip mode is given by:
J SKIP = DSKIP where
(2)
DSKIP is the distortion introduced by the Skip mode.
2.2 Competition Based Scheme for the Skip Mode In this sub-section we describe our previous work on Skip mode [9], integrated in the JM KTA software. Instead of using one single median predictor, a set of N predictors is defined. The predictors of the set are competing, and the best predictor is selected based on a rate-distortion criterion. Thereby Eq.2 reads:
{ }
i k J skip = min J skip
k =1.. N
(3)
with i J skip = Di + λmς (i )
(4)
where Di is the distortion for the predictor number i, ς (i ) is the cost (bitrate) of the predictor i and λm is the Lagrange multiplier of the predictor index. If all predictors are equal, the index is not transmitted. Moreover, predictors that provide the same prediction values are merged behind the same index. The inverse process is applied at the decoder. This brings additional bits savings at least for the CAVLC based entropy coder. 2.3 Analysis of the Skip Mode Selection The Skip is a powerful mode: for this mode the cost in number of bits for one macroblock is less than one bit. Its selection means that it is more interesting in an RD sense
792
G. Laroche, J. Jung, and B. Pesquet-Popescu
to send nothing instead of a block residual and a motion vector residual. The Skip mode has been initially created for background with a high probability of zero value for motion vector residue and texture residue. It is consequently widely selected in areas which exhibit static or constant moving background. In order to confirm this hypothesis, we have studied the spatial repartition of the skipped macroblocks in several video sequences. Indeed, the Skip mode is more often selected on static or still background than on moving background.
P ro p o rtio n o f S k ip p e d M B (% )
90 80 70 60 50 40 30 20 10 0 F o re m a n C IF
Mo b ile C IF
Mo d o C IF
S ile n t C IF
C ity S D
C re w S D
Ice S D
Fig. 2. Proportion of skipped macroblocks obtained with the H.264 reference software for the Baseline profile on several sequences for four quantization parameters (27, 32, 37 and 42)
Fig.2 shows the proportion of skipped macroblocks obtained with the H.264 reference software for the Baseline profile [1]. The test conditions to encode these sequences are the same as those given in section 4. It shows that the proportion of skipped macroblocks is higher for sequences with static background such as Modo, Silent and Ice than sequences with moving background, especially when the sequence contains non uniform motion as Mobile. In this paper, our goal is to increase the number of skipped macroblocks especially for moving background areas.
3 Automatic Classification of the Families The method described in [10] was shown to provide significant improvement upon H.264. In [11], we have additionally shown that an adaptive selection of the predictors, based on basic sequence characteristics can improve the gain. We go further in this section: • Two families of predictors are used instead of one in previous schemes. • The first family is adapted to still backgrounds, the second to moving backgrounds. • Each family evolves independently, at the picture level. This section describes in detail the proposed method.
Competition Based Prediction for Skip Mode Motion Vector
793
3.1 Description of the Classification Algorithm Each macroblock is classified in two classes: the still background class, where the Skip mode has high probability to be selected, and the moving background class, where the Skip mode has a low probability to be selected. The class information must be computed at the decoder side otherwise the transmission of this information is needed. To this end, we used the two previous frames, which are already known at the decoder side, to compute our criterion. For each macroblock, the Sum of Absolute Differences (SAD) between the collocated macroblock in the previous frame (which we shall denote as reference frame number 0) and the collocated macroblock in the second previous frame (which we shall denote as reference frame number 1) is computed (the collocated block is the block of the previous frame located at the same spatial position). If the computed SAD is lower than a fixed threshold the current macroblock is considered as still background else as moving background. Note that currently the threshold is empirically fixed. 3.2 Description of the Evolution of the Families of Predictors The evolution of each family is made frame by frame. The two sets are transmitted to the decoder picture by picture; consequently this information has a low impact on the bitrate (only 22 bits per picture). The current family sets have been computed with the statistics of the previous encoded frame. So for each macroblock,
i J skip of each pre-
dictor is computed. Let us denote by RdCount Si and RdCount Mi the number of times where predictor
pi leads to i J skip ≤ J min
(5)
where J min is the best RD cost for all other coding modes (Inter, Intra) of the current macroblock. This criterion is relevant because it represents the number of times where the predictor is equivalent or better than the selected macroblock mode. To determine the predictors in the still background family let us define:
{
MaxRdCountS = max RdCountSi
}
∀i < N
(6)
where N is the number of all predictors. At the beginning of the selection process, there is no predictor in the family. A predictor pi is added in the family if:
(MaxRdCount S ) < δ S × RdCount Si This means that the predictor
(7)
pi is added in the family if the number of times
when it minimizes the RD cost is anywhere near the best number of times minimization of all predictors according to the threshold δ S . For the moving background family, the same process is applied and Eq. 6 and Eq. 7 are changed into the following:
794
G. Laroche, J. Jung, and B. Pesquet-Popescu
{
MaxRdCountM = max RdCountMi
In Eq.7 and Eq.9
}
∀i < N
(8)
(MaxRdCount M ) < δ M × RdCount Mi
(9)
δ S , δ M ∈ [1;+∞[
so if these thresholds are equal to 1 only one
predictor is selected for each family and if they tend to + ∞ , this could correspond to the use of all predictors. In this scheme δ S and δ M are empiric thresholds. δ S allows to select few predictors for the still background where the Skip mode has a high probability to be selected and δ M allows to have more predictors for moving background. 3.3 List of Predictors For the experiments we have defined 11 predictors, which are described bellow: •
mv H .264 , ‘H.264 median’, the standard H.264 median [1] as described in Section 2.1.
•
mv a , mvb , mvc the motion vectors of the neighboring blocks. • mv 0 , the zero value. • mv extspa , ‘Extended Spatial’, a slightly different spatial predictor, that returns the mv a , mvb , mvc if this three vectors are available, otherwise returns mv a if available, otherwise mvb , otherwise mvc , otherwise ‘0’. • mv col , ‘Collocated’, a temporal predictor that returns the motion vector of the median of
collocated block, if it is available. •
mv Sa , mv Sb , mv Sc the motion vectors of the neighboring blocks scaled accord-
ing to the temporal distance of the current pointed block predictor frame and the reference frame pointed by the predictor. • mvtf , the motion vector at the position given by mv H .264 in the previous frame. The number of predictors may seem high, yet some groups of predictors usually provide the same values and so the same residuals. mv H .264 and mv extspa have the same value if all neighboring vectors have the same reference frame. The original spatial motion vectors mv a , mvb , mv c and their respectively scaled predictors mv Sa ,
mv Sb , mv Sc have the same value if the original motion vector points to the first reference frame (the first reference frame is the most selected reference). mv col and mvtf have also the same value if the mv H .264 value is near to the zero value which often occurs for still background. Consequently, the joint use of
mv H .264 and
Competition Based Prediction for Skip Mode Motion Vector
795
mvextspa or mvcol and mvtf etc does not imply a high increase of the cost index for the Skip mode. We observed however that it was very useful to keep them together in a family, because whenever the prediction value is different, which is related to differences in motion vector field, the used of multiple predictor values is significant.
4 Experimental Results Simulations were performed using the KTA software version 1.1 [4], based on the H.264 reference software JM11.0 [3]. We have selected the Baseline profile and used the VCEG’s common conditions [12] for coding efficiency experiments except recommended quantization parameters, given that we target low bitrate applications where the number of skipped macroblocks is higher than for high bitrate. So we have changed QP 22 (high bitrate) by QP 42 (low bitrate). Therefore, for the experiments we have selected following tools and conditions: • • • • •
CAVLC entropy coding method Only the first frame is intra coded 32x32 search range 4 reference frames QP 27, 32, 37, 42 are selected.
The percentages of bitrate saving presented in this section are computed with the Bjontegaard metric [13], which computes average difference between RD-curves. Note that this metric has been largely adopted for testing by VCEG, due to the easier comparison of RD points corresponding to different bitrates, as usually results from closed-loop codecs. 4.1 Analysis of the Predictor Selection Fig 3 shows the average of the amount of predictors used in each family according to each sequence and for all quantization parameters. These averages of the amount of predictors are obtained with the thresholds δ S and δ M which we have respectively empirically fixed to 1.05 and 1.6. For Modo and Silent sequences the number of predictors for the still background family is high (about five). Indeed, these sequences have a large part of static background. Some predictors have the same value, which is generally equal to zero, and consequently the cost of the index predictor is low. This is verified by the results in Table 1. and Table 2., which give the percentage of selection of each predictor for respectively the still background family and the moving background family on each sequence. In fact, in Table 2. we can see that for still background the most frequent predictors are mv H .264 , mv extspa , mv col , mv Scol and mv 0 . This selection means that these five vectors are often equal to the zero value. For other sequences, the motion is higher and consequently all predictors have different values. The amount of predictors for the still background is therefore about two predictors. So, for the sequences with systematic moving background or with non
796
G. Laroche, J. Jung, and B. Pesquet-Popescu
12
Amount of predictor
10
8
Still Background
6
Moving Background
4
2
0 Foreman CIF
Mobile CIF
Modo CIF Silent CIF City SD
Crew SD
Ice SD
Fig. 3. Average of the amount of the predictor for the still and moving background sequence by sequence for all quantization parameters Table 1. Percentage of selection of each predictor for the still background family
mvH .264
mv extspa
mv a
mvb
mvc
mv0
mvcol
mv tf
mvSa
mvSb
mvSc
Foreman CIF Mobile CIF Modo CIF
30%
34%
21%
18%
12%
11%
13%
10%
22%
18%
12%
36%
21%
20%
17%
13%
23%
20%
20%
24%
21%
17%
82%
82%
24%
21%
7%
64%
71%
69%
20%
23%
7%
Silent CIF City SD Crew SD Ice SD
93% 38% 55% 36%
91% 38% 52% 28%
13% 30% 13% 9%
14% 22% 26% 1%
4% 15% 8% 2%
85% 13% 7% 27%
85% 11% 14% 91%
86% 11% 6% 23%
11% 31% 9% 7%
17% 22% 14% 3%
5% 14% 5% 1%
Table 2. Percentage of selection of each predictor for the moving background family
mvH .264
mv extspa
mv a
mvb
mvc
mv0
mvcol
mv tf
mvSa
mvSb
mvSc
Foreman CIF Mobile CIF
98%
98%
98%
97%
91%
27%
43%
37%
98%
97%
91%
94%
86%
89%
78%
62%
8%
90%
90%
96%
95%
92%
Modo CIF
94%
93%
95%
93%
82%
86%
88%
87%
94%
92%
86%
Silent CIF City SD Crew SD Ice SD
94% 99% 98% 91%
95% 99% 99% 89%
94% 99% 95% 89%
93% 99% 98% 88%
84% 99% 90% 77%
90% 0% 36% 78%
86% 4% 44% 95%
82% 1% 36% 75%
94% 99% 95% 91%
92% 99% 99% 89%
80% 99% 85% 76%
uniform motion such as Foreman, City and Crew, the temporal predictors are less often selected in the moving background family. Ice sequence has a static background with a lot of moving objects, and consequently the motion vector collocated is the most selected predictor for the two families.
Competition Based Prediction for Skip Mode Motion Vector
797
For the still background family, the selection of predictors is related to the sequence type, sequences with non uniform motion as Foreman, Mobile and City any predictors or couple predictors seems to be more often selected. The selection has almost the same probability. In fact for these sequences, macroblocks classified in still background have more neighboring macroblocks (spatial or temporal neighboring macroblocks) which are classified in moving background than sequences with fixed point of view such as Modo, Silent and Ice. It would be interesting for future work to use a classification criterion based on the variance of all predictors to determine the macroblock class. 4.2 Increase of the Skip Mode Occurrence Fig 4 shows the percentage increase of the amount of macroblocks encoded with the Skip mode for the proposed scheme and for the competition of motion vector prediction, as presented in [9], which used two fixed predictors mv extspa , and mv a (ie, the best predictor configuration obtained [11]). This percentage increase is correlated with the sequence type. In fact, the sequences with static background already have a high proportion of skipped macroblocks. For all sequences, the increased number of skipped macroblocks is higher than with MVComp method.
Increase of amount Skipped MB (%)
50 45 40 35
MVComp
30 25
Proposed Scheme
20 15 10 5 0 Foreman CIF
Mobile CIF Modo CIF
Silent CIF
City SD
Crew SD
Ice SD
Fig. 4. Increase of the number of skipped macroblocks for MVComp and for the proposed scheme
The amount of skipped macroblocks is related to the coding efficiency because the Skip mode leads to a low bitrate. So the increase of the number of skipped macroblocks is generally related to the bitrate savings. Moreover the decoder has a decrease in complexity with the increase of skipped macroblocks, because the decoding Skip mode process is less complex than a decoding process which involves inverse quantization, transform and prediction.
798
G. Laroche, J. Jung, and B. Pesquet-Popescu
4.3 Global Bitrate Reduction The global bitrate reduction presented in this sub section is related to the skip modification (fixed set for MVcomp and families adaptation for the proposed scheme) and the competition based scheme for the motion vector prediction for all inter modes. Fig. 5 shows the global bitrate reduction for MVComp and the proposed scheme on each sequence and for all QPs. The average bitrate saving for MVComp is 6.2% and for the proposed scheme is 8.2%, as depicted in Fig. 6. The proposed scheme gives a systematic gain compared to the MVComp scheme for a large test set and the average bitrate saving is 1.9%. Note that the worst result which we have obtained is 0.9% of decrease compared with MVComp on Soccer SD sequence. The bitrate reduction seems related to the type of sequence. Sequences with static view point as Modo, Silent and Ice have a higher bitrate reduction. Note that these sequences have already a high proportion of skipped macroblocks for the reference method, as depicted in Fig. 2.
12
Bitrate savings (%)
10 8 MVComp 6 Proposed Scheme
4 2 0 Foreman CIF
Mobile CIF
Modo CIF Silent CIF City SD
Crew SD
Ice SD
Fig. 5. Global bitrate reduction for MVComp and proposed scheme on all sequences and for all quantization parameters
40 39
H.264
38
Proposed Scheme
Y-PSNR
37
MVComp
36 35 34 33 32 31 30 0
50
kbits/s
100
150
Fig. 6. RD curves for Modo CIF sequence at 30Hz. Baseline H.264 reference algorithm vs. MVComp and the proposed scheme.
Competition Based Prediction for Skip Mode Motion Vector
799
5 Conclusion In this paper, a competition based motion vector prediction is proposed to increase the efficiency of the Skip mode. It is driven by the classification of each macroblock into a still background class or a moving background class. For both classes a family of predictors is adapted independently, according to a statistical rate-distortion criterion. The adaptation takes into account the hypothesis that still background macroblocks need less predictors than moving background macroblocks. The two families of predictors are transmitted for each frame. This scheme was tested with different sequence types. It gives a systematic bitrate reduction compared to our previous work based on a static set of predictors, already adopted by the Video Coding Expert Group in the JM KTA software. The average bitrate savings compared to H.264 reference is 8.2%. In the near future it is planed to implement this scheme for B frame and hierarchical B frame in order to increase the bitrates savings.
References 1. ITU-T. Recommendation H.264 and ISO/IEC 14496-10 AVC, Advanced video coding for generic audiovisual services version 3 (2005) 2. Lim, K., Sullivan, G., Wiegand, T.: Text Description of JM Reference Encoding Methods and Decoding Concealment Methods, JVT-N046 contribution, Hong-Kong (January 2005) 3. Suehring, K.: H.264 software coordination, http://iphome.hhi.de/suehring/tml/ 4. Vatis, Y.: KTA software coordination http://www.tnt.uni-hannover.de/ vatis/kta/ 5. Wedi, T.: 1/8 -pel motion vector resolution for H.26L., ITU-T VCEG, Portland, USA, Proposal Q15-K-21 (August 2000) 6. Vatis, Y., Edler, B., Thanh Nguyen, D., Ostermann, J.: Two-dimensional non-separable Adaptive Wiener Interpolation Filter for H.264/AVC, ITU-T SGI 6/Q.6 Doc. VCEG-Z17, Busan (April 2005) 7. Narroschke, M., Musmann, H.G.: Adaptive prediction error coding in spatial and frequency domain with a fixed scan in the spatial domain. ITU-T SG16/Q.6 Doc. VCEGAD07, Hangzhou (October 2006) 8. Tanizawa, A., Chujoh, T.: Adaptive Quantization Matrix Selection on KTA Software. ITU-T SG16/Q.6 Doc. VCEG-AD06, Hangzhou (October 2006) 9. Jung, J., Laroche, G.: Competition-Based Scheme for Motion Vector Selection and Coding. VCEG Contribution VCEG-AC06, Klagenfurt (July 2006) 10. Laroche, G., Jung, J., Pesquet-Popescu, B.: A spatio-temporal competing scheme for the rate-distortion optimized selection and coding of motion vectors. In: Proc. European Signal Processing Conf. Florence, Italy (2006) 11. Jung, J., Laroche, G., Pesquet-Popescu, B.: RD optimized competition scheme for efficient motion prediction. Invited Paper, VCIP, SPIE Electronic Imaging, January 28-Febuary 1st, 2007, San Jose, CA, USA (2007) 12. Tan, T.K., Sullivan, G.J., Wedi, T.: Recommended simulation common conditions for coding efficiency experiments. ITU-T VCEG, Nice, Input / Discussion VCEG-AA10 (October 2005) 13. Bjontegaard, G.: Calculation of average PSNR differences between RD-curves. ITU-T VCEG, Texas, USA, Proposal VCEG-M33 (April 2001)
Efficiency of Closed and Open-Loop Scalable Wavelet Based Video Coding M.F. L´ opez, V.G Ruiz, and I. Garc´ıa Dept. Computer Architecture and Electronics University of Almer´ıa, Almer´ıa, Spain Abstract. Video compression techniques can be classified into scalable and non-scalable. Scalable coding is more suitable in variable band-width scenarios because it improves the quality of the reconstructed video. On the other hand, the scalability has a cost in terms of coding efficiency and complexity. This paper describes a JPEG2000-and-MCTF-based fully scalable video codec (FSVC) and analyzes a set of experiments to measure the cost of the scalability, comparing two different FSVC encoders: open-loop FSVC and closed-loop FSVC. In the open-loop version of FSVC, the encoder uses the original images to make the predictions. The closed-loop scheme generates the predictions with reference images identical to those obtained by the decoder at a given bitrate. Numerical and visual results demonstrate a small loss of the coding efficiency for the open-loop scheme. Moreover, the inclusion of the closed-loop increases the complexity of the encoder and produces poor performance at high bitrates.
1
Introduction
Scalable video coding is a technique which allows us to decode a compressed video stream in several different ways. Users can recover a specific version of a video according to their own requirements: spatial resolution, image quality, frame rate and data rate. spatial scalability provides a set of lowered resolution reconstructions for each image or region of interest. The progressive minimization of the distortion of the reconstructed video at the decoder is achieved using quality scalability. A variation of the frame rate is obtained by means of temporal scalability. Finally, these types of scalabilities can be combined together to generalize the idea of scalability with the concept of data rate scalability. Scalable video coding is a major feature for video storage and video transmission systems. For example, in video-on-demand (VoD) applications, a server sends a video stream to a set of clients through a number of transmission links. For the most of cases, the quality, resolution, and frame-rate of the visualizations must be adapted to the requirements of the decoder and the band-width available. In this context, the computational requirements of the servers are proportional to the number of different kinds of clients, and non-scalable video coding has two alternatives to minimize them: (i) the creation of a specific copy of the video sequence for each type of client or (ii) the use of CPU-intensive real-time J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 800–809, 2007. c Springer-Verlag Berlin Heidelberg 2007
Efficiency of Closed and Open-Loop Scalable Wavelet Based Video Coding
801
transcoding processes to re-encode on-the-fly the video. Scalable video coding addresses this problem by storing only one copy of each video sequence at the server and simplifying the transcoding task. This simple transcoding consists of a reordering that can be carried out by the clients retrieving the adequate portions of the data of the compressed video. This work describes and studies a fully scalable video coding system, called FSVC, specially designed for VoD applications over unpredictable band-width data networks (like the Internet). FSVC is open-loop motion compensated temporal filtering (MCTF)-based and its data output is a sequence of JPEG2000 packets that are placed in the compressed stream using some ordering (or progression). The decoding ordering of these packets determines the way the video will be displayed when only a part of the compressed stream is decoded. FSVC supports the following kind of scalabilities: (i) fine grain progressive by quality, (ii) dyadic progressive by resolution and (iii) dyadic progressive by frame-rate. The behavior of the coding efficiency of FSVC is examined with the adding and testing of a closed-loop scheme. The rest of this paper is organized as follows. In Section 2 the open-loop FSVC encoding system is described. The design of closed-loop FSVC is focused on Section 3. Experimental results are shown and analyzed in Section 4. Concluding remarks are given in Section 5.
I
2D-DWT
I
E
EBCOT Coder
E
EBCOT Decoder
Eˆ
EBCOT Decoder P
Pˆ
MC
MC
Coder
Iˆ
MC
Eˆ
OL-FSVC
M
Inverse 2D-DWT
Pˆ
CL-FSVC ME
Iˆ
Iˆ
Entropy Coder
M
Entropy Decoder
M
Decoder
Fig. 1. The block diagram of the FSVC codec. MC = Motion Compensation, ME = Motion Estimation, 2D-DWT = 2-Dimensional Discrete Wavelet Transform and EBCOT = Embedded Block Coding with Optimized Truncation.
2
The FSVC Codec
The discrete wavelet transform (DWT) has proved to be an excellent decorrelation tool for images, even better than the discrete cosine transform (DCT) [1]. Another advantage of the DWT is the smooth reconstructions obtained when only a portion of the wavelet information is used. The research community is very interested in the application of the DWT to the field of video compression.
802
M.F. L´ opez, V.G. Ruiz, and I. Garc´ıa
A Group Of Frames (GOF) Frame 0
Frame 1 B B
Temporal Resolution Level 0
Reference Frame
Temporal Resolution Level 1
Reference Frame
I
Frame 2
B B
F B B F
Reference Frame
B B B F I
Frame 3
F
B F F I
Frame 4
B B
F B B F
Reference Frame
B F F
F Reference Frame
B F F
Temporal Resolution Level 3
I
Frame 8 F
F B B F
Reference Frame
B B B F
B F F
I I
I
Reference Frame
B F F
F
F B B F B B B F B F F
F
B B B F I
I
F B B F
I I
Frame 7 B B
F B B F
Reference Frame
I
Frame 6 F
B B
F B B F
B B Temporal Resolution Level 2
I
B B B F I
B B B F I
Frame 5
F
B B B F I
B B
I
I
B F F I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
Fig. 2. An example of the MCTF-based temporal decorrelation scheme of FSVC for a GOF with 8 frames (only one DWT subband is shown)
One of the first works in this direction was based on the idea of processing digital video as a 3D signal [2]. The sequence of images is divided into groups of consecutive frames (GOFs) and each of them is transformed using the 3D-DWT and compressed by an embedded entropy coder. Obviously, the main advantage of this technique is its simplicity. Nevertheless, the compression ratios and the quality of the video reconstructions are not very good. The main reason is that the filters designed for the DWT are not suitable for decorrelating digital video in the temporal domain. When a small amount of information is used to decompress a sequence of video, unpleasant ghosting artifacts are generated by the movement of the objects [3]. A way to improve the overall performance of the this technique consists of an alignment of the GOF images as a previous step to its transformation into the wavelet domain. This alignment increases the temporal redundancy and helps to ameliorate the compression performance [4].2/3D-DWT A straight forward way to improve that technique is the application of a block-based motion compensation differential encoder followed by a 3D-DWT and an entropy codec [5]. The main disadvantage of this kind of codecs is the low performance of the wavelet filters when they are applied to the prediction error. These residual images usually show blocking artifacts where most of the wavelet filters do not work very efficiently. To minimize this problem (clearly visible in reconstructions), a meshbased motion estimation (ME) algorithm or other more complex algorithms have been proposed [3]. A better way to take advantage of the excellent work that wavelet transforms perform consists of applying the compensation of the motion after the wavelet decomposition. This technique, usually named in-band motion compensation (IBMC) [6], computes the residual images on the wavelet domain instead of the image domain. The main advantage of the IBMC video codec is its high visual quality for a partially decoded signal. Although IBMC video coding uses blocks
Efficiency of Closed and Open-Loop Scalable Wavelet Based Video Coding
803
to build the predictions, the blocking visual effect does not appear in the image domain. The video codec described in this work is actually an IBMC system. FSVC is a fully scalable video compression system [7]. As it can be seen in Fig. 1, the encoder is a differential coding scheme based on open-loop MCTF applied to the wavelet domain, and embedded block coding with optimized truncation (EBCOT) [8,9,10] applied to the residues. The compressor uses the motion information computed at the ME module and the original images to generate a sequence of prediction frames P that are subtracted to the original video sequence I. The prediction errors E are progressively encoded using the EBCOT module. As it is shown in Fig. 2, the input video sequence I is segmented in GOFs of size G (G = 8 in the example of the figure). Each GOF is divided into 1 + log2 G temporal resolution levels to obtain dyadic temporal scalability in each GOF. The lowest temporal resolution level T 3 is composed of the I[G · i] frames (let denote this by T 3 = {I[G · i]}), where i = 0, 1, · · · indexes the frames of the video sequence. The next temporal resolution level is T 2 = {I[22 · i]}. In general, T t = {I[2t · i]}, where t = 0, 1, · · · , log2 G. T j depends on T j+1 except, obviously, the lowest temporal resolution level T log2 G , where all the frames are intra-coded (all of them can be independently decoded). This allows the decoder: (i) to access any GOF of the compressed video without decoding the rest and (ii) to avoid the error propagation when real-time transmissions are carried out over error-prune transmission links. The MCTF design of FSVC is a motion compensated block based system which differs from other common schemes found in the literature [11] [12]. Fig. 2 shows how the frames at each temporal resolution level are predicted. Inspecting Fig. 1 it can be seen that the motion estimation is done in the image domain and the motion compensation is performed in the wavelet domain, choosing the correct phase and using the same M motion field for the same location at each spatial resolution. Every transformed frame is decomposed into a set of non-overlapped blocks which are predicted from the previous and the next frame in the lower temporal resolution level. For instance, frame 4, that belongs to T 2 , is predicted with frames 0 and 8, that belong to T 3 . Therefore, the blocks can be backward or forward predicted. The choice between a forward (F-block) or a backward (Bblock) prediction is decided according to the MSE (Mean Square Error), and taken into account the minimization of the drift errors. Drift errors propagate over dependencies between predicted frames. Thus, for predicted frame 1, forward predictions have higher priority to be selected than backward predictions, because at the decoder, the frame 0 (where all the blocks are intra-coded) will be reconstructed without drift error. After subtracting the prediction frames P to the predicted ones I, a sequence of residue frames E is generated for each temporal resolution level. Note that all the blocks of T log2 G are intra-coded. The intra-coded blocks can be used in other temporal resolution levels when the MSE of the residue is not low enough.
804
M.F. L´ opez, V.G. Ruiz, and I. Garc´ıa
The temporal decorrelation is performed in the wavelet domain with the aim of: (i) avoiding the artifacts in the reconstructions when spatial scalability is used and (ii) minimizing the disgraceful blocking artifacts that are visible at low bitrates. The motion compensated wavelet blocks are constructed selecting the correct phase (overcomplete DWT) to avoid its shift variability [6,13]. The frame residues are compressed with EBCOT and the motion fields with a static 0-order probabilistic model with a Huffman coder. EBCOT produces a sequence of JPEG2000 packets that are placed in the stream using some ordering. The receiving ordering is important because it determines the way the video will be displayed when only a partial decoding is carried out. In a progressive by quality scenario, FSVC decoder must choose the LTRCP progression (L RCP inherited from JPEG2000), where L stands for quality layer, T for temporal resolution level, R for spatial resolution level, C for color component and P for precinct. Other useful progressions are RLTCP and TLRCP that allow progressive by resolution and progressive by frame-rate reconstructions, respectively.
3
Closed-Loop FSVC
In practical cases, the FSVC decoder decompresses only a part of the stream generated by the encoder, depending on the available bandwidth. Consequently, ˆ and frames Iˆ at the decoder are only an approximation of the original residues E residues and frames at the encoder (see Fig. 1). As predictions P depend on reconstructions, a drift error appears in the decoder. By means of the dyadic MCTF scheme of FSVC explained in Section 2, drift is not accumulated along the time. This has two advantages: (i) the number of temporal resolution levels is smaller than the size of the GOF and therefore, the drift is small, and (ii) the drift is spread along the GOF. To know how much coding efficiency is lost due to drift, a closed-loop has been included in the encoder to ensure that both encoder and decoder use the same predictions, removing completely the drift error at a selected bitrate. FSVC was designed without update step and preserving the temporal dyadic decomposition of MCTF. This technique allows FSVC encoder to establish open-loop (OL) or closed-loop (CL) prediction step in the lifting scheme. From a block diagram point of view, CL-FSVC is quite similar to OL-FSVC. The MC module of CLFSVC uses the reconstructed frames at the decoder for a given bitrate k (see the dashed lines in Fig. 1) instead of the original frames used by OL-FSVC (see the dotted lines in Fig. 1). Therefore, the drift error disappears when reconstructing ˆ and Iˆ are identical at encoder and video sequence at the bitrate k (where E decoder). The FSVC decoder is the same for OL-FSVC and CL-FSVC.
4
Experimental Results
A set of experiments have been carried out to analyze the effects of open and closed-loop schemes on the coding efficiency of FSVC. The “progressive by
Efficiency of Closed and Open-Loop Scalable Wavelet Based Video Coding coastguard
805
bus
32
29
31
28
29
PSNR [dB]
PSNR [dB]
30
28 27
27 26 25
26 24
25 24 200
OL-FSVC CL-FSVC at 896 Kbps 400
600
800
1000 1200 1400 Bit-Rate [Kbps] coastguard
1600
1800
23 200
2000
32
OL-FSVC CL-FSVC at 896 Kbps 400
600
800
1000 1200 1400 Bit-Rate [Kbps] bus
400
600
800
1000 1200 1400 Bit-Rate [Kbps] bus
400
600
800
1000 1200 1400 Bit-Rate [Kbps]
1600
1800
2000
29
31
28
29
PSNR [dB]
PSNR [dB]
30
28 27
27 26 25
26 24
25 24 200
OL-FSVC CL-FSVC at 1024 Kbps 400
600
800
1000 1200 1400 Bit-Rate [Kbps] coastguard
1600
1800
23 200
2000
32
OL-FSVC CL-FSVC at 1024 Kbps 1600
1800
2000
29
31
28
29
PSNR [dB]
PSNR [dB]
30
28 27
27 26 25
26 24
25 24 200
OL-FSVC CL-FSVC at 1536 Kbps 400
600
800
1000 1200 1400 Bit-Rate [Kbps]
1600
1800
2000
23 200
OL-FSVC CL-FSVC at 1536 Kbps 1600
1800
2000
Fig. 3. Average PSNR of the luminance component for coastguard and bus video sequences. OL-FSVC and CL-FSVC are compared. The reference images in CL-FSVC have been decoded at k = 896, 1024 and 1536 Kbps (vertical green lines).
quality” decoding scenario has been chosen because it is the most interesting for VoD applications. The coding parameters used to run OL-FSVC and CLFSVC are: – Spatial Filter: Biorthogonal 9/7. Spatial Resolution Levels: 4. – Temporal Filter: Bidirectional 1/1 (open-loop for OL-FSVC and closed-loop for CL-FSVC). Temporal Resolution Levels: 5. – Motion Compensation: Fixed block-size with 1/1 Pixel Accuracy. Each GOF is composed of 16 frames (4 temporal resolution levels). Each color component is encoded using 16 quality layers and 4 spatial resolution levels. The video codestream has been decompressed using the LTRCP progression at several bitrates. The OL-FSVC and CL-FSVC compressed data are progressively decompressed at different bitrates with the FSVC decoder. The results presented in Fig. 3 and 4
806
M.F. L´ opez, V.G. Ruiz, and I. Garc´ıa container
akiyo 48
40 46 44
36
PSNR [dB]
PSNR [dB]
38
34 32
40 38
30
36 OL-FSVC CL-FSVC at 896 Kbps
28 200
42
400
600
800
1000 1200 1400 Bit-Rate [Kbps] container
1600
1800
OL-FSVC CL-FSVC at 896 Kbps
34 2000
200
400
600
800
1000 1200 1400 Bit-Rate [Kbps] akiyo
400
600
800
1000 1200 1400 Bit-Rate [Kbps] akiyo
400
600
800
1000 1200 1400 Bit-Rate [Kbps]
1600
1800
2000
48 40 46 44
36
PSNR [dB]
PSNR [dB]
38
34 32
40 38
30
36 OL-FSVC CL-FSVC at 1024 Kbps
28 200
42
400
600
800
1000 1200 1400 Bit-Rate [Kbps] container
1600
1800
OL-FSVC CL-FSVC at 1024 Kbps
34 2000
200
1600
1800
2000
48 40 46 44
36
PSNR [dB]
PSNR [dB]
38
34 32
40 38
30
36 OL-FSVC CL-FSVC at 1536 Kbps
28 200
42
400
600
800
1000 1200 1400 Bit-Rate [Kbps]
1600
1800
OL-FSVC CL-FSVC at 1536 Kbps
34 2000
200
1600
1800
2000
Fig. 4. Average PSNR of the luminance component for container and akiyo video sequences. OL-FSVC and CL-FSVC are compared. The reference images in CL-FSVC encoder were decoded at k = 896, 1024 and 1536 Kbps (vertical green lines).
are for the well-known video test sequences coastguard, bus, container and akiyo. Figures show the rate-distortion evaluation in order to compare OL-FSVC and CL-FSVC. The Y-axis represents the average PSNR for the luminance component of the complete video sequence. The X-axis represents the decoding bitrate. The closed-loop prediction of CL-FSVC encoder has been set to k = 896, 1024 and 1536 Kbps (Kilobits per second). Results demonstrate that in CL-FSVC there is a loss of efficiency when the decoding bitrate is higher than the closed-loop bitrate k. Moreover, PSNR loss rises when increasing the decoding bitrate. CL-FSVC obtains slightly better video reconstructions from low bitrates to the known a priori k Kbps. The highest coding gain is obtained at k Kbps and the improvement is smaller than 0.5 dB. At higher bitrates CL-FSVC performs worse than OL-FSVC because the decoded frames are similar to the original video and the prediction frames have
Efficiency of Closed and Open-Loop Scalable Wavelet Based Video Coding OL-FSVC
807
CL-FSVC
Fig. 5. Visual results for the third image of the akiyo, bus and coastguard video sequences decoded at 896 Kbps. On the left OL-FSVC and on the right CL-FSVC
more quality at the decoder than at the CL-FSVC encoder. Finally, Fig. 5 shows some reconstructed frames at k Kbps. A subjective comparison indicates that there is no visual difference between the frames decoded with CL-FSVC and OL-FSVC. Note that k Kbps is the decoding bitrate where CL-FSVC obtains the highest coding gain.
808
5
M.F. L´ opez, V.G. Ruiz, and I. Garc´ıa
Conclusions
This paper describes a fully scalable video codec (FSVC) based on MCTF and JPEG2000. FSVC provides fine granularity on temporal, quality and spatial scalabilities. Two different schemes of FSVC encoder with open and closed-loop have been designed and tested to investigate their coding efficiency and behavior. Experimental results with standard video sequences demonstrate that CL-FSVC only outperforms OL-FSVC around the bitrate selected for the closed-loop. The coding and visual gain is not significant and CL-FSVC performs worse at high bitrates. It can be concluded that if the maximal decoding bitrate is known a priori, the performance of FSVC can be improved using a closed-loop scheme. Otherwise, the open-loop FSVC offers similar or higher coding efficiency.
References 1. Taubman, D., Marcellin, M.: JPEG 2000 Image Compression Fundamentals, Standards and Practice. Kluwer Academic Publishers, Dordrecht (2002) 2. Kim, B.J., Pearlman, W.A.: An embedded wavelet video coder using threedimensional set partitioning in hierarchical trees. In: Proceedings of the IEEE Data Compression Conference, pp. 251–260. IEEE Computer Society Press, Los Alamitos (1997) 3. Secker, A., Taubman, D.: Lifting-based invertible motion adaptive transform (LIMAT) framework for highly scalable video compression. IEEE Transactions on Image Processing 12, 1530–1542 (2003) 4. Taubman, D., Zakhor, A.: Multirate 3-D subband coding of video. IEEE Transactions on Image Processing 3, 572–588 (1994) 5. Wang, Y., Cui, S., Fowler, J.E.: 3D video coding using redundant-wavelet multihypothesis and motion-compensated temporal filtering. In: Proceedings of the IEEE International Conference in Image Processing (ICIP), pp. 775–778. IEEE Computer Society Press, Los Alamitos (2003) 6. Andreopoulos, Y., van der Schaar, M., Munteanu, A., Barbarien, J., Schelkens, P., Cornelis, J.: Fully-scalable wavelet video coding using in-band motion compensated temporal filtering. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 417–420. IEEE, Los Alamitos (2003) 7. L´ opez, M.F., Rodr´ıguez, S.G., Ortiz, J.P., Dana, J.M., Ruiz, V.G., Garc´ıa, I.: FSVC: a new fully scalable video codec. In: Gagalowicz, A., Philips, W. (eds.) CAIP 2005. LNCS, vol. 3691, pp. 171–178. Springer, Heidelberg (2005) 8. Ohm, J.R.: Three-dimensional subband coding with motion compensation. IEEE Transactions on Image Processing 3, 559–571 (1994) 9. Choi, S.J., Woods, J.: Motion compensated 3-D subband coding of video. IEEE Transactions of Image Processing 8, 155–167 (1999) 10. Taubman, D.: High performance scalable image compression with EBCOT. IEEE Transactions on Image Processing 9, 1158–1170 (2000)
Efficiency of Closed and Open-Loop Scalable Wavelet Based Video Coding
809
11. Luo, L., Wu, F., Li, S., Xiong, Z., Zhuang, Z.: Advanced motion threading for 3D wavelet video coding. Signal Processing: Image Communication, Special Issue on Subband/Wavelet Video Coding 19, 601–616 (2004) 12. Chen, P., Woods, J.W.: Bidirectional MC-EZBC with lifting implementation. IEEE Transactions on Circuits and Systems for Video Technology 14, 1183–1194 (2004) 13. Andreopoulos, Y., Munteanu, A., der Auwera, G.V., Cornelis, J., Schelkens, P.: Complete-to-overcomplete discrete wavelet transforms: theory and applications. IEEE Transactions on Signal Processing 53, 1398–1412 (2005)
Spatio-temporal Information-Based Simple Deinterlacing Algorithm Gwanggil Jeon, Fang Yong, Joohyun Lee, Rokkyu Lee, and Jechang Jeong Department of Electronics and Computer Engineering, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul, Korea
[email protected]
Abstract. In this paper, we propose a new computationally efficient fuzzy rulebased line doubling algorithm which provides effective visual performance. In the proposed scheme, spatio-temporal mode selector and fuzzy rule-based correlation dependent interpolation techniques are utilized for the 2-D input signal. The basic idea is to classify the field dynamically into background or foreground area. The proposed method interpolates missing pixels using temporal information in the background area, and then interpolates remaining pixels using spatial information in the foreground area using fuzzy rule.
1 Introduction Deinterlacing technologies provide a progressively scanned video signal from an interlaced version with a frame rate equal to the original field rate. Because the sampling process of interlaced TV signals in the vertical direction does not satisfy the Nyquist sampling theorem, the linear sampling-rate conversion theory cannot be utilized for effective interpolation. This causes several visual artifacts which decrease the picture quality of the interlaced video sequence. For example, twitter artifacts will occur with fine vertical details where pixels appear to twitter up and down. Flicker artifacts occur in regions of high vertical frequency detail, causing annoying flicker. An unwanted staircase effect will occur when diagonal edges move slowly in the vertical direction. Deinterlaced video is supposed to have improved image quality by reduction of the aforementioned artifacts. However, several simple intra-field methods like line-replication, line-averaging or directional spatial interpolation are not capable of removing flicker artifacts. Recently, many different approaches that adopt fuzzy reasoning have been proposed in the engineering domain. Fuzzy reasoning methods have proved effective in image processing (e.g., filtering, interpolation, edge detection, and morphology), and have numerous practical applications. In [1], a line interpolation method using an intra-field edge-direction detector was proposed to obtain the correct edge information. This detector works by identifying small pixel variations in five orientations and by using rules to infer the interpolation filter. Fuzzy logic has successful applications in process control where binary decisions do not yield good results. Other examples of applications of fuzzy controllers in low-level image processing are a fuzzy edge detector by Michaud [2], fuzzy rate control for MPEG video [3] and fuzzy operators J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 810–817, 2007. © Springer-Verlag Berlin Heidelberg 2007
Spatio-temporal Information-Based Simple Deinterlacing Algorithm
811
for filtering and edge detection [4]. Fuzzy rule based motion adaptive and motion compensated deinterlacing algorithm were proposed in [5, 6]. In this paper, we propose a motion adaptive deinterlacing scheme using motion detector and fuzzy rulebased spatial domain interpolation. The proposed algorithm is based on spatio-temporal edge-based line average (STELA) algorithm, which performs interpolation in the direction of the highest sample correlation [7]. This technique exhibits good performance while requiring small computational burden. However, it has some drawbacks that quality of the picture deteriorates in motion area. Also interpolation errors are frequently occurred when the signal has high horizontal frequency components. The rest of the paper is organized as follows. In Section 2, the detail of the motion and edge direction detector, fuzzy rule based edge-sensitive line average algorithm, and the interpolation strategy will be described. Experimental results and conclusions are finally presented in Section 3 and Section 4.
2 Proposed Fuzzy Rule-Based Line Doubling Algorithm 2.1 Fuzzy Image Processing and STELA Algorithm Let, x(i,j,k) denotes the intensity of a pixel interpolated in this work. The variable i refers to the column number, j to the line number, and k to the filed number. Fuzzy techniques offer a suitable framework for the development of new methods because they are nonlinear and knowledge-based. Pure fuzzy filters are mainly based on fuzzy if-then rules, where the desired filtering effect can be achieved using a suitable set of linguistic rules [8]. Fig. 1 shows the general structure of fuzzy image processing, which consists of three stages: fuzzification (Θ), suitable operation (Ξ) on membership values and defuzzification (Ψ). The output of the fuzzy system xFLD(i,j,k) for an input x(i,j,k) is provided by the following equation, while xLI(i,j,k) represents the output by linear interpolation. xFLD (i,j,k)=Ψ(Ξ(Θ(x(i,j,k)))) (1)
Fig. 1. The general structure of fuzzy image processing
812
G. Jeon et al.
Fig. 2. The block diagram of the STELA algorithm
The line doubling method to fill the missing scan lines processes the residual high frequency components of the signal. In the final stage of the STELA algorithm, the results of the line double and the directional dependent interpolation are added to fill the missing lines. Fig. 2 shows the block diagram of the STELA algorithm. First, a 2D input signal is decomposed into the low-pass and high-pass filtered signals. The high-pass filtered signal is obtained by subtracting the low-pass filtered signal from the input signal. Then, each signal is processed separately to estimate the missing scan lines of interlaced sequence. The interpolation method uses a spatio-temporal window with four scan lines and determines the minimum directional change, then chooses the median from the average value of the minimum directional change, pixel values of previous and post frames and pixel values of top and bottom fields in current frame. 2.2 Motion Detector and Temporal Interpolation We introduce new interpolator called fuzzy rule-based line doubling (FLD) algorithm. This new interpolator has two separated steps: the spatio-temporal mode selector and the fuzzy rule-based interpolator. In the literature, conventional deinterlacing methods have been reported that interpolate missing pixels indiscriminately in the same way. In this paper, we utilize different methods adaptively in different conditions. In order to alleviate the interpolation error caused by high horizontal frequency components, we apply a directional based interpolation method to the low-pass filtered signal. Let x(i,j1,k) and x(i,j+1,k) denote the upper reference line and the lower reference line, respectively. The variable i refers to the column number, and j to the line number, and k to the filed number. Consider the pixel xFLD(i,j,k), which will be interpolated in this work. The edge direction detector utilizes directional correlations among pixels, in order to linearly interpolate a missing line. A 3-horizontal×2-vertical×3-temporal 3D localized window is used to calculate directional correlations and to interpolate the current pixel, as shown in Fig. 2. Here, {N, S, E, W, P, and F} represents {north, south, east, west, past, and future}, respectively. For the measurement of the spatio-temporal correlation of the samples in the window, we determine six directional changes given by CS ,45D = NW − SE , CS ,0D = N − S , CS , −45D = NE − SW CT ,45D = PW − FE , CT ,0D = P − F , CT , −45D = PE − FW
(2)
Spatio-temporal Information-Based Simple Deinterlacing Algorithm
813
Fig. 3. Spatio-temporal window for the direction-based deinterlacing
The parameter Cψ,θ denotes a directional correlation measurement, which is intensity change in the direction, represented by ψ( {S, T}), and θ( {-45o, 0o, 45o}). Cψ,θ is used to determine the direction of the highest spatial correlation. Cψ,θ represent the average value of two samples with the minimum directional change. If the parameter ψ is S, the algorithm proceeds to following Section. Otherwise, the output of the direction-based algorithm is obtained as
∈
∈
⎧( PW + FE ) / 2, if min(C D , C D , C D , C D , C D , C D ) = C D T ,45 T ,0 T , −45 S ,45 S ,0 S , −45 T ,45 ⎪⎪ xFLD (i, j , k ) = ⎨( P + F ) / 2, if min(CT ,45D , CT ,0D , CT , −45D , CS ,45D , CS ,0D , CS ,−45D ) = CT ,0D (3) ⎪ ( PE + FW ) / 2, if min( C , C , C , C , C , C ) = C ⎪⎩ T ,45D T ,0D T , −45D S ,45D S ,0D S , −45D T , −45D
2.3 Edge Direction Detector and Edge-Considered Spatial Interpolation The fuzzy rule based spatial domain linear average algorithm uses luminance different values to determine if a certain missing pixel is located with a strong edge or not. It is assumed that the pixel with (j-3)th row is assigned to t={NW', N', NE'}, the pixel with (j-1)th row is assigned to u={NW, N, NE}, the pixel with (j+1)th row is assigned to v={SW, S, SE}, and the pixel with (j+3)th row is assigned to w={SW', S', SE'}. For each pixel (i,j,k) of the image, a neighborhood window is used. Each neighbor with respect to (i,j,k) corresponds to one direction. The luminance different LD'edge_directionx(i,j,k), LDedge_directionx(i,j,k), and LD''edge_directionx(i,j,k) are defined as the gradients. For example, in case of edge_direction=45o: LD'45x(i,j,k)=NW'-NW, and LD45x(i,j,k)=NW-NE, LD''45x(i,j,k)=SW-SE'; in case of edge_direction=0o: LD'0x(i,j,k)=N'-N, LD0x(i,j,k)=N-S, and LD''0x(i,j,k)=S-S'; in case of edge_direction=45o: LD'-45x(i,j,k)=NE'-NE, LD-45x(i,j,k)=NE-SW, and LD''-45x(i,j,k)=SW-SW'. Each edge direction corresponds to a center position (0,0). The utilized membership functions are BN (for the fuzzy set big negative), SN (for the fuzzy set small negative), SP (for the fuzzy set small positive), and BP (for the fuzzy set big positive). The horizontal range of these functions represents all the possible gradient values, which is a value between -255 and 255. And the vertical axis represents a membership degree which is a value between 0 and 1.
814
G. Jeon et al.
(a) decimated
(b) in case of edge
(c) in case of monotonic slope
(d) in case of peak
Fig. 4. The patterns how to preserve the edges and peaks
IF ( LDU , LD L ) ∈ {( BN , BN ) or ( SN , SN ) or ( SP , SP ) or ( BP, BP )} THEN xFLD (i, j , k ) = (b + c ) / 2 IF ( LDU , LD L ) ∈ {( BN , BP ) or ( SN , SP )} THEN xFLD (i, j , k ) = (b + c ) / 2 + δ IF ( LDU , LD L ) ∈ {( BP , BN ) or ( SP , SN )} THEN xFLD (i, j , k ) = (b + c ) / 2 − δ
(4)
IF ( LDU , LD L ) ∈ {( BN , SN ) or ( BN , SP ) or ( BP , SN ) or ( BP, SP )} THEN xFLD (i, j , k ) = c IF ( LDU , LD L ) ∈ {( SN , BN ) or ( SN , BP ) or ( SP , BN ) or ( SP, BP )} THEN xFLD (i, j , k ) = b
An edge pattern recognizer can be designed using luminance differences between adjacent pixels. We consider a one-dimensional case of line interpolation. Given the four consecutive pixels a, b, c, and d, in the conventional linear interpolation output value xLI is provided as xLI=(b+c)/2, as shown in Figs. 4(b), (c), and (d). However, in case of edge type signal, the ideal interpolator should yield for the pixel xFLD (which lies between b and c) a value similar either to the one of b or to the one of c. This result shows that the linear interpolator cannot preserve edges. On the other hand, in case of monotonic slope and peak type signals, patterns shown in Fig. 4(c) and Fig. 4(d) need to be considered. The monotonic slope type signal is shown as Fig. 4(c). In order to preserve this signal, linear interpolation is reasonable because, xFLD=(b+c)/2 is desirable. In the case of peak type signal (Fig. 4d), a compensation parameter δ is yielded for xFLD while the linear interpolation xLI is calculated as (b+c)/2≈u≈v. Here, the value δ can be determined empirically. The final interpolation result xFLD is obtained by (4).
3 Simulation Results In this Section, the performance of the discussed FLD scheme is evaluated and compared with several other existing methods for video deinterlacing. This method can be divided into two processes: spatio-temporal mode selection and fuzzy rule-based
Spatio-temporal Information-Based Simple Deinterlacing Algorithm
815
interpolation. Along with the proposed algorithm, some of the existing deinterlacing algorithms were also tested for comparison, which included spatial domain methods (Bob [9], ELA [10]), temporal domain methods (Weave [9]), and spatio-temporal domain methods (STELA [7]). Table 1 shows the PSNR and computational CPU time results of different deinterlacing methods for various sequences. Table 1. PSNR and average CPU time (seconds/frame) results of different interpolation methods for seven CIF sequences (units: dB and s) ELA Akiyo Flower Foreman Mobile News Stefan T. Tennis
37.931091 0.0287398 21.681033 0.0288252 30.323657 0.0289796 23.532802 0.0315081 31.474943 0.0294674 26.391944 0.0325934 27.408149 0.0290691
Bob 39.858176 0.0127073 22.190527 0.0152723 30.172840 0.0132357 25.511884 0.0137357 33.615252 0.0129227 27.724832 0.0154430 28.565849 0.0137642
Weave
STELA
43.785868 0.0113008 20.294957 0.0123821 26.307918 0.0129634 23.537331 0.0135650 36.471421 0.0116788 21.549408 0.0144349 27.996337 0.0131788
44.655406 0.0429552 22.990663 0.0444268 30.449501 0.0452886 27.260284 0.0483943 39.284877 0.0440650 26.996759 0.0519430 31.587789 0.0447439
Proposed Method 44.660111 0.0437931 22.993162 0.0453186 30.452755 0.0461502 27.263200 0.0493073 39.289025 0.0449284 26.999648 0.0529776 31.591157 0.0456330
For a subjective performance evaluation, the 151st frame of the Mobile sequence was adopted. Fig. 5 compares the visual performance of the FLD with several major conventional methods. It is assumed that Bob, ELA, Weave, STELA algorithms are enough to be compared, since these methods is considered to be good methods among conventional methods for comparison.
(a) Original
(b) ELA
Fig. 5. Subjective quality comparison of the 151st grayscale QCIF Mobile image
816
G. Jeon et al.
(c) Bob
(d) Weave
(e) STELA
(f) Proposed method
Fig. 5. (continued)
4 Conclusion A new fuzzy rule based deinterlacing algorithm was proposed. The proposed FLD method consists spatio-temporal mode selection part and correlation dependent interpolation part. In spatial mode, once the edge direction is determined, in order to accurately reconstruct boundary of edges and peaks, fuzzy rule-based edge-sensitive interpolation is utilized. The proposed algorithm can be widely used for deinterlacing since it can be easily implemented in hardware with better visual performance. Detection and interpolation results were presented. Experimental results of computer simulations show that the proposed method was able to outperform a number of methods in the literature in objective and subjective qualities in a feasible amount of CPU time. The main advantage of this method is that it reduces computational CPU time while preserving edge details. Acknowledgment. “This work was sponsored by ETRI SoC Industry Promotion Center, Human Resource Development Project for IT SoC Architect.”
Spatio-temporal Information-Based Simple Deinterlacing Algorithm
817
References 1. Fan, Y.-C., Lin, H.-S., Tsao, H.-W., Kuo, C.-C.: Intelligent intra-field interpolation for motion compensated deinterlacing. In: Proc. ITRE 2005, vol. 3, pp. 200–203 (2005) 2. Michaud, F., Dinh, C., Lachiver, G.: Fuzzy detection of edge-direction for video line doubling. IEEE Trans. Circuits and Systems for Video Technology 7(3), 539–542 (1999) 3. Tsang, D., Bensaou, B., Lam, S.: Fuzzy-based rate control for real-time MPEG video. Fuzzy Systems 6(4), 504–516 4. Prodan, R.S.: Multidimensional digital signal processing for television scan conversion. Philips Journal of Research 41(6), 576–603 (1986) 5. Ville, D.V.D., Rogge, B., Philips, W., Lemahieu, I.: Motion adaptive deinterlacing using a fuzzy-based motion detector. In: workshop on Advanced Concepts for Intelligent Vision Systems (ACIVS), pp. 21–26 (Baden-Baden, Germany) (August 1999) 6. Ville, D.V.D., Rogge, B., Philips, W., Lemahieu, I.: Deinterlacing using fuzzy-based motion detection. In: 3rd International Conference on Knowledge-Based Intelligent Information Engineering Systems, pp. 263–267 (Adelaide, Australia) (August- September 1999) 7. Oh, H.-S., Kim, Y., Jung, Y.-Y., Morales, A.W., Ko, S.-J.: Spatio-temporal edge-based median filtering for deinterlacing. In: IEEE International Conference on Consumer Electronics, pp. 52–53. IEEE, Los Alamitos (2000) 8. Russo, F.: A FIRE filter for detail-preserving smoothing of images corrupted by mixed noise. In: IEEE International Conference on Fuzzy Systems, pp. 1051–1055. IEEE, Los Alamitos (1997) 9. Bellers, E.B., de Haan, G.: Advanced de-interlacing techniques. In: Proc. ProRisc/IEEE Workshop on Circuits, Systems and Signal Processing, Mierlo, The Netherlands, pp. 7–17. IEEE, Los Alamitos (1996) 10. Doyle, T.: Interlaced to sequential conversion for EDTV applications. In: Proc. 2nd Int. Workshop Signal Processing of HDTV, pp. 412–430 (February 1990)
Fast Adaptive Graph-Cuts Based Stereo Matching Michel Sarkis, Nikolas D¨orfler, and Klaus Diepold Institute for Data Processing (LDV) Technische Universit¨ at M¨ unchen (TUM) Munich, Germany
[email protected],
[email protected],
[email protected]
Abstract. Stereo vision is one of the central research problems in computer vision. The most difficult and important issue in this area is the stereo matching process. One technique that performs this process is the Graph-Cuts based algorithm and which provides accurate results [1]. Nevertheless, this approach is too slow to use due to the redundant computations that it invokes. In this work, an Adaptive Graph-Cuts based algorithm is implemented. The key issue is to subdivide the image into several regions using quadtrees and then define a global energy function that adapts itself for each of these subregions. Results show that the proposed algorithm is 3 times faster than the other Graph-Cuts algorithm while keeping the same quality of the results.
1
Introduction
Extracting depth information from stereo images is a very common research topic in Computer Vision. The main issue of stereo matching is to find dense correspondences between the images from which the depth map of the scene can be easily extracted. In the last decade, a number of different algorithms for stereo matching were developed. In [2], a very good review is presented along with a methodology to compare such algorithms. The main issue in any stereo matching algorithm is to compute some costs using similarity measures, and then define a suitable cost function upon which the minimum is the desired depth map. These algorithms are divided into three groups of approaches depending on how the cost function is optimized. Local-based optimization approaches like the adaptive windows techniques are fast but prone to problems on occlusion boundaries [3,4]. Scanline-based optimization approaches like dynamic programming produce better results on occlusion boundaries and are also fast, but the results contain a lot of inconsistencies among the scanlines [2,5,6,7]. Global-based optimization approaches like the Graph-Cuts avoid the disadvantages of the other groups and give an optimal solution for all pixels at once [1,8]. However, these algorithms are very slow due to the high complexity of the computations involved. The Graph-Cuts based stereo matching approach in [1] minimizes a global energy function. This formulation allows the disparity function to preserve discontinuities and to be piecewise smooth which leads to high reconstruction quality, J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 818–827, 2007. c Springer-Verlag Berlin Heidelberg 2007
Fast Adaptive Graph-Cuts Based Stereo Matching
819
especially in the discontinuity regions. This algorithm generally requires minimizing a non-convex function with thousands of dimensions which is a NP-hard problem. Hence, it requires a very high computational effort which leads to a significant amount of processing time. In this work, an adaptive Graph-Cuts based algorithm is presented. The key issue is to subdivide the image into several regions using quadtrees [9], compute the costs adaptively for each subregion, then minimize a global energy function which is adapted for each subregion. Results show that the proposed algorithm have a faster convergence rate than the other Graph-Cuts algorithms due to the the adaptivity of the cost function. This leads to an order 3 speed-up of the depth map computation process. Section 2 briefly presents the Graph-Cuts algorithm of [1]. Section 3 describes the proposed Adaptive Graph-Cuts stereo matching algorithm. Section 4 shows an analysis and comparison of the proposed technique with that of [1]. Finally, conclusions are drawn in Section 5.
2
The Graph-Cuts Algorithm
Computing the depth map using Graph-Cuts is equivalent to finding the optimal labeling function f for every pixel p in the set of all pixels P. f labels every pixel p ∈ P with a discrete set of labels L. Each label l ∈ L corresponds to a certain depth value. Therefore, a pair p, l corresponds to a single 3D-point in space. The matching between two pixels in the left and right image is formulated in terms of interactions between the pixels at the same depth label l. It requires that the corresponding 3D-points in an interaction must lie on the same depth label, i.e. if p1 , l1 , p2 , l2 ∈ I then l1 = l2 . An interaction is said to be active if it is visible in both pixels p and q. The energy function used to compute the optimal f is defined as: E (f ) = Edata (f ) + Esmoothness (f ) + Evis (f ) .
(1)
As can be noticed, the cost function is composed of three different parts. The photoconsistency term Edata forces the interacting points to have a similar intensity value. The smoothness term Esmoothness ensures that f is piecewise continuous; it implies that the neighboring pixels normally have the same disparity, except for discontinuities at object borders. The visibility term Evis forces the visibility constraint to be taken into account [1]. The algorithm uses the α-expansion technique to get from a configuration f the new configuration f . In f , a pixel get either relabeled with a new disparity label or keep its old disparity label. An important issue is that the initial configuration has to conform with the disparity search constraints. In order to find the α-expansion, a graph must be constructed. This graph has two distinguished terminal nodes which are called the source and the sink. Each other node in this graph is connected to these two terminals and to its neighbors by weighted edges. Now, finding the α-expansion move of a disparity level can be reduced to the problem of finding the minimal cut in this graph.
820
M. Sarkis, N. D¨ orfler, and K. Diepold
This is also equivalent to finding the maximal flow from the source to the sink. For this task, several algorithms can be used, i.e. [10,11]. This process is then repeated until each pixel is labeled with its optimal disparity label.
3 3.1
Adaptive Graph-Cuts Algorithm The Adaptive Cost Function
The cost function defined in (1) is composed of three terms that can be usually be varied depending on the images that are used. The first term of (1), the photoconsistency or data cost, is written as: 2 Edata = min (Vp (x, y) − Vq (x + l, y)) − Kd , 0 , (2) where Vp is the intensity value of the pixel p, Vq is the intensity value of the candidate pixel q and l is the label or disparity value that is tested for the candidate pixel q [1]. This equation presents the squared difference between the left and right pixel intensities subtracted from some predefined regularization constant Kd that will be later determined automatically for each region in the image. The minimum between the substraction result and 0 is taken to ensure that the data cost is always non-negative. The smoothness term in its turn is defined as: Esmoothness = min (| lp − lp+1 |, Ks ) , (3) where lp is the label (disparity value) of the pixel p, lp+1 is that of its neighbor pixel and Ks is the smoothness regularization constants. The last term of (1) defines the visibility of a pixel. It will be set to zero if the pixel is visible; otherwise, it will be set to infinity. In [1], the terms of the cost function are taken for the entire image. They are usually set manually depending on the image that is under study. In the proposed algorithm, however, the image will be subdivided into several regions using Quadtrees [9]. Thus, the regularization terms Kd and Ks will be varied depending on each subregion. The regions with low depth variation require more weight on the smoothness constraint since it is more probable that the pixels in these regions have similar disparity values and that they are not occluded. In the regions with high disparity variation, the pixels have higher probability to be discontinuous and occluded. Therefore, it makes sense to vary the weights of the cost function depending on the region. Unlike [1], Kd and Ks cannot be set manually since each image contains several regions. Therefore, the proposed algorithm will be conducted in a hierarchical structure. This means, that the stereo images will be downsampled into several levels. At the coarsest level, the disparity map will be computed like proposed in [1]. Kd and Ks can be either manually set at the coarsest level or computed by using an algorithm that measures the variation of the pixels in the image. Then, the disparity values found at this level will be used as a guide for the next finest level. Depending on the disparity map found at the coarser level,
Fast Adaptive Graph-Cuts Based Stereo Matching
821
the stereo images will be subdivided using the Quadtrees. Then, for each region or the leaf of the Quadtree, a statistical measuring algorithm will be employed to estimate the depth variation of the pixels and vary the parameters of the cost function accordingly to each of these subregions. One simple criteria that measures the depth variation is the standard deviation of the disparities. Another measure that can be used is the skewness of the pixels [12]. Once the parameters of the cost function are determined, the disparity search in the next finest level will be conducted by refining the disparity values found at the lowest level. This process will be repeated until the final level is reached. 3.2
Narrowband Disparity Refinement
Disparity values in the coarse disparity map are half the precision of those in the finer map. So the upsampled disparity values at the next finer level are the ones obtained from the coarser level multiplied by 2. Then, the values in between will be interpolated using the nearest neighbor interpolation technique [13]. To find the true value of the disparity at the next finer level, only a search in a small search region, the narrowband, is necessary [14,15]. Therefore, the efficiency of the search will be highly increased by the a-priori knowledge of the guided optimization. To refine the disparity map, a narrowband matching volume limited by dmin (x, y) and dmax (x, y) is initialized. Suppose that d (x, y) is the disparity at position x,y in the upsampled disparity map. The true disparity df (x, y) can be found within a range dref ine from d (x, y) . Hence, the narrowband is limited by dmax (x, y) = d (x, y) + dref ine and dmin (x, y) = d (x, y) − dref ine where dref ine specifies the width of the interval around the estimated value from the coarser level. The value of dref ine depends on the disparity estimation error in the coarser disparity map. In case if this error is below 1.5 pixels, then, dref ine in the finer disparity map should be multiplied by two, i.e. dref ine = 3. Such a search region can be visualized in the example shown in Fig. 1 a. To only define a search region is not enough since at the coarse level a pixel represents a small neighborhood of pixels in the fine map. Consequently, if a discontinuity occur at a position somewhere in this neighborhood, the search area should be expanded at this location to take that into account. In addition, occlusions might also occur at disparity discontinuities, and this has also to be taken into account. To overcome these problems, it is necessary to extend the initial search region, e.g. the shown in Fig. 1 a, with an erode and dilate step. – Erode: This step is applied to dmin (x, y). A new map dmin (x, y) is constructed where every dmin (x, y) gets the minimum value of the neighbors of dmin (x, y). – Dilate: This step is applied to dmax (x, y). A new map dmax (x, y) is constructed where every dmax (x, y) gets the maximum value of the neighbors of dmax (x, y). The final search region after the erode and delate steps of Fig. 1 a is shown in Fig. 1 b. As noticed, the pixels at the edges have now a wider search range which allows to overcome the problems mentioned before. The narrowband refinement algorithm is illustrated in Table 1.
822
M. Sarkis, N. D¨ orfler, and K. Diepold
a)
b)
Fig. 1. The narrowband disparity refinement region. a: Primary refinement region after expanding each disparity by dmax (x, y) = d (x, y) + dref ine and dmin (x, y) = d (x, y) − dref ine . b: Final refinement region after the erode and dilate step. Table 1. Disparity Refinement Algorithm Step Step Step Step
3.3
1: 2: 3: 4:
Upsample the dc and scale the disparity values by 2. calculate dmin = 2dc − dref ine and dmax = 2dc + dref ine . Erode dmin and dilate dmax for every pixel. Refine the disparity value for each pixel using the found region search.
Quadtree Subdivisions
The original Graph-Cut algorithm is designed to provide a global solution for the whole image [1]. In order to apply the adaptive cost function defined in Section 3.1, the upsampled disparity map will be subdivided using Quadtrees so that the cost function can be separately adapted in each region. Another technique that can be equally used to divide the image is the rectangular subdivisions defined in [16] since it has also similar properties to the Quadtrees. Nevertheless, it was recently shown in [15] that the Quadtree subdivision is far more efficient than the rectangular subdivision. Given is a matching volume box B with dimensions W × H × D, where W is the width of the disparity map, H is its height, and D is the number of disparity labels. What has first to be done, is to split this box into 4 children boxes Bj where j ∈ {1, 2, 3, 4}. Then, for each child Bj , the upper and lower disparity j j bounds Smin and Smax have to be computed as: j Smin = min (dmin (x, y)) j Smax = max (dmax (x, y)) ,
(4)
over all x, y ∈ Bj . From these bounds, the costs for merging and splitting each box is then computed as
Fast Adaptive Graph-Cuts Based Stereo Matching
j j Cmerge (B) = w · h · max Smax − min Smin j j Csplit (B) = w · h · Smax − Smin ,
823
(5)
where w and h are the width and height of the box B. The first equation in (5), when minimized, reflects that the pixels have similar disparity labels and thus can be merged together. The second one in (5) shows that the pixels have a lot of variation when minimized; hence, they have the potential to be split from each other. Consequently, it is possible to compute the following cost function: C (B) = min (Cmerge (B) , Csplit (B)),
(6)
upon which it is decided whether a each box is either split into 4 new boxes or merged with other boxes. Notice that the quadtrees subdivisions try to find the optimal divisions which minimizes the redundant calculations. Fig. 2 shows examples of these subdivisions for the example given in Fig. 1 b. In Fig. 2 a, the 0 0 whole region is chosen. The search is done from Smin to Smax . In Fig. 2 b, the 1 1 2 2 area is split in two subregions with ranges (Smin ,Smax ) and (Smin , Smax ). In Fig. 2 c the subdivisions found in Fig. 2 b are further subdivided while Fig. 2 d shows how some of these regions are merged back. Figure 3 shows two generated tree structures for the Tsukuba data set [2] superimposed on the finest and coarsest levels. Notice that regions with higher disparity differences were split into smaller regions than the ones with low disparity variation; a form which fits the narrowband refinement. For each subdivision, the proposed Adaptive Graph-Cut algorithm is now executed up to a predefined number of iterations. Since, the parameters of the cost function are adapted for each region, it might happen that the disparity function is not continuous among the neighboring regions. To resolve this deficiency, the parameters of the cost function are computed taken into account all the neighboring divisions.
4
Results
The proposed Adaptive Graph-Cut algorithm will be compared with original Graph-Cuts algorithm of [1]. In all the tests, the refinement parameter dref ine was set to 3 to account for a 1.5 pixel error on the coarser disparity map, the tree depth was set to 3, and a two level hierarchical refinement was conducted. In these tests, two criteria will be checked: the speed of the algorithm and the accuracy. The speed of the algorithms was measured using an AMD Duron 2 GHz machine with 768 Mb RAM while the accuracy is tested using the Middlebury stereo data set benchmark [17]. Both programs were written using the C language. Table 2 shows the timing results of the proposed algorithm and that of [1]. As can be noticed, the proposed algorithm has outperformed the Graph-Cuts algorithm of [1]. To obtain the disparity maps of the Middlebury stereo benchmark,
824
M. Sarkis, N. D¨ orfler, and K. Diepold
a)
b)
c)
d)
Fig. 2. Dividing the narrowband signal of Fig. 1 b into several regions. a: The complete region is chosen. b: The region is divided into two subregions. c: The regions found in b are further divided into more subregions. d: Some of the region found in c are merged.
Fig. 3. The generated tree structure for the Tsukuba data set. The left image is the tree generated on the finest level. The right image is the tree generated for the coarsest image.
there is an average time improvement of 66%. This means that the proposed algorithm is three times faster which is a very significant amount of improvement. In order to justify the improvement in time, it is also necessary to see the accuracy of the obtained results. In Table 3, the output of the Middlebury stereo benchmark is shown for both algorithms. In addition to that, the algorithm of [1]
Fast Adaptive Graph-Cuts Based Stereo Matching
825
is also presented while choosing the parameters of the cost function automatically. This was done to have a fair comparison since the parameters of the cost function of the proposed Adaptive Graph-Cuts cannot be manually chosen. When comparing the proposed algorithm to the automatic version of [1] where the parameters of the cost function were automatically chosen for the whole image, it can be noticed that both algorithms have almost the same performance. Thus combining this result with that of Table 3, it can be concluded that the proposed algorithm have the same accuracy performance as in [1] but can compute the disparity map of the scene three times faster. The obtained disparity map on the Middlebury stereo set is visualized in Fig. 4. Nevertheless, the results show that the ordinary Graph-Cuts algorithm with manual choice of the cost function parameters have a better performance. This is due to the fact the proposed method to choose these parameters is not yet optimal and need to be improved. Table 2. Percentage improvement in the run-time between the proposed algorithm and that of [1] Labels searched: 16 59 Resolution: 384x288 450x375 Data Set: Tsukuba Teddy Proposed Algorithm 10.71 55.56 Algorithm of [1] 33.55 155.29 Improvement 68% 64%
59 450x375 Cones 59.27 187.49 68%
19 434x383 Venus 18.89 55.98 66%
Table 3. Evaluation results for the proposed algorithm and the algorithm of [1] using Middlebury stereo benchmark. GC designs the algorithm of [1], GC Auto designs the algorithm of [1] while setting the parameters of the cost function automatically and AGC designs the proposed algorithm. Algorithm Avg.
Tsukuba
Venus
Teddy
Cones
ground truth
ground truth
ground truth
ground truth
nonocc all disc 2.79 3.13 3.60 1.20 1.73 6.02 1.12 1.56 5.50
nonocc all disc 12.0 17.6 22.0 11.4 17.1 22.0 10.4 15.8 20.4
nonocc all disc 4.89 11.8 12.1 6.15 13.2 13.3 8.20 14.6 15.7
Rank nonocc all disc GC 13.9 1.27 1.99 6.48 AGC 15.0 3.31 3.95 6.69 GC Auto 15.1 2.33 3.03 9.07
5
Conclusion and Future Work
In this work, an adaptive Graph-Cuts algorithm was presented that determines the disparity map of a stereo image using an a adaptive cost function. The adaptivity of the algorithm was obtained by splitting the image into several regions using Quadtrees and then computing an adaptive cost function for each of these regions. Results show that this scheme is three times faster than the other Graph-Cuts based stereo matching algorithms while keeping almost the
826
M. Sarkis, N. D¨ orfler, and K. Diepold
same accuracy. Looking into the future, a better function that measures the statistics of the image should be implemented in order to further improve the quality results and maintain the current enhancement in speed.
Image
Algorithm of [1]
Proposed Algorithm
Fig. 4. Output of both algorithms using Tsukuba, Venus, Teddy and Cones image sets from the the Middlebury stereo benchmark
Acknowledgement. This research is sponsored by the German Research Foundation (DFG) as a part of the SFB 453 project, High-Fidelity Telepresence and Teleaction.
Fast Adaptive Graph-Cuts Based Stereo Matching
827
References 1. Kolmogorov, V., Zabih, R.: Multi-camera scene reconstruction via graph cuts. In: European Conference on Computer Vision (2002) 2. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47(1) (2002) 3. Okutomi, M., Kanade, T.: A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(4) (1993) 4. Kanade, T., Okutomi, M.: A stereo matching algorithm with an adaptive window: Theory and experiment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9) (1994) 5. Crimsini, A., Shotton, J., Blake, A., Rother, C., Torr, P.: Efficient dense-stereo with occlusions and new view synthesis by four state DP for gaze correction. Technical report, Microsoft Research (2003) 6. Leung, C., Appleton, B., Sun, C.: Fast stereo matching by iterated dynamic programming and quadtree subregioning. In: British Machine Vision Conference (September 2004) 7. Kim, J.C., Lee, K.M., Choi, B.T., Lee, S.U.: A dense stereo matching using two-pass dynamic programming with generalized ground control points. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), (2005) 8. Kolmogorov, V., Zabih, R.: Computing visual correspondence with occlusions using graph cuts. In: International Conference on Computer Vision (2001) 9. Balmelli, L., Kovacevic, J., Vetterli, M.: Quadtrees for embedded surface visualization: Constraints and efficient data structures. In: IEEE International Conference on Image Processing, (1999) 10. Roy, S., Cox, I.J.: A maximum-flow formulation of the n-camera stereo correspondence problem. In: International Confernce on Computer Vision (1998) 11. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Transctions on Pattern Analysis and Machine Intelligence 23(11) (1999) 12. Kenney, J.F., Keeping, E.S.: Mathematics of Statistics. Van Nostrand, 3 rd edn. (1964) 13. Seul, M., O’Gorman, L., Sammon, M.J.: Practical Algorithms for Image Analysis: Descriptions, Examples, and Code, 1st edn. Cambridge University Press, Cambridge (2000) 14. Falkenhagen, L.: Hierarchical block-based disparity estimation considering neighbourhood constraints. In: International workshop on SNHC and 3D Imaging (September 1997) 15. Leung, C.: Efficient Methods for 3D Reconstruction from Multiple Images. PhD thesis, University of Queensland (2005) 16. Sun, C.: Fast stereo matching using rectangular subregioning and 3d maximumsurface techniques. International Journal of Computer Vision 47(1), 99–117 (2002) 17. Scharstein, D., Szeliski, R.: www.middlebury.edu/stereo
A Fast Level-Set Method for Accurate Tracking of Articulated Objects with an Edge-Based Binary Speed Term Cristina Darolti, Alfred Mertins, and Ulrich G. Hofmann Institute for Signal Processing, Univ. of L¨ ubeck, L¨ ubeck, 23538, Germany
Abstract. This paper presents a novel binary speed term for tracking objects with the help of active contours. The speed, which can be 0 or 1, is determined by local nonlinear filters, and not by the strength of the gradient as is common for active contours. The speed has been designed to match the nature of a recent fast level-set evolution algorithm. The resulting active contour method is used to track objects for which probability distributions of pixel intensities for the background and for the object cannot be reliably estimated.
1
Introduction
One of the necessary steps in making computers see is to teach them how to decide which object in the image is the one of interest. In many cases the object is completely defined by drawing a contour around the object area. Tracking involves keeping a lock on the correct contour as the object changes its position, shape and context in a video stream. In this paper we present a method for tracking objects using active contours. An active contour is a curve which evolves from a start configuration towards the boundaries of an object in an image whilst its motion is governed by image properties. The curve can be represented parametrically, for example as a spline curve [1,2,3], or non-parametrically [4,5]. Usually faster and more robust to clutter, parametric curves cannot easily describe articulated objects. This can however be simply achieved by non-parametric curves for which the representation of choice is the zero level set of a distance function [6]. The method presented here is intended for tracking articulated objects, thus active contours represented as level sets are the more suitable framework. The motion of the curve in this framework is governed by one of three forces. The first two are a force depending on the curvature of the boundary and a force depending on the strength of the image edge at the boundary [2,4]. A third force expressing the belief that a region along the boundary belongs to the tracked object has recently been added [7,8,9]. The region force is proportional to the joint probability of pixels in the region, assuming the probability distributions in the object and background are known. Active contours can be used in tracking by allowing the curve to move in each frame till it finds the boundary of the object in the respective frame. Like in the J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 828–839, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Fast Level-Set Method for Accurate Tracking of Articulated Objects
829
case of single images, tracking makes use of region and/or edge information [7]. The feature distributions of background and object are both used in [10,11]. In [12], the vector field obtained by computing the optical flow between two images is used to track the contour around the moving object. It has been suggested [13] that a statistical distance measure between the probability distribution in the object region and a model distribution may be used to track the object, but since a distribution is independent of the objects area, the algorithm needs very special conditions for tracking to work. We intend to track objects for which the probability distributions of the intensities of pixels does not have an analytical form and where an approximation by a mixture of normal distributions is not practicable when considering time constraints. The assumption that the distributions are normal is also problematic when the distributions of object and background strongly overlap. Should we add to these characteristics an inhomogeneous texture, it becomes obvious that it is very difficult to reliably describe the region information. Methods which can eventually describe the complicated statistics of such an image exist, but they are computationally much too expensive to qualify for use in tracking. For this problem, we introduce a new reliable binary speed term into the active contour framework with the goal of tracking the boundaries of smooth objects with properties as described above. An additional requirement is that the object boundary is detected with high accuracy, i.e. the computed boundary should be less than two pixels away from the real boundary as picked by the human eye. The method is utilized to track hands during articulated motion. Specifically, we are interested in measuring hand movements during precision work, for example as performed during surgical operations, without using markers; the detected boundaries need to be accurate so that they can lead to precise measurements. To set the frame for our work, a short overview of active contours evolved using the fast level-set method is given in Section 2. In Section 3, we extend the wellknown active contours method with a novel binary speed term that was designed to match the nature of the fast level-set algorithm. The binary speed is based on local nonlinear filtering with the SUSAN edge detector and mean-shift filter, unlike the established image-gradient-based speeds. The results of applying the binary speed to real videos of different surgeons performing suturing are to be found in Section 4. Finally, we complete our paper with conclusions and outlook in Section 5.
2
Active Contours by Level Sets
A geodesic active contour is a curve which moves in time; at every time step, the curve is associated with an energy that depends on the curvature of the boundary and the image edge strength at the boundary as introduced in [2]. If a new metric is defined on the scalar field of image edge magnitudes, one where distances are defined to be short when the path passes through points with large magnitudes, the curve’s energy is written as [4]: L E(C(p)) = g(∇I(C(p)))|C (p)|dp, (1) 0
830
C. Darolti, A. Mertins, and U.G. Hofmann
where C(p) = (x(p), y(p)) is a two-dimensional curve, L is the length of the curve, C (p)dp is the arc length of the curve and g(|∇I)|) : [0, +∞ → IR+ is a strictly decreasing function. The curve is considered to be optimal when its energy is minimal, which is equivalent to finding a smooth curve of minimal length passing through the strongest edges. Using the energy’s Lagrangian, an equation of motion is derived which describes the displacement of the curve in the direction of its Euclidean normal: Ct = g(I)kn − (∇g · n)n,
(2)
where Ct denotes the curve’s time derivative, I the image, k the Euclidean curvature and n the normal vector, each of these variables being computed for every point (x, y) on the curve. A framework was thus established where image features could be used to evolve a smooth curve. One can take into consideration edge features [4,2], region features [9] or both [7]. Osher and Sethian [6] have published the level-set method for numerical evolution of curves which move along their normal. In the level-set method, a d-dimensional curve, with d ∈ {2, 3}, can be embedded as the zero level set of a (d + 1)-dimensional function ϕ, knowing the initial curve C0 : C(x(p), y(p)) = {(x, y)|ϕ(x, y, t) = 0}), with ϕ(x, y, 0) = C0 .
(3)
Osher and Sethian have shown that the curvature of C, its normal and the equation of motion (2) can be expressed in terms of the function ϕ. Furthermore, the equation can be generalized to the case where the force acting on a curve point has a curvature-dependent component Fk and an image-dependent component FI . If ϕt is the time derivative of the function ϕ, ∇ϕ is its gradient, and the curvature is expressed as the divergence of the gradient of ϕ, a general equation of motion Ct = αFI n + βFk kn may be written ∇ϕ ϕt = αFI |∇ϕ| − βFk div |∇ϕ|, (4) |∇ϕ| with α and β being regularization parameters which control the influence of each term. In order to accomplish tracking with active contours, once the boundary of the object is found in a frame, the corresponding curve is used to initialize the active contour in the next frame; the position of the boundary is then updated by the active contour type law of motion such as to best match the measurements in the new frame [3,14,15] and this is the choice we make within this study. An alternative is to learn a motion model for the moving object and to reposition the contour with its help in the new frame such that the measurements best confirm it [8,1,7]. The Fast Level-Set Implementation Although very powerful, the numerical scheme for the level-set method is computationally intensive; a fair amount of research has been made to improve on
A Fast Level-Set Method for Accurate Tracking of Articulated Objects
831
its speed, for example in [16,7]. The fast level-set method described in [17] is two orders of magnitude faster than its predecessors; its distinguishing feature is that the algorithm implementing the curve motion works entirely in the integer domain and the computation of boundary curvature is simplified to integer operations whilst the computation of the normal is omitted altogether. The boundary of the object is considered to lie between pixels. Its position is specified by listing the object pixels bordering the curve in a list of interior pixels, called Lin , and by listing the background pixels bordering the curve in a list of exterior pixels, called Lout . The level-set function is piecewise constant, with values of -3 in the interior, -1 at pixels in the interior list Lin , 1 at pixels in the exterior list Lout , and 3 in the exterior. For every list pixel, the image-dependent speed FI and curvature-dependent speed Fk from Eq.(3) are computed, but only the sign is retained. By choice, the curve’s normals point outwards; logically, if at an exterior pixel the speed is negative the curve will be pushed inward, otherwise it will be left in place. To push the curve outward the curve is moved at interior pixels positive speed; the curve always advances at a speed of one pixel per iteration. To advance the curve outward at a pixel x from the list of exterior pixels, pixel x is deleted from Lout and the level set at x is set to the value for interior boundary pixels given by ϕ(x) = −1. If for any of the four connected neighbors y of pixel x, ϕ(y) = 3, y is added to the exterior list by setting ϕ(x) = 1; the procedure is called the switch procedure. When switching, it may also happen that one of the neighbors y now only has neighbors which belong to the interior of the curve, all having negative values in the level-set function; if this is true y is not an interior boundary pixel anymore, its corresponding level set value is set to -3 and it is deleted from Lin . The procedure is called the clean procedure, and together with the switch procedure it occurs in the pseudo code of the algorithm in Fig. 1. In [17] the clean procedure is executed after the list of exterior pixels has been iterated through, but this may leave a neighborhood temporarily incoherent; although cleaning at every step necessitates four extra comparisons, we choose to execute this operation to keep the list coherent at every step. The symmetric process is used to advance the curve inward at a pixel x from the list of interior pixels. The motion stops when changes for none or a very small percentage of the list pixels have to be made . Finally, an alternative to computing curvature, is to smooth the curve by convolving the level-set function with a Gaussian kernel converted to integer. It has been shown in scale space theory that this operation is equivalent to computing the Laplacian of an image; for an implicit function its Laplacian is equal to its curvature. The size of the Gaussian kernel controls the amount of smoothing. The position of the curve is thus updated by evolving it first according to the image-dependent speed for a number of iterations and subsequently evolving it according to the curvature-dependent speed for a number of iterations. It becomes obvious that this algorithm moves the curve exactly one or zero pixels per step. Thus, one needs not compute the magnitude of speeds FI and Fk for the fast level-set implementation. The sole information needed here is binary
832
C. Darolti, A. Mertins, and U.G. Hofmann
in nature and is equivalent to the answer to the question: is the list pixel an edge pixel or not and/or does it belong to object region or not. The equation of motion can be rethought in terms of a binary speed, as discussed in the next section.
3
SUSAN Edge-Based Term
Curve evolution based on region and edge features has an additive form pin (v(C(x))) Ct = log + αFI n + βFk kn , pout (v(C(x))) Edge and smoothing term
(5)
Region term
where the edge term has already been introduced in the previous section. The variables pin and pout denote the probability distribution of the feature vector v on the inside, respectively on the outside region of the object’s boundaries; the new region term causes the curve to expand when pin > pout , otherwise causing the curve to shrink. In general, the region and edge terms are computed independent of each another. We observe that the region term needs to be computed solely at pixels located on the curve, which means that the region information for pixel x is gathered from its neighborhood only. The expansion of the function FI reads as FI (I(C(x))), so the same is true for the edge term. As stated in the introduction, we intend to track objects for which discriminative distributions pin and pout cannot be estimated in a useful time. Since this is the case, we decide to use filters by which a pixel and its neighborhood can be analyzed to concomitantly describe region and edge properties. A simple binary speed is defined to categorize the result of filtering as follows 1, if result(f ilter(x)) is of type “object“ Fsw = (6) 0, otherwise. Correspondingly, the energy and the equation of motion are: E(C) = Fsw ds + β ds Ct = (Fsw + βk)n Ω
(7)
C
where Ω denotes the object’s interior region, β is a regularization parameter which controls the strength of the smoothing and ds is the arc length. The binary speed Fsw is chosen to be binary in order to match the nature of the fast level-set algorithm. Armed with this simple framework, we search for filters which can best characterize the boundaries of the sort of objects we wish to track. Because of problems in describing object regions, we choose an edge-based approach. Most edge-based active contours measure the edge as a function of the image gradient [4,15,7]. Thresholding gradient images to obtain binary edge images, like the one needed for the previously defined speed, bears well known problems, as will be discussed in the results section. We choose a nonlinear filter to analyze the intensities of
A Fast Level-Set Method for Accurate Tracking of Articulated Objects
833
neighboring pixels when deciding if a pixel is an edge pixel or not. More precisely, the similarity between a pixel and every other pixel in its neighborhood N (x) is computed, and their sum
I(x)−I(y) 6 ) t us(x) = e −( (8) y∈N (x)
yields a similarity score over the neighborhood, known as USAN [18] and denoted here by us; the parameter t specifies how large the difference between pixel intensities may be before they start to be dissimilar. The larger the us value, the more similar neighboring pixels are to the center of the filter mask. On the other hand, the us values will be smallest (Smallest USAN) when half of the pixels or less will be similar to the mask, a situation which occurs when the pixel lies on an edge or a corner. Multiple responses around the edges are eliminated by searching the minimum us value perpendicular to the edge direction; the direction vector d is obtained by computing the position of the center of gravity of the similarity responses within the mask. We may define the binary function in the simple motion equation (7) to be ⎧ 1, if us(x) > sim and ⎪ y∈N (x) Fsw (y) = 1 ⎪ ⎨ or us(x) = min{us(y)|y is on d } Fsw = (9) or F (y) > no ⎪ sw y∈N (x) ⎪ ⎩ 0, otherwise. The threshold sim denotes the smallest us value for which it can be stated with certitude that most pixels in the filter mask are similar to the center pixel; it can normally be set at 3/4 of the largest possible us value. The function is adjusted to fill in missing edges in the neighborhood N (x) of a pixel and to stop zigzagged edges from causing a leak; to increase speed, this is done by simply setting an edge if the pixel has more than no neighbors which are edges. Also singleton edges are deleted if there are no other edge pixels in the neighborhood. Looking at Function (9) one may notice that we have chosen to evolve the curve only outward. For most cases in object tracking it is possible to learn about the object and design an algorithm which finds a blotch in its interior. The boundary of this blotch is assumed to be the curve’s initial position. In the next frame, the curve is evolved from its last known position to determine the blotch in the current frame. Tracking is achieved by expanding the blotch to the new correct boundary. 3.1
Mean Shift Local Filter for the binary Function
The USAN-based term defined in the previous section has the disadvantage of stopping at false edges if they form a smooth structure. Some may be eliminated by analyzing the probability distribution of pixel features in the neighborhood of an edge. For regions small enough, the probability distribution is well described by its mode since the number of samples is small enough. Let an image feature vector x be composed of spatial coordinates and the intensity value
834
C. Darolti, A. Mertins, and U.G. Hofmann
of a pixel. The mode is than determined by using the mean shift procedure [19,14] on a three-dimensional variable. Consider the d-dimensional parametric Epanechnikov kernel density estimator KE over n data points with bandwidth h = (hspatial , hintensity ): x − xi v= f (x) = h
1 nhd
n
i=1 KE (v) KE (x) =
2
c(1 − |x| ), |x| ≤ 1 (10) 0, otherwise.
The constant c ensures that the p.d.f. integrates to 1. The mode can then be found by looking for stationary points of the estimator function. The gradient of the estimator function is proven to be proportional to the mean shift vector n xi g(v) 1, |x| ≤ 1 i=1 mh (x) = n g(x) = (11) 0, otherwise. i=1 g(v) Two pixels that start the mean-shift procedure and converge to similar modes are considered to belong to the same probability distribution. In order to avoid a direct thresholding, and since a comparatively superior term for measuring the similarity between pixels has already been defined, the USAN score on the mean-shift filtered neighborhood of an edge pixel is computed. The speed for edge pixels with a similarity score larger than a minimum score, denoted as msmin, is reset to one. The new binary function is ⎧ 1, if us(x) > sim and ⎪ y∈N (x) Fsw (y) = 1 ⎪ ⎪ ⎪ or us(x) = min{us(y)|y is on d } ⎨ or F (y) > no Fsw = (12) y∈N (x) sw ⎪ ⎪ ⎪ or us(mh (x)) > msmin ⎪ ⎩ 0, otherwise. The algorithm implementing the above speed is summarized in Fig. 1.
4
Results
The binary speed based on SUSAN and mean-shift filtering in its fast level-set implementation is used to track hand motion. The main motivation is tracking the precise motions performed by surgeons during the suture procedure. The accurate contour is used to determine feature points, like the middle of the arm and wrist; these are useful in computing the position and trajectory of the hand with a stereo computer vision system. The suture motion can thus be analyzed or the surgical skill of the person can be measured. It has been mentioned, in Section 3, that a blotch in the objects interior is to be found first. In order to obtain such a blotch, an average background image is computed. The background image is subtracted from the current frame and the result is segmented with a double threshold. The binary image is processed with the fast level-set method with the initial curve at its last position in the previous frame to obtain two blotches. The size of the hands can only vary as restricted
A Fast Level-Set Method for Accurate Tracking of Articulated Objects
835
Fig. 1. Pseudocode for the level set algorithm based on binary speed
by cameras depth of field. It may be possible that the curve does not find the real boundary in a frame. Should the curve not stop in a maximum number of iterations, it is assumed that tracking in the current frame has failed and the blotches are re-initialized in the next frame after background subtraction and segmentation. In the following, we observe some image properties of a typical frame from a recording of a suture operation; the frame in question is shown on the top left of Fig. 5. For this frame, Fig. 2 shows the histograms for the hand region and for the background region. The histograms were generated using the result of object/background segmentation, also shown in Fig. 5. It can be observed that the histograms overlap in the interval 25-60; pixels from shadowed parts of the hand and patches from the sleeves have many pixels with intensities in this interval, making this part of the image difficult to segment accurately. Because of the overlap, the result of segmenting the background subtracted image with an adaptive threshold, shown on the left in Fig. 4, is also unsatisfactory. Visually, the hands appear to have strong edges, it should be thus possible to find the boundaries of the object using this information. We have tested three well known edge detectors: the Sobel, the Canny and the Laplacian-ofGaussian methods, and their effect on the filtered frame can be observed in Fig. 3. The Sobel detector either does not find the boundaries of the upper shadowed hand parts - see the edges depicted in white - or introduces too many spurious edges on the hand surface - see the edges depicted in gray. The Canny edge detector reliably finds the correct edges, but introduces a few smooth ones on the hand surface and these in turn are smooth enough to make the active contour stop; additionally, because of the edge thinning and gap-closing step, the Canny
836
C. Darolti, A. Mertins, and U.G. Hofmann
Fig. 2. Histogram for the background (left) and for the hands (right)
Fig. 3. Result of running edge detection on the top left frame from Fig. 5. Sobel edge detector with higher threshold – white edges – and lower threshold – gray edges – (top left). Canny edge detector (top right). Laplacian-of-Gaussian edge detector (bottom left). SUSAN edge detector(bottom right).
edge detector is slow compared to the SUSAN edge detector. The Laplacianof-Gaussian is also comparatively slow and displays both the problems of the Canny detector and of the Sobel detector. The SUSAN edge detector is computed with a 37 pixel circular mask and a value of 6 for the threshold t. It also introduces spurious edges, as it is obvious from Fig. 3(bottom right). To eliminate some of them a local mean-shift filtering is performed and analyzed with the USAN similarity measure on a 3 × 3
A Fast Level-Set Method for Accurate Tracking of Articulated Objects
837
Fig. 4. The result of background subtraction and adaptive threshold segmentation (left) and mean shift filtering (right)
Fig. 5. Frames 1, 14, 23 and 40 from a recording showing a surgeon performing suture
neighborhood with the same threshold as the one used for the original image. To convey an impression of the effects of the filter, the result of filtering a frame with (hspatial , hintensity ) =(5,10) is shown on the right in Fig. 4. Finally, Fig. 5 show in blue the edges which remain after removal of edge pixels with the help of the mean-shift operation, for four different frames of a video. In the same image, the position of the final contour is shown in red. The hands of two different surgeons were tracked during suturing, as can be observed in Fig. 5 and 6. The algorithm implemented in C++, takes on average 0.18 seconds to process a frame on a desktop PC; the shortest processing time per
838
C. Darolti, A. Mertins, and U.G. Hofmann
Fig. 6. Frames 4, 13, 18 and 20 from a recording showing a surgeon performing suture
Fig. 7. Selection gestures (first two) and positioning gestures (last two) in a 3D medical visualization
frame was 0.1 seconds, the largest 0.2, but it is our belief that the implementation can be improved by parallelizing the code. The method was also employed to track hand motion when navigating a 3D medical visualization. Fig. 7 shows frames from a video where the user makes selection-by-pointing and positioning gestures.
5
Conclusions and Future Work
A novel binary speed based on SUSAN similarity scores between a pixel and its neighboring pixels and on probability density mode detection by the mean shift procedure has been presented. The speed is designed to match the nature of the fast level-set implementation. The hands of surgeons performing suture have been tracked at an average of 0.18 seconds per frame. Some pieces of the tracked boundaries are not accurate according to our definition. Also, in the frames with no boundary found, the curve leaked through a very local misdetection of edges. In the future, more of the information from neighboring pixels will be integrated in the binary speed. Finally, we propose to use shape templates to cope with large pieces of misdetected boundary.
References 1. Isard, M., Blake, A.: Icondensation: Unifying low-level and high-level tracking in a stochastic framework. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, pp. 893–908. Springer, Heidelberg (1998) 2. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988) 3. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models – their training and application. Comput. Vis. Image Underst. 61(1), 38–59 (1995)
A Fast Level-Set Method for Accurate Tracking of Articulated Objects
839
4. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. Int. J. Comput. Vision 22(1), 61–79 (1997) 5. Paragios, N., Deriche, R.: Geodesic active contours for supervised texture segmentation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’99), vol. 2, p. 2422. IEEE Computer Society, Los Alamitos (1999) 6. Osher, S., Sethian, J.A.: Fronts propagating with curvature-dependent speed: Algorithms based on Hamilton-Jacobi formulations. Journal of Computational Physics 79, 12–49 (1988) 7. Paragios, N., Deriche, R.: Geodesic active contours and level sets for the detection and tracking of moving objects. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(3), 266–280 (2000) 8. Ecabert, T.O.: Variational image segmentation by unifying region and boundary information. In: 16th International Conference on Pattern Recognition (2002) 9. Chan, T., Vese, L.: Active contours without edges. IEEE Trans. Image Processing 10, 266–277 (2001) 10. Mansouri, A.R.: Region tracking via level set pdes without motion computation. IEEE Trans. Pattern Anal. Machine Intell. 24(7), 947–961 (2002) 11. Yilmaz, A., Li, X., Shah, M.: Contour-based object tracking with occlusion handling in video acquired using mobile cameras. IEEE Trans. Pattern Anal. Machine Intell. 26(11), 1531–1536 (2004) ´ Barlaud, M., Aubert, G.: Segmentation of a vector field: 12. Roy, T., Debreuve, E., dominant parameter and shape optimization. Journal of Mathematical Imaging and Vision 24(2), 259–276 (2006) 13. Freedman, D., Zhang, T.: Active contours for tracking distributions. Image Processing, IEEE Transactions on 13(4), 518–526 (2004) 14. Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: IEEE Conf. Computer Vision and Pattern Recognition, IEEE Computer Society Press, Los Alamitos (2000) 15. Shi, Y., Karl, W.C.: Real-time tracking using level sets. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Washington, DC, USA, pp. 34–41. IEEE Computer Society Press, Los Alamitos (2005) 16. Sethian, J.: Level Set Methods and Fast Marching Methods. Evolving Interfaces in Computational Geometry, Fluid Mechanics, Computer Vision, and Materials Science. Cambridge University Press, Cambridge (1999) 17. Shi, Y., Karl, W.: A fast level set method without solving pdes. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE Computer Society Press, Los Alamitos (2005) 18. Smith, S.M., Brady, J.M.: Susan–a new approach to low level image processing. Int. J. Comput. Vision 23(1), 45–78 (1997) 19. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Analysis Machine Intell. 24, 603–619 (2002)
Real-Time Vanishing Point Estimation in Road Sequences Using Adaptive Steerable Filter Banks Marcos Nieto and Luis Salgado Grupo de Tratamiento de Im´ agenes - E.T.S.Ing. Telecomunicaci´ on Universidad Politcnica de Madrid - Madrid - Spain
[email protected] http://www.gti.ssr.upm.es Abstract. This paper presents an innovative road modeling strategy for video-based driver assistance systems. It is based on the real-time estimation of the vanishing point of sequences captured with forward looking cameras located near the rear view mirror of a vehicle. The vanishing point is used for many purposes in video-based driver assistance systems, such as computing linear models of the road, extraction of calibration parameters of the camera, stabilization of sequences, etc. In this work, a novel strategy for vanishing point estimation is presented. It is based on the use of an adaptive steerable filter bank which enhances lane markings according to their expected orientations. Very accurate results are obtained in the computation of the vanishing point for several type of sequences, including overtaking traffic, changing illumination conditions, paintings in the road, etc.
1
Introduction
Focusing on the field of driver assistance systems, two major objectives are road modeling and vehicle detection within in-vehicle vision systems. Usually, the road model is firstly computed to obtain a reliable environment description which afterwards allows to accurately detect vehicles. For this purpose, there are typically two main processing stages, features extraction, the module which extracts features from images, and model fitting, the module that uses those features to obtain the number of lanes, their width or curvature to compose an accurate model of the road. Most works found in literature detect, as features, the lane markings which delimite the road boundaries [1]-[3]. In that sense, the computation of the vanishing point may be used for many purposes in video-based driver assistance systems, such as computing linear models of the road, extraction of calibration parameters of the camera, stabilization of sequences, optical flow, etc. The vanishing point is the point in the image where parallel lines seems to converge. In road sequences, the perspective effect is basically only important in the direction of the optical axis of the camera, usually located in the middle of the image where the road seems to converge into a point in the horizon. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 840–848, 2007. c Springer-Verlag Berlin Heidelberg 2007
Real-Time Vanishing Point Estimation in Road Sequences
841
Many works in literature related to driver assistance systems make use of the vanishing point, usually after computing the road model [4] [5]. However, the computational load required is usually excessively high for real-time performing. In this work an innovative and efficient strategy for vanishing point detection and tracking is introduced by using steerable filter banks and linear road models. The algorithm detects and tracks this point in road sequences allowing real-time processing in a general purpose processor. Steerable filters are used considering their properties and overcoming their drawbacks. Basically, steerable filters give much better results than edge detectors like Sobel or Canny for the lane markings detection problem. However these results are only possible applying a large number of filter directions or by knowing a priori the orientation of the lane markings. In this work, this a priori information is obtained with an adaptive and appropriate selection of the expected lane markings direction through the computation of the Hough transform. The paper is organized as follows: section 2 depicts an overview of the whole system; section 3 explains the performance of steerable filter banks, while section 4 shows how to compute the linear road model that leads to the vanishing point estimation. The feedback stage is described in section 5. Results and conclusions are shown in sections 6 and 7 respectively.
2
Overview
The system is focused on the obtention of the vanishing point and the road model that delimites the position of the lane markings at each image of the sequence. Fig. 1 depicts the whole system. The first module is the steerable filter bank, which extract different edge images, Iθi , one per each steerable filter used. An enhanced edge image is then composed with these images resulting in a clear identification of the lane markings. The last module take this single edge image as input and through a fast Hough transform and minimum squares fitting find the best vanishing point of the image and the lane markings that delimite the road. The feedback module updates the filter bank so that the filters that will be used in the next image are those whose orientation coincide with the computed orientation of the lane markings of the road model.
Update Filter Bank
Input Image
Adaptive Steerable Filter Bank
I θi
Enhanced Edge Image
Hough transform & Vanishing point estimation
Vanishing point Road Model
Fig. 1. Block diagram of the system
3
Steerable Filter Bank
The use of steerable filters instead of other edge detectors for the lane markings detection problem is based on their following interesting properties [6]: 1)
842
M. Nieto and L. Salgado
steerable filters may be designed with just a basis of two fixed filters and an orientation parameter, θ. Lane markings are well modeled as straight lines with a clearly defined orientation, so that a filter tuned at that direction will maximize the response of the lane marking over the rest of edges of the image; 2) the formulation of steerable filters is usually performed with derivatives of two-dimensional Gaussians so that these filters are separable and may be implemented in two one-dimensional stages reducing the computational cost of the filtering process. 3.1
Steerable Filters
Steerable filters are used in pyramidal decompositions of images for multiresoltion analysis in a similar process to the discrete wavelet transform [7] [8], designing wedge filters [9] and also for lane marker detection [10]. The steerable filters may be composed by a basis of n fixed filters derived from the two-dimensional Gaussian function G(x, y) [6], as follows: G(x, y) = e−(x
2
+y 2 )
(1)
The n-th derivative of a Gaussian is denoted as Gn and the rotated version of Gn is Gθn , where θ is the rotation angle. The first derivative of a Gaussian, in o o the x and y directions, G01 and G90 respectively, are described as follows: 1 o
G01 = o G90 = 1
∂ −(x2 +y 2 ) ∂x e ∂ −(x2 +y 2 ) ∂x e
= −2xe−(x +y ) 2 2 = −2ye−(x +y ) 2
2
(2)
A filter with an arbitrary orientation θ can be built by applying a linear combination of these two fixed filters: o
o
Gθ1 (x, y) = cos(θ) · G01 (x, y) + sin(θ) · G90 1 (x, y)
(3)
Therefore, with a basis of two fixed filters is possible to design a steerable filter in an arbitrary orientation [6].
Fig. 2. First derivative of Gaussian: masks of the basis functions of 11 × 11 pixels: o o (Left) G01 and (right)G90 1 . Gray values mean zero intensity, while white and black are positive and negative values respectively.
The filters that have been used in this paper are those shown in (1), with a mask length of 11 × 11 pixels, which is the intermediate point between an excessively large mask, which would increase the computational load and the blurring effect, and a too short mask that can not adequately detect edges. Fig. 2 shows the bases set for the used fixed filters.
Real-Time Vanishing Point Estimation in Road Sequences
843
Orientation (θ = 22.9183 ) Orientation (θ = 45.8366 ) Orientation (θ = 68.7549 ) Orientation (θ = 91.6732 ) Orientation (θ = 114.5916 ) Orientation (θ = 137.5099 ) Orientation (θ = 160.4282 )
22.9º
45.8º
68.7º
91.7º
114.6º
137.5º
160.4º
Fig. 3. Examples of some steerable filters with different orientation values (θ = 63.0254°)
(a)
(θ = 171.8873°)
(b)
(c)
Fig. 4. (a) Original image; (b) Edge image obtained at θ = 63o , I63o ; and (c) I171o
Examples of steerable filters, computed as in (3), are shown in Fig. 3.1. The complete set of filters is defined with the θstep value, which define the difference between two consecutive filter orientations, θi and θi−1 , with a total number of filters within the bank, N , expressed as follows: θstep = θi − θi−1 , N =
π θstep
(4)
As it will be shown in following sections, the use of steerable filters built with the first derivative of Gaussian functions offers excellent results for the purpose of detecting lane markings. Though higher order derivatives may offer better results in signal to noise ratio, they require more fixed filters as basis [6], and for implementations with small masks like the one that is proposed here, there are no significant differences in the obtained results. 3.2
Generation of the Enhanced Edge Image
The steerable filter bank gives as output a set of edge images, Iθi (x, y), one for each computed orientation. Fig. 4 depicts how different orientations result in different edge images, where edges are detected only if their gradient direction is similar to the filters orientation. Fig. 4 (b) shows I63o , the edge image tuned at θ = 63o . As it is shown, only part of the real edges of the image are detected. Fig. 4 (b) shows I171o , where only the right lane markings is clearly displayed. The enhanced edge image is generated by giving to each pixel the variance value, σ 2 , of these edge images, computed as follows: σ 2 (x, y) =
N −1 1 (Iθ (x, y) − μ(x, y))2 N i=0 i
(5)
844
M. Nieto and L. Salgado
μ(x, y) =
N −1 1 Iθ (x, y) N i=0 i
(6)
Where N is the number of filters computed as in (4), (x, y) is the position of the pixel and Iθi (x, y) is the response value to the filter oriented with θi . The variance value is used due to that lane markings are usually straight lines that show high response value to filters tuned in the direction of the lane marking, and very low in the orthogonal direction. Therefore, the variance value is usually very high for lane markings, while for objects with irregular shapes, the responses may be very similar for all orientations and the variance value lower than for lane markings. The threshold is selected by analyzing the histogram of the σ 2 (x, y) values. The shape of the histograms depends on the sequence, but usually is very similar to the histogram shown in Fig. 5 (d).
(a)
(b)
(c)
250 200 150 100 50 0 0
20
40
60
80
100
120
140
160
180
200
220
240
140 140
160 160
180 180
200 200
220 220
240 240
(d) 5000 250 4000 200 3000 150 2000 100 1000 50 0 0
0 0
20 20
40 40
60 60
80 80
100 100
120 120
(e) Fig. 5. Typical histogram of an enhanced edge image; (a) Original image; (b) Enhanced edge image; (c) Thresholded edge image; (d) Histogram h[i]; and (e) i × h[i]
This histogram example corresponds to the edge image shown in Fig. 5 (b), where the values of σ 2 (x, y) have been scaled from 0 to 255. The road, the sky and the rest of smooth areas that obtain small variance values correspond to the peaks of the histogram near to zero, while the next peak of the histogram represent the significant edge pixels. The values of the histogram above this peak
Real-Time Vanishing Point Estimation in Road Sequences
845
will mainly represent the pixels belonging to the lane markings as well as to other elements. Therefore, to separate low variance pixels from potentially lane markings pixels, the threshold is selected as the value corresponding to the main peak of the histogram not considering the values closest to zero. This is done by multiplying the histogram function h[i] with i and then finding the maximum value of g[i] = i × h[i], as it is shown in Fig. 5 (d). The resulting image displays the segmentation of the regions that contain those pixels with higher σ 2 values, which are the candidates to belong to lane markings. In the example shown in Fig. 5 (c), this binary image contains pixels belonging to the lane markings and other objects like the horizon line.
4
Vanishing Point Estimation
Once the lane markings are clearly identified in the thresholded edge image, the following step is to fit straight lines to the lane markings, so that the vanishing point, vn , is computed as their intersection. 4.1
Line Fitting
The well known Hough Transform [11][12], which is robust against outliers while offering multiple line fitting, is used. The selection of the local maxima of the transform space is performed with the conjugate gradient method [13], initialized with the maxima of the previous image. The vanishing point is consequently obtained as the intersection point of the straight lines that characterize the lane markings. From the Hough transform each line is parameterized with an angle θ and a distance ρ as in (7): y · cos θ + x · sin θ = ρ
(7)
However, as there is not a unique intersection point, the vanishing point is selected as the solution of the overdetermined system of equations, shown in (8), built with the equations of each detected line. c |s ·v =p (8) where v = (y, x)T , c = (cos θ0 , . . . , cos θr−1 )T , s = (sin θ0 , . . . , sin θr−1 )T , and p = (ρ0 , . . . , ρr−1 )T . This system is solved with singular value decomposition (SVD), giving the least squares error single solution v to the system. 4.2
Low Pass Filtering
The vanishing point of the n-th image is stabilized through a low-pass time filter considering a window composed by the m previous vanishing points as in (9) vn = vn−1 −
1 (vn−m + vm ) m
(9)
846
M. Nieto and L. Salgado
where vk and vk are, respectively, the vanishing point estimation, and the computed vanishing point, as in (8), for the instant k. The temporal filtering ensures that outlier vanishing points, due to errors in the features extraction processing module, do not affect significantly the final estimation vn . Fig. 6 shows the vanishing point estimation for several example images. The vertical and horizontal lines intersect at the estimated vn , while the road model is shown as straight lines drawn over the detected lane markings.
5
Filter Bank Updating
The use of steerable filters has yet an important problem for applications that require low computational cost or real-time conditions: any approach working with steerable filters is based on the definition of a set of filters, with N orientations , θi , that must be applied to the image. Therefore, the orientation resolution is directly related to the number of filters applied. A reduced number of them would help to reduce the computation, but at the cost of worse orientation resolution. In this work, the results of the road model computation of previous images are used to adapt the steerable filter bank to reduce the computational load for following images by reducing the number of orientations to be computed. The results for each image is a pair of lines that model each of the lane markings that delimite the current lane. As a feedback, the orientations, θlef t and θright corresponding to these lines computed for the last image are used in the following image at the steerable filter bank, as the lane markings are expected not to change their orientation from one image to the following. To achieve the great edge detection results shown, it is crucial to filter at least these feedback orientations and their orthogonals. This way, the variance value is high for lane markings while low for other real edges of the image.
6
Results
Several test sequences have been tested with this strategy, showing excellent results in the accuracy of the vanishing point computation. These sequences have been recorded in different roads in Madrid (Spain) with a forward looking camera located near the rear view mirror of a vehicle. The used resolution was CIF format (352×288 pixels), while processing at 30f ps in a 2 GHz Intel Centrino Duo processor. To perform in real-time it is necessary to overcome some drawbacks of the proposed algorithms. For example, the two-dimensional convolution of all the steerable filter may be carried out by separating the filters into two one dimensional filters. The decomposition is done as follows: r2
x2
y2
G(x, y) = e− 2σ2 = e− 2σ2 e− 2σ2 = G(x) · G(y)
(10)
This operation mainly reduces from an N × N operator, whose computational load is O(n2 ), to two one dimensional operators, equivalent to O(2n).
Real-Time Vanishing Point Estimation in Road Sequences
(a)
(b)
(c)
847
(d)
Fig. 6. Several examples of vanishing point estimation and road model extraction
Also the Hough transform is usually computationally expensive if not efficiently implemented. As it is a point to multiple points transform, it is intelligent to compute a look-up table to change floating point operations by memory accesses. The proposed strategy has shown great results in several challenging conditions, as it is shown in Fig. 6. Example (a) shows the most simple case where the road is almost empty, while cases (b) and (c) are quite more difficult as there are overtaking traffic and road traffic signals that difficult the correct detection. Case (d) shows a particular situation where the illumination conditions have abruptly changed due to the shadow casted by a bridge on the road. As it can be observed, in all cases, the estimation of the vanishing point is very accurate while the linear road model also accurately describes the lane markings position and orientation.
7
Conclusions
Simple strategies, like the use of edge detectors may be used to estimate the vanishing point and generate models of the road in road sequences for driver assistance systems. In this paper an efficient strategy have been proposed to apply, in real-time conditions, steerable filter banks to detect the vanishing point and a linear road model in a closed-loop strategy which tunes the steerable filter bank to enhance only lane markings against other real edges of the image. Results have shown very accurate estimations of the vanishing point in several sequences and different situations, like overtaking traffic, illumination changes and presence of road signals.
Acknowledgements This work has been partially supported by the European Commission 6th Framework Program under project IST-2004-027195 (I-WAY). This work is also supported by the Comunidad de Madrid under project P-TIC-0223-0505 (PROMULTIDIS).
References 1. McCall, J.C., Trivedi, M.M.: Video-Based Lane Estimation and Tracking for Driver Assistance: Survey, System, and Evaluation. IEEE Transactions on Intelligent Transportation Systems 7(1), 20–37 (2006)
848
M. Nieto and L. Salgado
2. Wang, Y., Teoh, E.K., Shen, D.: Lane detection and tracking using B-snakes. Image and Vision Computing 22, 269–289 (2004) 3. Liang, Y.-M., et al.: Video Stabilization for a Camcorder Mounted on a Moving Vehicle. IEEE Transactions on Vehicular Technology 53(6) (2004) 4. Klappstein, J., Stein, F., Franke, U.: Monocular Motion Detection Using Spatial Constraints in a Unified Manner. In: Intelligent Vehicles Symposium, June 13-15, Tokyo, Japan, pp. 261–267 (2006) 5. Simond, N.: Reconstruction of the road plane with an embedded stereo-rig in urban environments. In: Intelligent Vehicles Symposium, June 13-15, Tokyo, Japan, pp. 70–75 (2006) 6. Freeman, W.T., Adelson, E.H.: The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(9), 891–906 (1991) 7. Castleman, K.R., Schilze, M., Wu, Q.: Simplified Design of Steerable Pyramid Filters. In: Proceedings of IEEE ISCAS, IEEE Computer Society Press, Los Alamitos (1998) 8. Karasaridis, A., Simoncelli, E.: A filter design technique for steerable pyramid image transforms. In: Proceedings of ICASSP (1996) 9. Simoncelli, E., Farid, H.: Steerable wedge filters for local orientation analysis. IEEE Transactions on Image Processing 5(9), 1377–1382 (1996) 10. McCall, J.C., Trivedi, M.M.: An Integrated, Robust Approach to Lane Marking Detection and Lane Tracking. In: Proceedings of IEEE Intelligent Vehicles Symposium, June 14-17, 2004, pp. 533–537. IEEE, Los Alamitos (2004) 11. Schreiber, D., Alefs, B., Clabian, M.: Single camera lane detection and tracking. In: IEEE Proc. Intelligent Transportation Systems, pp. 302–307. IEEE, Los Alamitos (2005) 12. Macek, K., Williams, B., Kolski, S., Siegwart, R.: A Lane Detection Vision Module for Driver Assistance. In: IEEE/APS Proc. Conference on Mechatronics and Robotics, Germany, IEEE, Los Alamitos (2004) 13. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes in C: The Art of Scientific Computing. Cambridge Press, Cambridge (1991)
Self-Eigenroughness Selection for Texture Recognition Using Genetic Algorithms Jing-Wein Wang Institute of Photonics and Communications National Kaohsiung University of Applied Sciences 415 Chien-Kung Road, Kaohsiung 807, Taiwan, R.O.C. Tel.: +886-7-3814526 Ext. 3350 Fax.: +886-7-38327712
[email protected]
Abstract. To test the effectiveness of Self-Eigenroughness, which is derived from performing principal component analysis (PCA) on each texture roughness individually, in texture recognition with respect to Eigenroughness, which is derived from performing PCA on all texture roughness; we present a novel fitness function with adaptive threshold to evaluate the performance of each subset of genetically selected eigenvectors. Comparatively studies suggest that the former is superior to the latter in terms of recognition accuracy and computation efficiency.
1 Introduction PCA-based method has been successfully used for supervised image classification [1]. While any image in the sample space can be approximated by a linear combination of the significant eigenvectors, this approach does not attempt to minimize the within-class variation since it is an unsupervised technique. Thus, the projection vectors chosen for optimal representation in the sense of mean square error may obscure the existence of the separate classes. In this paper, instead of using the common properties of classes in the training set, we use a given class’s own scatter matrix to obtain its discriminative vectors, called the Self-Eigenvectors. We also give a Self-Eigenvector selection algorithm to test the effectiveness with respect to the Eigenroughness, where both an enrolled dataset and an invader dataset are used for experiments. This paper is organized as follows. An extraction of texture roughness is presented in Section 2. The Eigenrouhness and Self-Eigenroughness techniques are introduced in Section 3, respectively, and the genetic eigenvector selection algorithm is proposed in Section 4. Experimental results are discussed in Section 5.
2 Texture Roughness To describe texture, one obvious feature is energy [2]. The image of a real object surface is not uniform usually but contains variations of intensities which form certain J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 849–854, 2007. © Springer-Verlag Berlin Heidelberg 2007
850
J.-W. Wang
repeated patterns. The patterns can be the result of physical surface properties such as roughness, which often has a tactile quality and therefore exhibits various energy variations over texture region. It is proper in reality to quantify the texture content by the roughness descriptor which provides measures of properties such as smoothness, coarseness, and regularity, being very useful as a distinctive preprocessing of texture characterization. In order to extract out texture roughness from its background, a smoothing filter is used to move the center from pixel to pixel in an image to guarantee the desired local edge enhancement. This continues until all pixel locations have been covered and a new image is to be created for storing the response of the linear mask. The local standard average μ and energy ε of the pixels in the 3 × 3 neighborhood defined by the mask are given by the expressions
μ(x, y) = ε (x, y) =
1
1
1
1
1
∑∑ f (x +i, y + j) ,
(1)
ρ i=−1 j=−1 1
∑∑( f (x +i, y + j) − μ(x, y)) , 2
ρ i=−1 j=−1
(2)
where ρ = 9 is a normalizing constant, f(x, y) is the input image, and ε(x, y) is corresponding to the roughness image formed with energy enrichment.
3 Self-Eigenroughness PCA can be used to find the best set of projection directions in the sample space composed with roughness features that will maximize the total scatter across all texture images. Projection directions are called the Eigenroughness, namely, Eigenroughness: T = {m , W K } ,
(3)
where m denotes the mean vector of the N-dimensional observation vector obtained from all two-dimensional roughness images, W K = ( w1 ,…, w K ), and K << N. The vector w K is the Eigenroughness corresponding to the K-th largest eigenvalue of the sample covariance matrix. The K principal components in W are the orthonormal axes onto which the retained energy under projection is maximal. Although PCA is as optimal representation criterion in the sense of mean square error, yet it does not consider the classification aspect. Based on our observation that the variations between images of different textures reflecting the changes in texture identity are almost always larger than the variations between images of the same texture, the single Eigenroughness set is not efficient to be used for analyzing a nonlinear structure such as complicated textures with large variations of roughness. To overcome this drawback, a variant of the PCA technique proposed by Torres and Vilá [3] is adopted in this work. In what follows, independent PCA is performed for each texture using its available images for recognition purpose. Like the traditional PCA, the independent PCA decorrelates the components and arranges them in order of decreasing significance but with a different amplitude distribution. The analysis results in a set of Eigenroughness for each texture, called Self-Eigenroughness,
Self-Eigenroughness Selection for Texture Recognition Using Genetic Algorithms
Self-Eigenroughness: Rl = {ml ,W lJ } l = 1, 2,..., L ,
851
(4)
where m l denotes the mean vector of the N-dimensional observation vector obtained from the two-dimensional roughness image l, W lJ = ( w1l ,…, w lJ ), and J < K. Vector l w J is the Self-Eigenvector corresponding to the J-th eigenvalue selected from the sample covariance matrix of texture l. Self-Eigenroughness is a generalization of PCA in the sense that it selects a small yet discriminative component subset for each texture while Eigenroughness uses the leading component subset. In texture recognition, Self-Eigenroughness can be superior to Eigenroughness owing to its ability to represent the non-Gaussian statistics of texture images and can be used as a discriminative measure to minimize within-class variations for each individual texture class. We noticed that this work may involve high dimensional data sets even for a small image with size 32 × 32, such representation leads to more than 1,000 Self-Eigenroughness components. Dimensionality reduction can circumvent this problem by reducing the number of eigenvectors in the data set. This can also reduce the computation time, and the resulting classifiers take less space to store. However, it is difficult to determine the number of Self-Eigenroughness components required a priori. It can be solved by using genetic algorithm (GA) [4] where we start with random Self-Eigenroughness subsets, which accompany initial thresholds, and then more eigenvectors are added or deleted for further selection. The procedure is repeated until there is only the best Self-Eigenroughness subset with a corresponding threshold is acquired.
4 Self-Eigenroughness Selection Standard GA is used for evaluating the effectiveness of a Self-Eigenroughness subset and searching for the best J components from the given ughness set with K eigenvectors. Fitness function is critical to the performance of GAs. We adapt Bayesian likelihood function to be the fitness function to explore the importance of individual feature in the optimal classification. The formula is:
ϑ=
α1 ⋅ AAR − α 2 ⋅ FRR ⎛ J ⎞ ⋅ ⎜1 − ⎟ . α 3 ⋅ ARR − α 4 ⋅ FAR ⎝ K ⎠
(5)
The ratio J/K of the selected feature number to the total feature number is a priori knowledge. There are four possible outcomes in a recognition system operating in an identification mode: accurate acceptance rate (AAR), accurate rejection rate (ARR), false rejection rate (FRR), and false acceptance rate (FAR). In Bayesian likelihood fitness function, that is suitable for any security level of environments by tuning the action factors (α1, α2,, α3, and α4). The values for α1, α2, α3, and α4 are empirically determined and each is set to α1 = 100, α2 = 1, α3 = 1, and α4 = 100, respectively. The selection algorithm is detailed as follows: (a) For each 256 × 256 pixels with 256 gray levels texture, randomly sample 100 subimages and use in training and classification phases. The roughness images of the
852
J.-W. Wang
samples are obtained with Eqs. (1) and (2), these image samples are then arrayed in a matrix x ∈ Rd , where d = 32 × 32, with one column per sample image. (b) Select J Self-Eigenroughness components and generate the projection vector for each of the training texture vector xlp ylp = W TJ x lp ,
(6)
where xlp , l = 1, 2,…, L and p = 1, 2,…, P, is the pth image of the lth texture. (c) Compute simplified Mahalanobis distance [4] for textures in the database by y q = W TJ x q , 1 ≤ q ≤ Q,
d lq = min q
γ = arg (min ( l
yq
−
2
ml
vl Q
∑d
lq
(7)
,
)) and d lq < θ ,
(8)
(9)
q =1
where y q is the projection vector of the qth test texture vector, and d lq is the distance of test texture image q from the l th category. The mean m l and the variance v l of the Self-Eigenroughness from the l th category are calculated with the leave-one-out cross validation. The category label is denoted as γ, and θ stands for the recognition threshold. For the inside testing, there will be three types of recognition results, which are AAR, FAR, and FRR. On the other hand, ARR and FAR are measured for the outside testing.
5
Results and Discussion
With a direct encoding scheme, the genetic representation is used to evolve potential solutions under a set of twelve images found in the Brodatz album [5], which are D3, D6, D9, D16, D21, D24, D34, D36, D52, D55, D68, and D78. Parameters for the designed GAs were determined experimentally: population size = 20, number of generation = 500 crossover probability = 0.5. A mutation probability value starts with a value of 0.1 and then varied as a step function of the number of iterations until it reaches a value of 0.001. The images of invaders are acquired from both the Brodatz album and the MIT Vision Texture database [6]. This results in a total of 100 images for the outside testing, which have been equalized prior to being used globally. Based on genuine and imposter distributions, the receiver operating characteristic (ROC) curve is defined as the plot of the false rejection rate (FRR) against the false acceptance rate (FAR) for all possible system operating points and measures the overall performance of the system. Each point on the curve of Fig. 1 shows the recognition performance of Self-Eigenroughness with GA selection approach.
Self-Eigenroughness Selection for Texture Recognition Using Genetic Algorithms
False Rejection Rate (%) 35
853
ROC Curve Threshold
30 68 25 20 15 70 10
72 74
5
76
78
80
0 0
0.5
1
1.5
82
84
2
2.5
86 3
3.5
False Acceptance Rate (%) Fig. 1. ROC curve of Self-Eigenroughness with GA selection approach Table 1. Recognition results based on Eigenroughness and Self-Eigenroughness Eigenvectors Leading Eigenroughness Leading Self-Eigenroughness Eigenroughness with GA selection Self-Eigenroughness with GA selection
Number
ARR % 78.8
FAR % 24.0
FRR % 12.5
Threshold
300
AAR % 85.5
50 100 200 300 168
55.4 78.4 91.2 87.6 88.7
44.1 77.2 89.1 84.8 84.3
61.0 25.3 12.3 16.6 19.2
39.6 19.3 7.6 8.8 7.7
38 80 106 130 141
79
98.3
99.9
0.9
1.3
78
168
For comparison study, we run several experiments by using the leading 50, 100, 200, and 300 Self-Eigenroughness eigenvectors, respectively, and singling Eigenroughness components out by using GAs. Algorithms based on the two types of eigenvectors have been shown to work well in texture discrimination. The recognition errors as shown in Table 1 mostly decrease when the used eigenvectors are selectively removed from all the eigenvectors at both cases. This decrease is due to the fact that less parameter used in place of the true value of the class conditional probability density functions need to be estimated from the same number of samples. The smaller the number of the parameters that need to be estimated, the less severe the curse of dimensionality can become. In the meanwhile, we also noticed that the Self-Eigenroughness outperforms the Eigenroughness with the genetic eigenvector selection. This is because the holistic approach based Eigenroughness encode the grey scale correlation among every texture statistically. Thus, any image variation due to changes of roughness results in a large change of texture representation. On the other hand, since our scheme encodes the
854
J.-W. Wang
texture representation separately, image variations are limited to each class only that leads to a lower equal error rate (EER). Moreover, the selection of Self-Eigenroughness for discrimination is not only on the leading eigenvectors but also spreads more on the latter components. Selected Self-Eigenroughness number in every 10 eigenvectors for the best fitness value is illustrated in Fig. 2.
Selected number 10 8 6 4 2 0 10
50
100
150
200
250
300
Every 10 Self-Eigenroughness Fig. 2. Selected number in every 10 for the total 300 Self-Eigenroughness components
Acknowledgement The author would like to acknowledge the support received from NSC through project number NSC 95-2622-E-151-022-CC3.
References [1] Fredembach, C., Schröder, M., Süsstrunk, S.: Eigenregions for Image Classification. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1645–1649 (2004) [2] Zhang, J., Tan, T.: Brief Review of Invariant Texture Analysis Methods. Pattern Recognition 35, 735–747 (2002) [3] Torres, L., Vilá, J.: Automatic Face Recognition for Video Indexing Applications. Pattern Recognition 35, 615–625 (2002) [4] Duda, R.O., Hart, P.E., Stork, D.G. (eds.): Pattern Classification. John Wiley & Sons, Chichester (2000) [5] Brodatz, P., (ed.): Textures: A Photographic Album for Artists and Designers. Dover, NY (1966) [6] Vision Texture, http://vismod.media.mit.edu/pub/VisTex/
Analysis of Image Sequences for Defect Detection in Composite Materials T. D’Orazio1, M. Leo1 , C. Guaragnella2, and A. Distante1 1
2
Institute of Intelligent Systems for Automation - C.N.R. Department of Electrics and Electronics Engineering - Politecnico di Bari
[email protected]
Abstract. The problem of inspecting composite materials to detect internal defects is felt in many industrial contexts both for quality controls through production lines and for maintenance operations during in-service inspections. The analysis of the internal defects (not detectable by a visual inspection) is a difficult task unless invasive techniques are applied. For this reason in the last years there has been an increasing interest for the development of low cost non-destructive inspection techniques that can be applied during normal routine tests without damaging materials but also with automatic analysis tools. In this paper we have addressed the problem of developing an automatic signal processing system that analyzes the time/space variations in a sequence of thermographic images and allows the identification of internal defects in composite materials that otherwise could not be detected. First of all a preprocessing technique was applied to the time /space signals to extract significant information, then an unsupervised classifier was used to extract uniform classes that characterize a range of internal defects. The experimental results demonstrate the ability of the method to recognize different regions containing several types defects.
1
Introduction
The problem of guaranteeing reliable and efficient safety checks has received great attention in recent years in many industrial contexts: quality controls and maintenance operations have to be reliable but also have to be performed at low cost in order to meet frequent schedules. In particular non destructive testing and evaluation (NDT&E) techniques are necessary to detect damage in high stressed and fatigue-loaded regions of the structure at an early stage. Some of these NDT&E techniques are based on analysis of the transmission of different signals such as ultrasonics, acoustic emission, thermography, laser ultrasonic, Xradiography, eddy currents, shearography, and low frequency methods [1]. Transient thermography is a very promising technique for the analysis of composite materials [2]. It is a non-contact technique, which uses the thermal gradient variation to inspect the internal properties of the investigated area. The materials are heated by an external source (lamps) and the resulting thermal transient is recorded using an infrared camera. Some research has been presented in the J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 855–864, 2007. c Springer-Verlag Berlin Heidelberg 2007
856
T. D’Orazio et al.
literature on the use of thermography [3,4,5]. They have demonstrated the effectiveness of thermography in detecting internal defects and show excellent results on all the investigated samples. Different qualitative approaches have been developed by many researchers to investigate the effects on thermographic images of a number of parameters such as: specimens of materials, defect types, depths of defect, size and thickness [6,7,8]. Quantitative approaches are attractive in the analysis of thermographic images because of the possible diagnosis capabilities that they introduce. They involve the solution of the direct problem, that is the computation of the expected response from known sound and defect materials, and the inverse problem, that is the evaluation of defect characteristics from a known response. Due to the nonlinear and non-univocal nature of these mapping problems, the solution is rather complex. For this reason some attempts using neural networks have started to emerge in the last few years [9,10,11]. In this paper we address the problem of developing an automatic system for the analysis of sequences of thermographic images to help the safety inspector when elaborating diagnosis. Starting from the observation that composite materials have different behaviors during the transient phase of the thermographic inspection and that the reflectivity time evolution of each pixel of the image differs when some regions contain inner defects, we devised a neural approach that analyzes the main characteristics of these thermic evolutions, extract significant information and then uses them to classify the investigated area as a defective or sound area. The main novelty of this work is the usage of a time-space denoising techniques before the application of a unsupervised neural classifier. Preprocessing a thermographic transient response of a composite material needs - at first - a noise reduction stage. Space domain filtering, often used, introduces a blurring effect reducing the image resolution of the obtained video sequence. Because of the physics of the thermal transmission through materials, neighboring pixels have a time correlated thermal evolution so that their information content can be used to reduce the noise, reducing the image blurring effects, and enhancing segmentation of the processed image. At this aim, Singular Value Decomposition (SVD) technique was applied to acquired video sequence: for each pixel in the image, a 8 pixel neighborhood was considered, and a matrix of 9 × N - the central pixel and the neighboring ones - time signals has been extracted from the video and considered as the target data to be SVD decomposed. Sliding windows of 3 × 3 × N (where N is the length of time sequence) to extract the main components that retain the greater information suppressing the remaining components that contain noise. An unsupervised classifier was found to be particularly effective, since it can easily implement the non linear mapping from an input feature space to an output space and does not require a supervised training phase. Experimental results on a composite material containing several defects demonstrate the effectiveness and the potential capabilities of the proposed approach. The structure of this paper is as follows. In Section 2, the system overview and the experimental setup are presented. The denoising techniques is described
Analysis of Image Sequences for Defect Detection in Composite Materials
857
in Section 3. In Section 4 the unsupervised classifier is presented. Experimental results are reported in section 5.
2
System Overview
The thermographic image sequence was obtained by using a thermo cameras sensitive to the infrared emissions. A quasi-uniform heating has been used that guarantees a temperature variation of the composite materials around 20C/sec. The composite material used in the experimental tests has an alloy core with a periodic honeycomb internal structure. It presents different kinds of defects: specifically there is a hole with some water, four impact damages, and one knife cut. In all the cases the defects or the internal damage are not detectable with a visual inspection. The thermographic analysis result is a sequence of images in which the value of each pixel (i,j) represents the temperature variation during the heating and warning phases. In figure 1 the mono-dimensional signals extracted from the thermographic sequence of some points belonging to sound and defect areas are plotted. The points were selected from different regions belonging to different kinds of defects. From the graph it is clearly evident that a functional description of the intensity variations cannot be easily generalized and the behaviors of points corresponding to different defective areas are not similar. However our starting hypothesis was that neighbor pixels have related temperature variations that can be used for a denoising procedure on the original signals.
Fig. 1. Some mono-dimensional signals extracted from the temporal sequences of thermographic images
Therefore for each point (i, j), we considered the 8-neighboring ones and for the 3 × 3 window through the termographic sequence as shown in figure 2. The Single Value Decomposition is applied to this 9×100 matrix and the more significant components are extracted and used to reconstruct the signal (i, j) . In this
858
T. D’Orazio et al.
way we provided a good approximation of the original signal but with reduced noise. The second step of this work is the unsupervised classification of signals to separate homogeneous regions and identify different defect areas. We decided for an unsupervised classifier since we want to be independent of the knowledge of specific behaviors of defect areas. In [14] a backpropagation neural network was trained to classify defect and sound areas. However the approach required the a priori selection of sample points from the different classes that the classifier had to recognize. In this work we provided all the data to a Self Organizing Map (SOM) that generates points aggregation according to the signals similarity. A K-means procedure was then used to associate points aggregations to clusters. The classifying procedure, based on the signal similarity, is completely automatic and does not require any intervention of human operators.
9 values
240 pixels
100 values
100 frames 320 pixels
SVD 100 values
Fig. 2. The 3×3×100 window extracted from the temporal sequences of thermographic images on which the SVD is applied
3
The SVD Preprocessing
We considered a 3 × 3 set of time signals coming from the acquisition of the thermal transient analysis. For each of the extracted signal sets two things appear evident: the signals evolution is very slow, and the time evolutions of neighbor pixels are similar. The first step of the proposed procedure, is thus focused to exploit all the time correlation coming from signals, to reduce as much as possible the unwanted noise. Stated M the data N × 9 matrix containing in the columns the time evolution of the signals, the SVD decomposition is defined in a matrix form as: M =U ·S·VH
(1)
Analysis of Image Sequences for Defect Detection in Composite Materials
859
where V H represents the transpose of the matrix V . The matrices M and U are N × 9 sized, while V is a 9 × 9 orthonormal matrix (the eigenvector matrix). Matrix S is diagonal, containing the signals in U modula, of a vector A, defined as A=M ·V =U ·S (2) related to the squared mean values of the orthogonal signals contained in the matrix A, but for a scaling factor of 1/N : N Si,i = A2j,i (3) j=1
The eigenvectors in the matrix V are coefficients of the ”filters” used to select the correlated components of the signals in the matrix M . The first singular value is very high, while all the others very often correspond to noise contribution, here defined as the uncorrelated signals components of the data matrix M . The decomposed image can be split into two sub-matrices, the signal matrix, Mo and the noise matrix, Mn . Such matrices can be obtained from the decomposition by recombining the original matrix using the singular values matrices So , obtained putting to zero all the eigenvalues but the first in S and Sn = S − So . M o = U · So · V H
(4)
M n = U · Sn · V H
(5)
and The largest part of the energy content of the correlated information in the vector observation is contained in the first column of the matrix U preserving in this way the time signal shape. Furthermore, as the first orthogonal vector of A in (2) is obtained as a weighted sum of highly correlated vector components, the signal to noise ratio increases about as the number of components in the observation vector. In our case, as we process 9 pixel of a 33 pixel neighborhood, the maximum increase in the signal to noise ratio would be 9 (i.e. about 10 dB). The signals coming from the image reconstruction with the So matrix are all similar to each other, so that only the signal corresponding to the center pixel is retained; a new video sequence is constructed applying the same procedure to all the 3 × 3 pixels neighborhoods of the image on the original video; the proposed denoising procedure is used to construct the output denoised video using the principal component of the pixel time evolution corresponding to the center pixel reconstructed by (4). The proposed procedure is able to guarantee some advantages with respect to standard filtering procedure: – the filtering procedure doesn’t use heuristic filters, but data adaptive ones, able to exploit all the correlation inherently present in the thermal time evolution signals; – the adaptivity of the denoising procedure allows blind filtering, i.e. the user hasn’t to deal with the defect type or characteristics, the material type or the used heating sources;
860
T. D’Orazio et al.
– the blurring effects coming from the spatial filtering procedure are reduced as the filtering effect takes place in the time domain, but results of the filtering procedure enhance the space domain (image quality).
4
The Unsupervised Classifier
The resulting 320x240x100 data structure is then fed into a two steps unsupervised clustering process: firstly a mapping in a bidimensional space is performed by Self Organizing Map Neural network (SOM) and then an unsupervised clustering is performed in this new data representation space by k-means algorithm. A SOM is a Winner-Take-All Artificial Neural Network that discovers similarity in the data and it builds a bidimensional map of M neurons (20x20 = 400 in our experiments ) organized in such a way that similar data are mapped onto the same neuron or to neighboring nodes in the map . This leads to a spatial clustering of similar input patterns in neighboring parts of the SOM and the clusters that appear on the map are themselves organized. This arrangement of the clusters on the map reflects the attribute relationships of the clusters in the input space. At first each neurons i of the map is assigned an n-dimensional weight vector where n is the dimension of the input data. The training process of self-organizing maps may be described in terms of data presentation and weight vector adaptation. Each training iteration t starts with the random selection of one input vector x. This vector is presented to the self-organizing map and each neurons of the net determines its activation. The Euclidean distance between the weight vector and the instance is used to calculate a unit’s activation. The unit with the lowest distance is then referred to as the winner, c. Finally, the weight vector of the winner as well as the weight vectors of selected units in the vicinity of the winner are adapted. For the units in the vicinity of the winner a gradual adaptation is performed: lower the distance, greater the updating. The updating formula for a neuron with weight vector W (t) can be expressed as W (t + 1) = W (t) + Θ(v, t)α(t)(x(t) − W (t))
(6)
where α(t) is a monotonically decreasing learning coefficient, x(t) is the input vector and Θ(v, t) is the neighborhood function depending on the lattice distance between the input and the neuron v. The weight vectors of the adapted units are moved slightly towards the neurons. The amount of weight vector movement is guided by the learning rate, which decreases over time. The number of units that are affected by adaptation as well as the strength of adaptation depending on a unit’s distance from the winner is determined by the neighborhood function. This number of units also decreases over time such that towards the end of the training process only the winner is adapted. The neighborhood function is unimodal, symmetric and monotonically decreasing with increasing distance to the winner, e.g. Gaussian. The movement of weight vectors has the consequence that the distance between the inputs and weight vectors decreases. In other words, the weight vectors become more similar to the input. Hence, the respective unit is more likely to win at future
Analysis of Image Sequences for Defect Detection in Composite Materials
861
presentations of this input. At the end of the training process each input activates a neuron of the map and similar input activate adjacent neuron on the bi-dimensional map. The resulting SOM’s map allows to determine, for each input vector the relative neuron into the map but, in order to perform on line classification of unknown input instances, it is necessary a further step. Considering that usually the number of neurons in the map is lower then the number of the input instances but it is greater then the number of classes of the instances it is necessary to group (cluster) neighboring neurons. The number of groups has to equal the number of classes to be determined in the data. In this paper a K-means clustering of the neurons in the SOM’s map has been performed where the number K of cluster has been set to the number of classes to be determined. K-means is an unsupervised clustering algorithm that defines k centroids, one for each cluster to be determined. These centroids shoud be placed in a cunning way because of different location causes different result. They are initially placed on a regular grid and converge with the iterations to the centroids of the data observations. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function: J=
k n
(j)
xi − cj 2
(7)
J=1 i=1 (j)
(j)
where xi − cj 2 is a chosen distance measure between the n data points xi and the cluster center cj .
5
Experimental Results
In figure 3 one of the thermographic images is reported. The defect types classification is superimposed on the image. A number of experiments have been carried out. The original thermographic images have been segmented both using a Neural Network classifier trained with a BackPropagation algorithm (referred as BP) and using the SOM classifier described above. In the first case a number of examples of each class was manually selected from the images and provided to the Neural Network that was trained to recognize different detects from the background (see [14] for details). The results are reported in figure 4. On the left the segmented image obtained with the SOM classifier is reported; on the right the result obtained with the BP classifier. Both of these segmentation images are affected by noise
862
T. D’Orazio et al.
Fig. 3. One of the thermographic images
Fig. 4. On the left: the segmented image with the SOM. On the right: the segmented image with the Backpropagation.
and the defect areas are not clearly visible. In the successive experiments we applied the SVD to decompose the original thermographic image sequence and we used the first components to reconstruct the images. Figure 5 shows the results obtained by the SOM classifier applied to the thermographic images after the reconstruction with the component S1 on the left and S1 S2 on the right. Figure 6 shows the results obtained after the reconstruction with S1 , S2 , S3 on the left and with S1 , S2 , S3 , S4 , S5 , S6 , S7 on the right. More the components worse the results obtained. In particular by using just the first component S1 to reconstruct the image sequence, the noise is greatly reduced and the defect areas are more clearly visible than in the following experiments with more components. This consideration demonstrates that the first component of the SVD decomposition actually contains the largest information content of the signal, while the image sequence reconstruction by using the successive components does not increase the quality of the signal but introduces noise in the final segmentation. The SOM classifier is able to organize the data and to separate the defect areas producing a segmentation image that is better than those obtained by the Backpropagation Classifier, with the great advantage of not requiring the set of examples of different classes but only the maximal number of expected classes.
Analysis of Image Sequences for Defect Detection in Composite Materials
863
Fig. 5. On the left: the Segmented image with the SOM after the reconstruction of the image with S1. On the right: the Segmented image with the SOM after the reconstruction of the image with S1 ans S2.
Fig. 6. On the left: the Segmented image with the SOM after the reconstruction of the image with S1, S2, S3. On the right: the Segmented image with the SOM after the reconstruction of the image with S1, S2, S3, S4, S5, S6, S7
6
Conclusions
In this paper we addressed the problem of developing an automatic signal processing system that analyzes the time/space variations in a sequence of thermographic images and allows the identification of internal defects in composite materials that otherwise could not be detected. First of all a preprocessing technique based on the Singular Value Decomposition was applied to the time /space signals to extract significant information and reduce the noise; then an unsupervised classifier was used to extract uniform classes that characterize a range of internal defects. The experimental results demonstrate the ability of the method to reduce the unwanted noise exploiting the time correlation coming from signals, and to recognize different regions containing several types defects.
References 1. Huang, Y.D., Froyen, L., Wevers, M.: Quality Control and Nondestructive Test in Metal Matrix Composites. Journal of Nondestructive Evaluation 20(3), 113–132 (2001) 2. Gaussorgues, G.: Infrared Thermography. Champan& Hall, Sydney, Australia (1994)
864
T. D’Orazio et al.
3. Jones, T.S.: Infrared Thermographic evaluation of marine composite structures. In: SPIE vol. 2459 (1995) 4. Avdelidis, N.P., Hawtin, B.C., Almond, D.P.: Transient thermography in the assessment of defects of aircraft composites. NDT & E Int. 36, 433–439 (2003) 5. Wu, D., Busse, G.: Lock-in thermography for nondestructive evaluation of materials. Rev. Gen. Therm. 37, 693–703 (1998) 6. Sakagami, T., Kubo, S.: Applications of pulse heating thermography and lockin thermography to quantitative nondestructive evaluations. Infrared Physics & Technology 43, 211–218 (2002) 7. Giorleo, G., Meola, C., Squillace, A.: Analysis of Detective Carbon-Epoxy by Means of Lock-in Thermography. Res. NonDestr. Eval. 241–250 (2000) 8. Inagaki, T., Ishii, T., Iwamoto, T.: On the NDT and E for the diagnosis of defects using infrared thermography. NDT & E Int. 32, 247–257 (1999) 9. Maldague, X., Largouet, Y., Couturier, J.P.: A study of defect using neural networks in pulsed phase thermography. modeling, noise, experiments 37, 704–717 (1998) 10. Marin, J.Y., Tretout, H.: Advanced technology and processing tools for corrosion detection by infrared thermography. AITA- advanced Infrared Technology and Appliation, 128–133 (1999) 11. [11] Saintey, M.B., Almond, D.P: An artificial neural network interpreter for transient thermography image data. NDT & E Int. 30(5), 291–295 (1997) 12. Haykin, S.: Neural Network a comprehensive foundation. IEEE Press, Los Alamitos (1994) 13. Freeman, J., Skapura, D.: Neural Network Algorithms, Applications, And Programming Techniques. Addison Welsey, London, UK (1991) 14. D’Orazio, T., Guaragnella, C., Leo, M., Spagnolo, P.: Defect detction in aircraft composites by using a neural approach in the analysis of thermographic images. NDT&E international Journal 38, 664–673 (2005)
Remote Sensing Imagery and Signature Fields Reconstruction Via Aggregation of Robust Regularization with Neural Computing Yuriy Shkvarko and Ivan Villalon-Turrubiates CINVESTAV Jalisco, Avenida Científica 1145, Colonia El Bajío, 45010, Telephone (+52 33) 3770-3700, Fax (+52 33) 3770-3709, Zapopan Jalisco, México {shkvarko,villalon}@gdl.cinvestav.mx http://www.gdl.cinvestav.mx
Abstract. The robust numerical technique for high-resolution reconstructive imaging and scene analysis is developed as required for enhanced remote sensing with large scale sensor array radar/synthetic aperture radar. First, the problem-oriented modification of the previously proposed fused Bayesianregularization (FBR) enhanced radar imaging method is performed to enable it to reconstruct remote sensing signatures (RSS) of interest alleviating problem ill-poseness due to system-level and model-level uncertainties. Second, the modification of the Hopfield-type maximum entropy neural network (NN) is proposed that enables such NN to perform numerically the robust adaptive FBR technique via efficient NN computing. Finally, we report some simulation results of hydrological RSS reconstruction from enhanced real-world environmental images indicative of the efficiency of the developed method.
1 Introduction Modern applied theory of reconstructive image processing is now a mature and well developed research field, presented and detailed in many works (see, for example [1] thru [16] and references therein). Although the existing theory offers a manifold of statistical and descriptive regularization techniques for reconstructive imaging in many application areas there still remain some unresolved crucial theoretical and processing problems related to large scale sensor array real-time reconstructive image processing. In this study, we consider the problem of enhanced remote sensing (RS) imaging and reconstruction of remote sensing signature (RSS) fields of the RS scenes with the use of array radars or synthetic aperture radars (SAR) as sensor systems. Two principal algorithmic-level and computational-level developments constitute major innovative contributions of this study, namely: 1) Development of a robust version of the fused Bayesian-regularization (FBR) method [1], [5] for reconstruction of the power spatial spectrum pattern (SSP) of the wave field scattered from the RS scene [7] and related RSS given a finite set of SAR signal recordings. Since this is in essence a nonlinear numerical inverse problem, we propose to alleviate the problem ill-poseness via robustification of the Bayesian J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 865–876, 2007. © Springer-Verlag Berlin Heidelberg 2007
866
Y. Shkvarko and I. Villalon-Turrubiates
estimation strategy [6], [7] by performing the non adaptive approximations of the SSP and RSS reconstructive operators that incorporate the non trivial metrics considerations for designing the proper solution space and different regularization constraints imposed on a solution. 2) Design of numerical techniques for efficient real-time computational implementation of the robust RS image enhancement and RSS field reconstruction algorithms that employ the neural network (NN) computing. In particular, we propose to employ the general Li’s architecture of the Hopfield-type dynamic NN detailed in [5] and [8] but modify the specifications of the NN’s parameters (i.e. synaptic weights and bias inputs in all the NN’s loops, as well as the NN’s state update rule) to enable such the modified NN to perform the real-time robust image enhancement and RSS field reconstruction tasks. Also, we propose a method to perform such a reconstruction with controllable balance between the achievable spatial resolution and admissible noise level in the resulting image/RSS map.
2 Problem Model Consider the measurement data wavefield u(y)=s(y)+n(y) modeled as a superposition of the echo signals s and additive noise n that assumed to be available for observations and recordings within the prescribed time-space observation domain Y∋y, where y=(t,p)T defines the time-space points in the observation domain Y=T×P. The model of observation wavefield u is specified by the linear stochastic equation of observation (EO) of operator form [5]: u=Se+n; e∈E; u,n∈U; S:E→U, in the L2 Hilbert signal spaces E and U with the metric structures induced by inner products, [e1,e2]E=
∫ e1 ( f , x)e2 ( f , x)dfdx , *
F×X
[u1,u2]U= ∫ u1(y)u2∗ (y)dy ,
(1)
Y
respectively. The operator model of the stochastic EO in the conventional integral form may be rewritten as [1] u(y)=
∫ S (y, x)e( f , x)dfdx +n(y) ,
F×X
e(f,x)= ∫ ε (t ; x)exp(–j2πft)dt
(2)
(3)
T
where ε (t ; x) represents the stochastic backscattered wavefield fluctuating in time t, and the functional kernel S(y,x) of the signal formation operator (SFO) S in (2) is specified by the particular employed RS signal wavefield formation model [1]. The phasor e(f,x) in (1), (2) represents the backscattered wavefield e(f,ρ,θ) over the frequency-space observation domain F×P×Θ; in the slant range ρ∈P and azimuth angle θ∈Θ domains, respectively. When considering the RS spectral analysis problems, the T radar engineers typically work in the frequency-space domain, (f;ρ,θ) ∈F×P×Θ, however, because of the one-to-one mapping [4], [9] only the spatial cross range coordiT T nates (ρ,θ) are usually associated with the RS scene [9], i.e. x=(ρ,θ) ∈X=P×Θ. This is valid for any narrowband RS system model [13] and incoherent nature of the
Remote Sensing Imagery and Signature Fields Reconstruction
867
backscattered wavefield e(f,x) that is naturally inherent to the RS imaging experiments [4], [9], [13]. Following such model assumptions, the phasor e(f,x) in (2), (3) is taken to be an independent random variable at each frequency f, and spatial coordinates x with the zero mean value and δ-form correlation function, * Re(f,f′;x,x′)=<e(f,x)e (f′,x′)>=b(f,x)δ(f – f′)δ(x – x′) that enables one to introduce the following definition of the spatial spectrum pattern (SSP) B(x)=Aver(2){e(x)}= ∫ e( f , x)e* ( f , x) |H(f )|2df ; x∈X .
(4)
F
Here, <⋅> represents the ensemble averaging operator, while Aver is referred to as the second order statistical averaging operator defined by (4), H(f) represents the given transfer function of the radar receiving channels that we assume to be identical 2 for all antenna array elements and impose the conventional normalization, |H(f)| =1 for all frequencies f∈F in the radar receiver frequency integrating band F. The RS imaging problem is stated as follows: to find an estimate Bˆ (x) of the SSP B(x) in the (2)
environment X∋x by processing whatever values of measurements of the data wavefield u(y); y∈Y, are available. Following the RS methodology [5], any particular physical RSS of interest is to be extracted from the reconstructed RS image Bˆ (x) applying the so-called signature extraction operator Λ. Hence, the particular RSS is mapped applying Λ to the reconstructed image, i.e. ˆ (x) =Λ( Bˆ (x) ) . Λ
(5)
Last, taking into account the RSS extraction model (5), we can reformulate now the signature reconstruction problem as follows: to map the reconstructed particular ˆ (x) =Λ( Bˆ (x) ) over the observation scene X∋x via post-processing RSS of interest Λ (5) whatever values of the reconstructed scene image Bˆ (x) ; x∈X are available. For an RS system with an arbitrary sensor configuration, the recorded data is traditionally expressed as a discrete-form version of the EO (1) U=SE+N
(6)
where E, N and U define the zero-mean vectors composed of the coefficients Ek , Nm , and Um of the numerical approximations of the relevant operator-form EO (1), i.e. E represents the K-D vector composed with the coefficients {Ek=[e,gk]E, k=1,…,K} of the K-D approximation, e(K)(x)=(PE(K)e)(x)=∑Ekgk(x), of the backscattered wavefield e(x) integrated over the receiver frequency integration band F, and PE(K) is a projector onto the K-D approximation subspace E(K)=PE(K)E=Span{gk} spanned by some chosen set of K basis functions {gk(x)}. The M-by-K matrix S that approximates the SFO in (6) is given now by [3] Smk=[Sgk,ϕm]U ; m=1,…,M ; k=1,…,K
(7)
where the set of the base functions {ϕm(y)} that span the finite-dimensional spatial observation subspace U(M)=PU(M)U=Span{ϕm} defines the corresponding projector PU(M) induced by the specified array spatial response characteristics {ϕm(y)} [1]. The vectors E, N and U in the EO (6) are characterized by the correlation matrices + RE=D=D(B)=diag(B), RN, and RU=SRES +RN, respectively, where diag(B) defines a
868
Y. Shkvarko and I. Villalon-Turrubiates +
diagonal matrix with vector B at its principal diagonal. Superscript defines the adjoint operator [5] that becomes the Hermitian conjugate. The vector B is composed of the elements Bk=<EkEk*>; k=1,…,K, and is referred to as a K-D vector-form approximation of the SSP. The RSS reconstruction problem is reformulated as follows: to derive an estimator for reconstructing the K-D approximations K Λˆ ( K ) (x) =Λ( Bˆ ( K ) (x) ))=Λ( ∑k =1 Bˆ k |gk(x)|2)=Λ(gT(x)diag( Bˆ )g(x))
(8)
of the relevant RSS distribution in the environment X∋x via post-processing (5) whatever values of the reconstructed scene image Bˆ (x) ; x∈X are available. The experiment design (ED) aspects of the SSP estimation problem involving the analysis of how to choose the basis functions {gk(x)} that span the signal representation subspace E(K)=PE(K)E=Span{gk} for a given observation subspace U(M)=Span{ϕm} were investigated in more details in the previous studies [7], [16]. Here, we employ the pixelformat basis [10] and the ED considerations [1] for inducing the metrics structure in the solution space defined by the inner product ||B||2B(K)=[B,MB]
(9)
where M is referred to as the metrics inducing operator [1]. Hence, the selection of M provides additional geometrical degrees of freedom of the problem model. In this study, we incorporate the model of M that corresponds to a matrix-form approximation of the Tikhonov’s stabilizer of the second order that was numerically designed in [1]. Also, following [1] we incorporate the projection-type a priori information requiring that the SSP vector B satisfies the linear constraint equation GB=C , i.e. G–GB=BP –
(10)
–
where BP=G C and G is the Moore-Penrose pseudoinverse of a given constraint operator G:B(K)→B(Q), and the constraint vector C∈B(Q) and the constraint subspace B(Q)(Q < K) are assumed to be given. In (10), the constraint operator G projects the portion of the unknown SSP onto the subspace where the SSP values are fixed by C.
3 Generalization of the FBR Method The estimator that produces the optimal estimate Bˆ of the SSP vector via processing the M-D data recordings U applying the FBR estimation strategy that incorporates nontrivial a priori geometrical and projection-type model information (9), (10) was developed in our previous study [1]. Such optimal FBR estimate of the SSP is given by the nonlinear equation [1] Bˆ =BP+PB0+W( Bˆ )(V( Bˆ )–Z( Bˆ )) .
(11)
In (11), BP is defined by (10) and B0 represents the a priori SSP distribution to be considered as a zero step approximation to the desires SSP Bˆ . In this study, we use all the notations from [1] for definitions of the sufficient statistics (SS) vector V( Bˆ )={F( Bˆ )UU+F+( Bˆ )}diag ({⋅}diag defines a vector composed of the principal diagonal of the embraced matrix), the solution-dependent SS formation operator
Remote Sensing Imagery and Signature Fields Reconstruction
F=F( Bˆ )=D( Bˆ )(I + S+ R −N1 SD( Bˆ ))–1S+ R −N1 ;
869
(12)
ˆ )={F( B ˆ )RNF+( B ˆ )}diag [1], and the composite solutionthe SS shift vector Z( B dependent smoothing-projection window operator [1] W( Bˆ )=PΩ( Bˆ )
(13)
P=(I–G–G)
(14)
with the projector
and the solution-dependent regularizing window
Ω( Bˆ )=(diag({S+F+FS}diag)+α D2( Bˆ )M( Bˆ ) )–1 ,
(15)
in which the regularization parameter α is to be adaptively adjusted using the system calibration data (10). The generalization of the FBR estimator (11) for the case of RSS reconstruction in the K-D solution space can now be performed applying the mapping (8) to (11) and taking into account the pixel format of the basis {gk(x)} spanning the RSS solution space that yields K Λˆ ( K ) (x) =gT(x)diag(Λ( Bˆ ))g(x)= ∑ k =1 Λ ( Bˆ k ) |gk(x)|2 ; x∈X .
(16)
Hence, in the adapted pixel-format solution space, the vector Λˆ =Λ( Bˆ ) composed of pixels {Λ( Bˆ k );k=1,…,K} represents the desired pixel-format map of the highresolution RSS reconstruction. Because of the complexity of the solution-dependent K-D operator inversions needed to be performed to compute the SS, V( Bˆ ), and the window, W( Bˆ ), the computational complexity of such generalized optimal algorithm (16) is extremely high. Hence, the (16) could not be addressed as a practically realizable estimator of the RSS (i.e. high-resolution RSS mapping technique realizable via performing polynomial-complex computations [12]).
4 R-FBR Technique for RSS Reconstruction We propose the robustification scheme for quasi-real-time implementation of the generalized FBR estimator that reduces drastically the computation load of the RSS formation procedure (16) without substantial degradation in the SSP resolution and overall RSS map performances. The robust version of the FBR estimator (referred to as R-FBR method) is proposed via roughing P=I and performing the robustification of both the SS formation operator F( Bˆ ) and the smoothing window Ω( Bˆ ) in (11) by roughing D( Bˆ )≈D=βI, where β represents the expected a priori image gray level [1]. Thus, the robust SS formation operator
F=A–1(ρ)S+ with A(ρ)=S+S+ρ–1I
(17)
becomes a regularized inverse of the SFO S with regularization parameter ρ–1, the inverse of the signal-to-noise ratio (SNR) ρ=β/N0 for the adopted white observation noise model, RN=N0I. The robust smoothing window
870
Y. Shkvarko and I. Villalon-Turrubiates
W=Ω=(w0I+M)–1
(18)
is completely defined now by matrix M that induces the metrics structure (9) in the solution space with the scaling factor w0=tr{S+F+FS}/K [1]. Here, we adopt practical constraints of high SNR operational conditions [7], [9] ρ>>1, in which case one can neglect also the constant bias Z=Z0I in (11) because it does not affect the pattern of the SSP estimate. Following these practically motivated assumptions, we derive the resulting R-FBR estimator Λˆ RFBR (x) =gT(x)diag(Λ(B0+ΩV))g(x)
(19)
where V={FUU+F+}diag represents now the robust (solution independent) SS vector. Thus, the principal computational load of the R-FBR estimator (19) is associated now with the operator inversions required to compute the solution operator (18) for adaptively adjusted regularization parameter ρ–1. Next, the simplest rough RSS estimator can be constructed as further simplification of (19) adopting the trivial prior model information (P=I and B0=0I) and roughly approximation the SS formation operator F by the adjoint SFO, i.e. F≈γ0S+ [1] (the normalizing constant γ0 provides the balance of the operator norms, γ 02 =tr–1{S+SS+S}tr{FSS+F+}). In this case, the (19) is simplified to its rough version Λˆ MSF (x) =gT(x)diag(Λ(Ω H))g(x)
(20)
referred to as matched spatial filtering (MSF) algorithm where the rough SS H= γ 02 {S+UU+S}diag is now formed applying the adjoint operator S+, and the windowing of the rough SS is performed applying the smoothing filter Ω=(w0I+M)–1 with nonnegative entry, the same one as was constructed numerically in [1].
5 NN for Implementing the R-FBR Method Now, we propose a NN for efficient quasi-real-time computational implementation of the presented above R-FBR method. The main idea is to aggregate the robust regularization with the NN-based computing to reduce the computational load of the RFBR technique. We approach this goal by performing the modifications of the multistate Hopfield-type NN originally developed in [5] and modified in [8]. Borrowing from [8] we define the Hopfield-type multistate NN as a massive interconnection of formal neurons, i.e. basic processing units. The outputs of all K neurons compose the output vector, z=sgn(Qv+Θ), where, Q represents the KxK matrix of the interconnection strengths of the NN, and Θ defines the Kx1 bias vector of the NN [8]. The output vector z is used to update the state vector v of the network: v′′=v′+Δv where, Δv=ℜ(z) is a change of the state vector v computed applying the state update rule ℜ(z) and the superscripts ′′ and ′ correspond to the state values before and after network state updating, respectively. We employ the same state update rule ℜ(z) that was designed previously in [8] that guarantees that the energy function of the overall NN
Remote Sensing Imagery and Signature Fields Reconstruction
ENN(v)= −
1 T v Qv–ΘTv 2
871
(21)
is decreased at each updating step, i.e. ENN(v′′)≤ENN(v′), until the NN reaches its stationary state related to the state vopt at which the minimum of the NN energy (21) is attained, i.e. ENN ( v opt ) = min ENN ( v) . Next, we associate the NN’s stationary state v
with the solution to a hypothetical inverse problem (IP) of minimization of the following composite cost function EIP(Y|λ)=
1 1 λ1 ||U–SY||2+ λ 2 ||Y||2 . 2 2
(22)
If the regularization parameters in (22) are adjusted as λ1=1, λ2=ρ–1 and the NN’s stationary state is associated with the solution to (22) than the minimization of ˆ =FU that uniquely EIP(Y|λ) provides the robust constraint least square estimate Y ˆY ˆ + }diag . ˆ defines the desired high-resolution RSS vector Λ =Λ(B0+ΩV) with SS V= {Y Hence, the cumbersome operator inversions needed to compute the SS and reconstruct the RSS are translated now into the relevant problem of recurrent minimization ˆ =vopt via specification of of the energy function (21) of the NN and derivation of Y the NN’s parameters as follows: K
Qki = −λ1 ∑ S jk S *ji − λ2δ ki ; for all k, I=1,…,K .
(23)
j =1
K
Θ k = λ1 ∑ S jk U j ;
for all k=1,…,K .
(24)
j =1
where Q ki and Θ k represent the elements of the interconnection strengths matrix Q and bias vector Θ of the modified NN, respectively. Because of the exclusion of the solution-dependent operator inversions (17) via translations of the SS formation procedure into the relevant recurrent problem of minimization of the NN’s energy function (21), the computational load of such R-FBR technique (19) is drastically decreased in comparison with the original FBR method (16).
6 Simulations In the simulations, we considered the SAR with partially synthesized aperture as an RS imaging system [4], [13]. The SFO was factorized along two axes in the image plane: the azimuth and the range. Following the common practically motivated technical considerations [4], [9], [11] we modeled a triangular shape of the SAR range ambiguity function of 3 pixels width, and a |sinc|2 shape of the side-looking SAR azimuth ambiguity function (AF) for two typical scenarios of fractionally synthesized apertures: (i) azimuth AF of 10 pixels width at the zero crossing level associated with the first system model and (ii) azimuth AF of 20 pixels width at the zero crossing
872
Y. Shkvarko and I. Villalon-Turrubiates
level associated with the second system model, respectively. We examined the behavior and corresponding performance quality metrics of the derived above R-FBR estimator of the SSP and relevant 2-bit RSS [2], [15] for two different simulated scenes and two specified above fractional SAR models. The results of the simulation experiment indicative of the enhanced quality of SSP and RSS reconstruction with the proposed approach are reported in Figures 1 to 4 for two different RS scenes borrowed from the real-world RS imagery of the Metropolitan area of Guadalajara city, Mexico [16], [17]. Figures 1.a. thru 4.a show the original super-high resolution test scenes (not observable in the simulation experiments with partially synthesized SAR system models). Figures 1.b thru 4.b present the results of SSP imaging with the conventional MSF algorithm (20). Figures 1.c thru 4.c present the SSP reconstructed applying the proposed R-FBR method (19) implemented using the modified NN computing technique developed in the previous section. The particular reconstructed RSS reported in the simulations in Figures 1.(d,e,f) thru 4.(d,e,f) represent the so-called hydrological electronic maps (HEMs) [2], [15] extracted from the relevant SSP images (grouped in the corresponding upper rows of the figures) applying the weighted order statistics (WOS) classification operator Λ( Bˆ (x)) detailed in [15]. Such HEMs are specified as 2-bit hydrological RSS [2], [15] that classify the areas in the reconstructed scene images Bˆ (x) into four classes: areas covered with water (black zones in the figures), the high-humidity areas (dark-gray zones), the low-humidity areas (light-gray zones), and dry areas/non classified regions (white zones).
a. Original super-high resolution scene
b. Low-resolution image formed with the MSF
c. SSP reconstructed with the R-FBR method
d. HEM extracted from the original scene
e. HEM extracted from the MSF image
f. HEM extracted from the R-FBR enhanced image
Fig. 1. Simulation results for the firs scene: first system model
Remote Sensing Imagery and Signature Fields Reconstruction
873
a. Original super-high resolution scene
b. Low-resolution image formed with the MSF
c. SSP reconstructed with the R-FBR method
d. HEM extracted from the original scene
e. HEM extracted from the MSF image
f. HEM extracted from the R-FBR enhanced image
Fig. 2. Simulation results for the firs scene: second system model
a. Original super-high resolution scene
b. Low-resolution image formed with the MSF
c. SSP reconstructed with the R-FBR method
d. HEM extracted from the original scene
e. HEM extracted from the MSF image
f. HEM extracted from the R-FBR enhanced image
Fig. 3. Simulation results for the second scene: first system model
874
Y. Shkvarko and I. Villalon-Turrubiates
a. Original super-high resolution scene
b. Low-resolution image formed with the MSF
c. SSP reconstructed with the R-FBR method
d. HEM extracted from the original scene
e. HEM extracted from the MSF image
f. HEM extracted from the R-FBR enhanced image
Fig. 4. Simulation results for the second scene: second system model Table 1. IOSNR values provided with the R-FBR method. Results are reported for different SNRs for two test scenes and two different simulated SAR systems.
SNR [dB] μ
10 15 20 25
First Scene
Second Scene
IOSNR: System1 SSP HEM
IOSNR: System2 SSP HEM
IOSNR: System1 SSP HEM
IOSNR: System2 SSP HEM
2.35 5.15 8.24 12.71
2.42 5.56 8.72 13.19
19.49 20.42 21.25 21.13
20.26 21.83 22.66 22.54
2.24 3.34 5.20 9.55
3.20 4.32 5.12 10.24
16.48 19.45 20.76 21.52
17.59 18.63 19.42 21.36
The quantitative measure of the improvement in the output signal-to-noise ratio (IOSNR) quality metric [4] gained with the enhanced SSP and HEM imaging methods for two simulated scenarios are reported in Table 1. All reported simulations were run for the same 512x512 pixel image format. The computation load of the enhanced RSS reconstruction with the R-FBR algorithm (19) applying the proposed above NN computational scheme in comparison with the original FBR method (16) was decreased approximately 105 times and required 0.38 seconds of the overall computational time for the NN-based implementation of the R-FBR technique (19) using a 2.8GHz Pentium4© computer with 512MB of memory.
Remote Sensing Imagery and Signature Fields Reconstruction
875
7 Concluding Remarks We have developed and presented the R-FBR method for high-resolution SSP estimation and RSS mapping as required for reconstructive RS imagery. The developed RFBR method was implemented in a quasi-real-time mode utilizing the proposed NN computational technique. The interconnection strengths and bias inputs of the designed multistate Hopfield-type NN were specified in such a way that enabled the NN to perform the solution of the aggregated inverse problem of high-resolution SSP estimation and corresponding HEM-RSS reconstruction from the available data recordings required to implement the overall R-FBR method. The developed technique performs the balanced aggregation of the data and model prior information to perform the enhanced image reconstruction and RSS mapping with improved spatial resolution and noise reduction. The presented simulation examples illustrate the overall imaging performance improvements gained with the proposed approach. The simulation experiment verified that the RSS extracted applying the R-FBR reconstruction method provide more accurate physical information about the content of the RS scenes in comparison with the conventional MSF and previously proposed descriptive regularization techniques [15], [16]. The presented study establishes the foundation to assist in understanding the basic theoretical and computational aspects of multi-level adaptive RS image formation, enhancement and extraction of physical scene characteristics that aggregates the robust regularization with NN-computing paradigms.
References 1. Shkvarko, Y.V.: Estimation of Wavefield Power Distribution in the Remotely Sensed Environment: Bayesian Maximum Entropy Approach. IEEE Transactions on Signal Processing 50, 2333–2346 (2002) 2. Henderson, F.M., Lewis, A.V.: Principles and Applications of Imaging Radar. In: Manual of Remote Sensing, 3rd edn. Wiley, New York (1998) 3. Shkvarko, Y.V.: Unifying Regularization and Bayesian Estimation Methods for Enhanced Imaging with Remotely Sensed Data. Part I – Theory. IEEE Transactions on Geoscience and Remote Sensing 42, 923–931 (2004) 4. Shkvarko, Y.V.: Unifying Regularization and Bayesian Estimation Methods for Enhanced Imaging with Remotely Sensed Data. Part II – Implementation and Performance Issues. IEEE Transactions on Geoscience and Remote Sensing 42, 932–940 (2004) 5. Li, H.D., Kallergi, M., Qian, W., Jain, V.K., Clarke, L.P.: Neural Network with Maximum Entropy Constraint for Nuclear Medicine Image Restoration. Optical Engineering. 34, 1431–1440 (1995) 6. Haykin, S.: Neural Networks: A Comprehensive Foundation. Macmillan, New York (1994) 7. Falkovich, S.E., Ponomaryov, V.I., Shkvarko, Y.V.: Optimal Reception of Space-Time Signals in Channels with Scattering. Radio I Sviaz, Moscow (1989) 8. Shkvarko, Y.V., Shmaliy, Y.S., Jaime-Rivas, R., Torres-Cisneros, M.: System Fusion in Passive Sensing using a Modified Hopfield Network. Journal of the Franklin Institute 338, 405–427 (2001) 9. Wehner, D.R.: High-Resolution Radar, 2nd edn. Artech House, Boston (1994) 10. Barrett, H.H., Myers, K.J.: Foundations of Image Science. Wiley, New York (2004)
876
Y. Shkvarko and I. Villalon-Turrubiates
11. Ponomaryov, V.I., Nino-de-Rivera, L.: Order Statistics, M Method in Image and Video Sequence Processing Applications. Journal on Electromagnetic Waves and Electronic Systems 8, 99–107 (2003) 12. Starck, J.L., Murtagh, F., Bijaoui, A.: Image Processing and Data Analysis: The Multiscale Approach. Cambridge University Press, Cambridge (1998) 13. Franceschetti, G., Iodice, A., Perna, S., Riccio, D.: Efficient Simulation of Airborne SAR Raw Data of Extended Scenes. IEEE Transactions on Geoscience and Remote Sensing 44, 2851–2860 (2006) 14. Erdogmus, D., Principe, J.C.: From Linear Adaptive Filtering to Nonlinear Information Processing. IEEE Signal Processing Magazine. 23, 14–33 (2006) 15. Perry, S.W., Wong, H.S., Guan, L.: Adaptive Image Processing: A Computational Intelligence Perspective. CRC Press, New York (2002) 16. Shkvarko, Y.V., Villalon-Turrubiates, I.E.: Dynamical Enhancement of the Large Scale Remote Sensing Imagery for Decision Support in Environmental Resource Management. In: Proceedings of the 18th Information Resource Management Association International Conference. Idea Group Inc. Vancouver (2007) 17. Space Imaging. In: GeoEye Inc. (2007) http://www.spaceimaging.com/quicklook
A New Technique for Global and Local Skew Correction in Binary Documents* Michael Makridis, Nikos Nikolaou, and Nikos Papamarkos Image Processing and Multimedia Laboratory Department of Electrical & Computer Engineering Democritus University of Thrace 67100 Xanthi, Greece
[email protected]
Abstract. A new technique for global and local skew correction in binary documents is proposed. The proposed technique performs a connected component analysis and for each connected component, document’s local skew angle is estimated, based on detecting a sequence of other consecutive connected components, at certain directions, within a specified neighborhood. A histogram of all local skew angles is constructed. If the histogram has one peak then global skew correction is performed, otherwise the document has more than one skews. For local skew correction, a page layout analysis is performed based on a boundary growth algorithm at different directions. The exact global or local skew is approached with a least squares line fitting procedure. The accuracy of the technique has been tested using many documents of different skew and it is compared with two other similar techniques.
1 Introduction Skew distortion is a very common problem in document images. A reliable skew correction technique can be used in scanned documents or as a pre-processing stage before image segmentation, character recognition or page layout analysis, where any type of distortion can lead to errors. There are two types of skew correction in documents, global and local. Although many techniques have been proposed for global skew correction, it remains an interesting and challenging task especially for documents with graphics, figures or various font sizes. On the contrary, few techniques have been proposed for local skew correction, which remains a difficult task, in terms that an additional page layout analysis stage should be included in the technique for accurate document restoration. For global skew correction, there are several techniques classified in five basic categories. These techniques include Hough Transform (HT) [1-4], Fourier Transform (FT) [5], projection profile [6-11], nearest neighbor clustering [12-14] and interline *
This work was supported by Archimedes (Kavala) project, co-funded by the European Union - European Social Fund & National Resources - EPEAEK II.
J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 877–887, 2007. © Springer-Verlag Berlin Heidelberg 2007
878
M. Makridis, N. Nikolaou, and N. Papamarkos
cross correlation [6, 15-16]. The major drawback of using HT is the computational cost. Postl [5] proposed a method based on the FT. This method tracks the direction for which the density of the Fourier space is the largest but its computational cost is also very high for large documents. Yan [15] and Gatos et al. [6] introduced methods for skew detection using cross-correlation between the text lines at a fixed distance. Yan’s method, though, is computationally expensive, as well as it being less accurate. Gatos et al. method is applicable only on documents with small skew angles. More recent methods [17-18] are very accurate as far as the precision of the skew but the skew angle range is limited to 15o . Local skew correction differs from global because an accurate page layout analysis algorithm is necessary to detect areas that locally have a different skew. The page layout analysis algorithm of many techniques is based on horizontal run length smoothing algorithm. This can lead to errors when the skew angle size of a document’s text area exceeds 15o approximately. Some techniques, as in [19], use boundary growth methods at horizontal and vertical directions to detect text areas with different skew. The detection of the skew is based on estimating the skew of top and bottom lines of the text lines. This task premises an accurate text line detection algorithm, which is a difficult task in documents with noise or with various fonts. The proposed technique deals with global and local skew detection and correction in binary document images. It is robust to noise and is not constrained by the skew angle size. A flowchart of the proposed method is given in Fig. 1. In the pre-processing procedure, a connected component analysis is performed and for each connected component (CC) a set of features is extracted. Based on these features and having in mind that the document’s resolution is higher than 100 dpi some CCs can be considered as noise and they are removed from the document. This makes the technique more precise and it also reduces the computational cost. The resulted filtered document is I f . In the next stage, the technique initially approaches the skew angle around each CCi ∈ I f with an integer value and afterwards estimates the exact angle using least squares. An iterative procedure is applied for each CCi ∈ I f . A set of 181 straight line segments
{Lsi ,−90o ,.., Lsi ,0o ,.., Lsi ,90o }
is constructed. The line segment
Lsi ,k , k ∈ [−90o ,90o ] that intersects with most CCs that lie within a neighborhood, let
it be n , is selected as dominant line segment and k is considered as the local integer angle. The center points (center of the bounding box) of the neighbor CCs that Lsi , k intersects, are used as input data for the construction of the least squares line. Locally, for CCi , it is considered that the exact angle is the angle of the least squares line. In case that there are more dominant line segments (with the same maximum n ) then for each dominant segment an additional local integer angle and exact angle is assigned to CCi . In the next stage, a histogram of all local integer angles of the CCs ∈ I f is constructed. The histogram is filtered and the peaks of the histogram are detected. If only
A New Technique for Global and Local Skew Correction in Binary Documents
879
Fig. 1. Flowchart of the proposed method
one peak is detected then it is assumed that the document has global skew distortion. Otherwise it is assumed that the document has local skews. In case of local skew, a page layout analysis is performed by applying a boundary growth algorithm at certain directions and homogenous areas are detected. The resulted image is I BG , where each CC ∈ I BG forms an area with local skew. This area includes a set of CCs ∈ I f . At the last stage, the integer skew estimation for each area is defined by the majority of local integer angles of its CCs. The average value of the exact skews of the CCs that have the selected local integer angle is considered as the exact local skew for the area. In case of global skew the whole document is considered as a homogenous area and the decision of the exact angle is taken as before.
880
M. Makridis, N. Nikolaou, and N. Papamarkos
2 Description of the Method 2.1 Pre-processing Stage
This pre-processing procedure decreases the overall computational cost and prevents the technique from examining CCs that are not characters. It is based on a set of structural features of the CCs. These features are: • Pixel size, PSi , represents the number of foreground pixels of CCi . • H i , Wi express the height and width of the bounding box (BB) of CCi . • Elongation, Ei , takes values within [0..1] and is defined as
min( H i ,Wi ) max( H i ,Wi )
(1)
• Density, Di , which is defined as
PSi H i *Wi
(2)
After the extraction of the above features, the proposed technique removes CCs that do not satisfy a set of conditions. More specifically, for each CCi , PSi must be greater than 6 pixels and smaller than 100 * CCMW , where CCMW the mean width of the CCs. H i should be greater than 4 pixels, while Ei and Di greater than 0.08. These conditions have been chosen after many trials and they are found to work fine for documents with resolution greater than 100 dpi. The excluded CCs will not be examined in the next stages of the algorithm. The purpose of the pre-processing stage is to remove noisy components, to make the document as uniform as possible and to decrease the computational cost. PSi and H i remove small or very large components that are neither characters nor character fragments, while Ei and Di remove border frames or long lines in the document.
Fig. 2. Straight line segments at different directions for the letter “B”. The segments have been plotted for every 10o instead of 1o degree in order to be more obvious.
A New Technique for Global and Local Skew Correction in Binary Documents
(a)
881
(b)
Fig. 3. a)Original image, b)Dominant local directions histogram. Horizontal axis refers to all integer angles ranged from −90o to 90o . Vertical axis refers to the occurrences of dominant line segments. Here, 657 CCs have 12o as local integer angle.
2.2 Skew Angle Estimation
For each CCi ∈ I f a set of 181 straight line segments {Lsi ,−90o ,.., Lsi ,0o ,.., Lsi ,90o } is constructed as shown in Fig. 2. Each line segment Lsi ,k , k ∈ [−90o ,90o ] forms a k o integer angle with horizontal axis. The length of each segment is set to 10 * CCMW , where CCMW the mean width of CCs ∈ I f .The start point of each segment is considered to be the center of the bounding box of CCi ∈ I f . Then, the number of CCs that intersects with each line segment is computed. Let this number be nk . The local integer angle of CCi is defined as the angle of the segment that corresponds to the largest n and this segment is called dominant line segment. In Fig. 2 the local integer angle for letter “B” is expected to lie between the line segments 10o and 20o . The center points of the CCs that the dominant line segment of CCi intersects, {Cpi ,0 , Cpi ,1,..., Cpi , j } , are used as input data to find the corresponding least squares line, whose angle is defined as the exact angle. The calculation of exact angle is described in more detail in section 2.4. In case there are more line segments with the same maximum occurrence n , for each one the corresponding local integer angle and exact angle is additionally assigned to CCi . Since the local integer angles for all CCs ∈ I f have been calculated, the histogram of all integer angles can be constructed as in Fig. 3. The horizontal axis depicts all possible integer angles and the vertical depicts the occurrences p of dominant line segments that have been found. In order for the technique to decide whether there is a global skew angle or more local skew angles, the number of peaks of the histogram should be detected. In order
882
M. Makridis, N. Nikolaou, and N. Papamarkos
(b)
(a)
(c)
Fig. 4. a)Original multi-skewed scanned document, b)Dominant local directions histogram, c)The dominant peaks detected
to achieve this, a filter is applied. This filter is a 1x5 max filter with an additional threshold condition. For every integer angle ang ∈ [−90o ,90o ] the following conditions should be satisfied: • ang = max(ang − 2, ang − 1, ang , ang + 1, ang + 2) • p (ang ) ≥ Th , where Th = max ( p (−90o ),..., p (0o ),... p (90o )) / 4
If these conditions are not valid, p(ang ) is set to zero. A multi-skewed document and the dominant local directions histogram before and after the application of the filter are shown in Fig. 4. If there is one peak then the document has global skew distortion, otherwise it has local skew distortion. The peaks of the histogram are also used for the detection of the exact skew angle. 2.3 Page Layout Analysis
In the case of local skew correction, a page layout analysis is necessary for the technique to locate areas with different skew angle. Supposed that i local peaks { p0 , p1,..., pi } have been detected in the histogram and {ang 0 , ang1 ,..., ang i } their corresponding integer skew angles, a boundary growth algorithm, at two perpendicular angles for each CCi ∈ I f is performed only if CCi is merged with another CC within a specified neighborhood. The first angle, ang BG , can be any of {ang 0 , ang1 ,..., ang i } , in condition that CCi is merged with another CC. The other will be its perpendicular angle ang BG − 90o . Boundary growth algorithm is applied at four directions, bottom to top, top to bottom, left to right and right to left (see Fig. 5(a)). The threshold for
A New Technique for Global and Local Skew Correction in Binary Documents
883
applying the algorithm is set to 2 * CCMH , for left to right and right to left directions and 4 * CCMH , for top to bottom and bottom to top directions, where CCMH the mean height of CCs ∈ I f . An example of boundary growth algorithm is depicted in Figs. 5(b) and 5(c).
(a)
(b)
(c)
Fig. 5. An example of the boundary growth algorithm. a)Boundary growth directions, b)Original image, c)Image after the application of the page layout analysis.
2.4 Skew Angle Detection
In section 2.2 for each CCi ∈ I f , the local integer angle was calculated. However a more precise angle is needed for proper skew correction of a document. In this section it is described how the exact angle for each CCi ∈ I f is calculated using least squares. Finally, the exact skew of the document is calculated. The center points of the CCs, that the dominant line segment of CCi intersects, are the input data for the calculation of the skew of the least squares line. The least squares line is defined as: y = a + bx
(3)
We are interested in the skew of this line, which is defined as: n
b=
n
n
n∑ X iYi − (∑ X i )(∑ Yi ) i =1
i =1
n
n∑ i =1
X i2
i =1
n
− (∑ X i )2
(4)
i =1
Where n is the number of center points {Cp0 , Cp1 ,..., Cpn } that the dominant line segment intersects and {( x0 , y0 ),( x1 , y1 ),...,( xn , yn )} their coordinates. From the set of CCs that their local integer angle coincides with global skew estimation, the exact skew of the document is defined as their average exact angle value. In the case of local skew, each CC ∈ I BG is considered as a homogenous area with a local skew. The local angle ang ∈ {ang 0 , ang1 ,..., ang i } of this area is defined by the majority of local integer values of the CCs that lie in it. The exact skew is defined again as the average exact angle values of these CCs. Fig. 6 depicts local skew correction results. Figs. 6(a), 6(c) and 6(e) refer to the original document images and the results of the skew correction procedure are shown in Figs. 6(b), 6(d) and 6(f).
884
M. Makridis, N. Nikolaou, and N. Papamarkos
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 6. Three examples of documents with different skewed areas. (a), (c), (e)Original documents, (b), (d), (f)Results after the application of the proposed technique.
3 Experimental Results The proposed technique is compared with Gatos et al.[8] and Chou et al.[20] techniques. Gatos et al. technique is designed for documents with skew angle size −45o to 45o , while Chou et al. technique is designed for documents with skew angle size −15o to15o . The evaluation concerns global skew detection and correction and it is based on three different randomly rotated sets, each one containing 50 documents, taken from the technical document database of the University of Washington [20]. The first set contains documents with random skew ranged from −90o to 90o . The second set contains documents with random skew ranged from −45o to 45o . This set is used to compare the proposed with Gatos et al technique. The third set is used to compare the proposed technique with Gatos et al and Chou et al. techniques and the document’s skew ranges from −15o to15o . For visual comparison of the techniques, deviation histograms have been constructed. In the deviation histogram horizontal axis refers to the number of the document, while vertical axis to the deviation di :
A New Technique for Global and Local Skew Correction in Binary Documents
d i = ri − oi
885
(5)
where ri is the resulted skew and oi the original skew of the i th document. Fig. 7(a)-(c) depict the deviation histograms of the first, second and third set. Table 1 compares these techniques as far as the mean, max and min deviation values. In the second data set (Fig. 7(b)) documents 7,8,26,39,45,47,48 do not participate at the calculation of mean value, because deviation value was over 5o and documents could not be restored. In Table 1 all documents are included for Gatos et al. technique even those that could not be restored.
(a)
(b)
(c) Fig. 7. Deviation histograms for three data sets of documents, a)Documents with skew angle ranged from −90o to 90o , b)Documents with skew angle ranged from −45o to 45o , c)Documents with skew angle ranged from −15o to 15o
886
M. Makridis, N. Nikolaou, and N. Papamarkos Table 1. Mean, max and min deviation values for all three techniques
Mean Deviation Proposed Technique (150 documents) Gatos et al. Technique (100 documents) Chou et al. Technique (50 documents)
Max Deviation
Min Deviation
d mean = 0.5112
d max = 2.7o
d min = 0o
d mean = 1.8045
d max = 41.35o
d min = 0.01o
d mean = 0.4226
d max = 1.53o
d min = 0o
Table 2. Computational cost comparisons
Mean Time Proposed Technique (Third document set) Chou et al. Technique (Third document set)
Max Time
Min Time
6.028sec
10.8 sec
2 sec
5.596 sec
7.6sec
3.7 sec
As far as the computational cost is concerned, comparison has been made only between the proposed and Chou et al. technique, because both techniques have been implemented at the same visual environment (see Table 2.). From the computational cost comparison, we observed that the proposed technique is faster when the documents contain text and graphics because the total number of the CCs is smaller. Chou et al. technique’s time efficiency depends on the total number of foreground pixels. This justifies the fact that deviation from mean time is smaller than the proposed technique. Gatos et al. technique has significantly less computational cost, but the implementation has not been done by the authors and any comparison would not be objective.
4 Conclusion In this paper we propose a technique for global and local skew detection in binary documents. The main advantages of the proposed technique are: • • • •
The simplicity of the method. The flexibility of detecting either global or local skew with accuracy. An effective algorithm for page layout analysis. Its accuracy in skew detection.
In the future, we will focus on improving the page layout analysis part of the method, in order to achieve higher restoration rates in documents with local skew. Also we will try to reduce the overall computational cost.
A New Technique for Global and Local Skew Correction in Binary Documents
887
References 1. Amin, A., Fischer, S.: A document skew detection method using the Hough transform. Pattern Analysis and Applications 3, 243–253 (2000) 2. Yin, P.Y.: Skew detection and block classification of printed documents. Image and Vision Computing 19, 567–579 (2001) 3. Wang, J., Leung, M.K.H., Hui, S.C.: Cursive word reference line detection. Pattern Recognition 30, 503–511 (1997) 4. Kwag, H.K., Kim, S.H., Jeong, S.H., Lee, G.S.: Efficient skew estimation and correction algorithm for document images. Image and Vision Computing. 20, 25–35 (2002) 5. Postl, W.: Detection of linear oblique structure and skew in digitized documents. In: Proceedings 8th Int. Conf. on Pattern Recognition, pp. 464–468 (1986) 6. Gatos, B., Papamarkos, N., Chamzas, C.: Skew detection and text line position determination in digitized documents. Pattern Recognition 30, 1505–1519 (1997) 7. Baird, H.S.: The skew angle of printed documents. In: O’Gorman, L., Kasturi, R. (eds.) The skew angle of printed documents, pp. 204–208. IEEE CS Press, Los Alamitos (1995) 8. Akiyama, T., Hagita, N.: Automated entry system for printed documents. Pattern Recognition. 23, 1141–1154 (1990) 9. Pavlidis, T., Zhou, J.: Page segmentation by white streams. In: Proceedings 1st Int. Conf. Document Analysis and Recognition, pp. 945–953 (1991) 10. Ciardiello, G., Scafuro, G., Degrandi, M.T., Spada, M.R., Roccotelli, M.P.: An experimental system for office document handling and text recognition. In: Proceedings 9th Int. Conf. on Pattern Recognition, Milano, pp. 739–743 (1988) 11. Kapoor, R., Bagai, D., Kamal, T.S.: A new algorithm for skew detection and correction. Pattern Recognition Letters. 25, 1215–1229 (2004) 12. Hashizume, A., Yeh, P.S., Rosenfeld, A.: A method of detecting the orientation of aligned components. Pattern Recognition. 4, 125–132 (1986) 13. Liu, J., Lee, C.M., Shu, R.B.: An efficient method for the skew normalization of a document image. In: Proceedings Int. Conf. on Pattern Recognition, vol. 3, pp. 122–125 (1992) 14. O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Analysis and Machine Intelligence 15, 1162–1173 (1993) 15. Yan, H.: Skew correction of document images using interline cross-correlation. Graphical Models and Image Processing 55, 538–543 (1993) 16. Chaudhuri, A., Chaudhuri, S.: Robust detection of skew in document images. IEEE Transactions on Image Processing 6, 344–349 (1997) 17. Chou, C.H., Chu, S.Y., Chang, F.: Estimation of Document Skew Angles Using Piecewise Linear Approximation of Line Objects. Pattern Recognition 40, 443–455 (2007) 18. Dhandra, B.V., Malemath, V.S., Mallikarjun, H., Hegadi, R.: Skew Detection in Binary Image Documents Based on Image Dilation and Region labeling Approach. In: Proceedings 18th Int. Conf. on Pattern Recognition, vol. 2, pp. 954–957 (2006) 19. Saragiotis, P., Papamarkos, N.: Skew correction in documents with several differently skewed text areas. In: Int. Conf. on Computer Vision Theory and Applications. Barcelona (2007) 20. Phillips, I.T.: User’s Reference manual for the UW English/Technical Document Image Database I. UW-I English/Technical Document Image Database, University of Washington (1993)
System for Estimation of Pin Bone Positions in Pre-rigor Salmon Jens T Thielemann1, Trine Kirkhus1, Tom Kavli1, Henrik Schumann-Olsen1, Oddmund Haugland2, and Harry Westavik3 1
SINTEF, PB 124 Blindern, N-0314 Oslo, Norway {jtt,trk,tka,hso}@sintef.no, http://www.sintef.no/omd 2 Trio Fish Processing Machinery AS, P.O. Box 38, Forus, NO-4064 Stavanger, Norway
[email protected] 3 SINTEF Fisheries and Aquaculture AS, N-7465 Trondheim, Norway
[email protected]
Abstract. Current systems for automatic processing of salmon are not able to remove all bones from freshly slaughtered salmon. This is because some of the bones are attached to the flesh by tendons, and the fillet is damaged or the bones broken if the bones are pulled out. This paper describes a camera based system for determining the tendon positions in the tissue, so that the tendon can be cut with a knife and the bones removed. The location of the tendons deep in the tissue is estimated based on the position of a texture pattern on the fillet surface. Algorithms for locating this line-looking pattern, in the presence of several other similar-looking lines and significant other texture are described. The algorithm uses a model of the pattern’s location to achieve precision and speed, followed by a RANSAC/MLESAC inspired line fitting procedure. Close to the neck the pattern is barely visible; this is handled through a greedy search algorithm. We achieve a precision better than 3 mm for 78% of the fish using maximum 2 seconds processing time.
1 Introduction Fresh salmon is excellent food, a food which most consumers prefer bone free. Currently, the salmon that is sold as fresh fillets requires manual after-processing to pick out some of the bones in the fish; the so called pin bones (Figure 1). This is due to that current machinery for removal of pin bones require that the fish has aged 4-6 days after slaughtering before processing. The reason for this delay is that after slaughtering, salmon is in a pre-rigor mortem or in a rigor mortem phase where some of the bones (the so called “pin bones”) are not removable without damaging the flesh. Most current automatic filleting practices therefore wait until the salmon has exited the rigor mortem phase before attempting filleting. This means that the fish leaves the factory 5-6 days old. Previous systems for pre-rigor pin bone removal have removed the bones by cutting with a knife into the fillet from above [1]. This leaves a large scar on the fillet. Other systems have simply focused on detecting the presence of pin bones using J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 888–896, 2007. © Springer-Verlag Berlin Heidelberg 2007
System for Estimation of Pin Bone Positions in Pre-rigor Salmon
889
X-rays, without attempting to automatically remove the bones. The use of X-ray makes the system prohibitively expensive. Trio Fish Processing Machinery AS, a Norwegian company, is developing a system that allows the pin bones to be removed from fish fillets while they are still in the pre-rigor phase, not more than one to two hours after the slaughtering [4]. To achieve this, the system needs to cut the tendon attachment so that the pin bones can be pulled out without damaging the flesh or breaking the bones. The cut is performed by inserting a long thin knife from the head end of the fillet close to the skin. The knife is inserted in such a way that the tip of the knife follows a trajectory where the tendons are estimated to be located, and thus cuts the tendons. As seen in Figure 1, the tendon attachments can not be seen directly on the outside of the fillet. It was thus necessary to estimate the location of the tendon attachment based on features on surface of the fillet. One alternative could be to detect the bone stumps and use those to position the cut. The bone stumps are however also often well hidden into the flesh, which makes any detection of them difficult and unreliable. We have therefore chosen to focus on detecting the positions of a line-looking texture pattern on the surface of the fillet. This line is empirically shown to be well co-located with the line of tendons deep in the flesh. This article presents an image processing system for imaging the fillets, detecting relevant patterns on the fish fillet and using those patterns to locate the tendon attachments. The algorithm needs to be both rapid and precise. A new fillet arrives every four seconds, which makes an upper limit on processing time. Any inaccuracy in the algorithm’s position estimate means that a wider knife must be used to ensure that all tendons are cut. A wide cut is not desirable due to cosmetic reasons. We have set a goal of a maximum deviation of 3 mm for the position estimate. The rest of the article is structured as follows. Section 2 gives a brief background on fish anatomy. Section 3 describes the imaging system in brief. Section 4 describes the algorithm used for position estimation. Section 5 reports the results achieved with this algorithm, followed by a discussion in section 6.
2 Brief Fish Anatomy In order to understand the procedure for automatic location of the tendon attachment, it is necessary to have a basic overview of fish fillet anatomy. Figure 1(a) shows a cross section of a salmon fillet. We see the indication of multiple bones crossing the shown plane, and the bones are fastened to the tissue close to the skin, at the position marked as the tendon attachments. The cut needs to be placed close to these attachments, at the position shown in the figure. A line looking pattern, called the epaxial septum [5], appears more or less directly above the tendon attachment. Our measurements of salmons indicate that the epaxial septum can be applied to predict the position of the tendon attachment with a precision of approximately 2 mm. By accurately locating the epaxial septum, a correct cut can be made, facilitating subsequent gentle bone removal.
890
J.T. Thielemann et al.
Epaxial septum
Pin bone
Cut position
Tendon attachment
(a) Dorsal boundary
(b) Fig. 1. Nomenclature for describing fish anatomy. (a) MR-scan of salmon. The scan shows a single cut through the fillet, perpendicular to the length axis of the fish. The tendon attachment can be seen almost directly below the epaxial septum. White horizontal line indicates cut position. Stapled line indicates pin bone. (b) Photo of fillet seen from above. Vertical parallel arrows indicate epaxial septum that is to be identified. We see that the line is barely visible at the right end. Right part of fish is referred to as head part, left part as tail part. Dorsal boundary indicated separately, this is the cut line after the fish was split in two. Note that the epaxial septum can be seen both in the MR-scan and the photo. We refer to the upper part of the fillet as the dorsal loin, and the lower part as the belly loin.
3 Image Capturing System The image capturing system is built around a standard 3 mega pixel area camera, which sees the fillet from above as shown in Figure 1(b). The camera captures images at a resolution of approximately 0.3 mm/pixel. The fish moves on a conveyor belt past the camera, and is imaged at a rate of approximately four images per second. A strobe is used to freeze the movement.
4 Image Processing Each salmon is an individual with different genetics and they are exposed to different environmental factors that influence their anatomical development and appearance.
System for Estimation of Pin Bone Positions in Pre-rigor Salmon
891
This makes up a large variation that the algorithms for detection of the epaxial septum must be robust against. The epaxial septum is not a true line. It is made up of a bend in the marbling pattern in the dorsal loin of the fillet. This may make the line discontinuous in some individuals. In particular, the line has a tendency to get smeared out and become discontinuous in the neck region. This smearing makes the line difficult to trace even for humans. The epaxial septum is not the only white line appearing on a salmon fillet. There are several parallel lines caused by connective tissues and the marbling in the salmon loins. 4.1 Detection Algorithm To avoid confusion with the other lines we make a model that gives a coarse prior prediction for where we expect the epaxial septum to be relative to the dorsal boundary of the fillet (the upper boundary to the conveyor belt). This model is based on a training set of 46 salmon fillets from 46 individuals, where the epaxial septum was manually located. The predicted position of the epaxial septum is obtained as an offset from the dorsal boundary of the fillet. The offset is normalized for the fillet size and made a function of the position along the fillet length. A region of interest (ROI) is picked ± 40 pixels (approximately ± 3 standard deviations in the training data) around the predicted epaxial septum position as shown in Figure 2. The width of the ROI is the same size in pixels for all fillets while the length is scaled to be 40% of the fillet’s length. When the ROI is taken out of the image and put into a rectangular window as shown in Figure 2(c), the predicted epaxial septum will be rectified into a straight horizontal line in the centre of the ROI. The searched epaxial septum should thus, if the prediction was good, also be approximately a horizontal line in the centre of the ROI as seen in the figure. We can thus apply a simple FIR-filter to enhance horizontal lines within the ROI. We have used a filter of 8 rows and 60 columns where the first and last two rows consist of −1’es and the four centre rows of +1’es. [2, 6]. The colour image is transformed to a monochrome intensity scale by using the formula intensity = red_channel/(green_channel +blue_channel) before applying the line filter. This enhances the marbling in the fillet and compensate for some of the variable illumination over the fillet. After filtering, two candidate points that can represent the epaxial septum are identified in each column as the highest and second highest peaks in the columns. To eliminate peaks caused by direct reflections we check a neighbourhood of 7x7 pixels around each candidate point for the presence of such reflections. Points that represent reflections are deleted. 4.2 Identification of the Epaxial Septum For each column in the ROI we now have one or two candidate points where one of them is expected to lie on the epaxial septum as shown in Figure 3. We see that the points make up fragments of multiple parallel lines. The true epaxial septum generally
892
J.T. Thielemann et al.
(a) 10 20 30 40 50 60 70 80
100
(b)
200
300
400
500
600
700
800
900
(c)
Fig. 2. (a) Fillet with computed ROI marked with thick lines. (b) Close-up of fillet region. ROI marked with thick blue lines, target epaxial septum with dashed red line. Note non-quadratic aspect ratio for enhanced visualization. (c) ROI after adjusting columns such that upper row of image correspond to the line defining the upper part of the ROI, and applying line enhancing filter. Target epaxial septum (marked with dashed red line) appears as near-horizontal.
appears more complete and less noisy compared to the other fragments. The task of the epaxial septum identification is thus to find the correct line among the alternatives. This is done in three steps: Initial polynomial The first 2/3 of the epaxial septum length, starting from the tail end, is generally smoothly curved and can be quite well approximated with a second order polynomial. A RANSAC type algorithm [3] is used to find an initial polynomial approximation to this part of the line: a. Three non-overlapping segments/bins are defined, corresponding to the left, the middle and the right part of the line length. b. One random point is picked among the candidate points from each of the bins. These three points define a candidate second order polynomial. c. If the polynomial curvature is within specified limits, the degree of match of the polynomial is measured by counting the number of points
System for Estimation of Pin Bone Positions in Pre-rigor Salmon
893
that fit the polynomial. Similar to [7], points are counted with a weight equal 1 for points that accurately match the polynomial, and with a decaying weight down to 0 as the deviation increases up to a maximum threshold. d. 100 random picks are performed and the polynomial with best match is used for step 2 Extrapolation towards head The filet, when placed on the conveyor belt, can have a quite strong bending at the neck end. This can result in large variations in the curvatures of the epaxial septum towards the head, and fitting this with a low order polynomial gives very inaccurate and unreliable results. It was thus chosen to extrapolate the polynomial line from step 1 above forward to the head end by means of an empirical model for how the epaxial septum depends on the upper edge of the filet. This model is just an average of the offset between the upper edge and the epaxial septum as function of position along the epaxial septum, obtained from manual measurements on a set of sample fillets. The extrapolation model is shifted vertically to join the polynomial line from above. Adjusting the line to match image points In a final step the initial line from the previous two steps is adjusted to better match the candidate points measured from the image. From experience we found that the polynomial is sufficiently accurate at the tail end. The adjustment algorithm thus starts at the centre of the ROI and moves forward towards the neck, one image column at a time. For each column the data point nearest to the line is identified. If this point is within a maximum deviation tolerance the forward part of the line is shifted up or down 1/10th of the distance to this nearest point. If no point is found within the deviation tolerance, the algorithm proceeds to the next column without adjusting the line. The deviation tolerance is made dependent on the number of matching points that were found in the previous 90 image columns. If few matching points were found we must expect a large error in the approximating line and the tolerance is made correspondingly large. If the density of matching points falls below a limit the search is terminated, and a signal is given that a reliable line could not be found. Figure 3 illustrates an example of how the algorithm for identification of the epaxial septum works.
5 Experimental Results The data was captured at the fish processing plant SalMar, Frøya, Norway in June 2006. The data consists of fillets from fish in five size categories; 1-3 kg, 3-4 kg, 4-5 kg, 5-6 kg, and 6-7 kg. Each class consists of up to 50 fish, and we only depicted the left-side fillet. We took two pictures of each fillet. For each capture the ground truth epaxial septum was manually marked. The data set was divided into a training set and a test set with different fish in the two sets. The algorithm was developed and optimized using the training set and the
894
J.T. Thielemann et al.
220
240
260
280
300
700
800
900
1000
1100
1200
1300
1400
1500
1600
Fig. 3. Illustration of the identification of the epaxial septum from candidate points. The red and magenta points are respectively the strongest and the second strongest peak points identified from the image. These are given confidence values 1.0 and 0.5 respectively. The green line is the manually marked line used as ground truth. The blue line from column 670 to 1270 is the second order polynomial found with the RANSAC algorithm. The black line from column 1270 and towards the right is the extrapolation line after it has been adjusted to the matching points. The blue lines represent the tolerance range used for matching points. 18
70
16 60
14 50
12 10
40
8
30
6 20
4 10
2 0
0
1
2
3
4
5
6
7
8 mm
(a)
9
10 11 12 13 14 15
1
2
3
4
5
6
7
8 mm
9
10 11 12 13 14 15
(b)
Fig. 4. Histogram of the maximum column wise distance between the estimated epaxial septum and the ground truth. (a) 1-4 kg fish. (b) 4-7 kg fish.
test set was used for evaluation and is the data used for reporting results. The same algorithms and models are used for all fish size classes. To quantify the performance of the algorithm, we have calculated the maximum distance between the estimated epaxial septum and the manually marked ground truth. Histograms indicating performance using this metric is shown in Figure 4. We achieve our desired precision of 3 mm for 78% of the fish in total. More precisely we achieve this goal for 85% of the fish above 4 kg, and for 70% of the fish below 4 kg. Typical examples of fish where the algorithm works and fails are shown in Figure 5. Typical time consumption for the algorithm run in The MathWorks, Inc.’s Matlab® is 1.5 seconds. The maximal observed runtime is 2 seconds.
System for Estimation of Pin Bone Positions in Pre-rigor Salmon
895
200
400
600
800 200
400
600
800
1000
1200
1400
1600
1800
(a) 100 200 300 400 500 600 700 800 200
400
600
800
1000
1200
1400
1600
(b) Fig. 5. Illustration of the system working and failing. (a) Epaxial septum detected correct. (b) Epaxial septum not detected correct, note slight sudden erroneous bend close to the neck.
6 Discussion and Conclusion For 85% of the fish above 4 kg, we are achieving an estimate within our goal of 3 mm precision. Where the algorithm fails, it fails close to the neck when the line to detect has become very unclear and we thus are forced to estimate its position mainly using prior models. Smaller size classes, less than 4 kg, only attain the same accuracy for 70% of the fish. We believe that this drop in performance is due to that all models were developed using the larger fillets, and that the model extrapolation employed is not sufficiently accurate. The smaller size classes are however generally not that important, as they constitute only a limited fraction of processed salmon (approximately 20% of total production). Except from model tuning, we think that in order to improve position estimates any further a significantly more advanced texture analysis is necessary. This analysis would need to not only analyze the line itself, but also the surrounding patterns. Another strategy is to detect when the algorithm is failing, such that these fish can be manually processed.
896
J.T. Thielemann et al.
Still, we consider the current results to be of sufficient quality to allow for automatic complete filleting and pin bone removal for fresh salmon. This works may thus make it possible to serve European consumers cheaper and better salmon with extended shelf life.
References 1. Braeger, H., Moller, W.: Apparatus for gaining pinbone-free fillets of fish, US Patent 4748723 (1987) 2. Davis, E.R.: Machine Vision: Theory, Algorithms, Practicalities, pp. 269–271. Academic Press, London (1990) 3. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm. of the ACM 24, 381–395 (1981) 4. O. Haugland, T. Voll. Mechanism and apparatus to ease extraction of pin bones, Norwegian patent 319441. 5. Kryvi, H., Totland, G.: Fiskeanatomi (Fish anatomy), Høyskoleforlaget AS (1997) ISBN 82-7634-056-3-5 6. Pratt, W.K.: Digital Image Processing, 2nd edn. pp. 553–555. John Wiley & Sons Inc. NY (1991) 7. Torr, P., Zissermann, A.: MLESAC: a new robust estimator with application to estimating image geometry. Computer Vision and Image Understanding 78(1), 138–156
Vertebral Mobility Analysis Using Anterior Faces Detection M. Benjelloun1 , G. Rico2 , S. Mahmoudi1 , and R. Pr´evot1 1
2
Computer Science Department, Faculty of Engineering, rue de Houdain 9 Mons, B-7000, Belgium {mohammed.benjelloun,said.mahmoudi,richard.prevot}@fpms.ac.be Departamento de Sistemas y Tecnolog´ıa Inform´ atica, Universidad Privada del Valle Av. Ayacucho 256, Cochabamba, Bolivia
[email protected]
Abstract. In this article, we are interested in the X-rays images of the spinal column in various positions. The purpose of this work is to extract some parameters determining the vertebral mobility and its variation during flexion-extension movements. A modified Discrete Dynamic Contour Model (DDCM) using the Canny edge detector was the starting point for our segmentation algorithm. To address the lack of convergence due to open contour, we have elaborated a heuristic method appropriate to the area of our application. The results in real images cooresponding to the cervical spinal column and their comparison with manual measures are presented to demonstrate and to validate the proposed technique.
1
Introduction
Medical image processing and analysis softwares ease and automate some tasks dealing with the interpretation of medical images. It permits the extraction of quantitative and objective parameters related to the form and the texture included in pictures. The motion of the anatomy can be determined from a set of serially acquired images. In this article, X-ray images of the spinal column of the same patient are analysed in various positions. We aim at developing a computer vision tool able to determine the mobility of cervical, lumbar and dorsal vertebrae. The purpose of the diagnosis is to extract some quantitative measures of particular changes between images acquired at different moments. For instance, to measure the vertebrae mobility, images in flexion, neutral and extension position are respectively analyzed (Fig. 1). Measuring each vertebra movement allows to determine the mobility of the vertebrae, in relation to each other, and to compare the corresponding vertebrae between several images. Several methods have been applied to vertebra segmentation [1]. Techniques using Hough Transform [2,3,4], Active Shape Models [6,5,8] and parametric deformable model (PDM) [7] are some examples of the various approaches developed. Templates are required for all of these methods. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 897–908, 2007. c Springer-Verlag Berlin Heidelberg 2007
898
M. Benjelloun et al.
Fig. 1. Flexion, neutral and extension position of the cervical spine
Some recent techniques of image processing enable surfaces approximation using CT or MR images of the patient enhanced with contrast product. These works are based on an accurate surface extraction with front propagation techniques based on the Fast-Marching and Level-Sets methods [9]. Fast-Marching and Level-Sets methods are numerical techniques which can follow the evolution of contours and surfaces that can develop sharp corners, break apart, and merge together, and are particularly useful for shape recovery of complex geometries like branching tubular structures [10]. Other techniques investigate spine segmentation approaches using volumetric CT datasets. In this context, Mastmeyer et al. [11] present, in a recent work, a hierarchical 3D segmentation method used to segment vertebral bodies from CT images of patients with osteoporosis. To capture the great variability in shape of the vertebrae, more templates should be considered inducing a severe computing time penalty in the case of the Hough Transform and the need of a large and appropriate training set in the case of the Active Shape Models. Furthermore, these methods are not reliable in presence of morphological anomalies (fractures, osteophytes, spine injuries, . . . ). So, we chose to investigate the methods not based on template matching. To get satisfying results despite the shape variability and the noise present in images, the method developed in the context of this analysis uses a Discrete Dynamic Contour Model (DDCM) [13] including inherent adaptations to the nature of the processed images. This method is founded on preliminary edge detection based on the Canny filter. Then the edges are exploited by the DDCM to extract interesting informations, which means the anterior faces of the vertebrae. It is indeed easier to work with this representation than to work with the complete vertebra contour. The reason is the noise located inside this one. Therefore the mobility of the vertebrae is represented by the mobility of their anterior sides. We rely on angular variations measurements and comparisons to determine this one. On the other hand, we notice that the X-ray images used for our experiments correspond to real patient and were provided by radiologists. For each patient three images are taken. Each image corresponds to a different position: neutral, flexion an extension. The rotation motion of the neck is applied using the same plane which is parallel to the imaging plane.
Vertebral Mobility Analysis Using Anterior Faces Detection
2
899
Vertebrae Segmentation
The algorithm proposed in this study is a combination of a preliminary edge detection, a contour segmentation using a Discrete Dynamic Contour Model (DDCM) [13], and a feature extraction developed to find the anterior face representing the vertebra. A Discrete Dynamic Contour Model (DDCM) [13] is a contour model consisting of vertices, connected by edges (Fig. 2).
Fig. 2. DDCM model [13]
The vertex Vi is represented by the position vector pi and the edge vector di is given by di = Vi+1 − Vi . Vertices move under the influence of internal forces, derived from the shape of the contour model, and external forces derived from the main characteristics of the image. The forces acting on each vertex lead to an acceleration, noted ai . Starting with an initial form, the contour is inflated by internal and external forces, gets larger and tries to acquire the desired contour. The deformation process ends when the internal forces balance the external forces. In our work, we have selected the Canny’s edge detector because we are working with images of which the characteristics change according to the source (Figure 3): a digitized image, a digital photo of the x-ray image, a direct image obtained directly from an the x-ray machine. 2.1
Contour Segmentation
We used a detector based on Canny’s work [14]. Once our image has been processed and we have its edges (Fig. 3), we must proceed to segment our region of interest: the vertebra. The method must provide accurate and repetitive results on a large set of images. For the segmentation phase, we have worked with the DDCM technique. A Discrete Dynamic Contour Model (DDCM) [13] is a contour model consisting of vertices, connected by edges (Fig. 2). We have tested the complete and the partial segmentation of a vertebra and we have applied a convergence criterion to stop the contour segmentation process. The DDCM method has been selected basically for two reasons: 1. This model can be adapted for the segmentation of unknown elements. Our goal is to get fast and satisfying results despite the shape variability, notably
900
M. Benjelloun et al.
Fig. 3. Edge detection using the Canny filter
in presence of morphological anomalies (fractures, osteophytes, . . . ). So, even if vertebrae have a box-like form, we chose a segmentation algorithm not based on template matching. 2. The grow-up process is less sensitive to the noise present inside the element that must be segmented. The detection process normally gives good located and detected contours, but in the case of the vertebrae, we found different levels of noise inside the vertebral body making the segmentation process more difficult. In our application, the initial DDCM contour is created by user clicking within the vertebra to segment. It’s composed of two pairs of points placed symmetrically on the horizontal axis and on the vertical axis passing through the clicked point. Note that the convergence of the algorithm is relatively independent from the position of the click. The algorithm used to deform the DDCM contour can be summarized as follows. Trace the first four points around the clicked point. Then, while not convergence: – – – –
Find the internal and external forces. Remove points with largest or shortest distance. Remove cycle points. Determine the convergence criteria.
Internal and external forces. The internal forces are used to minimize the local contour curvature. For the four initial points and subsequently for each new set of proposed points in the contour set, we must find the internal forces. The purpose of this phase is to obtain the radial vector for the points that are analysed, and to determine the shifting that a point could have due to these internal forces. The role of external forces is to deform the model. In our job we have modified the algorithm used in [15] to work with the radial direction. This modification is described in the next paragraphs.
Vertebral Mobility Analysis Using Anterior Faces Detection
901
Direction of the previous point. We consider that a point Pi (x, y) moves to another position following a direction related to the position of the previous point Pi−1 (x, y). Working with a 3 × 3 neighborhood, we decided to keep up five directions of exploration for each previous configuration (Fig. 4). That means that the point Pi (x, y) could follow one of this directions. The associated 3 × 3 mask is denoted W . For example, if the point Pi−1 is in the left-down direction, W is : ⎛ ⎞ 1 1 1 W = ⎝0 0 1⎠ (1) 0 0 1
Fig. 4. Masks corresponding to the direction of the previous point and to the directions of exploration, respectively
Radial direction. This direction is obtained with the internal forces. Depending on the radial direction of the point that is treated, we can select the appropriate mask that must be used to determine the next position of our point. For a given point Pi (x, y) we can move it in one of the eight directions of its 3 × 3 neighborhood. First, we can find the position variation of point Pi due to internal forces: 1. Given the points Pi , Pi−1 located at the left of Pi and Pi+1 located at the right of Pi , calculate the following vectors: – difference between Pi+1 and Pi , di = Pi+1 − Pi – difference between Pi and Pi−1 , di−1 = Pi − Pi−1 – tangential vector, ti = di + di−1 – curvature vector, ci = di − di−1 – radial vector, ri with ri,x = −ti,y and ri,y = ti,x 2. Then, calculate the shifting (vector shift) of the point Pi as: sint,i = ri fint,i
(2)
with fint,i the internal force: fint,i = ci · ri −
(ci−1 · ri−1 ) + (ci+1 · ri+1 ) . 2
(3)
When the mask of directions W has been found following the procedure explained in section 2.1, we must calculate the value of the external forces acting
902
M. Benjelloun et al.
on the point Pi (x, y). This point is the center of a convolution process between its neighborhood and the direction mask: fext,i =
2 2
W (j, k)I(y + j − 1, x + k − 1)
(4)
j=0 k=0
with I(x, y) the image data. This external force is used to calculate the vector of displacement sext,i of the point Pi (x, y): sext,i = −ri di (5) with – ri the radial vector for Pi (x, y) – di the shifting distance of Pi : 0 di = 1
if fext,i < k else
(6)
– k is a thresholding parameter The total shifting si of the point Pi is finally: si = sint,i + sext,i
(7)
After this process the point Pi (x, y) could change or not its position.
(a)
(b)
Fig. 5. (a): DDCM contour detection (b): Example of overflow of the DDCM
Convergence criteria. As the deformation process of the DDCM model is iterative, it’s necessary to define a criteria to stop it. So, we defined the criteria on basis of the correlation coefficient: (xi − μx )(yi − μy ) ωX,Y = i (8) 2 2 i (xi − μx ) i (yi − μy )
Vertebral Mobility Analysis Using Anterior Faces Detection
903
where X and Y are the sets of points belonging respectively to two successive DDCM models. To stop the deformation when the variation between two successive iterations becomes negligible, the criteria is defined as ωx,y > with an arbitrary value. Fig. 5-a is an example of the results obtained with our algorithm. This model of segmentation works without any problems in images where the contour to segment is well defined and closed. However if we have an open-contour the convergence criteria is not sufficient (Fig. 5-a), the deformation process growsup outside the desired region. 2.2
Anterior Face Detection
We notice that the resulting contours are sometimes open. This is because the cranny’s detector used fail sometimes in the presence of insufficient edge strength. That led us to seek another manner of representation of vertebrae even if the contour is partially opened. The main idea is that a vertebra is an approximate well-defined body, so that we did not need all the structure to measure its mobility and as a result we decided to take a portion of it: the anterior face of each vertebra [15]. An other solution was to introduce an edge closing approach to solve the problem of open contours. Therefore, to proceed, we have added an additional control that must limit the expansion process and stop the possibility of failure of the segmentation. A vertical line passing through the clicked point P is associated automatically to each vertebra. Each one of these lines becomes a limit of expansion in the right direction as it shown in Fig. 6.
Fig. 6. Contour with imaginary line
On the portion of the vertebra, a partial segmentation is done. Then the segment that represents the anterior face is detected as follows: – Given the set of points S that belongs to the final DDCM contour, take the point P whose the y coordinate is the same as the starting point of the DDCM contour. – Extract the number of points needed to represent the anterior face, taking as reference the number of points in the polyline segment obtained as part of the contour. This extraction must be done in the up and down direction.
904
3
M. Benjelloun et al.
Experiments and Results
In Fig. 7, we have a sequence of the images obtained during the analysis process by our heuristic method. Once all the segments that represent the anterior face were found, we must proceed to calculate the angle for each vertebra, the angular variation between two consecutive vertebrae and the angular variation for the same vertebra in two different positions. A graphic representation of this variation is shown in Fig. 8-a. It is necessary also to know the general curve that is represented by a line which contains all the vertebrae that we are analysing. From the angle measure of each vertebra, the contribution to the total curvature and the angle variation between two consecutive vertebrae can be calculated (Table 1). These informations can be used to detect anomaly such exaggerated curvature. The analysis can be carried on for the angular variation of each vertebra between two positions. In Fig. 8-b, we can observe a mobility reduction from the fifth cervical (C5) to the seventh cervical (C7). It is noted that the head movement is mainly supported by C4 and C3. It can be confirmed in the angle variation between the extension and the flexion positions. This kind of analysis will help the specialist to give an interpretation of the disorder presented in the vertebral mobility. We notice also, that we did not try to segment the two vertebrae C1 and C2 because they are in parts embedded in the head and it is very difficult to extract their contours. As we described for Fig. 8-b, the numerical data presented in Table 2 confirms the poor mobility in the flexion-extension movement for vertebrae C5, C6 and C7. That means a problem of mobility from C5 to C7. To check the validity of these measurements, we have compared those with results obtained manually. For the manual values, a group of 15 people should select with the greatest possible precision the anterior face for each vertebra, then we have determined the average values (Table 3). This procedure that normally must be done by the specialist, maybe will give the best results, but it is time consuming. The comparison was done on several images. The angle difference between the partial segmentation results and the manual Table 1. Angle and contribution in the curvature of each vertebra; angular variation between two consecutive vertebrae for neutral, extension and flexion positions
C3 C4/C3 C4 C5/C4 C5 C6/C5 C6 C7/C6 C7
Reference Angle % Curvature 2.5 10.67 % 2.3 4.8 20.42 % -4.8 0.0 0.0 % 4.8 4.8 20.42 % 6.5 11.3 48.48 %
Extension Angle % Curvature 2.5 3.06 % 6.2 8.7 12.87 % 8.2 16.9 24.9 % 4.5 21.4 31.44 % -3.0 18.4 27.12 %
Angle 26.6 -6.8 19.8 -5.8 14.0 2.9 16.9 5.6 22.5
Flexion % Curvature 26.62 % 19.84 % 14.06 % 19.96 % 22.52 %
Vertebral Mobility Analysis Using Anterior Faces Detection
905
Fig. 7. Cervical spine in neutral position, extension and flexion: (a) edges detected with Canny edge detector, (b) partial segmentation: the white lines represent the contour parts corresponding to the anterior faces and the lines in black correspond to the remaining parts of the closed contours given by DDCM method, (c) representation of the curvature for these vertebrae
906
M. Benjelloun et al. Table 2. Angle variation between each cervical spine position
C3 C4 C5 C6 C7
Reference vs Extension Reference vs Flexion Extension vs Flexion 0.0 24.1 24.1 3.9 15.0 11.1 16.9 14.0 2.9 16.6 12.1 4.5 7.1 11.2 4.1
Table 3. Manually determined angle of each vertebra
C3 C4 C5 C6 C7
Reference 2.6 3.5 0.0 5.1 11.0
(a)
Extension 2.7 8.8 17.1 21.3 21.4
Flexion 26.4 21.0 13.9 16.3 22.1
(b)
Fig. 8. (a): Angular variation between C4 and C5 in extension position, (b) Angular variation between the positions of extension and flexion of C4 (b): Graphical comparisons between the three spine positions
results obtained by specialist is in the range 0 to 2 degrees, it seems thus acceptable. During our experiments, the partial segmentation technique has proven to give repetitive and reproductible results on a large et of images. Moreover, the method has been successfully applied to thoracic and lumbar vertebrae (Fig. 9).
4
Conclusion
The aim of this work was to determine the variation between the positions of vertebrae in the flexion-extension movements and to measure their mobility. This work is based on two fundamental bases of Computer Vision: the contour detection and the image segmentation. We have implemented the Canny edge detector
Vertebral Mobility Analysis Using Anterior Faces Detection
907
Fig. 9. Lumbar spine : (a) Canny edges detector results, (b) partial segmentation, (c) curvature representation
and we have worked with the Discrete Dynamic Contour Model. The edges obtained by Canny’s detector may include holes annoying the DDCM convergence. Therefore, we have tested the complete and the partial segmentation using an additional control that must limit the expansion process. We didn’t work with the total segmentation technique due to the difficulty to always obtain edge maps containing closed contour of vertebrae. After the contour segmentation, we have proceeded to extract the anterior face for each vertebra. Therefore the vertebrae mobility is represented by the angular variations measurements of the anterior sides. That allows us to calculate the angular variations between two consecutive vertebrae within the same image as well to measure the angular variation of a vertebra in several images, in particular between three spine positions. The applied techniques have given good results to measure the mobility of cervical vertebrae, but also these techniques were applied in the dorsal and lumbar regions, with a positive outcome. However, future enhancements should include the superior and inferior faces detection and the ability to measure more variables than the angular variations, e.g. the intervertebral distances. In our future works, we are aiming to develop a new method for contours vertebrae detection based on a template matching process combined with a polar signature contour representation. We want also to investigate the use of other kinds of images such as volumetric CT.
References 1. Duncan, J.S., Ayache, N.: Medical image analysis: progress over two decades and the challenges ahead. IEEE Transactions on PAMI 22(1) (2000) 2. Tezmol, A., Sari-Sarraf, H., Mitra, S., Long, R., Gururajan, A.: Customized Hough transform for robust segmentation of cervical vertebrae from X-ray images. In: 5th IEEE Symposium on Image Analysis and Interpretation, New Mexico, USA, IEEE, Los Alamitos (2002)
908
M. Benjelloun et al.
3. Howe, B., Gururajan, A., Sari-Sarraf, H., Long, L.R.: Hierarchical segmentation of cervical and lumbar vertebrae using a customized generalized Hough transform and extensions to active appearance models. In: 6th IEEE Southwest Symposium on Image Analysis and Interpretation, Lake Tahoe, Nevada, USA, IEEE, Los Alamitos (2004) 4. Zheng, Y., Nixon, M.S., Allen, R.: Automated segmentation of lumbar vertebrae in digital videofluoroscopic images. IEEE Transactions on PAMI 23(1) (2004) 5. Long, L.R., Thoma, G.R.: Use of shape models to search digitized spine X-rays. In: 13th IEEE Symposium on Computer-Based Medical Systems, Houston, USA, IEEE, Los Alamitos (2000) 6. Roberts, M.G., Cootes, T.F., Adams, J.E.: Linking sequences of active appearance sub-models via constraints: an application in automated vertebral morphometry. In: BMVC2003 (2003) 7. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. International Journal of Computer Vision 1(4) (1988) 8. McInerney, T., Terzopoulos, D.: T-snakes: topology adaptive snakes. Medical Image Analysis 4(2) (2000) 9. Sethian, J.: Level set methods: Evolving Interfaces in Geometry. In: Fluid Mechanics, Computer Vision and Materials Sciences, Cambridge Univ. Press, UC. Berkeley (1999) 10. Malladi, R., Sethian, J., Vemuri, B.: Shape modelling with front propagation: a level set approach. IEEE Trans. Pattern Anal. Mach. Intell. 17(2), 158–175 (1995) 11. Mastmeyer, A., Engelke, K., Fuchs, C.: A hierarchical 3D segmentation method and the definition of vertebral body coordinate systems for QCT of the lumbar spine. MEDICAL IMAGE ANALYSIS 10(4), 560–577 (2006) 12. Niessen, W.J., ter Haar Romeny, B.M., Viergever, M.A.: Geodesic deformable models for medical image analysis. IEEE Trans. on Medical Imaging 17(4) (1998) 13. Lobregt, S., Viergever, M.A.: A discrete dynamic contour model. IEEE Transactions on Medical Imaging 14(1) (1995) 14. Canny, J.: A computational approach to edge detection. IEEE Transactions on PAMI 8(6) (1986) 15. Rico, G.: Vertebral mobility analysis using computer vision. An application in osteopathy clinic investigation. Thesis, Facult´e Polytechnique de Mons (2002)
Image Processing Algorithms for an Auto Focus System for Slit Lamp Microscopy Christian Gierl, T. Kondo, H. Voos, W. Kongprawechon, and S. Phoojaruenchanachai University of Applied Sciences Ravensburg Weingarten, Germany
[email protected] University of Applied Sciences Ravensburg Weingarten, Germany Sirindhorn International Institute of Technology, Thailand Sirindhorn International Institute of Technology, Thailand National Electronics and Computer Technology Center, Thailand
Abstract. The slit lamp microscope is the most popular opthalmologic instrument comprising a microscope with an light source attached to it. The coupling of microscope and light source distinguishes it from other optical devices. In this paper an Auto Focus system is proposed that considers this mechanical coupling and compensates for movements of the patient. It tracks the patients eye during the focusing process and applies a robust contrast-measurement algorithm to an area relative to it. The proposed method proved to be very accurate, reliable and stable, even starting from very defocused positions.
1
Introduction
Since its invention in 1911 the slit lamp microscope has become the most important opthalmologic instrument. The slit lamp - a high intensity light source that is attached to a stereomicroscope - illuminates the eye with focused or diffuse light from different angles thus permitting the examination of all anterior eye structures [1]. The lamp can be rotated around a vertical axis which is located inside the focal plane of the microscope. The position of the illuminated area of the eye depends on this rotational angle α and the z-position of the lamp (Fig. 1). Focusing of the microscope is achieved by adjusting the z-position of the focal plane and consequently of the lamp. The position of the illuminated area is therefore changing during the focusing process. In addition, when moving the microscope toward the eye, the image is magnified and appears brighter because more of the reflected light is captured (Fig. 2). Innovative technologies like mechatronics, digital cameras and communication networks led the recent way to telemedicine. At NECTEC, a remote eye diagnostic system has been developed where a photo slit lamp microscope is used to capture images and transmit them via Internet. Such a device however requires an Auto Focus (AF) system that moves the microscope to the in-focus position. The system has to determine the degree of focus and apply it in automatic J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 909–919, 2007. c Springer-Verlag Berlin Heidelberg 2007
910
C. Gierl et al.
Fig. 1. Movement of slit lamp and microscope during focusing
feedback control, thus controlling the motors. Herein, the main task is the measurement of the actual degree of focus. This paper presents an image processing algorithm, that determines the degree of focus using contrast measurement. Our approach especially addresses the changing illumination conditions. Based on an evaluation of several standard methods, a very robust mechanism has been developed and tested. Furthermore an eye tracking method is proposed that compensates for eye-movements of the patient during the focusing process. Finally both proposed methods are combined to a system that automatically focuses on the patients iris.
2 2.1
Image Processing for the AF-System Autofocusing Using Contrast Measurement
Evaluation criteria for contrast measurement techniques. Autofocusing techniques using contrast measurement compare the contrast of images taken at different z-positions. The contrast in a focused image is higher than in the same blurred image. The z-position is therefore adjusted until the maximum contrast is reached. This is the in-focus position of the image. Contrast can be measured by evaluating image intensity, peaks of the video signal, intensity gradients, image energy ratios or Chebyshev moments [2]-[7].
Image Processing Algorithms for an Auto Focus System
911
Fig. 2. Image of one eye taken at far (left) and close (right) distance. With decreasing distance the image is getting brighter, magnified and the illuminated area is changing (patients eye is moving in addition).
All these techniques have in common that they generate a ”focus value” that represents the degree of focus of the image. Up to eight different criteria to evaluate the focus value have been defined [6]. Monotonicity, discrimination power, converging range and robustness with respect to noise are most relevant [5]: The focus value should have only a global maximum. On either side of this peak it should decrease monotonically. Discrimination power means that the peak should be sharp thus providing a defined maximum position. Converging range addresses the range over which an in-focus image can be attained i. e. in which range the slope of the focus value still indicates the maximum position. A measure for the discrimination power is the peak width of the focus curve at 80% of the maximum value (Fig. 3). When comparing different contrast measurement techniques, the method with the smallest width has the highest discrimination power. In compliance with [6] we therefore define: Discr. power =
min. peak width of all methods at 80% peak width at 80%
A high peak width at 20% of the maximum focus value indicates a good converging range. Analogously we define: Conv. range =
peak width at 20% max. peak width of all methods at 20%
Contrast measurement is only applied to a rectangular subarea of the image, called the region of interest (ROI). The location and size of the ROI is influencing the evaluation criteria. If the ROI includes objects located at different focal distances any focus value will have several maxima, is no longer monotonic and cannot be evaluated. ROIs with multiple maxima are therefore not considered unless otherwise indicated.
912
C. Gierl et al.
!
"#
Fig. 3. Focus value of different contrast measurement techniques Table 1. Test of standard methods Focus value Discr. Pow. Conv. Range monotonicity Variance of intensity 35% 100% very bad Sobel (std. deviation) 76% 42% partially Sobel (variance) 80% 39% partially Laplace (std. deviation) 76% 58% partially Laplace (variance) 100% 21% partially
Intensity and Gradient based methods for Contrast Measurement. The standard AF-techniques have excessively been tested and their viability has been proved for standard applications [2]-[7]. However in the area of slit lamp microscopy no related work has yet been published . We therefore examined intensity and gradient based methods for their applicability in this special case. A series of images has been captured varying the z-position of the microscope. The focus value for each image is calculated and displayed over its z-position (Fig. 3 and Tab. 1). Using the variance of image intensity as focus value is a very popular contrast measurement method [3]. It can yet be improved by normalizing the focus value by the mean brightness value of the image [4]. Despite proved monotonicity in standard applications [5], tests with the captured image series showed that both methods are not monotonic and therefore not valid for application in slit lamp microscopy. When illumination is increasing, i. e. when approaching the
Image Processing Algorithms for an Auto Focus System
913
eye, the focus value increases because of bigger differences in image intensity (Fig. 3). This increase however is not related to the degree of focus. The method is therefore not valid. The differential based methods convolve the intensity image with a first or second derivative mask like Sobel-filter or Laplacian. The filtered image contains information about the sharpness of edges of the original image. The focus value may then be defined as standard deviation, variance or squared sum of the filtered image. Tests have been performed using different masks of size up to 7x5 of different types and techniques for calculating the focus value. In Fig. 3 representatively two methods are shown. In accordance to [5] all differential based methods proved to have very good discrimination power, especially those calculating the variance. However their converging range is small. They are only partially monotonic because their slope is close to 0 in large areas. Applied to noisy images focus value of the methods using Laplacian varied more than the methods using Sobel-filter which is therefore more resistant to noise. Evaluation of Proposed Method. The monotonicity of the gradient based methods can be increased by increasing the size of the convolution kernel. In the displayed Sobel-filtering a kernel of size 5x5 has been used. Increasing the mask size however implies a very high need for computational power. Using a 7x7 mask instead of a 5x5 mask doubles the computational cost, a 11x11 requires 5 times more time. Therefore a different approach has been taken. The image is convolved with a 1x11 vertical mask: [ -0.4 -0.4 -0.8 -1.2 -2.0 0.0 2.0 1.2 0.8 0.4 0.4 ]T The mask generates a weighted average of the first derivative in vertical direction with the weight decreasing with increasing distance to the current center position. It emphasizes on the capture of large structures like the transition from pupil to iris or from iris to eyeball. The focus value is generated by computing the standard deviation of the filtered image. By including a quite large range of pixels the degree of focus even in very defocused images can be determined. The focus value is therefore very monotonic (Fig. 4) and has a large converging range. Its discrimination power however is not as strong as the method using Sobelfilter. A combination of Sobel and vertical filtering combines the advantages of both methods. The overall focus values foverall computed by adding Sobel filtered focus value fSobel and the focus value generated from vertical filtering fvertical . foverall = fvertical + 0.05fSobel fSobel is generated using a 5x5 mask and computing the variance of the filtered image. Using the squared value variance leads to low values in defocused position and very high absolute values in proximity to the in-focus position. Especially due to the factor 0.05 for Sobel filtering, the vertical filtering is thus dominant in defocused position, indicating the direction of the maximum. In close proximity to the maximum however, fSobel rises strongly and is dominant indicating the exact in-focus position. There the influence of the vertical filtering is almost negligible.
914
C. Gierl et al.
Fig. 4. Focus value of proposed contrast measurement technique Table 2. Test results for proposed contrast measuring method Focus value Discr. Pow. Conv. Range Sobel (variance) 100% 46% Vertical 47% 100% Combined 89% 71%
The resulting focus value, shown in Fig. 4, therefore features both excellent monotonicity and discrimination power. The converging range is very large and the shape of the curve is almost ideal. These excellent characteristics were verified by evaluating 41 test series (Tab. 2). Application of proposed method in different areas of the image. The proposed contrast measurement method was applied using different ROIs of one image series (only minor eye movements in this image series). By moving a ROI of size 100x100 pixels over the image and determining the maximum focus value at each position a network of focused positions has been generated. The surface formed by these in-focus positions is displayed in Fig. 5. The distance of focal plane and microscope is constant. If the focused positions have been determined correctly their network should reflect the actual surface of the eye. The spheric shape of the eyeball is clearly distinguishable. The elevated position of reflection on the cornea and eyebrows can also be recognized. In proximity to the reflection the spheric shape the surface varies strongly. This is because
Image Processing Algorithms for an Auto Focus System
915
Fig. 5. Maximum focus values at different points of the image
the ROI includes part of the reflection and iris / pupil. The calculation of the in-focus image is obstructed then because the focus value has two maxima (see section 2.1). The reflection should therefore not be include in the ROI . The in-focus positions at the margins of the image are dominated by the border of illumination and therefore not suitable for focusing either. 2.2
Proposed Method for Eye Tracking
When the patient is moving his eye during the focusing process its position on the image is changed (Fig. 2). If now contrast measurement with a fixed ROI was applied different parts of the eye would be compared. The resulting focus values would not be valid. By tracking the center of the eye and adjusting the position of the ROI respectively it is ensured that always the same area of the eye is used for contrast measurement. Proposed feature to be tracked. Locating the center of the eye has successfully be performed by tracking the pupil or iris using edge detection or feature extraction methods [8][9]. However when applying these standard methods they proved to be not reliable. Two main reasons were identified for their failure: Low contrast in very defocused position obstructs edge detection and the moving spot of light on the images interferes with the feature recognition. The light from the slit lamp is reflected on the cornea. The reflection is very bright and therefore easy to distinguish from the surrounding area [10]. Even on
916
C. Gierl et al.
Fig. 6. Influence of slit lamp angle and z-position on position of reflection
very blurred images it is clearly visible. Furthermore its location with respect to the center of the eye is very stable and changing illumination does not obstruct its detection (see different images in Fig. 2). Tracking the reflection on the image. The following algorithm is proposed to locate the reflection on the image. The intensity image is to be thresholded with a very high value and then eroded. The resulting binary image indicates very bright areas. ( Fig. 7). The reflection is then identified by tracking the closest true value to the center of the image thus avoiding the tracking of the bright eyeball or other structures. Estimation of the position of the center of the eye. The center of the reflection on the cornea of the eye are located at the same y-coordinate of the image. This is because the slit lamp and the microscope are located at the same height. The distance between center of the reflection xr and center of the eye xc on the image depends on two factors (see Fig. 6): It increases with increasing slit lamp angle α and z-position of microscope z. In addition it depends on whether the slit lamp is on the right or on the left side of the eye, i. e. on the sign of the slit lamp angle. This is because the camera captures images from the right optical path of the stereomicroscope. The following algorithms have been derived empirically to calculate the position of the center of the eye on the image: for α > 0 xc = xr − 3α − 0.047z(α − 8) + 14 for α < 0 xc = xr − 3α − 0.047zα + 10 for α = 0 xc = xr
Image Processing Algorithms for an Auto Focus System
917
Fig. 7. Evaluation of eye tracking and thresholded image (left)
The slit lamp angle has major influence on xc . The influence of the z-position is less important. However it is also dependent on the angle, i. e. its impact on xc increases with increasing slit lamp angle. The dependency on the lateral position left or right of the slit lamp is mainly represented by a different constant part. Experimental testing of the eye tracking. The tracking performance was evaluated by displaying concentric circles with different radii at the calculated center position of the eye (Fig. 7). The radius of the smallest circle that includes the actual center of the eye determines the distance between these two centers and therefore the error of the calculation. The tracking has been applied to images with different z-positions, slit lamp angles, eyes and types of illumination. The test results are shown in Tab. 3. Most significant are the relative errors i. e. the range in which the absolute error varies when varying the z-position. The low values indicate that the calculated position is very stable with respect to the center of the eye during the focusing process. When a reflection was present in the image the detection rate was 100 %. In some images no reflection was present. This was detected with sufficient accuracy. However these images were invalid anyway. 2.3
Combination of Proposed Methods
The first step of every slit lamp examination is to focus on the iris. This procedure has been automated combining the proposed methods. The tracking algorithm follows the patients movements and positions the ROI with respect to the center of the eye. Then the proposed contrast measurement is applied.
918
C. Gierl et al. Table 3. Test results for tracking performance Difference of calculated Deviation and real center Pixel mm Average absolute error 16.3 0.40 Maximum absolute error 50.0 1.22 Average relative error 6.0 0.15 Maximum relative error 13.0 0.32 Detection Statistics Number Percentage Total Images 586 100 Images with no reflection 136 23.2% Wrong reflection detected when present 0 0.0% Reflection detected although not present 10 7.4% Reflection not detected although present 4 0.9%
In order to avoid the detection of the reflection and the eyelashes the ROI is positioned below the center of the eye (see location of the ROI in Fig. 7) and captures the transition from pupil to iris. This transition is suitable for detection with the vertical filter while the Sobel-filter additionally detects the fine structures of the iris.
3
Conclusion
Each of the proposed methods and their combination proved to be very reliable. In a test with using 10 image series the in-focus position could always be detected . The average converging range of the combined method almost covers 72% of the motor range. Focusing is therefore even from very defocused positions possible. The average width at 80% of the maximum value is less than 20% of the converging range. Discrimination power is therefore still very high and the in-focus position can be reproduced very accurately. Finally it is not affected by changing illumination and therefore proved to be robust. The combination of the two proposed methods is limited to examinations where a reflection is present. The contrast measurement alone however can also be applied without or in combination with a different tracking algorithm.
References 1. Ledford, J.K.: The Slit Lamp Primer, Slack Incorporated, Thorofare, N.J (2006) 2. Boecker, W.: A fast autofocus unit for fluorescence microscopy. Phys. Med. Biol. 42, 1981–1992 (1992) 3. Subbarao, M., Choi, T., Nikzad, A.: Focusing techniques. J. Opt. Eng. 32, 2824– 2836 (1993) 4. Yeo, T.T.E., Ong, S.H., Sinniah, J.R.: Autofocusing for tissue microscopy. Image and Vision Computing 11(10), 629–639 (1993)
Image Processing Algorithms for an Auto Focus System
919
5. Chun-Hung, S., Chen, H.H.: Robust Focus Measure for Low Contrast Images. Consumer Electronics, 2006. Digest of Technical Papers 24, 69–70 (2006) 6. Groen, F.C.A.: Comparison of Different Focus Functions for Use in Autofocus Algorithms. Cytometry 6, 81–91 (1985) 7. Yap, P.T., Raveendran, P.: Image focus measure based on Chebyshev moments. IEE Proc.-Vis. Image Signal Process 151(2), 128–136 (2004) 8. Park, Y., et al.: A Fast Circular Edge Detector for the Iris Region Segmentation. In: B¨ ulthoff, H.H., Poggio, T.A., Lee, S.-W. (eds.) BMCV 2000. LNCS, vol. 1811, pp. 417–423. Springer, Heidelberg (2000) 9. Wildes, R.P.: Iris recognition: an emerging biometric technology. Proc. IEEE 85(9), 1348–1363 (1997) 10. Park, K.R., Kim, J.: A Real-Time Focusing Algorithm for Iris Recognition Camera. SMC-C35(3), 441–444 (2005)
Applying Image Analysis and Probabilistic Techniques for Counting Olive Trees in High-Resolution Satellite Images J. Gonzalez, C. Galindo, V. Arevalo, and G. Ambrosio University of M´ alaga (Spain)
Abstract. This paper proposes a method, that integrates image analysis and probabilistic techniques, for counting olive trees in high-resolution satellite images. Counting trees becomes significant for surveying and inventorying forests, and in certain cases relevant for assessing estimates of the production of plantations, as it is the case of the olive trees fields. The method presented in this paper exploits the particular characteristics of parcels, i.e. a certain reticular layout and a similar appearance of trees, to yield a probabilistic measure that captures the confident of each spot in the image to be an olive tree. Some promising experimental results have been obtained in satellite images taken from QuickBird.
1
Introduction
Last years have witnessed a remarkable improvement of satellites used in remote sensing. Nowadays, commercial satellites like Quickbird, Orbview, or Ikonos provide high-resolution images that open up a promising and challenging field for the automatic detection of terrain features for a variety of purposes. Some examples of this can be found in the literature for detecting and locating human constructions, such as roads, buildings, sport fields, etc. (see [6] for a survey), and geographical features, like coastlines [7], lakes [3], mountains [11], etc. In general, the aim of remote sensing applications is to facilitate (and insofar as it is possible, automate) monitoring tasks on large areas of terrain, for instance surveying and inventorying forests, which are normally tediously and costly performed by human operators. In this paper we propose an image processing-based approach for counting trees, in particular olive trees, within a plantation. Counting trees bears a significant relevance for two reasons. First, it provides an inventory of the trees in the plantation that may help the farmer to a better planning of the irrigation or fertilization processes. On the other hand, information about the number of trees of a plantation becomes essential for assessing an estimate of the production, as well as for calculating the value of the field. In fact, the number of trees within parcels has been considered by the Spanish Government, following the European normative (UE law 154/75, 1975), to grant olive-trees farmers. Typically, the process of counting trees is carried out manually by an operator who has to move around the whole plantation. Sometimes, this tedious chore is J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 920–931, 2007. c Springer-Verlag Berlin Heidelberg 2007
Applying Image Analysis and Probabilistic Techniques
921
simplified by manually counting the number of trees within a relative small area (a sample region), assessing the global amount in the plantation according to its extension, the number of sampled trees, and the tree density measured in the sample region. In both cases, this process is highly prone to errors. Moreover, the active participation of operators, who may falsify the results, cause suspicions in grants. This paper proposes the integration of different image analysis and probabilistic techniques into a system for counting olive trees in high-resolution satellite images. In such images, olive trees typically appear as dark spots of different sizes and shapes, that may largely vary from one parcel to another. This makes counting processes based on image analysis complex and dependent on several parameters for each parcel. However, in general, olive trees within a particular parcel 1 meet some common characteristics that must be considered in the image analysis process to gain in robustness and reliability: they have almost the same size (but not the same shape) and usually follow a particular reticle (reticular layout). The procedure proposed in this paper takes advantage of these characteristics. Briefly, it first considers a representative portion of the image, given by an operator, where dark spots that fulfill a particular reticular layout are localized by means of a voting scheme. From this procedure we also obtain an estimate of how well each spot fits into that particular reticle: the higher this value, the higher the probability of a spot to be an olive tree of the parcel represented by the selected reticle. Secondly, and exploiting the similarity of trees within a given parcel (trees are usually planted at the same time, receiving the same irrigation and fertilization treatment), a prototype of the typical tree is obtained by processing the olive candidates entailed within the representative area given by the operator. The resultant prototype is used to asses the similarity of each candidate (in size and shade), as a probabilistic value, with respect to the prototype by means of Bayesian techniques [4]. The final probability of each candidate to be an olive tree will be the joint probability of both, that the spot belongs to the reticle and that it exhibits the same characteristics that the prototype. Although our work focuses on olive trees, it can be also applied to any type of plantation that follows a reticular arrangement. In the literature only a few works have addressed the problem of counting trees through satellite images [1,2,8]. However all of them consider a number of parameters which have to be tuned manually for each image, even for each parcel. The main advantage of the method we propose here is that it is highly automated and the participation of human operators is limited to select the input parcel within the image to be processed and to validate the obtained results. The structure of this paper is as follows. Section 2 gives a general description of the system. Section 3 delves into the automatic detection of tree candidates and the computation of their probabilities of being olive trees according to the 1
A parcel is understood here as an olive field where trees were planted at the same time and with the same farming techniques, although it may not coincide with the administrative division.
922
J. Gonzalez et al.
reticular arrangement of the parcel. Section 4 is devoted to the computation of the prototype within a parcel and the similarity computation of candidates with respect to it. Some experimental results are shown in section 5, and finally conclusions and future work are outlined.
2
Method Description
Following [9], the diameter of olive tree crowns varies between 3 and 8 m., they exhibit a regular circular/elipsoidal shape, and usually follow a reticular layout with a separation between tress in the range 6 to 10 m. Figure 1 shows a typical satellite image of an olive field. Therefore, given that trees normally present a similar pattern, i.e. a dark and almost circular spot upon a lighter background, a possible solution for counting trees is to perform pattern matching by correlating one of such a pattern through the image (as in [12]). However, this solution does not always work well since it is not clear that a fix pattern may capture the shape variability of olive tree crowns (even in a single parcel) as shown further on.
Fig. 1. A high-resolution olive tree field taken by the QuickBird satellite. Olive trees appear as small dark spots regularly arranged in a reticle. Though it may seem that they all exhibit a circular shape, there is a large variability due to the irregular growth of their branches.
An approximation to detect trees considering this shape variability is to locate closed contours in the image through typical computer vision techniques, i.e. Canny edge detector. Although this solution has been adopted in some works [2], it does not guarantee that other objects within the parcel, like rocks, machinery, buildings, etc., could also be detected as trees. Assuming that olive trees are planted following a reticular structure within the same plantation (which hold for the 85% of the Spanish olive fields), the method presented in this paper overcomes the commented limitations by a two-stages procedure (see figure 2 for a scheme of the method). In short, we firstly compute the main direction of the reticle of the parcel by processing the layout followed
Applying Image Analysis and Probabilistic Techniques
923
by the trees contained within a representative portion of the image selected by an operator (around 35 trees in our experiments). This direction is computed by means of a voting scheme which also permits us to assess a probabilistic measure about the probability of a dark spot to be an olive tree or not by only attending to its relative location within the computed reticular layout. In a second stage, the set of trees within the selected area are also used to generate a statistical pattern that characterizes the size, shape, and also shade2 of the olive trees within the parcel. This pattern, also called prototype, is used to compute the probability density function characterizing the appearance of the tree crowns of that reticle. By combining both estimates for each crown, ci , named as P(ci is aligned) and p(ci resembles the prototype), the proposed method aims to detect trees with a certain similarity to the prototype and lying in a certain layout within the image, as: p(ci is an olive tree) = p(ci is aligned, ci resembles the prototype) = P (ci is aligned) ∗ p(ci resembles the prototype)
(1)
In (1) we are assuming the independence of the two sources of information. Next sections describe in more detail each phase of our method.
3
Locating Olive Tree Candidates Within the Reticle
In this stage we rely on image processing techniques to locate the “center” of olive tree candidates (dark spots in the image). After that, the main direction of the reticle of the parcel is calculated based on a voting scheme applied on the trees within the representative window selected by an operator. 3.1
Localizing Centroids of Candidates
To locate olive tree candidates, we firstly compute the closed contours of the image through the Canny operator. Experimentally we have checked that this operator works well in our images with σ = 0.45. Figure 3 (left) shows the result of this operation on a typical olive field. Since at this stage we are not concerned on the shape of the trees, mainly because of their variability, but on their localization, we compute the centroid of each found contour through the chamfer distance transform [5]. The result is a set of points (figure 3 (right)) that localize the centers of the tree crowns (typically near their trunks) 3.2
Detecting Candidates Within a Given Reticle
Results from the previous step is largely prone to provide false positives. On one hand, elements in the field, like rocks or machinery, and shadows on the terrain 2
We only use gray-scale images which provide good results. The use of color images has not improved our results since trees, especially olive trees, exhibit almost the same color in satellite images. Other source of information, like infrared images (not considered in our work), could be employed to assess the vegetation rate [10].
924
J. Gonzalez et al.
Input Image
Representative window selected by an operator
Compute olive prototype Compute reticle orientation.
T ret
5D Mean
Std Deviation
Process each candidate ‘ci’ to fit: 1.- The reticular orientation (P1=P(ci is aligned)) 2.- The prototype characteristics (p2=p(ci meets the prototype’s characteristics)
p(x is an olive tree)=P1*p2
Fig. 2. The proposed method. Initially, an operator selects a representative window of the image, from which the main orientation of the reticular layout and a prototype of the trees is computed. This information is used to probabilistically characterize a tree in the reticle and from that, to look for the rest of candidates.
Applying Image Analysis and Probabilistic Techniques
a)
925
b)
Fig. 3. Locating olive tree candidates. a) Result of the Canny operator. b) Centroids computation for each contour.
may give raise to contours similar in size to that of the olive trees. On the other hand, crown shapes may induce the detection of more than one contour, and thus, more than one centroid. For these reasons, we exploit the common characteristic presented in olive tree plantations of arranging the field in a reticular structure (see fig. 4).
Fig. 4. The reticular arrangement of trees
In this reticular arrangement, each tree forms a certain angle, φ, with its neighbors, which repeats at increments of 45◦ . In our approach we rely on a voting scheme in which the centroids of the trees selected by the operator vote for a certain angle φ∗ if it forms an angle φ = n · 45 + φ∗ with a close neighbor. Obviously, centroids are not perfectly aligned and, thus, we account for a certain tolerance in the computation of that angle. Concretely, we divide the angle range [0, 45◦ ] into 18 buckets of 5◦ which becomes the permitted angular interval to decide that two trees are aligned. Consecutive buckets overlap 2.5◦ , i.e. [0, 5], (2.5, 7.5], (5, 10], (7.5, 12.5]..., to permit angles that fall within the limit of a bucket to also vote for the adjacent one. The bucket which receives the maximum votes, Bw , characterizes the orientation of the reticle (see fig. 5). Once the reticle orientation is computed, a probabilistic measure for all the trees within the parcel is calculated taking into account how well their centroids fit on it. To do this, we repeat the voting process, calculating for each candidate centroid the proportion of its votes for Bw with respect to sum of all its votes.
926
J. Gonzalez et al.
Voting scheme for calculating the reticle orientation
B1
B2
B3
B4
B5
..… …
B13
B14
B15
B16
B17
B18
C1
11
12
0
4
4
……
0
1
0
0
2
1
C2
14
8
1
2
……
0
0
1
2
6
6
……
……
……
……
……
……
……
……
……
……
……
……
Cm
17
4
1
3
0
……
1
0
1
1
9
3
Sum
341
191
8
21
13
……
0
0
9
12
75
49
Winner bucket. Indicates the main orientation of the reticle
Votes distribution for a certain candidate cj within the parcel
P(c j aligned with the reticle)
votesBj 1 # of buckets
¦
votesBij
i 1
B1 B2
B4 B5
B17 B18
Fig. 5. Voting process. Angles between the centroids of the representative window are computed and grouped in buckets. The most voted one (in this example B1 ) represents the main orientation of the reticle. Then, the process is repeated for the rest of centroids, assesing their votes for the winner bucket with respect to their votes to others.
This ratio is taken as an estimate of the probability of the membership of each tree candidate cj to a given reticle, that is: P (cj is aligned with the reticle) =
votesjBw # of buckets i=1
4
(2)
votesjBi
Classification of Candidates as Olive Trees
The aim of this phase is to discard candidates that, even belonging to the reticular arrangement of the field, do not fit onto the olive appearance of the parcel. Olives trees within a parcel normally share some common characteristics, like their color or size, but not the same shape, which may exhibit a great variability. To capture this shape variability we rely on the computation of a tree prototype based on statistical measures (mean and variance). The distance from a
Applying Image Analysis and Probabilistic Techniques
927
candidate tree to this prototype will give us the likelihood of that candidate to be an olive tree based on its appearance. 4.1
Computation of the Olive Tree Prototype
The olive tree prototype for a given parcel is calculated according to the characteristics of the representative olive trees selected by the operator. To compute that prototype, an image window centered at each centroid is considered. The size (k) of this windows should be large enough to contain the tree crown and also part of the terrain (whose color is almost constant within parcels). The size of this window (typically around 15 × 15 pixels in our experiments) is automatically calculated according to the average area of the representative contours and their relative distance within the reticle. The prototype is then characterized by a k 2 −dimensional mean vector (μ) and a k 2 × k 2 −dimensional covariance matrix (Σ) of the pixel gray-levels in the windows, computed as follows. Let the m representative candidates, be:
crm
cr1 = [I(a1 , b1 : b1 + k − 1) I(a1 + 1, b1 : b1 + k − 1) . . . I(a1 + k − 1, b1 : b1 + k − 1)]T cr2 = [I(a2 , b2 : b2 + k − 1) I(a2 + 1, b2 : b2 + k − 1) . . . I(a2 + k − 1, b2 : b2 + k − 1)]T ... = [I(am , bm : bm + k − 1) I(am + 1, bm : bm + k − 1) . . . I(am + k − 1, bm : bm + k − 1)]T
(3) where ai , bi are the upper-left corners of the windows centered at the centroids of the candidates cri . The μ vector and the covariance matrix are calculated as: μ=
1 m
m i=1
cri
Σ=
1 m
m i=1
(cri − μ) ∗ (cri − μ)T
(4)
Note that μ captures the mean gray-level of pixels of trees, and thus, their mean shape, but does not consider the high variability caused by their branches, and thus techniques based on template matching [12] are not suitable here. This variability is captured by the covariance matrix Σ: lower variance indicates low variability in the gray-level of the corresponding pixel. In figure 6 these measures are illustrated by depicting μ and Σ as images for a better understanding. For that, the k 2 elements of μ and of the diagonal of Σ have been orderly placed forming two k × k images. In those images, note that the mean shape of the representative candidates is almost circular, and that the representation of the diagonal Σ contains dark pixels (low variance) in the center part that account for the center of tree crowns but high values (large variability) around it, capturing the variability of tree shapes. The portion that contains part of the ground also presents a low variability because of the similarity of the terrain color within parcels. 4.2
Measuring Similarity to the Prototype
Using the prototype characterized by μ and Σ we estimate the similarity of a candidate, ci , given by a k × k window centered at a centroid contour, through the gaussian density probability function given by: p(ci ) =
1 (2π)k2
e− 2 (ci −μ) 1
1/2
|Σ|
T
Σ −1 (ci −μ)
(5)
928
J. Gonzalez et al.
Fig. 6. Prototype computation. Note the differences in the shape of the candidates. This high variability is captured by the mean vector μ and the covariance matrix Σ. M(i,j) shows an image that represent the values of μ while E(i, j) shows the diagonal of Σ, for which dark values indicates low variability (the center of the tree crowns) and lighter values, high variability (the shape of the branches).
This likelihood measure can be considered as an estimate of p(ci resembles the prototype): the higher the similarity of the candidate ci to the prototype characterized by μ and Σ, the higher the value of p(ci ). 4.3
Classifying the Candidates
Finally, in order to decide if a candidate, ci , is an olive tree, we set a minimum threshold for its joint probability. This threshold value is taken as the lowest value of p(x) yielded by the representative trees. That is, ci is considered as an olive tree iff: p(ci is olive tree) = (P (ci is aligned) ∗ p(ci )) ≥ τ, where (6) τ = min(p(x), x ∈ cr1 , . . . , crm ) 4.4
Experimental Result
Our method has been tested with panchromatic QuickBird images (0.6 meter/pixel of spatial resolution) of a region in the South of Spain. We have considered images of parcels containing, in average, around 2000 trees of different varieties, sizes, and reticle orientations. It has been implemented in C++ using the image processing library “OpenCV” [13]. Our implementation has been c integrated as an extension of the commercial package ESRIArcView, a GIS software commonly used by the remote sensing community. Figure 7 shows two snapshots of the application.
Applying Image Analysis and Probabilistic Techniques
929
c Fig. 7. Two snapshots of ESRIArcView running the olive tree counting software. In the figures, the process has been limited to a particular administrative area selected by the user.
In order to test the suitability of the proposed method we have compared its results to the number of trees visually counted by an operator from color aerial orthophotos. In this comparison we have differentiated false- positives (FP) and negatives (FN). A candidate is set to be FP if it is erroneously detected as an olive tree and FN if it is erroneously detected as a non-olive tree. For three of our test images we have obtained the results shown in table 1. Table 1. Some results of our method Number of olive trees (Ground Truth) Detected Trees 2324 2293 (98.66%) 10 2109 2072 (98.24% 15 2549 2530 (99.25% 11
FP FN (0.43%) 21 (0.9%) (0.71%) 22 (1.04%) (0.43%) 8 (0.31%)
Although the resultant figures of our method are promising, it still generates a number of false positives/negatives. They are mainly produced because our main assumption about the characteristics of parcels (reticular layout and similarity of tree sizes) is not always met. Concretely, FN are due to the presence of candidates misaligned with respect to the reticle, since sometimes farmers plant trees out of the reticle for a best use of the space at the limits of their parcels (as shown in figure 8a-left). On the contrary, in some cases a tree within the reticle needs to be cut and replanted, being then smaller than the rest (see figure 8a-right). In both cases, the joint probability falls down under the considered threshold because the candidate deviates significatively from the representative ones. Regarding FP, it occasionally appears candidates that even fulfilling the imposed requirements to be an olive tree they actually are not. This is the case illustrated in figure 8b, where there is a small orchard that entails trees with the same characteristics that the prototype and reticular layout of the parcel.
930
J. Gonzalez et al.
Fig. 8. Examples of misleading results. a) Two cases of false negatives yielded by the method: left) FN due to an olive tree misaligned with the reticle; right) FN caused by an olive tree largely different with respect to the rest of the parcel. b) Example of a false positive. Trees (or in general objects) in the image fulfilling the requirements of size and reticular arrangement of a parcel will be detected, although, like in this case, they can be trees in a near orchard.
5
Conclusions and Future Works
This paper has presented a probabilistic image-based method for counting olive trees in high-resolution satellite images. The proposed procedure takes into account the inherent characteristics of olive tree fields: the reticular layout of trees and their similar size (but not shape). Our method has been implemented and tested in several images of the South of Spain taken from the QuickBird satellite with promising results. In the near future we plan to test our method with color and aerial images in order to improve the results.
Acknowledgments c DigitalGlobe QuickBird imagery used in this study is distributed by Eurimage, SpA. (www.eurimage.com) and provided by Decasat Ingenieria S.L. (www.decasat.com). This work was partly supported by the Spanish Government under research contract DPI2005-01391.
References 1. Blazquez, C.H.: Computer-Based Image Analysis and Tree Counting with Aerial Color Infrared Photography. Journal of Imaging Technology 15(4), 163–168 (1989) 2. Brandtberg, T., Walter, F.: Automated delineation of individual tree crown in high spatial resolution aerial images by multiple-scale analysis. Machine Vision and Applications 11, 64–73 (1998) 3. Firestone, L., Rupert, S., Olson, J., Mueller, W.: Automated Feature Extraction: The Key to Future Productivity. Photogrammetric Engineering and Remote Sensing 62(6), 671–674 (1996)
Applying Image Analysis and Probabilistic Techniques
931
4. Gonzalez, R.C., Woods, R.E.: Digital image processing. Addison-Wesley, Reading, Mass (1987) 5. Butt, M.A., Maragos, P.: Optimal design of chamfer distance transforms. IEEE Transactions on Image Processing 7, 1477–1484 (1998) 6. Gruen, A., Baltsavias, E.P., Henricsson, O.: Automatic Extraction of Man-Made Objects from Aerial and Space Images (II). Birkh¨ auser Verlag Basel (1997) 7. Karantzalos, K.G., Argialas, D., Georgopoulos, A.: Towards coastline detection from aerial imagery. In: Int. Conf. of Image and Signal Processing for Remote Sensing VII, Crete, Greece 8. Karantzalos, K.G., Argialas, D.: Towards Automatic Olive Tree Extraction from Satellite Imagery. Geo-Imagery Bridging Continents. XXth ISPRS Congress, July 12-23, 2004 Istanbul, Turkey (2004) 9. Kay, S., Leo, O., Peedell, S.: Computer-assisted recognition of Olive trees in digital imagery. In: ESRI User Conference July 27-31, 1999 San Diego (1999) 10. Ko, C.C., Lin, C.S., Huang, J.P., Hsu, R.C.: Automatic Identification of Tree Crowns in Different Topology. In: Proc. of Visualization, Imaging, and Image Processing, Benidorm, Spain (2005) 11. Strozzi, T., Kaab, A., Frauenfelder, R., Wegmuller, U.: Detection and monitoring of unstable high-mountain slopes with L-band SAR interferometry Geoscience and Remote Sensing Int. Symp. pp. 1852–1854 (July 21-25, 2003) 12. Pollock, R.J.: The automatic recognition of individual trees in aerial images of forest based on a synthetic tree crown image model. In: 1st. International Airborne Remote Sensing Conference and Exhibition, France (1996) 13. Open Source Computer Vision Library. http://www.sourceforge.net/projects/opencvlibrary
An Efficient Closed-Form Solution to Probabilistic 6D Visual Odometry for a Stereo Camera F.A. Moreno, J.L. Blanco, and J. González Department of System Engineering and Automation, University of Málaga, Spain
[email protected], {jlblanco,jgonzalez}@ctima.uma.es
Abstract. Estimating the ego-motion of a mobile robot has been traditionally achieved by means of encoder-based odometry. However, this method presents several drawbacks, such as the existence of accumulative drifts, its sensibility to slippage, and its limitation to planar environments. In this work we present an alternative method for estimating the incremental change in the robot pose from images taken by a stereo camera. In contrast to most previous approaches for 6D visual odometry, based on iterative, approximate methods, we propose here to employ an optimal closed-form formulation which is more accurate, efficient, and does not exhibit convergence problems. We also derive the expression for the covariance associated to this estimation, which enables the integration of our approach into vision-based SLAM frameworks. Additionally, our proposal combines highly-distinctive SIFT descriptors with the fast KLT feature tracker, thus achieving robust and efficient execution in real-time. To validate our research we provide experimental results for a real robot.
1 Introduction Odometry is one of the most widely used means for estimating the motion of a mobile robot. Traditionally, odometry is derived from encoders measuring the revolutions of the robot’s wheels, thus providing information for estimating the change in the robot pose. Unfortunately, the usage of encoder-based odometry is limited to wheeled robots operating on plane surfaces and systematic errors such as drift, wheel slippage, and un-controlled differences in the robot’s wheels induce incremental errors in the displacement estimation, which can not be properly modelled by a zero-mean Gaussian distribution. This erroneous assumption about the encoder-based odometry errors is accepted in most probabilistic filters for robot localization and SLAM [15], and may eventually lead to the divergence of the filter estimation. In order to overcome the limitations of encoder-based odometry, other nonproprioceptive sensors such as laser sensors [4, 14] and, more recently, vision-based systems [1, 16] have been used in the last years. The proper performance of laser sensors is also limited to purely planar motions, whereas vision-based odometry exploits the advantages of the wider field-of-view of cameras. Nowadays, cameras are cheap and ubiquitous sensors capable of collecting huge amount of information from the environment. The existence of powerful methods for extracting and tracking significant features from images, along with the above-mentioned advantages of cameras, establish a propitious framework for applying vision to ego-motion estimation. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 932–942, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Efficient Closed-Form Solution to Probabilistic 6D Visual Odometry
933
Regarding this topic, several approaches have been proposed in the technical literature which apply different methods for estimating the displacement of a visionequipped mobile robot from a sequence of images taken along its navigation through the environment. The work in [10] reports both a monocular and a stereo visual odometry system based on iterative methods for estimating the 3D change in robot pose, while [1] performs monocular visual odometry with uncalibrated consumergrade cameras under the assumption of purely planar motion. In [13] it is presented a probabilistic method for performing SLAM which uses visual odometry as the robot motion model. This approach looks for sets of features in the stereo images and computes their SIFT descriptors in order to establish correspondences. The camera motion is subsequently estimated using an iterative optimization algorithm which minimizes the re-projection error of the 3D points. In this paper we propose a new approach to visual odometry by estimating incremental changes in the 6D (yaw, pitch, roll, x, y, z) robot pose between consecutive stereo images. Our method estimates the complete set of angles and translations, thus there are not constraints about the potential movements of the camera as in other approaches like [2]. Our algorithm combines the speed of the Kanade-Lucas-Tomasi detector and tracker [12] with the selectivity of SIFT descriptors [8] to match features in the stereo images. Since SIFT-based stereo matching is only carried out when the number of distinctive points in the tracker falls below a given threshold, we avoid the high computational cost involved in computing and comparing the Euclidean distance between SIFT descriptors for all the features in each pair of stereo images. Another advantage of our approach over previous works is the application of a closed form solution to estimate the changes in orientation and translation, eluding both the complexity and divergence problems of iterative methods. Moreover, we model the uncertainty of the pose estimate by propagating the uncertainty in the 3D positions of the observed points. The rest of the paper is organized as follows: Section II presents a brief outline of our proposed method for performing visual odometry, which is described in more detail in section III. In section IV we provide some experimental results, whereas section V presents some conclusions and the future work.
2 Method Overview Our proposed method, depicted in Fig. 1, can be summarized by the following stages: 1. Searching for a set of interest features in a first pair of stereo images, and computation of their corresponding SIFT descriptors. 2. Stereo matching based on the Euclidean distance between descriptors and epipolar geometry restrictions. 3. Projection into 3D space of the matched features, therefore obtaining a set of three dimensional points with coordinates relative to the current robot pose. 4. Tracking the features in the next pair of stereo images. Notice that this tracking allows us to avoid a new SIFT-based matching step. 5. These tracked features are projected into 3D space, yielding a new set of three dimensional points with known correspondences to the previous set of 3D points.
934
F.A. Moreno, J.L. Blanco, and J. González
6. Robot (camera) 6D pose estimation through a closed-form solution of the absolute orientation problem [6], given the correspondences between the two sets of 3D points. 7. If the number of tracked features falls below a certain threshold, new features are searched in the stereo images and their SIFT descriptors computed. Subsequently, they are matched according to their descriptors and added to the current set of points. 8. Repeat from step 4. A full detailed description of all the steps of our method is presented in next section. ID1 ID2
Stereo matching using SIFT descriptors
Z L
ID3
Y 3D Projection
R
ID4 X
t
ID1
6D Robot pose change estimation L Feature tracking applying KLT tracker
ID2 Z ID3 Y
R
ID4
3D Projection X
t+1 6D Robot pose change estimation
ID1 ID2
Z L Lost feature in tracking
Y
3D Projection
R
X
t+2
ID3
ID5
Deleted feature (lost match) New features found and matched
Fig. 1. A schematic representation of the proposed method
3 Detailed Description of the Method This section presents a detailed description of the different operations involved in our proposed algorithm for performing visual odometry. 3.1 Extraction and Matching of Reliable Features from Stereo Images Several methods have been proposed in the literature for extracting interest points from images, as the well known detectors of Kitchen & Rosenfeld [7] and Harris [5], based on the first and the second-order derivatives of images, respectively. More recently, the SIFT detector proposed by Lowe [8] deals with this problem by identifying local extrema in a pyramid of Difference of Gaussians (DoG). It also provides the detected features with a descriptor that exhibits invariance to rotation and scale, and partial invariance to lighting changes and affine distortions. In our work, the detection
An Efficient Closed-Form Solution to Probabilistic 6D Visual Odometry
935
of interest points in the images is carried out by the method proposed by Shi and Tomasi [12]. In addition, their corresponding SIFT descriptors are also computed to make them sufficiently distinguishable and to improve the robustness of the matching process. Once a set of keypoints has been detected in the left and right images they are robustly matched according to both the similarity of their descriptors and the restrictions imposed by the epipolar geometry. More precisely, in the former restriction, for each keypoint in the left image it is computed the Euclidean distance between its descriptor and those of the keypoints in the right image. For a pair of keypoints to be considered as a candidate match their descriptors must fulfill two conditions: to be similar enough (their distance below a certain threshold), and different enough to other candidates (their distance above a certain threshold). Moreover, the points must fulfill the epipolar constraint, i.e. they have to lay on the conjugate epipolar lines (or be close enough). In a stereo vision system with parallel optical axis as the one we use here, the epipolar lines are parallel and horizontal, thus the epipolar constraint reduces to checking that both features are in the same row. Finally, each pair of matched features is assigned a unique ID which will be used to identify the point projected from their image coordinates in subsequent time steps. 3.2 Projection into 3D Space Once the features have been robustly matched, the coordinates of their corresponding 3D points are estimated from their coordinates on the images and the intrinsic parameters of the stereo system. Formally, let (c, r) be the image coordinates of a feature in the left image (which we will be taken as the reference one) and d the disparity of its conjugate feature in the right one. Then, the 3D coordinates (X,Y,Z) of the projected point are computed as: X = (c − c0 ) b d
Y = (r − r0 ) b d
Z=
f
b
d
(1)
where (c0, r0) are the image coordinates of the principal point in the reference image, b is the baseline of the stereo system, and f stands for the identical focal length of the cameras. The errors in the so obtained variables r, c, and d are usually modeled as uncorrelated zero-mean Gaussian noises [9]. By using a first-order error to approximate the distribution of the variables in (1) as multivariate Gaussians, we obtain the following covariance matrix:
⎛ σX ⎜ Σ = σ XY ⎜ ⎜σ ⎝ XZ 2
σ XY σ Y2 σ YZ
σ XZ ⎞ σ YZ ⎟⎟ = J diag σ c2 , σ r2 , σ d2 J T σ Z2 ⎟⎠
(
)
(2)
where J stands for the Jacobian matrix of the functions in (1), and σ X2 , σ Y2 , σ Z2 , σ c2 , σ r2 , and σ d2 are the variances of the corresponding variables. Expanding (2) we come up with the following expression for Σ:
936
F.A. Moreno, J.L. Blanco, and J. González
⎛ b2σ c2 b2 (c − c0 ) σ d2 ⎜ d2 + d4 ⎜ ⎜ (c −c0 )b2σ d2 ( r − r0 ) Σ=⎜ d4 ⎜ c − c ( 0 )b2σ d2 f ⎜ ⎜ d4 ⎝ 2
(c −c0 )b2σ d2 ( r − r0 ) d4 b2σ r2 d2
d4
b2 ( r − r0 ) σ d2 2
+
( c −c0 )b2σ d2 f ⎞
d4
( r −r0 )b2σ d2 f d4
( r −r0 )b2σ d2 f d4 f 2b2σ d2 d4
⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
(3)
which approximately models the uncertainty in the 3D coordinates of points computed from the noisy measurements of a stereo system. Finally, to distinguish it from the rest of the projected points, each 3D point is assigned the unique ID of the matched pair of image features from which it was generated. 3.3 Tracking Features
In successive stereo frames, the detected features are tracked using the well-known Lucas-Kanade-Tomasi method [12] in order to determine their coordinates in the new pair of stereo images. This method computes the optical flow of a pixel in two consecutive images by minimizing the difference between the surrounding windows using a Newton-Raphson method. The correct tracking of a pair of matched features in the left and right images at time k yields another matched pair of features in the stereo images at time k+1. At this point the epipolar constraint is considered to detect improperly tracked features and, hence, to avoid the presence of unreliable matched pairs. By using this tracking process, we avoid both the search for features and the SIFT-based stereo matching at the new camera pose. Thus this method speeds up the process of extracting and matching features and, consequently, the computational burden of the whole visual odometry procedure is considerably reduced. The resulting set of tracked features are also projected to space following the method described in section 3.2, yielding a new set of 3D points which keep their IDs from the image features in order to maintain an implicit matching relationship with the points in the previous set. If the number of tracked features falls below a threshold, the algorithm searches for new features in the images to maintain a proper amount of elements in the 3D point sets. 3.4 Probabilistic Estimation of the Pose Change
In this section we present a method for estimating the probability distribution of the change in the robot pose between two time steps from the sets of 3D points determined as described above. Formally, let Xk be a set of 3D points obtained at time k:
X k = {Xik }
i =1... N1
(4)
An Efficient Closed-Form Solution to Probabilistic 6D Visual Odometry
937
where the position of each 3D point Xik is assumed to follow a Gaussian distribution with mean μ X i = X ki , Yki , Z ki and covariance Σ Xi determined by equations (1) and k k (3), respectively:
(
Xik ∼ N μ Xi , Σ Xi k
k
)
(5)
At this point, we define qk,k+1 as the random variable which models the pose change between time steps k and k+1 as a function of the sets of projected 3D points Xk and Xk+1:
qk , k +1 = f ( X k , X k +1 ) ;
qk , k +1 ∼ N ( μq , Σ q )
(6)
Under a linear approximation of error propagation, qk,k+1 follows a Gaussian distribution with covariance matrix Σq and mean μq = Δx, Δy, Δz, Δα , Δβ , Δγ where Δx, Δy, and Δz are the increments in the X, Y, and Z coordinates respectively, and Δα, Δβ, and Δγ stand for the increments in the yaw, pitch, and roll angles, respectively. 3.4.1 Estimation of the Mean Value μq In this paper, we propose to compute μq through the method reported by Horn in [6], where it is derived a closed-form solution to the least-squares problem of finding the relationship between two coordinate systems using the measurements of the coordinates of a number of points in both systems. We use the mean values μ X i of the posik tions of the 3D points as the inputs to this algorithm. This closed-form solution is in contrast to other proposals for visual odometry based on iterative methods [11, 13] which require an initial estimation of the change in pose. The closed-form solution can be summarized as follows: 1. Compute the centroids (ck and ck+1) of the two sets of points and subtract them from theiri coordinates in order toi deal only with coordinates relative to their ceni i i i i i troids: X k = X k , Y k , Z k and X k +1 = X k +1 , Y k +1 , Z k +1 . 2. For the i-th 3D point, compute the following nine products of its coordinates at time k and k+1: i
i
i PXX = X k X k +1
i
i
i
i
i
i PXY = X k Y k +1 ... PZYi = Z k Y k +1
i
PZZi = Z k Z k +1
(7)
3. Accumulate the products in (7) for all the 3D points to end up with the following nine values: i S XX = ∑ PXX i
i S XY = ∑ PXY i
... SZY = ∑ PZYi i
S ZZ = ∑ PZZi i
(8)
4. Form a 4x4 symmetric matrix with the elements in (8):
⎡ S XX + SYY + SZZ ⎢ S −S YZ ZY N=⎢ ⎢ S ZX − S XZ ⎢ ⎣ S XY − SYX
SYZ − SZY
S ZX − S XZ
S XX − SYY − SZZ
S XY + SYX
S XY + SYX S ZX + S XZ
− S XX + SYY − S ZZ SYZ + S ZY
S XY − SYX
⎤ ⎥ SZX + S XZ ⎥ ⎥ SYZ + S ZY ⎥ − S XX − SYY + SZZ ⎦
(9)
938
F.A. Moreno, J.L. Blanco, and J. González
5. Find the eigenvector corresponding to the largest eigenvalue of N, which will be taken as the quaternion that determines the rotation between the robot pose at time steps k and k+1. 6. Compute the rotation matrix (R) associated to the so obtained quaternion, and T compute the translation t = ( Δx, Δy, Δz ) as the difference between the centroid at time k and the scaled and rotated centroid at time k+1: t = c k − Rc k +1
(10)
Finally, we extract the values of the increments in the yaw, pitch, and roll angles Δα , Δβ , Δγ between poses from the rotation matrix R, having in this way all the components of μq. 3.4.2 Estimation of the Covariance Matrix Σq Covariance matrixes are usually obtained through a linear approximation of the functions involved in a given transformation between variables (see, for example, section 3.2). However, in the case of the closed-form solution described above the function cannot be linearized due to the computation of the largest eigenvector. Therefore, we propose here to use the linearized version of the problem, which can be stated as the minimization of the least square error of the system: ⎞ ⎛ X ki ⎞ ⎛ X ki +1 ⎞ ⎛ ⎟ ⎜ i ⎟ ⎜ R t ⎟ ⎜⎜ Yki ⎟⎟ ⎜ Yk +1 ⎟ = ⎜ ⎟⎜ Zi ⎟ ⎜ Zi ⎟ ⎜ ⎟ ⎜⎜ k ⎟⎟ ⎜⎜ k +1 ⎟⎟ ⎜⎜ ⎟ ⎝ 1 ⎠ ⎝ 0 0 0 1⎠⎝ 1 ⎠
for
the
variables
which
determines
(11)
the
pose
change,
i.e.
μq = Δx, Δy, Δz, Δα , Δβ , Δγ . Expanding (11) we obtain the position of the i-th point
at time k+1 as a function of its position at time k (represented by Xik ) and the increments in X, Y, Z, yaw, pitch and roll between both time steps: X ki +1 = f ( μ q , Xik )
Yki+1 = f ( μ q , Xik )
(12)
Z ki +1 = f ( μ q , Xik )
By linearizing these equations we come to the following expression for Σq:
Σ −q1 = H T Σ −1H = ∑ H iT ( Σi ) H i −1
(13)
i
where Hi stands for the Jacobian matrix of the equations in (12) for the i-th 3D point relative to μq and Σi = Σ ik + Σik +1 is the sum of the position covariance matrices of the ith point at times k and k+1 as defined in equation (3). Notice that, since the 3D points are uncorrelated, the first expression in (13) can be split into the sum of its block diagonal elements.
An Efficient Closed-Form Solution to Probabilistic 6D Visual Odometry
939
4 Experimental Results We have performed a variety of experiments to compare classical encoder-based odometry with our proposed method for visual odometry in an indoor environment. In this paper, we present one of them where our robot Sancho is equipped with a PointGrey Bumblebee stereo camera and driven through a room while gathering stereo images and odometry readings. We also use laser scans to build a map of the environment and estimate the real path of the robot, which will be taken as the ground truth in this experiment (thick lines in Fig. 2(a)–(b)). An example of the scene managed in this experiment is shown in Fig. 2(c). 1
1
0.8
0.8
0.6
0.6
0.4
y
y
0.4
0.2
0.2 0
0
-0.2
-0.2
-0.4 0
-0.4 0.2 0.4 0.6 0.8
1 (a)
1.2 1.4 1.6 1.8
x
0
0.2 0.4 0.6 0.8
1 (b)
1.2 1.4 1.6 1.8 x
(c)
Fig. 2. (a) Path of the robot estimated from the laser scanner built map (thick line) and our proposed visual odometry method (thin line). (b) Estimated paths from the laser scanner map and the encoder-based odometry readings (dashed line). (c) Example of the images managed in the experiments.
In order to compare the performance of the odometry methods, we compute the errors committed by both methods at each time step as the difference between their estimates and the ground truth. The histograms of the 3D position errors of both approaches are shown in Fig. 3. We have found that both methods perform similarly, with most of the errors in Δx and Δy below 5 cm. Notice that since the robot moves in a planar environment, Δz should be zero for the whole experiment. Consequently, our algorithm provides a coherent estimation which is always close to Δz = 0 with a small error (typically 1 cm), as can be seen in Fig.3. The distribution of the error in the 3D position is illustrated in the last plot in Fig. 4. Regarding the estimation of the orientation, visual odometry achieves an error in yaw (the only rotational degree of freedom of a planar robot) similar to conventional odometry. However, we should highlight the accuracy of our algorithm in the other components of the orientation, where the largest error is below 1 deg (please, refer to the histograms for pitch and roll in Fig.4). Recalling the estimated paths of the robot in Fig. 2 according to both odometric methods, we can now remark their similar accuracy in spite of the higher dimensionality of visual odometry, which, a priori, is prone to accumulate larger errors. We can
940
F.A. Moreno, J.L. Blanco, and J. González
Visual odometry
-0.1
-0.05
0 x
0.05
0.1
-0.1
-0.05
0 y
0.05
0.1
-0.1
-0.05
0 z
0.05
0.1
Encoder-based Odometry
-0.1
-0.05
0 x
0.05
0.1
-0.1
-0.05
0 y
0.05
0.1
Fig. 3. Histograms of the errors committed in the estimation of the changes in the robot position for the visual odometry (top plots) and classical odometry (bottom plots) approaches
conclude that the reason for this performance is the small estimation errors of visual odometry in the dimensions not involved in planar odometry, i.e. Δz, Δβ, Δγ.
5 Conclusions This paper has presented a new method to perform visual odometry by computing the 6D change between the poses of a camera in consecutive time steps. Our method combines the speed of the Lucas-Kanade-Tomasi detector and tracker with the capability of the SIFT descriptor to distinguish features. Another contribution of this work in comparison to previous approaches is the employment of a closed-form, optimal solution to the problem of finding the 6D transformation between two sets of corresponding points. The results show that the performance of our approach for visual odometry is quite similar to that of conventional odometry for planar environments, whereas visual odometry additionally allows movements in 6D. Further research will be aimed at integrating the presented approach into visual SLAM frameworks.
An Efficient Closed-Form Solution to Probabilistic 6D Visual Odometry
941
Visual odometry
-6
-4
-2
0 yaw
2
4
6
-6
-4
-2
0 2 pitch
4
6
-6
-4
-2
0 roll
2
4
6
Encoder-based Odometry
z
0.02 0
0.1
-0.02
0.05
x
0 0.02 -6
-4
-2
0 phi
2
4
6
y
0 -0.02
-0.05 -0.1
Distribution of the errors in 3D position
Fig. 4. Histograms of the errors committed in the estimation of the changes in the robot orientation for the visual odometry (top plots) and conventional encoder-based odometry (bottom-left plot) approaches. (bottom-right) Distribution of the errors in the estimation of the change in the robot 3D position for the visual odometry approach.
References 1. Campbell, J., Sukthankar, R., Nourbakhsh, I., Pahwa, A.: A Robust Visual Odometry and Precipice Detection System Using Consumer-grade Monocular Vision. In: A Robust Visual Odometry and Precipice Detection System Using Consumer-grade Monocular Vision, pp. 3421–3427 (2005) 2. Davison, A.J., Reid, I., Molton, N., Stasse, O.: MonoSLAM: Real-Time Single Camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence (2007) 3. Fernandez, D., Price, A.: Visual Odometry for an Outdoor Mobile Robot. In: Conference on Robotics, Automation and Mechatronics, pp. 816–821 (2004) 4. Hahnel, D., Burgard, W., Fox, D., Thrun, S.: An efficient fastslam algorithm for generating maps of large-scale cyclic environments from raw laser range measurements. In: Proc. of Int. Conference on Intelligent Robots and Systems (IROS) (2003) 5. Harris, C.J., Stephens, M.: A combined edge and corner detector. In: Proceedings of 4th Alvey Vision Conference, Manchester, pp. 147–151 (1988)
942
F.A. Moreno, J.L. Blanco, and J. González
6. Horn, B.K.P.: Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America A 4, 629–642 (1987) 7. Kitchen, L., Rosenfeld, A.: Gray-level corner detection. Pattern Recognition Letters 1, 95– 102 (1982) 8. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 9. Matthies, L., Shafer, S.A.: Error modeling in Stereo Navigation. IEEE Journal of Robotics and Automation RA-3(3) (1987) 10. Nistér, D., Naroditsky, O., Bergen, J.: Visual Odometry. In: Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 652–659 (2004) 11. Olson, C.F., Matthies, L.H., Schoppers, M., Maimone, M.W.: Rover navigation using stereo ego-motion. Robotics and Autonomous Systems 43(4), 215–229 (2003) 12. Shi, J., Tomasi, C.: Good features to track. Proc. Computer Vision and Pattern Recognition, 593–600 (1994) 13. Sim, R., Elinas, P., Griffin, M., Little, J.J.: Vision-based SLAM using the RaoBlackwellised Particle Filter. In: IJCAI Workshop Reasoning with Uncertainty in Robotics, Edinburgh, Scotland (2005) 14. Stachniss, C., Grisetti, G., Burgard, W.: Recovering Particle Diversity in a RaoBlackwellized Particle Filter for SLAM After Actively Closing Loops. In: Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), IEEE Computer Society Press, Los Alamitos (2005) 15. Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press, Cambridge (2006) 16. Wang, H., Yuan, K., Zou, W., Zhou, Q.: Visual Odometry Based on Locally Planar Ground Assumption. In: Int. Conference on Information Acquisition, pp. 59–64 (2005)
Color Image Segmentation Based on Type-2 Fuzzy Sets and Region Merging Samy Tehami, Andr´e Bigand, and Olivier Colot LAGIS-UMR CNRS 8146, Univ. Lille1, Villeneuve d’Ascq Cedex, 59655, France
[email protected] http://www-lagis.univ-lille1.fr
Abstract. This paper focuses on application of fuzzy sets of type 2 (FS2) in color images segmentation. The proposed approach is based on FS2 entropy application and region merging. Both local and global information of the image are employed and FS2 makes it possible to take into account the total uncertainty inherent to the segmentation operation. Fuzzy entropy is utilized as a tool to perform histogram analysis to find all major homogeneous regions at the first stage. Then a basic and fast region merging process, based on color similarity and reduction of small clusters, is carried out to avoid oversegmentation. The experimental results demonstrate that this method is suitable to find homogeneous regions for natural images, even for noisy images.
1
Introduction
Segmentation remains one of the most important problems in color image analysis nowadays. The two main techniques described in the literature are region reconstruction (image plane analysis using region growing process, [6], [1]) and color space analysis (the color of each pixel is represented in a color space). Many authors have tried to determine the best color space for some specific color image segmentation problems ([22]), but, unfortunately, there does not exist an unique color space for all the segmentation problems. Computational complexity may increase significantly (in comparison with gray scale image segmentation), so we have classically chosen to work in the (R, G, B) color space, where a color point is defined by the color component levels of the corresponding pixel, red (R), green (G) and blue (B). These two techniques have considerable drawbacks. The region-oriented techniques tend to over-segment images, and the second techniques are not robust to significance appearance changes because they do not include any spatial information. Fuzzy logic is considered to be an appropriate tool for image analysis, and particularly for gray scale segmentation ([2], [14], [30]). These techniques have been tested with success for color image analysis. Recently, fuzzy region oriented techniques for color image segmentation have been presented ([15], [3]), defining a region as a fuzzy subset of pixels, where each pixel in the image has a membership degree to each region. These techniques are based on fuzzy logic with type-1 fuzzy sets. Other techniques J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 943–954, 2007. c Springer-Verlag Berlin Heidelberg 2007
944
S. Tehami, A. Bigand, and O. Colot
have been presented to perform color clustering in a color space ([5], [4]), based on type-1 fuzzy sets and a homogeneity measure (homogeneity of the ”paths” connecting the pixels, [9], or fuzzy homogeneity calculated by fuzzy entropy, [4]). The major concern of these techniques is that spatial ambiguity among pixels has inherent vagueness rather than randomness. However, there remain some sources of uncertainties in type-1 fuzzy sets (see: [18]): the meanings of the words that are used, measurements may be noisy, the data used to tune the parameters of type-1 fuzzy sets may also be noisy. Imprecision and uncertainty are naturally present in image processing ([23]), and particularly these three kinds of uncertainty. Techniques that are not much used for the moment in color images analysis are type-2 fuzzy sets. Mendel ([8], [11], [18]) shows that type2 fuzzy sets (or FS2) may be applied to take into account these three kinds of uncertainty (measurement noise, data-generating mechanism, and description of features that are all nonstationary, when the nature of the nonstationarities cannot be expressed mathematically), and we have investigated this new scheme in this paper. The concept of a type-2 fuzzy set was introduced first by Zadeh ([24]) as an extension of the concept of an ordinary fuzzy set (type-1 fuzzy set). Type-2 fuzzy sets have grades of membership that are themselves fuzzy. At each value of the primary variable (discourse universe X), the membership is a function (and not just a point value) - the secondary membership function - whose domain (the primary membership) is in the interval [0,1] and whose range (the secondary grades) may also be in [0,1]. Hence, the membership function of a type-2 fuzzy set is three dimensional, and it is the new third dimension that provides new design degrees of freedom for handling uncertainty. In this paper we propose to use FS2 for the segmentation of color images in the color (RGB) space. The paper is organized as follows: – – – –
2 2.1
Section 2 briefly describes the type-2 fuzzy sets Section 3 introduces image segmentation using type-2 fuzzy sets In section 4 we present some results Finally, the paper is summarized with some conclusions in section 5.
Type-2 Fuzzy Sets Definition
Type-1 fuzzy sets that are used in image processing are often fuzzy numbers. However, it is not possible to say which membership function is the best one. This is the major motivation of this work to remove the uncertainty of membership values by using type-2 fuzzy sets. For example, the type-1 fuzzy sets may be interesting to modelize the imprecision of patients in telemedicine (visual acuity of vision tested by fuzzy logic with an application in ophthalmology, [13]). This imprecise value allows the modelization of the visual acuity (from 0 to 10), but it is not possible to take into account the bounds of the intervals of modelization. One possible approach consists in calculating an average value of the bounds observed with n patients. Another possible approach consists in
Color Image Segmentation Based on Type-2 Fuzzy Sets and Region Merging
945
making use of the average values and the standard deviation for the two endpoints of the membership function (representing the type-1 fuzzy set), and leads to a continuum of fuzzy numbers. Let be X the discourse universe: ˜ is characterized by its membership function A type-2 fuzzy set (FS2) A, or A, (MF) μ(x, u), where x ∈ X and u ∈ Jx ⊆ [0,1], with: A˜ = {((x, u), μ(x, u)) | ∀ x ∈ X, ∀ u ∈ Jx ⊆ [0,1]}, where 0 ≤ μ(x, u) ≤ 1. A˜ may also be characterized as following: ˜ A= (μ(x, u)/(x, u))dxdu (1) x∈X
2.2
u∈Jx
Uncertainty Representation
A way to visualize a type-2 fuzzy set A˜ is to sketch its two-dimensional domain, its ”footprint of uncertainty” (FOU, [18]). The heights of a type-2 MF (its secondary grades) sit atop of its FOU. The FOU is the union of all primary MFs: ˜ = F OU (A) Jx (2) x∈X
The computation of the general type-2 fuzzy set is prohibitive because the general FS2 operations are complex. A special case of FS2 is the interval type-2 fuzzy set (ITFS2), where all the secondary grades equal to one so that the set operations can be simplified to interval calculations ([18]). The interval type-2 fuzzy sets are the most widely used type-2 fuzzy sets (because they are simple to use, and it is very difficult to justify the use of any other kind of type-2 fuzzy set to date). The ITFS2 is a special case of the general type-2 fuzzy set (FS2) ˜ and can be expressed as following: A, or A, ˜ A= (1/(x, u))dxdu (3) x∈X
u∈Jx
Mendel ([19]) has shown that the footprint of uncertainty (FOU) represents uncertainty in the primary memberships of an ITFS2. Figure 1 presents primary (gaussian) membership function of a ITFS2. The upper membership function (μU (x)) and the lower membership function (μL (x)) of A˜ are two FS1 membership functions that bound the FOU (FOU is shaded in figure 1). 2.3
Type-2 Fuzzy Set Entropy
The process of selecting the necessary information to proceed segmentation must lead here to the correct estimate of the regions of the color image. The present work presents an application of the theory of fuzzy sets to evaluate these regions, with the best accuracy as possible. The terms fuzziness index ([26]) and entropy ([28]) provide the measurement of fuzziness in a set and are used to define the degree of uncertainty of the segmentation process (the total amount of
946
S. Tehami, A. Bigand, and O. Colot
uncertainty being difficult to calculate in this case). These data make it possible to define an index relevant for the process, being used as a criterion to find fuzzy region width and thresholds for segmentation automatically. An ordinary fuzzy set A of a set X is classically defined by its membership function μA (x) written as: μA : X → [0, 1], with x ∈ X where the membership function denotes the degree in which an event x may be a member of A. A point x for which μA (x)= 0.5 is said to be a crossover point of the fuzzy set A.. The uncertainty brought by the variable is represented by the “ α-cut ” of the fuzzy set A. Let X be a classical set and A ⊆ X an ordinary fuzzy set characterized by its membership function μA (x). Considering a threshold α ∈ [0, 1], the membership function can be defined as μα A (classical set Aα or α-cut of the fuzzy set A): μα {0, 1} A : X → 1 ∀x ≥ α α μA (x) = ∀x ∈ X 0 ∀x < α The fuzziness index γ of a fuzzy set A reflects the degree of ambiguity by measuring the distance between A and its nearest ordinary set A0.5 . It is defined as ([25]): γ(A) = 2.d(A, A0.5 )/n1/p (4) where d(A, A0.5 ) denotes the distance between A and its nearest ordinary set A0.5 . A positive scalar p is introduced to make γ (A) lie between zero and one. Its value depends on the type of distance function used. For example, p=1 represents a generalized Hamming distance, whereas p= 2 represents an Euclidean distance. The term entropy of an ordinary fuzzy set A was first introduced by Deluca and Termini ([28]) as: H(A) = (ΣSn (μA (x)))/n.ln2
(5)
Where Sn (μA (x)) = - μA (x)ln (μA (x))- (1- μA (x))ln(1- μA (x)), (ln standing for natural logarithm). Yager ([25]) and Kaufmann ([26]) proposed other possible measures of the entropy, motivated by the classical Shannon entropy function, that we do not present here (Fan and Ma ([10]) proposed a complete analysis of fuzzy entropy formulas). γ (A) and H(A) are such that : γ min = Hmin = 0, for μ = 0 or 1 γ max = Hmax = 1, for μ = 0.5. Therefore, γ and H are monotonic functions and increase in the interval [0, 0.5], and decrease in [0.5, 1] with a maximum of one at μ = 0.5. So it is possible to use one or the other expression to define the degree of uncertainty. In this work, we use the extension of the “De Luca and Termini” measure ([28]) to discrete images, proposed by Pal ([27]) and which is well adapted to our problem. For an MxN image subset A ⊆ X with L gray levels g ∈ [0, L − 1], the histogram h(g)
Color Image Segmentation Based on Type-2 Fuzzy Sets and Region Merging
947
and the membership function μX (g) , the (linear) index of fuzzyness can now be defined as follows: 1 h(g) ∗ min[μA (g), 1 − μA (g)] M N g=0 L−1
γ(A) =
(6)
There have been numerous applications of fuzzy entropies in gray scale image segmentation ([13], [14], [30]). The entropy of FS2 has not yet been studied in the literature. However, for type-2 fuzzy sets, it is very easy to extend the previous concepts of FS1 for ITFS2, as proposed by ([30]), and to define the (linear) index of fuzziness as follows: 1 h(g) ∗ [μU (g) − μL (g)] M N g=0 L−1
˜ = γ(A)
(7)
In this last formula, μU (g) and μL (g) are defined in the following paragraph. This basic definition verify the four conditions proposed by Kaufman ([26]) for the measure of uncertainty of a fuzzy set, and among the numerous frames of uncertainty modelization, this last equation seems to be an interesting way in image processing. 2.4
Signal Processing Applications of Type-2 Fuzzy Sets
Recently, some applications of type-2 fuzzy set have been presented in the literature. Gader and al. ([12]) presented land mines detection with very good results. Hagras ([7]) proposed a hierarchical type-2 fuzzy logic control architecture for autonomous mobile robots, and [8] and [11] proposed applications for the design of fuzzy logic systems (used for the control of plants). Rhee and Wang studied uncertainty associated with the parameters used in fuzzy clustering algorithms and showed that interval type-2 fuzzy approach aids cluster prototype to converge to a more desirable location than a type-1 fuzzy approach ([16], [17]). Tizhoosh ([30]) applied type-2 fuzzy sets to gray scale images thresholding. He obtained good results with very noisy images. As proposed in [18], he used interval type-2 fuzzy sets, with the following ”FOU” (figure 1): – Upper limit: μU (x): μU (x)= [μ(x)]0.5 – Lower limit: μL (x): μL (x) = [μ(x)]2 The study he made about these functions showed that they are well adapted in image processing. So we shall use the same functions in color images segmentation. We now present the application of type-2 fuzzy sets to color images segmentation.
3
Color Image Segmentation with Type-2 Fuzzy Sets
In this paper, we consider color uniformity as a relevant criterion to partition an image into significant regions. We propose a fuzzy entropy approach to take into account simultaneously the color and spatial properties of the pixels.
948
3.1
S. Tehami, A. Bigand, and O. Colot
Proposed Scheme
The segmentation scheme is divided into two steps. In the first one, the color image is considered as a combination of three color component images. A color component image is a monochromatic image where each pixel is characterized by the level of one color component. Each of these component images is analyzed using type-2 fuzzy set (both the occurence of the gray levels and the neighboring homogeneity value among pixels is considered) and then fuzzy entropy. So local and global information is employed in the algorithm. In the second step, the entropy is utilized as a tool to perform histograms analysis for finding all major homogeneous regions at the first stage. The classes built by the analyses of the three color component images are combined to form the classes of pixels of the color image (merging stage). 3.2
Type-2 Fuzzy Set Entropy
The membership function of the type-2 fuzzy set is shifted over the gray-level range (corresponding to one color component, R, G or B) and the amount of fuzziness is calculated (using equation 7). So we are able to transform an image into fuzzy domains with maximum fuzzy entropy. The proposed color image segmentation method could be described as a system whose inputs are a color image and the entropy threshold value. The output of the system is the segmented image (the threshold value is applied for each color component independently). 3.3
Algorithm
The general algorithm for color image segmentation based on type-2 fuzzy sets and measure of fuzzyness γ can be formulated as follows: – Select the shape of MF (here interval-based type-2 fuzzy set) – Calculate the image histogram for each color component of the color space (R, G, B) – Initialize the position of the membership function – Shift the MF along the gray-level ranges (R, G, B) (as illustrated figure 2) – Calculate in each position (g) the MF values μU (g) and μL (g) – Calculate in each position (g) the amount of uncertainty γ – Find the maximum values of γ – Threshold the image with γmax – Region merging process from the obtained classes of pixels 3.4
Color Region Merging
At the previous stage, a coarse segmentation of the image is obtained. Color region merging technique is needed in order to refine the segmentation results. In fact, regions with small numbers of pixels should be merged and homogeneous regions with narrow color transition might be split as separate regions having small color difference. These cases often appear in natural images characterized by imprecise regions such as shadows, brights and color gradients.
Color Image Segmentation Based on Type-2 Fuzzy Sets and Region Merging
Fig. 1. FOU of a ITFS2
949
Fig. 2. Shifting of the membership function
The region merging criterion. Classical problem with region merging is how to define merging criteria. Incorporating specific knowledge of psychophysical perception is an ideal way, but it is not practical for application. In this paper, the definition of a region is based on similar colour (homogeneity), so we take into account color similarity to decide if two regions are to be merged. We adopt an approach similar to [4]. In the RGB color space, we use the distance between two clusters C1 and C2 : dist(C1 , C2 )= max(|R1 − R2 | , |G1 − G2 | , |B1 − B2 |) where (R1 , G1 , B1 ) and (R2 , G2 , B2 ) are the average color values of clusters C1 and C2 . Region merging algorithm. The strategy we follow in this first work is the following: – From the segmented image obtained with the application of FS2, we merge clusters whose number of pixels is less than a predefined threshold into its closest cluster (first stage of merging) – Then region merging is performed iteratively by combining the two closest regions each time until the distances of all pairs of regions are greater than a specified global threshold.
4
Experimental Results
In order to test the performance of the proposed technique, a classical synthetic image (named ”Savoyse”, and composed of five areas on an uniform background, with an additive gaussian noise) is first tested. Other images are natural wellknown scene images (named ”House” and Lena ), have been also tested. These images are presented respectively figures 3, 4 and 5, (the intensity value for each color component of the test images is from 0 to 255). The algorithm has been implemented on the well-known software ”Matlab” on PC (it is important to note that the software has not been optimized). We can remark that the most time consuming parts of our Matlab implementation is the region merging
950
S. Tehami, A. Bigand, and O. Colot
procedure (due to the non-optimzed data structure used). Without the region merging procedure, running time is only some seconds. The characteristics of the images, the number of colors, the CPU time, etc... are listed table 1. 4.1
Type-2 Fuzzy Sets Entropy Approach
We have applied the unsupervised segmentation algorithm we propose on these images, presented respectively figure 3, 4 and 5: the first image (left) is the original image, the second image is the image obtained with the FS2 segmentation method, the third image is the result after the first merging stage and the fourth image (right) is the result obtained after the second merging stage. It is interesting to analyse these results. First, we can easily remark that the number of colors is drastically low after the segmentation process using FS2: thresholding using FS2 entropy is very effective. Mendel has shown that the amount of uncertainty associated to a FS2 is characterized by its lower and upper membership functions. So we are intuitively able to explain these results (compared with a FS1 for example). On the synthetic image (Savoyse), the two regions corresponding to the two green concentric discs are correctly extracted. This result shows that the method is able to handle unequiprobable and overlapping classes of pixels. The segmentation of the other (natural) images is challenging because of the presence of shadows and highlights effects. Segmentation results for the house image show that low-contrast regions are merged, and the number of segmented regions dropped from 17 to 4 colors. For ending, consider the well-known benchmark ”Lena” image. It is well known that segmentation techniques based solely on low-level cues (such as colors) are very difficult to apply, due to the distribution of colors. Nevertheless, our method provides good results: the hat of the girl and her face remain cleanly separated from the background.
Fig. 3. Original and segmented images, Savoyse
So we are able to sum up some important results: – The image is able to be transformed into fuzzy domains using type-2 fuzzy membership functions – These fuzzy domains consider both the occurence of the gray levels (of each color component) and the neighboring homogeneity among pixels (spatial information) – The analysis of the entropy function (of each color component) performs image segmentation (regions and contours) – The segmentation process is unsupervised (we do not need to know the number of clusters of pixels), and apparently the results we obtained seem
Color Image Segmentation Based on Type-2 Fuzzy Sets and Region Merging
951
Fig. 4. Original and segmented images, House
Fig. 5. Original and segmented images, Lena
robust to noise and to membership function shapes (we obtain the same results with different kinds of membership functions). 4.2
Type-2 Fuzzy Sets Entropy vs. Type-1 Fuzzy Sets Entropy
Then, we have compared type-1 fuzzy sets entropy approach (using the equation 6) to its counterpart with type-2 fuzzy sets. In particular, we are able to remark that the peak values of the entropy using type-2 fuzzy sets are more important that their counterparts using type-1 fuzzy sets (figure 6, middle), so that regions will be easier to extract in a noisy image, and this proves qualitatively the advantage of this approach (more uncertainty is taken into account using type-2 fuzzy sets, as it is suggested previously). This result is well illustrated on the results obtained with image ”Lena”. On the figure 6, we present the segmented image using FS2 on the left, and the segmented image using FS1 on the right (these results are obtained without region merging to make interpretation easier). It is clear that type-2 fuzzy sets aids to obtain better results. Type-2 fuzzy sets are able to model imprecision and uncertainty which type-1 fuzzy sets find difficult or impossible to handle. Local entropy in information theory represents the variance of local region and catches the natural properties of transition region. So FS2 being able to deal with a greater amount of uncertainty than FS1, transition regions are more acute and homogeneous regions are better drawn. It is possible to illustrate this assertion using the results of table 1. For ”Lena” image, 50 colors are obtained with FS2 instead of 17 with FS1. It will be interesting in the future to use a measure of performance, to compare these two approaches (and with non-fuzzy references) on different sets of images. Computational complexity and calculus time are small, and should be also compared with other algorithms with interest. Particularly, a complete study about the application of our method on noisy images is on work, to establish a link between the ”FOU” of FS2 and the level of noise, and will be presented in a future paper.
952
S. Tehami, A. Bigand, and O. Colot
Fig. 6. Segmented (FS2, left and FS1, right) images, Lena, and fuzzy sets entropies, (middle) Table 1. Results of the Proposed Approach in RGB Color Space Image Name
Size pixels
SAVOYSE150x150 HOUSE 256x256 LENA 512x512
4.3
CPU Time sec
number of colors Initial
FS1
FS2
0.5 1 to 2 9
5330 33925 67189
15 12 17
16 17 50
FS2 after 1st merging 7 5 7
FS2 after 2nd merging 7 4 5
Color Spaces
The proposed approach operates in RGB color space, which is the most commonly used model in the literature. The major disadvantage of RGB for color scene segmentation is the high correlation among the R, G, and B components. The HSI system is another commonly used color space in image processing, which is more intuitive to the human vision. Anyway, the non-removable singularity of hue may create spurious modes in the distribution of values resulting from nonlinear transformations, which makes the entropy calculus of hue value unreliable for segmentation. RGB color space does not have such a problem. But for color images with high saturation, segmentation using HSI can generate good results, and a comparison between RGB results and HSI results is under investigation.
5
Conclusion
Color image segmentation is a difficult task in image processing. A unique algorithm will certainly never be established to be applied to all kinds of images. We have tried to apply a new algorithm provided by fuzzy set theory. The central idea of this paper was to introduce the application of type-2 fuzzy sets, to take into account the total amount of uncertainty present at the segmentation stage, and this idea seems to be very promising. So a new segmentation algorithm has
Color Image Segmentation Based on Type-2 Fuzzy Sets and Region Merging
953
been presented and some examples have demonstrated the applicability of this algorithm. We have now to compare this algorithm with other ones (non-fuzzy and fuzzy algorithms) and to lead additional experiments with different test images to confirm the results we obtain (in a relevant benchmark) and to reinforce the potentiality of this new method. In particular, more extensive investigations on other measures of entropy and the effect of parameters influencing the width (length) of FOU are under investigation. We are also working about incorporating specific knowledge of psychophysical perception to obtain better results in the merging stage of our method. So this first study, with the good results we obtain, may lead to interesting studies in the future.
References 1. Meyer, F.: Topographic distance and watershed lines. Signal Processing 38, 113– 125 (1994) 2. Bigand, A., Bouwmans, T., Dubus, J.P.: Extraction of line segments from fuzzy images. Pattern Recognition Letters 22, 1405–1418 (2001) 3. Demirci, R.: Rule-based automatic segmentation of color images. Int. J. Electron. Commun 60, 435–442 (2006) 4. Cheng, H., Jiang, X., Wang, J.: Color image segmentation based on homogram thresholding and region merging. Pattern Recognition 35(2), 373–393 (2002) 5. Chen, T.Q., Lu, Y.: Color image segmentation- an innovative approach. Pattern Recognition 35, 395–405 (2002) 6. Tr´emeau, A., Colantoni, P.: Regions adjencency graph applied to color image segmentation. IEEE Trans. Image Process 9(4), 735–744 (2000) 7. Hagras, H.A.: A hierarchical type-2 fuzzy logic control architecture for autonomous mobile robots. IEEE Trans. on Fuzzy Systems 12(4), 524–539 (2004) 8. Wu, H., Mendel, J.M.: Uncertainty bounds and their use in the design of interval type-2 fuzzy logic systems. IEEE Trans. on Fuzzy Systems 10(5), 622–639 (2002) 9. Prados-Suarez, B., Chamorro-Martinez, J., Sanchez, D., Abad, J.: Region-based fit of color homogeneity measures for fuzzy image segmentation. Fuzzy Sets and Systems 158, 215–229 (2007) 10. Fan, J.-L., Ma, Y.-L.: Some new fuzzy entropy formulas. Fuzzy Sets and Systems 128, 277–284 (2002) 11. Liang, Q., Karnish, N.N., Mendel, J.M.: Connection admission control in ATM networks using survey-based type-2 fuzzy logic systems. IEEE Trans. on Systems, Man and Cyber 30(3), 329–339 (2000) 12. Auephanwiriyakul, S., Keller, J.M., Gader, P.D.: Generalized Choquet Fuzzy Integral Fusion. Information Fusion 3 (2002) 13. Taleb-Ahmed, A., Bigand, A., Lethuc, V., Allioux, P.M.: Visual acuity of vision tested by fuzzy logic: an application in ophthalmology as a step towards a telemedicine project. Information Fusion 5, 217–230 (2004) 14. Cheng, H.D., Chen, C.H., Chiu, H.H., Xu, H.J.: Fuzzy homogeneity approach to multilevel thresholding. IEEE Trans. Image Process 7(7), 1084–1088 (1998) 15. Philipp-Foliguet, S., Vieira, M.B., Sanfourche, M.: Fuzzy segmentation of fuzzy images and indexing of fuzzy regions. In: CGVIP02, Brazil (2002) 16. Rhee, F., Hwang, C.: An interval type-2 fuzzy k-nearest neighbor. In: Proc. Int. Conf. Fuzzy Syst. vol. 2, pp. 802–807 (May 2003)
954
S. Tehami, A. Bigand, and O. Colot
17. Hwang, C., Rhee, F.: An interval type-2 fuzzy C spherical shells algorithm. In: Proc. Int. Conf. Fuzzy Syst. vol. 2, pp. 1117–1122 (May 2004) 18. Mendel, J.M., Bob John, R.I.: Type-2 fuzzy sets made simple. IEEE Trans. on Fuzzy Systems 10(02), 117–127 (2002) 19. Mendel, J.M., Bob John, R.I., Liu, F.: Interval Type-2 Fuzzy Logic Systems made simple. IEEE Trans. on Fuzzy Systems 14(06), 808–821 (2006) 20. Mendel, J.M., Wu, H.: Type-2 Fuzzistics for symmetric Interval Type-2 Fuzzy Sets: Part1, Forward problems. IEEE Trans. on Fuzzy Systems 14(06), 781–792 (2006) 21. Mendel, J.M.: Advances in type-2 fuzzy sets and systems. Information Sciences 177, 84–110 (2007) 22. Finlayson, G.D.: Color in perspective. IEEE Trans. on PAMI 18(10), 1034–1035 (1996) 23. Bloch, I.: Information combination operators for data fusion: a comparative review with classification. IEEE Trans. on SMC 26, 52–67 (1996) 24. Zadeh, L.A.: The concept of a linguistic variable and its application to approximate reasoning. Information Sciences 8, 199–249 (1975) 25. Yager, R.R.: On the measure of fuzzyness and negation. Int. J. Gen. Sys. 5, 221–229 (1979) 26. Kaufmann, A.: Introduction to the theory of fuzzy set - Fundamental theorical elements. Academic Press, New York (1975) 27. Pal, N.R., Bezdek, J.C.: Measures of fuzzyness: a review and several classes. Van Nostrand Reinhold, New York (1994) 28. Deluca, A., Termini, S.: A definition of a nonprobabilistic entropy in the setting of fuzzy set theory. Information and Control 20(4), 301–312 (1972) 29. Klir, G.J., Yuan, B.: Fuzzy sets and fuzzy logic. Theory and applications. PrenticeHall, Englewood Cliffs (1995) 30. Tizhoosh, H.R.: Image thresholding using type 2 fuzzy sets. Pattern Recognition 38, 2363–2372 (2005)
ENMIM: Energetic Normalized Mutual Information Model for Online Multiple Object Tracking with Unlearned Motions Abir El Abed1 , S´everine Dubuisson1 , and Dominique B´er´eziat2 1
Laboratoire d’Informatique de Paris 6 (LIP6/UPMC) 104 Avenue du Pr´esident Kennedy, 75016 Paris 2 LIP6/UPMC, Clime project/INRIA Rocquencourt B.P. 105 78153 Le Chesnay Cedex France
[email protected]
Abstract. In multiple-object tracking, the lack in prior information limits the association performance. Furthermore, to improve tracking, dynamic models are needed in order to determine the settings of the estimation algorithm. In case of complex motions, the dynamic cannot be learned and the task of tracking becomes difficult. That is why online spatio-temporal motion estimation is of crucial importance. In this paper, we propose a new model for multiple target online tracking: the Energetic Normalized Mutual Information Model (ENMIM). ENMIM combines two algorithms: (i) Quadtree Normalized Mutual Information, QNMI, a recursive partitioning methodology involving a region motion extraction; (ii) an energy minimization approach for data association adapted to the constraint of lack in prior information about motion and based on geometric properties. ENMIM is able to handle typical problems such as large inter-frame displacements, unlearned motions and noisy images with low contrast. The main advantage of ENMIM is its parameterless and its capacity to handle noisy multi-modal images without exploiting any pre-processing step.
1
Introduction
Multiple object tracking algorithms generally present two basic principles: a motion detection (or estimation) algorithm coupled with a data association method. Actual techniques present diverse kinds of problems: – Restriction to some specific motion model and incapacity to deal with random and unlearned motions; – Difficulty to associate a measurement to the good target when the targets are quite similar, when we have a large interval of time between observations, if the observer has no prior information about the dynamic model, or when a measurement is equidistant from different targets. In the last years, the use of sequential Monte Carlo methods has grown in many application domains and in particular in target tracking. They are particularly J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 955–967, 2007. c Springer-Verlag Berlin Heidelberg 2007
956
A. El Abed, S. Dubuisson, and D. B´er´eziat
adapted to approximate the posterior probability density function of a state. These approaches are known as particle filters and mainly consist in propagating a weighted set of particles that approximates the density function. They provide flexible tracking frameworks as they are limited neither to linear systems nor to Gaussian noise [1,2,3]. For example, according to Isard and Blake [4], dynamic contour tracking is based on predictions using dynamical models. The parameters of these models are fixed by hand to represent plausible motions, such as constant velocity or critically damped oscillations. It is far more attractive to learn dynamical models on the basis of training sets. Once a new dynamical model has been learned, it can be used to build more efficient trackers. In practice, they incorporate the learned model into the Condensation algorithm [5], estimation process which should enable particles to be concentrated more efficiently. In this framework, the dynamics have to be learned to achieve and succeed in the task of tracking. Although, it may fail if the motion is badly anticipated by the learned model. Usually, association effectiveness is based on prior information and observation category. If we have a lack in prior information, the association task becomes difficult. Such case can occur when the observed system is deformed over time, moreover, when we have no information about motion and we track multiple objects that are quite similar even non distinguishable. Likewise, if we only observe target positions, it is possible for a measurement to be equidistant from several targets: all target association probabilities are relatively the same and it is difficult to associate the good measurement with the good target. As far as, no association method can handle all the cases previously illustrated. The literature contains some classical approaches for data association: the deterministic approaches and the probabilistic ones. The simplest deterministic method is the Nearest-Neighbor Standard Filter (NNSF) [6] that selects the closest validate measurement to a predicted target. In some tracking applications, the color is also exploited. Unfortunately, the color metric is not sufficient in many cases: for deformable objects, which color distribution may differ from one frame to another, or in case of several quite identical objects. Probabilistic approaches are based on posterior probability and make an association decision using the probability error. We can cite the most general one, called Multiple Hypothesis Tracking [7], for which multiple hypothesis are formed and propagated, implying the computing of every possible hypothesis. Another strategy for multiple target tracking is the Joint Probability Data Association (JPDA) [8] which uses a weighted sum of all measurements near the predicted state, each weight corresponding to the posterior probability for a measurement to come from a target. However, the number of possible hypothesis increases rapidly with the number of targets. To resume, the performance of a building tracker is based on the parametrization of the dynamic model. So far, the variety of approaches dealing with the problems of motion feature extraction that has been proposed in literature is huge. However, all of them suffer from different shortcomings and up to date there is no satisfactory solution.
ENMIM for Online Multiple Object Tracking with Unlearned Motions
957
In this paper, we propose the Energetic Normalized Mutual Information Model, ENMIM, a new model for online multiple target tracking in difficult visual environments under the constraint of total lack of knowledge about the dynamic model. ENMIM can online manage critical problems: total lack in information about the dynamic model, i.e. complex and random dynamics, and quite similar and deformable target tracking. As well as, it handles large inter-frame displacements, does not require parameters or prior information about the dynamic model, is not computationally intensive, is robust on the noisy images and can be applied on multi-modal images without remapping their intensity distribution. Moreover, its parametrization is adaptive and automated, representing its main advantage. It is built by a combination of the following two algorithms: 1. Quadtree Normalized Mutual Information QNMI: a statistical method which can automatically select the similar regions between two images and allows to online extract spatio-temporal motions; 2. Energy minimization approach: based on geometric properties, it provides an energetic amplitude allowing to associate measurements to targets. The rest of the paper is organized as follows. Section 2 presents the general definition of the Normalized Mutual Information (NMI) and the proposed method for online motion detection by using a quadtree decomposition based on NMI. In Section 3, we expose the energy minimization approach, derive its geometrical representation and its mathematical model. The proposed model for tracking, ENMIM, is then evaluated and tested on several sequences in Section 4. Finally, concluding remarks and perspectives are given in Section 5.
2
Motion Region Extraction
In this Section, we present a recursive partitioning methodology involving a region motion extraction that can deal with multiple independently moving objects. 2.1
Normalized Mutual Information (NMI)
In recent years NMI has proven to be a robust and accurate similarity measure for multi-modal image registration [9,10,11]. The NMI of two images is expressed in terms of the entropy of the images. Entropy is a measure of uncertainty of the prediction of the intensity of a point in an image: for example, the entropy of an homogeneous image is zero since there is no uncertainty about the intensity of any of its pixel. On the contrary, an image containing a large number of equally distributed intensities has a high entropy. For a discrete random variable A with intensities a, pA (a) is the probability for A to be equal to a, and the Shannon entropy is defined as H(A) = − a pA (a) log pA (a). The entropy terms needed for the computation of the NMI can be derived from the joint histogram, which is an estimation of the joint probability distribution of the intensities of two images. Joint histogram denotes the number of time that intensity couples occur at corresponding positions in the images. It is assumed, if two images are registered,
958
A. El Abed, S. Dubuisson, and D. B´er´eziat
that the entropy of their joint histogram corresponds to a local minimum. The Shannon entropy for a joint distribution of two discrete random variables A and B is defined as H(A, B) = − a,b pAB (a, b) log pAB (a, b), where pAB (a, b) is the probability for intensity couple (a, b) corresponding points to overlap parts of A and B. Note that an increasing joint histogram dispersion indicates a reduction of the registration quality. The NMI is based on the Shannon entropy of the images and given by: NMI(A, B) = H(A)+H(B) H(A,B) . 2.2
Quadtree Normalized Mutual Information (QNMI)
QNMI is designed for solving the problem of online motion detection and is based on a statistical measure. The proposed algorithm compensates the lack of knowledge to improve tracking when no training set is available to estimate the dynamic model of a target. The main feature of this approach is it can cope and attempt to detect motion between a pair of multi-modal images without remapping their intensity distribution. Among numerous criteria, NMI has already been proven to be very efficient to define a similarity measure between two frames. This statistical measure exploits the fact that the NMI of two images I1 and I2 has a maximal value in the following cases. 1. I1 and I2 are similar (the standard case), or I1 = I2 +a where a is a constant. 2. I1 and I2 are both homogeneous (but their intensity can be different). This actuality is useful when dealing with multi-modal images: for example, we can compare a gray level image with colored one without remapping their intensities. 3. I1 and I2 represent the same scene with different intensity distributions. In case of noise free images, and independently of the intensity distribution of the images, the maximum value of NMI is fixed and equal 2. The proposed approach, the QNMI, consists in a spatial partitioning of images to localize the region of difference between two frames by using a recursive method giving a sub-block representation based on NMI computing. A fixed threshold is carried out to stop partitioning when the NMI of both sub-blocks is maximum. We apply a quadtree partitioning to represent the image as a hierarchical quadtree data structure in which the root of the tree is the initial image and each node contains four sub-nodes. A node represents a square image portion and its four sub-nodes correspond to its four quadrants. With the presence of the noise, the threshold to stop the partitioning is given by: NMI = NMImax − NMInoise , where NMImax = 2 and NMInoise depends on the type of noise. Our approach can then deal with noisy images and avoid preprocessing treatment steps. The taxi video sequence shows three moving vehicles (see surrounding ellipses on Figure 1.(a-b), inverted frames 2 and 21). We observe that the vehicle on the left has a very low contrast resolution, even non visible for the human eye (it is more visible in Figure 1.c). Our goal is to detect the region of difference between frames 2 and 21. The presence of two vehicles indistinct with the background constitutes the major difficulty of this noisy sequence. To expose the result of
ENMIM for Online Multiple Object Tracking with Unlearned Motions
959
QNMI in a multi-modal context, we give as input of our algorithm the frame 21 (Figure 1.(c)) and the inverted frame 2 (Figure 1.(a)). We can see that the intensity distribution of both images is different. We estimate NMImax − NMInoise by extracting two similar areas from both images and computing their NMI, and then can determine NMInoise . For this test, we have obtained NMInoise = 0.3 and NMI = 1.7. After applying our QNMI algorithm, we get as output the Figure 1.(f) which contains the region of difference between the two considered frames. The black background corresponds to common regions between the frames and the white characterizes the moving regions. Figures 1.(d-e) represent the detected moving regions: the vehicles are well located in both frames in spite of the presence of noise and the low contrast between some vehicles and the background.
Fig. 1. Taxi sequence. (a-b) Inverted frames 2 and 21 and ellipses surrounding moving objects; (c) Frame 21; (d-e) Visualization of moving objects in frames (a) and (b); (f) Detection of moving region between frames (a) and (c) by using the QNMI.
The main advantages of the proposed algorithm are it avoids a preprocessing step and is unsupervised. Because the criterion of partitioning is constant, it is independent to the constraints of luminosity conservation, and also can deal with multi-modal images. Moreover, it is robust on noisy images with low contrast.
3
Energy Minimization Approach
In this Section, we propose an algorithm for data association restricted to one category of measurement: the position. Furthermore, we affirm the total lack of prior information concerning targets: exclusively the two anterior predicted positions at t − 1 and t − 2 are used as input to our algorithm. We will first give the concept of our approach before starting its mathematical modeling. We define a novel energy according to the evolution of the dynamic model of the target. The dynamic is described in terms of displacements in the target space
960
A. El Abed, S. Dubuisson, and D. B´er´eziat
(x, y). The dynamic scene is observed by a sensor which can provide exactly an observation at instant t, containing at least one measurement which can be associated with a specific object or can be a false alarm. Our goal is to associate one measurement per target. We call (y1 , ..., yM i ) the vector containing the M i measurements at a particular instant, also called observation. Each measurement is defined as a position in the target space. We indicate by A the position of the ˆ target k, A(t) its prediction at t and yj the measurement available at instant t. We distinguish two dynamic models: (i) the initial dynamic model: Aˆ1 (t + ˆ 1) = f1 (A(t)) + B1 ; (ii) the updated dynamic model: Aˆ2 (t + 1) = f2 (yj ) + B2 ; where B1 and B2 are gaussian noises, f1 a function representing the initial movement and f2 the new function after updating its parameters when the measurement yj is associated to the target. The energy between the target k 3 and the measurement yj is defined by E(k, yj ) = √13 l=1 αl E l (k, yj ), where αl = K E1 l (k,y ) is a weighted factor introduced to sensibly emphasize the
j
k=1
relative importance attached to the energy quantities E l . If we only consider the linear translation in one direction, the data association problem is limited to the computation of the Mahalanobis distance energy. Thus, in case of complex dynamics such as non linear displacements, oscillatory motions and non-constant velocities, we incorporate a second energy which measures the absolute accuracy between the dynamic features and indicates how much their parameters are close. Moreover, we distinguish some dynamic cases, that will be clarified by geometric descriptions afterward, for which we need to compensate by a third energy, the proximity energy, to improve the data association problem. Finally, the measurement yj is associated to target k by minimizing the total energy: Dyj →k =
argmin E(k, yj ) k=1,...,K
⎞⎫ 3 ⎬ 1 = argmin ⎝ √ α2l (E l (k, yj ))2 ⎠ (1) ⎩k=1,...,K ⎭ 3 l=1 ⎧ ⎨
⎛
with 0 ≤ αl ≤ 1 and 0 ≤ E(k, yj ) ≤ 1. Prediction is based on the use of a dynamic model which parameters are generally fixed by learning from a training sequence to represent plausible motions such as constant velocities or critically damped oscillations [12,4]. For complex dynamics, such as non-constant velocities or non-periodic oscillations, the choice of the parameters for an estimation algorithm is difficult. Furthermore, the learning step becomes particularly more difficult in the case of missing data, because the dynamic between two successive observations is unknown. For these reasons, the parameters of our dynamic model are set in an adaptive and automated way once a measurement is available [13]. The energy E(k, yj ) is a linear combination of three energies, {E 1 , E 2 , E 3 }, given by: 1. The Mahalanobis distance, E 1 (k, yj ), measures the distance between a measurement yj available at t and the prediction of A at (t − 1). This energy is sufficient if the motion is limited to translations (case of linear displace1 ˆ − 1))T Σ ˆ −1 (yj − A(t ˆ − 1)), ments). It is given by E (k, yj ) = (yj − A(t k
ENMIM for Online Multiple Object Tracking with Unlearned Motions
961
Fig. 2. (a-b-e) Intersection surfaces {S1 , S2 , S}; (c-d) Difference between the surfaces S1 and S2 extracted from two dynamical models; (f) Intersection surfaces when two ˆ1 and A ˆ2 , are equidistant from yj predictions at instant t, A
ˆk is the covariance matrix of target k, designed by A in the equation where Σ (we suppose that the coordinates are independent and we fix their variances). 2. To consider the case of complex dynamics, such as oscillatory motions or nonconstant velocities, we have added the absolute accuracy evolution energy E 2 (k, yj ). It introduces the notion of the geometric accuracy between two sets of features whose dynamic evolution is different. The description of both models are followed: – The updated dynamic model considers that the measurement yj at t is generated by the k th target and updates the parameters of its dynamic model to predict the new state of the target k at (t + 1); – The not updated dynamic model set predicts the new state at (t + 1) without considering the presence of any measurement, i.e. without updating the parameters of the dynamic model. 2 E (k, yj ) extends a numerical estimation of the closeness between two dynamic models. Our idea is to evaluate the parameters of the dynamic model in two steps if the measurement yj arises from a target or no. We first predict the states Aˆ1 (t + 1) and Aˆ2 (t + 1) of the target at (t + 1). We then determine S1 , the intersection surface between the two circumscribed circles ˆ − 2), A(t ˆ − 1), A(t)) ˆ ˆ − 1), A(t), ˆ of triangles (A(t and (A(t Aˆ1 (t + 1)), and S2 , the intersection surface between the two circumscribed circles of triangles ˆ − 2), A(t ˆ − 1), yj ) and (A(t ˆ − 1), yj , Aˆ2 (t + 1)) (see Figures 2.(a-b)). (A(t 2 E (k, yj ) is minimized when the similarity between both dynamic models is maximized and is given by: E 2 (k, yj ) = |S1 − S2 |. A question might be asked: is the component E 2 able to handle all types of motions? Indeed, E 2 evaluates a numerical measure of similarity between
962
A. El Abed, S. Dubuisson, and D. B´er´eziat
dynamic models. This measurement depends on the difference between two ˆ surfaces. It is considered as reliable if both positions, A(t) and yj , are on ˆ ˆ the same side comparing to axis (At−2 At−1 ), see Figure 2.c. In Figure 2.d, we show the case where both surfaces S1 and S2 are quite similar, which ˆ and implies that E 2 is null. This case can occur when the position of A(t) yj are diametrically opposite or when their positions are in different side comparing to axis (Aˆt−2 Aˆt−1 ). In such cases, the energy is not a sufficient information source to achieve the task of association. To compensate this energy, we incorporate the third energy E 3 . 3. The proximity energy evolution, E 3 (k, yj ), is the inverse of the surface S ˆ − 2), A(t ˆ − 1), yj ) and (A(t ˆ − defined by the intersection of two triangles (A(t ˆ ˆ 2), A(t − 1), A(t)) (see the dotted area of Figure 2.e). This energy evaluates ˆ and the measurement yj the absolute accuracy between the prediction A(t) at instant t. Increasing S means that the prediction and the measurement at instant t are close. This energy is given by: E 3 (k, yj ) = S1 . Another question could be asked: why using the intersection surface instead of only calculating the distance between the measurement yj and the prediction of target’s position at instant t? In Figure 2.f, we have two predictions at instant t, Aˆ1 and Aˆ2 that are both equidistant from the measurement yj . If we only compute the distance to measure the proximity energy, we will get that both models have the same degree of similarity with the initial model ˆ − 2), A(t ˆ − 1), yj ). This result defined by the dynamic model of points (A(t leads to a contradiction with the reality. This problem can be explained by the fact that if they have both the same degree of similarity with the third dynamic model, we can conclude that their corresponding targets have the same dynamic. For this reason, we have chosen to evaluate the similarity by extracting the intersection surface between triangles. We can remark in Figure 2.f that these intersection surfaces are very different, which leads to a different measure in the degree of similarity. We have described a novel approach for data association based on the minimization of an energy magnitude whose components are extracted from geometrical representations constructed with measurements, previous states and predictions. The purpose of choosing a geometrical definition for these energies refers to: – show the geometrical continuity of the system between predictions and previous states using two different dynamic models; – measure the similarity between predictions, at a particular time for the same object, using two different dynamic models, that logically must be quite similar because they represent the same system. ENMIM is the combination of QNMI and the energy minimization approach previously described. We first detect moving area between two frames with QNMI, giving measurements, and then can associate these measurements with targets. This gives a robust multiple object tracking method which is evaluated next Section.
ENMIM for Online Multiple Object Tracking with Unlearned Motions
4
963
Results and Discussions
In this Section, we present some tracking results obtained with the proposed approach on Tennis man and Ant sequences. The most difficult problem when tracking a ball of a tennis table is that the motion is oscillatory with a duration that is not a multiple of the period of oscillation. Dynamics of the ball are complex and undergo vertical and horizontal oscillations with different periods coupled with translations in both directions. Furthermore, its velocity is non-constant: movement accelerates and decelerates according to the blow given by the player. In such kind of systems, it is very difficult, even impossible, to learn the motion from a training set because of its nonlinearity, non periodicity. We use our ENMIM model to track this ball with NMInoise = 0.3 and NMI = 1.7. Figures 3.(a-b) show two frames of the sequence and Figure 3.(c) shows the region of difference between these frames detected by QNMI. Figures 3.(d-e) visualize the moving regions (the ball and the racket). Figure 3.(f) shows the real trajectory of the ball (solid line) and the results of tracking using the ENMIM model characterized by red dots. We remark ENMIM gives very good results. Figure 4 shows a frame of the Ant sequence. In this sequence, ants are quite similar even non-distinguishable and characterized by the same gray level
Fig. 3. Tennis man sequence. (a-b) Frames 13 and 17; (c) Detection of moving regions between frames (a) and (b) using QNMI; (d-e) Visualization of moving objects in frames 13 and 17; (f) The real trajectory of the ball (solid line) and our tracking results with ENMIM (red dots).
Fig. 4. Ant sequence. Acquisitions at t − 2, t − 1, t, t + 1.
964
A. El Abed, S. Dubuisson, and D. B´er´eziat
Table 1. Numeric values and amplitudes of different energies when a measurement Mj is associated to a target Ti
T1
T2
T3
T4
T5
T6
M1 6.5 1.5 0.03 22.5 3.1 6.8 15.1 14.2 22.2 21.4 6.6 38.8 18.3 74.4 8.5 16.1 0.3 23.7
M2 47.2 1.8 45.8 5.8 1.2 0.01 24.9 0.2 8.5 1.7 2.2 14.8 6.1 85.8 10.5 13.5 0.25 20.1
M3 25.8 1.4 0.8 21.4 3.1 24.6 4.1 0.3 1.8 21.6 2.7 27.8 17.9 92.2 12.6 9.3 0.2 32.4
M4 43.9 1.1 14.1 6.5 9.6 0.2 17.7 0.2 5.6 9.4 0.3 0.7 12.2 96.2 3.2 10.9 1.1 76.3
M5 48.7 11.3 48.1 2.5 54.5 14.6 23.7 1.3 4.7 5.4 24.1 17.8 9.3 6.9 4.05 10.3 1.9 10.8
M6 46.6 1.1 43.2 12.1 1.6 13.6 19.7 0.1 24.5 10.2 4.4 12.1 5.9 92.4 6.2 5.5 0.4 0.35
E(Ti , Mj ) 3.9 38.1 14.9 26.7 40.1 36.7
13.7 3.4 18.9 6.7 32.6 10.5
17.5 15.2 2.6 10.8 14
18.1
25.9 8.7 20.4 5.4 17.6 9.5
44.4 50.1 54.7 56.1 7.1 53.6
16.6 13.9 19.4 44.5 8.7 3.2
distribution. We remark that their displacements are erratic with non-constant velocities. They change their direction, accelerate, decelerate, stop moving, rotate around their axis. The sensor, at t, provides an observation containing six measurements corresponding to positions in the (x, y) space. In such a scene, only motion information are used. Figure 4 shows the acquisitions at {t− 2, t− 1, t, t+ 1} corresponding to frames {10, 25, 35, 45}. Notice that the frame at t is the available observation. Table 1 contains the numerical values of the energy components when a measurement Mj is associated to a target Ti . The NNSF method associates measurements {M2 , M4 , M5 } respectively to observations {T4 , T2 , T2 } which leads to a contradiction with the reality (see α1 E 1 (Ti , Mj ) in Table 1). We remark from Table 1 that α2 E 2 (2, M2 ) < α2 E 2 (4, M2 ) and α3 E 3 (2, M2 ) < α3 E 3 (4, M2 ) which compensates the error given by α1 E 1 (2, M2 ). Finally, E(k, M2 ) is minimized when M2 is associated with target T2 . Lets take another example to show the necessity of using the energy E 3 in our formulation. If we only use the energies α1 E 1 (Ti , Mj ) and α2 E 2 (k, Mi ) to associate data, we will get E(6, M5 ) < E(5, M5 ) and the measurement M5 will be associated to target T6 which is wrong. We can remark from Table 1 that α3 E 3 (5, M5 ) < α3 E 3 (6, M5 ) which compensates the other energy error. Finally, we observe that each measurement is well associated to its corresponding target. We notice that our energy minimization approach for data association is not a computationally intensive: in Matlab, the total time of computation of all these energies is 0.25 seconds. For this sequence, we have obtained NMInoise = 0.3 and NMI = 1.7. Figure 5.(c) shows the regions of difference between the two treated frames: the black background corresponds to the common regions between both frames and
ENMIM for Online Multiple Object Tracking with Unlearned Motions
965
Fig. 5. (a-b) Two frames from the Ant sequence; (c) Detection of moving regions between frames (a) and (b) using QNMI; (d-e) Visualization of detected ants; (f) Ants trajectory: the red triangles and the blue ’+’ represent the tracking results respectively obtained with ENMIM and JPDAF
the white characterizes the moving ones. Figure 5.(c-d) represent the detected moving regions: ants are well located in both frames in spite of the presence of noise. Considering one observation each eight frames, we have compared tracking results obtained with our ENMIM approach and with JPDAF, which provides an optimal object tracking solution in the Bayesian framework, by coupling a particle filter with the JPDA method (see [8] for more details). We show in Figure 5.(f) the real trajectory (solid line) of each ant separately (just for more visibility), our tracking results (red triangles) and JPADF’s tracking results (blue ’+’). As we can see, JPDAF fails to follow the ants, most of time because of an
966
A. El Abed, S. Dubuisson, and D. B´er´eziat
association error (when the ’+’ is not attached to the solid line). However, we remark that our tracking results correspond to the real trajectory which means that ENMIM follows well each ant.
5
Conclusions
In practice, to improve tracking, learning the motion from a training set is required to define the dynamic model for an estimation algorithm. Learning can be handled in case of plausible motions such as constant velocity. Thus, a problem arises in cases of non-linear dynamics (e.g. non-periodic oscillations, non regular accelerations, ...). Moreover, data association problem is of crucial importance to improve online multiple-target tracking. In this work, we have combined two approaches, QNMI and an energy minimization approach, to build a model for online multiple-target tracking, to give ENMIM. ENMIM is not restricted to object position tracking but also can deal with deformations and rotations from one frame to another. It can track objects with random and unlearned motions. The main advantage of ENMIM is its parameterless and its capacity to handle noisy and multi-modal images without needing any preprocessing step. Future works will involve an integration of the particle filter into our model to predict the state of objects when we have a large interval of time between two successive acquisitions. Likewise, we will extend our model to take into account the superimposing of multi-modal images: when we have a non-linear function relying two images.
References 1. Doucet, A., Godsill, S., Andrieu, C.: On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing, 197–208 (2000) 2. Doucet, A., Gordon, N., de Freitas, J.: An introduction to sequential monte carlo methods. In: Sequential Monte Carlo Methods in Practice, Springer, New York (2001) 3. Kitagawa, G.: Monte carlo filter and smoother for non-gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 1–25 (1996) 4. Blake, A., Isard, M.: Active contours. Springer, Heidelberg (1998) 5. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. Int. J. Computer Vision (1998) 6. Rong, L., Bar-Shalom, Y.: Tracking in clutter with nearest neighbor filter: analysis and performance. IEEE transactions on aerospace and electronic systems (1996) 7. Vermaak, J., Godsill, S., P´erez, P.: Monte carlo filtering for multi-target tracking and data association. IEEE Transactions on Aerospace and Electronic Systems (2005) 8. Fortmann, T., Bar-Shalom, Y., Scheffe, M.: Sonar tracking of multiple targets using joint probabilistic data association. IEEE Journ. Oceanic Engineering (1983) 9. Viola, P.: Alignment by maximization of mutual information. Ph.D. thesis, Massachusetts Institute of Technology, Boston, MA, USA (1995)
ENMIM for Online Multiple Object Tracking with Unlearned Motions
967
10. Collignon, A.: Multi-modality medical image registration by maximization of mutual information. Ph.D. thesis, Catholic University of Leuven, Leuven Belgium (1998) 11. Knops, Z.F., Maintz, J., Viergever, M., Pluim, J.: Registration using segment intensity remapping and mutual information. In: Barillot, C., Haynor, D.R., Hellier, P. (eds.) MICCAI 2004. LNCS, vol. 3216, pp. 805–812. Springer, Heidelberg (2004) 12. North, B., Blake, A., Isard, M., Rittscher, J.: Learning and classification of complex dynamics. IEEE Transactions on Pattern Analysis and Machine Intelligence (2000) 13. Abed, A.E., Dubuisson, S., B´er´eziat, D.: Comparison of statistical and shape-based approaches for non-rigid motion tracking with missing data using a particle filter. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2006. LNCS, vol. 4179, pp. 185–196. Springer, Heidelberg (2006)
Geometrical Scene Analysis Using Co-motion Statistics Zoltán Szlávik, László Havasi, and Tamás Szirányi Computer and Automation Research Institute, Hungarian Academy of Sciences, H-1111 Budapest, Kende u. 13-17, Hungary {szlavik,havasi,sziranyi}@sztaki.hu
Abstract. Deriving the geometrical features of an observed scene is pivotal for better understanding and detection of events in recorded videos. In the paper methods are presented for the estimation of various geometrical scene characteristics. The estimated characteristics are: point correspondences in stereo views, mirror pole, light source and horizon line. The estimation is based on the analysis of dynamical scene properties by using co-motion statistics. Various experiments prove the feasibility of our approach.
1 Introduction The analysis of scene dynamics is a fundamental task in a number of applications involving multi-camera systems, such as stereo vision, three-dimensional reconstruction, or object tracking/observation in surveillance systems. Estimation of the geometrical properties of scenes is usually required for better description, understanding and detection of the observed objects and events. In the case of scenes including several objects with random motion, successful estimation of scene geometry conventionally requires some a priori object definition or some human interaction. In the paper methods are proposed for the estimation of different geometrical scene characteristics such as: point correspondences, mirror pole, light source, horizon line. Most of the existing methods for the estimation of these characteristics are still-image based, that are calculating scene characteristics from the structure (e.g. edges, corners) or appearance (e.g. color, shape) of the observed scene and objects[2][13][14]. Such methods may fail if the chosen primitives or features cannot be reliably detected. The views of the scene from the various cameras may be very different, so we cannot base the decision solely on the color or shape of objects in the scene. In a multi-camera observation system the video sequences recorded by the cameras can be used for estimating matching correspondences between different views. Video sequences in fact also contain information about the scene dynamics besides the static frame data. Dynamics is an inherent property of the scene, independent of the camera positions, the different zoom-lens settings and lighting conditions. The basic task of scene geometry estimation from multiple views is the estimation of point correspondences. Based on the extracted point correspondences from two or more different views of the same scene various further tasks can be solved such as: registration of camera views [3][10][11], reconstruction of scene structure [8], calibration of cameras. Several motion based methods were proposed for registration between camera views [4][9][10][11]. The results of the registration are highly J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 968–979, 2007. © Springer-Verlag Berlin Heidelberg 2007
Geometrical Scene Analysis Using Co-motion Statistics
969
influenced by the accuracy and robustness of the object tracker. It was assumed that only a limited number of objects are observed in both views, which is not applicable in practical situations. In [23] co-motion statistics were used for the alignment of two overlapping views. In that approach, instead of the trajectories of moving objects, the statistics of concurrent motions – the so-called co-motion statistics – were used to locate point correspondences in pairs of images. The main advantage of using of comotion statistics is that no a priori information about motion, objects or structures is required. The determination of the position of the vanishing point [20] or the mirror pole [21] in cases where the input is a noisy outdoor video sequence which contains some specularly reflective planar surface within the field of view is a task which has rarely been investigated. The importance of this task lies in the fact that knowledge of the position of the mirror pole (henceforward: MP) enables the geometrical modeling of a planar reflective surface on the wall or of shadows cast on the ground-plane. These situations are often found in surveillance feeds, and they almost always cause problems in further processing steps and reduce the performance. Previous publications have focused on the use of a mirror to accomplish the 3D reconstruction task [15][16][20][21][22]. Most of these works rely on hand-selected point correspondences. The vanishing line is useful for camera orientation and extrinsic parameter determination [17]. For still images [8][18], it can be successfully determined only when there are detectable parallel lines; and in image-sequences, thus when certain assumptions are satisfied which enable us to detect and track known objects [17]. However, the precise detection of such non-rigid objects is a very challenging task in outdoor images. Additionally, in videos captured by analog surveillance cameras the contrast and focus are often badly adjusted, and thus precise measurements are not possible in individual frames. The evaluation of images is always influenced by lighting conditions, shadows. The estimation of the light source is very important in shadow modeling [19]. Shadows are important features in modeling of 3D visual world, they provide additional visual cues for depth and shape [8]. They are also useful for other computer vision applications such as detection and tracking of objects in surveillance systems. By detecting the location of the light source geometrical terms could be included into traditional color based shadow detection methods. Hence, more precise shadow detection could be possible [29]. The primary aim of the present paper is to show that by using co-motion statistics different characteristic features can be extracted about the scene dynamics and various geometrical scene properties can be estimated within a single framework.
2 Co-motion Statistics The estimation of geometrical scene characteristics is based on the extracted point correspondences and shape properties from video sequences. The extraction of these features is performed by using co-motion statistics [23]. Briefly, the co-motion statistics are a numerical estimation of the concurrent motion probability of different pixels in the image plane (or between image planes of different cameras).
970
Z. Szlávik, L. Havasi, and T. Szirányi
For the purpose of this paper, we assume that change detection results are available through whatever method is preferred by the user, e.g. by implementing the background modeling method proposed in [3] or simple change-detection. For the simplicG ity of description let us consider a single pixel in the image plane at location x . The extension of the procedure to the whole image is straightforward. Let m1 ( t , xG ) de-
G
notes a binary motion mask, where t is the time and the 2D vector x is the position in the image. m1 ( t , xG ) is a set of elements meaning motion (“1”) or no-motion (“0”).
G
Then the probability of observing motion (change) at a given location x could be defined as (1) (where Δt denotes the frame count, because of the discrete time-steps). G ∑t m1 ( t , x ) G (1) Pg ( x ) = Δt The temporal collection (accumulation) of 2D binary motion masks provides useful information about the parts of the image where temporally-concurrent motions occur. In general, the conditional probability of detecting motion (change) at an arbitrary G G image-point u when motion (change) is detected at another image-point x can be defined with the conditional-probability formula (2). G G m1 ( t , x ) m2 ( t , u ) ∑ G G G G f ( u , x ) = Pco ( u x ) = t G (2) ∑ m1 ( t , x ) t
G G Points x and u in (2) can be taken from the same image or from different ones.
Fig. 1. The upper two images show co-motion statistics within a single view (Shop sequence). Below them are examples between two views of the same scene.
In the first case concurrent motions are described within an image while in the latter case concurrent motions are described between different images of the same scene. Concurrent motions within a single image will appear when mirror images are observed in the scene because they move together with the object reflected in the mirror.
Geometrical Scene Analysis Using Co-motion Statistics
971
Inter-image concurrent motions will always occur if a dynamic scene is observed with two or more cameras from different locations. Shadows are also moving together with the object cast them. Associating the output of a shadow detector with one of the motion masks in (2) co-motion statistics between the image of detected motions and the image of detected shadows can be defined. Example statistics are shown in Fig. 1. For a detailed description of the implementation issues, we refer to [23]. Since G G Pco ( u x ) can be assigned to every pixel in the image, the 2D discrete PDF (probability distribution function) will have local maxima in position(s) where concurrent motions were often detected. The number of these most-probable peaks depends on the investigated statistics (and of course on the scene geometry): one peak is probable in co-motion statistics of areas from different images and for shadows; while two peaks are probable in the case of “local statistics” where there is some visible reflective surface. Thus the PDFs can be modeled with a simple Gaussian mixture model (GMM) with one or two components (3). G G G G GG G Pco ( u x ) ≈ ∑ wix Ν ( u , μ ix , Σ ix ) , where i
G x i
∑w
=1
(3)
i
The model parameters can be established by using the EM algorithm [24]. 2.1 Corresponding Point Extraction Most of the inspected scene modeling tasks are based on point correspondences. In a parametric model of co-motion maps the corresponding point pairs in multiple views or in single images (for mirror or shadow) are the centers of the GMM parameters G ( μG ix ) in local (intra-image) and remote (inter-image) statistics. For the computation of the scene geometry in the case of a scene with a reflective planar surface, the two centers in the local statistics can be used (see Fig. 1). The correspondences for shadow modeling come from the local and shadow statistics. In the following, corresponding point pairs will be identified by the two weighted-Gaussian functions G G G G C p = w Ν ( u , μ , Σ ) and C p ′ = w ′ Ν ( u ′, μ ′, Σ ′ ) . Depending on the scene configuration, not every moving point will have a visible corresponding point-pair. The extracted set of point correspondences will contain a lot of false matches. To reduce the number of these outliers they were filtered according to their directions. Earlier it was assumed that the observed motions are on the groundplane. This means that the inlier point correspondences will have the same direction. Thus, by filtering the directions of point correspondences most of the outliers can be excluded from the set of point correspondences before further processing. The idea of this outlier rejection is illustrated in Fig. 2. 2.2 Extraction of the Average Size of Detected Objects From the accumulated co-motion statistics the average size of detected objects at a given pixel easily can be extracted. The dimensions and orientation of the average shape come from the eigen-value decomposition of the covariance matrix: Σ x v x , i = λ x , i v x , i i = 1, 2
(4)
972
Z. Szlávik, L. Havasi, and T. Szirányi
Fig. 2. Illustration of outlier rejection on the “Shop” sequence. Only the directions corresponding to the main peak (mode) of the histogram (determined from the line directions) will be used for later computations. a) before rejection, c) after rejection; b) and d) show the corresponding histograms of angles.
These statistical characteristics are displayed in Fig. 3.
Fig. 3. Example for shape properties: axes of normal distributions, derived from the eigen-value decomposition of the covariance matrix
Finally, the height measurement comes from the projection (vertical component) of the most vertical eigenvector:
(λ
x , m ax
(
, v x , m ax ) = a rg m a x λ x ,1 e , v x ,1 , λ x , 2 e , v x , 2 (λ , v )
h j = hx =
) (5)
λ x , m ax e , v x , m ax
where e denotes the vertical unit vector: e = [ 0
1 ] and . is the dot product.
Geometrical Scene Analysis Using Co-motion Statistics
973
3 Extraction of Geometrical Scene Properties In this chapter we show that the co-motion statistics can be used for the estimation of different geometrical scene characteristics. The estimation of the investigated models is based on point correspondences or shape properties of observed objects. 3.1 Matching of Camera Views A specific form of the transformation between images produced by cameras with overlapping fields of view is the homography matrix [8]. During this point-to-point transformation one assumes that the objects are on the ground-plane (or any flat plane). In this case the parameter matrix ( H ) is a projective transformation that can be represented by a 3*3 matrix, expressed by the following transformation: Hx = x ′
(6)
where x and x ′ are the corresponding points in the two views in a homogenous coordinate form. For most scenes we can assume that moving objects are small enough to appear in the recorded videos as “moving blobs on the ground”. Therefore, the set of point correspondences obtained by estimating co-motion statistics will contain points which are on the groundplane. Then H can be estimated from at least 4 corresponding points by implementing the standard Direct Linear Transformation algorithm [1]. For the robust estimation of transformation H that maps points of one view onto another and rejection of outliers from the set of point-correspondences we have implemented the RANSAC algorithm [6]. When H has been determined, the views can be aligned with each other. 3.2 Estimation of the Mirror Pole Now we present a searching algorithm to find an optimal mirror pole (MP) that first was published in [7]. In the following formulas the corresponding point-pairs are G G G G contained in the two sets: A = {μG1x } and A′ = {μG 2x } . Lower-case symbols a and a′ are used to denote the elements of these sets. (Note that at this stage we are not able to identify which is the original and which is the reflection.) To solve this accuracy problem we introduce a fitness function to measure the fitting of a possible mirror pole position; the “best” mirror pole is the argument of the fitness function at its global maximum: G G G G G MP = arg max G ∑ Pg ( a ) Pcoll δ ( a , u ) a (7) u
G a∈A
(
)
Because the motion statistics are included in this function, it is not only completed G with a weighting component ( Pg ( a ) ) but it also permits a small correction in the G G G G positions. We define the function δ ( a , u ) that returns the 2D position ( v ) related to G G the largest value of the Gauss function corresponding to Pcoll ( v a ) where the points G G G a , u and v are collinear:
974
Z. Szlávik, L. Havasi, and T. Szirányi
G G G
G G
G HJG GG
δ ( a, u ) = arg max Pcoll ( v a ) and v ∈ au G v∈S
(8)
The optimization is carried out via an unconstrained nonlinear optimization. 3.3 Estimation of the Light Source For the estimation of the light source in the observed scene the shadows cast by moving objects are analyzed. Shadows are moving together with the object that cast them. So their “footprint” must appear in the accumulated co-motion statistics. An example for it is shown in Fig. 4. In the evaluation a motion detection method was used which is based on the background model introduced by Stauffer, and the initial color-based shadow detection is a modification of the similar part of SAKBOT [5]. The use of the shadow and motion masks together is possible after the following modification of (2): G G m (t, x ) s (t, u ) ∑ G G G G f sh ( u , x ) = Pco ( u x ) = t G G (9) ∑ ( m (t, x ) + s (t, u )) t
In the formula s ( t , uG ) denotes the binarized shadow mask, and it is demonstrated in Fig. 4.
Fig. 4. Example co-motion statistics with shadow included into the binary motion mask
Shafer [27] points out that an object and its cast shadow share a similar geometrical relationship to that found in the camera-mirror case, and consequently the method we introduced in Section 3.2 is applicable to the cast-shadow case as well. The steps of the method are the same as in the extraction of the mirror pole in Section 3.2. 3.4 Horizon Line (Vanishing Line) Estimation Parallel planes in a 3-dimensional space intersect a plane at infinity in a common line, and the image of this line is the horizontal vanishing line, or horizon. The vanishing line (VL) depends only on the orientation of the camera [8]. In the paper we describe the VL with the parameters of the line. The determination of the vanishing line is possible with knowledge of at least three corresponding line segments, see Fig. 5. These line segments can be computed from the apparent height of the same object as seen at different positions (depths) on the ground-plane. The objects may for instance be pedestrians [17], and the line segments
Geometrical Scene Analysis Using Co-motion Statistics
975
Fig. 5. Illustration of the computation of vanishing line
denote their height. However, the precise detection of such non-rigid objects is a highly challenging task in outdoor images. However, in our framework the necessary height information can be easily determined from the local statistics as described in Section 2.2. The information derived from statistics is valid only if the following assumption is satisfied: there are regions where the same objects are moving with equivalent probability (e.g. pathway or road). In general, without making any prior assumptions about the scene, every point may be paired to every other point. But the practical processing of this huge data-set requires that we have an effective way to drop out “outlier” points and extract information for VL estimation. First, we describe simple conditions which can be used to reduce the size of the data-set. The outlier rejection in this case is similar to dropping points where two objects are moving but are not of the same size. We reckon two points as corresponding points (which is probable, where same-sized objects are concerned) if
σ1 <
λ 1 λ 1′ / <σ2 λ 2 λ 2′
(10)
and
α ( v 1 , v 1′ ) < δ
(11)
where the notations come from the eigenvalue decomposition of the covariance matrices Σ and Σ ′ (4) of two points and α (. ) denotes the angle between two vectors (the deviation of eigenvectors in our case). These simple conditions lead to a set of points where the objects have similar orientation and aspect ratio. Fig. 6a demonstrates the extracted point set (marked by circles) corresponding to a point (marked by x). The correspondent heights are displayed in Fig. 6b. After this preprocessing every point will have several probable corresponding points (pixel locations where the same average object was observed). However, several outliers remain, thus we have to use all points to determine a plane which lies on these points. The intersection of this plane and the ground-plane (the plane at zero height) represents a guess for the vanishing line, see Fig. 6c.
976
Z. Szlávik, L. Havasi, and T. Szirányi
a)
b)
c) Fig. 6. VL computation based on heights. Corresponding points (marked with circles in a) and also demonstrated in b)) are related to an arbitrarily selected image point (marked by the large ‘x’). The estimation of the corresponding vanishing line is determined by the intersection of the ground-plane and the plane fitted to the points (see c)). The estimated horizon line is superimposed on a).
Because there are several estimations about the vanishing line (since every pixel of the image is processed) we describe these lines with their parameters in the space of a Hough-transformation [25][26]. In this space a line is represented with a distance (s) and an orientation-angle (φ) which must satisfy the equation (12). s = x c o s (φ ) + y sin (φ
)
(12)
G Thus, the guess about the horizon line at point x is determined by G μ = [φ s ] ..Then, the parameters of the optimal vanishing line are selected as the parameters referring to global maximum in the parameter space. G x H
4 Experimental Results We performed practical evaluation where both indoor and outdoor videos were used as input. The parameters introduced in the previous sections are assigned the following empirical values: ε 1 = 1 , ε 2 = 5 , σ 1 = 0 .8 , σ 2 = 1 .2 5 and δ = 1 0 D . To determine the binary motion mask a motion-detection method was used which is based on the background model introduced by Stauffer [12]. The detection of shadow areas was carried out using a color-based method which is implemented in
Geometrical Scene Analysis Using Co-motion Statistics
977
SAKBOT [5]. The computation time for the co-motion statistics is about 10-25 milliseconds/frame (depending on the motion-intensity). The objective-function maximization takes 3-25 seconds. The homography computation was tested using videos taken by surveillance cameras located in large outdoor public squares. In these sequences the size of the moving objects is sufficiently small to prove the feasibility of our technique.
Fig. 7. Homography computation in “Traffic” [23] videos
The determination of the mirror pole was evaluated in case of videos with a large reflective planar surface (a shop window). Because the mirror pole is near infinity, only the lines through the original point, its reflection and the mirror pole are displayed. The estimation of the light source was tested on an outdoor video sequence. Example results can be seen in Fig. 8.
Fig. 8. Example results of mirror pole estimation in “Shop” sequence [7] (left image) and for light source estimation in “Shadow” sequence [7]. Note that the lines connect corresponding body parts. Table 1 shows the numerical results of the mirror pole and light source estimation. Table 1. Experimental results: #Point denotes the number of automatically-extracted correspondences; derived coordinates of the vanishing point are in the column “Optimized” Sequence Shop Shadow
#Point 790 3509
Initial [21] 104,165 -13, 53
Mirror pole/light source Optimized True 4881, -1272 4500,-1300 -1918,680 -2110,850
978
Z. Szlávik, L. Havasi, and T. Szirányi
a)
b)
Fig. 9. Horizon line computation in “Indoor” and “Square” videos
The estimation of the horizon line was tested on the “Indoor” and “Square” sequences. The obtained results can be seen in Fig. 9. The manual extrapolation of the vanishing line is a difficult task, because: i) there are not enough static features for accurate alignment; and ii) the objects are usually too small in case of outdoor images.
5 Conclusions A general framework for the determination of geometric scene properties in surveillance recordings has been introduced. The extended co-motion statistics are the novel features used as a basis for estimation of various geometric parameters of the model: vanishing line, mirror pole, light source and the homography between two views. We have shown that with the only assumption that something is changing in a video important geometry-based dynamic features can be extracted in a unified framework. In future works, we intend to investigate the estimation of the necessary time and video length, and the time to accomplish synchronization between videos; the use of the method in handling image-sequences from non-static cameras will also be explored.
References 1. Abdel-Aziz, Y.I., Karara, H.M.: Direct Linear Transformation from Comparator Coordinates into Object Space Coordinates in Close-range Photogrammetry. In: Proc. ASP/UI Symp. Close-Range Photogrammetry, pp. 1–18 (1971) 2. Barnard, S.T., Thompson, W.B.: Disparity analysis of images. EEE Trans. PAMI 2, 333– 340 (1980) 3. Benedek, Cs., Havasi, L., Szirányi, T., Szlávik, Z.: Motion-based Flexible Camera Registration. In: Proc. IEEE AVSS’05, pp. 439–444. IEEE, Los Alamitos (2005) 4. Caspi, Y., Simakov, D., Irani, M.: Feature-based sequence-to-sequence matching. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, Springer, Heidelberg (2002) 5. Cucchiara, R., Grana, C., Neri, G., Piccardi, M., Prati, A.: The Sakbot System for Moving Object Detection and Tracking, Video-Based Surveillance Systems-Computer Vision and Distributed Processing, 145–157 (2001)
Geometrical Scene Analysis Using Co-motion Statistics
979
6. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. CACM 24(6), 381–395 (1981) 7. Havasi, L., Szirányi, T.: Estimation of Vanishing Point in Camera-Mirror Scenes Using Video, Optics Letters 1367–1566 (2006) 8. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003) 9. Khan, S., Shah, M.: Consistent Labeling of Tracked Objects in Multiple Cameras with Overlapping Fields of View. IEEE Trans. PAMI 25(10), 1355–1360 (2003) 10. Lee, L., Romano, R., Stein, G.: Monitoring activities from multiple video streams: establishing a common coordinate frame, IEEE Trans. PAMI, 22 (2000) 11. Nunziati, W., Alon, J., Sclaroff, S., Bimbo, A.D.: View registration using interesting segments on planar trajectories. In: Proc. of IEEE AVSS’05, pp. 75–80. IEEE, Los Alamitos (2005) 12. Stauffer, C., Eric, W., Grimson, L.: Learning patterns of activity using real-time tracking. IEEE Trans. PAMI 22(8), 747–757 (2000) 13. Weng, J., Ahuja, N., Huang, T.S.: Matching two perspective views. IEEE Trans. PAMI 14, 806–825 (1992) 14. Zhang, Z., Deriche, R., Faugeras, O., Luong, Q.-T.: A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence Journal 78, 87–119 (1995) 15. Hu, B., Brown, C., Nelson, R.: Multiple-view 3-D reconstruction using a mirror, Technical Report (2005) 16. Zabrodsky, H., Weinshall, D.: Utilizing symmetry in the reconstruction of 3-dimensional shape from noisy images. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 403– 419. Springer, Heidelberg (1994) 17. Lu, F., Zhao, T., Nevatia, R.: Self-Calibration of a camera from video of a walking human. In: Proc. of ICPR pp. 1513–1518 (2000) 18. Criminisi, A., Reid, I., Zisserman, A.: Single view metrology. In: Proc. of ICCV, 434–442 (1999) 19. Cao, X., Foroosh, H.: Camera calibration and light source orientation from solar shadows. Computer Vision and Image Understanding 105(1), 60–72 (2007) 20. Mitsumoto, H., Tamura, S.: 3-D reconstruction using mirror images based on a plane symmetry recovering method. IEEE Trans. PAMI, 941-946 (1992) 21. Penne, R.: Mirror Symmetry in Perspective. In: Blanc-Talon, J., Philips, W., Popescu, D.C., Scheunders, P. (eds.) ACIVS 2005. LNCS, vol. 3708, pp. 634–642. Springer, Heidelberg (2005) 22. Francois, A.R.J., Medioni, G.G., Waupotitsch, R.: Reconstructing mirror symmetric scenes from a single view using 2-view stereo geometry. In: Proc. of ICPR, pp. 12–16 (2002) 23. Szlávik, Z., Szirányi, T., Havasi, L.: Video camera registration using accumulated co-motion maps. ISPRS Journal of Photogrammetry and Remote Sensing 61(1), 298–306 (2007) 24. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice Hall, Englewood Cliffs (2002) 25. Nguyen, V., Martinelli, A., Tomatis, N., Siegwart, R.: A Comparison of Line Extraction Algorithms using 2D Laser Rangefinder for Indoor Mobile Robotics. In: Proc. of IROS, pp. 1929–1934 (2005) 26. Ji, Q., Xie, Y.: Randomised hough transform with error propagation for line and circle detection. Pattern Analysis and Applications 6, 55–64 (2003) 27. Shafer, S.A.: Shadows and Silhouettes in Computer Vision. Kluwer Academic Publisher, Dordrecht (1985) 28. Szlávik, Z., Szirányi, T.: Stochastic view registration of overlapping cameras based on arbitrary motion. IEEE Transactions on Image Processing 16(3), 710–720 (2007) 29. Havasi, L., Sziranyi, T., Rudzsky, M.: Adding geometrical terms to shadow detection process. In: EUSIPCO (14th European Signal Processing Conference), Florence (2006)
Cascade of Classifiers for Vehicle Detection Daniel Ponsa and Antonio L´opez Centre de Visi´ o per Computador, Universitat Aut` onoma de Barcelona Edifici O, 08193 Bellaterra, Barcelona, Spain {daniel,antonio}@cvc.uab.es url:www.cvc.uab.es/ADAS
Abstract. Being aware of other vehicles on the road ahead is a key information to help driver assistance systems to increase driver’s safety. This paper addresses this problem, proposing a system to detect vehicles from the images provided by a single camera mounted in a mobile platform. A classifier–based approach is presented, based on the evaluation of a cascade of classifiers (COC) at different scanned image regions. The Adaboost algorithm is used to determine the COC from training sets. Two proposals are done to reduce the computation needed for the detection scheme used: a lazy evaluation of the COC, and the customization of the COC by a wrapping process. The benefits of these two proposals are quantified in terms of the average number of image features required to classify an image region, achieving a reduction of the 58% on this concept, while scarcely penalizing the detection accuracy of the system.
1
Introduction
The research in Computer Vision applied to intelligent transportation systems is mainly devoted to provide them with situational awareness [1]. An essential task for demanded applications like Adaptive Cruise Control, or Autonomous Stop&-Go Driving is determining the position of other vehicles on the road. This provides key information to Advanced Driver Assistance Systems (ADAS) in order to increase driver’s safety. Traditionally this task has been addressed using active sensors like radar or lidar. However, since vision sensors (CCD/CMOS) are passive and cheaper, and provide a reacher description of the environment, many research efforts have also been devoted on applying computer vision techniques onto this topic [2]. The variety of vehicle appearances, due to the heterogeneity of this class of objects (many different types of cars, vans and trucks), and due to the uncontrolled acquisition conditions (different daytime and weather conditions, presence of strong shadows, artificial illumination, etc.), poses a big challenge for this detection task. Our recent work [3] has focused on developing a system to detect vehicles, from a car equipped with a single monochrome camera, mounted close to the rearview mirror facing the road ahead. The system follows the detection methodology proposed in [4] for face detection, based on scanning video-frames with a cascade of classifiers (COC). Evaluated image regions are sorted out between positive and negative categories (i.e., vehicle vs non-vehicle). The general procedure to construct the COC is the following. Given an initial training set, a classifier is J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 980–989, 2007. c Springer-Verlag Berlin Heidelberg 2007
Cascade of Classifiers for Vehicle Detection
981
learnt, selecting from training data a subset of features that distinguish efficiently both classes. This classifier conforms the first level of the COC. This cascade is then applied on a training sequence, where the generation of false positives (i.e., non–vehicle regions classified as vehicle) will be usually observed. These missclassifications are collected to construct a new training set, which is then used to learn the next classification level of the COC. This process is iterated, improving the COC until an acceptable performance in the training sequence is reached. With this strategy, we developed a vehicle detector showing qualitatively good results. This paper extends our previous work on vehicle detection, proposing two techniques to improve the efficiency of evaluating a COC at different regions in an image. First, a lazy evaluation of the classifiers in the COC is proposed, in order to minimize the amount of features computed to classify each inspected region. Given a testing set of video frames, the benefits of this lazy evaluation have been quantified by registering for each inspected region the amount of features needed to evaluate each COC classifier. This information has provided detailed insight into how detection takes place, and allows to identify the bottlenecks of this process. Using this information, a tuning of the used COC is proposed, replacing the critical classifiers in the COC with several new ones, which reduce significantly the average amount of features required to classify each inspected region. Experimental work done shows that the resultant COC has a detection accuracy practically identical to the one of the original COC. The structure of the paper is as follows. Next section justifies why vehicle detection has been posed as a classification problem, and gives details on how the vehicle classifier is constructed. Then, the methodology to scan frames is presented, based on knowledge on the geometry of image formation. Section 3 proposes a lazy evaluation of the vehicle classifier, and presents the study done to determine how vehicle detection based on a COC takes place. Section 4 proposes tuning the COC in order to improve detection efficiency, and section 5 quantifies the accuracy of the final vehicle detector, and discusses the obtained results.
2
Vehicle Detector
Traditional approaches to detect vehicles are based on guessing in advance which are the best image features for detecting vehicles and only vehicles. Examples are works proposing the use of line structures and shadows [5], or symmetry measurements [6]. However, these features do not really account for all the different appearances that a vehicle may present, due to the effect of the uncontrolled illumination conditions, and the high variability of appearance of the different sorts of vehicles (figure 1). For this reason, it seems proper to determine the features used to detect vehicles in a learning process. In our work, the Real Adaboost algorithm [7] has been used to construct a vehicle classifier. Given a training set T = {(H1 , l1 ), . . . , (Hnr , lnr )}, where Hi = {fi }N 1 is an over-complete set of N Haar-like1 features describing the i-th example, and li ∈ {v, nv} a flag indicating 1
The responses of filters proposed in [4], and their absolute value are considered.
982
D. Ponsa and A. L´ opez
if this example is a vehicle or not, the Adaboost algorithm selects a subset of n nf N features F = {fi }1 f ⊂ Hi , each one with an associated weak classifier ri , that when combined correctly classify the training examples. The resultant classifier follows the expression R(F) =
nf
ri (fi ) ,
(1)
i=1
where ri is a decision stump on fi that returns a positive (vi+ ) or negative (vi− ) value according to its classification decision. That is, + vi if fi ≤ (or ≥) threshold , ri (fi ) = vi− otherwise . Given features F computed in an image region, R(F) returns a value whose sign provides the final classification decision (positive for vehicles, negative for non–vehicles). Haar–like filters are used because of their reduced computational cost, which is independent of their evaluation scale. This is very relevant for vehicle detection, as vehicles are observed in frames in a wide range of scales (the proposed system considers regions from 24 × 18 up to 334 × 278 pixels), and Haar-like features does not demand and explicit size normalization of image regions. In order to achieve a desired detection performance, several classifiers are iteratively learnt and arranged in a cascade. Figure 2 sketches details on the Adaboost learning algorithm. Once we have a COC, the next step is scanning with it images for detecting vehicles. An scanning process is proposed, derived from the assumption that the road where vehicles move (the one holding the camera, and the observed
Fig. 1. Top) Heterogeneity of the objects to be detected, just from their rear–view. Bottom) Pairs of the same vehicle, acquired under different illumination conditions.
Cascade of Classifiers for Vehicle Detection
983
Training Sequence vehicle
car
van
non-vehicle
truck
Haar Feature Computation (o)
Adaboost
R
Haar Feature Computation
Adaboost
...
(+)
...
aspect ratio normalization
Manual False Positives Selection
(+) 1830 positive examples. (o) 176214 Features/example.
b)
size normalization (·)
R1
a)
R2
non-vehicle non-vehicle
(·) 24x18 pixels. Region size of a car at aproximately 70 meters.
...
Rn
vehicle
non-vehicle
c)
Fig. 2. a) Training set normalization. b) Process to construct the COC. c) Evaluation of the COC. True positives should be processed by all the COC layers.
ones) conforms to a flat surface. Knowing this plane and the geometry of image formation, the regions of the image where putative vehicles at different road locations project is determined, and then these regions are evaluated with the COC to verify the presence of a vehicle (figure 3). It is out of the scope of the paper describing how the road plane is estimated from images. Details can be found in [8]. In our acquisition system, estimating the road plane is equivalent to estimate the pose of the camera with respect to a world coordinate system placed on the road that sustains vehicles. In the experiments done in this paper, this information is obtained from ground truth data. oCam [0,h,0]T
Xc Zc
Yc h
Yw [0,0,0]T
q
dx
Xw Zw
Projective Geometry dz dw
Fig. 3. Frame Scanning Process. For each inspected road point, rectangular regions of different widths and aspect ratios are evaluated.
The image regions to be scanned are determined from the projection onto image coordinates of a regular grid on the ground plane, inspecting the road ahead up to 70 meters away. For each projected grid point, several regions of different sizes are considered, to account for vehicles of different widths. Obviously, different grid points projecting onto the same image pixels are considered just once.
984
D. Ponsa and A. L´ opez
For a dense scanning of the road ahead (that is, dx = 10 cm and dz = 10 cm in figure 3), this means classifying between 350.000 and 500.000 image regions per frame, depending on the acquisition system parameters. This is a remarkably huge number of regions, if it is compared with the amount inspected in other application domains (for instance, in [9] a dense scanning to detect pedestrians consists on evaluating just 12.800 regions per frame). Thus, improving the efficiency in the COC evaluation is very important for the described application. Once a frame has been scanned, a list of image regions that may contain a vehicle is obtained. As a same vehicle is usually detected in several neighboring overlapping regions, a clustering algorithm fuses them in order to provide a single detection per vehicle.
3
Lazy COC Evaluation
Each level of the COC is constituted by a classifier R as defined in (1). Since classification is done in terms of the sign of the value returned by R, it is not always necessary to evaluate all their weak classifiers ri to give the classification decision. The classification decision can be taken, as soon as the sum of the accumulated ri responses has a magnitude bigger than the summation of responses of opposite sign in the remaining rules. More formally, given a classifier R and n the set of features F = {fi }1 f of the region evaluated, the number of features nef f required to establish a classification decision corresponds to the minimum n accomplishing n i=1
ri (fi ) >
nf
−sign(
vj
n i=1
ri (fi ))
.
j=n+1
Thus, the number of features of F evaluated at each COC level depends on the image region content. It is proposed to take advantage on that to minimize the computation required by the described vehicle detection process. To quantify the significance of this lazy evaluation scheme, the presented vehicle detector has been applied on a set of testing frames, registering for each region processed the details of the COC evaluation, namely: – the number of COC layers evaluated to give a classification decision 2 , – the number of features evaluated at each COC layer. Figure 4 displays the statistics of the obtained results, showing the percentage of processed regions that receive a final classification decision at each COC layer, and for each layer, the percentage of regions that require the evaluation of a given number of features. For each layer, the expectation of the number of features evaluated vs the number of features nf of the classifier is presented. Results show that, on average, the standard evaluation of the COC requires computing 102.82 features per region, while the lazy evaluation requires just 76.06. This means a reduction of the 26% in the number of features computed. 2
Note that only positive regions are expected to be evaluated in all COC layers.
Cascade of Classifiers for Vehicle Detection % Regions rejected (labelled non-vehicle) in layers 1-6, and finally classified in layer 7
100
50
96.00%
% regions whose classification requires a given # of features
0
3.23%
1 Layer 1
6
64/90
4
2 Layer 2
6
208/230
4
0.54%
0.10%
3 4 3
Layer 3 229/250
2
3
344/370
2
0
50
0
1
1 0
100 200
0
0 100 200
0.06%
0.02%
0.03%
5
6
7
4 Layer 4
2 2 0
985
0
0
200
Layer 5
4
521/545
3
4
2
1
1 0
500
Layer 7
10
3 417/450
2
0
Layer 6
0
52/70 5
0
200 400
0
0
50
# Features
Fig. 4. Statistics of the lazy evaluation of a COC. For each layer, the average number of features evaluated vs. the total number of features of its classifier is shown.
Results also show that on average, the 96% of regions are discarded (i.e. classified as non–vehicles) at the first COC layer. This comes from the fact that processed images present a large homogeneous area (the road), and the image regions evaluated there are easy to distinguish from vehicles. However, although most image regions just require the evaluation of a single COC layer, they require on average evaluating 64.34 features, which results in a noteworthy amount of computation, due to the big amount of image regions that are inspected. In order to obtain a more efficient vehicle detector, less features should be used to discard this greater part of analyzed regions. Next section proposes a methodology to tune the learned COC in order to achieve that.
4
Tuning a COC
In order to implement with lower computational cost the task of a given level of a COC, it is proposed to substitute its corresponding classifier R by another COC. Ideally, this COC should achieve an equivalent classification performance, requiring the analysis of a fewer amount of features when a frame is processed. The method proposed is based on a partition of the training set T used to generate R, in order to obtain new classifiers of lower complexity. Let’s denote T⊕ and T the positive and negative examples in T (i.e. T = T ⊕ ∪T). Using the classifier R learned from T, elements in T are classified, selecting then the ones whose classification remain negative during the evaluation of the last 90% of the weak rules ri in (1). This selection groups negative examples according to the similarity of how they are classified (that is, that from the evaluation of the first 10% of weak classifiers in R, they are always considered as negative). This partitions T in two groups: – one with elements easily distinguishable from positive examples (T1 ); – the other with elements more difficult to classify (T2 ).
986
D. Ponsa and A. L´ opez
Heuristicly it is guessed that from these two sets new classifiers will be learned that jointly require a lower complexity than R. From the set {T ⊕ ∪T1 }, as contains clearly negative examples, it seems logical to expect classifying them with less features. For {T ⊕ ∪T2 } it is also possible to obtain a classifier of lower complexity, as the Adaboost will select a different subset of features F2 specially tuned to distinguish just the elements in T2 3 . Thus, in this paper we propose to recursively apply such a divide and conquer strategy, attempting to obtain classifiers of a desired complexity. This procedure can be seen as a wrapper method devoted to iteratively select negative examples that simplify (in terms of the number nf in R) the learned classifier. Figure 5 sketches the specific proposed strategy. The subset T1 is recursively purged using the described method, until either a classifier with a constrained maximum complexity is obtained, or the complexity of the classifier obtained does not decrease. Then, the examples discarded during this process are grouped in a new training set T , and the process is started again. The process is stopped when no significant improvement is achieved.
T1 R1
T12 T11
...
T2
R11...1
T11...12 T11...11
R' ...
T'
...
T R
R11...11
Fig. 5. Strategy used to substitute a classifier R by a COC
Using this strategy, the first level of the cascade analyzed in figure 4 has been replaced by 4 new sub-levels which, when applied on testing frames, display the statistics of figure 6. 60
% Regions rejected at each layer 20
40 20 0
Layer 1
Layer 2
Layer 3
5/9
25/30
30/35
10
10
Layer 4 66/75
6 4
40.08% 39.37% 10.61% 6.35% 1
2
3
4
10
5
5
0
0
0
123456789
0 10 20 30
2 0
20
0
0
50
Fig. 6. Statistics of the COC levels that replace the first layer of the COC in figure 4
The joint performance of these new 4 layers is compared in figure 7 with the performance of the replaced layer. Now the 96% of analyzed regions require on average the evaluation of 33 features, when the original COC required 64 features. Considering the overall COC performance, the average number of features required per inspected region is now 43.35, which with respect to the 76.06 of the original COC, it means a reduction of the 43%. 3
If this does not happen, one can just use the original R for classifying T2 .
% regions whose classification requires a given # of features
Cascade of Classifiers for Vehicle Detection Layer 1
6
64/90
4
987
New SubLayers 1:4
20
33/149
15 10
2 0
5 0
50
0
0
50
100
150
Fig. 7. Performance of the initial COC layer vs. the new learned sub-layers
5
Classification Rate Evaluation
To objectively evaluate the performance of the proposed method, the following experiment has been carried out. First, sequences different to the ones used for training has been used, which where acquired using three different vehicles with different video cameras. Each camera has premounted optics, and has been roughly calibrated assuming a pin–hole camera model with zero–skew. The images provided by each camera are significantly different, due to their different behavior with respect to the automatic control of the camera gain, and their spectral sensitivity. Sequences has been acquired in different times of the day (midday and sunset) and environmental conditions (cloudy day and sunny day, etc.). From them, 500 frames has been selected in order to construct a testing set to validate the system. The selection criterion has been collecting frames significant with respect to the different kinds of vehicles acquired and to the lighting conditions (presence of shadows, specularities, under-illuminated environments, etc.). All selected frames accomplish the restriction that a user can easily annotate a planar surface approximating the observed road. This annotation is easy if parallel road structures (lane markings, road limits, etc.) are clearly observed in the image. The annotated plane provides the ground truth information used to determine the frame regions that are inspected. With this information, an ideal scanning of video frames is carried out, and the best performance achievable for the proposed method can be quantified. The vehicles in testing frames have also been manually annotated, being labeled depending of if their detection should be mandatory, or if they can be miss-detected due to some of the following causes: – present partial occlusions; – are farther than the maximum operative detection distance (70 meters); – lay in a plane different than the one used for scanning the image. The labeling of observed vehicles in these two disjoint classes is done to better quantify the detection performance (i.e., count properly the number of false positives and false negatives). The miss-detection of a miss-detectable vehicle do not have to be interpreted as a false negative, as the objective in this papers is not evaluating the detection performance in this challenging cases. On the other hand, miss-detectable vehicles, being detected or not, are counted neither as true nor false positives, in order to do not distort results. Thus, classification ratios are computed taking into consideration just vehicles that should be
988
D. Ponsa and A. L´ opez
detected obligatorily. Table 1 shows the results obtained for a dense scanning of testing frames, using the original and the tuned COC respectively. Using the tuned COC, a slightly lower detection rate is achieved (93.91% versus the 94.13% of the original COC), but also a lower false positive rate per region evaluated. The detection accuracy achieved is remarkable, due to the complexity of the faced problem (detection of vehicles up to 70 meters away), and the challenging conditions considered in the testing (different acquisition cameras, daytime conditions, frontal and rear vehicle views, etc.). Table 1. Detection results of the original (top) and tuned (bottom) COC Original COC - True Positives Detection rates Car Van Truck Acum. Rear 547/570 95.96% 163/169 96.45% 67/78 85.90% ⇒ 777/817 95.10% Front 67/80 83.75% 11/12 91.67% 11/11 100.00% ⇒ 89/103 86.41% ⇓ ⇓ ⇓ ⇓ Acum. 614/650 94.46% 174/181 96.13% 78/89 87.64% ⇒ 866/920 94.13% Original COC - False Positives Detection rates FP per Window evaluated: 1.509e-004 FP per Frame: 1.07 Tuned COC - True Positives Detection rates Car Van Truck Acum. Rear 545/570 95.61% 162/169 95.86% 68/78 87.18% ⇒ 775/817 94.86% Front 67/80 83.75% 11/12 91.67% 11/11 100.00% ⇒ 89/103 86.41% ⇓ ⇓ ⇓ ⇓ Acum. 612/650 94.15% 173/181 95.58% 79/89 88.76% ⇒ 864/920 93.91% Tuned COC - False Positives Detection rates FP per Window evaluated: 1.426e-004 FP per Frame: 1.02
The detector has a better performance in detecting the back of vehicles, probably due to the fact that frontal views are underrepresented in the training set (they constitute less than the 10% of positive training examples). Concerning the type of vehicles, the ones more difficult to detect are trucks. We guess that this is due to two factors. On one hand, trucks conform a class more heterogeneous than other types of vehicles. On the other hand, the appearance of their back side usually vary very significantly depending on the camera viewpoint. This does not happen with the other type of vehicles, where their backside commonly conforms approximately a vertical plane, and for this reason their appearance scarcely varies with the camera viewpoint. Another point worth to mention is the number of false positives. On average 1.02 false positives per frame are generated, but this does not mean that when a real sequence is processed, a false alarm is generated at every frame. In real sequences it can be seen that false positives do not present spatio–temporal coherence, while true vehicles do. Using this fact, it is easy to differ false from true detections with the help of tracking.
Cascade of Classifiers for Vehicle Detection
6
989
Conclusions
A system has been presented to detect vehicles from images acquired from a mobile platform. Based on the Adaboost algorithm, a COC has been learnt from training data. Two proposals have been presented to reduce computational cost in the detection process, namely the lazy evaluation of classifiers, and a wrapping process to tune the initial learned COC. Thanks to these two proposals, the average number of features computed per inspected region has reduced from the 102.82 of the original COC with standard evaluation, to the 43.35 of the tuned COC with lazy evaluation (a reduction of around the 58%). The detection accuracy of the tuned COC is scarcely inferior to the one of original COC, showing also an inferior false detection rate. Acknowledgments. This research has been partially funded by Spanish MEC project TRA2004-06702/AUT.
References 1. Dickmanns, E.: The development of machine vision for road vehicles in the last decade. In: Int. Symp. on Intelligent Vehicles, Versailles, vol. 1, pp. 268–281 (2002) 2. Sun, Z., Bebis, G., Miller, R.: On-road vehicle detection: A review. IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 694–711 (2006) 3. Ponsa, D., L´ opez, A., Serrat, J., Lumbreras, F., Graf, T.: 3d vehicle sensor based on monocular vision. In: Int.Conf. Intel. Transportation Systems, pp. 1096–1101 (2005) 4. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE Conf.Computer Vision and Pattern Recognition, pp. 511–518. IEEE Computer Society Press, Los Alamitos (2001) 5. Maurer, M., Behringer, R., F¨ urst, S., Thomanek, F., Dickmanns, E.D.: A compact vision system for road vehicle guidance. In: 13th Int. Conference on Pattern Recognition, Vienna, Austria, vol. 3, pp. 313–317 (1996) 6. Broggi, A., Cerri, P., Antonello, P.: Multi-resolution vehicle detection using artificial vision. In: IEEE Intelligent Vehicles Synposium, pp. 310–314. IEEE Computer Society Press, Los Alamitos (2004) 7. Schapire, R.E., Singer, Y.: Improved boosting using confidence-rated predictions. Machine Learning 37, 297–336 (1999) 8. Sappa, A., Ger´ onimo, D., Dornaika, F., L´ opez, A.: On-board camera extrinsic parameter estimation. IEE Electronics Letters 42, 645–747 (2006) 9. Zhu, Q., Avidan, S., Yeh, M.C., Cheng, K.T.: Fast human detection using a cascade of histograms of oriented gradients. In: IEEE Computer Society Conference on Computer vision and Pattern Recognition, vol. 2, pp. 1491–1498. IEEE, Los Alamitos (2006)
Aerial Moving Target Detection Based on Motion Vector Field Analysis Carlos R. del-Blanco1 , Fernando Jaureguizar, Luis Salgado, and Narciso Garc´ıa Grupo de Tratamiento de Im´ agenes, Universidad Polit´ecnica de Madrid, 28040, Madrid, Spain {cda,fjn,L.Salgado,narciso}@gti.ssr.upm.es http://www.gti.ssr.upm.es
Abstract. An efficient automatic detection strategy for aerial moving targets in airborne forward-looking infrared (FLIR) imagery is presented in this paper. Airborne cameras induce a global motion over all objects in the image, that invalidates motion-based segmentation techniques for static cameras. To overcome this drawback, previous works compensate the camera ego-motion. However, this approach is too much dependent on the quality of the ego-motion compensation, tending towards an overdetection. In this work, the proposed strategy estimates a robust motion vector field, free of erroneous vectors. Motion vectors are classified into different independent moving objects, corresponding to background objects and aerial targets. The aerial targets are directly segmented using their associated motion vectors. This detection strategy has a low computational cost, since no compensation process or motion-based technique needs to be applied. Excellent results have been obtained over real FLIR sequences.
1
Introduction
Automatic target detection in FLIR imagery are challenging problems due to low signal-to-noise ratio, non-repeatability of target signatures and changes in illumination. Moreover, airborne camera induces a global motion in the sequence (called ego-motion), which produces that static background objects can be detected as moving targets. To overcame the camera ego-motion problem, most of works apply a compensation stage that follows the scheme: computation of the motion vector field, parameter estimation of the global motion and compensation of the global motion [1]-[6]. Each one of these sub-stages has several drawbacks, that as a whole produce a low quality image compensation. Erroneous motion vectors in the motion vector field computation are the most significative drawback, as they can cause an erroneous global motion estimation. A low quality or erroneous image compensation directly affects to motionbased techniques, that only produces satisfactory results in static images or in perfect compensated images. Besides, these techniques are based on the substraction of consecutive images [1][2]. Therefore, they usually do not segment J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 990–1001, 2007. c Springer-Verlag Berlin Heidelberg 2007
Aerial Moving Target Detection Based on Motion Vector Field Analysis
991
entire moving objects, only some parts of them, due to overlapping of the own objects between consecutive images. On the other hand, almost all works deal with terrestrial targets. This implies that the airborne camera aims to high-textured earth regions. Nevertheless, this work addresses the aerial target detection, as in [3] and [7]. Consequently, the camera aims to low-textured sky regions, which are not valid to estimate motion due to aperture problem [8]. In this case, the camera ego-motion compensation depends on a reduced set of high-textured cloud and earth regions (if they exist), decreasing its quality. In this paper a new aerial target detection strategy is presented, which is able to detect moving aerial targets in low-textured sky sequences affected by camera ego-motion. This is achieved by computing a free-error motion vector field, in which high-textured regions are only considered. The motion vector field is analyzed to classify their motion vectors as belonging to background or aerial target regions. Aerial targets are morphologically segmented using the previous motion vector classification. As a result, an accurate and low complexity target detection is obtained, as no static-camera-oriented motion-based technique is applied. This paper is organized as follows: Section 2 presents an overview of the proposed strategy. Section 3 describes the robust image motion estimation. The background and aerial target detection are presented in Section 4 and 5, respectively. Section 6 shows experimental results obtained over real FLIR sequences. Finally, conclusions are presented in Section 7.
2
Strategy Overview
The proposed detection strategy is carried out into three different stages, as shown in Fig 1. The Image Motion Estimation stage automatically detects the edge regions in two consecutive images (I n−1 and I n ) of the FLIR sequence. A free-error sparse motion vector field is computed (SM V F n ), using only those image regions where were detected edges. The Background Detection stage analyzes SM V F n to find out if a set motion vector corresponding to background objects exists. If so, those motion vectors are discarded and the rest (M VAT ) are classified as belonging to aerial targets. The Target Detection stage segments all the aerial targets (AT n ) presented in I n , by morphologically processing those edge regions corresponding to M VAT .
Image Motion Estimation
In-1, In
Edge Detection
SMVFn
Background Detection
MVAT
ATn Target Detection
Edge-Matching
Fig. 1. Stages of the proposed detection strategy
992
3
C.R. del-Blanco et al.
Image Motion Estimation
This stage detects the edges of a pair of consecutive images, I n−1 and I n , and performs an edge-matching to compute a motion vector field, which represents the local motion in the image. 3.1
Edge Detection
A Laplacian of Gaussian based edge detector along with an automatic thresholding is used to detect all the relevant edges in a pair of consecutive images, I n−1 and I n . A Laplacian of Gaussian filter LoG is applied to the image I n to stand out n those regions with high intensity variation. As a result, ILoG is obtained, whose intensity values follow a Laplacian distribution, assuming an additive Gaussian noise in the image [11],[12]. The expression of the Laplacian distribution is given by (1): L(x) =
1 · e−(|x−μ|/b) 2b
(1)
where μ is the mean and b is a scale parameter. These parameters are estimated through a robust parameter estimation technique composed by two parts. In the first part, a preliminary estimation, μp and bp , is carried out through the maximum likelihood parameter estimation algorithm. In the second part, μ and n b are obtained using the same technique but only over a range of values of ILoG , determined by (2): n − 4σ < ILoG < 4σ
(2)
where σ = 2b2p is the variance of a Laplace distribution with a scale parameter equal to bp . An adaptive threshold TLoG is computed from μ and b as in (3): Pf TLoG = μ − b · ln 1 − 2 − 0.5 (3) 2 where Pf is the acceptable proportion of false edges (a high value will produce more false edges but detect more true ones, and viceversa). n The intensity values of ILoG smaller than TLoG are set to zero, obtaining n IT hLoG . Then, a zero-crossing technique is applied to ITnhLoG to obtain a binary edge image E n , which contains all relevant edges. This process is also applied to I n−1 to obtain the edge image E n−1 . Fig. 2 shows the automatic edge detection process. The FLIR image presented in Fig. 2(a) is filtered by LoG. The intensity distribution of the resulting filtered image is fitted by a Laplacian distribution, as shown Fig. 2(b). An optimum threshold is computed from the parameters of the previously fitted Laplacian distribution. Applying this threshold, the edge image is obtained (Fig. 2(c)). As can be observed, this edge image containes the main edges in the FLIR image, while correctly rejecting those intensity variations due to the noise.
Aerial Moving Target Detection Based on Motion Vector Field Analysis
3.2
993
Edge Matching
E n−1 and E n are morphological dilated with a squared structuring element of size 5 × 5 (which is an acceptable size for selecting the own edge and its more significative neighborhood, that will use to find the edge-based correspondences), obtaining DE n−1 and DE n . The dilated edge pixels from DE n−1 are divided n−1 into a set of kc clusters, CDE . These clusters are calculated by means of a kmeans algorithm, which uses the spatial coordinates of the dilated edge pixels as a feature vector. The number of clusters kc is computed as in (4): kc =
NDE Npix
(4)
where NDE is the number of pixels corresponding to the dilated edges, and Npix is a predefined average number of pixels for each cluster. A high value of Npix will produce a better quality correspondence but less resolution in the generated motion vector field, and viceversa. Fig. 3 depicts the clustering of the dilated edge regions DE n−1 , obtained by applying a k-means algorithm over the coordinates of dilated edge regions. n−1 Each cluster in CDE is composed by a set of pixel coordinates that are used to form clusters of pixels in I n−1 , whose set is denominated CIn−1 . The Edge-Matching sub-stage (called in this way because each cluster of C n−1 are formed by pixels belonging to edge regions) compares each cluster of CIn−1 with the corresponding regions in I n (using the same cluster shape) , and its adjacent neighborhood located inside a predefined search area Sa . The search area Sa is constrained to the dilated edge pixels of DE n , since the best correspondence should be another edge region. The best matching is computed minimizing the mean absolute difference cost function (MAD), whose expression is given in (5): M AD(dx , dy ) =
1 Npc
n−1 I (x, y) − I n (x + dx , y + dy )
(5)
n−1 (x,y)∈CI,i
n−1 where CI,i is the cluster i of CIn−1 of size Npc pixels; and (dx , dy ) are the coordinates of each candidate motion vector inside Sa . The best matching produces a motion vector that defines the movement of one cluster in I n−1 with the corresponding one in I n . The set of estimated motion vectors, related to all the clusters of I n−1 , forms a sparse forward motion vector field, SF M V F n . Erroneous vectors can be obtained in SF M V F n due to aperture problem [8], the low signal-to-noise ratio of FLIR images and objects that appear or disappear between consecutive images. To discard these erroneous vectors (that could be detected as aerial targets), each motion vector in SF M V F n is analyzed. This analysis consists in computing the sparse backward motion vector field SBM V F n between I n and I n−1 , following the same procedure as for computing SF M V F n , but now the clusters of I n are those resulting from the best matching in the forward motion estimation process and the search area is constrained
994
C.R. del-Blanco et al.
(a) 180 160 140 120 100 80 60 40 20 −0.06
−0.04
−0.02
0
0.02
0.04
(b)
(c) Fig. 2. (a) Original FLIR image, (b) Laplacian fitting of the LoG-filtered image intensity distribution, (c) detected edges using the threshold computed through estimated Laplacian parameters
by DE n−1 , the dilated edges of E n−1 . Then, the coherency between forward and backward motion vector fields is verified, by imposing that each couple of associated vectors must satisfy (6):
n
MV F MV F dSF , dSF x y
n
Fn Fn = − dSBMV , dSBMV x y
(6)
Aerial Moving Target Detection Based on Motion Vector Field Analysis
995
As a result, an accurate sparse motion vector field SM V F n is obtained, free of erroneous motion vectors.
Fig. 3. Clustering of the dilated edge regions using a k-means algorithm
4
Background Detection
The purpose of this stage is to determine if background objects, mainly earth and cloud regions, appear in the FLIR image, and if so, to detect them. The presence of background objects is based on the evaluation of two conditions: quantity and majority conditions. First, the quantity condition is evaluated, which consists in checking if the number of motion vectors in SM V F n is larger than a predetermined threshold. Since the background object size is significantly larger than the target size, the number of motion vectors in SM V F n in presence of background objects will be much larger than in presence of only aerial targets. If the quantity condition is fulfilled, then the majority condition is evaluated. This condition establishes that al least 50% of motion vectors must follow a coherent motion (corresponding to camera ego-motion). This avoids that aerial targets to be considered as background objects, in the rare situation that an image composed by numerous aerial targets have passed the quantity condition. Notice that the coherent motion corresponding to the background objects can have a magnitude different from zero, even though the background objects are actually static, due to the ego-motion induced by the airborne camera. The coherent motion is modeled through a restricted-affine transformation, RAT . This transformation is adequate, as the long distance between the camera and both target and background objects allows to simplify the projective camera model into an orthogonal one [7]. The RAT only considers translations, rotations and zooms, as shown in (7): ⎡ n−1 ⎤ ⎡ ⎤ ⎡ n⎤ x s · cos θ s · sin θ tx x ⎣ y n−1 ⎦ = ⎣ −s · sin θ s · cos θ ty ⎦ · ⎣ y n ⎦ (7) 1 0 0 1 1 where s, θ, tx and ty are respectively zoom, angle of rotation, horizontal translation and vertical translation; and, xn−1 , y n−1 , xn , y n are the coordinates
996
C.R. del-Blanco et al.
of a determined pixel in I n−1 and I n respectively, which are related by the RAT transformation. The RAT parameters are estimated by means of a robust parameter estimation technique, based on RANSAC [9], Least Median Square [9] and Median Absolute Deviation algorithms [10]. This estimation technique starts randomly sampling S pairs of motion vectors from SM V F n . S is calculated to ensure with a probability Ps that at least one pair of motion vectors is free of outliers (a high value of Ps will produce a better estimation but more computations, and viceversa). Its expression is given by (8): S=
log (1 − Ps ) log [1 − (1 − ε2 )]
(8)
where ε is the expected maximum fraction of outliers in SM V F n . For each pair of motion vectors, Pmv , RAT parameters are estimated by solving the equation system presented in (7). The squared residual distance ri2 is calculated between each motion vector of SM V F n and those obtained from the estimated RAT parameters. Then, the median of all ri2 is computed, which is used as a quality of goodness of each RAT parameter estimation. Therefore, the e is the RAT parameter estimation with the minimum value of best fitting RAT the median. The set of inliers vectors Sin is determined through the Median Absolute e to calculate Deviation algorithm [10]. This uses the set of ri2 related to RAT Sin as in (9):
2 2 n ˆ Sin = mvi ∈ SM V F | ri < 2.5 · β (9) where mvi is a motion vector from SM V F n that has associated the squared residual distance ri2 , and βˆ is the inliers scale estimator given by (10): 5 ˆ β = 1.4826 · 1 + · median {r2i } (10) (Nmv ) − 2 where Nmv is the total number of motion vectors in SM V F n . The majority condition is passed if the cardinal of Sin is equal to or larger than Nmv 2 , and if so, the members of Sin correspond to background objects. On the contrary, all motion vectors in SM V F n will correspond to one or more aerial targets.
5
Target Detection
This stage detects aerial targets using the set of motion vectors related to aerial targets, SAT . If background detection fails, SAT is set to SM V F n . On the contrary, if background detection successes SAT is set to SM V F n − Sin , which represents the set of outliers motion vectors in the previous inliers scale estimation process.
Aerial Moving Target Detection Based on Motion Vector Field Analysis
997
The edge regions associated to the members of SAT are processed by means of a morphological close, using as structuring element an square of size D × D, where D is the mean size of an aerial target. As a result, a set of one or more connected regions is obtained, each one representing a different aerial target.
6
Results
The system has been tested with real FLIR sequences captured by an interlaced gray-level infrared camera in 8-12 μm range with a resolution of 512× 512 pixels. For all the tested sequences one field per frame was selected, therefore the image aspect ratio was modified to 1:2. The camera was mounted on a moving platform that produced a global motion in the sequences. These sequences are mainly composed by low-textured sky regions, and only in some frames by reduced cloud and earth regions. In addition, the sequences are affected by varying illumination conditions. Fig. 4 shows the motion vector field estimation process, accomplished in the Image Motion Estimation stage. Fig. 4(a) shows the original FLIR image with one aerial target and some cloud and earth regions. Fig. 4(b) presents the forward sparse motion vector field, computed between I n−1 and I n , where I n is used to search the best-correspondences with the dilated edge regions of I n−1 . Fig. 4(c) shows the backward sparse motion vector field, computed as in Fig. 4(b) but between I n and I n−1 . And Fig. 4(d) presents the free-error SM V F n , composed by those motion vectors of Fig. 4(b) that are coherent with motion vectors of Fig. 4(c), i.e. have the same module but opposite directions. As can be observed, Fig. 4(b), (c) and (d) are sparse motion vector field, since only those regions detected as edges in Section 3.1 are used in the image motion estimation. Notice that some motion vectors from Fig. 4(b) have been discarded in Fig. 4(d), what corresponds with those regions that have appeared/disappeared between consecutive images due to camera ego-motion; or regions that suffer the aperture problem [8], and therefore have a low reliability. Fig. 5 depicts the aerial target detection process. The SM V F n of Fig. 4(d), resulting from the Image Motion Estimation stage, is analyzed to detect motion vectors belonging to background or aerial target regions, as shown in Fig. 5(a) (background and aerial target motion vectors are enclosed by a discontinued rectangle and a discontinued oval respectively). Only aerial target motion vectors are morphological processed to segment aerial targets. In this case, the only aerial target is satisfactory segmented, as shown in Fig. 5(b). Fig. 6 shows another example of the aerial target detection, but with two different aerial target and without any background regions, as shown in Fig. 6(a) (the image has been cropped around the aerial targets to show the process with more clarity). Therefore, the analysis of the corresponding SM V F n classifies both connected regions as belonging to aerial target regions (Fig. 6(b), as in Fig. 5(a) aerial target motion vectors are enclosed by discontinued ovals), since the background presence conditions (4) have not been passed. Finally, both aerial targets are segmented through morphological operations, as shown in Fig. 6(c).
998
C.R. del-Blanco et al.
(a)
(b)
(c)
(d) Fig. 4. (a) Original FLIR image, (b) forward sparse motion vector field, (c) backward sparse motion vector field and (d) free-error SM V F
Aerial Moving Target Detection Based on Motion Vector Field Analysis
999
(a) ( )
(b) Fig. 5. (a) Motion vector classification of Fig 4(d) into background objects and aerial targets; (b) Aerial target segmentation, obtained by the morphological processing of the regions associated to aerial target motion vectors
(a)
(b)
(c) Fig. 6. (a) The cropped original FLIR image containing two aerial targets; (b) motion vector classification of the SM V F n , obtained from (a) and the previous image in the sequence; (c) morphological segmentation of the two aerial targets presented in (a), using the motion vectors from (b)
1000
C.R. del-Blanco et al.
The proposed target detection is efficient with targets of reduced size. However, when the target area is less than 50 pixels, its performance begins decreasing. The entire set of FLIR sequences has been processed, obtaining an average detection rate of 98.2% and an average false alarm rate of 3.8%, using the fol1 lowing parameters: Pf = 512×512 , Npix = 256, = 0.4 and Ps = 0.9999. These results demonstrate the excellent performance of this detection strategy.
7
Conclusions
A novel strategy for detecting aerial moving targets in airborne FLIR imagery has been presented in this paper. Instead of compensating the camera egomotion for applying static-oriented motion-based techniques, the proposed strategy directly analyzes the image motion (calculated as a motion vector field) to separately cluster background and aerial target regions. The aerial targets are segmented by morphologically processing the aerial target regions. In order to achieve this detection, the computation of a free-error motion vector field is required. This is accomplished by the combination of two strategies: using only the edge regions to compute the motion vectors; and testing the coherency of the motion vectors belonging to the forward and backward motion vector fields. In addition to the gained reliability, a low complexity is achieved, since only a reduced set of image regions is processed. The results presented in Section 6 demonstrate the high efficient of this detection strategy, which is able to accurately detect multiple aerial targets under ego-motion and clutter conditions.
Acknowledgements This work has been partially supported by the Ministerio de Ciencia y Tecnolog´ıa of the Spanish Government under project TIN2004-07860 (Medusa) and by the Comunidad de Madrid under project P-TIC-0223-0505 (Pro-Multidis).
References 1. Strehl, A., Aggarwal, J.K.: Detecting moving objects in airborne forward looking infra-red sequences. In: Proc. IEEE Workshop on Computer Vision Beyond Visible Spectrum, pp. 3–12. IEEE Computer Society Press, Los Alamitos (1999) 2. Strehl, A., Aggarwal, J.K., MODEEP,: a Motion-Based Object Detection and Pose Estimation Method for Airborne FLIR Sequences. Machine Vision and Applications. 11(6), 267–276 (2000) 3. Estalayo, E., Salgado, L., Jaureguizar, F., Garc´ıa, N.: Efficient image stabilization and automatic target detection in aerial FLIR sequences. Automatic Target Recognition XVI. In: Proc. of the SPIE, vol. 6234 (2006) 4. Seok, H.D., Lyou, J.: Digital Image Stabilization using Simple Estimation of the Rotational and Translational Motion. Acquisition, Tracking and Pointing XIX. Proc. of SPIE 5810, 170–181 (2005)
Aerial Moving Target Detection Based on Motion Vector Field Analysis
1001
5. Yilmaz, A., Shafique, K., Lobo, N., Li, X., Olson, T., Shah, M.A.: Target-tracking in FLIR imagery using mean-shift and global motion compensation. In: Proc. IEEE Workshop Computer Vision Beyond Visible Spectrum, IEEE Computer Society Press, Los Alamitos (2001) 6. Yilmaz, A., Shafique, K., Shah, M.: Target Tracking in Airborne Forward Looking Infrared Imagery. Image and Vision Computing Journal. 21(7), 623–635 (2000) 7. Meier, W., Stein, H.: Estimation of object and sensor motion in infrared image sequences. In: Proc. IEEE. Int. Conf. on Image Processing, vol. 1, pp. 568–572. IEEE, Los Alamitos (1994) 8. Wechsler, H., Duric, Z., Fayin, L., Cherkassky, V.: Motion estimation using statistical learning theory. IEEE Trans. on Pattern Analysis and Machine Intelligence. 26(4), 466–478 (2004) 9. Stewart, C.V.: Robust parameter estimation in computer vision. SIAM Reviews. 41(3), 513–537 (1999) 10. Meer, P., Stewart, C.V., Tyler, D.: Robust computer vision: an interdisciplinary challenge. Computer Vision and Image Understanding. 78(1), 1–7 (2000) 11. Rosin, P.: Edges: Saliency measures and automatic thresholding. Machine Vision and Applications. 9(4), 139–159 (1999) 12. Srivastava, A., Lee, A.B., Simoncelli, E.P., Zhu, S.C.: On Advances in Statistical Modeling of Natural Images. Journal of Mathematical Imaging and Vision. 18, 17–33 (2003)
Embedding Linear Transformations in Fractal Image Coding Michele Nappi and Daniel Riccio∗ University of Salerno, via Ponte Don Melillo, 84084 Fisciano, Salerno, Italy {mnappi,driccio}@unisa.it
Abstract. Many desirable properties make fractals a powerful mathematic model applied in several image processing and pattern recognition tasks: image coding, segmentation, feature extraction and indexing, just to cite some of them. Unfortunately, they are based on a strong asymmetric scheme, so suffering from very high coding times. On the other side, linear transforms are quite time balanced, allowing to be usefully integrated in real-time applications, but they do not provide comparable performances with respect to the image quality for high bit rates. Owning to their potential for preserving the original image energy in a few coefficients in the frequency domain, linear transforms also known a widespread diffusion in some side applications such as to select representative features or to define new image quality measures. In this paper, we investigate different levels of embedding linear transforms in a fractal based coding scheme. Experimental results have been organized as to point out what is the contribution of each embedding step to the objective quality of the decoded image.
1
Introduction
The literature about the fractal image compression uninterruptedly grown-up starting from the preliminary definition of Partitioned Iterated Function System (PIFS) due to Jacquin in 1989 [3]; most of the interest in fractal coding is due to its side applications in fields such as image database indexing [2] or face recognition [4]. These applications both utilize some sort of coding, and they can reach a good discriminating power even in the absence of high PSNR from the coding module. The majority of works on the image fractal compression sets the speed-up of the coding process as its main goal, while still preserving desirable properties of the fractal coding such as high compression rate, fast decoding and scale invariance. Many different solutions have been proposed to speed up the coding phase [3], for instance modifying the partitioning process or providing new classification criteria or heuristic methods for the range/domain matching problem. All these approaches can be grouped in three classes: classification methods, feature vectors and local search. Generally, speed-up methods based on nearest neighbour search by feature vectors outperform all the others in therms of decoded image quality at a comparable compression rate, but they often suffer from the high dimensionality of the feature vector; the Saupe’s operator represents a J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 1002–1013, 2007. c Springer-Verlag Berlin Heidelberg 2007
Embedding Linear Transformations in Fractal Image Coding
1003
suitable example. To cope with this, dimension reduction technique are introduced. Saupe reduced the dimension of the feature vector by averaging pixels, while in [7] DCT is used to cut-out redundant information. In the same way, also linear transforms have been widely exploited to extract representative features or to codify groups of pixels in image indexing and compression applications. Indeed, Linear transforms form the basis of many compression systems as they de-correlate the image data and provide good energy compaction. For example, the Discrete Fourier Transform (DFT) [8] is used in many image processing systems, while Discrete Cosine Transform (DCT) [8] is used in standards like JPEG, MPEG and H.261. Still others are Walsh-Hadamard transforms (WHT) [8] and Haar Transforms (HT) [8]. In particular, linear transforms have been matter of study also in the field of objective quality measures definition. The HVS, based on some preliminary DCT filtering is just an example [5], but also magnitude and phase of the DFT coefficients have been used to define new objective quality measures [1]. This is motivated by that standard objective measures such as the Root Mean Square Error (RMSE) and Peak Signal To Noise Ratio (PSNR) are very far from the human perception in some cases. Hence, this paper sets as its main goal to investigate the ways of embedding a generic linear transform T into the standard PIFS coding scheme. In more details, at first linear properties of T are exploited to dramatically reduce computational costs of the coding phase, by arranging its coefficients in a suitable way. Subsequently, the RMSE, commonly used to upper bound the collage error, is replaced by a new objective distance measure based on T coefficients.
2
Theoretical Concepts
Inorder to shed light on further discussions about the hybrid scheme proposed in the paper, it may be sound to draw the reader’s attention to some basic concepts about fractal compression and linear transforms. 2.1
Partitioned Iterated Function Systems
PIFS consist in a set of local affine contractive transformations, which exploits the image self-similarities to cut-out redundancies, while extracting salient features. In more details, given an input image I, it is partitioned into a set R = r1 , r2 , . .. , r|R| of disjoint square regions of size |r|×|r|, named ranges. Another set D = d1 , d2 , . . . , d|D| of larger regions is extracted from the same image I. These regions are called domains and can overlap. Their size is |d| × |d|, where usually |d| = 2 |r|. Since a domain is quadruple sized respect to a range, it must be shrunk by a 2×2 average operation on its pixels. This is done only once, down sampling the original image and obtaining a new image that is a quarter of the original. An overall representation of the PIFS compression scheme is reported in Fig. 1. The image I is encoded range by range: for each range r, it is necessary to find a domain d and two real numbers α and β such that
1004
M. Nappi and D. Riccio
segmentation Ranges
Range search
Domains
Coding
4.5 6.2 2.1
Domain
Input image
Error estimation RMSE
Best domain
Classification KD-Tree List of candidate domains KD-Tree
Fig. 1. The architecture of our fractal coder
min min r − (αd + β)2 . d∈D
α,β
(1)
Doing so minimizes the quadratic error with respect to the Euclidean norm. It is customary to impose that |α| ≤ 1 in order to ensure convergence in the decoding phase. The inner minimum on α and β is immediate to compute by solving a minimum square error problem. The outer minimum on d, however, requires an exhaustive search over the whole set D, which is an impractical operation. Therefore, ranges and domains are classified by means of a feature vectors in order to reduce the cost of searching the domain pool: if the range r is being encoded, only the domains having a feature vector close to that of r are considered. 2.2
Linear Transforms
A Linear Transform (LT) T is called linear if it has two mathematical properties: T (x + y) = T (x) + T (y) additivity T (αx) = αT (x) homogeneity A third property, shift invariance, is not a strict requirement for linearity, but it is a mandatory property for most image processing techniques. These three properties form the mathematics of how linear transformation theory is defined and used. Homogeneity and additivity play a critical role in linearity, while shift invariance is something on the side. This is because linearity is a very broad concept, encompassing much more than just signals and systems. In other words, when there are no signals involved, shift invariance has no meaning, so it can be thought of as an additional aspect of linearity needed when signals and systems are involved. The linear transform domain features are very effective when the patterns are characterized by their spectral properties; so, in this paper, the feature extraction capability of the Discrete Fourier Transform (DFT), the Discrete Cosine Transform and the Haar Transform (HT) are investigated.
Embedding Linear Transformations in Fractal Image Coding
3
1005
Linear Transforms Can Speed-Up the Coding Phase
Inorder to reduce the computational cost of the exhaustive search while still preserving a good image quality, we define feature vectors that will help us to choose the most promising candidate domains for encoding a given range. Thus, ¯ and let be r and d a range and a domain block respectively, with r = α · d + β, let be T a two-dimensional linear transformation (FFT, DCT or HT), a feature vector u can be extracted from r and d by reorganizing the coefficients of the transformation T . r = α · d + β¯
Applying T
¯ T (r) = T (α · d + β)
Linearity of T
¯ T (r) = α · T (d) + T (β) ¯ T (r) = α · T (d) + B
⇒
⇒
T ransf orming β¯
⇒
where
⎡
β ⎢0 ¯=⎢ B ⎢ .. ⎣.
⎤ 0 ... 0 0 ... 0⎥ ⎥ .. . . .. ⎥ . . ..⎦
0 0 ... 0 Being Γ the transformed domain T (d), the transformed range can be rewritten as: ⎡ ⎤ α · Γ 00 + β α · Γ 01 . . . α · Γ 0n ⎢ α · Γ 10 α · Γ 11 . . . α · Γ 1n ⎥ ⎢ ⎥ T (r) = ⎢ . ⎥. .. . . .. ⎣ .. ⎦ .. . α · Γ n0
α · Γ n1 . . . α · Γ nn
Notice that only the first term of T (r) is affected by β and it represents the mean of r. As the main desired property of the feature vector is the independence from α and β, the first element of the T (r) matrix is then discarded, while the remaining ones are rearranged in a linear vector u of dimension n2 − 1 by means of a zig-zag scanning that starts from the position (0, 1). In order to also cancel out effects of α on u, its elements are divided by the quantity E[u], indeed: n −1 1 E[u] = 2 α · Γi = α · Γ¯ , n − 1 i=0 2
where Γ¯ =
2 n −1
i=0
Γi . −1
n2
Finally, the real feature vector u¯ is given by: u¯ = {αΓ = 0 /E[u], αΓ1 /E[u], . . . , αΓn2 −1 /E[u]} ¯ 2 = αΓ0 /αΓ¯ , αΓ1 /αΓ¯ , . . . , αΓ /α Γ = n −1 = Γ0 /Γ¯ , Γ1 /Γ¯ , . . . , Γn2 −1 /Γ¯ .
1006
4
M. Nappi and D. Riccio
Linear Transforms Can Improve the Image Quality
A major problem in evaluating lossy techniques is the extreme difficulty in describing the type and amount of degradation in reconstructed images. Because of the inherent drawbacks associated with the subjective measures of image quality, there has been a great deal of interest in developing quantitative measures that can consistently be used as substitute. All these measures have been largely used to assess the quality of the whole image after a coding process has been applied on; in other words the original image is compressed/decompressed by means of an encoder and than the overall amount of distortion introduced by the coding scheme is measured. Thus, objective measures represent an effective way to compare different coding schemes in terms of percentage of distortion introduced for a fixed compression ratio. Here, the key idea is to embed quality measures based on linear transforms into the coding process, not curbing them to be a sheer analysis tool. The compression scheme we adopted for this study, which is represented in Fig. 1, lays itself open to a direct replacement of the RMSE by other quality measures. 4.1
LT Based Measures
Many objective quality measures [1], have been defined to replace subjective evaluations by retaining, as much as possible, the fidelity with the human perception of image distortions introduced by the coding schemes. The most common measures are undoubtedly the RMSE (Root Mean Square Error) and the PSNR (Peak Signal to Noise Ratio) [1]. They owe their wide spread to that they work well on the average by showing a very low computational cost. However, there are cases in which the quality estimates given by the PSNR are very far from the human perception (see Fig. 2) and this led many researchers to define new quality metrics providing better performances in terms of distortion measurement even if at a higher computational cost. the most significant examples of image quality measures defined in the frequency domain are the Human Visual System [5] (HVS) and the FFT Magnitude Phase Norm [1]. Human Visual System Norm: few models of the HVS have been developed in literature; in [5] dealing with the Discrete Cosine Transform, Nill has defined his function for the model as a band-pass filter with a transfer function in polar coordinates. Therefore the image quality is calculated on pictures processed through such a spectral mask and then inverse discrete cosine transformed. FFT Magnitude Phase Norm: A spectral distance-based measures is the Fourier magnitude and/or phase spectral discrepancy on a block basis [1]. In general, while the mean square error is among the best measures for additive noise, local phase-magnitude measures are more suitable for coding and blur artifacts. In particular, the FFT magnitude/phase norm is most sensitive to distortion artifacts, but at the same time least sensitive to the typology of images.
Embedding Linear Transformations in Fractal Image Coding
(a)
1007
(b)
Fig. 2. Two picture with the same objective quality (PSNR 26.5 dB), but very different subjective quality
Both these measures have drawbacks. The HVS is to much complex to be profitably used in several applications, while the FFT based distance has two main limitations: a) the phase is significantly smaller than the magnitude and its contribution to the overall distance value is almost negligible, b) the n-norm and the arctan, needed to compute magnitude and phase, are computationally intensive to be calculate, in particular for complex coefficients. Hence it appears that fractal image coding can significantly profit of a simpler image quality measure exploiting properties of linear transforms. This represent a further embedding level of the linear transforms into the fractal coding scheme. In more details, we can define such distance as follows. Let be Γ (u, v) and Γˆ (u, v) the transformed coefficients of the original and coded image. Considering that some transforms bear coefficients with a real and imaginary part, let we introduce the Ψ (Γ (u, v)) operator, which is defined as follows: Ψ (u, v) = |Re (Γ (u, v))| + |Im (Γ (u, v))| Thus, the LT distance function can be defined as follows: LT =
4.2
n−1 n−1 2 1 ΨR (u, v) − ΨRˆ (u, v) . 2 n u=0 v=0
(2)
Embedding Quality Measures in PIFS
In PIFS coding the whole image is partitioned in a sets of ranges (as Section 1). For each range, the coding scheme looks for an approximating domain to be assigned to, while the domain is mapped into the corresponding range by an affine transformation. For a given range R, PIFS associates that domain providing the
1008
M. Nappi and D. Riccio
smallest approximation error in a root mean square sense, so exactly in that point it is possible to embed different quality measure to decide the best range/domain association. The key idea underlying to this strategy is that quality measures outperforming the RMSE from a subjective point of view can better the subjective appearance of the whole image by improving the quality of each range. In other words, in the original definition of the PIFS coding scheme as proposed by ˆ = α · D + β by minJaquin, the range is approximated by the transformation R imizing the error function R − (α · R + β)2 . In this paper, both the HVS and the function in (2) have been investigated to replace the RMSE. In particular, α and β are even computed by solving a mean square error problem while the distance between the original and the transformed range is measured by a new ˆ As the HVS is already based on the DCT transform quality measure f (R, R). we only experimented the LT quality measure (LT in all figures) based on the FFT coefficients.
5
Experimental Results
Tests have been conducted on a dataset of twenty images, twelve of them coming from the waterloo bragzone standard database [9] and the remaining eight from the web. A large variability in testing conditions has been ensured by selecting test images containing patterns, smooth regions and details. They are all 8-bit grayscale images at a resolution of 512 × 512 pixels. The performance of the algorithm has been assessed under different points of view. The main aim of the test is to underline the efficiency of the LT based feature vector and the improvements given by LT based quality measures. The compression ratio has been calculated as the ratio between the original image size and the coded image size. Because of the partial reversibility of the coding process, the fractal compression of the image adds noise to the original signal. Less added noise means greater image quality, and therefore a better algorithm. Noise is usually measured by the Peak Signal-to-Noise Ratio (PSNR), which in dB can be computed as follows: M · N · 2552 PSNR = 10 · log10 , (sm,n − sm,n )2 m,n
where M and N are image width and height, 255 is the maximum pixel value, sm,n is the pixel value in the original image and sm,n is the corresponding pixel in the decoded image. In order to further assess the performance of the hybrid scheme, we also compared it with the Saupe’s algorithm [6]. 5.1
The Contribution of LT Based Feature Vectors and Quality Measures
A comparison with Saupe’s algorithm, as shown in Figs. 4 and 5, shows the particular behavior of the hybrid scheme 3 variants (DCT, Haar, FFT): From
Embedding Linear Transformations in Fractal Image Coding
1009
Fig. 4, it obviously come out that the FFT provide very scarce performances, that represents a further confirmation of that LT yielding real and imaginary coefficients are not effective at all when applied into the PIFS coding. Fig. 5 also point out that DCT and Haar based feature vectors have almost quite comparable performances. Furthermore, they show better performances than the FFT and Saupe’s vector. The main reasons motivating the superiority of the DCT and Haar transforms is that they retain the most of the image information in its first coefficients, so when a shorter vector is obtained by truncating the original one to a little number of coefficients, more representative features are retained. On the contrary, this not happens for the Saupe’s vector that is usually reduced by averaging its components. In this further experiment an objective assessment of the decoded images in terms of PSNR (Peak Signal to Noise Ratio) is still possible, because quality measures are only used to decide whether the current range must be split or not, while α and β parameters are still computed to minimize the mean square approximation error. For each test image, twenty different compression ratios were selected for degradation. They range from 4.5 : 1 to 50 : 1 with an increment of about 4. This is repeated for all the quality measures (RMSE, HVS, LT) from Section 4.1. Figures 6 shows the PNSR curves for one from the sample images (mandrill), while Fig. 7 reports the mean curves over all test images. 6.5 12.6
5.7 8.6
9.4 15.2
3.5 6.8
RMSE LT
Stop
Threshold Th = 5.0
RMSE search
LT search
6.5 12.6
Stop
5.7 8.6
9.4 15.2
3.5 6.8
7.2 10.1
1.8 4.2
RMSE LT
Fig. 3. LT and RMSE searching for a given range
An important observation made in applying the LT based measure to the test images is that it can give PSNR values larger than that obtained from the RMSE even though PSNR is maximized where the RMSE reaches its minimum). The explanation of way this happens resides in the range/domain matching process. As the coder find a domain giving an approximation error lower than a fixed threshold, the domain pool search stops and the range is coded by this domain. The MF metric induces the coder to a thorough domain search, since it is more selective than the RMSE and provides a little approximation error (lower than the fixed threshold) only for range/domain comparisons which results in small
1010
M. Nappi and D. Riccio PSNR/CR curves for the image: mandrill 24 DCT FFT Haar Saupe
23
22
PSNR
21
20
19
18
17
0
10
20
30 40 Compression Ratio
50
60
70
Fig. 4. PSNR curves on Mandrill image (PIFS with LT based feature vectors) 33 FFT Haar DCT Saupe
32
31
Recognition Rate
30
29
28
27
26
25
24
0
10
20
30 Number of Features
40
50
60
Fig. 5. Average PSNR curves over all the test images (PIFS with LT based feature vectors)
Embedding Linear Transformations in Fractal Image Coding
1011
mandrill 23 HVS LT RMSE 22.5
PSNR
22
21.5
21
20.5
20
5
10
15
20
25 30 35 Comprassion Ratio
40
45
50
55
Fig. 6. PSNR curves on Mandrill image (PIFS with LT based quality measures) Mean PSNR over all the images 31.5 HVS LT RMSE
31
30.5
PSNR
30
29.5
29
28.5
28
27.5
27
5
10
15
20
25 30 Comprassion Ratio
35
40
45
50
Fig. 7. Average PSNR curves over all the test images (PIFS with LT based quality measures)
1012
M. Nappi and D. Riccio 33 FFT Haar DCT Saupe
32
31
Recognition Rate
30
29
28
27
26
25
24
0
10
20
30 Number of Features
40
50
60
Fig. 8. Average PSNR curves aver all the test images (PIFS with both LT based feature vectors and quality measures)
RMSE values; on the other hand, the number of range/domain matchings for each range is upper bounded by a fixed constant l (50 in our case), so that the coding time is not significantly affected by additional comparisons. Fig. 3 reports a graphical example of this kind of situations. A further gain in terms of PSNR is obtained, by integrating both the heuristics, as shown in Fig. 8, where the PSNR of the hybrid Fractal-LT scheme with both the feature vectors and the LT quality measure (with FFT) integrated into the PIFS scheme. While FFT (feature vector) and Saupe get comparable on small Compression Ratio, the best performance are yet given by the DCT and Haar transforms, still confirming results of previous independent experiments.
6
Conclusion and Remarks
In this paper we proposed a new hybrid approach for fractal image compression, which embed linear transformation in the PIFS scheme. We both described the new range/domain feature vector, exploiting homogeneity and additivity of linear transformations, and a new LT based quality measure. Experimental results have shown a significant reduction of the bit rate to represent a good fractal code for a given image, and consequently a performance improvement of the coding process. Furthermore, comparisons with a similar coding method show how the proposed algorithm performs better, bringing into evidence that it is mostly able to find a fine approximation for each range and then for whole image quite efficiently.
Embedding Linear Transformations in Fractal Image Coding
1013
There are still many aspects to analyze: further optimization of the LT embedding in the hybrid scheme, their replacement with other linear transformations such as the Hadamard transform or the use of linear transforms to further compact residual information.
References 1. Avcibas, I., Sankur, B., Sayood, K.: Statistical evaluation of image quality measures. Journal of Electronic Imaging 11(2), 206–223 (2002) 2. Distasi, R., Nappi, M., Tucci, M.: FIRE: Fractal Indexing with Robust Extensions for Image Databases. IEEE Transactions on Image Processing 12(3), 373–384 (2003) 3. Fisher, Y.: Fractal Image Compression: Theory and Application. Springer, New York (1994) 4. Komleh, H.E., Chandran, V., Sridharan, S.: Face Recognition Using Fractal. In: Proceedings of IEEE International Conference on Image Processing (ICIP 2001), vol. 3, pp. 58–61. IEEE, Los Alamitos (2001) 5. Nill, N.B.: A visual model weighted cosine transform for image compression and quality assessment. IEEE Transactions on Communications 3(6), 551–557 (1985) 6. Distasi, R., Nappi, M., Riccio, D.: A Range/Domain Approximation Error Based Approach for Fractal Image Compression. IEEE Transaction on Image Processing 15(1), 89–97 (2006) 7. Wohlberg, B., de Jager, G.: Fast image domain fractal compression by DCT domain block matching. Electronics Letters 31(11), 869–870 (1995) 8. Wu, J.-L., Duh, W.-J.: Feature extraction capability of some discrete transforms. In: Proceedings of the IEEE International Symposium on Circuits and Systems, vol. 5, pp. 2649–2652. IEEE, Los Alamitos (1991) 9. Kominek, J.: Waterloo BragZone and Fractals Repository (January 25, 2007), http://links.uwaterloo.ca/bragzone.base.html
Digital Watermarking with PCA Based Reference Images Erkan Yavuz1 and Ziya Telatar2 1
Aselsan Electronic Ind. Inc., Communications Division, 06172, Ankara, Turkey
[email protected] 2 Ankara University, Faculty of Eng., Dept. of EE, 06100, Besevler, Ankara Turkey
[email protected]
Abstract. Principal Components Analysis (PCA) is a valuable technique for dimensionality reduction purposes for huge datasets. Principal components are linear combination of the original variables. The projection of data on this linear subspace keeps the most of the original characteristics. This helps to find robust characteristics for watermarking applications. Most of the PCA based watermarking methods were done in projection space i.e. in eigen image. In this study, different from the other methods, PCA is used to obtain a reference of the cover image by using compression property of PCA. PCA and block-PCA based methods are proposed by using some of the principal vectors in reconstruction. The watermarking is done according to difference of the original and its reference image. The method is compared with Discrete Wavelet Transform (DWT) based approach and its performance against some attacks is discussed.
1 Introduction The rapid development in digital multimedia increased people’s attraction; with the help of compression algorithms and increased internet connection speeds, it is easy to share something by another internet user in a reasonable time. Besides, as a matter of the digital technology, it is easy to generate identical but unauthorized copies. Then, the protection of multimedia items gets harder day by day. Digital watermarking systems, however, have been proposed to provide content protection, authentication and copyright protection, protection against unauthorized copying and distribution, etc. Fragile or semi-fragile watermarking methods are proposed for content protection applications. Robust watermarking, a way of copyright protection among the other methods, aims that the watermark could not be removed or damaged by malicious or non-malicious attacks by third parties. Watermarking methods have some common properties. These are known as imperceptibility, robustness, security and capacity. Robustness is not the case for fragile methods while capacity is not much important for authentication purposes. Then, imperceptibility and security properties are more common and important features of watermarking systems. Watermarking can be classified according to different criteria. As for the working domain, it can be grouped into two categories as spatial domain and frequency (transform) domain methods. In spatial domain approaches the watermark is embedded directly to the pixel locations. Least Significant Bit (LSB) modification [1] is well J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 1014–1023, 2007. © Springer-Verlag Berlin Heidelberg 2007
Digital Watermarking with PCA Based Reference Images
1015
known example of these type methods. In frequency domain approaches, the watermark is embedded by changing the frequency components. Discrete Cosine Transform (DCT) ([2], [3], [4]) and DWT ([5], [6]) are the most common transform methods used in watermarking. Spatial domain methods are not preferred since they are not robust to common image processing applications and especially to lossy compression. The embedding region can be another classification item. One can use a secret key or use human perception criteria or use whole image not going into special selection. For the robustness issue, it is preferred to embed the watermark into perceptually most significant components [2], but in this way the visual quality of the image may degrade and watermark may become visible. If perceptually insignificant components are used, watermark may lose during lossy compression. Then, determining the place of watermark is a tradeoff between robustness and invisibility. After choosing the embedding place, another question is how to embed. It may be additive [2] or quantization [7] based. The embedding method more or less determines the detection method. The method can be blind, semi-blind or non-blind. In blind schemes, the original cover image is not necessary to extract watermark; in semi-blind schemes, the watermarked image with some side information is needed; lastly, non-blind or private schemes require the original image. Generally speaking, the quantization based methods enable blind detection. PCA has been used in different ways in watermarking methods. Pu et al. applied PCA to watermark to improve their DCT based watermarking method [8]. For the codebook based watermarking, the adjacent codewords may be so distinct that the watermarked image quality becomes very low. Chang and Lin [9] proposed a method with PCA sorted Vector Quantization (VQ) codebook to solve the problem. The long term average attack is one of the problems of the video watermarking methods. To overcome this, the embedding should be selected from the varying regions of the video segment. Wang et al. found the embedding location with PCA for their method [10]. Kaarna and Toivanen [11] proposed a method for multi-band spectral images. They apply PCA to the spectral images, obtain eigen images, apply DWT to eigen images and embed the watermark there. Hien et al. [12] embedded the watermark to the block based eigen images. They made a tradeoff between robustness and invisibility and did not use first principal vector. Kang et al. [13] used Multi-Band Wavelet transform (MWT) to decompose image, form the same spatial locations as vectors and applied PCA on these vector sets. They embed the watermark to the first principal vectors for better robustness. In this study, different from the above examples, PCA is used for reference image generation by compression property of PCA. There are some watermarking studies using a reference image derived from the original. These are robust against attacks unlike the most of the spatial domain methods. In Joo’s et al. method [14], nth level DWT is applied to the image. DWT is applied once to the nth level LL band. The resultant subbands are made zero except LLn+1 and inverse DWT is applied. Then the reference of LLn is obtained (LL′n). The absolute difference of LLn and LL′n is calculated, sorted in descending order and some of the coefficients are chosen to embed watermark. The watermark here is a pseudo random sequence containing +1s and -1s. The watermark is added to the selected coefficients. Since, the modified coefficients in LLn change its reference LL′n
1016
E. Yavuz and Z. Telatar
further, the method is repeated iteratively. In each iteration Peak Signal to Noise ratio (PSNR) decreases, iteration stops when PSNR comes down to an acceptable limit. The original image is needed in extraction to determine the embedding place. The watermark is extracted according to the differences of the original and reference pixel values. In Liu’s et al. scheme [15] one level DWT is applied to the image, A, and the reference image, A′, is obtained by applying inverse DWT with making the subbands equal to zero except LL band. Absolute difference of the original and reference image is calculated. The pixels satisfying the condition (s<|A(i,j) – A′(i,j)|
2 Principal Components Analysis PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, the second greatest variance on the second coordinate, and so on. Unlike other linear transforms, PCA does not have a fixed set of basis vectors, its basis vectors depend on the data set. This technique orthogonalizes the components of the input vectors so that they are uncorrelated with each other; orders the resulting orthogonal components (principal components) and eliminates those components that contribute the least to the variation in the data set without much loss of information. In other words, the lower-order components contain most important characteristics of original information. Then it is useful for, dimensionality reduction, highlighting similarities and differences and compressing ([9], [13]). For mathematical representation, let X is a data set with MxN dimension. A mean vector (mx) is formed with means of each row. Then it is subtracted from columns to make the data set zero mean (XC). Covariance matrix of the zero mean data set is calculated (MxM):
(
C = Ε X C ⋅ X CT
)
(1)
The sorted eigenvectors (MxM) of the covariance matrix are the basis vectors of PCA:
CV = λV
(2)
Then, projection of data vector from X to the basis vector in V is found as:
Y =VT X
(3)
Digital Watermarking with PCA Based Reference Images
1017
For dimensionality reduction (LxN instead of MxN) first L basis vectors (L<M) can be used in projection:
YL = VLT X
(4)
3 Proposed Method The cover image is gray-scale with MxN dimension. The watermark is a random sequence having 1 and -1 values generated by a seed. The method is based on selecting some pixels randomly from the difference of original image and its reference satisfying the condition (s<|A(i,j) – A′(i,j)|
a) 30 principal vectors
b) 16x16 block, 1 principal vector
Fig. 1. PCA and block-PCA based reference images for Lena
3.1 Watermark Embedding • • • •
Obtain reference image of the cover image by PCA or block-PCA, A′ Obtain the difference of cover and reference images, A – A′ Find the pixels satisfying s<|A(i,j) – A′(i,j)|
1018
E. Yavuz and Z. Telatar
Aw ( p i , p j )
⎧ A′( pi , p j ) + α , if w(k ) = 1 and A( p i , p j ) > A′( pi , p j ) =⎨ ⎩ A′( p i , p j ) − α , if w(k ) = −1 and A( p i , p j ) < A′( pi , p j ) Here, α=(s+t)/2 [15], k=1 to N and N is watermark length.
Block PCA Original Image
PCA
Reconstruction d principal vectors Reconstruction b principal vectors
Reference Image
s < A(i, j ) − A′(i, j ) < t Modify selected pixels
P
Watermarked Image
(a) Watermarked Image
PCA or Block-PCA
Reference Image
P
⎧ 1, Aw ( pi , p j ) ≥ Aw′ ( pi , p j ) w′(k ) = ⎨ ⎩− 1, Aw ( pi , p j ) < Aw′ ( pi , p j ) w’ w
Comparison
(b) Fig. 2. Block diagram of the proposed method a) Embedding b) Extraction
(5)
Digital Watermarking with PCA Based Reference Images
1019
3.2 Watermark Extraction • Find reference image of the watermarked image, A′w • Extract the watermark according to the difference of watermarked and its reference image at embedding locations as:
⎧ 1, if Aw ( pi , p j ) ≥ Aw′ ( pi , p j ) w′(k ) = ⎨ ⎩− 1, if Aw ( pi , p j ) < Aw′ ( pi , p j )
(6)
• Compare the extracted and original watermark:
Sim(w, w′) =
w ⋅ w′ w′ ⋅ w ′
(7)
4 Experiments The gray-scale image size is 256x256 in the experiments. The same parameters and measurement methods proposed in [15] are used to make a comparison with our results. Therefore, s is 6 and t is 8 then α is obtained as 7, system performance is measured with normalized similarity. In [15], test images were Lena and Baboon. Cameraman, Boat, Bridge, Peppers and Goldhill are included in the test set for this study. The watermark is a pseudo random sequence of 1s and -1s with length 1000 generated with seed number 250. MATLAB and Image Processing Toolbox are used for simulations and attacks. The proposed method is tested against resizing, JPEG compression, Gaussian blur, average blur, median filtering, Gaussian noise, Salt and Pepper noise, sharpening, histogram equalization, contrast enhancement and cropping. To obtain the reference image, 10 and 30 principal vectors are used in reconstruction for PCA based reference (b=10 or 30), first principal vector is used in reconstruction for 8x8, 16x16 and 32x32 block sizes for block-PCA based reference (d=1). An example of the original and watermarked image is given in Figure-3. In Table-1, the comparative performance of the method with DWT based one is given for Lena image. The values in the table are the normalized values with the possible maximum similarity value (e.g. the maximum similarity is 31.6 for 1000 length watermark). The proposed method is robust up to 10% JPEG compression and gives better results than DWT based one for especially 40% JPEG compression quality (Figure-4). The simulation is repeated for each attack with 1000 different fake watermarks, and the maximum similarity achieved with fake watermarks is below 0.15 (Figure-5). The experiments with other images in the test set give similar results. Since there is no special criterion for embedding place selection other than selecting them randomly, performance of the method degrades slightly for highly textured images like Baboon, Bridge, etc. This is because that, textured regions are weak against compression and filtering attacks. To overcome this effect, selecting smooth regions at the beginning of the algorithm as the embedding location is tested. An improvement is achieved. In Figure-6, test results
1020
E. Yavuz and Z. Telatar
for Baboon, Bridge and Boat images for region selection and no selection cases against JPEG compression are given. For the cropping attack in [15], it is assumed that there will be no change in image size and half of the image is covered partially (i.e. the grey-scale value of the covered regions is 255). Block-PCA based approach give similar results but PCA based approach does not. During tests we saw that, the information loss due to cropping changes the reference image significantly relative to non-cropped reference for PCA approach more than block-PCA and DWT approaches. The significant change in reference image affects the performance.
(a)
(b)
Fig. 3. Example of the method a) Original Lena b) Watermarked Lena, PSNR: 71 Table 1. Attack performance of the proposed system
Attack type (No attack PSNR) No Attack Resizing ¼ → 1/1 JPEG, 10% quality Gaussian Blur 5x5 Average Blur 5x5 Median Filter 5x5 Gaussian Noise 0.005 Salt Pepper Noise 0.01 Sharpening 0.2 Histogram Equalization Contrast Enhancement Cropping [15]
70,1 1 0,79 0,52 0,77 0,69 0,79
Block PCA 8x8 71,1 1 0,53 0,23 0,47 0,29 0,39
Block PCA 16x16 71,2 1 0,68 0,46 0,62 0,52 0,59
Block PCA 32x32 71,0 1 0,72 0,58 0,68 0,60 0,70
PCA 30 pri. vector 71,0 1 0,59 0,51 0,54 0,41 0,48
PCA 10 pri. vector 70,9 1 0,79 0,69 0,75 0,67 0,72
0,35
0,33
0,34
0,34
0,36
0,35
0,98
0,83
0,81
0,91
0,69
0,90
0,68
0,73
0,69
0,72
0,45
0,53
0,95
0,95
0,97
0,94
0,93
0,86
0,96
0,94
0,94
0,91
0,70
0,74
0,40
0,41
0,43
0,34
0,11
0,11
DWT [15]
Digital Watermarking with PCA Based Reference Images
1021
Fig. 4. Comparison of PCA, Block-PCA and DWT based methods against JPEG compression
Fig. 5. Performance against fake watermarks (400th watermark is the correct one)
Fig. 6. Comparison of region selection and no-selection cases for embedding place against JPEG compression
1022
E. Yavuz and Z. Telatar
5 Conclusions In this study, a robust image watermarking method is proposed by using PCA or block-PCA based reference images. The main idea is to find the difference of the original and its reference, select suitable embedding locations and change selected pixel values while preserving the difference. The proposed method gives better results for JPEG compression when compared to DWT based one [15]. Different number of principal vectors for PCA reconstruction and different block sizes for block-PCA reconstruction are studied. For PCA based reference, if the number of principal vectors used in reconstruction increases, the reference image comes closer to original one. Similarly, for block-PCA based reference, the reference image comes closer to original one if the block size decreases. We saw from test results that, if the reference image is perceptually similar to the original and if the limit values (s,t) are selected below 10 as in [15], the candidate embedding locations will become limited to high detail (textured) regions. Since high detail regions are weak for compression and filtering attacks, system performance decreases. According to tests, using 10 principal vectors for PCA based reference and using 32x32 block size (with 1 principal vector in reconstruction) give better results. If the limit values (s,t) are selected close to each other, their mean value will be close to them. Then the change in the pixel value will be perceptually invisible. This enables to use smooth regions for embedding while not degrading the invisibility criterion. It is shown that the performance is increased by selecting smooth regions for the cover images having high details (like Baboon, Bridge, etc). The information loss due to cropping attack changes the reference significantly for PCA based reference approach. This leads to decrease in performance. But it is not the case for block-PCA based reference approach. As the future directives, we will try to modify the method such that it will not be necessary to know the embedding locations in extraction. Then the method will change from semi-blind to blind. Secondly, to improve the performance further, we will try to implement automatic selection of the smooth regions as embedding places.
References 1. Schyndel, R.G., Tirkel, A.Z., Osborne, C.F.: A Digital Watermark. In: Proceedings of IEEE International Conference on Image Processing (ICIP94), Austin, USA, vol. 2, pp. 86–90. IEEE, Los Alamitos (1994) 2. Cox, I.J., Kilian, J., Thomson, L., Shamoon, T.: Secure Spread Spectrum Watermarking for Multimedia. IEEE Transactions on Image Processing 6(12), 1673–1687 (1997) 3. Barni, M., Bartolini, F., Cappellini, V., Piva, A.: A DCT-Domain System for Robust Image Watermarking. Signal Processing 66(3), 357–372 (1998) 4. Suhail, M.A., Obaidat, M.S.: Digital Watermarking-Based DCT and JPEG Model. IEEE Transactions on Instrumentation and Measurement 52(5), 1640–1647 (2003) 5. Hsieh, M-S., Tseng, D-C.: Hiding Digital Watermarks Using Multiresolution Wavelet Transform. IEEE Transactions on Industrial Electronics 48(5), 875–882 (2001) 6. Kundur, D., Hatzinakos, D.: Towards Robust Logo Watermarking Using Multiresolution Image Fusion. IEEE Transactions on Multimedia 1(2), 185–198 (2004)
Digital Watermarking with PCA Based Reference Images
1023
7. Chen, B., Wornell, G.W.: Quantization Index Modulation: A Class of Provably Good Methods for Digital Watermarking and Information Embedding. IEEE Transaction on Information Theory 47(4), 1423–1443 (2001) 8. Pu, Y., Liao, K., Zhou, J., Zhang, N.: A Public Adaptive Watermark Algorithm for Color Images Based on Principal Component Analysis of Generalized Hebb. In: Proceedings of International Conference on Information Acquisition, pp. 484–488 (2004) 9. Chang, C-C., Lin, P-Y.: A Compression-Based Data Hiding Scheme Using Vector Quantization and Principle Component Analysis. In: International Conference on Cyber Worlds, Tokyo, Japan, pp. 369–375 (2004) 10. Wang, R., Cheng, Q., Huang, T.: Identify Regions of Interest (ROI) for Video Watermark Embedment with Principle Component Analysis. In: ACM Multimedia, Los Angeles, CA, USA, pp. 459–461. ACM Press, New York (2000) 11. Kaarna, A., Toivanen, P.: Digital Watermarking of Spectral Images in PCA/Wavelettransform Domain. In: Proceedings of the International Geoscience and Remote Sensing Symposium, IGARSS’03, Toulouse, France, vol. VI, pp. 3564–3567 (2003) 12. Hien, T.D., Chen, Y-W., Nakao, Z.: The PCA Based Digital watermarking. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2774, pp. 1427–1434. Springer, Heidelberg (2003) 13. Kang, X., Zeng, W., Huang, J., Zhuang, X., Shi, Y-Q.: Digital Watermarking Based on Multi-band Wavelet and Principal Component Analysis. In: Proceedings of the SPIE Visual Communications and Image Processing, vol. 5960, pp. 1112–1118 (2005) 14. Joo, S., Suh, Y., Shin, J., Kikuchi, H.H.: A New Robust Watermark Embedding into Wavelet DC Components. ETRI Journal 24(5), 401–404 (2002) 15. Liu, J-L., Lou, D-C., Chang, M-C., Tso, H-K.: A Robust Watermarking scheme Using Self-Reference Image. Computer Standards & Interfaces 28(3), 356–367 (2006)
JPEG2000 Coding Techniques Addressed to Images Containing No-Data Regions Jorge Gonz´alez-Conejero, Francesc Aul´ı-Llin` as, Joan Bartrina-Rapesta, and Joan Serra-Sagrist` a Universitat Aut` onoma de Barcelona Dep. of Information and Communications Engineering UAB-ETSE, Cerdanyola del Vall`es, 08290 SPAIN
Abstract. This work introduces techniques addressed to enhance the coding performance obtained when compressing images that contain areas with irrelevant information, here called no-data regions. No-data regions can be produced due to several factors, such as geometric corrections, overlapping of successive layers of information, a malfunction of the sensor used to capture the image, etc. Most coding systems are not devised to consider such regions separately from the rest of the image, sometimes causing an important loss in the coding efficiency. Within the framework JPEG2000, we propose five techniques that address this issue. Experimental results suggest that the application of the proposed techniques can achieve, in some cases, a compression gain of 130 over a compression without applying the proposed techniques.
1
Introduction
Nowadays, the compression, manipulation, and transmission of images are becoming an important issue for applications in many different fields of our society. In Geographic Information Systems (GIS), in Remote Sensing (RS), in telemedicine, and even in navigation systems, images are an important source of data to enhance the abilities of devices and applications. In some cases, however, images present a special characteristic: they contain areas that are completely irrelevant for the final user or application. In RS, for instance, spatial adjustments carried out to correct the image geometry generate images containing empty areas. Also in RS, images captured through sensors located in satellites or airplanes often contain areas of irrelevant information due to atmospheric events. The successive overlapping of layers of information in GIS applications [1], on the other hand, produces images that contain areas that are permanently out of sight. The medical community also deal with this issue with the use of high resolution sensors producing images that, in some cases, contain large irrelevant regions for the medical diagnoses. Even, in compound documents where images are combined with graphics and text, Mixed Raster Content (MRC) [2], these regions arise. We name these areas of irrelevant information as no-data regions. JPEG2000 is the most recent image compression standard. It supplies advanced features for the efficient coding, manipulation, and transmission of J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 1024–1036, 2007. c Springer-Verlag Berlin Heidelberg 2007
JPEG2000 Coding Techniques Addressed to Images
1025
images. Some of these features are: state-of-the-art coding performance, five different progression orders, random code-stream access and processing, scalability by quality, resolution, component, and spatial area, etc. The abilities of JPEG2000 fulfill most of the requirements of applications and scenarios where images are used and, nowadays, it is the most important reference of still image compression. Even so, JPEG2000 does not consider that images might contain no-data regions. Coding systems able to consider no-data regions are found in literature. In one hand, extensions from state-of-the-art coding systems to deal with no-data regions are introduced, for instance, in [3,4]. These coding systems lie in the main idea of skipping coefficients within no-data regions, although extensions could penalize the coding performance since the original coding system is not devised to dismiss regions of the image (see [5]). On the other hand, there are coding systems that are specifically designed for the encoding of images that contain no-data regions. As an example, in [6], a set partitioning and a later shrink stages are used to avoid the no-data coefficients in the encoding process. None of these coding systems are devised for JPEG2000. The purpose of this work is precisely to develop techniques addressed to the coding of images containing no-data regions within the framework of JPEG2000. This goal is achieved by the development of techniques that take into account the complete lack of importance of the no-data regions within some of the stages of the coding system. The paper is structured as follows: Section 2 introduces the techniques developed for JPEG2000 for coding images containing no-data regions; Section 3 assesses the coding performance and computational load of these techniques presenting several experimental results; and last section summarizes this work pointing out some conclusions.
2
Proposed Coding Techniques Considering No-Data Regions
JPEG2000 is constituted by 12 parts. Part 1 [7] contains the core coding system; it is the basis of the other parts and was conceived from the coding paradigm Embedded Block Coding with Optimized Truncation (EBCOT) [8]. The main coding stages of the JPEG2000 core coding system are depicted in Figure 1. Figure 1 also depicts the five techniques proposed for the coding of images containing no-data regions. The first two techniques are simple modifications of the image samples belonging to the no-data regions. The aim of these modifications is to reduce the coding cost of no-data regions considering the intrinsic features of the coding system. Both techniques are applied during the pre-processing stage of the JPEG2000 core coding system and they are called Average Data Region (ADR) and Phagocyte. The third technique we propose is based on one of the features provided by JPEG2000: the Region Of Interest (ROI) coding. Although in JPEG2000 two different types of ROI coding are supported, only the MaxShift method is able
1026
J. Gonz´ alez-Conejero et al. ADR/
MAXSHIFT ROI CODING
PHAGOCYTE
RATE CONTROL
ORIGINAL IMAGE
MULTI− COMPONENT TRANSFORM
DISCRETE WAVELET TRANSFORM
DWT−SA
TIER−1 QUANTIZATION
ENCODING
TIER−2 ENCODING
JPEG2000 FILE
BPE−SA
Fig. 1. White boxes depict the main coding stages of the JPEG2000 core coding system. Gray boxes depict the proposed techniques devised for the coding of images containing no-data regions.
to code the ROI separately from the background. MaxShift is applied after the quantization stage of the coding system. Besides its application, we also slightly modify the rate control stage to stop the encoding process when the data region is already coded, completely avoiding the coding of no-data regions. The last two techniques imply in-depth modifications on two important stages of the coding system: the Discrete Wavelet Transform (DWT), and the BitPlane Encoder (BPE) defined in the tier-1. These modifications break JPEG2000 compliance, constructing code-streams that can only be correctly decoded with decoders that also implement the proposed modifications. These two techniques are called DWT Shape Adaptive processing (DWT-SA) and BPE Shape Adaptive processing (BPE-SA). 2.1
Average Data Region and Phagocyte Techniques
Figure 2 (a) depicts an image of Barcelona acquired through a Landsat mission in 2003. This image is used in a GIS application where the overlapping of successive maps [1] covers the buildings of the image and thus produces no-data regions, which are set by default to the highest value of the allowed bit-depth. Notice the separation between the no-data regions (white areas) from the data regions: a sharp boundary is clearly noticeable between both regions. Taking into account that the DWT stage used in most image coding systems is precisely devised to de-correlate the information of the image detecting smooth areas and sharp edges, one can easily infer that the separation between data and no-data regions is also captured by the high-pass filters of the DWT and thus reflected in the high subbands produced by the DWT. This evidence is given in Figure 2 (b), which depicts the DWT subbands produced by the application of one level of DWT using the common 9/7 filter-banks. Note that, except for the residual subband (top-left), the boundary between data and no-data regions is distinguishable in the three remaining subbands. The problem here is that these sharp boundaries are precisely the areas that have the highest coding costs because they contain coefficients with high values, and therefore several bit-planes have to be encoded, generating longer code-streams.
JPEG2000 Coding Techniques Addressed to Images
(a)
(b)
(c)
(d)
1027
Fig. 2. Image Barcelona2. (a) Original image (no-data regions are the white areas). (b) One level of DWT applied to the original image. (c) Image produced by the Average of Data Region. (d) Image produced by Phagocyte using a window size of 32×32.
The main idea inspiring ADR and Phagocyte is to smooth the boundary between data and no-data regions. ADR computes the arithmetic mean of the data regions and sets all coefficients belonging to no-data regions to the computed mean. Figure 2 (c) depicts the image obtained when this simple operation is carried out. Note that the boundary surrounding the no-data regions is smoother than in the original image. However, the arithmetic mean of data regions does not consider that images have areas with different textures and characteristics that eventually might cancel the gain obtained with ADR. Aimed to consider local image variations, Phagocyte computes the arithmetic mean of the coefficients within a window of specified size. This technique was first implemented in 1992 in the GIS application MiraMon [9], developed by the Center for Ecological Research and Forestry Applications (CREAF) [10]. Figure 2 (d) depicts the image obtained when Phagocyte is applied. In this case, the boundaries are even smoother than with ADR. 2.2
MaxShift ROI Coding Technique
ROI coding is a feature that some image coding systems provide. The main idea of ROI coding is to set different priorities to specified image regions. JPEG2000
J. Gonz´ alez-Conejero et al.
stripe height
1028
Fig. 3. Scanning order skipping no-data coefficients. Coefficients belonging to the nodata regions are depicted in black.
provides two ROI coding methods, the MaxShift and the Scaling-based method, defined in Part 1 [7] and Part 2 [11] respectively. The main difference between these two methods is that MaxShift compels to encode the complete ROI before the rest of the image (considered background), whereas the Scaling-based method allows different priorities for the ROIs, enabling the coding of the ROI jointly with the background. In the last seven years, 15 new approaches have been proposed in the literature, most of them aimed to finely combine ROI with background. Some of them are [12, Chapter 16.2],[13,14,15,16]. The only ROI coding method suitable to manage the coding of no-data regions is the MaxShift method, since it is the single one that separates the coding of the ROI from the coding of the background. Although this may become a shortcoming for some applications, it fits perfectly with the coding of no-data regions. The main operation carried by MaxShift is an up-shift of the coefficients within the ROI. Just considering the no-data region as the background and stopping the encoding process when the complete ROI is encoded, the regions containing no-data are completely avoided in the final code-stream. 2.3
Shape Adaptive Processing Techniques
The one dimensional (1D) DWT can be understood as a successive application to an original sequence of samples of a pair of low-pass and high-pass filters, called analysis filter-bank. The extension to the second dimension of an image just considers the application of the 1D DWT to the columns of the image and then to the resulting rows. An issue that comes up when applying the DWT to an image is what happens at the image boundaries, i.e. how to deal with the non-existent samples locations outside the image needed for the application of the filter-bank. This is commonly addressed carrying out a mirror effect that virtually extends the image boundaries. For example, the sample x[1] would be the mirror of the non-existent sample x[−1], where x[n] denotes the samples of an image row. The main idea behind the DWT-SA technique is to apply a mirror effect at the boundaries of the data region, with the aim to minimize the coding cost of these boundaries. The first time this approach was proposed in the literature
JPEG2000 Coding Techniques Addressed to Images
1029
for still image compression was by Li and Li in [17]. After the application of the DWT shape adaptive, DWT-SA also sets to 0 all the wavelet coefficients belonging to the no-data regions. In addition to the operations carried out by the DWT-SA, the BPE-SA technique avoids the coding of coefficients belonging to no-data regions by skipping them in the fractional bit-plane coder of the tier-1 coding stage. This modification is rather simple (see Figure 3). A similar approach is proposed in [4] to separate the coding of ROIs with lossless compression from the coding of the background with lossy compression. The shape adaptive encoding has obvious benefits to deal with no-data regions. Nevertheless it entails drawbacks as the less compaction of the image information by the DWT shape adaptive, compression performance lost due to the irregular shapes of the no-data regions, etc. In [5] the costs and advantages of the shape adaptive encoding are analyzed.
3
Experimental Results
All the techniques introduced in the previous section have been implemented in our JPEG2000 Part 1 implementation BOI [18]. Tests have been carried out using the lossless and lossy modes of JPEG2000 standard, with 5 DWT levels, derived quantization for the lossy mode, code-blocks of size 64×64 and the restart coding variation. The proposed techniques need to transmit the shape of the no-data regions for a clear distinction between data and no-data regions using, for instance, methods such as [19] or [20]. Since this does not influence the comparison among the proposed techniques, the transmission of the shape belonging to no-data regions is considered as a complementary problem that is not addressed in this work. The proposed techniques have been evaluated using three different image corpus. The first one has four natural images of the ISO 12640-1 corpus [21]. The three images of the second image corpus have been provided by CREAF [10] and were captured through the Landsat IV mission [22] in 2002 and 2003, belonging to different areas of Catalonia (a region of Spain). The five images of the third image corpus have been provided by UDIAT [23] and have been captured through different devices belonging to the medical community (computer tomographies, mammographies, and computer radiographies). All images are gray-scaled with a bit-depth of 8 bits per sample (bps). We set different no-data regions that fill different percentages of the image to fairly assess the coding performance of the proposed techniques. All images, with their corresponding no-data region, are depicted in Figure 4. When more than one no-data region is defined for the same image, they are denoted as imageName-a, imageName-b, and so on. We present two types of experimental results: the coding performance achieved when encoding the images at different bit-rates in lossy mode, and the computational load, defined as the time needed to carry out different stages of the encoding process when the five proposed techniques are applied.
1030
J. Gonz´ alez-Conejero et al.
ISO 12640-1 corpus
Portrait 2048 × 2560 68% of data region
Fruit Basket 2560 × 2048 54% of data region
Wine and tableware 2560 × 2048 44% of data region
Musicians 2560 × 2048 a) 30% of data region b) 70% of data region
Landsat corpus
Barcelona1 7200 × 4800 71% of data region
Barcelona2 7200 × 5000 82% of data region
Garrotxa 7109 × 4864 a) 32% of data region b) 68% of data region
Medical corpus
CR1 1760 × 2136 57% of data region
CR2 2048 × 2495 59% of data region
CT 512 × 512 57% of data region
Mammo1 1914 × 2294 35% of data region
Mammo2 1914 × 2294 35% of data region
Fig. 4. Images of corpora with their corresponding data/no-data regions. White areas depict data regions. The percentage of data region is shown below the image name.
3.1
Coding Performance
In these tests, each image of the corpora has been encoded at 100 bit-rates equivalently spaced, in terms of bit-rate, from 0.001 to 4 bps, except for the images of the medical corpus, that have been encoded at 100 bit-rates from 0.001 to 2.5 bps. For each target bit-rate, each image has been encoded using the five proposed techniques, decoded and compared to the original image in
JPEG2000 Coding Techniques Addressed to Images
1031
terms of Peak Signal to Noise Ratio (PSNR). Obviously, the PSNR is computed only for the data regions. In addition to the coding performance achieved by the proposed techniques, we also report the results obtained when the original image is encoded without using any of the proposed techniques. Table 1 reports the results obtained for two Landsat images encoded at 0.5, 1.0 and 2.0 bps, and the results obtained for three medical images encoded at 0.001, 0.01 and 0.1 bps. Bit-rates are selected approximately to decode images with a PSNR in the range of 30 to 50 dB. Figure 5 (a) depicts the average coding performance achieved by the five techniques when encoding the images of the ISO 12640-1 corpus. To ease the visual interpretation, the graphic plots the PSNR difference achieved between the coding using the proposed techniques and the coding without using any of them. Table 1. Coding performance evaluation of the proposed techniques for some images of the Landsat corpus (top) and Medical corpus (bottom). Results report the PSNR in dB. The percentage of data region is shown below the image name. 0.5 bps 1.0 bps 2.0 bps Technique Garrotxa-a Barcelona2 Garrotxa-a Barcelona2 Garrotxa-a Barcelona2 32% 82% 32% 82% 32% 82% ADR 29.31 30.72 38.62 34.63 54.31 41.59 Phagocyte 29.48 30.82 38.71 34.67 54.31 41.54 MaxShift 28.80 30.63 37.92 34.46 52.55 41.25 DWT-SA 29.61 31.05 39.02 35.08 56.09 42.11 BPE-SA 29.69 31.14 39.18 35.22 56.91 42.31 Original 21.64 30.09 25.42 33.60 31.26 39.26
Technique ADR Phagocyte MaxShift DWT-SA BPE-SA Original
CT 57% 28.33 28.85 27.43 27.97 28.57 27.43
0.001 bps CR1 Mammo1 57% 35% 30.63 38.95 30.04 39.91 31.45 38.95 31.28 41.46 33.00 41.58 31.11 38.95
CT 57% 34.25 34.48 33.76 34.69 35.34 33.76
0.01 bps CR1 Mammo1 57% 35% 39.32 45.34 39.72 46.96 41.14 48.38 41.18 48.41 41.21 48.41 41.09 48.38
CT 57% 45.84 46.63 48.99 48.98 49.56 48.99
0.1 bps CR1 Mammo1 57% 35% 42.35 48.09 42.42 50.85 42.64 51.11 42.66 51.13 42.66 51.15 42.50 51.11
Two important points are worth noting in these results. The first one is that, the more sophisticated the coding technique is, the better the coding performance results are. This is noticeable, for instance, in Figure 5 (a), where both DWT-SA and BPE-SA techniques obtain the best results among the proposed techniques, clearly outperforming the results obtained by ADR and Phagocyte. However, between DWT-SA and BPE-SA, and also between ADR and Phagocyte, differences are not very meaningful. The second point we want to stress is that the higher the bit-rate is, greater the efficiency achieved by the proposed techniques is. At 2 bps, for instance, the
1032
J. Gonz´ alez-Conejero et al.
Corpus ISO 12640-1
Corpus ISO 12640-1 0.3
BPE-SA Phagocyte MaxShift Original
6
0.2 PSNR difference (in dB)
PSNR difference (in dB)
5 4 3 2 1
0.1 0 -0.1 -0.2 -0.3
0
-0.4 0
0.2
0.4
0.6
0.8 1 1.2 Bits per sample (bps)
(a)
1.4
1.6
1.8
2
BPE-SA DWT-SA Phagocyte BISK 0.1
0.2
0.3
0.4 0.5 0.6 Bits per sample (bps)
0.7
0.8
0.9
1
(b)
Fig. 5. Comparison of rate distortion. (a) JPEG2000 techniques addressed to images containing no-data regions. (b) JPEG2000 techniques addressed to images containing no-data regions and BISK.
difference between BPE-SA and ADR is, in some images, more than 2.5 dB. Conversely, at very low bit-rates, all techniques obtain similar coding results. On the other hand, the ROI coding technique MaxShift is not adequate for the coding of no-data regions since in most images it penalizes the coding performance due some no-data coefficients become data to allow a correctly recovering. It is also worth noticing the low coding performance achieved when none of the proposed techniques is applied. Table 2 reports the results obtained by the five techniques for all the images of the corpora. Results report the bit-rate achieved when the complete code-stream is generated with the lossless mode (in bps) and the achieved compression gain1 that relates the length of code-streams from different compressions. For images that have large no-data regions, it can be as much as 130. It means a 73% of final code-stream save. These results suggest that: 1) it is important to apply techniques for the coding of images containing no-data regions in order to enhance the coding efficiency, 2) shape adaptive techniques obtain the best results, 3) at very low bit-rates, all techniques obtain similar results, and 4) the higher the percentage of data region is, the larger the compression gain is. Our last evaluation of the proposed techniques is carried out against the coding system BISK [6], that is entirely devised to support shape processing, and thus it is ideal for the coding of no-data regions. We evaluate the average coding performance achieved by BISK and by the proposed techniques for the images of the ISO 12640-1 corpus. Figure 5 (b) depicts the obtained results taking BISK as the reference and computing the PSNR difference between it and three of the proposed techniques. Results suggest that BPE-SA and DWT-SA achieve slightly better results than BISK at low bit-rates, approximately in the range 1
ref erence Compression gain is computed by 100 loge compressed , where ref erence denotes the size of the code-stream encoded without applying the proposed techniques and compressed denotes the size of the code-stream encoded with our techniques.
JPEG2000 Coding Techniques Addressed to Images
1033
Table 2. Bit-rate achieved when the whole code-stream is generated with the lossless mode, applying five coding techniques of no-data region to encode four images from each corpus of images. The compression gain is depicted in brackets. The percentage of data region is shown below the image name. Technique ADR Phagocyte MaxShift DWT-SA BPE-SA Original
Portrait 68% 3.32 (28.84) 3.32 (28.84) 3.41 (26.17) 3.29 (29.75) 3.27 (30.36) 4.43
Technique ADR Phagocyte MaxShift DWT-SA BPE-SA Original
Fruit Basket Wine and tableware 54% 44% 2.69 (47.60) 2.44 (63.84) 2.69 (47.60) 2.43 (64.25) 2.75 (45.40) 2.48 (62.21) 2.65 (49.10) 2.37 (66.75) 2.63 (49.86) 2.34 (68.02) 4.33 4.62
Garrotxa-a 32% 2.09 (115.44) 2.09 (115.44) 2.14 (113.08) 2.05 (117.38) 2.04 (117.87) 6.63
Technique ADR Phagocyte MaxShift DWT-SA BPE-SA Original
Mammo1 35% 0.72 (-1.40) 0.71 (0.00) 0.76 (-6.80) 0.69 (2.86) 0.69 (2.86) 0.71
Garrotxa-b 68% 4.68 (34.83) 4.68 (34.83) 4.71 (34.19) 4.63 (35.90) 4.62 (36.12) 6.63
Mammo2 35% 0.75 (29.76) 0.73 (32.47) 0.78 (25.84) 0.71 (35.24) 0.71 (35.24) 1.01
Barcelona1 71% 4.25 (28.30) 4.29 (27.36) 4.35 (25.97) 4.10 (31.89) 4.03 (33.61) 5.64
CR1 57% 1.93 (8.92) 1.93 (8.92) 1.99 (5.86) 1.89 (11.01) 1.89 (11.01) 2.11
Musicians-a 30% 1.58 (126.71) 1.58 (126.71) 1.63 (123.60) 1.55 (128.63) 1.54 (129.28) 5.61
Barcelona2 82% 4.40 (17.09) 4.42 (16.64) 4.47 (15.51) 4.28 (19.85) 4.22 (21.27) 5.22
CR2 59% 1.80 (20.52) 1.80 (20.52) 1.85 (17.78) 1.76 (22.77) 1.76 (22.77) 2.21
of 0 to 1 bps. At medium and high bit-rates, BISK outperforms the results of both. On the other hand, Phagocyte and the remaining techniques that keep JPEG2000 compliance achieve, in general terms, worst coding performance than BISK. 3.2
Computational Load
The proposed techniques might eventually penalize the computational load of the JPEG2000 encoder. To assess how the proposed techniques affect the coding performance of the encoder, in Table 3 we evaluate the computational load of the stages modified for the proposed techniques, and the total computational load of the coder. The results reported in this table are the time, in seconds, obtained when generating the complete code-stream of two images of the corpora. Tests have been carried out on a Pentium IV (3 GHz) with the JVM 1.5 and GNU/Linux.
1034
J. Gonz´ alez-Conejero et al.
Table 3. Computational load of different coding stages achieved when applying the proposed techniques. Results are reported in seconds. Columns report the computational load of: sample - stage needed for ADR and Phagocyte; max - MaxShift stage; DWT - DWT stage; tier-1 - BPE and MQ coding; Total - total time spent by the encoding process. The percentage of data region is shown on the right of the image name. Portrait (68%) Garrotxa-a (32%) Technique sample max DWT tier-1 Total max sample DWT tier-1 Total ADR 1.630 – 0.849 7.096 11.715 13.485 – 10.421 32.061 62.361 Phagocyte 1.855 – 0.860 7.084 11.840 16.047 – 10.342 31.742 67.447 MaxShift – 1.851 0.818 17.430 14.451 – 6.416 10.359 95.832 63.868 DWT-SA – – 2.384 6.988 11.667 – – 18.995 31.170 59.579 BPE-SA – – 2.389 6.826 11.668 – – 19.056 28.142 60.723 Original – – 0.867 8.226 11.223 – – 10.478 65.775 88.075
For images containing large no-data regions, the application of the proposed techniques widely reduces the computational load. This is due to the reduction on the number of bit-planes to be coded when applying the proposed techniques. However, when the no-data region is small, the proposed techniques slightly increase the computational load of the encoding process, although the penalization is, in the worst case, about 5%.
4
Conclusions
This work introduces five coding techniques for the coding of images containing no-data regions within JPEG2000. The ADR and Phagocyte techniques are simple modifications carried to image samples, aimed to smooth the sharp boundaries between data and no-data regions. The ROI coding method MaxShift is able to consider no-data regions by the definition of the data region as the ROI and the no-data region as the background. The shape adaptive techniques DWTSA and BPE-SA modify two stages of the JPEG2000 coding system taking into account the lack of importance of no-data regions. Experimental results suggest that the best techniques to code images containing no-data regions are the shape adaptive techniques, although they construct non compliant JPEG2000 code-streams. Phagocyte is the technique that, keeping JPEG2000 compliance, obtains the best results in terms of coding performance. The application of the ROI coding method MaxShift is clearly not adequate for the coding of images containing no-data regions. Results also suggest that the coding performance achieved by BISK and by the shape adaptive techniques is equivalent at low bit-rates. Although the degree of improvement varies from image to image and depending on the size of the no-data region with respect to the complete area of the image, the proposed techniques enhance the coding efficiency of JPEG2000 for images containing no-data regions.
JPEG2000 Coding Techniques Addressed to Images
1035
Acknowledgments. This work has been partially supported by the Spanish Government (MEC), by FEDER, and by the Catalan Government, under Grants TSI2006-14005-C02-01 and SGR2005-00319.
References 1. Zabala, A., Pons, X., Diaz-Delgado, R., Garcia-Vilchez, F., Auli-Llinas, F., SerraSagrista, J.: Effects of JPEG and JPEG2000 Lossy Compression on Remote Sensing Image Classification for Mapping Crops and Forest areas. In: International Conference on Geoscience and Remote Sensing Symposium, pp. 790–793 (2006) 2. ITU-T, Study Group-8 Contribution: Mixed Raster Content (MRC), Recommendation T.44 (1998) 3. Cagnazzo, M., Poggi, G., Verdoliva, L., Zinicola, A.: Region Oriented compression of multispectral images by shape-adaptive wavelet transform and SPIHT. In: IEEE International Conference on Image Processing, vol. 4, pp. 2459–2462. IEEE, Los Alamitos (2004) 4. Hua, J., Liu, Z., Xiong, Z., Wu, Q., Castleman, K.R.: Microarray BASICA: Background Adjustment, Segmentation, Image Compression and Analysis of Microarray Images. EURASIP Journal on Applied Signal Processing 2004(1), 92–107 (2004) 5. Cagnazzo, M., Parrilli, S., Poggi, G., Verdoliva, L.: Costs and Advantages of Object-Based Image Coding with Shape-Adaptive Wavelet Transform. EURASIP Journal on Image and Video Processing, Article ID 78323 13 (2007) 6. Fowler, J.: Shape-Adaptive Coding Using Binary Set Splitting with K-D Trees. In: IEEE International Conference on Image Processing, vol. 2, pp. 1301–1304. IEEE, Los Alamitos (2004) 7. ISO/IEC 15444-1: JPEG2000 image coding system. Part 1: Core coding system (2000) 8. Taubman, D.: High performance scalable image compression with EBCOT. IEEE Transactions on Image Processing 7(7), 1158–1170 (2000) 9. Pons, X.: MiraMon, Geographic Information Systems and Remote Sensing software. Center for Ecological Research and Forestry Applications (CREAF). Bellaterra (1994) 10. CREAF: Center for Ecological Research and Forestry Applications (2007), Available http://www.creaf.uab.es 11. ISO/IEC 15444-2: JPEG2000 image coding system. Part 2: Extensions (2004) 12. Taubman, D., Marcellin, M.: JPEG2000: Image Compression Fundamentals, Standards, and Practice. Kluwer International Series in Engineering and Computer Science 642 (2002) 13. Wang, Z., Banerjee, S., Brian, L.: et al: Generalized Bitplane-by-Bitplane shift method for JPEG2000 ROI Coding. In: Proceedings of IEEE International Conference on Image Processing, vol. 5637-III, pp. 81–84. IEEE Computer Society Press, Los Alamitos (2002) 14. Liu, L., Fan, G.: A new JPEG2000 Region-of-Interest image coding method: Partial significant bitplanes shift. IEEE Signal Processing Letters 10(2), 35–38 (2003) 15. Liang, Y., Liu, W.: A new JPEG2000 Region-of-interest coding method: generalized partial bitplanes shift. In: Proceedings of the SPIE. vol. 5637, pp. 365–371 (2005) 16. Chen, C.C., Chen, O.T.C.: Region of Interest Determined by Perceptual-Quality and Rate-Distortion Optimization in JPEG2000. In: IEEE International Symposium on Circuits and Systems, vol. 3, pp. 23–26. IEEE, Los Alamitos (2004)
1036
J. Gonz´ alez-Conejero et al.
17. Li, S., Li, W.: Shape Adaptive Discrete Wavelet Transforms for Arbitrarily Shaped Visual Object Coding. IEEE Transactions on Circuits and Systems for Video Technology 10(5), 725–743 (2000) 18. Group on Interactive Coding of Images, Department of Information and Communications Engineering, Universitat Autonoma de Barcelona: BOI software (2006), Available: http://www.gici.uab.es/BOI 19. Aghito, S.M., Forchhammer, S.: Context-Based Coding of Bilevel Images Enhanced by Digital Straight Line Analysis. IEEE Transactions on Image Processing 15, 2120–2130 (2006) 20. Akimov, A., Kolesnikov, A., Frnti, P.: Lossless compression of map contours by context tree modeling of chain codes. Pattern Recognition 40(3), 944–952 (2007) 21. ISO/IEC 12640-1: Graphic technology - Prepress digital data exchange - CMYK standard colour image data (CMYK/SCID) (1997) 22. U.S. Geological Survey (2007) Landsat project website Available: http://landsat.usgs.gov 23. Parc Tauli Corporation: UDIAT-Diagnosis Center Website. (2007) Available: http://www.cspt.es/webcspt/udiat
A New Optimum-Word-Length-Assignment (OWLA) Multiplierless Integer DCT for Lossless/Lossy Image Coding and Its Performance Evaluation Somchart Chokchaitam1 and Masahiro Iwahashi2 1 2
Department of Electrical Engineering, Thammasat University, Phatoom Thani, Thailand Department of Electrical Engineering, Nagaoka University of Technology, Niigata, Japan
Abstract. Recently, we proposed a multiplierless 1D Int-DCT improved from our previous proposed Int-DCT by approximating floating multiplications to bit-shift and addition operations. The multiplierless 1D Int-DCT can be well operated both lossless coding and lossy coding. However, our multiplierless 1D Int-DCT is not focused on how to assign word-length for floating-multiplier approximation as short as possible for reduction of hardware complexity. In this paper, we propose a new Optimum-Word-Length-Assignment (OWLA) multiplierless Int-DCT. Apart from inexpensive hardware complexity, the new OWLA multiplierless 1D Int-DCT achieves high coding performance in both lossless and lossy coding. The lossless/lossy coding criterion is applied to evaluate coding performance of the proposed Int-DCT comparing to those of the others Int-DCT.
1 Introduction Discrete Cosine Transform (DCT) [1] is a well-known transform in many coding standards such as lossy JPEG [2] for a still image coding and MPEG [3] for a moving image coding. The DCT-based coding system provides a high compression ratio but its coding system is limited to only lossy coding. For a compatibility with the conventional DCT-based algorithms, the Integer DCT (Int-DCT) [4-10] have been developed from the conventional lossy DCT to be a lossless transform by employing lifting structures [11] and “rounding” operations [12]. So far, many kinds of the Int-DCT have been proposed. For example, Fukuma’s group proposed their 8-point Int-DCT [4-6] composed of the 4-point integer-Hadamardtransform (4-IHT) and the 2-point integer-Rotation transform (2-IRT). Next, Y-J. Chen’s group proposed the low-cost 8-point Int-DCT [7] composed of the 8-point Walsh-Hadamard Transform and integer lifting. G. Charith’s group proposed their Npoint I2I-DCT-II [8] by applying recursive methods and lifting techniques. Recently, we proposed 1D multiplierless Int-DCT [9-10] improved from our first proposed 1D Int-DCT with removing its multipliers by approximating floating multiplications to bit shift and addition operations. The proposed multiplierless 1D Int-DCT can be well operated both lossless coding for a high quality decoded image and lossy coding for a compatibility with the conventional DCT-based coding system. However, our multiplierless 1D Int-DCT is not focused on how to assign word-length for floatingmultiplier approximation as short as possible for reduction of hardware complexity. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 1037–1048, 2007. © Springer-Verlag Berlin Heidelberg 2007
1038
S. Chokchaitam and M. Iwahashi
In this paper, we propose a new Optimum-Word-Length-Assignment (OWLA) multiplierless Int-DCT. The new OWLA multiplierless 1D Int-DCT is not only inexpensive hardware complexity but also still well operated in both lossless coding and lossy coding. This paper is organized as the follows: In section 2, we review three kinds of the existing 1D Int-DCTs. The proposed OWLA multiplierless 1D Int-DCT is introduced in section 3. The lossless/lossy criteria for coding-performance evaluation are reviewed in section 4. Simulation results are illustrated the evaluation results in section 5. Conclusion is presented in section 6.
2 The Existing 1D Int-DCTs [4-6] 2.1 The Fukuma’s 1D Int-DCT [4-6] The Fukuma’s 1D Int-DCT [4-6] is constructed with the 4-point integer Hadamard transforms (4-IHT) and the 2-point integer rotation transform (2-IRT) as illustrated in fig. 1(a). Structures of the 4-IHT and the 2-IRT are illustrated in fig. 1(b) and 1(c), respectively. The multiplier by 0.5 in the 4-IHT can be achieved by shifting one bit to the right and the multiplier vectors in the 2-IRTi in fig. 1(c) are given by
[
]
⎡ sin(π 8) − 1 cos(3π 8) − 1⎤ M A = MB = 1− 2 1 2 1− 2 ,MC = ⎢ cos(π 8) ⎥ cos(π 8) ⎦ ⎣ cos(π 8) ⎡1 − cos(3 π 16) ⎡ cos(π 16) − 1 1 − cos(3 π 16) ⎤ cos(π 16) − 1⎤ − sin(3 π 16) sin(π 16) MD = ⎢ ⎥, M E = ⎢ ⎥ sin( 3 π 16 ) sin( 3 π 16 ) sin( π 16 ) sin(π 16) ⎦ ⎣ ⎦ ⎣
TF0 x
8
z-1
8
z-1
8
z-1
8 8
-1
y0 y1
-1
-1
y2 y3
x4 x5 x6 x7
TF8
y4
2-RTD 2-RTE
z-1
x3
TF7
4-LHTC
8
x2
TF6
2-RTC
z-1
x1
TF5 2-RTB
8
2-RTA
z-1
TF4
4-LHTB
8
TF2 TF3
4-LHTA
z-1
TF1
x0
(1)
y5
-1
y6 y7
-1
(a) The 8-point Integer DCT (Int-DCT) xH(0)
-1 yH(0)
R
-
yH(2)
0.5
xH(3)
-
-
- yH(3)
(b) The 4-IHT
m1
R
R m2
m3
integer
R
integer
0.5
xH(2)
yR(0)
xR(0)
yH(1)
-
integer
integer
xH(1)
R
xR(1)
yR(1)
(c) The 2-IRTi.
Fig. 1. The Fukuma’s 1D Int-DCT and its components 4-IHT and 2-IRTi. rounding operation. Parameter Mi=[mi1 mi2 mi3] is the multiplier vector.
"®" denotes
A New OWLA Multiplierless Integer DCT for Lossless/Lossy Image Coding
1039
2.2 The Y-J. Chen’s 1D Int-DCT [7] The Y-J. Chen’s group proposed the low-cost 8-point Int-DCT [7] composed of the 8point Walsh-Hadamard Transform and integer lifting as illustrated in Fig. 2. The Y-J. Chen’s Int-DCT is designed under an objective to reduce a complexity of the IntDCT. Therefore, the Y-J. Chen’s Int-DCT is easy to implement because his Int-DCT does not require floating multiplier coefficients. It requires only simple integer arithmetic such as bit shifts and adders. The matrix HW and B in Fig. 2 are
H
W
⎡1 ⎢1 ⎢ ⎢1 ⎢ 1 = ⎢ ⎢1 ⎢ ⎢1 ⎢1 ⎢ ⎢⎣ 1
1
1
1
1
1
1
1
1
1
−1
−1
−1
1 1
−1 −1
−1 −1
−1 1
−1 1
1 −1
−1
−1
1
1
−1
−1
−1
−1
1
−1
1
1
−1 −1
1 1
−1 −1
−1 1
1 −1
−1 1
x -1
z
1 ⎤ ⎡1 ⎢0 − 1 ⎥⎥ ⎢ ⎢0 1 ⎥ ⎥ ⎢ 0 − 1⎥ ;B = ⎢ ⎢0 1 ⎥ ⎥ ⎢ − 1⎥ ⎢0 ⎥ ⎢0 1 ⎥ ⎢ − 1 ⎥⎦ ⎢⎣ 0
0
0
0
0
0
0
0
0
0
1
0
0
0 0
1 0
0 0
0 0
0 0
0 1
1
0
0
0
0
0
0
0
0
0
1
0
0 0
0 0
1 0
0 0
0 0
0 0
x0
y0
x1
y1
0⎤ 0 ⎥⎥ 0⎥ ⎥ 0⎥ 0⎥ ⎥ 0⎥ 0⎥ ⎥ 1 ⎥⎦
(2)
8 z-1
x2
βb2
z
>>b1 >>b2 >>b1
x3 8
-1
x4
y3
B 3π/8
x5
βb2
>>b1
>>b1
B y4
7π/16
-1
βb2
>>b1
>>b1
3π/8
8 z-1
αb1
αb1
HW
8 z-1
y2
-π/8
8 z-1
3π/16
-1
βb2 >>b2
x6
αb1
8
>>b1
αb1>>b1
y5
βb2 >>b1 >>b2 >>b1
αb1
-1
αb1
y6
-1
z
>>b2
αb1
x7
8
>>b2
αb1
αb1
αb1
y7
Fig. 2. The Y-J. Chen’s 1D Int-DCT
2.3 The Charith’s N-point I2I-DCT-II [8] The G. Charith’s group proposed their N-point I2I-DCT-II [8] by applying recursive methods and lifting techniques as illustrated in Fig. 3. The G. Charith’s N-point I2IDCT-II is very flexible because it can perform not only the 8-point Int-DCT but also
WH2
Odd2 DCT4
OddN/2
WH2
DCT2
DCTN/2
OddN
DCT Column Permutation
DCT Column Permutation
WH2
WH2
Fig. 3. The Charith’s N-point I2I-DCT-II
1040
S. Chokchaitam and M. Iwahashi
the 2M–point Int-DCT where M is an integer value. However, we consider only the 1D 8-point I2I-DCT-II that has a compatibility with the conventional 8-point DCT.
3 The Proposed OWLA Multiplierless 1D Int-DCT 3.1 Our First Proposed 1D Int-DCT [9] Our first proposed Int-DCT [9] was designed by applying a simple concept of rounding operation and lifting structure [11]. Our 1D Int-DCT requires only 8 rounding operations as illustrated in Fig. 4. Filter coefficients operated from subband ith to subband jth (Fij ) are illustrated as the follows: ⎡ F00 ⎢F ⎢ 10 ⎢ F20 ⎢ ⎢ F30 ⎢F ⎢ 40 ⎢ F50 ⎢F ⎢ 60 ⎣⎢ F70
F01 F02 F03 F04 F05 F06 F07 ⎤ ⎡ 0 0.2071 - 0.2071 −1 - 0.5 0 0 0.5 ⎤ ⎥ F11 F12 F13 F14 F15 F16 F17 ⎥⎥ ⎢⎢0.0733 0 − 0.3536 − 1.2803 0.5 0 0 0 ⎥ ⎥ (3) F21 F22 F23 F24 F25 F26 F27 ⎥ ⎢0.4142 0.8284 0 − 1.9142 0.4142 0 0 0 ⎥ ⎢ ⎥ F31 F32 F33 F34 F35 F36 F37 ⎥ ⎢0.5858 0.1716 0.4142 0 0.5858 0 0 0 ⎥ = F41 F42 F43 F44 F45 F46 F47 ⎥ ⎢ 0 0 0 0 0.1989 − 0.7071 0.7351 − 0.1989⎥ ⎥ ⎢ ⎥ F51 F52 F53 F54 F55 F56 F57 ⎥ ⎢ 0 − 0.0994 − 0.5 0 0.2832 0 0.1989 − 0.4239⎥ ⎥ ⎢ F61 F62 F63 F64 F65 F66 F67 0 − 0.5 0 0 − 0.3536 − 0.1913 0 − 0.3536⎥ ⎥ ⎢ ⎥ F71 F72 F73 F74 F75 F76 F77 ⎦⎥ ⎣⎢ 0 0 0 0 0.1913 0.8155 0.5665 0 ⎦⎥
3.2 Our Multipierless Int-DCT [10] Our Multiplierless Int-DCT was improved our floating 1D Int-DCT by approximating floating multiplications in eq. (3) to bit shift and addition operations. In previous paper, the same 8 word-length is applied to all filter coefficients. To achieve that goal, filter coefficients operated from subband ith to subband jth (Fij ) are approximated to 8bit-word-length assignment as the follows: ⎡ F00 ⎢F ⎢ 10 ⎢ F20 ⎢ ⎢ F30 ⎢ F40 ⎢ ⎢ F50 ⎢F ⎢ 60 ⎣⎢ F70
F01 F11 F21 F31 F41 F51 F61 F71
F02 F12 F22 F32 F42 F52 F62 F72
F03 F13 F23 F33 F43 F53 F63 F73
F04 F14 F24 F34 F44 F54 F64 F74
F05 F15 F25 F35 F45 F55 F65 F75
F06 F16 F26 F36 F46 F56 F66 F76
F07 ⎤ ⎡ 0 53 ⎢ 19 0 F17 ⎥⎥ ⎢ ⎢106 212 F27 ⎥ ⎥ ⎢ F37 ⎥ 1 ⎢149 43 = * ⎥ F47 0 256 ⎢ 0 ⎥ ⎢ F57 ⎥ ⎢ 0 − 25 ⎢ 0 − 128 F67 ⎥ ⎥ ⎢ F77 ⎦⎥ 0 ⎣⎢ 0
- 53 − 256 - 128 0 0 − 91 − 327 128 0 0 0 − 490 106 0 0 106 0 149 0 0 0 0 0 − 181 188 − 128 0 73 0 51 0 0 − 91 − 49 0 0 0 49 209 145
128 ⎤ 0 ⎥⎥ 0 ⎥ ⎥ 0 ⎥ − 51 ⎥ ⎥ − 109⎥ − 91 ⎥ ⎥ 0 ⎦⎥
(4)
Filter coefficients in eq (4) can be operated by only shift and addition operations. For example, F01 can be operated as the follows: F01 =
53 32 16 4 1 1 1 1 1 1 1 1 1 = + + + = + + + = + + + 256 256 256 256 256 8 16 64 256 2 3 2 4 2 6 2 8
(5)
From eq.(5), we can replace F01 multiplication to summation of results from 3 bit shift, 4 bit shift, 6 bit shift, and 8 bit shift operations as illustrated in Fig. 5. Therefore, it requires 205 bit shift operations and 116 addition operations to perform 1D Multiplierless Int-DCT.
A New OWLA Multiplierless Integer DCT for Lossless/Lossy Image Coding
x 8 -1
z
8 -1
z
y1 1
x2
8
x3
y2 1
8 z-1
x4
8
x5
8
x6
8
x7
z-1 z-1
y3
Lifting Transform
-1
z-1
y0
x1
8 z-1
T9
T0
x0
1041
y4 y5 y6
-1
y7
(a) Our first proposed Int-DCT T5
T6
T7
T8
F 30
F 20
F 10
F 40
F70
F 01
F 31
F 21
R1
F 41
F71
F 62
F 02
F32
R2
F 12
F42
F 72
F53
F 63
F 03
R3
F23
F 13
F43
F 73
F 54
F 64
F 04
F 34
F 24
F 14
R4
F74
R5
F 65
F 05
F35
F25
F 15
F45
F 75
F 56
R6
F 06
F 36
F 26
F 16
F 46
F76
F 57
F67
F 07
F 37
F 27
F17
F 47
R7
T1
T2
F 50
F60
F 51
F61
F52
T3
R0
T4
(b) Signal processing of lifting transformation Fig. 4. Our first proposed 1D Int-DCT X X
F01 Multiplier
F01*X
3 bit shift operation 4 bit shift operation
F01*X
6 bit shift operation 8 bit shift operation
Fig. 5. Approximation floating multiplication to bit shift and addition operations
3.3 Our Proposed Optimum-Word-Length-Assignment Multiplierless Int-DCT Recently, we proposed the "SNR sensitivity" [6] defined as an effect of the finite word length expression on a quality of the decoded image. In this paper, we apply the
1042
S. Chokchaitam and M. Iwahashi
SNR sensitivity to design a new optimum-word-length-assignment multiplierless IntDCT as the follows: 3.3.1 The Optimum Word Length Assignment Method In this paper, we optimize word-length assignment of 26 floating multiplier coefficients in our Int-DCT as illustarted in table 1. First, the 26 floating multiplier coefficient Fji, is expressed as hk, (k=0,1,…,25), by ∞
−j hk = ( − 1) ⋅ ∑ Bj 2 , B0
k = 0,1,",25
(6)
j=1
where Bj (j=0,1,…) is 0 or 1. Under the finite word length expression in this paper, hk is truncated into Wk [bit] binary value hk'. Namely, Wk
−j hk ' = ( − 1) ⋅ ∑ Bj ' 2 , B0
k = 0,1,",25
(7)
j=1
Value Δhk is defined as a difference between value hk and binary value hk' as
Δhk = h k − h k '
(8)
Then, we calculate errors generated from finite word length allocation (NTF) in the decoded image [14] as 25
N TF = ∑ (S Hk ⋅ Δhk )
(9)
k =0
where the SHk called "SNR sensitivity" is defined as an effect of the finite word length expression on a quality of the decoded image. Next, we calculate the "relative SNR sensitivity" ( SR k ) by SR k =
S Hk
∏
25 p=0
15
k = 0,1,",25
,
(10)
S Hp
The optimum-word-length assignment is given by the relative SNR sensitivity SRk as follows. ΔW k = Wk − W = log 2
S Hk S Hk
= log 2 SR k when k = 0,1, " ,25
(11)
3.3.2 Our Proposed Optimum-Word-Length-Assignment Multiplierless Int-DCT In our previous paper [14], we found that an optimum word length assignment depends on the input signal. To find an optimum word length assignment of our existing multiplierless Int-DCT, AR(1) model is applied as representative input for image data. We theoretically calculate the optimum-word-length assignment by applying the AR(1) model with correlation coefficient ρ=0.95, 0.8, 0.65, and 0.5 as an input signal, which its frequency spectrum is X ( e jω ) =
1− ρ 1 + ρ − 2ρ ⋅ cos ω 2
(12)
A New OWLA Multiplierless Integer DCT for Lossless/Lossy Image Coding
1043
Table 1 illustrates the optimum-word-length assignment ( ΔWk ) of our existing multiplierless Int-DCT based on the AR(1) model with various correlation coefficients. In this paper, we confirm that the optimum-word-length assignment depends on input data, so we use AR(1) model with correlation coefficient ρ=0.8 as a representative of image data. Examples of number of assigned bits when ρ=0.8 are shown in table 1. Notice that at least 1 bit must be assigned to represent floating multiplier coefficients. Table 1. The optimum-word-length-assignment results based on AR(1) model
Fji
hk
F51 F54 F56 F57 F64 F65 F67 F01 F02 F30 F31 F32 F34 F20 F21 F23 F24 F10 F12 F13 F45 F46 F47 F74 F75 F76
h0 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14 h15 h16 h17 h18 h19 h20 h21 h22 h23 h24 h25
ǻWk for U=0.95 3.96 -1.33 2.76 1.47 -1.3 -1.65 1.5 4.55 3.95 -2.12 5.27 4.67 -0.61 -3.27 4.12 3.75 -1.76 -3.5 -3.58 3.52 -2.17 -4.15 0.98 -8.76 -2.14 -4.12
No. of assigned No. of assigned bits bits when 4 bits when 8 bits for for U=0.65 for U=0.5 (average) (average) U=0.8 for U=0.8 for U=0.8 2.31 1.64 1.3 6 10 -1.05 -1.26 1.74 3 7 0.29 1.61 -3.74 4 8 1.88 2.03 2.24 6 10 -1.03 -1.23 -1.71 3 7 -1.07 0.43 0.27 3 7 1.91 2.06 2.27 6 10 2.88 2.11 1.61 7 11 2.18 1.21 0.42 6 10 0.33 1.25 1.85 4 8 3.6 2.82 2.33 7 11 2.89 1.93 1.14 7 11 -0.34 -0.54 -1.02 3 7 -0.81 0.1 0.7 3 7 2.45 1.68 1.18 6 10 2.07 1.36 1.03 6 10 -1.49 -1.69 -2.17 2 6 -1.04 -0.12 0.47 3 7 -3.77 2.27 -1.3 1 5 1.84 1.13 0.8 6 10 -1.59 -0.95 -0.24 2 6 -3.35 -2.63 -1.79 1 5 1.39 1.54 1.75 5 9 -5.58 -4.63 -3.69 1 5 -1.56 -0.92 -0.22 2 6 -0.33 -2.6 -1.76 1 5
ǻWk
ǻWk
ǻWk
4 Lossless/Lossy Coding Criterion [13] The lossless/lossy coding criterion [13] consists of three parameters: “bit-rate-lossless coding criterion” as lossless coding criterion, “quantization-lossy coding gain” and “rounding errors” for lossy coding criterion.
1044
S. Chokchaitam and M. Iwahashi
4.1 Lossless Coding Criterion The bit-rate-lossless coding criterion (CLSL) is defined as a ratio between the total bit rate of PCM (BPCM) and that of lossless coding (BLSL) by CLSL = 20 log10
2BPCM 2BLSL
(13)
The bit-rate-lossless coding criterion represents a total bit rate of the Int-DCT comparing to that of PCM in lossless coding. 4.2 Lossy Coding Criterion The conventional lossy coding gain (CLSY) is generally defined by CLSY = 10 log10
σ 2PCM σ 2LSY
(14)
where σ2PCM denotes variance of total errors in PCM coding and σ 2LSY denotes variance of total errors in lossy coding calculated from
σ2LSY = σ2N Q + σ N R 2
(15)
where σ 2N and σ 2N R denote variances of errors generated from quantization and Q rounding operation, respectively. A variance of errors generated from rounding operation is approximately constant but that from quantization depends on quantization step size. Therefore, lossy coding criterion is divided into two criteria: rounding error and quantization-lossy coding gain. The quantization-lossy coding gain is defined from conventional lossy coding gain with neglecting rounding error as CLSY,Q = 10 log10
σ2PCM . σ2N Q
(16)
The rounding error illustrates lossy coding performance at a high bit rate; whereas, the quantization-lossy coding gain illustrates lossy coding performance at a low bit rate. Moreover, the conventional lossy coding gain can be determined from rounding error and quantization-lossy coding gain as C LSY = 10 log10
2 σ PCM − 2 10 CLSY,Q σ PCM + σ 2N
.
(17)
R
4.3 Assumption for Calculating a Variance of Rounding Error Rounding operation is a non-linear operation to transform signals from floating value to the nearest integer value. To approximate a variance of rounding errors, we assume that 1) correlations between each of the errors and the signals are zero (statistical independence) 2) power spectrum of rounding error are flat. From the previous
A New OWLA Multiplierless Integer DCT for Lossless/Lossy Image Coding
1045
assumptions, we can find an equivalent expression of rounding operation as shown in fig. 6 by changing a non-linear operation by an additive noise [12] as S Ro ( z ) = S Ri ( z ) + N R ( z )
(18)
where SRi, SRo, NR denote input signal, output signal and additive noise of rounding operation, respectively. Since correlations between each of the errors and the signals are zero (based on the previous assumptions), we can calculate variance of output signals of rounding operation from σ S2 = σ S2 + σ N2 Ro
Ri
(19)
R
where σ S2Ri , σ S2Ro , σ N2 R denote variance of input signals, output signals and additive noises of rounding operation, respectively. Then if we assume that power spectrums of additive noises are approximately flat, we can calculate variance of additive noises of rounding operation from σ N2 R =
0.5
∫
− 0. 5
x 2 dx =
1 12
(20)
NR SRi
R
SRo
SRi
SRo
Fig. 6. Rounding operation and its equivalent expression
5 Simulation Results In this paper, we compare four kinds of the 1D Int-DCT in respect of lossless/lossy coding criterion. Some of standard images are applied to evaluate the existing IntDCT. Notice that all floating value is truncated into 8 bits (average) for simulation results in this section. 5.1 Lossless Coding Criterion In this paper, “The Y-DCT”, “The F-DCT”, “The C-DCT”, “The M-DCT” and “The P-DCT” denote the Y-J. Chen’s Int-DCT-based coding system, the Fukuma’s IntDCT-based coding system, the Chrith’s N-point I2I-DCT-II-based coding system, our multiplierless Int-DCT-based coding system, and our proposed OWLA Int-DCTbased coding system, respectively. For example, “The Y-DCT” denotes the coding system that we apply the Y-J. Chen’s Int-DCT as an analysis filter and a synthesis filter. From results in table 2, bit-rate-lossless coding criteria of the existing Int-DCT are almost the same, except the Y-J. Chen’s 1D Int-DCT is the worst. Coding performance of the Y-J. Chen’s 1D Int-DCT is worse than that of PCM coding, so bit-rate-lossless
1046
S. Chokchaitam and M. Iwahashi Table 2. Bit-rate-lossless coding criterion of the existing Int-DCTs
Image name Couple Aeirl Girl Chest-X Ray Moon Barbara Average
P-DCT 9.16 7.06 9.62
M-DCT 9.14 7.05 9.58
F-DCT 9.05 7.09 9.59
Y-DCT -6.68 -7.69 -6.32
C-DCT 8.87 7.06 9.50
-5.55 7.97 13.53 6.97
-5.54 7.96 13.48 6.95
-5.50 7.97 13.48 6.95
-19.89 -8.09 -1.69 -8.39
-5.51 7.92 13.30 6.85
coding criterion is negative. This is because its output signal scales with 81/2. Based on lossless coding criterion, the proposed OWLA Int-DCT is the best in average. 5.2 Lossy Coding Criterion For a fair comparison, we evaluate lossy coding criterion of the Int-DCT by applying the optimum quantization step [13] given by Δb Gk = Δk Gb
(21)
where Δb denotes quantization step size in bth subband, G b is calculated from Gb =
∑∑g k2
2 b
(k1 , k 2 )
(22)
k1
and gb(k1,k2) are filter coefficients of the synthesis filter Gb. If we write a relation between the quantization-lossy coding gain and bit-rate-lossless coding criterion as
CLSY,Q = CLSL − Ω
(23)
In this case (optimum bit allocation), Ω becomes 7
Ωopt = 10 log10 ∏ b =0
( G )w 2
−1 b
(24)
b
Table 3. Lossy coding criterion of the existing Int-DCTs
Criterion Number of rounding operation A variance of rounding error
P-DCT M-DCT F-DCT Y-DCT C-DCT 8 8 21 15 51 0.11 0.11 0.25 0.07 0.54
Ωopt
0
0
0
-18.06
0
quantization-lossy coding gain (in average)
6.97
6.95
6.95
9.67
6.85
A New OWLA Multiplierless Integer DCT for Lossless/Lossy Image Coding
1047
Fig. 6. An image decoded by the proposed OWLA Int-DCT-based coding system (PSNR = 33.6 dB at 1 bpp)
Fig. 7. An image decoded by the multiplierless Int-DCT-based coding system (PSNR = 32.9 dB at 1 bpp)
From results in table 3, lossy coding criterion of the Y-J. Chen’s Int-DCT is the best followed by our proposed OWLA Int-DCT. It is confirmed an effectiveness of the proposed OWLA Int-DCT. Fig. 6 and Fig. 7 illustrate decoded image from the proposed OWLA Int-DCT-based coding system and our multiplierless Int-DCT (not optimize), respectively. Fig. 6 and Fig. 7 also confirm the effectiveness of our proposed Int-DCT
6 Conclusion In this paper, a new OWLA multiplierless 1D Int-DCT was proposed for unified lossless/lossy coding. A OWLA new multiplierless 1D Int-DCT does not require any floating multiplier and it’s considered word-length for floating-multiplier approximation as short as possible, so its hardware complexity is not so high. The proposed method achieves better coding performance than our previous Int-DCTbased method, whereas its hardware complexity is high. Lossless/lossy criteria are applied to confirm an effectiveness of the proposed OWLA Int-DCT.
1048
S. Chokchaitam and M. Iwahashi
Acknowledgement This work was financially supported by the CAT telecom public company limited, Thailand.
References 1. Rao, K.R., Hwang, J.J.: Technique and standards for image, video and audio coding. Prentice Hall, Inc. NJ (1996) 2. Pennebaker, W.B., Mitchell, J.L.: JPEG still image data compression standard. Van Nostrand Reonhold, NY (1993) 3. Mitchell, J.L., Pennebaker, W.B., Fogg, C.E., LeGall, D.J.: MPEG Video compression standard. Chapman and Hall, NY (1997) 4. Fukuma S., Ohyama K., Iwahashi M., Kambayashi N.: Lossless 8-Point Fast Discrete Cosine Transform Using Lossless Hadamard Transform, Technical report of IEICE, DSP99-103, pp. 37–44 (October1999) 5. Chokchaitam, S., Iwahashi, M., Zavarsky, P., Kambayashi, N.: A Bit-Rate Adaptive Coding System Based on Lossless DCT. IEICE Trans. On Fundamentals E85-A(2), 403– 413 (2002) 6. Chokchaitam, S., Iwahashi, M., Kambayashi, N.: Optimum word length allocation of integer DCT and its error analysis. Signal Processing: Image Communication 19(6), 465– 478 (2004) 7. Chen, Y.J., Oraintara, S., Nguyen, T.: INTEGER DISCRETE CO-SINE TRANSFORM (Int DCT), invited paper, the 2nd Intern-ational Conference on Information, Communications and Signal Processing, Singapore (December 1999) 8. Charith, G., Abhayaratne, K.: N-Point Discrete Cosine Transforms that Map Integers To Integers for Lossless Image / Video C-oding. In: Proc. Picture Coding Symposium (PCS), pp. 417–422 (2003) 9. Chokchaitam, S., Iwahashi, M., Jitapanakul, S.: A New Lossless-DCT for Unified Lossless/Lossy Image Coding. In: MWSCAS 04, Midwest Symposium on Circuits and Systems, II, pp. 409–412 (2004) 10. Chokchaitam, S., Iwahashi, M.: A New Lossless/Lossy Image Coding based on A Multiplierless Integer DCT, ITC-CSCC 2006, Chiang-Mai (July 2006) 11. Sweldens, W.: The Lifting Scheme: A Construction of Second Generation Wavelets, Tech. Rep. 1995:6, Industrial Math. Initiative, Dept. of Math. Univ. of South Carolina (1995) 12. Reichel, J., Menegaz, G., Nadenau, M.J., Kunt, M.: Integer Wavelet Transform for Embedded Lossy to Lossless Image Compression. IEEE Transaction on Image Processing 10(3), 383–392 (2001) 13. Chokchaitam, S., Iwahashi, M.: Performance Evaluation of the Lossless/Lossy Wavelet for Image Compression under Lossless/Lossy Coding Gain. IEICE special section on Digital Signal Processing 85-A(8), 1882–1891 (2002)
On Hybrid Directional Transform-Based Intra-band Image Coding Alin Alecu1, Adrian Munteanu1, Aleksandra Pizurica2, Jan Cornelis1, and Peter Schelkens1 1
Dept. of Electronics and Informatics, Vrije Universiteit Brussel – Interdisciplinary Institute for Broadband Technology (IBBT), Pleinlaan 2, 1050 Brussels, Belgium {aalecu,acmuntea,jpcornel,pschelke}@etro.vub.ac.be 2 Dept. of Telecommunications and Information Processing, Ghent University, SintPietersnieuwstraat 41, 9000 Gent, Belgium
[email protected]
Abstract. In this paper, we propose a generic hybrid oriented-transform and wavelet-based image representation for intra-band image coding. We instantiate for three popular directional transforms having similar powers of approximation but different redundancy factors. For each transform type, we design a compression scheme wherein we exploit intra-band coefficient dependencies. We show that our schemes outperform alternative approaches reported in literature. Moreover, on some images, we report that two of the proposed codec schemes outperform JPEG2000 by over 1dB. Finally, we investigate the trade-off between oversampling and sparsity and show that, at low rates, hybrid coding schemes with transform redundancy factors as high as 1.25 to 5.8 are capable in fact of outperforming JPEG2000 and its critically-sampled wavelets.
1 Introduction For some time now, the wavelet transform [1] has been successfully employed in signal and image processing applications such as compression, denoising and feature extraction, to mention but a few. Indeed, it is known that wavelet transforms manifest good non-linear approximation properties for piecewise smooth functions in one dimension [2], and it is these properties in particular that have led to their widespread use in still image coding for instance. Nonetheless, the two-dimensional (2D) wavelets commonly used in image compression applications are obtained by a tensorproduct of one-dimensional (1D) wavelets. As such, they are adapted only to point singularities, and cannot efficiently model higher-order curvilinear singularities, which are abundant in images. In contrast, recent literature reveals multiscale directional geometric image representations [3-6] that are quickly emerging as the new successors to classical wavelets. These transforms overcome the limited abilities of 2D tensor-product wavelets and in this sense are capable of accurately capturing geometric image structures such as smooth contours. When analyzing or designing a transform-based image compression scheme, a few properties need to be taken into consideration. First, the approximation power of the employed basis gives an indication of how well the basis provides an N -term J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 1049–1060, 2007. © Springer-Verlag Berlin Heidelberg 2007
1050
A. Alecu et al.
nonlinear approximation of the input signal, which is expressed in terms of the decay of the approximation error obtained by retaining the N largest coefficients. Evidently, a high power of approximation is desirable, but it must be followed by appropriate compression [2]. This leads us to compression power, which is given by the number of bits required to approximate the signal up to a given error (distortion) D . One can express here a distortion-rate trade-off, i.e. a relation of the form D ( R ) . Secondly, the statistical dependencies that are present between the transform coefficients represent yet another factor to be considered. Different types of dependencies can be enumerated here, starting from the inter-scale dependencies exploited by the zero-tree structures of the EZW [7] and SPIHT [8] schemes, the intra-band dependencies exploited in the quadtree-based algorithms [9, 10] and the EBCOT coder [11], or the composite intra-band/inter-scale dependencies used in ECECOW [12] and EZBC [13]. As a side note, the oriented transforms of [4, 5] introduce a new type of dependencies, namely inter-orientation. The choice as to which of these dependencies the coding scheme tries to exploit will undoubtedly influence coding performance. Finally, the degree of transform redundancy is another issue that will affect compression. For instance, the image representations of [4, 5] yield a higher power of approximation compared to critically-sampled wavelets [1], but also manifest higher oversampling. While most of the work so far on oriented transforms has targeted denoising applications, recent literature reports their use also in image compression. In this respect, transform “orientability” can be seen as a specific partitioning of the frequency plane, while the design of such a transform-based compression scheme implies the encoding of the quantized coefficients, including an efficient exploitation of coefficient dependencies. For instance, Chappelier et al propose an iterative algorithm that uses a contourlet decomposition at high frequencies and a wavelet decomposition at lower frequencies, respectively [14]. A similar partitioning of the frequency spectrum is presented in [15], combined with a clustering of the transform coefficients using morphological operations. An alternative frequency plane tiling is proposed in [16], wherein a wavelet transform is employed in place of the Laplacian pyramid of [5], such that a contourlet-alike partitioning of each wavelet frequency plane tile is finally achieved; a SPIHT-alike coding algorithm supplements the transform. In this paper, we chose a frequency partitioning generically similar to that of [14, 15]. We design three hybrid compression schemes employing wavelets and three types of oriented transforms, i.e. curvelets [4] and two variants of the contourlet transform [17, 18], respectively. Unlike previous approaches, we propose to exploit intraband coefficient dependencies within each hybrid transform. This is achieved through the use of quadtree-based coding, followed by adaptive context-based entropy coding. The justification for focusing on this type of dependencies is based on the sufficiency of intra-band dependency models and the mild mutual-information gains reported in literature for the more complex intra-band/inter-scale/inter-orientation models of each separate transform (see [19] for wavelets, [20] for contour- lets and our recent work [21] for curvelets). We report that in terms of coding performance, this type of architecture clearly outperforms previous approaches [14-16]. Moreover, we show that on a series of images, the proposed codecs outperform JPEG2000 [22]. In this sense, for “Barbara”, we report gains of over 1dB. Finally, we
On Hybrid Directional Transform-Based Intra-band Image Coding
1051
ωy 1 −2
A− 2 f
D f
D−22 f
D−32 f
D−11 f
3 −1
D−11 f
D f …
D−21 f … D−21 f
…
D−31 f D−31 f
D−12 f
D−31 f …
A− 2 f … …
D−11 f
D−21 f
ωx
…
D−31 f
Fig. 1. Pyramidal wavelet decomposition (left), and frequency plane tiling (right)
investigate the impact of the transform oversampling factor and the number of decomposition levels on the coding performance. In the latter sense, the redundancy increases with the number of oriented transform decomposition levels. Nonetheless, we find that for the least oversampled scheme investigated, up to three such levels can be employed while remaining competitive with respect to JPEG2000. The paper is organized as follows. In section 2 we give a brief overview of the transforms employed; the proposed codec architecture is presented in section 3; we show experimental results in section 4; finally, we draw the conclusions in section 5.
2 Pyramidal Subband Decompositions Let f ∈ L2 (
2
)
be a measurable 2D signal with finite energy. Let a discrete ap-
proximation Aj +1 f of f at a certain resolution be further decomposed into an approximation Aj f at a coarser resolution and a number of detail signals D j f . A multiresolution representation on J levels of the signal A0 f is then written as:
(A
−J
f , ( D jl f )
j,l
).
(1)
where −1 ≥ j ≥ − J denotes the scale 2 j and l corresponds to the different detail signals at level j . In the following, we will instantiate (1) for wavelets, a series of oriented transforms and a more generic hybrid transform, respectively. 2.1 Discrete Wavelets
The 1D Discrete Wavelet Transform (DWT) decomposes a signal f ∈ L2 (
)
into a
multiresolution representation of the form (1), where l = 1 . The term A − J f can be interpreted here as a low-pass filtering of f and D1j f as a band-pass filtering, respectively, each followed by uniform sampling at the rate 2 j [1]. It is known that the separable 2D DWT is obtained from a tensor-product of 1D wavelets. Hence, a multiresolution representation of f ∈ L2 ( 2 ) is again written as (1), in which this time
1052
A. Alecu et al.
D−l 2 f
D−l 1 f
ωy A− 2 f
D−l 1 f D−l 2 f A− 2 f
ωx
Fig. 2. Pyramidal curvelet decomposition (left), and frequency plane tiling (right)
we denote with l = 1, 2,3 the horizontal, vertical and diagonal detail signals. We illustrate in Fig. 1, for J = 2 levels, the pyramidal decomposition into wavelet subbands of an image f and the tilling of the frequency plane, respectively. For a more detailed overview of the topic, we refer the reader to literature [1]. 2.2 Discrete Curvelets
Similar to the DWT, the Discrete Curvelet Transform (DCuT) can also be seen as a multiscale pyramid, but with more directions and positions at each scale. Thus, while offering multiscale and time-frequency localization properties similar to those of wavelets, the DCuT introduces additional geometric features, such as a high degree of directionality and anisotropy. Moreover, this transform provides an optimally sparse representation of objects with edges, making it far sparser than its wavelet counterpart. A DCuT multiresolution representation on J levels of an image f can be written as (1), where −1 ≥ j ≥ − J denotes the scale 2 j and l = 0,1,… corresponds to different number of orientations (i.e., to rotation angles 0 ≤ θ l ≤ 2π ). At every other finer scale, the number of orientations is doubled [4]. The detail signals ( D lj f ) are comj ,l posed of fine-scale directional anisotropic curvelet elements, while the approximation A − J f corresponds to coarse-scale isotropic wavelets [4, 23]. We illustrate in Fig. 2, for l = 8 directions at the coarsest curvelet scale, the DCuT subband decomposition of an image f , and the corresponding tilling of the frequency plane, respectively. Recent literature reports two variants of the DCuT, i.e. in which the implementation is based on Unequally Spaced Fast Fourier Transforms (USFFTs) and on wrapping techniques, respectively [4, 23]. 2.3 Discrete Contourlets
The Discrete Contourlet Transform (DCoT) is a multiscale pyramid with properties essentially similar to those of the DCuT. Nonetheless, a major difference is that it allows a filter bank implementation, being constructed as an iterated filter bank structure composed of a Laplacian pyramid followed by a directional filter bank
On Hybrid Directional Transform-Based Intra-band Image Coding
1053
ωy DClC, −1 f … DWlW, −2 f … …
A− 2 f …
…
…
ωx
…
Fig. 3. The proposed partitioning of the frequency plane
decomposition. Furthermore, while providing the same nonlinear power of approximation as the DCuT, it is characterized by a lower redundancy. A DCoT multiresolution representation on J levels of an image f can be written as the representation (1) for the DCuT, except that the detail signals D lj1 f , j , l1 D lj2 f with θl1 θl2 + π are no longer disjoint. The partitioning of the frequency j , l2 plane is similar to that shown in Fig. 2, with the additional observation that due to the decoupling of the multiscale and directional steps, the DCoT allows for a different number of directions at different scales. Hence, Fig. 2 can be seen as a particular case of the DCoT in which the frequency tiling is similar to that of the DCuT. The spatial decomposition follows that of Fig. 2, but in which one must also take into account the directional nonseparability observation mentioned previously. For a full overview, we refer the reader to literature [17]. Recent variations on this transform include a critically-sample variant [24] and a contourlet with sharp frequency localization [18].
(
(
)
)
3 The Compression Scheme In this paper, we employ a transformation stage consisting of a J C -level directional transform decomposition at high frequencies, and a JW -level wavelet decomposition at lower frequencies, respectively. Using (1), it follows that the multiresolution representation of an image f ∈ L2 ( 2 ) on J = J C + JW levels can be written as:
(A (
−J
(
f , DW1 , jW f , DW2 , jW f , DW3 , jW f
)
(
)
)
− J ≤ jW ≤− J C −1
(
, DClC, jC f
)
− J C ≤ jC ≤−1, lC
).
(2)
where DWlW, jW f and DClC, jC f denote wavelet and directional detail signals jW , lW jC , lC respectively, and A− J f is the approximation of f at the coarsest scale. An example of the corresponding frequency plane tilling is shown in Fig. 3, for J C = 1 and JW = 1 . The justification for this frequency partitioning choice can be found in [14] for the particular case of wavelets and contourlets. Studies in literature reveal that DWT intra-band models capture most of the dependencies between coefficients [19]. Furthermore, we have recently shown in [21] that intra-band modeling of DCuT dependencies (in particular, its USFFT variant) can be deemed sufficiently significant, with marginal gains for the more complex intraband/inter-scale or intra-band/inter-orientation models. Finally, an information-
1054
A. Alecu et al.
theoretic analysis of DCoT dependencies reveals that similar conclusions can be drawn for this transform as well [20]. In view of these observations, we choose to adopt in this paper an intra-band coding strategy for the encoding of the oriented transform and DWT coefficients, respectively. More specifically, we employ a 2D variant of the QuadTree-Limited (QT-L) codec proposed in [25]. The QT-L is an intra-band multi-pass quadtree-based scalable coding scheme that uses successive approximation quantization (SAQ) in order to determine the significance of the quantized coefficients with respect to a series of decreasing thresholds Tp = 2 p , 0 ≤ p ≤ pmax . The set of coding passes performed by
QT-L for each coding stage p, 0 ≤ p ≤ pmax include [26] (i) the significance pass, encoding the positions of coefficients that were non-significant at previous coding stages q, p < q ≤ pmax but become significant with respect to the current Tp , (ii) a non-significance pass, encoding the positions of coefficients that become significant with respect to Tp , and which are located in the neighborhood of coefficients found to be significant at previous coding stages q , and (iii) a refinement pass, refining the magnitudes of the coefficients found to be significant at previous coding stages q . For the first coding stage p = pmax , the QT-L codec performs only a significance pass. In order to encode the locations of significant coefficients at each coding stage p , the QT-L coding algorithm performs a quadtree decomposition wherein the matrix of quantized coefficients is further divided into a set of quadrants (matrices). The partitioning process is limited, such that quadtrees are not built up to pixel level. Instead, once the area (the number of coefficients) of the current node in the quadtree is lower than a predefined minimal quadrant area, the partitioning process is stopped and the 33
31
PSNR (dB)
29
27
25 JPEG2000 WBCT CurvWav ContWav ContSDWav Hybrid (Chappelier) Hybrid (Liu)
23
21
19 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Rate (bpp)
Fig. 4. Rate-distortion results for the “Barbara” image obtained using (i) JPEG2000, (ii) the proposed schemes and (iii) a series of hybrid schemes recently reported in literature
On Hybrid Directional Transform-Based Intra-band Image Coding
1055
coefficients within the quadrant are further entropy coded. The coding scheme is supplemented by a context conditioning phase and context based entropy coding of the symbols generated in the coding passes. For DWT subbands, we employ the context models of JPEG2000 [22]. Furthermore, horizontal and vertical models [22] have been used to encode the oriented transform subbands, wherein we classify subbands as being mostly horizontal or mostly vertical, respectively. Given the representation (2), each subband D lj f and A − J f is encoded independently within the described scheme. The compression framework is supplemented by a rate-distortion (R-D) optimization technique that allows the generation of an optimal scalable bit-stream representation.
4 Experimental Results In this section, we report and discuss the compression results obtained using the proposed scalable hybrid compression schemes, for a set of JPEG2000 test images. Specifically, we have designed a DCuT/DWT-based codec that uses a DCuT via USFFTs, which we denote by “CurvWav”. It should be mentioned that the DCuT work of [4, 23] reports the use of either wavelets or curvelets at the finest scale. Please note that in this work all DClC, jC f detail signals refer to curvelets. Furthermore, we have designed two variants of a DCoT/DWT-based compression scheme. The first codec employs an instantiation of the original DCoT of [17], which we denote as “ContWav”, while the second is based on the DCoT with sharp frequency localization of [18], which we refer to as “ContSDWav”. The DWT employed is the biorthogonal 6.8 transform. Each transform setup is then described by a hybrid transform with J = J C + JW levels, wherein the combination of J C , JW is optimally determined for each image. A similar remark holds for the orientations lC at each level jC . We plot in Fig. 4 and Fig. 5, for a set of 512x512 and 256x256 natural, fingerprint and seismic images, the scalable R-D compression results obtained using our proposed codecs and JPEG2000, respectively. The transform employed is of the form (2) with J = J C + JW levels, wherein we choose J C = 1 and JW = 4 or JW = 5 , depending on the image size. The same number of levels J has been used for JPEG2000. It can be seen from these figures that on “Barbara” the two DCoT-based codecs clearly outperform JPEG2000 at all bit-rates, with gains of up to 0.80 dB for ContWav and 1.10 dB for ContSDWav, respectively. On “Fingerprint” all three codecs are comparable with or even slightly outperform JPEG2000. A similar observation can be made for the “Seismic” image. For the remainder of the images, the proposed schemes remain competitive with respect to JPEG2000 for rates up to 0.25 bpp. A first conclusion that can be drawn from these results is the fact that, as intuitively expected, the hybrid codecs are comparable with or outperform JPEG2000 in particular on images with strong directional features. Indeed, “Fingerprint” and “Seismic” are highly directional, with almost no texture, while “Barbara” is a combination of both features. Another observation is that the two DCoT-based codecs consistently
1056
A. Alecu et al. 29
38 36
27 34
25 PSNR (dB)
PSNR (dB)
32 30 28
23
21
26 JPEG2000 CurvWav ContWav ContSDWav
24 22
JPEG2000 CurvWav ContWav ContSDWav
19
17
20 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0
0.5
0.05
0.1
0.15
0.2
0.3
0.35
0.4
0.45
0.5
45
32
43
30
41
28
39
26 PSNR (dB)
PSNR (dB)
0.25 Rate (bpp)
Rate (bpp)
37
24
35
22
33
20 JPEG2000 CurvWav ContWav ContSDWav
31 29
JPEG2000 CurvWav ContWav ContSDWav
18 16
27
14 0
0.05
0.1
0.15
0.2
0.25 Rate (bpp)
0.3
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25 0.3 Rate (bpp)
0.35
0.4
0.45
0.5
Fig. 5. Rate-distortion results obtained using the proposed schemes and JPEG2000, for (upperleft) “Lena”, (upper-right) “Fingerprint”, (lower-left) “Seismic” and (lower-right) “Cameraman”. Note that on some images ContWav and ContSDWav curves possibly overlap.
outperform the DCuT-based scheme. Also, on the average, the ContSDWav scheme is better than its ContWav counterpart. In terms of visual results, the reconstructed “Barbara” image, compressed at 0.1 bpp, is depicted in Fig. 6, for JPEG2000 and ContSDWav. It can be seen that the obtained 0.9143 dB difference between the two images clearly translates into a visual quality difference as well, in particular in regions with a high degree of directionality (i.e., trousers, books,…). Similar observations regarding visual quality can be made for bit-rates all the way up to 0.5 bpp. Furthermore, we compare our proposed codecs with the “Hybrid” codec of Chappelier et al [14], the “Hybrid” codec of Liu et al [15] and the Wavelet-based Contourlet Transform (“WBCT”) codec of [16]. The R-D results are plotted in Fig. 4 for “Barbara”. Note that the results for the codecs of [14] and [16] have been reproduced from the graphical illustrations of these papers. It can be seen that the best results are obtained by the ContSDWav and ContWav codecs, followed by the two hybrid schemes. Finally, we analyze the impact of oversampling on compression performance. In this sense, we illustrate in Fig. 7 for “Barbara”, for a constant number of decomposition levels J = J C + JW and increasing values of J C (i.e., J C = 1, 2,3 ), the coding results obtained with our proposed CurvWav, ContWav and ContSDWav schemes, respectively. It can be seen from this figure that for all codecs the compression performance gradually decreases as J C increases. Similar results have been obtained for other images. The explanation for this consists in the fact that an increase in J C is associated with an increase in the redundancy factor, as can be seen from Table 1, in
On Hybrid Directional Transform-Based Intra-band Image Coding
1057
Fig. 6. “Barbara” compressed at 0.1 bpp, for which we obtain a PSNR of (above) 25.0405 dB for JPEG2000 and (bellow) 25.9548 dB for ContSDWav
which we show the oversampling factors for different J C . These results lead us to conclude that highly oversampled hybrid schemes can indeed be designed to be competitive or even outperform JPEG2000 at low rates, but there is a limit up to which one may trade redundancy in return for sparsity. Furthermore, this limit is
1058
A. Alecu et al. 34 32 30 28
PSNR (dB)
26 24 ContSDWav Jc = 1 ContSDWav Jc = 2 ContSDWav Jc = 3 ContWav Jc = 1 ContWav Jc = 2 ContWav Jc = 3 CurvWav Jc = 1 CurvWav Jc = 2 CurvWav Jc = 3 JPEG2000
22 20 18 16 14 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Rate (bpp)
Fig. 7. For the “Barbara” image, the rate-distortion results obtained using the proposed codecs with the same number of levels J = J C + JW and increasing values of J C . The results for JPEG2000 are also shown. Table 1. Transform oversampling factors for the proposed codecs
Max (large J C )
JPEG2000 1
CurvWav 7.2
ContWav 1.33
ContSDWav 2.33
Used ( J C = 1 )
1
5.8
1.25
2
image-dependent. In the case of “Barbara” for instance, the ContWav scheme remains competitive up to J C = 2 or at certain bit-rates even up to J C = 3 , while ContSDWav losses coding performance beyond J C = 1 . These observations coincide with the lower redundancy factor of the former transform over the latter (see Table 1). We end this section by concluding that the redundancy of oriented transforms such as those of [4, 5, 18] is not necessarily a drawback for compression applications. Indeed, the results of this paper show that an optimal trade-off between oversampling and sparsity, combined with an adequate exploitation of coefficient dependencies, can lead to competitive coding results. In this sense, we have reported codecs with redundancy factors as high as 1.25 to 5.8 and have shown that at low rates such schemes can in fact outperform JPEG2000 and its critically-sampled wavelets.
5 Conclusions In this paper, we have proposed a generic hybrid image representation consisting of an oriented transform at high frequencies and a wavelet transform at low frequencies.
On Hybrid Directional Transform-Based Intra-band Image Coding
1059
We have designed separate compression schemes using three popular directional transform instantiations. Although the tiling of the frequency plane is not unique, we show that our choice of frequency partitioning, combined with the choice of exploiting intra-band coefficient dependencies (i.e., the use of an intra-band coding architecture), leads to image compression schemes that clearly outperform other approaches reported in literature. Moreover, on some images, we show that two of the proposed codec instantiations outperform JPEG2000 by over 1dB. Finally, we have investigated the trade-off between oversampling and sparsity and shown that, at low rates, hybrid coding schemes with transform redundancy factors as high as 1.25 to 5.8 can in fact outperform classical wavelet-based schemes.
Acknowledgments This research was funded by Fund for Scientific Research - Flanders (JCA-SVC&R project and post-doctoral fellowships A. Munteanu, P. Schelkens and A. Pizurica).
References 1. Mallat, S.: A Theory for Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 674–693 (1989) 2. Vetterli, M.: Wavelets, approximation and compression. IEEE Signal Processing Magazine 18, 59–73 (2001) 3. Candès, E.J., Donoho, D.: Ridgelets: a key to higher-dimensional intermittency. Phil. Trans. R. Soc. Lond. A. 357, 2495–2509 (1999) 4. Candès, E.J., Donoho, D.: New Tight Frames of Curvelets and Optimal Representations of Objects with Piecewise C2 Singularities. Comm. Pure Appl. Math 57, 219–266 (2004) 5. Do, M.N., Vetterli, M.: Contourlets. In: Welland, G.V. (ed.) Beyond Wavelets, Academic Press, London (2003) 6. Le Pennec, E., Mallat, S.: Sparse Geometric Image Representations with Bandelets. IEEE Transactions on Image Processing 14, 423–438 (2005) 7. Shapiro, J.M.: Embedded Image Coding Using Zerotrees of Wavelet Coefficients. IEEE Transactions on Signal Processing 41, 3445–3462 (1993) 8. Said, A., Pearlman, W.: A New Fast and Efficient Image Codec Based on Set Partitioning in Hierarchical Trees. IEEE Trans. on Circuits and Systems for Video Tech. 6, 243–250 (1996) 9. Munteanu, A., Cornelis, J., Van der Auwera, G., Cristea, P.: Wavelet Image Compression - The Quadtree Coding Approach. IEEE Transactions on Information Technology in Biomedicine 3, 176–185 (1999) 10. Pearlman, W.A., Islam, A., Nagaraj, N., Said, A.: Efficient, low-complexity image coding with a set-partitioning embedded block coder. IEEE Trans. Circuits and Systems for Video Technology 14, 1219–1235 (2004) 11. Taubman, D.: High Performance Scalable Image Compression with EBCOT. IEEE Transactions on Image Processing 9, 1158–1170 (2000) 12. Wu, X.: High-order context modeling and embedded conditional entropy coding of wavelet coefficients for image compression. In: Thirty-First Asilomar Conference on Signals, Systems & Computers, vol. 2, pp. 1378–1382 (1997)
1060
A. Alecu et al.
13. Hsiang, S.-T., Woods, J.W.: Embedded image coding using zeroblocks of subband/wavelet coefficients and context modeling. In: IEEE International Symposium on Circuits and Systems (ISCAS), Geneva, Switzerland, vol. 3, pp. 662–665. IEEE, Los Alamitos (2000) 14. Chappelier, V., Guillemot, C., Marinkovic, S.: Image Coding with Iterated Contourlet and Wavelet Transforms. In: Proc. IEEE International Conf. on Image Processing, Singapore, pp. 3157–3160. IEEE Computer Society Press, Los Alamitos (2004) 15. Liu, Y., Nguyen, T.T., Oraintara, S.: Low Bit-Rate Image Coding Based on Pyramidal Directional Filter Banks. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Toulouse, France, IEEE, Los Alamitos (2006) 16. Eslami, R., Radha, H.: Wavelet-based Contourlet Coding using an SPIHT-like Algorithm. In: Proc. of Conference on Information Sciences and Systems, NJ, pp. 784–788 (2004) 17. Do, M.N., Vetterli, M.: The Contourlet Transform: an Efficient Directional Multiresolution Image Representation. IEEE Trans. Image Proc. 14, 2091–2106 (2005) 18. Lu, Y., Do, M.N.: Multidimensional Directional Filter Banks and Surfacelets. IEEE Trans. Image Processing (to appear) 19. Liu, J., Moulin, P.: Information-Theoretic Analysis of Interscale and Intrascale Dependencies between Image Wavelet Coefficients. IEEE Transactions on Image Processing 10, 1647–1658 (2001) 20. Po, D.D.-Y., Do, M.N.: Directional multiscale modeling of images using the contourlet transform. IEEE Transactions on Image Processing 15, 1610–1620 (2006) 21. Alecu, A., Munteanu, A., Pizurica, A., Philips, W., Cornelis, J., Schelkens, P.: Information-Theoretic Analysis of Dependencies between Curvelet Coefficients. In: IEEE International Conference on Image Processing (ICIP), Atlanta, GA, USA, pp. 1617–1620. IEEE, Los Alamitos (2006) 22. Taubman, D., Marcelin, M.W.: JPEG2000: Image Compression Fundamentals, Standards, and Practice. Kluwer Academic Publishers, Norwell, Massachusetts (2002) 23. Candès, E.J., Demanet, L., Donoho, D.L., Ying, L.: Fast Discrete Curvelet Transforms. Applied and Computational Mathematics, California Institute of Technology (2005) 24. Lu, Y., Do, M.N.: CRISP-Contourlets: a Critically Sampled Directional Multiresolution Image Representation. In: Proc. SPIE Conf. on Wavelet Applic. in Signal and Image Proc. X, San Diego, USA (2003) 25. Schelkens, P., Munteanu, A., Barbarien, J., Galca, M., Giro-Nieto, X., Cornelis, J.: Wavelet Coding of Volumetric Medical Datasets. IEEE Trans. on Medical Imag. 22, 441–458 (2003) 26. Munteanu, A.: Wavelet Image Coding and Multiscale Edge Detection: Algorithms and Applications. PhD Thesis. Vrije Universiteit Brussel, Brussels (2003)
Analysis of the Statistical Dependencies in the Curvelet Domain and Applications in Image Compression Alin Alecu1, Adrian Munteanu1, Aleksandra Pizurica2, Jan Cornelis1, and Peter Schelkens1 1
Dept. of Electronics and Informatics, Vrije Universiteit Brussel – Interdisciplinary Institute for Broadband Technology (IBBT), Pleinlaan 2, 1050 Brussels, Belgium Phone: +32-2-629-1896 {aalecu,acmuntea,jpcornel,pschelke}@etro.vub.ac.be 2 Dept. of Telecommunications and Information Processing, Ghent University, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium {
[email protected]}
Abstract. This paper reports an information-theoretic analysis of the dependencies that exist between curvelet coefficients. We show that strong dependencies exist in local intra-band micro-neighborhoods, and that the shape of these neighborhoods is highly anisotropic. With this respect, it is found that the two immediately adjacent neighbors that lie in a direction orthogonal to the orientation of the subband convey the most information about the coefficient. Moreover, taking into account a larger local neighborhood set than this brings only mild gains with respect to intra-band mutual information estimations. Furthermore, we point out that linear predictors do not represent sufficient statistics, if applied to the entire intra-band neighborhood of a coefficient. We conclude that intra-band dependencies are clearly the strongest, followed by their inter-orientation and inter-scale counterparts; in this respect, the more complex intra-band/inter-scale or intra-band/inter-orientation models bring only mild improvements over intra-band models. Finally, we exploit the coefficient dependencies in a curvelet-based image coding application and show that the scheme is comparable and in some cases even outperforms JPEG2000. Keywords: curvelet, coefficient dependency, mutual information, compression.
1 Introduction For some time now, geometric-based image representations [1-4] are emerging as the new successors to classical wavelets [5]. These transforms overcome the limited ability of 2D tensor-product wavelets to capture directional information and, as such, are capable of providing optimally sparse representations of objects with C 2 edges. While most of the work in literature has been focused so far on the transforms themselves, practical applications that make use of these representations are only slowly coming to light. Carefully assessing the statistical dependencies between the resulting coefficients is of paramount importance in various applications. For instance, evolving from the original independence assumption [5] between wavelet J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 1061–1071, 2007. © Springer-Verlag Berlin Heidelberg 2007
1062
A. Alecu et al.
Fig. 1. An image decomposition into curvelet subbands
coefficients towards the observation of strong inter- and intra-scale statistical dependencies [6-8] has led to the design of successful image coding and denoising applications. It is clear that in order to repeat the success of wavelets, a similar investigation of the statistical dependencies is required for the recently emerging geometric transforms. In this respect, we investigate in this paper a representation that appears to hold particular promise for future image processing applications, namely the curvelet transform [1]. The paper is organized as follows: section 2 gives a brief description of a curvelet image decomposition; we analyze the curvelet coefficient dependencies in terms of mutual information in section 3; we exploit these dependencies in a practical image coding application in section 4; finally, we draw the conclusions in section 5.
2 Curvelet Decompositions The curvelet-based decomposition scheme employed in this work is the Digital Curvelet Transform via UnequiSpaced FFT’s (DCT-USFFT) of Candes et. al. While we will not go into an extensive discussion of the transform itself - instead referring the reader directly to literature [1, 9] - we would like to clarify here a few concepts and notations that will be used throughout the paper. Fig. 1 illustrates the decomposition of an image into curvelet subbands (shown as rectangles), each corresponding to a certain scale and orientation. At each finer scale, the number of orientations doubles w.r.t. the next coarser scale [1]. Subbands located at the same scale are displayed along concentric coronae, the outermost corresponding to the highest frequencies. The subbands are grouped as being mostly horizontal/vertical (MH/MV), according to their orientation. We employ hereafter the terminology of [10, 11], such that, given a subband coefficient X , P denotes its parent at the next resolution level, Ck is a cousin at the same scale but in a different orientation band,
Analysis of the Statistical Dependencies in the Curvelet Domain
1063
Table 1. Mutual information estimates between X and its single neighbors N i , j within a 5x5 neighborhood, as averages over a test set of 7 images
j
-2 -1 0 1 2
-2 0.1225 0.1379 0.2032 0.1337 0.1244
-1 0.1462 0.1767 0.4905 0.1686 0.1469
i 0 0.1562 0.1972 0.1963 0.1552
1 0.1466 0.1683 0.488 0.1732 0.1448
2 0.1243 0.1314 0.1968 0.1351 0.1154
and N is a local (intra-band) neighbor of X . We refer to “adjacent” cousins as those belonging to subbands located at adjacent orientations, and we use the notation Cop to denote the cousin belonging to the band with opposite orientation to the one containing X . Finally, the DCT-USFFT transform coefficients consist of wavelet coefficients at the coarsest scale, and curvelet coefficients at all other finer scales, respectively [1].
3 Curvelet Coefficient Dependencies In this paper, we express coefficient dependencies in terms of mutual information (MI). In general, the mutual information I ( X ; Y ) between two random variables X and Y can be reasonably estimated using existing methods (i.e., the log-scale histogram method, the adaptive partitioning method [12], a.s.o.). Nonetheless, it is well-known that as the number of variables involved increases, one is confronted with the so-called curse of dimensionality, in which the difficulty of accurately estimating the joint pdfs increases exponentially with the number of variables. Hence, we adopt the approach of [6-8] that replaces a multi-dimensional Y by its sufficient statistic T = f (Y ) , such that I ( X ; Y ) = I ( X ; T ) . We start by illustrating in Table 1 the intra-band MI estimates I ( X ; N i , j ) between a curvelet coefficient X and each of its single neighbors N i , j , i, j ∈ {−2, −1,1, 2} of the symmetrical 5x5 neighborhood ( X would refer here to the central coefficient i = 0, j = 0 ). The MI values are computed as averages over the curvelet subbands of the last two finest scales, over a test set of 7 images. It can be seen from these results that N −1,0 and N1,0 convey more information about X (by a factor of ×4 ) than any other neighbor, the next strongest dependencies being observed amongst the horizontally and vertically-located neighbors. Finally, we notice that MI estimates gradually decrease as the distance from X increases. We now focus on the MI estimates between curvelet coefficients X and their entire local neighborhoods, i.e. sets of the form N = { N i , j }i∈I , j∈J . In order to derive
such estimates for multi-dimensional random variables Y = {Y1 , Y2 ,… , YN } , we employ a linear predictor of the magnitudes of the coefficients, i.e., we assume that T = ∑ i ai Yi is a sufficient statistic of Y , where ai are weights that minimize the
1064
A. Alecu et al. 1.4
1.2
Mutual Information
1
0.8
0.6
0.4
0.2
0 0
5
10
15
20
25
Nr. coefficients
Fig. 2. Curvelet intra-band mutual information estimates I ( X ; N ) as averages over a test set of 7 images, for successive values of card ( N ) = 1,…, 24 1.2
Mutual Information
1
0.8
0.6
0.4
0.2
0 0
5
10
15
20
25
Nr. coefficients
Fig. 3. Curvelet intra-band mutual information estimates I ( X ; N ) as averages over a test set of 7 images, for successive values of card ( N ) = 1,…, 24 ; a linear predictor of the entire neighborhood set N is employed
expected squared error [6]. Furthermore, we use a greedy algorithm in order to dynamically add the most informative neighbors to the set N . In this sense, the algorithm starts with N=∅ , and at each iteration extends the neighborhood set N=N ∪ { N i , j } (i.e., card ( N ) = card ( N ) + 1 , where card ( ⋅) is the cardinal of a set). The term N i , j denotes a single neighbor of X chosen from among the available i ∈ I , j ∈ J , such that the MI for the given card ( N ) is maximized.
Analysis of the Statistical Dependencies in the Curvelet Domain
1065
0.4 0.35
Mutual Information
0.3 0.25 0.2 0.15 0.1 0.05 0 0
5
10
15
20
25
nr. of coefficients
Fig. 4. Wavelet intra-band mutual information estimates I ( X ; N ) as averages over a test set of 7 images, for successive values of card ( N ) = 1,…, 24
The curvelet MI estimates
I ( X ; N)
for the symmetrically-located 5x5
neighborhood of X are calculated for increasing values of card ( N ) , card ( N ) = 1… 24 . These estimates are again computed as averages over the test set of 7 images. In a first experiment, we estimate I ( X ; N ) by employing a linear predictor T only for the exclusive set N \ { N −1,0 , N1,0 } . In other words, we calculate I ( X ; N −1,0 , N1,0 , T ) , which, despite the curse of dimensionality, is still within reasonable computational limits. The obtained MI values are illustrated in Fig. 2. We exclude the set { N −1,0 , N1,0 } because we have experimentally found that linear magnitude predictors of N , when N −1,0 ⊂ N or N1,0 ⊂ N , do not behave well. Indeed, Fig. 3 plots similar results to those of Fig. 2, except that a linear predictor T of N when { N −1,0 , N1,0 } ⊂ N is now used. It can be clearly seen from Fig. 3 that, after an initial abrupt increase, the MI decreases rapidly. Apparently this is in contradiction with the chain rule for MI, which states that I ( X ; Y1 ,…Yk ) ≥ I ( X ; Y1 ,…Yk −1 ) [8]. Nonetheless, let us recall that I ( X ; N ) can be estimated through its bound I ( X ;T ) ≤ I ( X ; N )
if T
is a sufficient statistic for
N , in which case
I ( X ; T ) = I ( X ; N ) . The results of Fig. 3 indicate that I ( X ; T ) decreases rapidly for card ( N ) > 2 . Hence, T can no longer be considered a sufficient statistic for N , when
{N
−1,0
, N1,0 } ⊂ N . This comes as an important observation if one recalls that in the case
of wavelets, it is shown that linear predictors are indeed sufficient statistics for the entire local neighborhood of a coefficient [7, 8]. In fact, for the sake of comparing curvelet MI behavior with that of a thoroughly-studied transform, we illustrate in
1066
A. Alecu et al.
Fig. 5. The shape of curvelet intra-band neighborhoods N = { Ni , j } . The ordering of N i , j is
denoted by the shades of grey, black signifying the strongest dependency. Table 2. Mutual information estimates. Ck denotes a cousin of X located k orientations
away, Cop is the opposite-orientation cousin, and finally P denotes the parent.
Lena Peppers
I ( X ; P) 0.1310 0.0851
I ( X ; C1 ) 0.0806 0.0293
I ( X ; C4 ) 0.0334 0.0055
I ( X ; C12 ) 0.0102 0.00001
I ( X ; Cop ) 0.1536 0.0938
Fig. 4 results similar to those shown in Fig. 3, but for high-frequency wavelet subbands (the wavelet transform employed here is the (4,4) symmetrical biorthogonal transform, and the results refer to the horizontal detail subbands). A comparison of the results of Fig. 2 and Fig. 4 reveals that MI estimates increase more abruptly for curvelets than for wavelets. Indeed, curvelets require only two coefficients to approximately reach a MI maxima, while wavelets require four. Additionally, for both transforms, it can be noticed that after a certain card ( N ) , the MI estimates exhibit a slow decay. This can be explained by the fact that these values of card ( N ) correspond to neighbors N i , j of X located further away. At such distant locations, the correlation with respect to X decreases significantly, such that T deviates from the sufficient statistics assumption. The fact that for curvelets, the slow decay of the MI starts at a low value of card ( N ) , points to the conclusion that, although very strong, curvelet coefficient dependencies are limited to local microneighborhoods. Additionally, the difference between the magnitudes of the overall curvelet, respectively wavelet, MI estimates, is due to the high oversampling of the curvelet transform [1], if compared to the critically-sampled wavelet. Indeed, oversampling induces redundancy and thus stronger dependencies. We conclude the analysis of intra-band curvelet coefficient dependencies by illustrating the shape of the curvelet neighborhood N , for the first few values of card ( N ) . These results are shown in Fig. 5, and correspond to the neighborhood employed in Fig. 2, the ordering of N i , j being denoted by the decreasing shades of grey. It can be seen from Fig. 5 that the strongest dependencies can be found for the immediate horizontal neighbors, followed by the next horizontal, and immediate vertical neighbors, respectively. In addition, this ordering appears to match the single MI results I ( X ; N i , j ) of Table 1. This is an interesting observation, especially if compared with the known classical wavelet dependencies [8]. Indeed, curvelet
Analysis of the Statistical Dependencies in the Curvelet Domain
1067
Table 3. Mutual information estimates between X and its parent P , neighbors N and cousins C
Lena Peppers Barbara
I ( X ; P) 0.1310 0.0851 0.0456
I (X;N) 0.9123 0.8138 0.9092
I ( X ;C ) 0.2051 0.0871 0.2582
I ( X ; N, P) 1.0318 0.8668 0.9124
I ( X ; N,C ) 1.1555 0.8671 1.1338
I ( X ; P, C ) 0.4294 0.218 0.3538
neighborhoods appear to have a strong anisotropic shape. A possible explanation for this is the fact that curvelets themselves possess anisotropic scaling laws, the support of a curvelet being contained in a ‘parabolic’ shape that obeys such laws [1]. Next, we briefly extend our investigation of curvelet MI estimates to inter-scale, respectively inter-orientation coefficient dependencies (a discussion of their joint statistics can be found in [10]). We illustrate in Table 2, for “Lena” and “Peppers”, the MI estimates between a coefficient X and some of its cousins Ck , between X and P , and finally between X and Cop . The results are derived for subbands located at the last two finest scales. It can be observed that the MI decreases with the increase in the difference between orientations, the most significant cousin in this sense being the orientation-adjacent C1 . Nonetheless, the opposite-orientation cousin Cop appears to be the most significant of all, outperforming even the parent coefficient P . We believe that this is a result of the real-valued curvelet transform implementation. Indeed, the DCT-USFFT investigated in this paper builds complex coefficient subbands that correspond to a single direction. Real-valued pairs of subbands and their “opposites” are then constructed from such single complex-valued subbands. As such, it is expected that the obtained “opposite” coefficients still display significant dependencies. We end this section by showing in Table 3, for a few images, the MI estimates between X and its “generalized“ neighborhood set G = { N, P, C} , i.e. between X and its parent P , its intra-band neighbors set N and its cousins set C , respectively.
Based on the previous findings, we chose N= { N −1,0 , N1,0 } and C= {C-1 , C1 , Cop } , where C-1 ,C1 denote the two orientation-adjacent cousins of X . The first choice is motivated by the fact that N −1,0 and N1,0 convey the most information about X , the inclusion of additional neighbors beyond these two bringing insignificant gains with respect to MI; the second choice is based on the observed ordering of I ( X ; Ck ) estimates. From Table 3, we find that I ( X ; P ) < I ( X ;C )
I ( X ; N ) (i.e. the local neighbors
provide the most information about X ), and, furthermore, that I ( X ; P, C ) I ( X ; N, P ) < I ( X ; N, C ) . At this point, the results lead us to conclude that intra-band models capture most of the dependencies between curvelet coefficients, with marginal gains for the more complex intra-band/inter-scale or intraband/inter-orientation models.
1068
A. Alecu et al.
4 Image Coding In this section, we target a potential application of the curvelet transform, namely coding. In particular, we describe how the statistical coefficient dependencies investigated in the previous section have been exploited in the design of a competitive curvelet-based image compression scheme. Furthermore, we show that the proposed codec is comparable and in some cases even outperforms JPEG2000 [13].
Fig. 6. Context models and associated neighborhoods for (left) curvelet MV subbands, and (right) wavelet subbands Table 4. The coding gain obtained using the proposed context models versus those of JPEG2000, for a few images
Average gain (dB) Max gain (dB)
Lena 0.1345 0.2887
Barbara 0.0804 0.2562
Seismic 0.0938 0.1318
The general architecture of our scheme is derived from the general structure of a transform-based codec. Thus, at the encoder, a forward decomposition concentrates the energy of the signal in a few coefficients, followed by quantization, coding of the quantized coefficients to a set of symbols, and finally entropy coding. In the final stage, the scheme performs a context-based entropy coding that is steered by some parameters from the coding step. In the following, we will focus on the encoding of the transform coefficients, and the context models of the entropy coder, respectively. Thus, first let us recall that the results of section 3 show that intra-band modeling of the curvelet transform captures most of the dependencies between curvelet coefficients, with marginal gains for the more complex intra-band/inter-scale or intraband/inter-orientation models. In view of these observations, we choose to adopt in this paper an intra-band coding strategy, wherein we encode the quantized curvelet coefficients using a 2D variant of the QuadTree-Limited (QT-L) codec of [14]. Furthermore, we have shown in Fig. 5 the shape of the curvelet intra-band neighborhood, for the set of coefficients exhibiting the highest dependencies. Based on these findings, we have designed context models for the curvelet transform, for the MH and MV subbands, respectively. The models have been derived using a training set of 9 representative images. An example of the associated coefficient neighborhoods is depicted in Fig. 6, for a) curvelet MV subbands, and b) wavelet subbands (i.e., as employed in the context models of JPEG2000 [13]). The coding
Analysis of the Statistical Dependencies in the Curvelet Domain
1069
gains (i.e., the gains in PSNR) obtained using the proposed anisotropic context models versus the JPEG2000 models are shown in Table 4, for a few images. The gains are expressed here as averages over an extensive range of bit-rates. It can be seen from this table that the proposed models bring considerable gain in compression performance, if compared to the context models of JPEG2000. 29
27
PSNR (dB)
25
23
21
19 JPEG2000 Proposed Scheme 17 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Rate (bpp)
Fig. 7. For “finger”, the rate-distortion results obtained using the proposed scheme and JPEG2000, respectively 45
43 41
PSNR (dB)
39
37
35
33 31 JPEG2000 Proposed Scheme
29
27 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Rate (bpp)
Fig. 8. For “seismic”, the rate-distortion results obtained using the proposed scheme and JPEG2000, respectively
Finally, we illustrate in Fig. 7 and Fig. 8, for the “finger” and “seismic” images of the JPEG2000 test set, the rate-distortion curves obtained using our curvelet-based coding scheme and JPEG2000, respectively. It can be seen that at the targeted rates
1070
A. Alecu et al.
the proposed scheme is comparable and, moreover, in some cases outperforms JPEG2000. These results are all the more important if we note that the transform employed has an oversampling factor of over 5.8. In this sense, to the best of our knowledge, this is the first work that shows that the high redundancy typical of the new geometric transforms [1-4] is not necessarily an impediment for coding applications, and that a correct exploitation of the dependencies that exist between the transform coefficients can lead to competitiveness with respect to the JPEG2000 standard and its critically-sampled wavelets [13].
5 Conclusions This paper reports an information-theoretic analysis of the dependencies that exist between curvelet coefficients. We show that strong dependencies exist in local intraband micro-neighborhoods, and that the shape of these neighborhoods is highly anisotropic. Specifically, we find that the two immediately adjacent neighbors that are located orthogonal to the orientation of the subband convey the most information about the coefficient. Moreover, taking into account a larger local neighborhood set brings only mild gains with respect to intra-band mutual information estimations. Furthermore, we point out that, unlike the case of wavelets [8], linear predictors do not represent sufficient statistics, if applied to the entire intra-band neighborhood of a coefficient. Instead, such predictors should be used for a local neighborhood that does not include the two mentioned coefficients. Regarding inter-orientation dependencies, we observe that these strongly depend on the direction; in this sense, it is shown that the set of most significant predictors contains only three coefficients. We conclude that intra-band dependencies are clearly the strongest, followed by their interorientation and inter-scale counterparts; the more complex intra-band/inter-scale or intra-band/inter-orientation models bring only mild improvements. Finally, we exploit the coefficient dependencies in a curvelet-based image coding application and show that the proposed scheme is comparable and in some cases even outperforms JPEG2000 [13].
References 1. Candès, E.J., Donoho, D.: New Tight Frames of Curvelets and Optimal Representations of Objects with Piecewise C2 Singularities. Comm. Pure Appl. Math 57, 219–266 (2004) 2. Do, M.N., Vetterli, M.: Contourlets. In: Welland, G.V (ed.) Beyond Wavelets, Academic Press, London (2003) 3. Le Pennec, E., Mallat, S.: Sparse Geometric Image Representations with Bandelets. IEEE Transactions on Image Processing 14, 423–438 (2005) 4. Candès, E.J., Donoho, D.: Ridgelets: a key to higher-dimensional intermittency. Phil. Trans. R. Soc. Lond. A. 357, 2495–2509 (1999) 5. Mallat, S.: A Theory for Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 674– 693 (1989)
Analysis of the Statistical Dependencies in the Curvelet Domain
1071
6. Buccigrossi, R.W., Simoncelli, E.P.: Image Compression via Joint Statistical Characterization in the Wavelet Domain. IEEE Transactions on Image Processing 8, 1688–1701 (1999) 7. Simoncelli, E.P.: Modeling the joint statistics of images in the wavelet domain. SPIE 44th Annual Meeting, Denver, CO. (1999) 8. Liu, J., Moulin, P.: Information-Theoretic Analysis of Interscale and Intrascale Dependencies between Image Wavelet Coefficients. IEEE Transactions on Image Processing 10, 1647–1658 (2001) 9. Candès, E.J., Demanet, L., Donoho, D.L., Ying, L.: Fast Discrete Curvelet Transforms. Applied and Computational Mathematics, California Institute of Technology (2005) 10. Alecu, A., Munteanu, A., Pizurica, A., Philips, W., Cornelis, J., Schelkens, P.: Information-Theoretic Analysis of Dependencies between Curvelet Coefficients. In: IEEE International Conference on Image Processing (ICIP), Atlanta, GA, USA, IEEE, Los Alamitos (2006) 11. Po, D.D.-Y., Do, M.N.: Directional multiscale modeling of images using the contourlet transform. IEEE Transactions on Image Processing (to appear) 12. Darbellay, G.A., Vajda, I.: Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory 45, 1315–1321 (1999) 13. Taubman, D., Marcelin, M.W.: JPEG2000: Image Compression Fundamentals, Standards, and Practice. Kluwer Academic Publishers, Norwell, Massachusetts (2002) 14. Schelkens, P., Munteanu, A., Barbarien, J., Galca, M., Giro-Nieto, X., Cornelis, J.: Wavelet Coding of Volumetric Medical Datasets. IEEE Transactions on Medical Imaging 22, 441–458 (2003)
A Novel Image Compression Method Using Watermarking Technique in JPEG Coding Process Hideo Kuroda, Shinichi Miyata, Makoto Fujimura, and Hiroki Imamura Nagasaki University, 1-14 bunkyou-machi, Nagasaki, 852-8521, Japan
Abstract. Watermarking is a technique used to embed copyright information in an image. In this paper, we propose a novel image compression method which embeds a part of the coding parameters, instead of the copyright information, into an own image. The proposed method is adapted for the JPEG coding process. In the proposed method for JPEG, the DC coefficients of the DCT transform are embedded into low-to-middle frequency terms of the AC coefficients. Therefore, the DC coefficients need not be transmitted separately, which results in less data being needed for encoding. On the decoder side, first, the data for the DC coefficients embedded in AC coefficients is extracted. After this, the data of the DC and AC coefficients allows for the reconstruction of the image. Experiments on the relation between data compression ratio and PSNR using a quantization scale factor as parameter are carried out. The experimental results show that the proposed method has achieved a 3.65% reduction of the quantity of image data, compared with the standard JPEG method, while maintaining nearly the same image quality.
1
Introduction
Recently, we have witnessed a boom in the usage of digital cameras, including cellular phone cameras, personal computers and the internet. With it, the number of persons, who transmit pictures taken by themselves over internet, has greatly increased. This has led to an increase on two important demands on image coded data. One is assertion of copyrights for own pictures. The other is better image compression, for achieving of a large store of images or for obtaining high efficiency in the transmission of images. For the former, namely an assertion of copyright, there are digital watermarking techniques. Watermarking techniques, which embed copyright information in image, have two kinds of important requirements. One is robust watermarking, which means that the embedded information remains even after attempts to tamper with the image data. Another is that a good picture quality is maintained even after embedding copyright information in the image. For these important points, many researches have been carried out[1]-[6]. For the later, namely image compression, there are many researches[7]-[11]. In this paper, we propose a novel image compression method, using a watermarking technique in JPEG coding process which spread widely. This paper is organized as follows: in Section 2 and 3, watermarking and JPEG J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 1072–1083, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Novel Image Compression Method Using Watermarking Technique
1073
are presented, respectively. In Section 4, the proposed method is discussed, and experiments and experimental results are presented in Section 5.
2
Watermarking
Watermarking technique is a technique to embed copyright information in digital contents. In researches on watermarking techniques, there are two important issues. One of them is robustness. Embedded copyright information may be attacked by illegal use of the contents by a third person, without the owners permission. The attacks can disturb the extraction of correct information. So, the owners could be unable to claim copyright after the attack. Therefore, it is important to extract embedded copyright information normally. The other is image quality. Image information is changed when copyright information is embedded. Therefore, the image quality becomes lower by copyright information embedding. It is important that the image quality does not degrade too much. For these reasons, there are some researches on robustness improvement for attacks to embedded information, without degrading image quality[1]-[6]. Research on watermarking techniques is closely related to several image coding techniques, including the digital wavelet transform[4], DCT[3],[4] and JPEG[5]. On the other hand, there is a technique of steganography which aims for communication concealment[6]. For watermarking technique, copyright information is embedded in image, and for steganography technique, secret text. As mentioned above, different kinds of information can be embedded for each purpose. In this paper, a part of coding parameters is used as embedded information.
3
JPEG
JPEG is an image compression algorithm and image format of international standard, and it is used widely now. The flow of a JPEG coding system is shown in Fig.1. On the coder side, first, an input image (YUV format) is divided into 8 × 8 blocks of pixels. Pixel values of each block are transformed by DCT algorithm,
DC Input picture
DCT
DPCM
Quantization
Inverse DPCM Inverse DCT
Multiplexing
Output data
Inverse Multiplexing
Input data
Huffman coding
AC
Output picture
Huffman coding
Huffman decoding
DC
Inverse Quantization Huffman decoding
AC
Fig. 1. Flow of JPEG coding system
1074
H. Kuroda et al.
and a DC coefficient and 63 AC coefficients are obtained for each block. Then, DCT blocks of size 8 × 8 are constructed. Secondly, each DCT coefficient is quantized according to the wide-use quantization matrix of Fig.2. By multiplying the quantization scale factor to the quantization matrix, the compression ratio is changed. When the quantization scale factor is large, compression factor is high, when the quantization scale factor is small, the compression factor is low. An example of quantized coefficients is shown in Fig.3.
Fig. 2. Wide-use quantization matrix
Fig. 3. Examples of quantized coefficients
Fig. 4. Zigzag scanning
In the high frequency coefficients of DCT blocks ”0” occurs quite often. This means that run-length coding is suitable to the high frequency coefficients of DCT blocks by use of the zigzag scanning(Fig.4). The ”0” run-length of AC coefficients is typically longer if zigzag scanning is used, and makes the efficiency of JPEG compression better. Image information is reduced by quantization, run-length coding and zigzag scanning. Thirdly, DC terms and AC terms are processed in different ways. First, we explain the processing of DC terms. The DC coefficients between adjacent DCT blocks have strongly correlation. Therefore, JPEG coding system takes difference values of DC coefficients between adjacent DCT blocks. These difference values are encoded using Huffman codes. By this process, the quantity of DC information is shrunk. Huffman encode table is wide-use Huffman encode table (Table.1,Table.2) for DC coefficients. In the same way, we use wide-use Huffman encode table is used for AC coefficients. Finally, Huffman encoded DC, AC coefficients and header information are multiplexed by a multiplexer. On the decoder side, coded data are input to the inverse multiplexer. The output image is then reconstructed by a process inverse to the encoding process.
4
Proposed Method
Our proposed method uses a watermarking techniques for higher efficiency than the efficiency of the JPEG method. In proposed method, a part of coefficients of DCT blocks is embedded into other parts of coefficients instead of copyright information.
A Novel Image Compression Method Using Watermarking Technique
1075
Table 1. Wide-use Huffman code table for DC coefficients Data Huffman code -2047,,,-1024,1024,,,2047 111111110 -1023,,,-512,512,,,1023 11111110 -511,,,-256,256,,,511 1111110 -255,,,-128,128,,,255 111110 -127,,,-64,64,,,127 11110 -63,,,-32,32,,,63 1110 -31,,,-16,16,,,31 110 -15,,,-8,8,,,15 101 -7,,,-4,4,,,7 100 -3,-2,2,3 011 -1,1 010 0 00
Table 2. Wide-use table of add bit on Huffman code(Table.1) for DC coefficients Data Add bit -2047,,,-1024,1024,,,2047 00000000000,,,01111111111,10000000000,,,11111111111 -1023,,,-512,512,,,1023 0000000000,,,0111111111,1000000000,,,1111111111 -511,,,-256,256,,,511 000000000,,,011111111,100000000,,,111111111 -255,,,-128,128,,,255 00000000,,,01111111,10000000,,,11111111 -127,,,-64,64,,,127 0000000,,,0111111,1000000,,,1111111 -63,,,-32,32,,,63 000000,,,011111,100000,,,111111 -31,,,-16,16,,,31 00000,,,01111,10000,,,11111 -15,,,-8,8,,,15 0000,,,0111,1000,,,1111 -7,,,-4,4,,,7 000,,,011,100,,,111 -3,-2,2,3 00,01,10,11 -1,1 0,1 0 none
Fig.5 shows the flow of proposed method. In Fig.5, the blocks except the thickline blocks are the same as the blocks of JPEG coding system(Fig.1). Hereafter, we explain only the thick-line blocks. On the coder side, the data for a DC coefficient is embedded into AC coefficients. Because the data for a DC coefficient is extracted from AC coefficients at the decoder side, the data for a DC coefficients is transmitted at the coder side. In JPEG method, the data of multiplexed DC and AC is output. But, in the proposed method, information only for AC is output. At the decoder side, DC coefficients that were embedded at coder side are extracted from AC coefficients. DCT blocks are the reconstructed from DC coefficient and AC coefficients. In this way, JPEG method needs information of DC terms and AC terms. Our proposed method needs only AC terms. Therefore, the quantity of image data by obtained by the proposed method is smaller than that by JPEG.
1076
H. Kuroda et al.
DC Input picture
DCT
DC Inverse DCT
Huffman coding
Quantization AC
Output picture
DPCM
Inverse Quantization AC
Embedding DC into AC coefficients Inverse DPCM
Huffman decoding
Multiplexing
Output data
Inverse Multiplexing
Input data
Huffman coding
Huffman decoding Extracting DC from AC coefficients
Fig. 5. Flow of proposed method
Next, the domain of embedding and embedding procedure are explained below. As shown in Fig.3, the high frequency coefficients of DCT blocks contain many ”0” bits. Therefore, if the values of the high frequency coefficients are changed from ”0” to ”1” by embedding, the quantity of output data become large. A diagonal band of low-to-middle frequency coefficients of AC terms is selected for domain of embedding by many watermarking techniques[1][3][4]. So, we choose a diagonal band of low-to-middle frequency coefficient, too. Preexperiments were carried out for selection of coefficient position for embedding. We measured the PSNR of reconstructed image when 1 is added as assumed noise to the value of one of the coefficients of which position is shown in zigzag scanning. Experimental results are shown in Fig.6 for Lenna and Fig.7 for Mandrill. Horizontal axis shows the positions in zigzag scanning. Vertical axis shows PSNR. The characteristics of Fig.6 and Fig.7 are quite similar. The characteristics of Airplane, Barbara and Earth are quite similar, too. The positions where the influence of embedding is small are similar for the five test images. If these positions are selected as domain of embedding, the image quality is not affected very much by the embedding. The other side, Fig.8 shows each value of the quantization matrix. Horizontal axis shows the positions in zigzag scanning. Vertical axis shows values of quantization matrix. In order to easily compare the characteristics of Fig.6, Fig.7 and Fig.8, the vertical axis of Fig.8 is turned upside down. The characteristics of Fig.6, Fig.7 and Fig.8 are very similar each other, too. Therefore, influence to image quality by embedding has close relation with values of the quantization matrix. When the positions which values of quantization matrix are small are selected for embedding position, influence to image quality is little. The proposed method embeds Huffman encoded different values of DC coefficient between adjacent DCT blocks in AC terms of own block. The quantity of this embedding information is from two to twenty bits from Table1 and Table2. So, it is necessary to embed twenty bits per DCT block at the maximum for the proposed method. Therefore, the positions of embedding are chosen as the 1st-14th, 16th-20th and 24th in zigzag scanning, which correspond to small values of the quantization matrix from Fig.2. Next, the method of embedding is explained below. The quantized coefficients of the embedding position are modified to even or odd number, namely, if the
A Novel Image Compression Method Using Watermarking Technique
1077
43 16th-20th
24th
1st-14th
PSNR[dB]
42.5
42
41.5
41
40.5
0
10
20
30 40 Positon in zig-zag scanning
50
60
70
Fig. 6. PSNR of reconstructed image when value 1 is added as assumed noise to the value of one of the coefficients of which position is shown in zigzag scanning(Lenna)
41.4
16th-20th 24th
1st-14th
41.2 41
PSNR[dB]
40.8 40.6 40.4 40.2 40 39.8 39.6
0
10
20
30 40 Positon in zig-zag scanning
50
60
70
Fig. 7. PSNR of reconstructed image when value 1 is added as assumed noise to the value of one of the coefficients of which position is shown in zigzag scanning(Mandrill)
1078
H. Kuroda et al.
0
10
Position in zig-zag scanning 30 40
50
60
70
16th-20th
20 1st-14th Values of quantization matrix
20
24th
40 60 80 100 120 140
Fig. 8. Values of wide-use quantization matrix
embedding data bit is ”0”, the coefficient is modified to even, and if it is ”1”, to odd. If the embedding data bit is ”0” and the coefficient is already even, the coefficient isn’t modified. Likewise, the if the embedding data bit is ”1” and the coefficient is already odd, the coefficient isn’t modified. The coefficients that are necessary to be modified are subtracted one from the absolute values of the coefficients. For example, ”58” is modified to ”57”, ”-54” is modified to ”-53”. If the value of coefficient is zero, the coefficient is added one. An embedding example (two blocks of edge of left) is shown Fig.9. Difference values of DC coefficient are taken between horizontally adjacent DCT blocks. If the block is on the left edge, there is no difference value. Instead, the Huffman encoded DC coefficient of the block is embedded. Huffman encoded ”127”(=1111011111111) is embedded in first block. Huffman encoded ”3”(=01111), which is difference value of DC coefficients, is embedded in second block. Thus, the original information of the DC coefficients is no longer transmitted separately. At the decoder side, in case of extracting information of embedding, Huffman code (Table.1) shows the length of embedding data. The image quality is down by embedding. We consider the degradation of image quality caused by embedding and calculate the compression ratio for proposed method. We show experiments and experimental results in Section 5.
5
Experiment and Experimental Results
Test images are Airplane, Barbara, Earth, Lenna, Mandrill. 8-bit grayscale Y signals are used. Because we want to count the information in the header, we
A Novel Image Compression Method Using Watermarking Technique
1079
Fig. 9. Examples of embedding DC value into AC terms of DCT coefficient for the first and the second blocks
use the command “cjpeg”, and the U,V signals are changed to zero. We show graphs of experimental results for Lenna(Fig.10) and Mandrill(Fig.11). We measured the PSNR and sizes of compressed image data, for the quantization scale factor in the range from 0.100 to 2.500 by the proposed and JPEG methods. Experimental results are shown in Fig.12 for Lenna and Fig.13 for Mandrill. The horizontal axis shows PSNR values, and the vertical axis shows the image data size, measured in bytes. The solid line shows the proposed method, and the dotted line corresponds to the JPEG method. The domain for which the solid line is below the dotted line is the application domain of our proposed method. When the quantization scale factor is small, in other word, compression ratio is low, the proposed method is superior to JPEG method at compression efficiency. However, as the quantization scale factor becomes larger, JPEG method is superior to the proposed method. By comparing Fig.13 with Fig.12, we notice that the application domain of proposed method for Mandrill is wider than Lenna’s one. This happens because the image of Mandrill includes a lot of high frequency.
1080
H. Kuroda et al.
Fig. 10. Original image(Lenna)
40000 A A’ B B’ C C’
38000
Quantity of image data[Byte]
36000 34000
Fig. 11. Original image(Mandrill)
Methods Quantization scale factor Proposed 0.100 JPEG 0.100 Proposed 0.473 JPEG 0.520 Proposed 2.500 JPEG 2.500
A’
JPEG method
32000
A
Proposed method
30000
Domain of JPEG method
28000
B,B’ Domain of proposed method
26000 24000
C
22000 20000 C’ 18000
28
30
32
34
36 PSNR[dB]
38
40
42
44
Fig. 12. Comparison between proposed method and JPEG method(Lenna)
Next, we evaluate the compression efficiency, when the PSNR is roughly same, for the proposed method and the JPEG method. Table.3 shows improvements on compression ratio of the proposed method to the JPEG method. Improvements on compression ratio is calculated by Eq.1. Our proposed method improves by 3.65% the compression ratio, compared to the JPEG method. Improvements on compression ratio =
NJP EG − NP NJP EG
(1)
A Novel Image Compression Method Using Watermarking Technique 65000
Methods Quantization scale factor A Proposed 0.100 A’ JPEG 0.100 B Proposed 1.068 B’ JPEG 1.112 C Proposed 2.500 C’ JPEG 2.500
Quantity of image data[Byte]
60000 55000
1081
A’ JPEG method A
Proposed method
50000 45000 Domain of JPEG method
40000 35000
C
30000 25000
Domain of proposed method
B,B’ C’
24
26
28
30
32 34 PSNR[dB]
36
38
40
42
Fig. 13. Comparison between proposed method and JPEG method(Mandrill)
Table 3. Improvements on compression ratio Test images Methods Quantization PSNR Quantity Differences Improvements on Scale Factor [dB] of image in image compression data[Byte] data[Byte] ratio[%] Airplane Proposed 0.142 40.62 37467 1501 3.85 JPEG 0.149 40.61 38968 Barbara Proposed 0.142 39.88 49289 1947 3.80 JPEG 0.147 39.88 51236 Earth Proposed 0.137 42.44 35953 1508 4.03 JPEG 0.146 42.44 37461 Lenna Proposed 0.163 40.81 34493 1499 4.16 JPEG 0.174 40.82 35992 Mandrill Proposed 0.119 40.20 58283 1442 2.41 JPEG 0.124 40.20 59725 Average 3.65
where NJP EG and NP denote the sizes, in bytes, of the image data obtained using the JPEG method and our proposed method, respectively. Furthermore, we measured PSNR values when the compressed image data size is roughly same for the proposed method and the JPEG method. Table.4 shows improvements on PSNR of our proposed method compared to the JPEG method. The proposed method improves 1.77dB compared to the JPEG method.
1082
H. Kuroda et al. Table 4. Improvements on PSNR
Test images Methods Quantization Scale Factor Airplane Proposed 0.130 JPEG 0.185 Barbara Proposed 0.111 JPEG 0.174 Earth Proposed 0.137 JPEG 0.203 Lenna Proposed 0.137 JPEG 0.205 Mandrill Proposed 0.110 JPEG 0.176 Average
6
PSNR Quantity of Improvements on [dB] image data[Byte] PSNR[dB] 40.94 38037 1.26 39.68 38041 41.25 50542 2.36 38.89 50546 42.44 35953 1.22 41.22 35954 41.52 34953 1.44 40.08 34953 40.50 58380 2.57 37.93 58402 1.77
Conclusions
We have proposed a novel coding method, using a watermarking technique, that disposes of the DC term by embedding the DC information into the AC terms. Huffman encoded difference values of DC coefficient in adjacent DCT blocks were embedded. The proposed method has achieved a 3.65% reduction of the quantity of image data, compared with the JPEG method, while maintaining nearly the same image quality. For low compression ratios, our proposed method shows good results. The compression efficiency becomes worse as the compression ratio is raised. In our future work, we will investigate the possibility of widening the domain of application of the proposed method. And, we are investigating on JPEG2000, MPEG, H.264/AVC, fractal coding and vector quantization too.
References 1. Miller, M.L., Doerr, G.J., Cox, I.J.: Applying Informed Coding and Embedding to Design a Robust High-Capacity Watermark. IEEE Transactions On Image Processing 13(6), 792–807 (2004) 2. Kutter, M., Bhattacharjee, S.K., Ebrahimi, T.: Towards Second Generation Watermarking Schemes. In: 1999 International Conference on Image Processing, vol. 1, pp. 320–323 (1999) 3. Hernandez, J.R., Amado, M., Perez-Gonzalez, F.: DCT-Domain Watermarking Techniques for Still Images: Detector Performance Analysis and a New Structure. IEEE Transactions On Image Processing 9(1), 55–68 (2000) 4. Nikolaidis, A., Pitas, I.: Asymptotically Optimal Detection for Additive Watermarking in the DCT and DWT Domains. IEEE Transactions On Image Processing 12(5), 563–571 (2003) 5. Iwata, M., Miyake, K., Shiozaki, A.: Digital Watermarking Method to embed Indes Data into JPEG Images. IEICE Trans. Fundamentals E85-A(10), 2267–2271 (2002) 6. Iwata, M., Miyake, K., Shiozaki, A.: Digital Steganography Utilizing Features of JPEG Images. IEICE Trans Fundamentals E87-A(4), 929–936 (2002)
A Novel Image Compression Method Using Watermarking Technique
1083
7. ITU T.81, Information Technology - Digital Compression and Coding of Continuous - Tone Still Images - Requirements and Guidelines, http://www.w3.org/ Graphics/JPEG/itu-t81.pdf 8. Martin, M.B., Bell, A.E.: New Image Compression Techniques Using Multiwavelets and Multiwavelet Packets. IEEE Transactions On Image Processing 10(4), 500–510 (2001) 9. Lee, K., Kim, D.S.: Regression-Based Prediction for Blocking Artifact Reduction in JPEG-Compressed Images. IEEE Transactions On Image Processing 14(1), 36–48 (2005) 10. Wu, X., Dumitrescu, S., Zhang, N.: On Multirate Optimality of JPEG2000 Code Stream. IEEE Transactions On Image Processing 14(12) (2005) 11. Lee, Y.-L., Han, K.-H., Sullivan, G.J.: Improved Lossless Intra Coding for H.264/MPEG-4 AVC. IEEE Transactions On Image Processing 15(9) (2006)
Improved Algorithm of Error-Resilient Entropy Coding Using State Information Yong Fang1,2, Gwanggil Jeon1, Jechang Jeong1, Chengke Wu2, and Yangli Wang2 1
Department of Electronic and Communication Engineering, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul, Korea
[email protected] 2 National Key Lab. on ISN, Xidian University, Xi’an, China
Abstract. This paper proposes an improved algorithm of the error-resilient entropy coding (EREC) to limit error propagation (EP) in variable-length-coding (VLC) bit stream. The main novelties include two folds. First, after each stage of EREC encoding process, the resulting states of all slots and blocks are conveyed as side information and used at decoders to remove the EP caused by those erroneous blocks/slots that have been placed-up/filled-up. Second, the alternate placement (AP) technique is proposed to alleviate the EP caused by those erroneous blocks/slots that are still partially-placed/partially-filled. An indepth analysis shows that less than three bits per block are required for conveying state information. Experiments are conducted and the results show that our proposed method improves recovery quality significantly.
1 Introduction Several video compression standards, such as H.26x and MPEG-x have been proposed in the past decade for storage and communication purposes. However, as the video data are highly compressed, they become sensitive to errors caused by unreliable transmission channels. Often, error propagation (EP) will be accompanying, meaning that an error at any position of the bit stream will disable not only the decoding of the codeword that contains it, but also the following ones until a synchronization symbol is met. Furthermore, reconstruction error in a single pixel sample will affect all the samples that are directly or indirectly predicted from it, thus leading to video quality degradation. A large amount of methods have been proposed to avoid EP and these methods can be clustered into three classes roughly: data inserting (DI), data embedding (DE) and structure modifying (SM) (see table 1). Both the DI and DE work by conveying redundancy information explicitly or implicitly to enhance error resilience. When the DI is used, extra bandwidth is required to convey redundancy information explicitly so that coding efficiency is decreased more or less. When the DE is used, redundancy information is embedded implicitly into those source bits of less importance so that recovery quality is degraded more or less unavoidably. Differing from the DI and DE, the SM works just by modifying the structure of bit stream to enhance error resilience. The main drawback of the SM lies in that higher computational complexity is often required. Below, a simple review on related work on this issue is given. J. Blanc-Talon et al. (Eds.): ACIVS 2007, LNCS 4678, pp. 1084–1095, 2007. © Springer-Verlag Berlin Heidelberg 2007
Improved Algorithm of Error-Resilient Entropy Coding Using State Information
1085
Inserting resynchronization markers (RMs) periodically or adaptively [1] is the most simple and effective method for enhancing error resilience, but will introduce lots of redundancies and increase the bit-rate rapidly. For example, each group-ofblock (GOB) startcode in H.263 spends at least 31 bits and is obviously an expensive effort. The reversal VLC (RVLC) scheme [2] is capable of achieving unique decoding in both the forward and reverse directions of the bit stream. According to related reports, RVLC sacrifices 1.5%~12% of coding efficiency for motion vectors (MVs) [3] and DCT coefficients [4], compared to the traditional VLC. The partial backward decodable bit stream (PBDBS) scheme [5] works by reversing part of bit stream so that the reversed part can be decoded backward when forward decoding is disabled. The PBDBS is some similar to RVLC but does not lead to the loss in coding efficiency. However, both RVLC and PBDBS cannot rescue data in between the first and last errors when more than one error is present in the same packet. The widely-used data partitioning (DP) [6] works by splitting macroblock (MB) headers, motion vectors (MVs), direct-current (DC) coefficients and alternate-current (AC) coefficients into different segments and hence allows each segment to be isolated from errors or erasures in other segments. The combination of the DP with VLC codeword reordering [7] can be used to get pseudo-embedded bit stream. The main idea of VLC codeword reordering is to extract first VLC codewords from each block and then second VLC codewords from each block, etc. As the DP, VLC codeword reordering just alleviates the impact of EP rather than remove EP. On the other hand, EREC [8], converting the traditional VLC blocks of data into fixed-length slots, allows the decoder to synchronize bit stream at the start of each EREC slot. The major drawbacks of EREC are no guarantee of frame spatial synchronization and the requirement of highly-protected auxiliary information. Recently, researchers proposed several error-resilient methods based on data embedding techniques, which are originally proposed for watermarking, steganalysis, etc. They applied data embedding schemes to establish another covert channel for transmitting important information that enhances error resilience without increasing the bit-rate significantly. In [9], for each MB in I-frames, its data length is embedded into the least-significant-bits (LSBs) of DC coefficients of its upper MB; for each non-skipped MB in P-frames, its data length and the skipped run before it are embedded into selected AC coefficients of its “host” MB (with at least one non-zero AC coefficients) by modulo 2 operation. Besides more or less quality degradation, another obvious drawback of the DE is that errors in embedded data may be propagated. For example, if the data length information of one MB is corrupted, its following MBs until the next RM are desynchronized in fact. What is worse, because lots of bits are required to represent data lengths and skipped runs (for example, in [9], 12 bits are used to represent the data length of each intra-coded MB), embedded data becomes highly susceptible to channel errors. For both the DI and DE, a key issue is what kind of auxiliary information should be conveyed to enhance error resilience effectively at as low cost as possible. As pointed out above, to convey data lengths of MBs and skipped runs is obviously an uneconomical choice. Hence, in this paper, we propose to combine the SM (more specifically, EREC) with the DI/DE. Instead of data lengths and skipped runs, states
1086
Y. Fang et al.
of slots (filled-up or not) and blocks (placed-up or not) during EREC encoding process are conveyed to assist the decoding at receivers. This paper is arranged as follows. In section 2, a simple review on the original EREC is given in which the decoding process is emphasized. In section 3, we define the states of blocks and slots and then analyze the cost of state information extensively. In section 4, an example of our proposed method is given and the alternate placement (AP) is also described. In section 5, experiments are given to evaluate the performance of our proposed method. Finally, we conclude this paper in section 6. Table 1. Main approaches to limiting EP
Approaches DI DE SM
Disadvantages Lower coding efficiency Worse recovery quality Higher complexity
Examples Inserting RM, RVLC Embedding data lengths DP, VLC codeword reordering, PBDBS, EREC
2 Reviews on EREC The EREC works by converting VLC bit stream to fixed-length structures. An EREC frame is composed of N fixed-length slots to transmit N variable-length blocks of data. The EREC encoding process includes N stages. The first stage is to allocate each block to one slot. Starting from the beginning of each block, as many bits as possible are placed within its corresponding slot. In subsequent stages, each block with bits still to be placed searches for slots with space remaining. At stage n, block i searches slot j (j = (i + f(n)) mod N, where f(n) is a predefined offset sequence). If the searched slot has some space available, then as many bits as possible are placed within it. Fig. 1 shows a simple example of the encoding process with four imaginary blocks of lengths 2, 8, 5, 8. The offset sequence in this example is {0, 3, 2, 1}. The EREC decoding process is just an inverse of the encoding process in the absence of channel errors. Here, we are interested in the performance of ERECstructured bit stream over noisy channel. In the example of Fig. 1, both block 1 and block 3 are not longer than their corresponding slots, so they can be decoded independently of other blocks and hence be free from EP. However, the decoding of two other blocks is much more complex. At stage 2, block 2 accesses slot 1 and all remaining bits are placed into the space of slot 1, so its remaining bits can be decoded only when block 1 has been decoded correctly. Similarly, block 4 can be decoded correctly only when other three blocks have been decoded correctly. It is obvious that both block 2 and block 4 still suffer from EP even with the EREC used. Although no any bit of block 4 is placed into slot 3 at stage 2, block 4 can not be decoded correctly still if block 3 is erroneous. The reason is that if block 3 is erroneous, the decoder will fail to find the end of block 3 and hence have no way to know whether any bits of block 4 have been placed into slot 3. Therefore, the decoder can not find the correct position to continue the decoding of block 4. Based on this analysis, we can conclude that:
Improved Algorithm of Error-Resilient Entropy Coding Using State Information
1087
(1). Any erroneous slot (filled-up or not) will abort the decoding of those partiallydecoded blocks that access it. Correspondingly, a symmetric conclusion is: (2). Any erroneous block (placed-up or not) will “contaminate” those partiallyfilled slots that are accessed by it. Obviously, better results are expected if the decoder is told the states of slots (filled-up or not) and blocks (placed-up or not) in advance. In the example of Fig. 1, during the decoding process of block 4, if the decoder knows in advance that both slots 3 and slot 2 have been filled-up after stage 1, it will skip them directly regardless of whether they are erroneous or not. Based on this idea, the rest of this paper proposes an improved algorithm of the EREC which aims to avoid EP by conveying state information of blocks and slots.
Fig. 1. Example of the EREC encoding process
3 State Information of Blocks and Slots To reach our goal, there exist three problems. The first is what kind of state information (SI) should be conveyed? The second is how much the cost of state information is? The last is how to convey state information? In this section, we first set up a model which answers problem 1 and then analyze the model to answer problem 2. Finally, we present several Huffman tables as the answer of problem 3. In addition, we also propose to convey state information robustly using EREC structure. 3.1 Model Setup During the EREC encoding process, each block may be placed-up or not, so we define two kinds of states for blocks: ‘C’ state (completely-placed) and ‘P’ state (partially-placed). Correspondingly, we also define two kinds of states for slots: ‘C’ state (completely-filled) and ‘P’ state (partially-filled). Obviously, before encoding, all blocks and slots must be in ‘P’ state; after encoding, all blocks and slots must be in
1088
Y. Fang et al.
‘C’ state. For both blocks and slots, the state transfer is unidirectional, i.e., only the transfer ‘P’Î’C’ is possible. We call (X, Y) as the joint state in which the X (‘C’ or ‘P’) represents the state of a block and the Y (‘C’ or ‘P’) represents the state of the corresponding slot. The possible values of (X, Y) are (‘P’, ‘P’), (’C’, ‘P’), (’P’, ’C’) and (’C’, ‘C’). Only when a ‘P’ block accesses a ‘P’ slot, can bit placement happen. That is, only when (X, Y) = (‘P’, ‘P’), can state transfer happen. In addition, after each bit placement, either the block is placed-up or the slot is filled-up. That is, when (X, Y) = (‘P’, ‘P’), state transfer must happen. The joint state (‘P’, ‘P’) may be transferred to any of three other joint states. Fig. 2 shows all possible state transfers.
Fig. 2. State transfers
3.2 Analysis on Cost The idea to convey the states of all blocks and slots during encoding process gives rise to an obvious problem: how much is the cost of state information, i.e., how many bits are required to convey the states of all blocks and slots during encoding process? At the very first glance, the cost seems very huge. For an EREC frame of N blocks, 2N bits are needed to record the states of all blocks and slots after each stage (N bits for blocks and N bits for slots). Hence, 2N2 bits are required in total. However, the following analysis shows that the cost is much lower in fact. It has been clarified that only when (X, Y) = (‘P’, ‘P’), can state transfer happen. If no state transfer happens, it is unnecessary to convey the resulting joint state after bit placement. Hence, only when (X, Y) = (‘P’, ‘P’), the resulting joint state must be conveyed. Now, the problem is converted into: how many times the joint state (‘P’, ‘P’) happens? As described above, after each bit placement, at most one ‘P’ survives, so, one or two ‘P’s are converted to ‘C’s in each bit placement. Before the placing, there are 2N ‘P’s in total. After the placing, all ‘P’s are converted to ‘C’s. Notice that the result of the last placing must be (‘P’, ‘P’) Î (’C’, ‘C’). Based on this analysis, it is clear now that the total number of state transfers (denoted as T) must satisfy the following unequation: N ≤ T ≤ 2N – 1.
(1)
Improved Algorithm of Error-Resilient Entropy Coding Using State Information
1089
Let Tcc be the times the state transfer (‘P’, ‘P’) Î (’C’, ‘C’) happens and similarly, Tpc and Tcp are also defined. Obviously, we have
⎧Tpc = Tcp ⎪ . ⎨Tpc + Tcp + Tcc = T ⎪Tpc + Tcp + 2Tcc = 2 N ⎩
(2)
If any of T, Tcc, Tpc, Tcp is known, three other variables can be deduced according to equation (2). Let Pcc be the probability that (‘P’, ‘P’) Î (’C’, ‘C’) happens, i.e., Pcc = Tcc/T. Similarly, Pcp and Ppc are defined. The entropy of state transfers should be E = −( Pcc log Pcc + Pcp log Pcp + Ppc log Ppc ) .
(3)
Denote the cost of state information as C, it is obvious that C = ET. According to equation (2), we have T + Tcc = 2N. Thus, Pcc = (2N – T)/T. In addition, it is obvious that Ppc = Pcp = (1 – Pcc)/2 = (T – N)/T. Hence, we get C = −[(2 N − T ) log(2 N − T ) + 2(T − N ) log(T − N ) − T log T ] .
(4)
This is a function with respect to T. Let C* be the peak value and let T* be the T when C = C*, it is easy to get ⎪⎧C* ≈ 2.5 N . ⎨ ⎪⎩T * = (1 + 1/ 2) N
(5)
It means that theoretically, at most 2.5N bits are required to represent all state information during encoding, i.e., 2.5 bits per block on average. Of course, it is often difficult in practical to match the encoder to statistic property of state information, but the practical cost is always lower than 3N. This is an exciting result! In lowmedium bit-rate applications, say 384kbps, 32 bits (384000/30/396) on average are required to code each MB. The maximal redundancy rate of state information is lower than 10%. In fact, not all state transfers must be conveyed. First, it is unnecessary to convey the last state transfer because it must be (‘P’, ‘P’) Î (’C’, ‘C’). Second, it is unnecessary to covey state transfers at the last stage because all of them must also be (‘P’, ‘P’) Î (’C’, ‘C’). Third, when there is only one ‘P’ block or only one ‘P’ slot, it is unnecessary to convey all following state transfers. If there is only one ‘P’ block, the following state transfers must be a series of (‘P’, ‘P’) Î (’P’, ‘C’) ended by a (‘P’, ‘P’) Î (’C’, ‘C’). If there is only one ‘P’ slot, the situations are similar. Finally, for P frames, the COD bit of each MB can be used to represent its state transfer at the first stage (if COD = 1, the state transfer must be (‘P’, ‘P’) Î (’C’, ‘P’)). Therefore, the practical cost of state information is much lower than 3N usually. As an example, we give three Huffman tables in table 2. Let Ci be the cost of the ith Huffman table, it is easy to prove that C1, C2 ≤ 3(N – 1) and C3 ≤ 4(N – 1).
1090
Y. Fang et al. Table 2. Example of Huffman tables for state transfers
Index Tab. 1 Tab. 2 Tab. 3
(‘C’, ‘P’) 0 10 00
(‘P’, ‘C’) 10 0 01
(‘C’, ‘C’) 11 11 1
3.3 Transmission of SI and Removal of EP Inside Blocks
Now we answer the problem how to transmit SI. We use EREC algorithm to limit the EP in SI. We assign the SI of each block (or slot) to one “state” slot (to distinguish from slots for video data) and then place SI into state slots using EREC algorithm. Obviously, each state slot contains not more than 3 bits. As video data, for the decoding of SI, the required auxiliary information is the total length C (or just set to 3N for simplicity). The SI data is conveyed accompanying with video data. Although most of the EP between blocks can be removed with the help of SI, the EP inside each block may still be serious. When the EREC is used in video transmission, each MB is assigned to one slot usually. As we know, each MB is composed of one header (including MB mode, coded-block-pattern (CBP), MVs, and so on) and up to six 8x8 blocks. Obviously, those 8x8 blocks placed later will be more likely to suffer from the EP. To avoid errors are propagated between 8x8 blocks, we modify MB structure using the VLC reordering technique [7].
4 Example of Improved EREC Using State Information To illustrate how our proposed algorithm works, we give an example in which the states of all slots and blocks during encoding process are conveyed. On knowing state information, the decoder can skip those erroneous ‘C’ blocks/slots. However, although erroneous ‘C’ blocks/slots can be skipped intelligently with the help of state information, erroneous ‘P’ blocks/slots may still abort the decoding process frequently. To alleviate the impact of erroneous ‘P’ blocks/slots, the alternate placement (AP) technique is proposed and also illustrated in our example. 4.1 Decoding Process
In the above example, both block 1 and block 3 are free from EP obviously, so only block 2 and block 4 are discussed below. We discuss block 2 first. At stage 2, block 2 accesses slot 1. Since slot 1 is the only ‘P’ slot (the state codeword is ‘0’), we can deduce that all remaining bits of block 2 must have been placed into to slot 1. In addition, because the remaining bits of block 2 are placed backward, it is sure that block 2 can be decoded regardless of whether block 1 is erroneous or not. Hence, block 2 is also free from EP. Then we discuss block 4. At stage 2, block 4 accesses slot 3. Since slot 3 has been in ‘C’ state (the state codeword is ‘11’), the decoder skips slot 3 directly regardless of whether it is erroneous or not. Similarly, at stage 3, the decoder skips slot 2 directly. At stage 4, block 4 accesses slot 1 and the situations become very complex. Here, a
Improved Algorithm of Error-Resilient Entropy Coding Using State Information
1091
Fig. 3. Example of the encoding process of the proposed method
critical problem is in which direction the remaining bits of block 4 have been placed into slot 1. If the remaining bits of block 4 are placed forward into slot 1, the decoding of block 4 will depend on block 1. Otherwise, the decoding of block 4 will depend on block 2. Because block 2 is much longer than block 1, the risk of block 2 to be corrupted is also much higher than that of block 1. Therefore, the direction of placement is very critical for the decoding of block 4. This problem will be dealt with in details below. 4.2 Alternate Placement (AP)
It can be seen from the above example that if the remaining bits of ‘P’ blocks are placed into ‘P’ slots in an appropriate direction, the impact of erroneous ‘P’ blocks/ slots can be alleviated dramatically. To reach the best performance, we give the alternate placement (AP) technique whose main idea is: for each ‘P’ slot, after each bit placement, the direction of placement is reversed. The AP technique makes sure that at each stage, the risk of ‘P’ blocks to suffer from EP is reduced to the minimum. For example, in Fig. 3, at stage 1, block 1 is placed into slot 1 forward. At stage 2, the remaining bits of block 2 are placed into slot 1 backward. At stage 4, the remaining bits of block 4 are placed into slot 1 forward. By using the AP technique, all ‘P’ blocks that access the same slot are divided into two groups according their directions of placement: forward group and backward group. Correspondingly, at each stage, ‘P’ slots are divided into three kinds: bidirectional decodable slots, unidirectional (forward or backward) decodable slots and undecodable slots. Each group can be decoded independently of each other and hence, the EP between two groups is removed. For example, in Fig. 3, all ‘P’ blocks that access slot 1 (block 1, block 2, block 4) are divided into two groups. The forward group includes block 1 and block 2, while the backward group includes block 2. Block 1 and block 4 can be decoded independently of block 2 and vice versa.
1092
Y. Fang et al.
As for those blocks that belong to the same group, the placement order is deterministic. The decoding of each block depends on all blocks placed earlier in the same group. For example, in Fig. 3, the decoding of block 4 depends on block 1.
5 Experiments and Results Our experiments are preformed on several QCIF video sequences at 10 fps (frames per second). The length of each sequence is 100. These sequences are coded using a baseline H.263 encoder and the resulting bit stream is used as data source of EREC encoding. The structure of group-of-picture (GOP) is IPPP…. Each video frame is coded into one EREC frame exactly. The auxiliary information (the length of each EREC frame) is assumed error-free. Each MB is assigned into one slot. The average PSNR defined in equation (6) [10] is used to compare all schemes. PSNR = (4xPSNRY + PSNRCb + PSNRCr)/6.
(6)
In our experiments, the following six kinds of algorithms are compared: 1). O: original EREC. 2). R: improved EREC using VLC reordering. 3). S: improved EREC using both SI and VLC reordering, without using AP. 4). SA: similar to (3), but with AP being used. 5). SAEF: similar to (4), but SI is assumed error free. 6). NEP: an imaginary algorithm in which the data length of each MB is transmitted error free at zero cost so that no EP exists between MBs. In addition, VLC reordering is also used in NEP to limit EP inside each MB. Among them, the S, SA and SAEF algorithms require extra bandwidth to transmit SI. As pointed out above, SI is coded in EREC structure to limit EP. To decode SI correctly, reliable transmission is required for the total length of SI (i.e. C). For simplicity, we just set C to 3N (i.e., 3 bits per MB). No channel coding is used and only random bit errors are simulated in our experiments. No error concealment is used. For each MB, when an illegal VLC codeword is met or more than 64 TCOEFs are decoded in one 8x8 block, the decoding is aborted. For the S, SA and SAEF algorithms, if the decoding process of video bit stream contradicts SI, the corresponding MB is also considered in error. The results are plotted in Fig. 4. In these figures, the y-label represents the average PSNR of recovered sequences and the x-label represents the bit count per MB. For the S, SA and SAEF algorithms, we simply shift the curves right 3 bits to compensate for the increased bit rate. To be fair, 50 runs are conducted for each experiment. The gain of SI is noticeable. At medium bit rate, the gain is about 0.5dB (more or less for different sequences). It is important to note that this gain is achieved with VLC reordering having been used. If VLC reordering is not used, the gain of SI is much more significant. From the results, an important conclusion is: with the increase in bit rate, the gain of SI increases at first and then decreases. At low bit-rate, the distortion is caused mainly by lossy source coding. With the increase in bit rate, the proportion of SI becomes more and more trivial and the gain becomes more and more
Improved Algorithm of Error-Resilient Entropy Coding Using State Information
1093
Fig. 4. RD comparisons between different methods, GOP = 50, BER = 1E–3. From top to bottom, the figures correspond to results concerning Foreman, Coastguard and News, respectively.
1094
Y. Fang et al.
significant. However, at high bit rate, the distortion is caused mainly by the EP inside each MB (due to increased MB length). The aim of SI is to limit the EP between MBs rather than the EP inside each MB. Hence, the gain of SI decreases at high bit rate. Only a slight gain is observed in our experiments with the AP being used. The main reason is that during the EREC encoding, the probability of (X, Y) = (‘P’, ‘P’) is too low (at most (N – 1)/N2, except the first stage), hence the AP can rescue only a few bits. However, at high bit rate, the gain of AP is also noticeable (up to 0.2dB when QP = 2). The reason is deduced to be that at high bit rate, the probability distribution of MB length becomes more fluctuant. During the EREC encoding process, the flatter the probability distribution of block length is, the higher the proportion of the bits placed at the first stage is. More fluctuant probability distribution of MB length means that more bits are placed at the following stages (except the first stage) and hence more bits can be rescued by the AP. It is valuable to evaluate how serious the EP in SI is. From the results, only a slight degradation (about 0.1dB) is observed due to the EP in SI. It means that our algorithms are of high stability under different cases. There are three reasons for this slight degradation. First, because only a few bits are required for coding SI (fewer than 3 bits per MB), the risk of SI being corrupted is low. Second, EREC-structured SI is of intrinsic robustness to random errors. Third, as video data, SI placed at later stages is more likely to suffer from EP. Since video data placed at later stages is of less importance, the impact of the EP in SI is not so serious. Now let us consider the performance in the absence of EP between MBs. For Foreman sequence, the gap between the ‘SA’ and the ‘NEP’ is the smallest, only about 0.2dB, while the gap for News sequence is the largest, about 1dB. The gap for Coastguard sequence is in between. The reason is deduced to be the difference of spatial complexity. News sequence is of higher spatial complexity (due to the string “MPEG4 WORLD”) and more fluctuant probability distribution of MB length so that the risk of EP between MBs in EREC frames is increased.
6 Conclusions In this paper, a novel method to limit EP in VLC bit stream is presented which works by combining the EREC algorithm with the DI/DE technique. Our main idea is to convey the states of all blocks and slots during the EREC encoding process. With the help of state information, the decoders can skip erroneous ‘C’ slots/blocks intelligently. To validate our method, an in-depth analysis on the cost of state information is given, which shows that less than three bits per block are required for conveying all state information during the EREC encoding process. As another novelty, we also propose the AP technique to limit EP in VLC bit stream further whose main idea is to reverse the direction of placement after each bit placement. The AP technique can be used to alleviate the impact of erroneous ‘P’ slots/blocks. Experiments are given to compare our proposed methods with other available methods and noticeable improvement is observed.
Improved Algorithm of Error-Resilient Entropy Coding Using State Information
1095
Acknowledgements This work was sponsored by the BK21 project of Korea and the NSF of China under grant no. 60532060 in 2007. This work was also supported partially by ETRI SoC Industry Promotion Center and Human Resource Development Project for IT SoC Architect and the NSF of China under grant nos. 60672117 and 60607010.
References 1. Ferguson, T.J., Rabinowitz, J.H.: Self synchronizing Huffman Codes. IEEE T. Information Theory. 30, 687–693 (1984) 2. Takishima, Y., Wada, M., Murakami, H.: Reversible Variable Length Codes. IEEE T. Communications. 43, 158–162 (1995) 3. ISO/IEC JTC1/SC29/WG11/M2382: Report of Results on Core Experiments on Error Resilience for Motion Data with Structured RVLC-E8 (1997) 4. Li, A.H., Fong, M., Wen, J., Villasenor, J.D.: Test Results of Error Resilience with Modified Error Resilient Syntax with Data Partitioning and RVLC. ITU-T Rec. Q15-E-20 (1998) 5. Gao, S.S., Tu, G.F.: Robust H.263+ Video Transmission Using Partial Backward Decodable Bit Stream. IEEE T. CSVT 13, 182–187 (2003) 6. ISO/IEC 14496-2: Coding of Audio-Visual Objects-Part 2: Visual (MPEG-4 Visual Version 2) (1999) 7. Goshi, J., Mohr, A.E., Ladner, R.E., Riskin, E.A., Lippman, A.: Unequal Loss Protection for H.263 Compressed Video. IEEE T. CSVT 15, 412–419 (2005) 8. Redmill, D.W., Kingsbury, N.G.: The EREC: an Error Resilient Technique for Coding Variable-Length Blocks of Data. IEEE T. Image Proc. 5, 565–574 (1996) 9. Lie, W.-N., Lin, T.-C., Lin, C.-W.: Enhancing Video Error Resilient by Using DataEmbedding Techniques. 16, 300–308 (2006) 10. Shyu, H.C., Leou, J.J.: Detection and Concealment of Transmission Errors in MPEG Images-a Genetic Algorithm Approach. IEEE T. CSVT. 9, 937–948 (1999)
Author Index
Achard, Catherine 274 Ad´ an, Antonio 60 Aghajan, Hamid 97, 156, 310 Alecu, Alin 1049, 1061 Ambrosio, G. 920 Anti´c, Borislav 777 Arevalo, V. 920 Asimidis, Asimakis 543 Asvestas, Pantelis 497 Aul´ı-Llin` as, Francesc 1024 Bacauskiene, Marija 521 Bartrina-Rapesta, Joan 1024 Batouche, Mohamed 449 Benjelloun, M. 897 B´er´eziat, Dominique 955 Bergamaschi, Anna 543 Berthoumieu, Yannick 352 Bigand, Andr´e 943 Blanc-Talon, Jacques 132, 233 Blanco, J.L. 932 Borda, Monica 121 Bourennane, Salah 132, 233 Bri¨er, Peter 37 Bugeau, Aur´elie 628 Byun, Hae Won 417 Canchola, Sandra 406 Chabrier, S´ebastien 439 Chai, Young-Joon 732 Chang, Chung-Ching 156 Chaumette, Fran¸cois 1 Chen, Gencai 331 Chen, Ling 331 Cho, Woon 384 Chokchaitam, Somchart 1037 Colot, Olivier 943 Cook, Emily 543 Cornelis, Jan 1049, 1061 Crnojevi´c, Vladimir 777 Cyganek, Boguslaw 744 D’Orazio, T. 855 Dai, Qionghai 768 Darolti, Cristina 828
De Cock, Jan 652 de Haan, Gerard 461 De Neve, Wesley 699 De Schrijver, Davy 699 de With, Peter H.N. 285, 427, 675, 687 De Witte, Val´erie 640 De Wolf, Koen 699 del-Blanco, Carlos R. 990 Deriche, Mohamed 373 Dhondt, Yves 720 Diepold, Klaus 818 Diosi, Albert 1 Direko˘ glu, Cem 553 Distante, A. 855 Dizdaroglu, Bekir 509 Dom´ınguez, S. 25 Dong, Xiao 616 D¨ orfler, Nikolas 818 Dorval, Thierry 597 Dubuisson, S´everine 955 Economopoulos, Theodore El Abed, Abir 955 Enescu, Valentin 13 Esbrand, Colin 543 Faas, Frank G.A. 212 Fang, Yong 1084 Fant, Andrea 543 Farin, Dirk 427, 675 Florea, Corneliu 587 Florea, Laura 587 Fujimura, Makoto 1072 Gabayan, Kevin 97 Galindo, C. 920 Gangal, Ali 509 Garc´ıa, D. 25 Garc´ıa, Inmaculada 800 Garc´ıa, Narciso 990 Gautama, Sidharta 575 Gelzinis, Adas 521 Genovesio, Auguste 597 Georgiou, Harris 543 Gierl, Christian 909
497
1098
Author Index
Goebel, Peter Michael 84 G´ omez, Carlos 364 Gonz´ alez, J. 920, 932 Gonzalez-Barbosa, Jose-Joel 406 Gonz´ alez-Conejero, Jorge 1024 Goossens, Bart 190, 473 Gregori, Valent´ın 254 Griffiths, Jennifer 543 Guaragnella, C. 855 Guessoum, Zahia 449 Guillaume, Mireille 168 Hafiane, Adel 439 Hall, Geoff 543 Hashimoto, Hideo 711 Haugland, Oddmund 888 Havasi, L´ aszl´ o 968 He, Xiangjian 262 Hellicar, Andrew 242 Hintz, Tom 262 Hiraoka, Masaki 711 Hislop, Greg 242 Hofmann, Ulrich G. 828 Horv´ ath, P´eter 200 Hou, Yunshu 340 Hu, Hao 461 Huck, Alexis 168 Hurtado-Ramos, Juan B. 406 Huysmans, Toon 531, 607 Iakovidis, Dimitris K. 565 Ilse, Ravyse 340 Imamura, Hiroki 1072 Imamura, Kousuke 711 Iwahashi, Masahiro 1037 Jaureguizar, Fernando 990 Jeon, Gwanggil 810, 1084 Jeong, Jechang 810, 1084 Jiang, Jianmin 395 Jimenez-Hernandez, Hugo 406 Jones, John 543 Jonker, Pieter 37 Ju, Myung-Ho 322 Jung, Joel 789 Kang, Hang-Bong Kaspersen, Kristin Katz, Itai 97
322 543
Kavli, Tom 888 Kerre, Etienne E. 254, 640 Kim, Tae-Yong 732 Kim, Taekyung 384 Kirkhus, Trine 543, 888 Klein Gunnewiek, Rene 427 Kondo, T. 909 Kongprawechon, W. 909 Kuroda, Hideo 1072 Kwolek, Bogdan 144 Lambert, Peter 652 Laroche, Guillaume 789 Laurent, H´el`ene 439 Lavialle, Olivier 121 Leaver, James 543 Lee, Joohyun 810 Lee, Rokkyu 810 Lenseigne, Boris 597 Leo, M. 855 Letexier, Damien 233 Li, Gang 543 Li, Jianmin 262 Li, Ping 427 Li, Qiang 768 Liu, Xiaodong 768 Longo, Renata 543 L´ opez, Antonio 980 L´ opez, Manuel F. 800 Luong, Hiˆep 473 Mahmoudi, S. 897 Mai, Zhenhua 607 Makridis, Michael 877 Manthos, Nikos 543 Maroulis, Dimitris 565 Matsopoulos, George 497 Mazouzi, Smaine 449 M´egret, R´emi 352 M´elange, Tom 254, 640 Merch´ an, Pilar 60 Mertins, Alfred 828 Metaxas, Marinos G. 543 Michel, Fabien 449 Mikram, Mounia 352 Milgram, Maurice 274 Miyata, Shinichi 1072 Mokhber, Arash 274 Morb´ee, Marleen 663 Moreno, F.A. 932
Author Index Morillas, Samuel 254 Morvan, Yannick 675 Muhammad, Irfan 297 Munteanu, Adrian 1049, 1061 Mys, Stefaan 720 Nachtegael, Mike 640 Nappi, Michele 1002 Naseem, Imran 373 Ngan, King Ngi 178 Nieto, Marcos 840 Nikolaou, Nikos 877 Nixon, Mark S. 553 Notebaert, Stijn 652 Noy, Matthew 543 Nozick, Vincent 72 Ochoa, Daniel 575 Ogier, Arnaud 597 Oh, Duk-Won 732 Østby, Joar M. 543 Ozsavas, Emrah 48 Paik, Joonki 384 Pani, Silvia 543 Papamarkos, Nikos 877 Pari, L. 25 P´erez, Patrick 628 Pesquet-Popescu, B´eatrice 364, 789 Philips, Wilfried 190, 473, 640, 663 Phoojaruenchanachai, S. 909 Pinho, Romulo 531 Piˇzurica, Aleksandra 190, 640, 663, 1049, 1061 Ponsa, Daniel 980 Pop, Sorin 121 Popescu, Dan C. 242 Prades-Nebot, Josep 663 Pr´evot, R. 897 Qu, Xingtai
274
Ravyse, Ilse 13 Remazeilles, Anthony 1 Ren, Jinchang 395 Renard, N. 132 Riccio, Daniel 1002 Rico, G. 897 Roca, Antoni 663 Rosenberger, Christophe 439
Royle, Gary J. 543 Ruedin, Ana 221 Ruiz, Vicente Gonzalez
1099
800
Sahli, Hichem 13, 340 Saito, Hideo 72 Salas, Joaquin 406 Salgado, Luis 840, 990 Samet, Refik 48 S´ anchez, F.M. 25 Sarkis, Michel 818 Savelonas, Michalis A. 565 Sazlı, Murat H. 297 Schelkens, Peter 1049, 1061 Schulerud, Helene 543 Schulte, Stefan 254, 640 Schumann-Olsen, Henrik 888 Sebasti´ an, J.M. 25 ˇ Segvi´ c, Siniˇsa 1 Serra-Sagrist` a, Joan 1024 Shin, Seung-Ho 732 Shkvarko, Yuriy 109, 865 Sijbers, Jan 531, 607 Speller, Robert D. 543 Steinbuch, Maarten 37 Szir´ anyi, Tam´ as 968 Szl´ avik, Zolt´ an 968 Tehami, Samy 943 Telatar, Ziya 297, 1014 Terebes, Romulus 121 Theodoridis, Sergios 543 Thielemann, Jens T. 543, 888 Traslosheros, A. 25 Triantis, Frixos 543 Turchetta, Renato 543 Van de Walle, Rik 652, 699, 720 van der Stelt, Paul F. 543 Van Deursen, Davy 699 van Vliet, Lucas J. 212 Vazquez-Bautista, Rene 109 Venanzi, Cristian 543 Verikas, Antanas 521 Vermeirsch, Kenneth 720 Vertan, Constantin 587 Villalon-Turrubiates, Ivan 109, 865 Vincze, Markus 84 Vintimilla, Boris 575 Voos, H. 909
1100
Author Index
Wang, Jing-Wein 849 Wang, Yangli 1084 Westavik, Harry 888 Wijnhoven, Rob 285 Wu, Chen 310 Wu, Chengke 1084 Yang, Wenxian 178 Yavuz, Erkan 1014 Ye, Getian 756
Yildirim, M. T¨ ulin 485 Yong, Fang 810 Y¨ uksel, M. Emin 485 Zafarifar, Bahman 687 Zhang, Yanning 340 Zhao, Gangqiang 331 Zhao, Rongchun 340 Zheng, Guoyan 616 Zlokolica, Vladimir 640