Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5876
George Bebis Richard Boyle Bahram Parvin Darko Koracin Yoshinori Kuno Junxian Wang Renato Pajarola Peter Lindstrom André Hinkenjann Miguel L. Encarnação Cláudio T. Silva Daniel Coming (Eds.)
Advances in Visual Computing 5th International Symposium, ISVC 2009 Las Vegas, NV, USA, November 30 – December 2, 2009 Proceedings, Part II
13
Volume Editors George Bebis, E-mail:
[email protected] Richard Boyle, E-mail:
[email protected] Bahram Parvin, E-mail:
[email protected] Darko Koracin, E-mail:
[email protected] Yoshinori Kuno, E-mail:
[email protected] Junxian Wang, E-mail:
[email protected] Renato Pajarola, E-mail:
[email protected] Peter Lindstrom, E-mail:
[email protected] André Hinkenjann, E-mail:
[email protected] Miguel L. Encarnação, E-mail:
[email protected] Cláudio T. Silva, E-mail:
[email protected] Daniel Coming, E-mail:
[email protected]
Library of Congress Control Number: 2009939141 CR Subject Classification (1998): I.3, H.5.2, I.4, I.5, I.2.10, J.3, F.2.2, I.3.5 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-642-10519-X Springer Berlin Heidelberg New York 978-3-642-10519-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12798896 06/3180 543210
Preface
It is with great pleasure that we present the proceedings of the 5th International Symposium on Visual Computing (ISVC 2009), which was held in Las Vegas, Nevada. ISVC offers a common umbrella for the four main areas of visual computing including vision, graphics, visualization, and virtual reality. The goal is to provide a forum for researchers, scientists, engineers, and practitioners throughout the world to present their latest research findings, ideas, developments, and applications in the broader area of visual computing. This year, the program consisted of 16 oral sessions, one poster session, 7 special tracks, and 6 keynote presentations. Also, this year ISVC hosted the Third Semantic Robot Vision Challenge. The response to the call for papers was very good; we received over 320 submissions for the main symposium from which we accepted 97 papers for oral presentation and 63 papers for poster presentation. Special track papers were solicited separately through the Organizing and Program Committees of each track. A total of 40 papers were accepted for oral presentation and 15 papers for poster presentation in the special tracks. All papers were reviewed with an emphasis on potential to contribute to the state of the art in the field. Selection criteria included accuracy and originality of ideas, clarity and significance of results, and presentation quality. The review process was quite rigorous, involving two to three independent blind reviews followed by several days of discussion. During the discussion period we tried to correct anomalies and errors that might have existed in the initial reviews. Despite our efforts, we recognize that some papers worthy of inclusion may have not been included in the program. We offer our sincere apologies to authors whose contributions might have been overlooked. We wish to thank everybody who submitted their work to ISVC 2009 for review. It was because of their contributions that we succeeded in having a technical program of high scientific quality. In particular, we would like to thank the ISVC 2009 Area Chairs, the organizing institutions (UNR, DRI, LBNL, and NASA Ames), the industrial sponsors (Intel, DigitalPersona, Equinox, Ford, Hewlett Packard, Mitsubishi Electric Research Labs, iCore, Toyota, Delphi, General Electric, Microsoft MSDN, and Volt), the International Program Committee, the special track organizers and their Program Committees, the keynote speakers, the reviewers, and especially the authors that contributed their work to the symposium. In particular, we would like to thank Mitsubishi Electric Research Labs, Volt, Microsoft MSDN, and iCore, for kindly sponsoring several “best paper awards” this year. We sincerely hope that ISVC 2009 offered opportunities for professional growth and that you enjoy these proceedings. September 2009
ISVC09 Steering Committee and Area Chairs
Organization
ISVC 2009 Steering Committee Bebis George Boyle Richard Parvin Bahram Koracin Darko
University of Nevada, Reno, USA NASA Ames Research Center, USA Lawrence Berkeley National Laboratory, USA Desert Research Institute, USA
ISVC 2009 Area Chairs Computer Vision Kuno Yoshinori Wang Junxian
Saitama University, Japan Microsoft, USA
Computer Graphics Pajarola Renato Lindstrom Peter
University of Zurich, Switzerland Lawrence Livermore National Laboratory, USA
Virtual Reality Hinkenjann Andre Encarnacao L. Miguel
Bonn-Rhein-Sieg University of Applied Sciences, Germany Humana Inc., USA
Visualization Silva Claudio Coming Daniel
University of Utah, USA Desert Research Institute, USA
Publicity Erol Ali
Ocali Information Technology, Turkey
Local Arrangements Mauer Georg
University of Nevada, Las Vegas, USA
Special Tracks Porikli Fatih
Mitsubishi Electric Research Labs, USA
VIII
Organization
ISVC 2009 Keynote Speakers Perona Pietro Kumar Rakesh (Teddy) Davis Larry Terzopoulos Demetri Ju Tao Navab Nassir
California Institute of Technology, USA USarnoff Corporartion, USA University of Maryland, USA University of California at Los Angeles, USA Washington University, USA Technical University of Munich, Germany
ISVC 2009 International Program Committee (Area 1) Computer Vision Abidi Besma Aggarwal J.K. Agouris Peggy Argyros Antonis Asari Vijayan Basu Anup Bekris Kostas Belyaev Alexander Bhatia Sanjiv Bimber Oliver Bioucas Jose Birchfield Stan Bischof Horst Goh Wooi-Boon Bourbakis Nikolaos Brimkov Valentin Cavallaro Andrea Chellappa Rama Cheng Hui Chung, Chi-Kit Ronald Darbon Jerome Davis James W. Debrunner Christian Duan Ye El-Gammal Ahmed Eng How Lung Erol Ali Fan Guoliang Ferri Francesc Filev, Dimitar Foresti GianLuca
University of Tennessee, USA University of Texas, Austin, USA George Mason University, USA University of Crete, Greece Old Dominion University, USA University of Alberta, Canada University of Nevada at Reno, USA Max-Planck-Institut für Informatik, Germany University of Missouri-St. Louis, USA Weimar University, Germany Instituto Superior Tecnico, Lisbon, Portugal Clemson University, USA Graz University of Technology, Austria Nanyang Technological University, Singapore Wright State University, USA State University of New York, USA Queen Mary, University of London, UK University of Maryland, USA Sarnoff Corporation, USA The Chinese University of Hong Kong, Hong Kong UCLA, USA Ohio State University, USA Colorado School of Mines, USA University of Missouri-Columbia, USA University of New Jersey, USA Institute for Infocomm Research, Singapore Ocali Information Technology, Turkey Oklahoma State University, USA Universitat de Valencia, Spain Ford Motor Company, USA University of Udine, Italy
Organization
Fukui Kazuhiro Galata Aphrodite Georgescu Bogdan Gleason, Shaun Guerra-Filho Gutemberg Guevara, Angel Miguel Guerra-Filho Gutemberg Hammoud Riad Harville Michael He Xiangjian Heikkilä Janne Heyden Anders Hou Zujun Imiya Atsushi Kamberov George Kampel Martin Kamberova Gerda Kakadiaris Ioannis Kettebekov Sanzhar Kim Tae-Kyun Kimia Benjamin Kisacanin Branislav Klette Reinhard Kokkinos Iasonas Kollias Stefanos Komodakis Nikos Kozintsev, Igor Lee D.J. Li Fei-Fei Lee Seong-Whan Leung Valerie Li Wenjing Liu Jianzhuang Little Jim Ma Yunqian Maeder Anthony Makris Dimitrios Maltoni Davide Maybank Steve McGraw Tim Medioni Gerard Melenchón Javier Metaxas Dimitris Miller Ron
The University of Tsukuba, Japan The University of Manchester, UK Siemens, USA Oak Ridge National Laboratory, USA University of Texas Arlington, USA University of Porto, Portugal University of Texas Arlington, USA Delphi Corporation, USA Hewlett Packard Labs, USA University of Technology, Australia University of Oulu, Finland Lund University, Sweden Institute for Infocomm Research, Singapore Chiba University, Japan Stevens Institute of Technology, USA Vienna University of Technology, Austria Hofstra University, USA University of Houston, USA Keane inc., USA University of Cambridge, UK Brown University, USA Texas Instruments, USA Auckland University, New Zeland Ecole Centrale Paris, France National Technical University of Athens, Greece Ecole Centrale de Paris, France Intel, USA Brigham Young University, USA Princeton University, USA Korea University, Korea Kingston University, UK STI Medical Systems, USA The Chinese University of Hong Kong, Hong Kong University of British Columbia, Canada Honyewell Labs, USA CSIRO ICT Centre, Australia Kingston University, UK University of Bologna, Italy Birkbeck College, UK West Virginia University, USA University of Southern California, USA Universitat Oberta de Catalunya, Spain Rutgers University, USA Ford Motor Company, USA
IX
X
Organization
Mirmehdi Majid Monekosso Dorothy Mueller Klaus Mulligan Jeff Murray Don Nachtegael Mike Nait-Charif Hammadi Nefian Ara Nicolescu Mircea Nixon Mark Nolle Lars Ntalianis Klimis Papadourakis George Papanikolopoulos Nikolaos Pati Peeta Basa Patras Ioannis Petrakis Euripides Peyronnet Sylvain Pinhanez Claudio Piccardi Massimo Pietikäinen Matti Porikli Fatih Prabhakar Salil Prati Andrea Prokhorov Danil Qian Gang Raftopoulos Kostas Reed Michael Regazzoni Carlo Remagnino Paolo Ribeiro Eraldo Robles-Kelly Antonio Ross Arun Salgian Andrea Samal Ashok Sato Yoichi Samir Tamer Sarti Augusto Schaefer Gerald Scalzo Fabien Shah Mubarak Shi Pengcheng
Bristol University, UK Kingston University, UK SUNY Stony Brook, USA NASA Ames Research Center, USA Point Grey Research, Canada Ghent University, Belgium Bournemouth University, UK NASA Ames Research Center, USA University of Nevada, Reno, USA University of Southampton, UK The Nottingham Trent University, UK National Technical University of Athens, Greece Technological Education Institute, Greece University of Minnesota, USA First Indian Corp., India Queen Mary University, London, UK Technical University of Crete, Greece LRDE/EPITA, France IBM Research, Brazil University of Technology, Australia LRDE/University of Oulu, Finland Mitsubishi Electric Research Labs, USA DigitalPersona Inc., USA University of Modena and Reggio Emilia, Italy Toyota Research Institute, USA Arizona State University, USA National Technical University of Athens, Greece Blue Sky Studios, USA University of Genoa, Italy Kingston University, UK Florida Institute of Technology, USA National ICT Australia (NICTA), Australia West Virginia University, USA The College of New Jersey, USA University of Nebraska, USA The University of Tokyo, Japan Ingersoll Rand Security Technologies, USA DEI, Politecnico di Milano, Italy Aston University, UK University of Rochester, USA University of Central Florida, USA The Hong Kong University of Science and Technology, Hong Kong
Organization
Shimada Nobutaka Ritsumeikan University, Japan Singh Meghna University of Alberta, Canada Singh Rahul San Francisco State University, USA Skurikhin Alexei Los Alamos National Laboratory, USA Souvenir, Richard University of North Carolina - Charlotte, USA Su Chung-Yen National Taiwan Normal University, Taiwan Sugihara Kokichi University of Tokyo, Japan Sun Zehang eTreppid Technologies, USA Syeda-Mahmood Tanveer IBM Almaden, USA Tan Kar Han Hewlett Packard, USA Tan Tieniu Chinese Academy of Sciences, China Tavares, Joao Universidade do Porto, Portugal Teoh Eam Khwang Nanyang Technological University, Singapore Thiran Jean-Philippe Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland Trucco Emanuele University of Dundee, UK Tsechpenakis Gabriel University of Miami, USA Tubaro Stefano DEI, Politecnico di Milano, Italy Uhl Andreas Salzburg University, Austria Velastin Sergio Kingston University London, UK Verri Alessandro Università di Genova, Italy Wang Song University of South Carolina, USA Wang Yunhong Beihang University, China Webster Michael University of Nevada, Reno, USA Wolff Larry Equinox Corporation, USA Wong Kenneth The University of Hong Kong, Hong Kong Xiang Tao Queen Mary, University of London, UK Xu Meihe University of California at Los Angeles, USA Yang Ruigang University of Kentucky, USA Yi Lijun SUNY at Binghampton, USA Yu Ting GE Global Research, USA Yuan Chunrong University of Tübingen, Germany Zhang Yan Delphi Corporation, USA Zhang Yongmian eTreppid Technologies, USA (Area 2) Computer Graphics Abram Greg Agu Emmanuel Andres Eric Artusi Alessandro Baciu George Barneva Reneta Bartoli Vilanova Anna
IBM T.J. Watson Reseach Center, USA Worcester Polytechnic Institute, USA Laboratory XLIM-SIC, University of Poitiers, France Warwick University, UK Hong Kong PolyU, Hong Kong State University of New York, USA Eindhoven University of Technology, The Netherlands
XI
XII
Organization
Belyaev Alexander Benes Bedrich Bilalis Nicholas Bimber Oliver Bohez Erik Bouatouch Kadi Brimkov Valentin Brown Ross
Max-Planck-Institut für Informatik, Germany Purdue University, USA Technical University of Crete, Greece Weimar University, Germany Asian Institute of Technology, Thailand University of Rennes I, IRISA, France State University of New York, USA Queensland University of Technology, Australia Callahan Steven University of Utah, USA Chen Min University of Wales Swansea, UK Cheng Irene University of Alberta, Canada Chiang Yi-Jen Polytechnic Institute of New York University, USA Choi Min University of Colorado at Denver, USA Comba Joao Univ. Fed. do Rio Grande do Sul, Brazil Cremer Jim University of Iowa, USA Damiand Guillaume SIC Laboratory, France Debattista Kurt University of Warwick, UK Deng Zhigang University of Houston, USA Dick Christian Technical University of Munich, Germany DiVerdi Stephen Adobe, USA Dingliana John Trinity College, Ireland El-Sana Jihad Ben Gurion University of The Negev, Israel Entezari Alireza University of Florida, USA Fiorio Christophe Université Montpellier 2, LIRMM, France Floriani Leila De University of Genova, Italy Gaither Kelly University of Texas at Austin, USA Gotz David IBM, USA Gooch Amy University of Victoria, Canada Gu David State University of New York at Stony Brook, USA Guerra-Filho Gutemberg University of Texas Arlington, USA Hadwiger Markus VRVis Research Center, Austria Haller Michael Upper Austria University of Applied Sciences, Austria Hamza-Lup Felix Armstrong Atlantic State University, USA Han JungHyun Korea University, Korea Hao Xuejun Columbia University and NYSPI, USA Hernandez Jose Tiberio Universidad de los Andes, Colombia Huang Zhiyong Institute for Infocomm Research, Singapore Joaquim Jorge Instituto Superior Técnico, Portugal Ju Tao Washington University, USA
Organization
Julier Simon J. Kakadiaris Ioannis Kamberov George Kazhdan Misha Kim Young Klosowski James Kobbelt Leif Lai Shuhua Lakshmanan Geetika Lee Chang Ha Lee Tong-Yee Levine Martin Lewis Bob Li Frederick Linsen Lars Lok Benjamin Loviscach Joern Magnor Marcus Majumder Aditi Mantler Stephan Martin Ralph McGraw Tim Meenakshisundaram Gopi Mendoza Cesar Metaxas Dimitris Myles Ashish Nait-Charif Hammadi Noma Tsukasa Oliveira Manuel M. Ostromoukhov Victor M. Pascucci Valerio Peters Jorg Qin Hong Razdan Anshuman Redon Stephane Reed Michael Renner Gabor Rushmeier, Holly Sander Pedro
University College London, UK University of Houston, USA Stevens Institute of Technology, USA Johns Hopkins University, USA Ewha Womans University, Korea IBM, USA RWTH Aachen, Germany Virginia State University, USA IBM T.J. Watson Reseach Center, USA Chung-Ang University, Korea National Cheng-Kung University, Taiwan McGill University, Canada Washington State University, USA University of Durham, UK Jacobs University, Germany University of Florida, USA Fachhochschule Bielefeld (University of Applied Sciences), Germany TU Braunschweig, Germany University of California, Irvine, USA Technical University of Vienna, Austria Cardiff University, UK West Virginia University, USA University of California-Irvine, USA NaturalMotion Ltd., USA Rutgers University, USA University of Florida, USA University of Dundee, UK Kyushu Institute of Technology, Japan Univ. Fed. do Rio Grande do Sul, Brazil University of Montreal, Canada University of Utah, USA University of Florida, USA State University of New York at Stony Brook, USA Arizona State University, USA INRIA, France Columbia University, USA Computer and Automation Research Institute, Hungary Yale University, USA The Hong Kong University of Science and Technology, Hong Kong
XIII
XIV
Organization
Sapidis Nickolas Sarfraz Muhammad Scateni Riccardo Schaefer Scott Sequin Carlo Shead Tinothy Sorkine Olga Sourin Alexei Stamminger Marc Su Wen-Poh Staadt Oliver Tan Kar Han Teschner Matthias Umlauf Georg Wald Ingo Weyrich Tim Wimmer Michael Wylie Brian Wyman Chris Yang Ruigang Ye Duan Yi Beifang Yin Lijun Yoo Terry Yuan Xiaoru Zhang Eugene Zordan Victor
University of the Aegean, Greece Kuwait University, Kuwait University of Cagliari, Italy Texas A&M University, USA University of California-Berkeley, USA Sandia National Laboratories, USA New York University, USA Nanyang Technological University, Singapore REVES/INRIA, France Griffith University, Australia University of Rostock, Germany Hewlett Packard, USA University of Freiburg, Germany University of Kaiserslautern, Germany University of Utah, USA University College London, UK Technical University of Vienna, Austria Sandia National Laboratory, USA University of Iowa, USA University of Kentucky, USA University of Missouri-Columbia, USA Salem State College, USA Binghamton University, USA National Institutes of Health, USA Peking University, China Oregon State University, USA University of California at Riverside, USA
(Area 3) Virtual Reality Alcañiz Mariano Arns Laura Behringer Reinhold Benes Bedrich Bilalis Nicholas Blach Roland Blom Kristopher Borst Christoph Brady Rachael Brega Jose Remo Ferreira Brown Ross Bruce Thomas Bues Matthias Chen Jian
Technical University of Valencia, Spain Purdue University, USA Leeds Metropolitan University, UK Purdue University, USA Technical University of Crete, Greece Fraunhofer Institute for Industrial Engineering, Germany University of Hamburg, Germany University of Louisiana at Lafayette, USA Duke University, USA Universidade Estadual Paulista, Brazil Queensland University of Technology, Australia The University of South Australia, Australia Fraunhofer IAO in Stuttgart, Germany Brown University, USA
Organization
Cheng Irene Coquillart Sabine Craig Alan Crawfis Roger Cremer Jim Figueroa Pablo Fox Jesse Friedman Doron Froehlich Bernd Gregory Michelle Gupta Satyandra K. Hachet Martin Haller Michael Hamza-Lup Felix Harders Matthias Hollerer Tobias Julier Simon J. Klinger Evelyne Klinker Gudrun Klosowski James Kozintsev, Igor Kuhlen Torsten Liere Robert van Lok Benjamin Luo Gang Majumder Aditi Malzbender Tom Mantler Stephan Meyer Joerg Molineros Jose Moorhead Robert Muller Stefan Paelke Volker Papka Michael Peli Eli Pettifer Steve Pugmire Dave Qian Gang Raffin Bruno Redon Stephane Reiners Dirk Richir Simon Rodello Ildeberto
University of Alberta, Canada INRIA, France NCSA University of Illinois at Urbana-Champaign, USA Ohio State University, USA University of Iowa, USA Universidad de los Andes, Colombia Stanford University, USA IDC, Israel Weimar University, Germany Pacific Northwest National Lab, USA University of Maryland, USA INRIA, France FH Hagenberg, Austria Armstrong Atlantic State University, USA ETH Zürich, Switzerland University of California at Santa Barbara, USA University College London, UK Arts et Metiers ParisTech, France Technische Universität München, Germany IBM T.J. Watson Research Center, USA Intel, USA RWTH Aachen University, Germany CWI, The Netherlands University of Florida, USA Harvard Medical School, USA University of California, Irvine, USA Hewlett Packard Labs, USA Technical University of Vienna, Austria University of California, Irvine, USA Teledyne Scientific and Imaging, USA Mississippi State University, USA University of Koblenz, Germany Leibniz Universität Hannover, Germany Argonne National Laboratory, USA Harvard University, USA The University of Manchester, UK Los Alamos National Lab, USA Arizona State University, USA INRIA, France INRIA, France University of Louisiana, USA Arts et Metiers ParisTech, France University of Sao Paulo, Brazil
XV
XVI
Organization
Santhanam Anand Sapidis Nickolas Schmalstieg Dieter Schulze, Jurgen Slavik Pavel Sourin Alexei Stamminger Marc Srikanth Manohar Staadt Oliver Stefani Oliver Thalmann Daniel Varsamidis Thomas Vercher Jean-Louis Wald Ingo Yu Ka Chun Yuan Chunrong Zachmann Gabriel Zara Jiri Zyda Michael
MD Anderson Cancer Center Orlando, USA University of the Aegean, Greece Graz University of Technology, Austria University of California - San Diego, USA Czech Technical University in Prague, Czech Republic Nanyang Technological University, Singapore REVES/INRIA, France Indian Institute of Science, India University of Rostock, Germany COAT-Basel, Switzerland EPFL VRlab, Switzerland Bangor University, UK Université de la Méditerranée, France University of Utah, USA Denver Museum of Nature and Science, USA University of Tübingen, Germany Clausthal University, Germany Czech Technical University in Prague, Czech University of Southern California, USA
(Area 4) Visualization Andrienko Gennady Apperley Mark Avila Lisa Balázs Csébfalvi
Fraunhofer Institute IAIS, Germany University of Waikato, New Zealand Kitware, USA Budapest University of Technology and Economics, Hungary Bartoli Anna Vilanova Eindhoven University of Technology, The Netherlands Brady Rachael Duke University, USA Benes Bedrich Purdue University, USA Bilalis Nicholas Technical University of Crete, Greece Bonneau Georges-Pierre Grenoble University, France Brown Ross Queensland University of Technology, Australia Bühler Katja VRVIS, Austria Callahan Steven University of Utah, USA Chen Jian Brown University, USA Chen Min University of Wales Swansea, UK Cheng Irene University of Alberta, Canada Chiang Yi-Jen Polytechnic Institute of New York University, USA Chourasia Amit University of California - San Diego, USA Dana Kristin Rutgers University, USA Dick Christian Technical University of Munich, Germany
Organization
DiVerdi Stephen Doleisch Helmut Duan Ye Dwyer Tim Ebert David Entezari Alireza Ertl Thomas Floriani Leila De Fujishiro Issei Gotz David Grinstein Georges Goebel Randy Gregory Michelle Hadwiger Helmut Markus Hagen Hans Hamza-Lup Felix Heer Jeffrey Hege Hans-Christian Hochheiser Harry Hollerer Tobias Hong Lichan Hotz Ingrid Joshi Alark Julier Simon J. Kao David Kohlhammer Jörn Kosara Robert Laramee Robert Lee Chang Ha Lewis Bob Liere Robert van Lim Ik Soo Linsen Lars Liu Zhanping Ma Kwan-Liu Maeder Anthony Majumder Aditi Malpica Jose Masutani Yoshitaka Matkovic Kresimir McGraw Tim
Adobe, USA VRVis Research Center, Austria University of Missouri-Columbia, USA Monash University, Australia Purdue University, USA University of Florida, USA University of Stuttgart, Germany University of Maryland, USA Keio University, Japan IBM, USA University of Massachusetts Lowell, USA University of Alberta, Canada Pacific Northwest National Lab, USA VRVis Research Center, Austria Technical University of Kaiserslautern, Germany Armstrong Atlantic State University, USA Armstrong University of California at Berkeley, USA Zuse Institute Berlin, Germany Towson University, USA University of California at Santa Barbara, USA Palo Alto Research Center, USA Zuse Institute Berlin, Germany Yale University, USA University College London, UK NASA Ames Research Center, USA Fraunhofer Institut, Germany University of North Carolina at Charlotte, USA Swansea University, UK Chung-Ang University, Korea Washington State University, USA CWI, The Netherlands Bangor University, UK Jacobs University, Germany Kitware, Inc., USA University of California-Davis, USA CSIRO ICT Centre, Australia University of California, Irvine, USA Alcala University, Spain The University of Tokyo Hospital, Japan VRVis Forschungs-GmbH, Austria West Virginia University, USA
XVII
XVIII
Organization
Melançon Guy Meyer Joerg Miksch Silvia Monroe Laura Moorhead Robert Morie Jacki Mueller Klaus Museth Ken Paelke Volker Papka Michael Pettifer Steve Pugmire Dave Rabin Robert Raffin Bruno Razdan Anshuman Rhyne Theresa-Marie Santhanam Anand Scheuermann Gerik Shead Tinothy Shen Han-Wei Sips Mike Slavik Pavel Sourin Alexei Theisel Holger Thiele Olaf Toledo de Rodrigo Tricoche Xavier Umlauf Georg Viegas Fernanda Viola Ivan Wald Ingo Wan Ming Weinkauf Tino Weiskopf Daniel Wischgoll Thomas Wylie Brian Yeasin Mohammed Yuan Xiaoru Zachmann Gabriel Zhang Eugene Zhukov Leonid
CNRS UMR 5800 LaBRI and INRIA Bordeaux Sud-Ouest, France University of California, Irvine, USA Vienna University of Technology, Austria Los Alamos National Labs, USA Mississippi State University, USA University of Southern California, USA SUNY Stony Brook, USA Linköping University, Sweden Leibniz Universität Hannover, Germany Argonne National Laboratory, USA The University of Manchester, UK Los Alamos National Lab, USA University of Wisconsin at Madison, USA Inria, France Arizona State University, USA North Carolina State University, USA MD Anderson Cancer Center Orlando, USA University of Leipzig, Germany Sandia National Laboratories, USA Ohio State University, USA Stanford University, USA Czech Technical University in Prague, Czech Republic Nanyang Technological University, Singapore University of Magdeburg, Germany University of Mannheim, Germany Petrobras PUC-RIO, Brazil Purdue University, USA University of Kaiserslautern, Germany IBM, USA University of Bergen, Norway University of Utah, USA Boeing Phantom Works, USA Courant Institute, New York University, USA University of Stuttgart, Germany Wright State University, USA Sandia National Laboratory, USA Memphis University, USA Peking University, China Clausthal University, Germany Oregon State University, USA Caltech, USA
Organization
ISVC 2009 Special Tracks 1. 3D Mapping, Modeling and Surface Reconstruction Organizers Nefian Ara Broxton Michael Huertas Andres
Carnegie Mellon University/NASA Ames Research Center, USA Carnegie Mellon University/NASA Ames Research Center, USA NASA Jet Propulsion Lab, USA
Program Committee Hancher Matthew Edwards Laurence Bradski Garry Zakhor Avideh Cavallaro Andrea Bouguet Jean-Yves
NASA Ames Research Center, USA NASA Ames Research Center, USA Willow Garage, USA University of California at Berkeley, USA University Queen Mary, London, UK Google, USA
2. Object Recognition Organizers Andrea Salgian Fabien Scalzo
The College of New Jersey, USA University of Rochester, USA
Program Committee Bergevin Robert Leibe Bastina Lepetit Vincet Matei Bogdan Maree Raphael Nelson Randal Qi Guo-Jun Sebe Nicu Tuytelaars Tinne Vedaldi Andrea Vidal-Naquet Michel
University of Laval, Canada ETH Zurich, Switzerland EPFL, Switzerland Sarnoff Corporation, USA Universite de Liege, Belgium University of Rochester, USA University of Science and Technology of China, China University of Amsterdam, The Netherlands Katholieke Universiteit Leuven, Belgium Oxford University, UK RIKEN Brain Science Institute, Japan
3. Deformable Models: Theory and Applications Organizers Terzopoulos Demetri Tsechpenakis Gavriil Huang Xiaolei
University of California, Los Angeles, USA University of Miami, USA Lehigh University, USA
XIX
XX
Organization
Discussion Panel Metaxas Dimitris (Chair) Rutgers University, USA Program Committee Angelini Elsa
Ecole Nationale Supérieure de Télécommunications, France Breen David Drexel University, USA Chen Ting Rutgers University, USA Chen Yunmei University of Florida, USA Delingette Herve INRIA, France Delmas Patrice University of Auckland, New Zealand El-Baz Ayman University of Louisville, USA Farag Aly University of Louisville, USA Kimia Benjamin Brown University, USA Kambhamettu Chandra University of Delaware, USA Magnenat-Thalmann Nadia University of Geneva, Switzerland McInerney Tim Ryerson University, Canada Metaxas Dimitris Rutgers University, USA Palaniappan Kammappan University of Missouri, USA Paragios Nikos Ecole Centrale de Paris, France Qin Hong Stony Brook University, USA Salzmann Mathieu UC Berkeley, USA Sifakis Eftychios University of California at Los Angeles, USA Skrinjar Oskar Georgia Tech, USA Szekely Gabor ETH Zurich, Switzerland Teran Joseph University of California at Los Angeles, USA Thalmann Dabiel EPFL, Switzerland 4. Visualization-Enhanced Data Analysis for Health Applications Organizers Cheng Irene Maeder Anthony
University of Alberta, Canada University of Western Sydney, Australia
Program Committee Bischof Walter Boulanger Pierre Brown Ross Dowling Jason Figueroa Pablo Liyanage Liwan
University of Alberta, Canada University of Alberta, Canada Queensland University of Technology, Australia CSIRO, Australia Universidad de los Andes, Colombia University of Western Sydney, Australia
Organization
Malzbender Tom Mandal Mrinal Miller Steven Nguyen Quang Vinh Shi Hao Shi Jiambo Silva Caludio Simoff Simeon Yin Lijun Zabulis Xenophon Zanuttigh Pietro
HP Labs, USA University of Alberta, Canada University of British Columbia, Canada University of Western Sydney, Australia Victoria University, Australia University of Pennsylvania, USA University of Utah, USA University of Western Sydney, Australia University of Utah, USA Institute of Computer Science-FORTH, Greece University of Padova, Italy
5. Computational Bioimaging Organizers Tavares João Manuel R.S.University of Porto, Portugal Jorge Renato Natal University of Porto, Portugal Cunha Alexandre Caltech, USA Program Committee Santis De Alberto
Università degli Studi di Roma “La Sapienza”, Italy Falcao Alexandre Xavier University of Campinas, Brazil Reis Ana Mafalda Instituto de Ciências Biomédicas Abel Salazar, Portugal Barrutia Arrate Muñoz University of Navarra, Spain Calco Begoña University of Zaragoza, Spain Kotropoulos Constantine Aristotle University of Thessaloniki, Greece Iacoviello Daniela Università degli Studi di Roma “La Sapienza”, Italy Rodrigues Denilson Laudares PUC Minas, Brazil Shen Dinggang University of Pennsylvania, USA Ziou Djemel University of Sherbrooke, Canada Pires Eduardo Borges Instituto Superior Técnico, Portugal Sgallari Fiorella University of Bologna, Italy Perales Francisco Balearic Islands University, Spain Rohde Gustavo Carnegie Mellon University, USA Peng Hanchuan Howard Hughes Medical Institute, USA Rodrigues Helder Instituto Superior Técnico, Portugal Pistori Hemerson Dom Bosco Catholic University, Brazil Zhou Huiyu Brunel University, UK
XXI
XXII
Organization
Yanovsky Igor Corso Jason Maldonado Javier Melenchón Barbosa Jorge M.G. Marques Jorge Aznar Jose M. García Tohka Jussi Vese Luminita Reis Luís Paulo El-Sakka Mahmoud
Jet Propulsion Laboratory, USA SUNY at Buffalo, USA
Open University of Catalonia, Spain University of Porto, Portugal Instituto Superior Técnico, Portugal University of Zaragoza, Spain Tampere University of Technology, Finland University of California at Los Angeles, USA University of Porto, Portugal The University of Western Ontario London, Canada Hidalgo Manuel González Balearic Islands University, Spain Kunkel Maria Elizete Universität Ulm, Germany Gurcan Metin N. Ohio State University, USA Liebling Michael University of California at Santa Barbara, USA Dubois Patrick Institut de Technologie Médicale, France Jorge Renato M.N. University of Porto, Portugal Barneva Reneta State University of New York, USA Bellotti Roberto University of Bari, Italy Tangaro Sabina University of Bari, Italy Newsam Shawn University of California at Merced, USA Silva Susana Branco University of Lisbon, Portugal Pataky Todd University of Liverpool, UK Brimkov Valentin State University of New York, USA Zhan Yongjie Carnegie Mellon University, USA 6. Visual Computing for Robotics Organizers Chausse Frederic
Clermont Universite, France
Program Committee Didier Aubert Didier Thierry Chateu Thierry Chapuis Roland Hautiere Nicolas Royer Eric Bekris Kostas
LIVIC, France Clermont Université, France Clermont Université, France LCPC/LEPSIS, France Clermont Université, France University of Nevada, Reno, USA
Organization
XXIII
7. Optimization for Vision, Graphics and Medical Imaging: Theory and Applications Organizers Komodakis Nikos Langs Georg
University of Crete, Greece University of Vienna, Austria
Program Committee Paragios Nikos Bischof Horst Cremers Daniel Grady Leo Navab Nassir Samaras Dimitris Lempitsky Victor Tziritas Georgios Pock Thomas Micusik Branislav Glocker Ben
Ecole Centrale de Paris/INRIA Saclay Ile-de-France, France Graz University of Technology, Austria University of Bonn, Germany Siemens Corporate Research, USA Technical University of Munich, Germany Stony Brook University, USA Microsoft Research Cambridge, UK University of Crete, Greece Graz University of Technology, Austria Austrian Research Centers GmbH - ARC, Austria Technical University of Munich, Germany
8. Semantic Robot Vision Challenge Organizers Rybski Paul E. DeMenthon Daniel Fermuller Cornelia Fazli Pooyan Mishra Ajay Lopes Luis Roehrbein Florian Gustafson David Nicolescu Mircea
Carnegie Mellon University, USA Johns Hopkins University, USA University of Maryland, USA University of British Columbia, Canada National University of Singapore, Singapore Universidade de Aveiro, Portugal Universität Bremen, Germany Kansas State University, USA University of Nevada at Reno, USA
Additional Reviewers Vo Huy Streib Kevin Sankaranarayanan Karthik Guerrero Paul Brimkov Borris Kensler Andrew
University of Utah, USA Ohio State University, USA Ohio State University, USA Vienna University of Technology, Austria University at Buffalo, USA University of Utah, USA
XXIV
Organization
Organizing Institutions and Sponsors
Table of Contents – Part II
Computer Graphics III Layered Volume Splatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Philipp Schlegel and Renato Pajarola
1
Real-Time Soft Shadows Using Temporal Coherence . . . . . . . . . . . . . . . . . . Daniel Scherzer, Michael Schw¨ arzler, Oliver Mattausch, and Michael Wimmer
13
Real-Time Dynamic Wrinkles of Face for Animated Skinned Mesh . . . . . L. Dutreve, A. Meyer, and S. Bouakaz
25
Protected Progressive Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Gschwandtner and Andreas Uhl
35
Bilateral Filtered Shadow Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinwook Kim and Soojae Kim
49
LightShop: An Interactive Lighting System Incorporating the 2D Image Editing Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Younghui Kim and Junyong Noh
59
Visualization II Progressive Presentation of Large Hierarchies Using Treemaps . . . . . . . . . Ren´e Rosenbaum and Bernd Hamann
71
Reaction Centric Layout for Metabolic Networks . . . . . . . . . . . . . . . . . . . . . Muhieddine El Kaissi, Ming Jia, Dirk Reiners, Julie Dickerson, and Eve Wuertele
81
Diverging Color Maps for Scientific Visualization . . . . . . . . . . . . . . . . . . . . . Kenneth Moreland
92
GPU-Based Ray Casting of Multiple Multi-resolution Volume Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Lux and Bernd Fr¨ ohlich Dynamic Chunking for Out-of-Core Volume Visualization Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dan R. Lipsa, R. Daniel Bergeron, Ted M. Sparr, and Robert S. Laramee
104
117
XXVI
Table of Contents – Part II
Visualization of the Molecular Dynamics of Polymers and Carbon Nanotubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sidharth Thakur, Syamal Tallury, Melissa A. Pasquinelli, and Theresa-Marie Rhyne
129
Detection and Tracking Propagation of Pixel Hypotheses for Multiple Objects Tracking . . . . . . . . Haris Baltzakis and Antonis A. Argyros
140
Visibility-Based Observation Model for 3D Tracking with Non-parametric 3D Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ra´ ul Mohedano and Narciso Garc´ıa
150
Efficient Hypothesis Generation through Sub-categorization for Multiple Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dipankar Das, Yoshinori Kobayashi, and Yoshinori Kuno
160
Object Detection and Localization in Clutter Range Images Using Edge Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dipankar Das, Yoshinori Kobayashi, and Yoshinori Kuno
172
Learning Higher-Order Markov Models for Object Tracking in Image Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Felsberg and Fredrik Larsson
184
Analysis of Numerical Methods for Level Set Based Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bj¨ orn Scheuermann and Bodo Rosenhahn
196
Reconstruction II Focused Volumetric Visual Hull with Color Extraction . . . . . . . . . . . . . . . . Daniel Knoblauch and Falko Kuester
208
Graph Cut Based Point-Cloud Segmentation for Polygonal Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Sedlacek and Jiri Zara
218
Dense Depth Maps from Low Resolution Time-of-Flight Depth and High Resolution Color Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bogumil Bartczak and Reinhard Koch
228
Residential Building Reconstruction Based on the Data Fusion of Sparse LiDAR Data and Satellite Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . Ye Yu, Bill P. Buckles, and Xiaoping Liu
240
Adaptive Sample Consensus for Efficient Random Optimization . . . . . . . . Lixin Fan and Timo Pylv¨ an¨ ainen
252
Table of Contents – Part II
Feature Matching under Region-Based Constraints for Robust Epipolar Geometry Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Xu and Jane Mulligan
XXVII
264
Applications Lossless Compression Using Joint Predictor for Astronomical Images . . . Bo-Zong Wu and Angela Chih-Wei Tang Metric Rectification to Estimate the Aspect Ratio of Camera-Captured Document Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junhee Park and Byung-Uk Lee
274
283
Active Learning Image Spam Hunter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan Gao and Alok Choudhary
293
Skin Paths for Contextual Flagging Adult Videos . . . . . . . . . . . . . . . . . . . . Julian St¨ ottinger, Allan Hanbury, Christian Liensberger, and Rehanullah Khan
303
Grouping and Summarizing Scene Images from Web Collections . . . . . . . Heng Yang and Qing Wang
315
Robust Registration of Aerial Image Sequences . . . . . . . . . . . . . . . . . . . . . . Clark F. Olson, Adnan I. Ansar, and Curtis W. Padgett
325
Color Matching for Metallic Coatings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jayant Silva and Kristin J. Dana
335
Video Analysis and Event Recognition A Shape and Energy Based Approach to Vertical People Separation in Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alessio M. Brits and Jules R. Tapamo
345
Human Activity Recognition Using the 4D Spatiotemporal Shape Context Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Natasha Kholgade and Andreas Savakis
357
Adaptive Tuboid Shapes for Action Recognition . . . . . . . . . . . . . . . . . . . . . Roman Filipovych and Eraldo Ribeiro
367
Level Set Gait Analysis for Synthesis and Reconstruction . . . . . . . . . . . . . Muayed S. Al-Huseiny, Sasan Mahmoodi, and Mark S. Nixon
377
Poster Session Real-Time Hand Detection and Gesture Tracking with GMM and Model Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gabriel Yoder and Lijun Yin
387
XXVIII
Table of Contents – Part II
Design of Searchable Commemorative Coins Image Library . . . . . . . . . . . . ˇ Radoslav Fasuga, Petr Kaˇspar, and Martin Surkovsk´ y
397
Visual Intention Detection for Wheelchair Motion . . . . . . . . . . . . . . . . . . . . T. Luhandjula, E. Monacelli, Y. Hamam, B.J. van Wyk, and Q. Williams
407
An Evaluation of Affine Invariant-Based Classification for Image Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Fleck and Zoran Duric Asbestos Detection Method with Frequency Analysis for Microscope Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hikaru Kumagai, Soichiro Morishita, Kuniaki Kawabata, Hajime Asama, and Taketoshi Mishima
417
430
Shadows Removal by Edges Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Spagnolo, P.L. Mazzeo, M. Leo, and T. D’Orazio
440
Online Video Textures Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wentao Fan and Nizar Bouguila
450
Deformable 2D Shape Matching Based on Shape Contexts and Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Iasonas Oikonomidis and Antonis A. Argyros
460
3D Model Reconstruction from Turntable Sequence with Multiple-View Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jian Zhang, Fei Mai, Y.S. Hung, and G. Chesi
470
Recognition of Semantic Basketball Events Based on Optical Flow Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li Li, Ying Chen, Weiming Hu, Wanqing Li, and Xiaoqin Zhang
480
Action Recognition Based on Non-parametric Probability Density Function Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuta Mimura, Kazuhiro Hotta, and Haruhisa Takahashi
489
Asymmetry-Based Quality Assessment of Face Images . . . . . . . . . . . . . . . . Guangpeng Zhang and Yunhong Wang Scale Analysis of Several Filter Banks for Color Texture Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olga Rajadell, Pedro Garc´ıa-Sevilla, and Filiberto Pla A Novel 3D Segmentation of Vertebral Bones from Volumetric CT Images Using Graph Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Melih S. Aslan, Asem Ali, Ham Rara, Ben Arnold, Aly A. Farag, Rachid Fahmi, and Ping Xiang
499
509
519
Table of Contents – Part II
XXIX
Robust 3D Marker Localization Using Multi-spectrum Sequences . . . . . . . Pengcheng Li, Jun Cheng, Ruifeng Yuan, and Wenchuang Zhao
529
Measurement of Pedestrian Groups Using Subtraction Stereo . . . . . . . . . . Kenji Terabayashi, Yuki Hashimoto, and Kazunori Umeda
538
Vision-Based Obstacle Avoidance Using SIFT Features . . . . . . . . . . . . . . . Aaron Chavez and David Gustafson
550
Segmentation of Chinese Postal Envelope Images for Address Block Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xinghui Dong, Junyu Dong, and Shengke Wang Recognizability of Polyhexes by Tiling and Wang Systems . . . . . . . . . . . . . H. Geetha, D.G. Thomas, T. Kalyani, and T. Robinson
558 568
Unsupervised Video Analysis for Counting of Wood in River during Floods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Imtiaz Ali and Laure Tougne
578
Robust Facial Feature Detection and Tracking for Head Pose Estimation in a Novel Multimodal Interface for Social Skills Learning . . . Jingying Chen and Oliver Lemon
588
High Performance Implementation of License Plate Recognition in Image Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Zweng and Martin Kampel
598
TOCSAC: TOpology Constraint SAmple Consensus for Fast and Reliable Feature Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhoucan He, Qing Wang, and Heng Yang
608
Multimedia Mining on Manycore Architectures: The Case for GPUs . . . . Mamadou Diao and Jongman Kim
619
Human Activity Recognition Based on Transform and Fourier Mellin Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pengfei Zhu, Weiming Hu, Li Li, and Qingdi Wei
631
Reconstruction of Facial Shape from Freehand Multi-viewpoint Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seiji Suzuki, Hideo Saito, and Masaaki Mochimaru
641
Multiple-view Video Coding Using Depth Map in Projective Space . . . . . Nina Yorozu, Yuko Uematsu, and Hideo Saito Using Subspace Multiple Linear Regression for 3D Face Shape Prediction from a Single Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mario Castel´ an, Gustavo A. Puerto-Souza, and Johan Van Horebeek
651
662
XXX
Table of Contents – Part II
PedVed: Pseudo Euclidian Distances for Video Events Detection . . . . . . . Md. Haidar Sharif and Chabane Djeraba Two Algorithms for Measuring Human Breathing Rate Automatically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomas Lampo, Javier Sierra, and Carolina Chang
674
686
Biometric Recognition: When Is Evidence Fusion Advantageous? . . . . . . . Hugo Proen¸ca
698
Interactive Image Inpainting Using DCT Based Exemplar Matching . . . . Tsz-Ho Kwok and Charlie C.L. Wang
709
Noise-Residue Filtering Based on Unsupervised Clustering for Phase Unwrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Jiang, Jun Cheng, and Xinglin Chen Adaptive Digital Makeup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abhinav Dhall, Gaurav Sharma, Rajen Bhatt, and Ghulam Mohiuddin Khan An Instability Problem of Region Growing Segmentation Algorithms and Its Set Median Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lucas Franek and Xiaoyi Jiang
719 728
737
Distance Learning Based on Convex Clustering . . . . . . . . . . . . . . . . . . . . . . Xingwei Yang, Longin Jan Latecki, and Ari Gross
747
Group Action Recognition Using Space-Time Interest Points . . . . . . . . . . Qingdi Wei, Xiaoqin Zhang, Yu Kong, Weiming Hu, and Haibin Ling
757
Adaptive Deblurring for Camera-Based Document Image Processing . . . . Yibin Tian and Wei Ming
767
A Probabilistic Model of Visual Attention and Perceptual Organization for Constructive Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masayasu Atsumi Gloss and Normal Map Acquisition of Mesostructures Using Gray Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yannick Francken, Tom Cuypers, Tom Mertens, and Philippe Bekaert Video Super-Resolution by Adaptive Kernel Regression . . . . . . . . . . . . . . . Mohammad Moinul Islam, Vijayan K. Asari, Mohammed Nazrul Islam, and Mohammad A. Karim Unification of Multichannel Motion Feature Using Boolean Polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Naoya Ohnishi, Atsushi Imiya, and Tomoya Sakai
778
788
799
807
Table of Contents – Part II
Rooftop Detection and 3D Building Modeling from Aerial Images . . . . . . Fanhuai Shi, Yongjian Xi, Xiaoling Li, and Ye Duan
XXXI
817
An Image Registration Approach for Accurate Satellite Attitude Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alessandro Bevilacqua, Ludovico Carozza, and Alessandro Gherardi
827
A Novel Vision-Based Approach for Autonomous Space Navigation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alessandro Bevilacqua, Alessandro Gherardi, and Ludovico Carozza
837
An Adaptive Cutaway with Volume Context Preservation . . . . . . . . . . . . . S. Grau and A. Puig A 3D Visualisation to Enhance Cognition in Software Product Line Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ciar´ an Cawley, Goetz Botterweck, Patrick Healy, Saad Bin Abid, and Steffen Thiel A Visual Data Exploration Framework for Complex Problem Solving Based on Extended Cognitive Fit Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Zhu, Xiaoyuan Suo, and G. Scott Owen Energetic Path Finding across Massive Terrain Data . . . . . . . . . . . . . . . . . Andrew Tsui and Zo¨e Wood The Impact of Image Choices on the Usability and Security of Click Based Graphical Passwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoyuan Suo, Ying Zhu, and G. Scott Owen
847
857
869 879
889
Visual Computing for Scattered Electromagnetic Fields . . . . . . . . . . . . . . . Shyh-Kuang Ueng and Fu-Sheng Yang
899
Visualization of Gene Regulatory Networks . . . . . . . . . . . . . . . . . . . . . . . . . Muhieddine El Kaissi, Ming Jia, Dirk Reiners, Julie Dickerson, and Eve Wuertele
909
Autonomous Lighting Agents in Photon Mapping . . . . . . . . . . . . . . . . . . . . A. Herubel, V. Biri, and S. Deverly
919
Data Vases: 2D and 3D Plots for Visualizing Multiple Time Series . . . . . Sidharth Thakur and Theresa-Marie Rhyne
929
A Morse-Theory Based Method for Segmentation of Triangulated Freeform Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Wang and Zeyun Yu A Lattice Boltzmann Model for Rotationally Invariant Dithering . . . . . . . Kai Hagenburg, Michael Breuß, Oliver Vogel, Joachim Weickert, and Martin Welk
939 949
XXXII
Table of Contents – Part II
Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aaron Hagan and Ye Zhao
960
A Practical Guide to Large Tiled Displays . . . . . . . . . . . . . . . . . . . . . . . . . . Paul A. Navr´ atil, Brandt Westing, Gregory P. Johnson, Ashwini Athalye, Jose Carreno, and Freddy Rojas
970
Fast Spherical Mapping for Genus-0 Meshes . . . . . . . . . . . . . . . . . . . . . . . . . Shuhua Lai, Fuhua (Frank) Cheng, and Fengtao Fan
982
Rendering Virtual Objects with High Dynamic Range Lighting Extracted Automatically from Unordered Photo Collections . . . . . . . . . . . Konrad K¨ olzer, Frank Nagl, Bastian Birnbach, and Paul Grimm
992
Effective Adaptation to Experience of Different-Sized Hand . . . . . . . . . . . 1002 Kenji Terabayashi, Natsuki Miyata, Jun Ota, and Kazunori Umeda Image Processing Methods Applied in Mapping of Lubrication Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1011 Radek Poliˇsˇcuk An Active Contour Approach for a Mumford-Shah Model in X-Ray Tomography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1021 Elena Hoetzl and Wolfgang Ring An Integral Active Contour Model for Convex Hull and Boundary Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031 Nikolay Metodiev Sirakov and Karthik Ushkala Comparison of Segmentation Algorithms for the Zebrafish Heart in Fluorescent Microscopy Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041 P. Kr¨ amer, F. Boto, D. Wald, F. Bessy, C. Paloc, C. Callol, A. Letamendia, I. Ibarbia, O. Holgado, and J.M. Virto A Quality Pre-processor for Biological Cell Images . . . . . . . . . . . . . . . . . . . 1051 Adele P. Peskin, Karen Kafadar, and Alden Dima Fast Reconstruction Method for Diffraction Imaging . . . . . . . . . . . . . . . . . . 1063 Eliyahu Osherovich, Michael Zibulevsky, and Irad Yavneh Evaluation of Cardiac Ultrasound Data by Bayesian Probability Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073 Mattias Hansson, Sami Brandt, Petri Gudmundsson, and Finn Lindgren Animated Classic Mosaics from Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085 Yu Liu and Olga Veksler
Table of Contents – Part II
XXXIII
Comparison of Optimisation Algorithms for Deformable Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097 Vasileios Zografos Two Step Variational Method for Subpixel Optical Flow Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1109 Yoshihiko Mochizuki, Yusuke Kameda, Atsushi Imiya, Tomoya Sakai, and Takashi Imaizumi A Variational Approach to Semiautomatic Generation of Digital Terrain Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1119 Markus Unger, Thomas Pock, Markus Grabner, Andreas Klaus, and Horst Bischof Real-Time Articulated Hand Detection and Pose Estimation . . . . . . . . . . 1131 Giorgio Panin, Sebastian Klose, and Alois Knoll Background Subtraction in Video Using Recursive Mixture Models, Spatio-Temporal Filtering and Shadow Removal . . . . . . . . . . . . . . . . . . . . . 1141 Zezhi Chen, Nick Pears, Michael Freeman, and Jim Austin A Generalization of Moment Invariants on 2D Vector Fields to Tensor Fields of Arbitrary Order and Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . 1151 Max Langbein and Hans Hagen A Real-Time Road Sign Detection Using Bilateral Chinese Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1161 Rachid Belaroussi and Jean-Philippe Tarel Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1171 Mustafa Berkay Yilmaz, Hakan Erdogan, and Mustafa Unel Common Motion Map Based on Codebooks . . . . . . . . . . . . . . . . . . . . . . . . . 1181 Ionel Pop, Scuturici Mihaela, and Serge Miguet Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1191
Table of Contents – Part I
ST: Object Recognition Which Shape Representation Is the Best for Real-Time Hand Interface System? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Serkan Gen¸c and Volkan Atalay
1
Multi-target and Multi-camera Object Detection with Monte-Carlo Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giorgio Panin, Sebastian Klose, and Alois Knoll
12
Spatial Configuration of Local Shape Features for Discriminative Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lech Szumilas and Horst Wildenauer
22
A Bag of Features Approach for 3D Shape Retrieval . . . . . . . . . . . . . . . . . . Janis Fehr, Alexander Streicher, and Hans Burkhardt
34
Efficient Object Pixel-Level Categorization Using Bag of Features . . . . . . David Aldavert, Arnau Ramisa, Ricardo Toledo, and Ramon Lopez de Mantaras
44
Computer Graphics I Relighting Forest Ecosystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jay E. Steele and Robert Geist
55
Cartoon Animation Style Rendering of Water . . . . . . . . . . . . . . . . . . . . . . . . Mi You, Jinho Park, Byungkuk Choi, and Junyong Noh
67
Deformable Proximity Queries and Their Application in Mobile Manipulation Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Gissler, C. Dornhege, B. Nebel, and M. Teschner
79
Speech-Driven Facial Animation Using a Shared Gaussian Process Latent Variable Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Salil Deena and Aphrodite Galata
89
Extracting Principal Curvature Ridges from B-Spline Surfaces with Deficient Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suraj Musuvathy and Elaine Cohen
101
Adaptive Partitioning of Vertex Shader for Low Power High Performance Geometry Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.V.N. Silpa, Kumar S.S. Vemuri, and Preeti Ranjan Panda
111
XXXVI
Table of Contents – Part I
Visualization I Visualized Index-Based Search for Digital Libraries . . . . . . . . . . . . . . . . . . . Jon Scott, Beomjin Kim, and Sanyogita Chhabada
125
Generation of an Importance Map for Visualized Images . . . . . . . . . . . . . . Akira Egawa and Susumu Shirayama
135
Drawing Motion without Understanding It . . . . . . . . . . . . . . . . . . . . . . . . . . Vincenzo Caglioti, Alessandro Giusti, Andrea Riva, and Marco Uberti
147
Image Compression Based on Visual Saliency at Individual Scales . . . . . . Stella X. Yu and Dimitri A. Lisin
157
Fast Occlusion Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mayank Singh, Cem Yuksel, and Donald House
167
An Empirical Study of Categorical Dataset Visualization Using a Simulated Bee Colony Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . James D. McCaffrey
179
ST: Visual Computing for Robotics Real-Time Feature Acquisition and Integration for Vision-Based Mobile Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas H¨ ubner and Renato Pajarola
189
Matching Planar Features for Robot Localization . . . . . . . . . . . . . . . . . . . . Baptiste Charmette, Eric Royer, and Fr´ed´eric Chausse
201
Fast and Accurate Structure and Motion Estimation . . . . . . . . . . . . . . . . . Johan Hedborg, Per-Erik Forss´en, and Michael Felsberg
211
Optical Flow Based Detection in Mixed Human Robot Environments . . . Dario Figueira, Plinio Moreno, Alexandre Bernardino, Jos´e Gaspar, and Jos´e Santos-Victor
223
Using a Virtual World to Design a Simulation Platform for Vision and Robotic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Om K. Gupta and Ray A. Jarvis
233
Feature Extraction and Matching Accurate and Efficient Computation of Gabor Features in Real-Time Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gholamreza Amayeh, Alireza Tavakkoli, and George Bebis Region Graph Spectra as Geometric Global Image Features . . . . . . . . . . . Qirong Ho, Weimiao Yu, and Hwee Kuan Lee
243 253
Table of Contents – Part I
XXXVII
Robust Harris-Laplace Detector by Scale Multiplication . . . . . . . . . . . . . . . Fanhuai Shi, Xixia Huang, and Ye Duan
265
Spatial-Temporal Junction Extraction and Semantic Interpretation . . . . . Kasper Broegaard Simonsen, Mads Thorsted Nielsen, Florian Pilz, Norbert Kr¨ uger, and Nicolas Pugeault
275
Cross-Correlation and Rotation Estimation of Local 3D Vector Field Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Janis Fehr, Marco Reisert, and Hans Burkhardt
287
Scene Categorization by Introducing Contextual Information to the Visual Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianzhao Qin and Nelson H.C. Yung
297
Edge-Preserving Laplacian Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stella X. Yu
307
Medical Imaging Automated Segmentation of Brain Tumors in MRI Using Force Data Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masoumeh Kalantari Khandani, Ruzena Bajcsy, and Yaser P. Fallah Top-Down Segmentation of Histological Images Using a Digital Deformable Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F. De Vieilleville, J.-O. Lachaud, P. Herlin, O. Lezoray, and B. Plancoulaine Closing Curves with Riemannian Dilation: Application to Segmentation in Automated Cervical Cancer Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patrik Malm and Anders Brun Lung Nodule Modeling – A Data-Driven Approach . . . . . . . . . . . . . . . . . . . Amal Farag, James Graham, Aly Farag, and Robert Falk Concurrent CT Reconstruction and Visual Analysis Using Hybrid Multi-resolution Raycasting in a Cluster Environment . . . . . . . . . . . . . . . . Steffen Frey, Christoph M¨ uller, Magnus Strengert, and Thomas Ertl Randomized Tree Ensembles for Object Detection in Computational Pathology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas J. Fuchs, Johannes Haybaeck, Peter J. Wild, Mathias Heikenwalder, Holger Moch, Adriano Aguzzi, and Joachim M. Buhmann Human Understandable Features for Segmentation of Solid Texture . . . . Ludovic Paulhac, Pascal Makris, Jean-Marc Gregoire, and Jean-Yves Ramel
317
327
337 347
357
367
379
XXXVIII
Table of Contents – Part I
Motion Exploiting Mutual Camera Visibility in Multi-camera Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Kurz, Thorsten Thorm¨ ahlen, Bodo Rosenhahn, and Hans-Peter Seidel Optical Flow Computation from an Asynchronised Multiresolution Image Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yusuke Kameda, Naoya Ohnishi, Atsushi Imiya, and Tomoya Sakai Conditions for Segmentation of Motion with Affine Fundamental Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shafriza Nisha Basah, Reza Hoseinnezhad, and Alireza Bab-Hadiashar Motion-Based View-Invariant Articulated Motion Detection and Pose Estimation Using Sparse Point Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shrinivas J. Pundlik and Stanley T. Birchfield Robust Estimation of Camera Motion Using Optical Flow Models . . . . . . Jurandy Almeida, Rodrigo Minetto, Tiago A. Almeida, Ricardo da S. Torres, and Neucimar J. Leite Maximum Likelihood Estimation Sample Consensus with Validation of Individual Correspondences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liang Zhang, Houman Rastgar, Demin Wang, and Andr´e Vincent Efficient Random Sampling for Nonrigid Feature Matching . . . . . . . . . . . . Lixin Fan and Timo Pylv¨ an¨ ainen
391
403
415
425
435
447
457
Virtual Reality I RiverLand: An Efficient Procedural Modeling System for Creating Realistic-Looking Terrains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Soon Tee Teoh
468
Real-Time 3D Reconstruction for Occlusion-Aware Interactions in Mixed Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Ladikos and Nassir Navab
480
Augmenting Exercise Systems with Virtual Exercise Environment . . . . . . Wei Xu, Jaeheon Jeong, and Jane Mulligan Codebook-Based Background Subtraction to Generate Photorealistic Avatars in a Walkthrough Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anjin Park, Keechul Jung, and Takeshi Kurata
490
500
Table of Contents – Part I
XXXIX
JanusVF: Adaptive Fiducial Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Malcolm Hutson and Dirk Reiners Natural Pose Generation from a Reduced Dimension Motion Capture Data Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reza Ferrydiansyah and Charles B. Owen
511
521
ST: Computational Bioimaging Segmentation of Neural Stem/Progenitor Cells Nuclei within 3-D Neurospheres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weimiao Yu, Hwee Kuan Lee, Srivats Hariharan, Shvetha Sankaran, Pascal Vallotton, and Sohail Ahmed
531
Deconvolving Active Contours for Fluorescence Microscopy Images . . . . . Jo A. Helmuth and Ivo F. Sbalzarini
544
Image Registration Guided by Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . Edgar R. Arce-Santana, Daniel U. Campos-Delgado, and Alfonso Alba
554
Curve Enhancement Using Orientation Fields . . . . . . . . . . . . . . . . . . . . . . . . Kristian Sandberg
564
Lighting-Aware Segmentation of Microscopy Images for In Vitro Fertilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alessandro Giusti, Giorgio Corani, Luca Maria Gambardella, Cristina Magli, and Luca Gianaroli Fast 3D Reconstruction of the Spine Using User-Defined Splines and a Statistical Articulated Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel C. Moura, Jonathan Boisvert, Jorge G. Barbosa, and Jo˜ ao Manuel R.S. Tavares
576
586
Computer Graphics II High-Quality Rendering of Varying Isosurfaces with Cubic Trivariate C 1 -Continuous Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Kalbe, Thomas Koch, and Michael Goesele
596
Visualizing Arcs of Implicit Algebraic Curves, Exactly and Fast . . . . . . . . Pavel Emeliyanenko, Eric Berberich, and Michael Sagraloff
608
Fast Cube Cutting for Interactive Volume Visualization . . . . . . . . . . . . . . . Travis McPhail, Powei Feng, and Joe Warren
620
A Statistical Model for Daylight Spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martyn Williams and William A.P. Smith
632
XL
Table of Contents – Part I
Reducing Artifacts between Adjacent Bricks in Multi-resolution Volume Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rhadam´es Carmona, Gabriel Rodr´ıguez, and Bernd Fr¨ ohlich Height and Tilt Geometric Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vedrana Andersen, Mathieu Desbrun, J. Andreas Bærentzen, and Henrik Aanæs
644 656
ST: 3D Mapping, Modeling and Surface Reconstruction Using Coplanar Circles to Perform Calibration-Free Planar Scene Analysis under a Perspective View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yisong Chen Parallel Poisson Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthew Bolitho, Michael Kazhdan, Randal Burns, and Hugues Hoppe
668 678
3D Object Mapping by Integrating Stereo SLAM and Object Segmentation Using Edge Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masahiro Tomono
690
Photometric Recovery of Ortho-Images Derived from Apollo 15 Metric Camera Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taemin Kim, Ara V. Nefian, and Michael J. Broxton
700
3D Lunar Terrain Reconstruction from Apollo Images . . . . . . . . . . . . . . . . Michael J. Broxton, Ara V. Nefian, Zachary Moratto, Taemin Kim, Michael Lundy, and Aleksandr V. Segal Factorization of Correspondence and Camera Error for Unconstrained Dense Correspondence Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Knoblauch, Mauricio Hess-Flores, Mark Duchaineau, and Falko Kuester
710
720
Face Processing Natural Facial Expression Recognition Using Dynamic and Static Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bogdan Raducanu and Fadi Dornaika
730
Facial Shape Recovery from a Single Image with an Arbitrary Directional Light Using Linearly Independent Representation . . . . . . . . . . Minsik Lee and Chong-Ho Choi
740
Locating Facial Features and Pose Estimation Using a 3D Shape Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Angela Caunce, David Cristinacce, Chris Taylor, and Tim Cootes
750
Table of Contents – Part I
A Stochastic Method for Face Image Super-Resolution . . . . . . . . . . . . . . . . Jun Zheng and Olac Fuentes A Framework for Long Distance Face Recognition Using Dense- and Sparse-Stereo Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ham Rara, Shireen Elhabian, Asem Ali, Travis Gault, Mike Miller, Thomas Starr, and Aly Farag
XLI
762
774
Reconstruction I Multi-view Reconstruction of Unknown Objects within a Known Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Kuhn and Dominik Henrich Accurate Real-Time Disparity Estimation with Variational Methods . . . . Sergey Kosov, Thorsten Thorm¨ ahlen, and Hans-Peter Seidel Real-Time Parallel Implementation of SSD Stereo Vision Algorithm on CSX SIMD Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fouzhan Hosseini, Amir Fijany, Saeed Safari, Ryad Chellali, and Jean-Guy Fontaine Revisiting the PnP Problem with a GPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . Timo Pylv¨ an¨ ainen, Lixin Fan, and Vincent Lepetit On Using Projective Relations for Calibration and Estimation in a Structured-Light Scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daljit Singh Dhillon and Venu Madhav Govindu Depth from Encoded Sliding Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Hermans, Yannick Francken, Tom Cuypers, and Philippe Bekaert
784 796
808
819
831 843
ST: Deformable Models: Theory and Applications A New Algorithm for Inverse Consistent Image Registration . . . . . . . . . . . Xiaojing Ye and Yunmei Chen A 3D Active Surface Model for the Accurate Segmentation of Drosophila Schneider Cell Nuclei and Nucleoli . . . . . . . . . . . . . . . . . . . . . . . Margret Keuper, Jan Padeken, Patrick Heun, Hans Burkhardt, and Olaf Ronneberger Weight, Sex, and Facial Expressions: On the Manipulation of Attributes in Generative 3D Face Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brian Amberg, Pascal Paysan, and Thomas Vetter Contrast Constrained Local Binary Fitting for Image Segmentation . . . . Xiaojing Bai, Chunming Li, Quansen Sun, and Deshen Xia
855
865
875 886
XLII
Table of Contents – Part I
Modeling and Rendering Physically-Based Wood Combustion . . . . . . . . . . Roderick M. Riensche and Robert R. Lewis
896
A Unifying View of Contour Length Bias Correction . . . . . . . . . . . . . . . . . . Christina Pavlopoulou and Stella X. Yu
906
ST: Visualization Enhanced Data Analysis for Health Applications A Novel Method for Enhanced Needle Localization Using Ultrasound-Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bin Dong, Eric Savitsky, and Stanley Osher
914
Woolz IIP: A Tiled On-the-Fly Sectioning Server for 3D Volumetric Atlases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zsolt L. Husz, Thomas P. Perry, Bill Hill, and Richard A. Baldock
924
New Scalar Measures for Diffusion-Weighted MRI Visualization . . . . . . . . Tim McGraw, Takamitsu Kawai, Inas Yassine, and Lierong Zhu Automatic Data-Driven Parameterization for Phase-Based Bone Localization in US Using Log-Gabor Filters . . . . . . . . . . . . . . . . . . . . . . . . . Ilker Hacihaliloglu, Rafeef Abugharbieh, Antony Hodgson, and Robert Rohling Wavelet-Based Representation of Biological Shapes . . . . . . . . . . . . . . . . . . . Bin Dong, Yu Mao, Ivo D. Dinov, Zhuowen Tu, Yonggang Shi, Yalin Wang, and Arthur W. Toga Detection of Unusual Objects and Temporal Patterns in EEG Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kostadin Koroutchev, Elka Korutcheva, Kamen Kanev, Apolinar Rodr´ıguez Albari˜ no, Jose Luis Mu˜ niz Gutierrez, and Fernando Fari˜ naz Balseiro
934
944
955
965
Virtual Reality II DRONE: A Flexible Framework for Distributed Rendering and Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Repplinger, Alexander L¨ offler, Dmitri Rubinstein, and Philipp Slusallek Efficient Strategies for Acceleration Structure Updates in Interactive Ray Tracing Applications on the Cell Processor . . . . . . . . . . . . . . . . . . . . . . Martin Weier, Thorsten Roth, and Andr´e Hinkenjann Interactive Assembly Guide Using Augmented Reality . . . . . . . . . . . . . . . . M. Andersen, R. Andersen, C. Larsen, T.B. Moeslund, and O. Madsen
975
987 999
Table of Contents – Part I
XLIII
V-Volcano: Addressing Students’ Misconceptions in Earth Sciences Learning through Virtual Reality Simulations . . . . . . . . . . . . . . . . . . . . . . . . 1009 Hollie Boudreaux, Paul Bible, Carolina Cruz-Neira, Thomas Parham, Cinzia Cervato, William Gallus, and Pete Stelling A Framework for Object-Oriented Shader Design . . . . . . . . . . . . . . . . . . . . 1019 Roland Kuck and Gerold Wesche Ray Traced Virtual Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031 Christian N.S. Odom, Nikhil J. Shetty, and Dirk Reiners
ST: Optimization for Vision, Graphics and Medical Imaging: Theory and Applications Stochastic Optimization for Rigid Point Set Registration . . . . . . . . . . . . . . 1043 Chavdar Papazov and Darius Burschka Multi-label MRF Optimization Via a Least Squares s − t Cut . . . . . . . . . 1055 Ghassan Hamarneh Combinatorial Preconditioners and Multilevel Solvers for Problems in Computer Vision and Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067 Ioannis Koutis, Gary L. Miller, and David Tolliver Optimal Weights for Convex Functionals in Medical Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1079 Chris McIntosh and Ghassan Hamarneh Adaptive Contextual Energy Parameterization for Automated Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1089 Josna Rao, Ghassan Hamarneh, and Rafeef Abugharbieh Approximated Curvature Penalty in Non-rigid Registration Using Pairwise MRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1101 Ben Glocker, Nikos Komodakis, Nikos Paragios, and Nassir Navab Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111
Layered Volume Splatting Philipp Schlegel and Renato Pajarola University of Zurich, Binzmuhlestrasse 14, 8050 Zurich, Switzerland
[email protected],
[email protected]
Abstract. We present a new layered, hardware-accelerated splatting algorithm for volume rendering. Layered volume splatting features the speed benefits of fast axis-aligned pre-classified sheet-buffer splatting while at the same time exhibiting display quality comparable to highquality post-classified view-aligned sheet-buffer splatting. Additionally, we enhance the quality by using a more accurate approximation of the volume rendering integral. Commonly, the extinction coefficient of the volume rendering integral is approximated by the first two elements of its Taylor series expansion to allow for simple α-blending. In our approach we use the original, exponential extinction coefficient to achieve a better approximation. In this paper we describe the layered splatting algorithm and how it can be implemented on the GPU. We compare the results in terms of performance and quality to prior state-of-the-art volume splatting methods. Keywords: Volume rendering, volume splatting.
1
Introduction
Direct volume rendering [1] is a method for visualizing discrete datasets without extracting explicit geometry. These datasets are often generated by regularly sampling a continuous scalar field. In order to visualize a dataset, the continuous scalar field (3D function) has to be reconstructed from the discrete dataset. Once the reconstruction step is finished, the volume rendering integral needs to be evaluated along the viewing rays. This can be done either in screen or in object space. A popular method is ray casting in conjunction with trilinear interpolation. While ray casting delivers good results, it is more costly to compute and only recent developments have achieved interactive frame rates. Splatting as an object space method was introduced in [2]. Instead of evaluating every ray from the screen space as with ray casting, each voxel is being illuminated, classified and supplied with a footprint of an interpolation kernel and then projected onto the screen. Due to the inappropriate evaluation of the volume rendering integral, the results suffer from blurring and color bleeding. These issues were addressed by introducing axis-aligned sheet splatting [3]. Post-classified image-aligned sheet splatting has further overcome some drawbacks [4,5]. Image-aligned approaches typically split the interpolation kernel into several slabs to better approximate the volume rendering integral. Thus for every G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1–12, 2009. c Springer-Verlag Berlin Heidelberg 2009
2
P. Schlegel and R. Pajarola
single voxel, multiple slabs have to be rasterized. In terms of performance the multiplied rasterization costs are a major bottleneck. Our new algorithm limits the number of required splatting operations to exactly one per voxel without losing the quality advantages of splatting multiple kernel slabs per voxel. We achieve this by applying a correction term based on the previous and consecutive sheet. Hence the sheets are not independent from each other anymore and that’s why we call a sheet a layer and the method layered volume splatting. Furthermore, common approaches to volume rendering make simplifications regarding the evaluation of the volume rendering integral [6,7]. The integral in its original form cannot be solved analytically without making some confining assumptions and thus needs to be approximated. It is usually developed into a Riemann sum. Moreover, only the first two elements of the Taylor series expansion of the exponential extinction coefficient are taken. This leads directly to Porter-Duff compositing as described in [8] and is well supported in graphics hardware. We think it is now feasible to use the original exponential extinction coefficient, by virtue of fast and programmable GPUs, in order to achieve a closer approximation of the volume rendering integral and thus a better quality. The contributions of this paper are manifold. First, we introduce a novel, fast, GPU-accelerated volume splatting algorithm based on an axis-aligned layer concept. Second, we provide an effective interpolation correction solution that accounts for the overlap of blending kernels into adjacent layers. Also, we avoid the simplification of the attenuation integral in favor of a more accurate solution. Finally, we demonstrate the superior performance of layered splatting, achieving excellent quality equal to prior state-of-the-art splatting algorithms.
2
Previous Work
Volume splatting was originally introduced by Westover [2]. The algorithm works as follows: Every voxel is mapped from grid into screen space and the density and gradient values are converted into color values (pre-classification). Finally a reconstruction step and the compositing into the framebuffer are performed. Projecting the footprint of an interpolation kernel determines which pixels are affected by a voxel (forward mapping). The algorithm works quite fast but suffers from blurring and color bleeding as a result of pre-classification and improper visibility determination. In [3] a revised algorithm divides the volume into sheets along the axis most parallel to the view direction. The voxel contributions are summed up into a sheet buffer before being composited. However, the algorithm still sticks to pre-classification and pre-shading. Another drawback is the popping artifacts that may occur when the orientation of the sheet direction changes. Crawfis and Max [9] exploit texture mapping hardware to accelerate the splatting operations and introduce a new reconstruction kernel based on Max’ previous 2D optimization [10]. Seminal work on reconstruction and interpolation kernels is provided by Marschner and Lobb [11]. They study several reconstruction filters and classify them according to a new metric that includes
Layered Volume Splatting
3
smoothing, postaliasing and overshoot. Carlbom [12] provides another discussion about filters. This includes research on weighted Chebyshev approximation and comparisons to piecewise cubic filters. Other quality enhancements have been proposed by Zwicker et al. [13,14] with their EWA splatting. To avoid aliasing artifacts they introduce a new splatting primitive consisting of an elliptical Gaussian reconstruction kernel with a Gaussian low-pass filter. An anti-aliasing extension including an error analysis of the splatting process has been published by Mueller et al. [15]. Hadwiger et al. [16] investigate quality issues that arise from limited precision and range on graphics hardware when using high-quality filtering with arbitrary filter kernels. In [17] they present a framework for performing fast convolution with arbitrary filter kernels to substitute linear filtering. Mueller and Crawfis introduce view-aligned sheet splatting in [4]. To overcome the popping artifacts of axis-aligned splatting when switching to a different axis, sheets that are perpendicular to the view direction are used. This requires the voxels to be resorted in every frame where the view direction changes. Because the support radius of the Gaussian interpolation kernel is larger than the distance between two sheets, the kernel is sliced into slabs and for each kernel slice its footprints are generated. This means that for a single voxel several of these footprints have to be splatted, multiplying the display costs. To reduce the blur from splatting, Mueller et al. [5,18] suggest displacing classification and shading to after projection onto the screen. In addition, not only the density volume is splatted but also the gradient volume. The gradients are required for shading calculations, which now take place after splatting. Modern, programmable graphics hardware makes it possible to greatly enhance volume rendering and splatting performance. Apart from splatting, there are approaches using 2D and 3D textures for volume rendering [19,20,21]. 3D texture techniques are generally very fast because of the hardware support but not well suited for high-order interpolation. In turn, splatting can also benefit from fast graphics hardware. Neophytou et al. [22,23] present a combined CPU/GPU method where bucket distribution of the voxels to the sheets is done on the CPU and compositing, classification and shading on the GPU. The special properties of the Gaussian kernel and footprint enable splatting of four footprint slices at a time using all color channels. Opaque pixels are marked by the shader and henceforth omitted from splatting. Furthermore, using point sprites instead of polygons for splatting reduces the amount of geometry to be sent to the graphics board [22,24]. Grau et al. [25] extended Neophytou’s method to an all GPU algorithm by doing the necessary bucket sorting on the GPU. Our new layered volume splatting approach uniquely combines the performance advantages of direct (axis-aligned) splatting and hardware acceleration, with the quality improvements of post-classified (sheet-buffered) splatting. The focus of our comparison lies on state-of-the-art splatting methods. Anyhow, compared to 3D texture based volume rendering, we achieve a better image quality due to higher-order interpolation as well as a similar performance in some cases.
4
3 3.1
P. Schlegel and R. Pajarola
Layered Volume Splatting Performance Considerations
A performance analysis of volume splatting shows three areas where expensive operations may become a bottleneck: Sorting is necessary to guarantee back-to-front or front-to-back traversal of the splats or sheets. Sheetless approaches such as the original splatting algorithm [2] have different traversal orders for every frame in which the view direction changes. Sheet splatting methods have the advantage that the individual voxels only need to be distributed to the different sheets whereas the order within a sheet is not important. For that reason a cheap bucket sorting algorithm can be employed. When using axis-aligned sheet splatting, the distribution of the voxels to the sheets remains valid as long as a view direction change does not exceed a 45 ◦ angle. In this case another axis will become most parallel to the view direction and the sheet orientation changes. The voxels have to be redistributed to the newly oriented sheets. For view-aligned sheet splatting, since the sheets are perpendicular to the view direction, the voxels have to be resorted for every change in view direction. This is a clear disadvantage over axis-aligned sheet splatting. Resorting on the CPU causes a lot of traffic on the bus because each time the whole geometry for the splats has to be transferred to the graphics card. Point sprites can diminish the amount of data sent to the graphics card since only one vertex is required per splat instead of several when using polygons. A recent method by Grau et al. [25] does the resorting completely on the GPU eliminating the need to resend the geometry to the graphics card. In our layered splatting approach we apply fast axis-aligned ordering such that voxel redistribution only has to be performed when crossing a 45 ◦ angle. With normal axis-aligned sheet splatting, popping artifacts may occur when this happens. However, our layered volume splatting strongly abates the popping artifacts, which can be attributed to the use of more compact interpolation kernels and the interpolation correction term, see Section 3.2, as well as to the improved attenuation integration via the exponential extinction coefficient, see Section 3.3. Rasterization. The splatting operation itself is a kind of 2D texture mapping including point sprites. Mostly it is the real bottleneck of volume splatting because texturing of millions of effective splats drives the current graphics cards to the rasterization limits. This applies especially when using sheet-buffered splatting in conjunction with an interpolation kernel that has a large support radius. The kernel then contributes to many sheets, and hence many slabs of a kernel have to be rasterized for a single voxel as shown in Figure 1(a). Neophytou et al. [22] address this issue by using all color channels to splat four kernel slabs at a time. Although this solution is very fast, it has some weaknesses. For one, it only works if the additional color channels are not required for transmitting the normal. Furthermore, it is only well suitable using a Gaussian kernel because individual, pre-integrated kernel slabs can be conveyed from a base kernel by a single factor.
Layered Volume Splatting
5
view direction
view direction
Gaussian kernels
cubic kernels
view-aligned sheets
layers
add contribution z
x,y
z
kernel slabs
(a)
contribution from adjacent layer
x,y
(b)
Fig. 1. (a) Sheet buffers are perpendicular to the view direction. Each Gaussian interpolation kernel with radius 2.0 spreads across five sheets resulting in five slabs per kernel at arbitrary position. All slabs are explicitly splat. (b) The voxel grid with a layer overlay and two footprints of our cubic kernel. Li is the currently processed layer. Contributions from footprints in the adjacent layers are approximated. They are not explicitly splatted as kernel slab footprints when using layered splatting.
Our goal was to strictly have one single texturing operation per voxel including the possibility to provide a normal, either from the gradient volume or a gradient interpolation kernel. To achieve this, we switch from a splat centric view to a sheet centric view and, therefore, call a sheet a layer. A layer Li contains the contributions of all interpolation kernels for which the corresponding voxel centers fall directly into Li , plus correction terms from interpolation kernels from voxels in adjacent layers as illustrated in Figure 1(b). We define the invariant that only contributions from voxels centered in the current layer Li are explicitly splatted. Contributions to Li from voxels in adjacent layers are not explicitly splatted but approximated using a correction term. To minimize errors introduced by the correction terms, we no longer use a Gaussian interpolation kernel with radius 2.0 which may contribute to five layers. Instead we use a cubic interpolation filter with radius 1.0 that contributes to at most three layers. From a layer centric view this means that only voxels in the current layer plus voxels in the two adjacent layers must be considered. For a given layer Li and its adjacent layers Li−1 and Li+1 , only the parts of voxels in Li are rasterized as splats into the current layer’s frame buffer. The missing contributions from adjacent layers Li−1 and Li+1 are accounted for on a per-pixel basis. This is achieved by accumulating contributions from Li−1 and Li+1 to the current layer Li according to the ratio κ of the pre-integrated kernel intersecting Li , see also Figure 1(b). Consequently the correction addend consists of the contributions from the adjacent layers weighted by the correction factors κ. The ratio κ may change for every voxel if they do not have the same positions perpendicular to the layers. This is typically the case with view-aligned layers, since in general the volume axes do not align with the view direction. To avoid this, we exploit axis-aligned layers to keep the relative positions of the voxels
6
P. Schlegel and R. Pajarola
Fig. 2. Right image without using the interpolation correction term shows significant artifacts from missing contributions from parts of interpolation kernels overlapping adjacent layers
constant within a layer. Thus for a given blending kernel h(x, y, z) we can precompute the correction factors κ(x, y) once along the projection dimension since the kernels’ intersections with adjacent layers Li−1 and Li+1 are constant for all voxels, as described in the following section. Compositing. Using per-pixel post-classification and post-shading for high quality rendering, compositing and blending becomes crucial from a performance point of view, especially when the number of sheets or layers rises. It gets even worse if any kind of z-supersampling as in typical sheet based splatting is used to better approximate the volume rendering integral. Let us define the grid resolution of the volume being 1.0 and the distance between two sheets as 0.5. This effectively doubles the required amount of compositing operations but produces a higher quality image, particularly for low-resolution volumes. Huang et al. demonstrate this in their OpenSplat framework [26]. The compositing performance is basically independent from the effective number of voxels or splats as long as no special optimizations are made. Assuming classification and shading is done in a fragment shader, Neophytou et al. [22] show how special OpenGL extensions can be used to optimize performance. Early z-culling and depth-bounds test extensions allow dropping of fragments that are not affected during splatting or which are already opaque in a front-to-back traversal. As we use a different extinction model, we cannot use the default OpenGL blending. Thus we calculate blending within the fragment shader where classification and lighting takes place, and subsequently can take advantage from the same optimizations. As z-supersampling is not required by our layered splatting approach, it benefits from a reduced number of compositing operations. This is feasible because of compact blending kernels, the interpolation correction terms accounting for adjacent layer contributions, and the improved attenuation factor from the exponential extinction coefficient. Excellent rendering quality is furthermore achieved due to high-resolution interpolation within layers and post-classification.
Layered Volume Splatting
3.2
7
Cubic Interpolation Kernel
Because of the discrete resolution of a sampled scalar (or vector) field, the gaps between the sample points must be interpolated for direct rendering and zooming. In other words, a continuous 3D function has to be reconstructed from the available spatial samples. This reconstruction is not only crucial for quality but also for performance. The most common interpolation scheme is the (tri-)linear interpolation that is heavily used in ray casting based volume rendering. In the volume splatting context the Gaussian interpolation kernel is very popular. Apart from the superior quality of the Gaussian kernel over trilinear interpolation, there are some other properties that make it very attractive. The derivative of the Gaussian is a Gaussian again. Further it can be considered spherically symmetric, making it independent from the view direction. Frequently a Gaussian 2 with a support radius of 2.0 is used: h(r) = [|r| < 2.0] c · e−2.0r . However, the Gaussian kernel does not satisfy very well the needs of layered splatting. As only the footprint in the layer where the voxel center lies is explicitly rendered, an error is introduced for every contribution of the kernel that lies outside of that central layer. A Gaussian with radius 2.0 contributes to four additional layers apart from the main layer where the voxel lies. Accordingly, it is better to use a kernel with a smaller support radius. In terms of performance this has an additional benefit. The individual footprint splats are smaller and thus fewer pixels have to be rasterized per footprint, further deferring the rasterization limit. Interpolation kernel filters can roughly be arranged in three categories: separable filters, spherically symmetric and pass-band optimal discrete filters. The latter are proposed by Hsu et al. [27] and adapted to volume rendering by Carlbom [12]. Because they are quite expensive they are not feasible for fast rendering. Given a 1D kernel f (r), a separable 3D filter can be written as h(x, y, z) = f (x)f (y)f (z),
(1)
and we use the following 1D function from the family of cubic filters for our purposes f (x) = [|x| < 1.0] 1 − 3|x2 | + 2|x3 |. (2) A discussion of cubic interpolation filters can be found in [28] and [11]. We chose this particular filter because it is zero outside a box with edge length 2.0 and subsequently spans exactly the three layers Li−1 , Li and Li+1 in a regular voxel grid as indicated in Figure 1(b). Thus the error introduced by layered interpolation along the projection dimension z can significantly be reduced by a correction term. In fact, the integral of the 3D interpolation filter h(x, y, z) of Equation 1 equals zero inside [−1, 1]. The correction factors κ(x, y) for the correction term can be calculated as follows: −1 −0.5 1 κLi−1 (x, y) = −1 h(x, y, z)dz −1 h(x, y, z)dz = 0.09375 −1 0.5 1 κLi (x, y) = −0.5 h(x, y, z)dz −1 h(x, y, z)dz = 0.8125 −1 1 1 κLi+1 (x, y) = 0.5 h(x, y, z)dz −1 h(x, y, z)dz = 0.09375
8
P. Schlegel and R. Pajarola
It shows that κ is independent from the position (x, y) due to the separable characteristics of the kernel. The final interpolated result for layer Li can be obtained by first splatting the voxels centered in Li , followed by κ-corrected accumulation of values from layers Li−1 and Li+1 . Figure 2 demonstrates the effect of the correction term. Note that in contrast to Gaussian interpolation the amount of pixels to be rasterized is reduced by a factor of 4.0 without loss of rendering quality (see also Section 4). However, there is one disadvantage when using a filter that is not spherically symmetric as it is not independent from the view direction anymore. When splatting the footprints of such kernels, the view direction has to be taken into account and an appropriate footprint has to be selected. From a performance point of view it is not feasible to generate the footprints on the fly during splatting. However, precalculating a set of oriented footprint images, storing them in some small texture cache, and choosing the most suitable in every situation solves the problem. Selecting the right footprint image can be done in a vertex shader program. Currently we use a set of 856 pre-calculated footprint images. Each of these has a size of 642 pixels with 1 byte per pixel requiring a total of only 3.3 MByte texture memory. According to our experiments, the discretization of the view direction does not lead to visual artifacts. 3.3
Extinction Coefficient
The goal of every volume rendering algorithm is to approximate the volume rendering integral as closely as possible. Unfortunately, the volume rendering integral cannot be solved analytically without making some confining assumptions [7]. The volume rendering integral is based on the absorption and emission model by Max [6]. I0 denotes the intensity of the light when it enters the volume. τ (t) is an extinction coefficient and E(s) is the light emitted by the volume (samples) itself. The integral goes along a viewing ray through the volume calculating the resulting light intensity. D D − 0D τ (t)dt I(D) = I0 e + E(s)τ (s)e− s τ (t)dt ds 0
The integral is now approximated using a Riemann sum as
D/Δt
I(D) ≈ I0
D/Δt
e
−τ (ti )Δt
i=0
+
D/Δt
Ei τ (ti )Δt
i=0
e−τ (tj )Δt ,
(3)
j=i+1
and the extinction term is generally developed into a Taylor series as αj = 1 − e−τ (tj )Δt ≈ 1 − (1 − τ (tj )Δt) ≈ τ (tj )Δt.
(4)
Putting Equation 4 into 3 and substituting the emitted light E by the voxel color C results in:
D/Δt
I(D) ≈ I0
i=0
D/Δt
(1 − αi ) +
i=0
D/Δt
Ci αi
j=i+1
(1 − αj )
(5)
Layered Volume Splatting
9
Fig. 3. Comparison between α (left) and the exponential extinction (right)
When inspecting Equation 5 it turns out that this attenuation and blending is equal to the over or under operator (depending on whether back-to-front or frontto-back compositing is used) from Porter and Duff [8]. This makes it attractive because it is very well supported by the hardware. Figure 4(a), however, shows the error introduced by the simplification of taking the first two elements of the Taylor series expansion (1 − αj in Equation 5) of the extinction coefficient instead of the original exponential function (e−τ (tj )Δt in Equation 3). With nowadays powerful and programmable graphics hardware, there is no need to use the α-attenuation but we can return to the original e−τ extinction coefficient to more closely approximate the volume rendering integral (see also Figure 3). The extinction and blending has therefore to be calculated explicitly in the compositing fragment shader. On the other hand, exchanging the classical α with τ implicates a new transfer function. Typically the transfer function maps from the scalar volume field ρ to color and opacity values: (ρ) −→ (r, g, b, α). α hereby denotes the linear opacity where α = 0.0 is completely transparent and α = 1.0 is fully opaque. Since the α parameter is exchanged by τ the new extinction domain ranges from 0.0 to ∞. Thus existing transfer functions have to be transformed according to this new semantics as follows: f (x) = 1 − e−x α −→ τ = f −1 (α) = (1 − e−α )−1 = ln
4
1 1−α
Experimental Results
Figure 4(b) shows a brief overview of the rendering pipeline. For each dataset a full rotation over 125 frames with a constant angular step is measured. The average framerate of this rotation is listed in Table 1, all sorting steps are included. The rendering is done in a viewport of 5122 pixels size. All measurements were made on an Intel Xeon 2.66GHz machine and a Geforce 8800 GT graphics card. We use the method from [22] as a benchmark. On equal hardware our implementation of their renderer, labeled view-aligned splatter, achieves the same or slightly higher framerates. Nevertheless, our layered splatting is able to outperform this well recognized approach by a factor 10 for the large foot dataset and still a factor 2 for the tiny fuel dataset.
10
P. Schlegel and R. Pajarola
volume <x,y,z, ,n>
vertex buffer <x,y,z, ,n> bucket sort if required, skip else
final image
transform splat
axis aligned layers
depth buffer
classify shade composite opacity feedback
(a)
mark affected
add contribution
(b)
Fig. 4. (a) Difference between the approximated attenuation term 1−α and the original extinction term e−τ . (b) The rendering pipeline of layered volume splatting. The bucket sorting step can be omitted if no 45 ◦ angle is crossed.
Fig. 5. Left-to-right: axis-aligned splatter, view-aligned splatter, layered splatter, 3D texture slicing Table 1. Performance results Dataset
Dimension
Effective Axis-Aligned View-Aligned Layered Preintegrated 3D splats Sheet Splatting Sheet Splatting Splatting Texture Slicing Fuel Injection 643 14K 39.8fps 47.0fps 100.3fps 197.1fps Lobster 301 × 324 × 56 233K 15.0fps 13.2fps 43.0fps Aneurism 2563 79K 26.7fps 23.0fps 48.7fps 47.3fps Neghip 643 122K 7.9fps 11.1fps 30.7fps 192.6fps Engine 2562 × 128 1.3M 3.8fps 3.0fps 23.2fps 43.0fps Skull 2563 1.4M 3.6fps 2.7fps 16.5fps 43.6fps 3 Foot 256 4.6M 0.9fps 0.8fps 8.3fps 40.9fps 3 Vertebra 512 1.6M 3.0fps 2.4fps 14.9fps
Figure 5 shows the engine dataset rendered with a transparent transfer function using the renderers from Table 1. The image of the layered splatter shows a quality comparable to the images from the two other splatters but with reduced specular highlights due to the exponential extinction. The image from the 3D texture slicing renderer shows artifacts from the slices on the opaque rear panel.
Layered Volume Splatting
5
11
Conclusion
We have described a new volume splatting approach called layered splatting. The goal of this new algorithm is to enhance the performance of existing volume splatting methods while maintaining an excellent display quality. This is particularly useful for interactively visualizing large datasets. Our approach shows tremendous speedups when rendering multimillion-splat datasets while maintaining an excellent quality, i.e. much closer to view-aligned sheet splatting than to sheetless rendering. Recently GPU ray casting has become popular and fast, approaching the performance known from 3D texture slicing and splatting. Even though ray casting is known for its good image quality, most implementations of GPU ray casting rely on the built-in trilinear interpolation scheme while sampling along the rays. Having a fast interpolation scheme is crucial for the performance of GPU ray casting. Unlike splatting, implementing an efficient high-order interpolation scheme using GPU ray casting may be more difficult.
References 1. Engel, K., Hadwiger, M., Kniss, J.M., Rezk-Salama, C., Weiskopf, D.: Real-Time Volume Graphics. AK Peters (2006) 2. Westover, L.: Interactive volume rendering. In: VVS 1989: Proceedings of the 1989 Chapel Hill workshop on Volume visualization, pp. 9–16. ACM, New York (1989) 3. Westover, L.: Footprint evaluation for volume rendering. In: Baskett, F. (ed.) Computer Graphics (SIGGRAPH 1990 Proceedings), vol. 24, pp. 367–376 (1990) 4. Mueller, K., Crawfis, R.: Eliminating popping artifacts in sheet buffer-based splatting. In: IEEE Visualization (VIS 1998), Washington - Brussels - Tokyo, pp. 239–246. IEEE, Los Alamitos (1998) 5. Mueller, K., M¨ oller, T., Crawfis, R.: Splatting without the blur. IEEE Visualization, 363–370 (1999) 6. Max, N.: Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics 1, 99–108 (1995) 7. Moreland, K.D.: Fast High Accuracy Volume Rendering. PhD thesis, The University of New Mexico (2004) 8. Porter, T., Duff, T.: Compositing digital images. In: Christiansen, H. (ed.) SIGGRAPH 1984 Conference Proceedings, Minneapolis, MN, July 23-27, pp. 253–259. ACM, New York (1984) 9. Crawfis, R., Max, N.: Texture splats for 3D scalar and vector field visualization. In: IEEE Visualization 1993 Proceedings, pp. 261–266. IEEE Computer Society, Los Alamitos (1993) 10. Max, N.: An optimal filter for image reconstruction. In: Arvo, J. (ed.) Graphics Gem II, pp. 101–104. Academic Press, New York (1991) 11. Marschner, S.R., Lobb, R.J.: An evaluation of reconstruction filters for volume rendering. In: Bergeron, R.D., Kaufman, A.E. (eds.) Proceedings of the Conference on Visualization, pp. 100–107. IEEE Computer Society Press, Los Alamitos (1994) 12. Carlbom, I.: Optimal filter design for volume reconstruction and visualization. In: Nielson, G.M., Bergeron, D. (eds.) Proceedings of the Visualization 1993 Conference, San Jose, CA, pp. 54–61. IEEE Computer Society Press, Los Alamitos (1993)
12
P. Schlegel and R. Pajarola
13. Zwicker, M., Pfister, H., van Baar, J., Gross, M.: EWA volume splatting. In: Ertl, T., Joy, K., Varshney, A. (eds.) Proceedings of the Conference on Visualization 2001 (VIS 2001), Piscataway, NJ, pp. 29–36. IEEE Computer Society, Los Alamitos (2001) 14. Chen, W., Ren, L., Zwicker, M., Pfister, H.: Hardware-accelerated adaptive EWA volume splatting. In: Proceedings of IEEE Visualization 2004 (2004) 15. Mueller, K., M¨ oller, T., II Edward Swan, J., Crawfis, R., Shareef, N., Yagel, R.: Splatting errors and antialiasing. IEEE Transactions on Visualization and Computer Graphics 4, 178–191 (1998) 16. Hadwiger, M., Hauser, H., M¨ oller, T.: Quality issues of hardware-accelerated high-quality filtering on pc graphics hardware. In: Proceedings of WSCG 2003, pp. 213–220 (2003) 17. Hadwiger, M., Viola, I., Theußl, T., Hauser, H.: Fast and flexible high-quality texture filtering with tiled high-resolution filters. In: Vision, Modeling and Visualization 2002, Akademische Verlagsgesellschaft Aka GmbH, Berlin, pp. 155–162 (2002) 18. Mueller, K., Shareef, N., Huang, J., Crawfis, R.: High-quality splatting on rectilinear grids with efficient culling of occluded voxels. IEEE Trans. Vis. Comput. Graph 5, 116–134 (1999) 19. Rezk-Salama, C., Engel, K., Bauer, M., Greiner, G., Ertl, T.: Interactive volume on standard pc graphics hardware using multi-textures and multi-stage rasterization. In: HWWS 2000: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware, pp. 109–118. ACM, New York (2000) 20. Cabral, B., Cam, N., Foran, J.: Accelerated volume rendering and tomographic reconstruction using texture mapping hardware. In: Proceedings ACM/IEEE Symposium on Volume Visualization, pp. 91–98 (1994) 21. Xue, D., Zhang, C., Crawfis, R.: isbvr: Isosurface-aided hardware acceleration techniques for slice-based volume rendering. In: Volume Graphics, pp. 207–215 (2005) 22. Neophytou, N., Mueller, K.: GPU accelerated image aligned splatting. In: Gr¨ oller, E., Fujishiro, I. (eds.) Eurographics/IEEE VGTC Workshop on Volume Graphics, Stony Brook, NY, pp. 197–205. Eurographics Association (2005) 23. Neophytou, N., Mueller, K., McDonnell, K.T., Hong, W., Guan, X., Qin, H., Kaufman, A.E.: GPU-accelerated volume splatting with elliptical RBFs. In: Santos, B.S., Ertl, T., Joy, K.I. (eds.) EuroVis 2006: Joint Eurographics - IEEE VGTC Symposium on Visualization, Lisbon, Portugal, May 8-10, pp. 13–20. Eurographics Association (2006) 24. Higuera, F.V., Hastreiter, P., Fahlbusch, R., Greiner, G.: High performance volume splatting for visualization of neurovascular data. In: IEEE Visualization, p. 35. IEEE Computer Society Press, Los Alamitos (2005) 25. Grau, S., Tost, D.: Image-space sheet-buffered splatting on the gpu. In: IADIS 2007, International Conference on Computer Graphics and Visualization 2007, pp. 75–82 (2007) 26. Huang, J., Crawfis, R., Shareef, N., Mueller, K.: Fastsplats: optimized splatting on rectilinear grids. In: IEEE Visualization, pp. 219–226 (2000) 27. Hsu, K., Marzetta, T.L.: Velocity filtering of acoustic well logging waveforms. IEEE Trans. Acoustics, Speech and Signal Processing ASSP-37(2), 265 (1989) 28. Mitchell, D.P., Netravali, A.N.: Reconstruction filters in computer graphics. In: Proceedings ACM SIGGRAPH, pp. 221–228 (1988)
Real-Time Soft Shadows Using Temporal Coherence Daniel Scherzer1 , Michael Schw¨arzler2 , Oliver Mattausch1 , and Michael Wimmer1 1
Vienna University of Technology 2 VRVis Research Company
Abstract. A vast amount of soft shadow map algorithms have been presented in recent years. Most use a single sample hard shadow map together with some clever filtering technique to calculate perceptually or even physically plausible soft shadows. On the other hand there is the class of much slower algorithms that calculate physically correct soft shadows by taking and combining many samples of the light. In this paper we present a new soft shadow method that combines the benefits of these approaches. It samples the light source over multiple frames instead of a single frame, creating only a single shadow map each frame. Where temporal coherence is low we use spatial filtering to estimate additional samples to create correct and very fast soft shadows.
1
Introduction
Shadows are widely acknowledged to be one of the global lighting effects with the most impact on scene perception. They are perceived as a natural part of a scene and give important cues about the spatial relationship of objects. In reality most light sources are area light sources and these create soft shadows. We are not used to hard shadows and perceive them as distinct objects. For the realistic shadowing of a scene, soft shadows are therefore considered a must. Soft shadows consist of an umbra region where the light source is totally invisible and a penumbra region where only part of the light source is visible. Typical soft shadowing methods for real-time applications approximate an area light by a point light located at its center and use heuristics to estimate penumbrae, which leads to soft shadows that are not physically correct [1,2]. This is because the area visibility that is the result of an area light interacting with a scene is replaced by a simpler from point visibility. As the human eye is not very sensitive to the correctness of soft shadows, the results can be acceptable from a perceptual point of view or can even be physically plausible. Additionally most inherent shadow map artifacts like aliasing are often hidden through the low frequency soft shadows. Nevertheless some perceptually concerning artifacts remain: Overlapping occluders can lead to unnatural looking shadow edges, or large penumbrae can cause single sample soft shadow approaches to either break down or become very slow (see Figure 1, 5). G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 13–24, 2009. c Springer-Verlag Berlin Heidelberg 2009
14
D. Scherzer et al.
Fig. 1. From left to right: our Method (634FPS), bitmask soft shadows with a 8x8 search area (156 FPS), our Method with a bigger penumbra (630FPS) and bitmask soft shadows with the same penumbra and a 12x12 search area (60FPS). Even very good single sample soft shadow methods show some artifacts, like biasing problems and contact shadow undersampling that can be avoided by using multiple samples.
Accurate methods use light source sampling: The idea is to calculate soft shadows by sampling the area of the light source. Hard shadow calculations are performed for every sample and the results are combined [3]. The primary problem of these methods is that the number of samples to produce smooth penumbrae is huge. N samples produce N − 1 attenuation levels. High-quality soft shadows need 256 or more to create the 256 attenuation levels available with an 8Bit color channel. If we have to render a shadow map for each sample and store all the shadow maps for the final combination pass, real-time frame rates become unlikely, even for simple scenes. But under the assumption of a light source with uniform color and dense enough sampling of the light source, the result of these approaches are correct soft shadows. Our approach can be described as a combination of light source area sampling over time and single sample filtering: – The area sampling is done one sample per frame by creating a shadow map from a randomly selected position on the area light. For each screen pixel the hard shadow results obtained from this shadow map are combined with the results from previous frames (accumulated in a screen space buffer called the shadow buffer ) to calculate the soft shadow for each pixel. – When a pixel becomes newly visible and therefore no previous information is available in the shadow buffer, we use a fast single sample approach (PCSS with a fixed 4x4 kernel) to generate an initial soft shadow estimation for this pixel. – To avoid discontinuities between sampled and estimated soft shadows, all the estimated pixels are augmented by using a depth-aware spatial filter to take their neighborhood in the shadow buffer into account. The main contribution of this paper is the application of temporal coherence (through the use of the shadow buffer) to the soft shadowing problem together with spatial filtering (penumbra estimation and pixel neighborhood) in a soft shadow mapping algorithm that is even faster than a very fast variant of PCSS
Real-Time Soft Shadows Using Temporal Coherence
15
(see Section 4 for details), but produces accurate real-time soft shadows. Additionally we extend temporal reprojection to handle moving objects such as dynamic shadow casters and receivers.
2
Previous Work
Soft shadows are a very active research topic, therefore we can only give a brief overview of the most relevant publications for our work. A still valuable survey of a number of different soft shadow methods is due to Hasenfratz et al. [1]. There are two major paradigms: methods that use shadow volumes (object-based algorithms) and methods that use shadow maps (image-based algorithms). We concentrate entirely on algorithms that employ shadow mapping due to their higher performance in real-time applications. Filtering: Since physically based soft shadow mapping requires many light source samples, it was previously considered too costly for real-time rendering. Therefore a number of algorithms were proposed that offer cheaper approximations. Most popular among those is Fernando et al.’s [4] Percentage Closer Soft Shadows (PCSS), which estimates the soft shadow from a single sample and employs a blocker search to estimate the penumbra width and then uses PCF-filtering accordingly to soften the shadows. Very fast speeds can be achieved by using small fixed sized kernels and only adapting the sample spacing to the penumbra estimation. Unfortunately this introduces artifacts for large penumbrae (see Figure 5). We will use this approach for our initial guess for freshly disoccluded fragments due to its speed. Annen et al. use a weighted summation of basis terms [5]. Using a variation of convolution shadow maps and the average blocker search from Fernando et al. [4], Annen et al. [6] extract area light sources from environment maps. Back projection: A whole class of methods use back projection to get a more physically based estimation of the soft shadow [7,8,9,2]. These methods treat the shadow map as a discrete approximation of the blocker geometry. By back projecting shadow map samples onto the light source, an accurate calculation of the percentage of light source visibility can be done. Convincing results are produced, but a region (related to the size of the penumbra) of the shadow map has to be sampled for each fragment, which becomes costly for large penumbrae. Sampling: Maybe the most straightforward approach to computing soft shadows is by sampling an area or volume light source. Such methods are mostly targeted at off-line or interactive rendering. Heckbert and Herf [3] propose to sample the light source at random positions, render the scene and accumulate the results. Our approach moves this method into the real-time domain by exploiting temporal coherence. Agrawala et al. [10] create a layered attenuation map out of the shadow maps, which allows interactive frame rates. St-Amour et al. [11] combine the visibility information of many shadow maps into a compressed 3d visibility structure which is precomputed, and use this structure for rendering.
16
D. Scherzer et al.
Temporal reprojection: Finally, temporal reprojection was used by Scherzer et al. [12] to improve the quality of hard shadow mapping and by Vel´ azquezArmend´ariz [13] and Nehab et al. [14] in a more general way to accelerate realtime shading. Our algorithm takes up the idea of temporal reprojection and uses it to compute shadows from area light sources.
3
The Algorithm
If we want to find the contribution of an area light source to a specific fragment, we have to calculate the fraction of the area of the light source that is visible from the fragment. For our purposes we will use the reverse value, which we call the occlusion percentage or soft shadow result ψ(x, y) for a fragment at screen space position (x, y). ψ(x, y) is 0 for a fragment that is illuminated by the whole area of the light source and 1 for a fragment that is not illuminated by the light source at all. Due to the fact that most calculations we perform are done per fragment and for the sake of notational simplicity, we will only use the (x, y) notation when introducing a function and will afterward omit it. 3.1
Estimating Soft Shadows from n Samples
To make the calculation of the soft shadow value feasible for rasterization hardware, we use sampling and shadow maps. We approximate an area light source by n different point light sources and compute a shadow map for each of them. A shadow map allows us to evaluate for every screen space fragment if it is illuminated by its associated point light. 0 lit from point light i τi (x, y) = (1) 1 in shadow of point light i τi (x, y) is the result of the hard shadow test for the ith shadow map for the screen space fragment at position (x, y). Under the assumption that our point light placement on the area light source is sufficiently random, the soft shadowing result ψ (i.e., the fractional light source area occluded from the fragment) can be estimated by the proportion ψˆn of shadowed samples n
1 ψˆn (x, y) = τi (x, y) n i=1
(2)
The number of shadowed samples nψˆn has a Binomial distribution with variance nψ(1 − ψ). We can thus give an unbiased estimator for the variance of the proportion ψˆn as var( ˆ ψˆn (x, y)) =
ψˆn (x, y)(1 − ψˆn (x, y)) n−1
(3)
The importance of Equation 3 is that it allows us to estimate the quality of our soft shadow solution after taking n samples. We will later use this estimate (the
Real-Time Soft Shadows Using Temporal Coherence
17
standard error derived from this estimate) to judge if sampling alone will give sufficient quality. Table 1 shows these formulas applied to some real-world τi and increasing sample sizes. Please note that although the standard error decreases when the sample size is increased, the estimator for the standard error is not guaranteed to do so. Table 1. Evaluation of the presented formulas for one fragment. Increasing the sample √ size generally reduces variance and standard error, sˆ = var ˆ n τn ψˆn
1 1 1.00 var( ˆ ψˆn ) 0 sˆ 0
3.2
2 0 0.50 0.25 0.50
3 1 0.67 0.11 0.33
4 1 0.75 0.06 0.25
5 1 0.80 0.04 0.20
6 0 0.67 0.04 0.21
7 1 0.71 0.03 0.18
Temporal Coherence
We want to be able to solve Equation 2 iteratively, so that we only need to create one shadow map each frame. We will then use the temporal coherence of the current frame with previous frames to increase convergence. The temporal coherence method introduced by Scherzer et al. [12] improves the resolution of standard hard shadow maps. It is based on the assumption that most screen space fragments stay the same from frame to frame. It can be determined which fragments of the new frame have also been present in the last frame by reprojecting (to account for camera movement) fragments from the new frame into the old and comparing their respective depths. If the depth difference is smaller than a predefined , the two fragments are considered equal and therefore fragment data from the previous frame is reused. If dk (x, y) is the depth of the fragment at position (x, y) of the current frame and dk−1 (x, y) is the same for the previous frame, then the test for fragment equality is given by |dk (x, y) − dk−1 (x, y)| <
(4)
If on the other hand the difference is greater, the fragment was not present in the last frame and is therefore new (a.k.a. disoccluded) and no previous data for this fragment is available. For our approach we want to be able to compute ψˆn iteratively for each frame and fragment from the information saved from previous frames together with the information gathered for the current frame. We do this by keeping n ρn (x, y) := i=1 τi (x, y) from the previous frame. ρn stores all the shadow map tests already performed for the n previously calculated shadow maps. We also need the sample size n, which is equal to the number of shadow map tests we have already performed for this fragment. If the fragment was occluded in the last frame, we get n = 0 and ρn (x, y) = 0 because no previous information is available. Therefore n can be different for each screen space fragment. We can now calculate ψˆcur (x, y) for the current frame as
18
D. Scherzer et al.
τcur (x, y) + ρn (x, y) ψˆcur (x, y) = n(x, y) + 1
(5)
This formula only needs access to n (the count of samples available for this fragment stored in the previous frame), ρn (i.e. sum of the shadow map tests up to the previous frame) and the current shadow map. We provide access to these values by storing in each frame the updated sample size n + 1 and τcur + ρn for every fragment into a screen sized off-screen buffer called the shadow buffer for use in the next frame. Now this formula can be evaluated very quickly in a fragment shader and real-time frame rates pose no problem. Note that we also have to store the depth of each fragment to be able to evaluate the test in Equation 4.
Fig. 2. Convergence after 1,3,7,20 and 256 frames. Upper Row : Sampling of the light source one sample per frame (using Equation 5); Lower Row : Our new algorithm.
The disadvantage of this approach is that for newly disoccluded fragments (i.e. fragments with a small n) the results have large errors (see also Figure 2). This becomes clearer when we take a look at the example in Table 1: At n = 2 we have a standard error of 0.5, which means the real ψ is probably inside of 0.5 ± 0.5, so the quality of ψˆ2 as an estimate is really bad. A second closely connected problem that aggravates the situation even further is discontinuity in time – the difference between ψˆn and ψˆn+1 is still large. For instance ψˆ1 − ψˆ2 = 0.5 which means the soft shadow results can be 128 attenuation values apart. This is a noticeable jump and will cause flickering artifacts. In the next section we will show how to avoid this by using spatial filtering.
Real-Time Soft Shadows Using Temporal Coherence
3.3
19
Spatial Filtering
Due to the use of temporal coherence and Equation 5 we have already constructed a soft shadow algorithm that takes little time to evaluate each frame, but two closely related problems remain: – For a good estimation of ψ with ψˆn for a fragment with temporal coherence alone we need the fragment to be visible for many frames. – During the frames following the disocclusion of a fragment, ψˆn (x, y) has a large standard error and may change drastically, resulting in flickering of these fragments (see also Figure 2, upper row). Our first observation in this respect is that for the first few frames after a fragment has been disoccluded, a single sample soft shadow mapping approach will probably have better quality than using ψˆn (a.k.a the sampled approach), which is also suggested by the high variance in this case. So our first improvement over using only Equation 5 is to use a very fast single sample soft shadow approach, PCSS, as a starting point and then refine it by using Equation 5 (see Figure 3). Note that PCSS itself is also a spatial filter. For the refinement, we have to initialize n and ρn for the sample generated using PCSS. The PCSS shadow test will give us a result in the range [0..1]. The most natural choice is here to use this result directly as ρn and set n = 1. Note that it can make sense to use higher values for n, meaning that the PCSS sample will be of greater weight than the following “normal” samples. Using bigger values for n can lead to faster convergence if the PCSS result is near to the correct solution (see Figure 3). This approach can considerably shorten the period a fragment has to be visible to achieve a good estimation of ψ with ψˆn . Please note that other single sample soft shadow approaches could also be used together with our sampling approach. We have chosen PCSS mainly for its speed. For small n, each new sample potentially causes drastic changes to the estimated soft shadow solution. On the one hand these changes need to happen to guarantee a swift convergence, but on the other hand we want to avoid the resulting flickering artifacts in the rendered shadows. Therefore we propose to introduce an additional smoothing by using a neighborhood filter in screen space just for the rendered shadows, without any impact on the soft shadow information (namely n + 1 and ρn+1 ) stored in the shadow buffer for the next frame (see Figure 2, lower row for results). Section 3.3 will describe in detail how this filter is constructed. We still have to decide when to apply the neighborhood filtering. For this we use the standard error of the variance estimator of Equation 3. This gives us an estimated error for our sampling approach. We choose an error threshold t down to which neighborhood filtering will be used. If the error is smaller, only sampling will be applied. Neighborhood Filter. The neighborhood filter should remove noise and flickering artifacts in the resulting shadow by smoothing. We use a box filter and
20
D. Scherzer et al.
Fig. 3. Convergence for the first 4 frames when sampling the light source one sample per frame (upper row ) can be greatly increased by using PCSS as its first sample (lower row ). Here we have used n = 4 for the PCSS sample, so it is weighted like 4 normal samples.
only include samples within a certain depth range of the current fragment, since those are likely to have a similar ψˆn (x, y). The main question is how to set the filter width to achieve the correct amount of smoothing. Since we want to render soft shadows, a good filter size is given by the penumbra width. Although we don’t know the exact penumbra width, we can estimate it efficiently. We base our estimation on the one used in Fernando [4] because it is fast and simple. It assumes blocker, receiver and light source are parallel and is given by near receiver − avg(blocker) pw = lightsize (6) deptheye avg(blocker) where pw denotes the penumbra width (projected to screen space) we want to estimate, receiver is the depth of the current fragment and lightsize is the size of the light source. near is the near plane distance and deptheye the fragment depth for the projection. The calculation of the average blocker depth avg(blocker) is one of the costliest steps in this algorithm. It works by averaging the depths of the k nearest texels in the shadow map for each fragment each frame. In our approach, we can avoid doing this by using temporal coherence again: When a fragment first becomes visible we perform PCSS, including blocker depth estimation, anyway. We save this value avg(blocker), generated from k depth samples, as a starting value, and in each successive frame we refine it using one additional depth sample depthi from the new shadow map: avg(blocker)i =
depthi + avg(blocker)i−1 (i − 1 + k) i+k
(7)
Real-Time Soft Shadows Using Temporal Coherence
3.4
21
Accounting for Moving Objects
The original temporal reprojection paper by Scherzer et al. [12] does not account for moving objects and up to now we have only accounted for camera movement by reprojection. If we want to be able to display moving objects that cast and receive soft shadows, we have to investigate how these influence the evaluation of our algorithm. Moving objects will appear in the shadow map as potential casters of dynamic shadows and also in the scene as dynamic objects where shadows are cast upon. With moving objects disocclusions become very frequent and therefore temporal reprojection does not work as well. Our approach is therefore to first identify moving objects and shadows that are cast by moving objects in order to handle these cases. First we change the generation of the shadow map by including for each texel the information whether it belongs to a moving object or not. Because the depth in the shadow map is always positive we can store this inside the depth by using the sign, therefore generating no additional memory cost. A negative sign means that this fragment is from a moving object (and therefore a potential dynamic shadow caster). We can now retrieve this information for each screen-space fragment when we do the shadow map test. If we have a negative depth we know that the shadow that falls on this object is cast by a moving object. If we detect such a case we must assume that the data stored in the shadow buffer for the current fragment is probably invalid and we therefore handle it like a normal disocclusion and apply PCSS. This alone would lead to unsatisfactory results because we generate a different shadow map each frame on a different position on the light source. This would cause the PCSS shadow to jump around each frame. Therefore we additionally apply the neighborhood filter from Section 3.3 to smooth out any jumps and decrease discontinuities between the primarily PCSS based moving shadows with the sampled static shadow.
4
Implementation and Results
We implemented the algorithm in 3 passes: first, render shadow map, second, render into new shadow buffer (applying algorithm) and final color buffer (use shadow buffer from previous frame as input) and third copy final color buffer to framebuffer. For our tests, we used an Intel Core 2 Duo E6600 CPU with an NVIDIA 280GTX graphics card and DirectX 10. All images were taken using shadow maps with a resolution of 10242. For selecting the sample position on the area light we used a Halton sequence. Other quasi-random sequences showed similar behavior. The screen space neighborhood filter uses a Poisson disk centered at the current fragment with a fixed sample size (16 samples). The shadow maps were rendered using standard uniform shadow mapping, a 32bit floating point texture and linear depth. Hardware 2x2 PCF filtering was not used. We used the multiple render target functionality for the second pass to render into the shadow buffer and into an 8bit RGB buffer. The shadow buffer is a 4-channel 16bit floating point texture. It contains ρn , n, the linear
22
D. Scherzer et al.
depth of the fragment and the average blocker depth avg(blocker). To have meaningful neighborhood texels at the frame buffer borders, we have chosen the resolution of the shadow buffer about 5% larger than the framebuffer, so if we assume a 800x600 framebuffer, the shadow buffer would be 840x630. Please note that 5% was sufficient for our movement speeds. If faster movements occur, a larger “overscan” should be choosen. Two instances of the shadow buffer are required (one for reading and one for writing), resulting in an additional memory requirement of 2 × 840 × 630 × 4 × 2 ≈ 8M B. We used an error threshold of t = 1/50. Our goal was to develop a soft shadow approach that should be faster than PCSS, but provide at the same time better quality, therefore we compared to a fast PCSS version using only 16 texture lookups for the blocker search and 16 texture lookups for the PCF kernel. Figure 4 shows a benchmark of a typical walkthrough of one of our test scenes, using a viewport of 1024x768 pixels. Timings are given in ms. Our algorithm tends to be slower if there are many disocclusions, because here it has to perform the blocker search that also PCSS
Fig. 4. A sample walkthrough in one of our test scenes with our new method and with PCSS using 16/16 samples for blocker/PCF lookup
Fig. 5. From left to right: Overlapping occluders (our Method, PCSS 16/16) and bands in big penumbras (our Method, PCSS 16/16) are known problem cases for single sample approaches
Real-Time Soft Shadows Using Temporal Coherence
23
has to perform. Our shader is more complex (more ifs) than the PCSS shader so we can be slower then PCSS in such circumstances. The used PCSS16 always performs 16+16 lookups, while our shader only has to do those 16+16 lookups for disocclusions. Our shader performs at least one shadow map lookup and one shadow buffer lookup every frame. 16 lookups are used for the neighborhood filter, which is the case when the standard error is higher then the threshold t for every fragment were the single sample soft shadow approach is active. Figure 5 shows typical problems of PCSS that are solved with our approach. A better comparison can be seen in the accompanying videos. We also did a comparison with a more elaborate PCSS with 32 lookups for the occluder search and 32 lookups for the PCF kernel, which was considerably slower than our algorithm. 4.1
Limitations
Although we presented a method to handle moving objects, there is still room for improvements. Especially when dynamic shadows overlap with static shadows with large penumbrae, some flickering artifacts remain (see the accompanying videos).
5
Conclusions and Future Work
We presented a very fast soft shadow approach based on shadow maps that uses temporal reprojection for achieving physical correctness. Where temporal reprojection is insufficient we use spatial filtering to allow for soft shadows on recently disoccluded fragments. As a future direction of our research we would like to investigate multiple light sources because they lend themselves naturally to this approach: nowhere do we assume that the soft shadow data in the shadow buffer comes from the same light source, so we can extend the approach to multiple light sources simply by calculating a shadow map for each light source each frame. The values rhon and n can then accumulate contributions from all the light sources. Moving light sources could also be possible. We think that an approach that weights older light source samples less, together with age factors for shadow buffer fragments, could work. Moving objects would benefit from calculating each frame a second shadow map in the center of the light source. Maybe the two shadow maps could be calculated in a combined fashion. Our statistical approach is now based on uniform distributions of samples. Maybe non-uniform distributions could improve convergence.
Acknowledgements This work was supported by the Austrian Science Fund (FWF), project number P21130-N13.
24
D. Scherzer et al.
References 1. Hasenfratz, J.M., Lapierre, M., Holzschuch, N., Sillion, F.: A survey of real-time soft shadows algorithms. Computer Graphics Forum 22, 753–774 (2003) 2. Schwarz, M., Stamminger, M.: Quality scalability of soft shadow mapping. In: GI 2008: Proceedings of graphics interface 2008, Toronto, Ont., pp. 147–154. Canadian Information Processing Society, Canada (2008) 3. Heckbert, P.S., Herf, M.: Simulating soft shadows with graphics hardware. Technical Report CMU-CS-97-104, CS Dept., Carnegie Mellon U, CMU-CS-97-104 (1997), http://www.cs.cmu.edu/~ ph 4. Fernando, R.: Percentage-closer soft shadows. In: SIGGRAPH 2005: ACM SIGGRAPH 2005 Sketches, p. 35. ACM, New York (2005) 5. Annen, T., Mertens, T., Bekaert, P., Seidel, H.P., Kautz, J.: Convolution shadow maps. In: Kautz, J., Pattanaik, S. (eds.) Rendering Techniques 2007: Eurographics Symposium on Rendering. Eurographics / ACM SIGGRAPH Symposium Proceedings, Eurographics, Grenoble, France, vol. 18, pp. 51–60 (2007) 6. Annen, T., Dong, Z., Mertens, T., Bekaert, P., Seidel, H.P., Kautz, J.: Real-time, all-frequency shadows in dynamic scenes. In: SIGGRAPH 2008: ACM SIGGRAPH 2008 papers, pp. 1–8. ACM, New York (2008) 7. Guennebaud, G., Barthe, L., Paulin, M.: Real-time soft shadow mapping by backprojection. In: Eurographics Symposium on Rendering (EGSR), Eurographics, Nicosia, Cyprus, pp. 227–234 (2006) 8. Guennebaud, G., Barthe, L., Paulin, M.: High-Quality Adaptive Soft Shadow Mapping. In: Proceedings of Computer Graphics Forum, Eurographics 2007, vol. 26, pp. 525–534 (2007) 9. Schwarz, M., Stamminger, M.: Microquad soft shadow mapping revisited. In: Eurographics 2008, Annex to the Conference Proceedings: Short Papers, pp. 295–298 (2008) 10. Agrawala, M., Ramamoorthi, R., Heirich, A., Moll, L.: Efficient image-based methods for rendering soft shadows. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 375–384. ACM Press/AddisonWesley Publishing Co. (2000) 11. St-Amour, J.F., Paquette, E., Poulin, P.: Soft shadows from extended light sources with penumbra deep shadow maps. In: Graphics Interface 2005 Conference Proceedings, pp. 105–112 (2005) 12. Scherzer, D., Jeschke, S., Wimmer, M.: Pixel-correct shadow maps with temporal reprojection and shadow test confidence. In: Kautz, J., Pattanaik, S. (eds.) Rendering Techniques 2007 Proceedings Eurographics Symposium on Rendering, Eurographics, pp. 45–50. Eurographics Association (2007) 13. Vel´ azquez-Armend´ ariz, E., Lee, E., Bala, K., Walter, B.: Implementing the render cache and the edge-and-point image on graphics hardware. In: GI 2006: Proceedings of Graphics Interface 2006, Toronto, Ont, pp. 211–217. Canadian Information Processing Society, Canada (2006) 14. Nehab, D., Sander, P.V., Lawrence, J., Tatarchuk, N., Isidoro, J.R.: Accelerating real-time shading with reverse reprojection caching. In: Graphics Hardware (2007)
Real-Time Dynamic Wrinkles of Face for Animated Skinned Mesh L. Dutreve, A. Meyer, and S. Bouakaz Universit´e de Lyon, CNRS Universit´e Lyon 1, LIRIS, UMR5205, F-69622, France PlayAll
Abstract. This paper presents a method to add fine details, such as wrinkles and bulges, on a virtual face animated by common skinning techniques. Our system is based on a small set of reference poses (combinations of skeleton poses and wrinkle maps). At runtime, the current pose is compared with the reference skeleton poses, wrinkle maps are blended and applied where similarities exist. The poses evaluation is done with skeleton’s bones transformations. Skinning weights are used to associate rendered fragments and reference poses. This technique was designed to be easily inserted into a conventional real-time pipeline based on skinning animation and bump mapping rendering.
1
Introduction
Recent rendering and animation advances improve the realism of complex scenes. However, a realistic facial animation for real-time applications is still hard to obtain. Some reasons could explain this difficulty. The first one remains from the lack of computational and memory resources supplied by an interactive system, and another one remains from the difficulty to simulate small deformations of the skin like wrinkles and bulges when muscles deform its surface. As small as considered details are, they greatly contribute to the recognition of facial expressions, and thus to the realism of virtual faces. Since many works have been proposed for large-scale animations and deformations, only a few of them deal with real-time animation of small-scale details. We denote small-scale details in an animation context, i.e. wrinkles and bulges appearing while muscles contractions, instead of the micro-structures of the skin independent to facial expressions. Recent progress in motion capture or wrinkles generation techniques allow the production of these details in a form of highresolution meshes that are not fully usable for real-time animation applications such as video games. Nevertheless, converting these detailed meshes into few wrinkle maps may be a good input for our real-time and low-memory technique. Oat [1] proposed a technique to blend wrinkle maps to render animated details on a human face. A mask map defines different facial areas, coefficients are manually tuned for each area to blend the wrinkle maps. This technique provides good results at an interactive time, but the task of defining wrinkle maps coefficients at each frame or key-frame of the animation is a long and tedious G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 25–34, 2009. c Springer-Verlag Berlin Heidelberg 2009
26
L. Dutreve, A. Meyer, and S. Bouakaz
work. Furthermore, a new animation will require new coefficients and previous work could not be reused. It is why we intend to improve and adapt it to a face animated by skinning techniques. Our dynamic wrinkling system is based on a small set of reference poses. A reference pose is a pair of a large-scale deformation (a skeleton pose) and its associated small-scale details (stored in the form of a wrinkle map). During the animation, the current pose is compared with reference poses and coefficients are automatically computed. Notice that comparison is done at a “bone level” resulting in local blends and independence between areas of the face. The main contribution of our approach is that we propose a technique to easily add animation details on a face animated by skinning technique. Wrinkle maps coefficients are computed automatically by using a small set of reference poses. We do not use a mask map to separate areas of the face, we propose to use skinning weights as a correlation between bones movements and reference wrinkle maps influences. Moreover, it was designed to be easily inserted into a conventional runtime pipeline based on skinning and bump mapping such as new generation video game engines or any other interactive applications. Indeed, by blending wrinkle map on the GPU, our approach does not modify the per-pixel lighting, no additional rendering pass is required. And by only taking the bones positions as input, our approach does not need to change the animation aspect by skinning. Thus, the implementation requires a few efforts. The supplementary material needed compared to a skinning-based animation system is only a few set of reference poses, two are sufficient for the face. The runtime computation cost is not much more important than for classical animation and rendering pipelines. Finally as results, our approach greatly improves the realism by increasing the expressiveness of the face.
2
Related Work
Although many articles have been published on the large scale facial animation, adding details to an animation in real-time remains a difficult task. Some methods proposed physical models for skin aging and wrinkling simulation [2,3,4,5]. Some research focused on details acquisition from real faces, based on intensity ratio images which are then convert to normal maps [6], shape-from-shading [7] or other similar self-shadowing extraction techniques [8,9], or with the help of structured light projectors [10,11]. Some works proposed details transfer technique to use existing details to new faces [6,11,9]. We focus this state of the art on wrinkling systems for animated mesh. Volino et al. [12] presented a technique to add wrinkles on an existed animated triangulate surface. Based on the edges length variations, the amplitudes of applied height-maps are modified. For each edge and texture map, a shape coefficient is calculated to know the influence of an edge with the given map. More the edge is perpendicular to wrinkles, and more its compression or elongation will disturb the height map. The rendering is done by a ray-tracer with bump mapping shading and an adaptive mesh refinement. Bando et al. [13] proposed a method to generate fine and large scale wrinkles on human body parts.
Real-Time Dynamic Wrinkles of Face for Animated Skinned Mesh
27
Fine scale wrinkles are rendered using bump mapping and are obtained by using direction field defined by user. Large scale wrinkles are rendered by displacing vertices of a high-resolution mesh and obtained by using Bezier curves manually drawn by user. Wrinkles amplitudes are modulated along mesh animation by computing triangle shrinkage. Na et al. [11] extended the Expression Cloning technique [14] to allow a hierarchical retargeting. Transfer could be apply to different animation level of details, from a low-resolution up to a high-resolution mesh where dynamic wrinkles appear. While some techniques were proposed to generate wrinkles, few approaches focused on real-time applications [15,1]. Larboulette et al. [15] proposed a technique to simulate dynamic wrinkles. The user defines wrinkling areas by drawing a perpendicular segment of wrinkles (wrinkling area), following by the choice of a 2D discrete control curve (wrinkle template). The control curve conserve its length while the mesh deformation, generating amplitude variations. Wrinkles are obtained by mesh subdivision and displacement along the wrinkle curve. Many methods need a high resolution mesh or a costly on-the-fly mesh subdivision scheme to generate or animate wrinkles [9,7,11,4]. Due to real-time constraints, these techniques are difficult to use in an efficient way. Recent advances in GPU computing allow to render efficiently fine details by using bump maps for interactive rendering. Oat presented in [1] a GPU technique to easily blend wrinkle maps applied to a mesh. Maps are subdivided in regions, for each region, coefficients allow to blend between the wrinkle maps. This technique requires few computational and storage costs, three normal maps are used for a neutral, a stretched and a compressed expressions. Furthermore it is easily implemented and added to an existing animation framework. As we explain in the introduction, the main drawback of this method is that it requires manual tuning of the wrinkle maps coefficients for each region. Our method aims to propose a real-time dynamic wrinkling system for skinned face generating automatically wrinkle maps coefficients and without the requirement of a mask map.
3
Overview
Figure 1 shows the complete framework of our wrinkling animation technique. The first step is to create o reference poses. Each of them consists on a skeleton pose (large scale deformation) and a wrinkle map (fine scale deformation). While the runtime animation, the current skeleton pose is compared with reference poses and coefficients are calculated for each bones of each reference poses. Skinning and reference poses/bones weights are used to blend wrinkle maps with the default normal map and render the face with surface lighting variations. This last step is done on the GPU while the per-pixel lighting process. The remainder of this paper is organized as follows. In Section 4, we briefly present the existing large-scale deformation technique and the input reference poses. The section 5 presents the main part of the paper, we show how we use the reference poses in real-time to obtain dynamic wrinkling and fine details animation. In Section 6, we present some results and conclude this paper.
28
L. Dutreve, A. Meyer, and S. Bouakaz
Fig. 1. Required input data are a classic skinned mesh in a rest pose and some reference poses (reference pose = skeleton pose + wrinkle map). In runtime, each pose of the animation is compared with the reference poses, bone by bone, then skinning influence is used as masks to apply the bones poses evaluation. Wrinkle maps are blended on the GPU and a per-pixel lighting allows to render the current frame with dynamic wrinkles and details.
4
Large-Scale Deformations and Reference Poses
As mentioned above, our goal is to adapt the wrinkle maps method to the family of skinning techniques which perform the large scale deformations of the face. They offer advantages in terms of memory cost and simplicity of posing. Many algorithms have been published [16,17,18,19] about skinning and its possible improvements. Our dynamic wrinkling technique could be used with any of the skinning methods cited. We only need that mesh vertices are attached to one or more bones with a convex combination of weights. Reference poses are manually created by CG artist. He deforms the facial “skeleton” to obtain the expressions where he want to wrinkles appear. These expressions should be strongly pronounced to provide better results, i.e. , if artist wants that wrinkles appear while eyebrows rise up, he should pose the face with the maximum eyebrows rising up, when wrinkles furrows are deepest. Since influences work with bones or group of bones, it is possible to define a pose where wrinkles appear in various areas of the face. Having details in different areas will not cause that all of these details appear at the same time at an arbitrary frame. For example, details around the mouth would appear independently with forehead details, even if they are described in a same reference pose.
5
Animated Wrinkles
Our goal is to use preprocessed wrinkles data to generate visual dynamic fine deformations of the face surface, instead of computing costly functions and algorithms in order to generate wrinkles on the fly. We explain in this section
Real-Time Dynamic Wrinkles of Face for Animated Skinned Mesh
29
how reference poses are used in real-time and at arbitrary poses. The runtime algorithm consists on three main steps: – Computing influences of each reference pose on each bone. – Associating reference poses and vertices by using skinning weights and influences computed at the first step. – Blending wrinkle maps and the default normal map while the per-pixel lighting process. 5.1
Pose Evaluation
The first step is to compute the influence of each reference pose on each bone. This consists to find how the bone position at an arbitrary frame differs from its position at the reference poses1 . Computing these influences at the bone level instead of a full pose level allows to determine regions of interest. This offers the possibility to apply different reference poses at same time. Resulting in the need of less reference poses (only 2 are sufficient for face: a stretched and a compressed expression). We define the influence of the pose Pk for the bone Ji at an arbitrary frame f by this equation: ⎧ 0 if JiP0 = JiPk ⎪ ⎪ ⎨ αikf if 0 ≤ αikf ≤ 1 IPk (Jif ) = 0 if αif k < 0 ⎪ ⎪ ⎩ 1 if αif k > 1 with αif k =
(Jif − JiP0 ) (JiPk − JiP0 ) || (JiPk − JiP0 ) ||
where denotes the dot product and ||.|| denotes the euclidean distance. This computation consists on projecting orthogonally Jif onto the segment (JiP0 , JiPk ). So, at each frame, reference poses influences for bone Ji could be written as a vector of dimension o + 1 (o reference poses plus the rest pose) αif =< αif 0 , αif 1 , ..., αif k , ...αif o >. αif 0 is the influence of the rest pose. o αif 0 = max 0, 1 − αif k k=1
5.2
Bones Masks
Once we know the reference poses influences for each bone, we could use them for the per-pixel lighting. The main idea is to use skinning weights to compute the influence of reference poses for each vertex, and by the interpolation done during the rasterization phase, for each fragment (Fig. 2). Since wrinkles and 1
Notice that we deal with bones positions in the head root bone coordinates system, so we could assume that head rigid transformation will not cause problems while pose evaluation.
30
L. Dutreve, A. Meyer, and S. Bouakaz
expressive details are greatly related to the face deformations, we can deduce that these details are related with bones transformations too. So we associate bones influence with reference poses influences. We assume that skinning weights are convex, resulting in a simple equation (influence of the pose Pk for vertex vjf at frame f ): n Inf (vjf , Pk ) = (wji × IPk (Jif )) i=1
Fig. 2. The two left images show the skinning influences of the two bones of the right eyebrow. A reference pose has been set with this two bones rising up. The last image shows the influence of this reference pose for each vertex attached to these bones.
Skeletons may greatly move along applications, complex animations require a lot of bones per face. This could generate some redundancy between bones transformations. If CG artist defines a pose which require a similar displacement of 3 adjacent bones, a wrinkle would appear along these 3 bones areas. At runtime, it may appears artifacts or wrinkle breaks if the 3 bones don’t move similarly. To avoid this, we simply give the possibility to group some bones together. Their transformations still leave independent, but their poses weights are linked, i.e. the average of their coefficients is use instead of their initial values. 5.3
Details Blending
The final step of our method is to apply wrinkle maps to our mesh by using coefficients Inf (vjf , Pk ) computed at the next step. Two methods can be used depending on how wrinkle maps have been generated: – Wrinkle maps are neutral map improvements (i.e. details of the neutral map such as pores, scares and other static fine details are present in the wrinkle map). – Wrinkle maps only contain deformations associated with the expression.
Real-Time Dynamic Wrinkles of Face for Animated Skinned Mesh
31
In the first case, a simple blending is used. Since same static details are present in both neutral and wrinkle maps, a blending will not produce a loss of static deformations while in the second case, a simple averaging will cause it. For example, a fragment influenced by 100 percents of a wrinkle map will be drawn without using the neutral map, resulting in the fact that details of the neutral map will not appear. A finest blending is required. [1] proposed one for normal −−→ map in tangent space, let W N the final normal, W the normal provided by the blending of the wrinkle maps and N the normal provided by the default normal map: −−→ W N = normalize (W .x + N .x, W .y + N .y, W .z × N .z) The addition of the two first coordinates makes a simple averaging between the direction of the two normals, given the desired direction. The z components are multiplied, this leads to increase details obtained from the two normals. Smaller the z value is, and more bumpy is the surface, multiplication allows to add the neutral surface variation to the wrinkled surface.
6
Results and Conclusion
Our test model Bob contains 9040 triangles for the whole body which is a current size for video-games. The face is rigged with 21 bones. Animation runs at more than 100 fps on a laptop equipped with a Dual Core 2.20GHz CPU, 2Go RAM and a NVidia 8600MGT GPU. Rendering is done using OpenGL and NVidia Cg for GPU programming. We use tangent space normal maps as wrinkle maps in our experiments. We focus our tests on the forehead wrinkles because they are the most visible expressive wrinkles and generate the higher visual deformations. Furthermore, this area is subjected to different muscles actuations, compressed and stretched expressions generate different small surface variations. Similar to [20], we use a feature-points based animation transfer to animate our characters. This facial animation retargeting algorithm is based on feature-points displacements and Radial Basis Function regression (Fig 3). Figure 4 shows 4 facial expressions with and without dynamic wrinkling. By analyzing the top row, you may notice that expression recognition is not easy. The neutral and the angry expressions of the eyebrows are not very distinct.
Fig. 3. Example frames of an animation providing by a 2D tracker and transfer in real-time to our skinned face improved with our wrinkling system
32
L. Dutreve, A. Meyer, and S. Bouakaz
Fig. 4. The first row shows our character without dynamic wrinkles. The second rows shows reference poses influences. The last row shows our character with details. Notice that the second and the third column define the two reference poses.
However, the bottom row shows that wrinkles improve the expression recognition. Figure 5 shows the independence between facial areas without additional mask maps. Pose evaluation is done on the CPU, resulting coefficients are send to the GPU. The bones Masks step is perform into the vertex shader, i.e. skinning weights are multiplied by poses coefficients, resulting in o values (one value for each reference pose). The rasterization step is done and we obtain the interpolated values for each fragment in the pixel shader where we use these coefficients to blend normal maps and compute the lighting. Large scale deformation as well as the rendering are not modified, the addition of our method to an existing implementation is easy. No additional rendering pass is required, only few functions should be add to the different steps cited above. Data send to GPU equals o × n floating values with o the number of reference poses and n the number of bones. Our technique is greatly artist-dependent. Three steps are important to obtain good results. First, a good rigging is primary since we directly use skinning weights as bones masks, and so, it defines how each vertex will be influenced by the different reference poses. Second, the reference poses should consist on an orthogonal set of skeleton poses as much as possible, to avoid an over fitting. Notice that blending reference poses in a same area is possible (last column of Fig. 4), but it becomes a problem if similar bones displacements lead to different fine-details. Finally, detail maps quality greatly influences the visual results. Our choice to use skinning weights as bones masks offers many advantages. They allow us to relate large scale and small scale deformations, and so, we do not
Real-Time Dynamic Wrinkles of Face for Animated Skinned Mesh
33
Fig. 5. This figure demonstrates that reference poses influences are independent between areas of the face. Only the right area of the forehead is influenced by the stretch wrinkle map on the middle image while the whole forehead is influenced on the right image.
need additional mask textures. They ensure that vertices influenced by reference poses are vertices which are displaced accordingly with bones too. However, weights become smaller when they are far away from the bones location, and so wrinkles depth become smaller too, even if they should be as visible as those near bones. We have presented a technique to use pre-generated reference poses to generate in real-time wrinkles and fine-details appearing while an arbitrary skinned face animation. In addition to providing interesting visual results, the requirements that we considered necessary and/or important have been met. Our dynamic animation wrinkles runs in real-time, the use of per-pixel lighting allows us to dispense with high-resolution meshes or costly subdivision techniques. Furthermore, it is based on widely-used techniques such as skinning and bump mapping. Its implementation does not present technical difficulties and does not modify usual animation and rendering pipeline. However, results depend greatly of the quality of the input data provided by the CG artist. We plan to investigate this issue by developing specific tools to help him/her in the reference poses creation.
References 1. Oat, C.: Animated wrinkle maps. In: ACM SIGGRAPH 2007 courses, pp. 33–37 (2007) 2. Wu, Y., Kalra, P., Magnenat-Thalmann, N.: Simulation of static and dynamic wrinkles. In: Proc. Computer Animation 1996, pp. 90–97 (1996) 3. Boissieux, L., Kiss, G., Magnenat-Thalmann, N., Kalra, P.: Simulation of skin aging and wrinkles with cosmetics insight. In: Proc. of Eurographics Workshop on Animation and Simulation. (2000) 4. Kono, H., Genda, E.: Wrinkle generation model for 3d facial expression. In: ACM SIGGRAPH 2003 Sketches & Applications, p. 1 (2003) 5. Venkataramana, K., Lodhaa, S., Raghava, R.: A kinematic-variational model for animating skin with wrinkles. Computers & Graphics 29, 756–770 (2005)
34
L. Dutreve, A. Meyer, and S. Bouakaz
6. Tu, P.H., Lin, I.C., Yeh, J.S., Liang, R.H., Ouhyoung, M.: Surface detail capturing for realistic facial animation. J. Comput. Sci. Technol. 19, 618–625 (2004) 7. Lo, Y.S., Lin, I.C., Zhang, W.X., Tai, W.C., Chiou, S.J.: Capturing facial details by space-time shape-from-shading. In: Proc. of the Computer Graphics International, pp. 118–125 (2008) 8. Bickel, B., Botsch, M., Angst, R., Matusik, W., Otaduy, M., Pfister, H., Gross, M.: Multi-scale capture of facial geometry and motion. ACM Trans. Graph. 26 (2007) 9. Bickel, B., Lang, M., Botsch, M., Otaduy, M., Gross, M.: Pose-space animation and transfer of facial details. In: Proc. of the 2008 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (2008) 10. Zhang, L., Snavely, N., Curless, B., Seitz, S.M.: Spacetime faces: High-resolution capture for modeling and animation. In: ACM Annual Conference on Computer Graphics, pp. 548–558 (2004) 11. Na, K., Jung, M.: Hierarchical retargetting of fine facial motions. Computer Graphics Forum 23, 687–695 (2004) 12. Volino, P., Magnenat-Thalmann, N.: Fast geometrical wrinkles on animated surfaces. In: Proc. of the 7-th International Conference in Central Europe on Computer Graphics, Visualization and Interactive Digital Media (1999) 13. Bando, Y., Kuratate, T., Nishita, T.: A simple method for modeling wrinkles on human skin. In: Proc. of the 10th Pacific Conference on Computer Graphics and Applications, pp. 166–175 (2002) 14. Noh, J.Y., Neumann, U.: Expression cloning. ACM Trans. Graph, 277–288 (2001) 15. Larboulette, C., Cani, M.P.: Real-time dynamic wrinkles. Computer Graphics International (2004) 16. Lewis, J.P., Cordner, M., Fong, N.: Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. ACM Trans. Graph, 165–172 (2000) 17. Kry, P.G., James, D., Pai, D.: Eigenskin: Real time large deformation character skinning in hardware. In: Proc. of the 2002 ACM SIGGRAPH/Eurographics symposium on Computer animation, pp. 153–160 (2002) 18. Kurihara, T., Miyata, N.: Modeling deformable human hands from medical images. In: Proc. of the 2004 ACM SIGGRAPH/Eurographics symposium on Computer animation, pp. 357–365 (2004) 19. Rhee, T., Lewis, J.P., Neumann, U.: Real-time weighted pose-space deformation on the gpu. Computer Graphics Forum 25, 439–448 (2006) 20. Dutreve, L., Meyer, A., Bouakaz, S.: Feature points based facial animation retargeting. In: Proc. of the 15th ACM Symposium on Virtual Reality Software and Technology, pp. 197–200 (2008)
Protected Progressive Meshes Michael Gschwandtner and Andreas Uhl Multimedia Signal Processing and Security Lab (WaveLab) Department of Computer Sciences Salzburg University, Austria {mgschwan,uhl}@cosy.sbg.ac.at
Abstract. In this paper we propose a protection scheme for 3D geometry data. This field has got hardly any attention despite the fact that 3D data has severe industrial impact on the success of a product or business. With current reproduction techniques the loss of a 3D model virtually enables a malicious party to create copies of a product. We propose a protection mechanism for 3D data that provides fine grained access control, preview capabilities and reduced processing cost.
1
Introduction
In recent years the influence of 3D data has risen steadily. It is used in obvious applications like computer simulation and computer games but also in rapid prototyping, movies and even biometrics [1]. Advancements in rapid prototyping and 3D acquisition techniques enable the reproduction of objects (i.e. machine parts) that can be used as full replacements for the original parts. This means that if 3D models are stolen, accidentally disclosed or simply intercepted during transmission the financial success of a product can be seriously impaired. Common rapid prototyping techniques are: Fused deposition modeling, Stereolithography, Selective laser sintering and CNC milling. Those systems can be used to create small to medium sized batches of a product from 3D data. Another example for the value of high resolution 3D objects is digital animation. Today more and more movies are augmented with 3D objects or whole 3D scenes which take many man-hours to create. Sometimes even whole actors are recreated as 3D meshes (i.e. Arnold Schwarzenegger in Terminator IV). Even the field of biometrics makes more and more use of 3D data, namely the three-dimensional face biometrics ([2,3,4,5]). Those 3D objects represent a certain value and should be sufficiently protected. There are basically two distinct protection techniques for any kind of data, encryption (the first line of defense) and watermarking (the second line of defense). Watermarking augments the data with (visible or invisible) information that helps to identify for example the source of the object (in owner copyright protection) or, if multiple watermarks are embedded, the distribution chain. But watermarking does not actively prevent the data from beeing accessed by unauthorized parties. And while watermarking has already been extensively researched in the field of 3D data ([6,7,8,9]) the protection against unauthorized G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 35–48, 2009. c Springer-Verlag Berlin Heidelberg 2009
36
M. Gschwandtner and A. Uhl
access has got hardly any attention. The only two approaches are [10], which protects the data through a remote rendering system, and [11] which is basically a watermarking system with the additional random encryption of parts of the 3D mesh. In [12] it was analyzed how encryption affects progressive 3D data in general. Those tests focused on the geometry encryption alone and formulated several attack scenarios. Those attacks were tested on several systems and determined which ones are the most resistible. The work in [12] showed that the few existing protection systems are designed only for very special scenarios and are hardly applicable for common real world tasks. The results showed that further work was needed to create an efficient and versatile 3D mesh protection system. The mostly theoretical approach [12] provided the fundamental technology for the design of the Protected Progressive Meshes system (PPM). In this work we continue the previous research by addressing the lack of a generic mesh protection system. We designed and implemented a system with the following goals in mind: – Efficient protection of geometry data – Several different protection modes : “Full security with moderate decryption cost ”, “Preview ability” (transparent encryption), “Multilevel access” and “Adequate security with minimal decryption cost”. The remainder of this paper is organized as follows. In section two we will give an overview on the Protected Progressive Meshes (PPM) format, continuing with section three which explains the different security modes. Section four describes the principal attacks against the different operation modes. Section five analyzes the impact of these attacks with experiments, section six discusses the results of the security analysis and section seven concludes with an overview of the applicability of this protection system.
2
Principle
The vertex positions of uncompressed 3D objects are highly redundant [12]. If three-dimensional data should be partially encrypted one has to solve the redundancy problem. If the vertex positions to be encrypted are chosen randomly [11] the partial encryption scheme becomes very ineffective in terms of security [12]. To optimize the effectiveness of partial encryption it is needed to remove the redundant information before encrypting the data. This is achieved by converting the object into a progressive representation (Figure 1a), so called progressive meshes, before applying any encryption. The basic structure of a progressive mesh (independent of the algorithm) contains six (possibly empty) parts (Figure 1a) The geometry, connectivity and attribute information of the base mesh (which may be empty) and the geometry, connectivity and attribute information of the (several) refinement data. Every part can be encrypted completely or partially. At present stage our algorithm, which is a variant of the Compressed Progressive Meshes (CPM) [13]
Protected Progressive Meshes
37
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 1. a) PPM flow diagram b) Basic structure of progressive meshes c) Encrypting the whole mesh d) Detail encryption e) Base mesh encryption f) Onion-Layer encryption
algorithm, supports geometry and connectivity information but could be easily extended to support attribute information as well. The Protected Progressive Meshes (PPM) algorithm groups several refinement steps, which are called vertex-split operations ([14,15]), into batches called refinement levels. A vertexsplit operation (Figure 2) needs three types of information, the vertex that has to be split (split-vertex), the edges along which the mesh is cut (split-edges) and the error vector that displaces the new vertices from the predicted to the correct split-positions. The index to the split-vertex and the edges along which the mesh is split are called the connectivity information and the error vector is called the geometry information. In contrast to the CPM approach, PPM does not interleave the connectivity and the geometry information. The encoding process of the PPM system works as follows: 1. Select two vertices whose removal has the lowest possible impact on the mesh. This can be measured through the quadric-error metric ([16]). 2. Collapse the two vertices into one and update the quadrics. The two vertex positions (A , B ) are estimated through the surrounding vertices. The vertices ak , bk , ck, v1 and v2 shown in Figure 2 are used to predict A and ak , bk , dk, v1 and v2 are used to predict B . The vector between the two vertices is called the distance vector D = A − B . The PPM algorithm stores only the error vector E = D − D between the real distance vector D and the estimated distance vector D . 3. Repeat steps 1. and 2. until no more vertices can be removed in this level. Every collapse operation blocks several vertices from being removed in the current level. Those vertices are the ones used to predict the positions A , B for every collapsed vertex. This are the vertices au , bv , cw , dr , v1 and v2 in Figure 2.
38
M. Gschwandtner and A. Uhl
ci a1
c1
a
b
bk
bi
b
di+1
a
v1
v2
D
B B' dk +1 b
di
bk
Vertex split
d1
b1 bi
b
di+1
(a)
ck +1
a
A' E A D'
d1
b1
ak
a1
a
v2
B' dk +1
ck +1 c1
A' D' V
v1
ci
split edges
ak
di
(b)
Fig. 2. a) predicted vertices and predicted distance vector after the vertex split b) real vertex positions after adding the error vector to the predicted distance vector
(a) Basic structure of a refinement level
(b) Structure of a PPM file Fig. 3.
4. Once no more vertices can be removed a level of detail if finished. The PPM encoder builds a spanning tree over the mesh starting from a well known vertex. This spanning tree is traversed in a pre-order way. Every time a collapsed vertex is visited the encoder appends a binary 1 to the refinement data followed by the two indices of the split-edges. Those indices are counted counterclockwise starting from the edge to the parent vertex in the tree. If a vertex is traversed that has not been collapsed the encoder appends a binary 0 (Figure 3a). The possible indices to the split-edges depend on the valence d of the collapsed vertex. Thus the encoder can store the indices with a dynamic number of storage bits. The amount of bits for the two split-edge d indices is log2 . 2 5. If the mesh has not yet been simplified to the desired level of detail the encoder starts a new level and continues with 1. In a level of detail the encoder is usually able to reduce the mesh by ∼ 10% until all remaining vertices block further simplification operations ([13]). A refinement layer produced by this encoding process looks as follows (Figure 3a). A run of 0 bits followed by a 1 (split-marker). The number of zero bits tell the decoder how many vertices in the tree are traversed until a split-vertex is reached. Following the split marker the indices to the two splitedges are stored with log2 d bits. The index is relative to the edge to the parent vertex in the tree indexed counter-clockwise. The split-tree has as many nodes as vertices of the mesh. The markers of split-vertices and the indices of the
Protected Progressive Meshes
39
split-edges together are called the split-vertex bitmap. Following the split-vertex bitmap (without any padding) are the error vectors. They can either be stored as floating point vectors point or as quantized values. The error vectors are stored in the order of the split-markers. The vector marked by the first split-marker corresponds to the first error vector. Optionally the split-vertex bitmap and the error vectors can be stored in a compressed form. Once all refinement layers are constructed and the mesh is simplified to only the base mesh the encoder writes the final PPM file (Figure 3b) which consist of the following parts 1. The Header section which contains basic decoding information like refinement layer count, compression flags, quantization parameters etc. This section is not encrypt-able as every decoder must be able to read them. 2. The Key Management section which comes in two flavors. Standard and extended. However the different types only affect the way users and keys are managed not the way the data is encrypted. This section holds the information about the keys that were used to encrypt the different layers. It is possible that each layer, including the base mesh, have been encrypted using a different key for each. However in general several layers are grouped together to achieve a recognizable change in quality of the decoded mesh (a different level of quality). Refinement layers that are meant to be grouped together can be encrypted with the same key. The Key Management section not only stores the information which layers have been encrypted with the same key but depending on the use scenario also stores the actual keys in an encrypted form. The layer keys can be encrypted with a master key or through an external encryption mechanism. However there can only be one master key so in this case it is not possible to provide different levels of access for different users. This section can be encrypted partially (namely the encrypted layer keys). 3. The Base Layer section stores the coarsest representation of the 3D object. It is stored as a standard 3D Object File Format (OFF) in ASCII representation. However the way the base mesh is stored does not affect the overall PPM structure, so every representation (even binary ones) can be used. But the decoder needs to know which representation was used because there is no way to specify a different representation of the base mesh in the PPM file itself. This base mesh can be seen as a mesh where as much redundancy has been removed as possible. This layer can be completely encrypted. 4. The Offset Table is simply a list of layer sizes. This tells the decoder how big a layer is. The size is the real size of the layer occupied in the PPM file (including any padding). When layers are encrypted they need to be padded depending on the encryption algorithm used. This layer must not be encrypted. 5. The last section are the Refinement Layers which hold the data that is needed to decode the full detail mesh. Each layer can be individually encrypted.
40
3
M. Gschwandtner and A. Uhl
Security Modes
The ability to partially or completely encrypt several parts of the PPM file can be used to adapt this system to several scenarios. It would be possible to encrypt only parts of a refimement layer, but PPM always encrypts a whole layer (for example in Figure 1a the base layer and refinement layer 1). One refinement layer is relatively small compared to the full mesh, further dividing those layers would only add complexity without additional improvemenets. The PPM system supports four different encryption modes represented by the type of layers that are encrypted. They are padded with – Full Encryption. In this case the base layer and all refinement layers are encrypted (Figure 1c). Obviously this can be achieved with a conventional full file encryption too but in this case the header information remains unencrypted. This allows software without a valid key to at least verify the type of the file. This obvious feature is not the sole benefit of PPM Full Encryption over conventional full file encryption. Even if the whole data is encrypted the access to several detail layers can be controlled. A user may be allowed to decode the mesh only up to a certain refinement layer while another user may be allowed to decode the mesh up to the highest detail. – Detail Encryption. The Detail Encryption mode (Figure 1d) is the classic application of transparent encryption ([17,18,19]). It protects the detail data and keeps the base layer and maybe several lower detail levels unprotected to allow general access to low detail versions. The high detail versions or the original data are only available to users who have the correct decryption keys. – Base Encryption. The Base Encryption mode (Figure 1e) encrypts the whole base mesh but no refinement layers. This mode aims at the optimal balance between security and efficiency. It tries to minimize the encryption/decryption cost while maintaining a high level of security. Every vertex split operation relies either directly or indirectly on the geometry information that is stored in the base mesh. The impact of base mesh encryption distributes throughout all refinement steps. – Onion Layer Encryption. This mode is called onion layer encryption because only several refinement layers are encrypted (Figure 1f) which are followed by unencrypted layers. The higher unencrypted layers depend on the encrypted onion layers and are thus not decode-able without knowing the information contained in the encrypted layers. This means the encrypted parts protect the unencrypted layers like the skin of an onion. That is why they are called onion layers. Of course this protection mode is not aimed at maximum security, but at reasonable protection with minimum decryption cost.
4
Security Analysis
For our analysis we did not compress the split-vertex bitmap and the error vectors which is the best case scenario for an attacker, because one can guess the
Protected Progressive Meshes
41
real size of the refinement layer very accurately. In this case we can assume that an attacker can separate the the split-vertex bitmap and the error vectors in the uncompressed but undecode-able higher layers with a high degree of certainty. 4.1
Assessing the Quality of an Attack
In every attack (described in section 5) one needs to determine if the decisions that are made are correct and lead towards the correct unencrypted mesh. Obviously this is no exact science and depends heavily on the type of input data. To make an exact decision whether or not an attack is successful one would need to calculate the distance to the unencrypted model which in turn would make an attack obsolete. In a real-world attack the correctness of the steps has to be evaluated by some heuristic. For example: – The most important evaluation is the magnitude of the decoded error vectors. The error vectors are stored as 32-bit IEEE floating point numbers. If there is an error in the decoding process some or all of the bits will be wrong. This happens for example if the list of error vectors is read from the wrong starting position and thus the error-vectors are very likely assuming very small or unusually big numbers. If the attack successfully decoded the data all error vectors should be within a reasonable range. – Another way to determine if an attack correctly decodes the data is the overall smoothness of the mesh. If we assume that most meshes have a smooth surface (skin, the shape of a car, ...), the smoothness of the mesh after the attack could be an indicator if the right steps were made during the attack. Additionally, the encoder tries to do small changes in the higher detail levels and the bigger changes in the lower detail levels. This is something that should be considered during an attack. For example if layers in lower detail levels are attacked the smoothness criteria may not be the best choice as those coarse meshes may be rough, but the plausibility of the error vectors should be a functioning criteria. 4.2
Basic Attack Scenarios
To correctly decode a mesh the decoder needs to know which vertices have to be split (split-vertices) and how big the error to the predicted position is (errorvector). The split-vertex bitmap which marks the split-vertices depends on the fact that the spanning tree for each level of detail, that was built during the encoding step, is the same as the spanning tree during the decoding step. If this tree is somehow built in a different way the split-markers will mark the wrong vertices for a vertex-split operation leading to a corrupted decoding process. This error remains throughout all detail levels. For the full encryption scenario there is obviously no attack possible if we assume that the chosen encryption algorithm is secure. The same applies to the detail encryption, because all data starting from a certain point is encrypted.
42
M. Gschwandtner and A. Uhl
The data in lower detail levels is unencrypted and thus does not need to be attacked. The only thing an attacker can do is to subdivide the mesh to increase the level of detail but this can not be considered a real attack because this is always possible even if the encrypted data would not be available at all. For an attack on the base mesh encryption one needs to at least guess the correct number of vertices and the connectivity (number of vertices and connectivity) of the real base mesh. This for itself is already a very complex task covered in Section 5.1. For the onion layer encryption there are several techniques worth investigating. The encrypted data itself (the onion layers) can still be considered secure but the unencrypted higher detail layers may be subject to an attack. For that an attacker may chose between two different approaches: – Focus on the remaining unencrypted information. An attacker can try to find the correct split-vertex by relying only on the error vector and completely ignoring the connectivity information (split-vertex bitmap). In that case one has to find a connection between those two types of information. If the correct split-vertex was found one can test all split-edge combinations to make the correct vertex-split operation. In such a case the mesh connectivity is very likely different from the original data but the overall shape may be correct. – Focus on the information in the encrypted layers. An attacker can try to brute-force the split-vertex bitmap of the encrypted onion layers. It should be safe to assume that the corresponding error vectors are very small and can be substituted with zero. This will lead to a slightly different final mesh but all remaining unencrypted layers can be decoded without any further intervention.
5
Experimental Security Analysis
We analyzed the complexity and the practicability of techniques that allow an attacker to accomplish the attacks outlined in section 4.2. For our analysis we used the following meshes. The cara model (Figure 5a) is a face scan with 10045 vertices, the cow model (Figure 5b) has 2903 vertices. The hand (Figure 5c) model has 7609 vertices and the heptoroid (Figure 5e) has 17878 vertices. Figure 5f shows the horse model with 3430 vertices and Figure 5g the moomoo model with 3890 vertices. 5.1
Brute-Forcing the Base Mesh for an Attack on the Base Mesh Encryption
The smallest possible base mesh is a tetraeder and thus has only four vertices. If we apply Base Encryption to a PPM with such a small base mesh it is obvious that an attacker can use a brute force approach to find roughly the correct vertex positions and thus decode the mesh to some extent. For a brute force attack on the base mesh the attacker needs to find the correct connectivity of the guessed vertices, this is by far the most complex task in this attack. When the base
Protected Progressive Meshes
43
Table 1. Component based correlation coefficients rEX ,VX cow 0.0693 hand 0.0544 heptoroid 0.0092 moomoo 0.0264 cara 0.0437 sphere 0.0621 horse_hq 0.0691
rEX, VY 0.0170 -0.0337 -0.0056 0.0066 -0.0378 0.0282 -0.0493
rEX ,VZ -0.0137 0.0102 0.0109 0.0242 0.0232 0.0014 -0.0017
rEY ,VX -0.0220 -0.0298 -0.0005 -0.0096 -0.0218 -0.0313 -0.0313
rEY ,VY 0.0954 0.0407 -0.0017 0.0578 0.0566 0.1910 0.0622
rEY ,VZ -0.0318 -0.0353 -0.0066 -0.0214 -0.0317 0.0273 0.0594
rEZ ,VX -0.0326 0.0045 0.0055 0.0536 0.0188 0.0041 -0.0374
rEZ ,VY -0.0135 -0.0052 0.0085 -0.0221 0.0601 -0.0224 -0.0177
rEZ ,VZ 0.0384 0.0392 0.0240 0.1068 0.0528 -0.1655 0.1408
mesh gets bigger this brute force attack becomes more and more unusable due to the increasing number of possible combinations. In order to roughly assess the possible combinations we need to look at the number of possible triangulations for a given set of points. This is a topic of ongoing research even in 2D. If a set of points in 2D is triangulated it can be easily transformed into a polyhedra by embedding the point set in R3 and displacing the third coordinate of all points in such a way that the boundary of the 2D triangulation can be triangulated in the 3D representation without any intersections. Even if this is only a very small subset of all possible combinations of a set of points in R3 , we can argue that this is a (weak) lower bound for the complexity of this attack. However currently there is no exact formula on the number of triangulations of a set of points in R2 . In [20] Sharir and Welzl found that the number of possible triangulations is between 2.5n and 43n . In our case we know that there much more possible combinations in 3D as in 2D. Therefor a brute force attack becomes unviable very fast. 5.2
Guessing the Corresponding Split-Vertex from the Error-Vector Pairs for an Attack on the Onion Layer Encryption
In section 4.2 we described an attack on the onion layer encryption that does not try to recover the information of the encrypted layers but rather focuses on the recovery of the unencrypted but protected (by the onion layer) data. In that case an attacker needs to use some heuristic (section 4.1) to determine the corresponding error vector/split-vertex pairs. For all our experimental results we assume that an attacker can determine the correct split-edges, which may not be trivial but is in most cases much simpler than the task of finding the correct split-vertex. Using the component based correlation to determine corresponding error vector split-vertex pairs. In this case the idea is that there may be some correlation between the components of the split-vertex and the error-vector. If there would be any correlation the components of the error vector would provide a lead for the vertex it belongs to. This would enable an attacker to split this vertex and move the new vertices to their correct positions.
44
M. Gschwandtner and A. Uhl
3 The error vectors (E) and the split vertices ⎛ (V⎞) are elements ⎛ of⎞R and thus EX VX each one consist of three components E = ⎝ EY ⎠ and V = ⎝VY ⎠ . EZ VZ We calculated the correlation coefficient (Equation 1) between the components of the error vector and the split-vertex for the previously mentioned test data. To ensure that there is no correlation between the different components we calculated the correlation between all possible combinations (Table 1) of the components of the split-vertex (VX , VY , VZ ) and the components of the error vector (EX , EY , EZ ). For example EXi is the first component of the i-th error vector and VXi is the corresponding first component of the split-vertex. n Eki Vui − Eki Vui rEk ,Vu = (1)
n Ek2i − ( Eki )2 n Vu2i − ( Vui )2
As we can see nearly every coefficient is between -0.1 and 0.1 and many are even much closer to 0 (i.e heptoroid rEX ,VX ). This means there is basically no correlation and thus an attacker can not use this information to find the correct error vector split-vertex pairs.
(a) cara
(b) heptoroid
(c) cow
(d) heptoroid
Fig. 4. Distribution of angles between the error vectors and the corresponding splitvertices for the a) cara model and b) heptoroid model. Distribution of angles between the predicted distance vector and the error vector for the c) cow model d) heptoroid model.
Using the angle between the error vectors and the possible splitvertices. To further investigate possible indications that could help an attacker in finding the correct split-vertex we have to take a look at the angle between the error-vector and the split-vertex. Figure 4 a and b show a somewhat Gaussian distributed angle histogram centered at π2 . This means that many split-vectors have an angle of ∼ π2 to the error vector. Given the fact that in normal meshes the number of vertices that are roughly orthogonal to a given vertex (the line that cuts the cow model in Figure 5d) is much higher than the number of vertices that are very close (dots surrounding the arrow in Figure 5d), this is not an advantage to an attacker. One would need to test a substantial amount of the available vertices for every error vector.
Protected Progressive Meshes
(a) cara
(e) heptoroid
(b) cow
(c) hand
(f) horse_hq
(g) moomoo
45
(d)
(h)
Fig. 5. a,b,c,e,f,g) Test Meshes d) cow model cut by a plane that is normal to a splitvertex (marked with red arrow) h) Regular triangular mesh with blocking vertices caused by a vertex split (grey area)
Using the angle between the predicted distance vector and the error vector. Another, more promising approach would be to compare the angle between the error-vector and (all) the predicted distance Vectors D (Figure 2) to identify the vertex that has to be split. For that attack one needs to calculate two predicted vertices A and B for every vertex in the mesh and every possible combination of split-edges. The selection of the two split-edges (V v1 and V v2 in Figure 2) influences the prediction for A and B . This means that for every vertex in the mesh there are several different predictions for the position of the vertices (A and B) after the vertex split. The number of possible choices k for the split-edges at a single split-vertex is where k is the valence of the 2 split-vertex. One needs to compare every single error-vector with every possible D at every vertex of the mesh. If the mesh has n vertices one needs to calculate n ki possible distance vectors D . If we assume an average valence of six i=1 2 n 6 6 the complexity is i=1 = ∗n and thus O(n). After each vertex-split the 2 2 topology of the mesh has changed and thus the local neighborhood of the new vertices has to be updated. All other vertices are not affected which means that the calculated possible distance vectors from the previous step can be reused. In normal meshes this results in a constant update time after each vertex-split. Figure 4c and d show that there is a very strong trend for the error vectors to have the same direction as the predicted distance vectors because the majority of the angles is smaller than π6 . However this is still only a very weak criteria. In fact our tests showed that for a single vertex-split an attacker would still only
46
M. Gschwandtner and A. Uhl
be able to dismiss about 20%-40% of the possible vertices. So for every errorvector the number of possible split-vertices is in O(n), which leads to an overall complexity of O(nn ). 5.3
Brute-Forcing the Split-Tree for an Attack on the Onion Layer Encryption
The second attack on the onion layer encryption mentioned in section 4.2 is an attack on the split-tree of the encrypted refinement layers. Providing a single formula for the theoretical complexity of this attack is virtually impossible because the range of input meshes is too diverse. The fact that complexity depends not only on the number of input meshes but also heavily on the connectivity information makes the prediction so hard. To be able to give some estimation anyway we will assume that the input mesh before the encrypted layer is a regular triangular mesh. In this case every vertex-split blocks 13 additional vertices from being split in the current level (Figure 5h). But even if vertices are blocked in a level they can still be direct or indirect neighbors for other vertex-splits. Figure 5h shows the minimum distance between two vertex splits so that no blocked vertex is split. In the case of a regular mesh where every vertex has a valence of six, every vertex has 15 possible split edges. We know that about 10% of the mesh are split during a refinement batch. This means that a brute force attack on the split tree needs to test 15(n/10) possible combinations. For a mesh with 5000 vertices where the seventh refinement layer is encrypted (∼ 100 vertices) a brute force attack would need to test ∼ 1510 possible combinations. This is however only for a very simple mesh type. In real meshes which are much bigger and arbitrarily connected an estimation would be far more complex not to say impossible, because one can not make any assumptions on the input data.
6
Discussion
The onion layer encryption mode in section 3 shows the lowest security. However this mode is only intended to achieve a reasonable security or in other words a degradation of the original data. It is not intended for high security applications where resources don’t matter. It is targeted at multimedia content delivery where a content owner wants to ensure that an attack is unfeasible in a given time frame. Additionally the recovery of the correct input mesh is still very expensive in terms of processing power, because the decision if a vertex is the correct splitvertex is not binary. There are always several other candidates which leaves a fair amount of uncertainty. The base encryption is a good alternative to full encryption if a progressive representation of the mesh is also desired. The detail encryption mode provides complete security for the encrypted detail data. This is obvious because unlike the onion layer encryption all data that an attacker wants to access is encrypted. Thus one would need to break the encryption which is not an aspect of the PPM scheme. One may argue that for detail encryption
Protected Progressive Meshes
47
the PPM scheme is not necessary. But this is only true to some extent. If one wants to apply detail encryption on a 3D mesh one needs to carefully select what data is encrypted because in these scenarios the goal is to maintain an intact low resolution representation. If the parts are not carefully selected one may end up with a mesh where parts are encrypted completely while other parts are not encrypted at all. This careful selection is done by the PPM scheme automatically. Due to the conversion into a progressive representation the unimportant detail parts are grouped together in the highest refinement levels and the important low resolution part is stored in the base mesh and the lowest refinement levels. The full encryption mode is only for completeness. If one only requires a mesh to be encrypted with no additional requirements, one should choose a conventional encryption approach. However all modes obviously profit from the fact that the data is also a progressive mesh, thus it can be viewed and processed at different detail levels at no additional costs.
7
Conclusion
We presented a protection scheme for 3D geometry data that provides several modes with varying security and performance and thus covers a broad range of possible application scenarios. This provides a protection domain currently not available in this field. Previous mesh protection schemes either are solely based on watermarking or do not provide a protection at all. In addition to being the first genuine mesh protection system, PPM also provides multilevel access, intrinsic progressive representation and preview capabilities.
References 1. Scheenstra, A., Ruifrok, A., Veltkamp, R.C.: A survey of 3d face recognition methods. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 891–899. Springer, Heidelberg (2005) 2. Bronstein, E.M., Bronstein, M.M., Kimmel, R.: Expression-invariant 3d face recognition, pp. 62–69. Springer, Heidelberg (2003) 3. Chang, K.I., Bowyer, K.W., Flynn, P.J.: An evaluation of multi-modal 2d+3d face biometrics. IEEE PAMI, Los Alamitos (2005) 4. Gupta, S., Markey, M.K., Aggarwal, J., Bovik, A.C.: Three dimensional face recognition based on geodesic and euclidean distances. In: SPIE Symposium on Electronic Imaging: Vision Geometry XV (2007) 5. Ben Amor, B., Ouji, K., Ardebilian, M., Chen, L.: 3D Face recognition by ICPbased shape matching. In: The second International Conference on Machine Intelligence (ACIDCA-ICMI 2005). (2005) 6. Chou, C.M., Tseng, D.C.: Technologies for 3d model watermarking: A survey. International Journal of Computer Science and Network Security 7, 328–334 (2007) 7. Zafeiriou, S., Tefas, A., Pitas, I.: A blind robust watermarking scheme for copyright protection of 3d mesh models. III, 1569–1572 (2004) 8. Chao, M.W., Lin, C.H., Yu, C.W., Lee, T.Y.: A high capacity 3d steganography algorithm. IEEE Transactions on Visualization and Computer Graphics 15, 274–284 (2009)
48
M. Gschwandtner and A. Uhl
9. Wang, Y.P., Hu, S.M.: A new watermarking method for 3d models based on integral invariants. IEEE Transactions on Visualization and Computer Graphics 15, 285–294 (2009) 10. Koller, D., Turitzin, M., Levoy, M., Tarini, M., Croccia, G., Cignoni, P., Scopigno, R.: Protected interactive 3d graphics via remote rendering. ACM Trans. Graph. 23, 695–703 (2004) 11. Cho, M., Kim, S., Sung, M., On, G.: 3d fingerprinting and encryption principle for collaboration. In: AXMEDIS 2006: Proceedings of the Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, Washington, DC, USA, pp. 121–127. IEEE Computer Society, Los Alamitos (2006) 12. Gschwandtner, M., Uhl, A.: Toward DRM for 3D geometry data. In: III. Edward, J.D., Wong, P.W., Dittmann, J., Memon, N.D. (eds.) Proceedings of SPIE, Security, Forensics, Steganography, and Watermarking of Multimedia Contents X, San Jose, CA, USA, vol. 6819, p. 68190. SPIE (2008) 13. Pajarola, R., Rossignac, J.: Compressed progressive meshes. IEEE Transactions on Visualization and Computer Graphics 6, 79–93 (2000) 14. Floriani, et al: A survey on data structures for level-of-detail models. In: Advances in Multiresolution for Geometric Modelling, pp. 49–74 (2004) 15. Hoppe, H., DeRose, T., Duchamp, T., McDonald, J., Stuetzle, W.: Surface reconstruction from unorganized points. Computer Graphics 26, 71–78 (1992) 16. Garland, M., Heckbert, P.S.: Surface simplification using quadric error metrics. In: SIGGRAPH 1997: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 209–216. ACM Press/Addison-Wesley Publishing Co. (1997) 17. Engel, D., Stütz, T., Uhl, A.: A survey on JPEG2000 encryption. Multimedia Systems (to appear 2009) 18. Massoudi, A., Lefèbvre, F., Vleeschouwer, C.D., Macq, B., Quisquater, J.J.: Overview on selective encryption of image and video, challenges and perspectives. EURASIP Journal on Information Security (2008) 19. Lu, X., Eskicioglu, A.M.: Selective encryption of multimedia content in distribution networks: Challenges and new directions. In: Proceedings of the IASTED International Conference on on Communications, Internet and Information Technology (CIIT 2003), Scottsdale, AZ, USA (2003) 20. Sharir, M., Welzl, E.: Random triangulations of planar point sets. In: SCG 2006: Proceedings of the twenty-second annual symposium on Computational geometry, pp. 273–281. ACM, New York (2006)
Bilateral Filtered Shadow Maps Jinwook Kim1 and Soojae Kim2 1
Imaging Media Research Center, Korea Institute of Science and Technology
[email protected] 2 Dept. of HCI and Robotics, University of Science and Technology, Korea
[email protected]
Abstract. We present a novel shadow smoothing algorithm using the bilateral filter. From the observation that shadow leaking, which is found often in filter based approaches, occurs at depth discontinuity seen from the viewpoint of the eye, we apply the bilateral filter which is conceptually a product of two Gaussian filters, one for smoothing the shadow map and the other for handling depth discontinuity. Consequently the bilateral filtered shadow maps can smooth the shadow boundaries effectively and do not suffer from shadow leaking artifacts.
1
Introduction
Shadow is one of the most important features in modern 3D graphics applications. There have been vast amount of research results related to realtime shadow techniques. Among these approaches, we focus on the shadow map based techniques [1]. The shadow maps algorithm generates a depth buffer by rendering a scene from the viewpoint of the light source and use this information in determining whether pixels of the image are lit or in the shadow. Due to its image based nature, the shadow map algorithm can handle complex and dynamic scenes efficiently and can be implemented easily on commodity GPUs. A major downside of shadow maps is aliasing artifacts which come from the buffer resolution mismatch between the shadow map and a render buffer for the final scene as shown in Figure 1(a). To solve this aliasing problem, one can use very high resolution shadow maps, warp the shadow map considering the perspective transformation for better fit [2,3,4], or filter the shadow map to remove jagged appearance on its boundary [5,6,7,8]. One of the critical problems found often in filter based approaches is a shadow leaking artifact which is an unwanted shadow intensity deterioration. We observe that shadow leaking happens due to the discontinuity in a depth buffer rendered from the viewpoint of the eye. Figure 1(b) shows a typical example of the shadow leaking artifact occurring with the Gaussian blur filter. Obviously the part of shadow silhouette occluded by a frontward object should not be blurred. Then our questions are which part of the shadow should be smoothed or how we can apply the filter selectively. Figure 1(d) shows a depth buffer rendered from a normal viewpoint of the eye. We see that filtering is applicable only to the region where depth changes continuously. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 49–58, 2009. c Springer-Verlag Berlin Heidelberg 2009
50
J. Kim and S. Kim
(a) Conventional map
shadow (b) Gaussian shadow map
(d) Depth buffer
filtered (c) Bilateral filtered shadow map
(e) Difference
Fig. 1. Bilateral filtered Shadow map
In this regard, we propose to apply the bilateral filter instead of standard Gaussian filter to blur shadow boundaries correctly. The bilateral filter can adjust the smoothing level on shadow boundaries gracefully considering the depth discontinuity. Figure 1(c) and Figure 1(e) shows an anti-aliased shadow using the bilateral filter and the difference of Gaussian filtered shadow map and the bilateral filtered shadow map. As a result, the bilateral filtered shadow map does not suffer from shadow leaking artifacts.
2
Previous Work
There has been a rich literature on computing shadows. We refer to Woo et al.[9] and Hasenfratz et al.[10] for a complete survey on existing shadow algorithms. Among various shadow algorithms, we shed highlights only on image based approaches.
Bilateral Filtered Shadow Maps
51
Williams[1] proposes an image based shadow algorithm by computing a view of the scene from the point of view of the light source. He store the depth value of the scene which is called the shadow map. The shadow map is then used to determine whether each pixel of the scene from the view point of the eye is lit or in shadow. This technique can be accelerated in most of the current graphics hardware and is widely used for interactive applications. One of the disadvantages of shadow mapping is an aliasing problem due to limited resolution of the shadow map. Percentage closer filter is an algorithm to anti-alias shadow map boundaries [11]. A key idea of percentage closer filter is to perform depth comparison first and filter the comparison results instead of filtering the depths directly. However the shadow test depends on the distance to the point to shade and the distance is only available at run-time. This characteristic makes pre-filtering difficult. One of the popular methods to smooth aliased shadow boundaries is to use Gaussian filter. Conventional shadow map is used to cast shadow on scenes but only the shadow is rendered without illumination. Shadow images is blurred using the two dimensional Gaussian filter or the separable Gaussian filter and then is modulated with the illuminated scene without shadow. The method requires one more rendering pass of composition but typical rendering pipelines usually require a final composition stage to treat glow effect, transparency and so on. Hence the additional pass can be ignorable. The problem is Gaussian filter smoothes entire shadow boundaries without considering the structure of the scene. Therefore it suffers from shadow leaking artifacts even with a moderately sized filter kernel. Variance shadow maps[7] is a probabilistic method to approximate shadow intensity. When the shadow map is generated, the depth value and its square are stored and used to estimate the probability of shadow cast. The algorithm can support pre-filtering and additional convolutions but produces noticeable light leaking artifacts for complex scenes. An extension of the algorithm partitions the shadow map frustum into multiple depth ranges to reduce the light leaking artifacts[12]. Convolution shadow maps[8] achieve anti-aliased shadows by approximating the shadow test by a Fourier series expansion. Convolution shadow maps support pre-filtering but can produce less light leaking artifacts. However the algorithm requires a Fourier series expansion of higher order for reliable shadow anti-aliasing, which results in a more memory requirements and processing costs. To solve this problem, exponential shadow maps[13] approximate the shadow test using an exponential function. It is reported to be faster and consume less memory while producing less artifacts. Also it is worth noting that bilateral filtering has been used in global illumination rendering [14,15]. However those techniques are developed for noninteractive applications.
52
J. Kim and S. Kim
3
Bilateral Filter
We briefly review the bilateral filter in this section. The bilateral filter is a technique to smooth images while preserving edges [16,17]. Conceptually the bilateral filter is a non-linear product of two Gaussian filter in different domains. A standard Gaussian filter can be described as: I (p) = Gσ (p − q)I(q), (1) q∈S
where I (p) and I(q) represent a filtered image intensity of the pixel location p and an original image intensity of the pixel location q and S denotes a set of all pixel locations in the given image. The Gaussian distribution function is defined as 1 x Gσ (x) = exp − 2 . (2) 2 2πσ 2σ Since Gaussian filter calculates a weighted average of the intensity of nearby pixels, it is unaware of the image content and cannot therefore handle the image edges selectively. Like the Gaussian filter, the bilateral filter is also defined as a weighted average of nearby pixel values as: 1 I (p) = Gσs (p − q)Gσr (|I(p) − I(q)|)I(q), (3) Wp q∈S
where Wp is a normalization factor: Wp = Gσs (p − q)Gσr (|I(p) − I(q)|).
(4)
q∈S
The bilateral filter has weights as a product of two Gaussian filter weights, one of which is to average intensity in a spatial domain, and the other of which is to take account of the intensity difference. Hence as soon as one of the weights is close to 0, then no smoothing occurs, which means that the product becomes negligible around the region, where intensity changes rapidly, such as near the sharp edges. As a result, the bilateral filter preserve sharp edges. Even though the bilateral filter is an efficient way to smooth an image while maintaining its discontinuities, it may be a time consuming algorithm if the filtering range is large because it requires estimating weights over large neighborhoods. The complexity of a brute force implementation of the bilateral filter is O(N 2 ), where N is the number of pixels. A typical treatment is to restrict filtering range, that is, consider only the pixels q such that p − q ≤ 2σs . This approximation reduces the overall complexity to O(N σs2 ). It is also worth mentioning that applying the bilateral filter to a twodimensional image can be accelerated by applying an one-dimensional filter in the horizontal direction followed by applying another one-dimensional filter in the vertical direction. This scheme is called the separable bilateral filter and its complexity decreases to O(2N σs ).
Bilateral Filtered Shadow Maps
4 4.1
53
Bilateral Filtered Shadow Maps Definition
As described in the previous section, the bilateral filter can smooth the image respecting its boundary. This bilateral feature of the filter inspires us to apply it to smooth shadow boundaries while suppressing shadow leaking artifacts. The basic idea is to use the depth information. Near the shadow boundaries where the depth value changes discontinuously, the filter should not be applied. Hence we modify the bilateral filter by using the the depth information instead of the image intensity. The Bilateral filter for shadow map is defined as: 1 I (p) = Gσs (p − q)Gσr (|d(p) − d(q)|)I(q) (5) Wp q∈S Wp = Gσs (p − q)Gσr (|d(p) − d(q)|), (6) q∈S
where I(q) denotes an shadow intensity value at the pixel q and d(p) represents a depth value of the pixel p stored in the depth buffer image. Similarly to the bilateral filter, our shadow map filter can smooth the hard shadow boundary by averaging the neighborhood values only when depth values does not change discontinuously. If the depth value of the nearby pixel q is significantly different from the depth value of the target pixel p, then Gσr (|d(p)− d(q)|) is close to 0 and the resulting weight becomes negligible. It means that the pixel q contributes little to the smoothing process. 4.2
Implementation
First we generate a depth buffer rendered from the viewpoint of the light source (pass 1). Then we render the scene from the eye onto two render targets. The first render target contains an illuminated scene without shadow on RGB channels and the shadow cast result on an alpha channel separately (pass 2). The second render target is to store depth values from the viewpoint of the eye in a 32bit floating point texture format. Now we apply the separable bilateral filter to the shadow images stored in the alpha channel of the previous first render target. When we apply the bilateral filter, it is required to compare the previous second render target and the depth buffer achieved in the pass 1. Finally the bilaterally filtered shadow image is modulated with the illuminated scene without shadow stored in the pass 2 (pass 3). Figure 2 shows a flow diagram of our algorithm. To avoid expensive exponential function evaluations on shaders, we generate a floating point texture containing pre-calculated exponential function values and use the texture as a function lookup table. Also we skip over several pixels when sampling nearby pixels for wider blurring range while maintaining a number of sampling operations.
54
J. Kim and S. Kim
Fig. 2. Flow diagram of bilateral filtered shadow maps
4.3
Discussion
Figure 3 shows how smoothly the bilateral filter can remove aliasing artifacts according to the filter coefficient. Images were rendered on a frame buffer of 1024×768 resolution using the shadow map of 512×512 resolution and 7×7 separable bilateral filter was applied. Since The depth is normalized to map on [0, 1), the depth value as well as the filter coefficient change in a small scale on the figure. As the coefficient decreases, the bilateral filtered shadow maps can remove shadow leaking artifact more effectively. Table 1 shows a performance comparison between conventional shadow maps, Gaussian filtered shadow maps and bilateral filtered shadow maps. All tests are measured on nVidia GTX 280. When using separable filters with 7×7 or less sampling points on a shadow map of 512×512 resolution, the bilateral filtered shadow maps shows a comparable performance to Gaussian filtered shadow maps. Note, however, that increasing a rendering buffer resolution can hamper the bilateral filtered shadow maps because modulating a filtered shadow image and an illuminated scene without shadow occurs on the render buffer and therefore the performance may decrease on a large render buffer resolution. Figure 4 and 5 show examples of Gaussian filtered shadow maps and the bilateral filtered shadow maps on various settings. All images are generated on a frame buffer of 1024×768 resolution. Note that bilateral filter with 7×7 sample points shows a comparable anti-aliasing quality to 15×15 sample points while saving considerable amount of frames per second(Table 1).
Bilateral Filtered Shadow Maps
(a) Gaussian σs = 5.0
filter (b) Bilateral σr = 0.003
55
filter (c) Bilateral filter (d) Bilateral filter σr = 0.001001 σr = 0.000572
(e) Bilateral filter (f) Bilateral filter (g) Bilateral filter σr = 0.000215 σr = 0.000144 σr = 0.000037
(h) No filter
Fig. 3. Effect of changing bilateral filter coefficient
(a) No filter
(b) Gaussian filtered 512×512 shadow map, 7×7 samples interleaving 2 pixels
Fig. 4. Conventional shadow maps and Gaussian filtered shadow maps. σs =5.0
56
J. Kim and S. Kim
(a) Bilateral filtered 512×512 shadow (b) Bilateral filtered 1024×1024 shadow map(3×3 samples interleaving 2 pixels) map(7×7 samples interleaving 2 pixels)
(c) Bilateral filtered 512×512 shadow (d) Bilateral filtered 512×512 shadow map(7×7 samples interleaving 3 pixels) map(15×15 samples interleaving 2 pixels) Fig. 5. Bilateral filtered shadow maps on various settings. σs =5.0, σr =0.000037
Bilateral Filtered Shadow Maps
57
Table 1. Performance comparison Filter type
Number of Resolution of Frames sampling points shadow Map per second 512 478 No filter 1024 474 3×3 512 471 5×5 512 471 Gaussian filter 7×7 512 470 7×7 1024 469 15×15 512 384 3×3 512 468 5×5 512 467 Bilateral filter 7×7 512 420 7×7 1024 404 15×15 512 270
5
Conclusion
The bilateral filtered shadow map is a novel shadow algorithm to smooth shadow boundaries not suffering from shadow leaking artifacts. The proposed algorithm can be implemented easily on commodity GPUs and shows a comparable performance to other filter based approaches. As an extension of our approach, applying the bilateral filter to existing anti-aliasing hard shadow techniques such as variance shadow maps or exponential shadow maps can be possible without significant modification of the processing pipeline. Also adjusting the penumbra size to simulate soft shadows from an areal light source could get benefits of the bilateral filter. Acknowledgements. This work was supported in part by the IT R&D program of MKE/MCST/IITA (2008-F-033-02, Development of Real-time Physics Simulation Engine for e-Entertainment) and the Sports Industry R&D program of MCST(Development of VR based Tangible Sports System).
References 1. Williams, L.: Casting curved shadows on curved surfaces. ACM SIGGRAPH Computer Graphics 12, 270–274 (1978) 2. Stamminger, M., Drettakis, G.: Perspective shadow maps. In: Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pp. 557–562. ACM, New York (2002) 3. Wimmer, M., Scherzer, D., Purgathofer, W.: Light space perspective shadow maps. In: Eurographics Symposium on Rendering, vol. 2004, pp. 92–104 (2004) 4. Martin, T., Tan, T.: Anti-aliasing and continuity with trapezoidal shadow maps. In: Eurographics Symposium on Rendering, pp. 153–160 (2004)
58
J. Kim and S. Kim
5. Zhang, H.: Forward Shadow Mapping. In: Proc. Euro-Graphics Rendering Workshop, vol. 98, pp. 249–252 (1998) 6. Fernando, R., Fernandez, S., Bala, K., Greenberg, D.: Adaptive shadow maps. In: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 387–390. ACM, New York (2001) 7. Donnelly, W., Lauritzen, A.: Variance shadow maps. In: I3D 2006: Proceedings of the 2006 symposium on Interactive 3D graphics and games, pp. 161–165. ACM, New York (2006) 8. Annen, T., Mertens, T., Bekaert, P., Seidel, H., Kautz, J.: Convolution Shadow Maps. In: Eurographics (2007) 9. Woo, A., Poulin, P., Fournier, A.: A survey of shadow algorithms. IEEE Computer Graphics and Applications 10, 13–32 (1990) 10. Hasenfratz, J., Lapierre, M., Holzschuch, N., Sillion, F., Gravir, A.: A Survey of Real-time Soft Shadows Algorithms. In: Computer Graphics Forum, vol. 22, pp. 753–774 (2003) 11. Reeves, W., Salesin, D., Cook, R.: Rendering antialiased shadows with depth maps. In: Proceedings of the 14th annual conference on Computer graphics and interactive techniques, pp. 283–291. ACM, New York (1987) 12. Lauritzen, A., McCool, M.: Layered variance shadow maps. In: Proceedings of graphics interface 2008, Canadian Information Processing Society Toronto, Ont., Canada, pp. 139–146 (2008) 13. Annen, T., Mertens, T., Seidel, H., Flerackers, E., Kautz, J.: Exponential shadow maps. In: Proceedings of graphics interface 2008, Canadian Information Processing Society Toronto, Ont., Canada, pp. 155–161 (2008) 14. Durand, F., Holzschuch, N., Soler, C., Chan, E., Sillion, F.X.: A frequency analysis of light transport. ACM Trans. Graph. 24, 1115–1126 (2005) 15. Sloan, P.P., Govindaraju, N.K., Nowrouzezahrai, D., Snyder, J.: Image-based proxy accumulation for real-time soft global illumination. In: PG 2007: Proceedings of the 15th Pacific Conference on Computer Graphics and Applications, Washington, DC, USA, pp. 97–105. IEEE Computer Society, Los Alamitos (2007) 16. Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Sixth International Conference on Computer Vision 1998, pp. 839–846 (1998) 17. Paris, S., Kornprobst, P., Tumblin, J., Durand, F.: A gentle introduction to bilateral filtering and its applications. In: International Conference on Computer Graphics and Interactive Techniques. ACM Press, New York (2007)
LightShop: An Interactive Lighting System Incorporating the 2D Image Editing Paradigm Younghui Kim and Junyong Noh Graduate School of Culture Technology, KAIST, Republic of Korea
Abstract. Lighting is a fundamental and important process in the 3D animation pipeline. Conventional lighting workflow is time-consuming and labor-intensive. A user must fiddle with a range of unintuitive parameters one by one for a large set of lights continually to achieve the desired effect. LightShop, introduced here, provides the user with an intuitive and interactive interface employing the paradigm of 2D image editing software: direct sketching on objects and simultaneous control of the overall look of the lighting. The system then determines the optimal number of lights and their parameters automatically and rapidly. This is achieved by converting the user inputs to a data map and mining the information of the lights from the data map via data clustering while measuring the cluster validity. An experiments show that LightShop dramatically simplifies the laborious and tedious lighting process and helps the user generate high-quality and creative lighting conditions with ease.
1
Introduction
One of the most important steps in the 3D animation pipeline is the setting up of the proper lighting for a scene. Good lighting improves the quality of the rendered final images, emphasizes the key elements of the scene, produces dramatic moods, and enhances the overall storytelling effect [1,2]. Skillful digital lighting can help to reduce under- and over- illuminated areas, enhance contrast of the scene [3], and make the rendering process more effective by minimizing the number of lights that are used. The lighting process is time-consuming and labor-intensive. For a scene rendered in 3D animation production, typically dozens of lights have to be created and placed in space, while several unintuitive parameters for each light have to be tuned. Consequently, designing lighting for a scene often involves manipulating numerous parameters one by one until the desired result is obtained. Furthermore, substantial trial and error is inevitable even for one parameter, due to the relatively unpredictable relationship between parameter variations and their results. Clearly, lighting is a daunting task for novices and even for professionals. Common 2D image editing software such as Adobe Photoshop employs a sketching metaphor combined with simple but diverse adjustments. The sketching metaphor provides the user with a familiar and intuitive interface, and image adjustments involve straightforward applications of image editing algorithms to G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 59–70, 2009. c Springer-Verlag Berlin Heidelberg 2009
60
Y. Kim and J. Noh
modify the entire color condition of the image. Diverse image adjustments are instrumental in creating a high-quality and creative image quickly as the overall look of image can be simultaneously adjusted. In the spirit of the inverse-lighting approach, methods to make the lighting process more interactive and convenient have been proposed. These methods estimate the proper placements and configurations of light from creations such as rough sketches [4,5,6] , direct dragging creations [7] , images [8] , or paintings by the user [2]. However, these earlier methods share similar limitations. As demonstrated with the general lighting model addressed in a study [9], the lights estimated by those methods do not have sufficient DOFs in the context of cinematography [4,5]. In addition, unlike 2D image editing, it is difficult to control the overall look of lighting simultaneously. This problem can cause inefficiencies in designing high-quality and creative lighting conditions. Allowing simultaneous control of the overall look would speed up the light setup process compared to the parameter-by-parameter manipulations associated with previous methods. Another aspect to consider in setting up lights is the rendering time. Rendering is typically a huge computational bottleneck in the CG pipeline. The required computation for each frame is generally high and the sheer number of frames to be rendered is overwhelming. Observation shows that the rendering time increases linearly with the number of lights deployed. Consequently, one means of reducing the rendering time is to minimize the number of lights in the scene without compromising the original goal. Unfortunately, the number of lights is not considered during light estimations in earlier methods [2,4,5]. Inspired by the common 2D image editing software workflow, LightShop, an interactive lighting system that efficiently produces desired lighting conditions via intuitive sketching combined with simple adjustment controls, is proposed here. We believe that this adaptation greatly expedites the light setup process. Our contributions are summarized below. Determining the optimal number of lights. The number of lights has a great influence on the rendering computation. It is a burden for the user to determine the appropriate number of lights during the lighting setup process. LightShop minimizes the use of unnecessary lights. The optimal number of lights is determined for the scene conforming to the user inputs. Simultaneous control of the overall look of lighting. Dealing with complicated 3D light parameters one by one is inefficient, unintuitive, and tedious. Borrowing the paradigm of 2D image editing software, to adjust the overall look of a scene efficiently, LightShop allows the user simple control of 2D image editing variables such as the gain, contrast, hue, and saturation.
2
Related Work
In this section, a survey of various approaches to the inverse-lighting problem that are most relevant to our methods is given. See [10] for a more comprehensive discussion of the subject.
LightShop: An Interactive Lighting System
61
Schoeneman et al. [4] proposed a method to predict the most suitable intensities, utilizing the known numbers and positions of the lights when a region to be flashed is specified with global illumination. The method of Poulin et al. [5] computes the positions of the lights while exploiting the shadows and highlighted regions that the user specifies as a constraint on a ball-shaped geometry. The approach by Pellacini et al. [7] allows the user to drag multiple shadows interactively. Shesh and Chen [6] suggested a method to optimize the parameters of lights given the bright and dark areas from rough sketches by the user, assuming a local point-light based illumination model. Pellacini et al. [2] suggested a complete interactive lighting system that produced outstanding results with diverse extensions. The method works with varied rendering techniques such as global illumination (GI) and non-photorealistic rendering (NPR). These methods are well tuned for calculating the parameters of light in accordance with the user inputs. However, unlike our system, the user has the burden of determining the number of lights and adjusting the parameters of all the lights individually to achieve the desired overall look for a scene. For example, there is no single parameter that regulates overall contrast of the scene. Kristensen et al. [11] and Ragan-Kelly et al. [12] reported on a method that helps the user tune the parameters of lights and check the result interactively by concentrating on a rapid preview of final rendering. Design galleries [13] assist the user in controlling numerous parameters to produce desired effects. Shacked and Lischinski [14] suggested an automatic lighting technique based on the scene geometry information. Wang and Samaras [8] proposed a method to infer lighting conditions from an image with a known geometric shape. Anrys et al. [3] suggested an image-based approach that relights a real image that is captured by LightStage according to user inputs through a sketching interface. These methods are similar to ours in spirit, particularly with respect to simplifying the lighting process.
3 3.1
Method Overview
Fig. 1 shows an overview of LightShop. It has five stages - user input (b)(h), data mapping (c), data adjustment (j), data clustering (e)(k), and lights estimation (f)(l) - for each iteration. Initially, a sketching interface receives user inputs in the form of colored vertices ((a)(b), Section 3.2). The user inputs are mapped to a bounding sphere to create a generalized data map in the data mapping stage ((c), Section 3.3). This data map is analyzed and partitioned to form clusters in the data clustering stage ((e), Section 3.5). Each cluster is converted to a light with appropriate parameters and positions in the 3D space in the lights estimation stage ((f), Section 3.6). Finally, LightShop provides feedback to the user that reflects the influence from the newly created light by updating the viewport. At this point, the system is ready for another user interaction. Upon new inputs, LightShop invalidates all of the lights previously estimated and repeats the process with the existing data map. If the overall lighting look
62
Y. Kim and J. Noh
requires modification, the parameter interface (h) accepts inputs from the user to change the data map in the data adjustment stage ((j), Section 3.4). At the end of each process, a new set of lights is provided to the user (l). The process of defining the optimal number of lights is also shown in Fig. 1. After the user removes the saturation of the lighting (h), all mapped data have a similar color. Consequently, the number of lights estimated is reduced to one in (l), as compared to two in (f).
Fig. 1. Overview of LightShop. The upper images show the process from the user input through the sketching interface and the lower images show the process from the user input through the parameter interface. The dots without contours represent user inputs and the dots with contours represent the mapped data. The gray triangles in the rightmost images are estimated lights.
3.2
User Interface
LightShop provides two main interfaces, a sketching interface and parameter interface. The sketching interface allows the user to specify the regions to be lit and the colors of the light. The parameter interface accepts various user inputs to modify the overall look of the lighting. Sketching Interface. Once the user determines the color, intensity, and the radius of the brush, the sketching interface toggles between two different modes: the mapping mode and the brush mode. In the mapping mode, the user prescribes the characteristic of the target material, whether it is general or highlighted, for example. In the brush mode, the user decides how the movement of the brush influences the accumulated data. For example, selecting the Add option increases the brightness of the painted area, whereas Subtract decreases it. Replace changes an old color to a new color. During the sketching process, the selected vertices display the brush color along with the shading from the previous light setup. Real-time feedback helps predict the result in the next iteration. Parameter Interface. One means of controlling the overall look of the lighting easily is to provide sliders to adjust the Gain, Contrast, Hue, Saturation, Brightness and Color balance of the scene, employing the paradigm of common 2D commercial image editing software. Overshoot and Shadow, once checked,
LightShop: An Interactive Lighting System
63
are applied to all of the estimated lights. Offset, which is responsible for the minimum brightness of the scene, also proves to be useful. 3.3
Data Mapping
The data mapping stage converts the user inputs to a data map. The data map embodies lighting information and infers regions to be illuminated on the bounding sphere. For instance, Fig. 2 shows the front of the head area in green mapped to a bounding sphere. Once the map is created, further manipulations to change the lighting conditions are applied to this data map. In order to build a data map, a bounding sphere that is large enough to enclose the target objects is placed. The colored vertices are then mapped to the bounding sphere according to a mapping function. The mapping functions can be custom-made depending on the specific illumination effects to be achieved. We have constructed three different mapping functions that work for most cases - general mapping, highlight mapping, and environment mapping.
Fig. 2. General mapping maps the user inputs (a) to the bounding sphere (b) through the direction of the average of normals (yellow arrow). A region containing the colored vertices then would be illuminated from the estimated light (d). The dots on the sphere (b), in turn, are represented at their corresponding locations in a 2D data map unfolded on the polar coordinate (c).
General Mapping. One way to estimate a light direction is to employ the Lambertian shading model, as the greater part of regions illuminated by a light are due to diffusion on the surface. According to the Lambertian shading model, a painted vertex Vi ∈ [position, color, normal], i = {1, ..., n} is lit with the brightest intensity when the light reaches from the opposite direction of the vertex normal Vinormal . Here, n denotes the number of painted vertices. Observation suggests that a good candidate direction of the light is the opposite direction of normal the average of normals, V , of the vertices painted by the user’s strokes. The use of averaging also suppresses noises from the bumpy surface geometry. The intersection of the normal direction with the bounding sphere determines the Pi ∈ [position, color, direction, target], i = {1, ..., n}, which represents the result of the mapping. Pi contains the following values: ⎡ ⎤ normal (Viposition + t · V )/2r ⎢ ⎥ Vicolor ⎢ ⎥ Pi = ⎢ (1) ⎥ normal ⎣ ⎦ −V Viposition
64
Y. Kim and J. Noh
Here, r is the radius of the bounding sphere. t, which can be computed as normal V position + t · V = r, is the distance between the painted vertex on the object and the mapped point on the bounding sphere. Fig. 2 shows the general mapping method. Here, each painted vertex Vi has a unique position Piposition in the 2D data map. Highlight Mapping. The user-selected vertices can act as a highlight region. The Phong specular model dictates that the reflected light in the direction toward the camera exhibits the most intense highlight. Therefore, the opposite of the reflected camera direction on the painted vertex is the candidate light direction, Pidirection . Environment Mapping. The LightShop architecture facilitates easy incorporation of image-based lighting (IBL). Instead of painting specific areas on the model, the user can provide an spherical environment image. The environment image contains color information at each pixel, which is projected as light toward the center of the sphere. Each (u, v) location in the image is mapped to a sphere using a polar coordinate system. This mapping determines the colors on the bounding sphere corresponding to the pixels in the image. Our candidate light directions are the opposite of the normal directions at each point, on the bounding sphere, and they naturally point to the center of the sphere. The number of pixels in the environment image is typically very large. Using all of the pixels incurs a heavy computational load. We simply sample 20x20 data thus balancing quality and performance. 3.4
Data Adjustment
One of the factors that makes LightShop extremely versatile is the employment of a 2D image processing paradigm, obviating the need for the conventional 3D lighting setup procedure. The data map created in the previous step contains all of the lighting information that the user has established. This map can be freely modified at this stage to better reflect the desired lighting conditions by changing 2D parameters such as the Hue, Saturation, Brightness, Gain, Contrast, and Color balance. An Offset parameter, which is responsible for the minimum brightness of lighting, is also provided to influence areas in which there is no mapped data. For example, the overall lighting can become brighter by turning the Offset value up even if there is no data on the data map. This 2D procedure is much faster than the time required by a typical 3D lighting setup. The time savings becomes considerable when numerous lights need to be deployed in a scene. Consequently, it helps the user generate lighting conditions that are more creative and of a higher quality. 3.5
Data Clustering
This stage determines the number of lights by identifying the proper set of clusters. As more lights imply a longer rendering time, it is critical to minimize the number of lights deployed while reflecting the user’s intention correctly.
LightShop: An Interactive Lighting System
65
Fig. 3. Comparison of the diverse cluster validity functions. The second column (b) shows the normalized variations of the cluster validity according to the number of clusters. The blue line is a graph of Ray’s function, the green line is that of Dunn’s index, and the red line is that of the Davies-Bouldin index. The colored dots in the top of the graphs denote the maximum value of the cluster validity at koptimal . In the data map(c)(d)(e)(f), the circles of different colors indicate different clusters.
Two cardinal conditions are taken into account. First, each cluster should be circle-shaped on the bounding sphere. An ellipse or any other non-circular shape region cannot be lit by a general light in ordinary circumstances. Second, the computations have to be fast for real-time interaction. The data map is partitioned with a K-means method satisfying these two requirements. The Kmeans method can be defined as P ∈ Ci , d(P, Cicenter ) < d(P, Cjcenter ), i = j. The position Piposition and color Picolor reflect the principal characteristics of the data. The sum of the weighted Euclidean distances determined by the color and the position defines the distance between the two data points, d(Pi , Pj ). d(Pi , Pj ) = α Piposition − Pjposition +β Picolor − Pjcolor
(2)
Here, it is possible to modulate the relative contribution from colors and positions using the weight α and β. It was found that setting α = 1 and β = 2 worked well for all of the examples in this paper. The K-means method has two main drawbacks. The first is that the result is influenced greatly by the initial positions of the cluster centers. The second is that the number of clusters, k, should be determined a priori by the user. To overcome these drawbacks, the information provided in the accumulated data is exploited. This helps determine the initial positions of the cluster centers. In addition, various separation measurement methods were tested with K-means clustering. The best method is adopted to determine the optimal number of clusters.
66
Y. Kim and J. Noh
The user’s strokes can be a reasonable suggestion for the initial positions of cluster centers, as most strokes are well-separated in the data map. Moreover, the strokes implicitly contain the intentions of the user. The mean of the data from each stroke is used as a candidate for the initial position of the cluster center. The maximum number of clusters is the number of strokes; kmax = number of strokes. For the data map generated by an environment mapping, which does not have any user stroke information, the data is simply sampled at a regular interval to prevent conglomeration of the initial centers. One hundred data points (10x10) were sufficient in our test. The optimal number of clusters can be determined by measuring the cluster validity when the maximum number of clusters is given. The cluster validity indicates how well the clusters are separated. If the number of clusters is denoted as k and the cluster validity function as v(k), the optimal number of clusters can be formulated as koptimal = argmaxk (v(k)). To make a fair comparison when evaluating the validity function, random changes of the initial positions of cluster centers are disallowed. First, the maximum number of clusters kmax is set according to the aforementioned method. The number of centers in then reduced one by one by combining the nearest pair of centers. Many methods for measuring the cluster validity have been developed. We tested many different data distribution scenarios with typical validity functions, including the Davies-Bouldin index [15], a generalized Dunn’s index [16,17], and Ray’s validity function [18]. As demonstrated in Fig. 3, these methods can provide feasible results in most cases in our tests, while Ray[18]’s result performed best in extracting a set of semi-circular clusters. Consequently, the optimal number of lights koptimal is defined as min(d(Cicenter , Cjcenter )) koptimal = argmaxk ( 1 k ) center )2 i=1 P ∈Ci d(P, Ci k
(3)
Here, 1 < k ≤ kmax , and 1 ≤ i, j ≤ k, i = j. For further details, refer to [18]. Ray’s function [18] cannot compute the cluster validity for a single cluster when k = 1. An instance of virtual data is placed on the data map that should not be clustered with other data. This virtual data serves as an additional initial position of the cluster center. The actual optimal number of clusters is then one less than the returned numbers from the validity function. The location of the virtual data is determined to be the maximum distance on the data map, → − − → dmax ( 0 , 1 ) in Equation 2, from the cluster center when k = 1. The underlying rationale is that if the virtual data is positioned too close, it may merge with other clusters. 3.6
Light Estimation
LightShop generates a set of two types of light: ambient light and spot light. The offset, which is defined as the minimum brightness in a scene, is converted into ambient light. Each cluster Ck is converted to a spot light Lk in this stage by calculating the light parameters of the position of the light source, the light
LightShop: An Interactive Lighting System
67
direction, the color and the spot angle. The position of the mapped data Piposition is the point that will be lit on the bounding sphere to illuminate the corresponding vertices on the object Vi . The mean value for each parameter is initially computed. Assuming that Ck,i is the i-th data belonging to the k-th cluster Ck , the mean can be easily computed n as C k = i=1 Ck,i /n, where n is the number of elements in the k-th cluster. For the spot light Lk ∈ [source, direction, color, spot angle] shown in Fig. 2(d), the light direction Ldirection and the color Lcolor are determined by computing k k the mean value directly. The position of the light source Lsource is determined k as a position some distance away from the cluster center in the direction of Ldirection . The spot angle is the maximum angle between the light direction and k target the direction towards the target positions of the data Ck,i . The computation of each light, Lk , is summarized below. ⎡ ⎤ position Ck − l · Ldirection k direction ⎢ ⎥ Ck ⎢ ⎥ Lk = ⎢ (4) ⎥ color ⎣ ⎦ Ck target 2 max(arccos(Ldirection · (Lsource − Ck,i ))) k k
The value of l, the distance between the light source and the bounding sphere, is set by the user. If Overshoot and Shadow were checked in the parameter interface, they are applied now to all of the generated Maya lights.
4
Results and Discussion
LightShop was implemented as a plug-in for Autodesk Maya 2008 using C++ and Maya Embedded Language (MEL). We utilized the Paint Vertex Color Tool in Maya 2008 as a sketching interface. All of the tests were performed on an Intel Core2 2.4Ghz system with 4GB of RAM and the GeForce 8800 GT graphics chipset. Only one core was used, as our implementation is single-threaded. The accompanying video(http://143.248.249.138/isvc09/paper183) shows the live demonstrations of the results with each interface. Fig. 4(a) illustrates the process of setting up lighting through the sketching interface. Each row represents each iteration. From left to right, the columns show the viewport before the user input, the user input, the feedback, and the number of estimated lights. As can be seen in the figure, LightShop faithfully estimates the various light parameters of many DOFs including the locations, directions, color, and spot-angle in accordance with the user intentions. How well LightShop reacts to the user input in determining the optimal number of lights is also demonstrated. As can be seen in the first and the second iterations, the number of lights increased corresponding to the added user strokes. In the third iteration, the number of lights is reduced despite the increased number of user strokes, as the separated regions are combined with the new user input at the forehead. Fig. 4(b) shows the results from the user input through the parameter interface. In a scenario where the user adjusts the gain, hue, saturation, and color
68
Y. Kim and J. Noh
Fig. 4. Results from light setting scenarios using the sketching interface(a), using the parameter interface(b) and with complex and multiple objects(c). The white circles in (a) indicate the regions that have user inputs. Each colored bar of the color histogram(b) represents the frequency of each component (R,G,B).
balance for a scene sequentially, we compared the color histogram of the result using the before and after images at each iteration. To eliminate the effects from the material on the color histogram, a white diffusion material was used as the target object. The variation of the histogram by the gain adjustment shows that the proportion of dark pixels is reduced while that of bright pixels is enhanced compared to the histogram of the ‘before’ image, especially for the red component. In this manner, for the variations of the other histograms, the component ratio of RGB information of the data also changes appropriately. Here, the saturation adjustment affects the number of lights as well as the color of the lights. Fig. 4(c) shows the performances for scenes containing complex and multiple objects. All of the results were produced by an amateur artist who had two years of experience. He was requested to set up the lighting conditions using LightShop for each given scene and background image shown in the figure. Each lighting setup took between 10 and 20 minutes including the conception time. Multiple objects were used, including two elephants, with a total of 85,024 triangles. The rightmost scene, which has the greatest number of accumulated data points, 11,016 points, required 0.374 seconds to estimate 8 lights at the last iteration. The turnaround time is fast enough for comfortable user interaction. Clearly, the use of additional triangles requires a longer computation time. Fortunately, however, a case with many triangles allows us to use a subset of the data. Appropriate sampling may lead to aggressive data reduction, if necessary. Fig. 5 compares the time spent and the final rendered quality of our result (center and right) using environment mapping with the result from IBL in Mental-Ray (left). All of the images were produced under the same conditions and rendered at a resolution of 640x480. Typically, IBL produces a highquality result after heavy rendering computation. In situations where speed is of
LightShop: An Interactive Lighting System
69
Fig. 5. Results from the environment mapping. The lowest row represents the number of estimated lights and the rendering time for a frame.
importance, it is difficult for IBL to determine the proper trade-off between speed and image quality. In contrast, LightShop determines the proper configurations of lights conforming to the data map itself. As seen in the figure, LightShop can dramatically reduce the rendering computation while producing a similar quality of the rendered image due to the optimized number of lights. The rendering of the center image required approximately 6 seconds using 19 estimated lights. It was found that the final rendered quality of the image from LightShop was as good as that from a typical IBL method. The rightmost image is the rendered image by hardware rendering. It achieved only 10fps with shadows using identical lighting conditions as the center image. Although LightShop is very effective in setting up diverse lighting conditions rapidly, it also has limitations. Although the framework of LightShop allows the incorporation of various custom-built mapping functions, diverse effects remain limited to those under local illumination. Other types of illumination, such as GI or NPR, would be better handled with an optimization based method [2,6]. We leave this issue as future work. LightShop assumes static lighting conditions. As data clustering only addresses the spatial configuration of the data map, temporal coherency across frames is not guaranteed. For differing lighting conditions across frames, more sophisticated clustering algorithms with proper temporal constraints may be necessary. Further extensions of the proposed approach are possible. Additional desired functions can be incorporated into the current system of LightShop. For example, the effects of shadowing from a light can be added through a new mapping function, where the position of the light source is deduced from the specified shadow regions. Diverse image filters can also be applied in the data adjustment stage, again following the image editing paradigm. This will make the lighting setup process even more efficient and creative.
5
Conclusion
This paper introduces LightShop, a novel lighting system that helps a user generate high-quality and creative lighting conditions interactively and rapidly. The user sets up the lights for the scene while conveniently employing the paradigm of common 2D image editing software. The user interacts with LightShop through
70
Y. Kim and J. Noh
direct sketching on the objects or through a simple parameter interface. LightShop then automatically determines the optimal number of lights conforming to the user inputs. Additionally, the user can adjust the overall look of the scene simultaneously, as in 2D image editing, without having to control numerous light parameters individually.
References 1. Alton, J.: Painting with light. Republished in 1995 by University of California Press, Berkely (1949) 2. Pellacini, F., Battaglia, F., Morley, R.K., Finkelstein, A.: Lighting with paint. ACM Trans. Graph. 26, 9 (2007) 3. Anrys, F., Dutre, P., Willems, Y.D.: Image based lighting design. In: The 4th IASTED International Conference on Visualization 2004, Imaging and Image Processing (2004) 4. Schoeneman, C., Dorsey, J., Smits, B., Arvo, J., Greenberg, D.: Painting with light. In: Proc. SIGGRAPH 1993, pp. 143–146 (1993) 5. Poulin, P., Ratib, K., Jacques, M.: Sketching shadows and highlights to position lights. In: Proc. CGI, p. 56 (1997) 6. Shesh, A., Chen, B.: Crayon lighting: sketch-guided illumination of models. In: Proc. GRAPHITE 2007, pp. 95–102 (2007) 7. Pellacini, F., Tole, P., Greenberg, D.P.: A user interface for interactive cinematic shadow design., vol. 21, pp. 563–566 (2002) 8. Wang, Y., Samaras, D.: Estimation of multiple illuminants from a single image of arbitrary known geometry. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 272–288. Springer, Heidelberg (2002) 9. Barzel, R.: Lighting controls for computer cinematography. J. Graph. Tools 2, 1–20 (1997) 10. Patow, G., Pueyo, X.: A survey of inverse rendering problems. Comput. Graph., 663–687 (2003) 11. Kristensen, A.W., Akenine-M¨ oller, T., Jensen, H.W.: Precomputed local radiance transfer for real-time lighting design. In: Proc. SIGGRAPH 2005, pp. 1208–1215 (2005) 12. Ragan-Kelley, J., Kilpatrick, C., Smith, B.W., Epps, D., Green, P., Hery, C., Durand, F.: The lightspeed automatic interactive lighting preview system. In: Proc. SIGGRAPH 2007., p. 25 (2007) 13. Marks, J., Andalman, B., Beardsley, P.A., Freeman, W., Gibson, S., Hodgins, J., Kang, T., Mirtich, B., Pfister, H., Ruml, W., Ryall, K., Seims, J., Shieber, S.: Design galleries: a general approach to setting parameters for computer graphics and animation. In: Proc. SIGGRAPH 1997, pp. 389–400 (1997) 14. Shacked, R., Lischinski, D.: Automatic lighting design using a perceptual quality metric. Computer Graphics Forum 20, 215–226 (2001) 15. Davies, D., Bouldin, D.: A cluster separation measure. IEEE Trans. Pattern Anal. Machine Intell 1, 224–227 (1979) 16. Bezdek, J., Pal, N.: Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B 28, 301–315 (1998) 17. Dunn, J.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Cybernetics and Systems 3, 32–57 (1973) 18. Ray, S., Turi, R.H.: Determination of number of clusters in k-means clustering and application in colour image segmentation. In: Proc. 4th Int. Conf. Advances in Pattern Recognition and Digital Techniques, Calcutta, India, pp. 137–143 (1999)
Progressive Presentation of Large Hierarchies Using Treemaps Ren´e Rosenbaum and Bernd Hamann Institute of Data Analysis and Visualization (IDAV) Department of Computer Science University of California, Davis, CA 95616, U.S.A.
Abstract. The presentation of large hierarchies is still an open research question. Especially, the time-consuming calculation of the visualization and the cluttered display lead to serious usability issues on the viewer side. Existing solutions mainly address appropriate visual representation and usually neglect considering system resources. We propose a holistic approach for the presentation of large hierarchies using treemaps and progressive refinement. The key feature of the approach is the mature use of multiple incremental previews of the data. These previews are well-designed and lead to reduced visual clutter and a causal flow in terms of a tour-through-the-hierarchy. The inherent scalability of the data thereby allows for a reduction in the consumed resources and short response times. These characteristics are substantiated by the results we achieved from a first implementation. Due to its many beneficial properties, we conclude that there is much potential for the use of progressive refinement in visualization.
1
Introduction
The trend for creating and collecting massive amounts of data has led to a strong demand for its meaningful analysis. With the prime objective being to visually convey information inherent in the data, interactive visualization has always been an integral component to accomplish this task. In recent years, many valid approaches for the display of hierarchical datasets have been developed. One of the most prevalent tree visualization techniques is the treemap display [1]. It is easy to understand [2], has been applied to a million items [3], and gained increasing acceptance in research and industry [4]. However, when it comes to the display of large datasets, it shares a common drawback with many other visualizations – details of the data are difficult to comprehend. Furthermore, usually significant resources for data processing, transmission, and display are required. We show how to improve the treemap information display by progressive data handling and presentation. Instead of displaying all data at once, our approach
The author gratefully acknowledges the support of Deutsche Forschungsgemeinschaft (DFG) for funding this research (#RO3755/1-1).
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 71–80, 2009. c Springer-Verlag Berlin Heidelberg 2009
72
R. Rosenbaum and B. Hamann
takes advantage of the inherent scalability of the data to provide previews which initially require little data and are refined over time. This general benefit of progression is used (1) to provide a pre-defined or interactive tour-through-thehierarchy and (2) to efficiently use consumed system resources. Thereby, the approach combines the mandatory stages compression, transmission, and demand-driven display into a single holistic strategy and nicely blends with most of the enhancements proposed for treemaps. Thus, it is a meaningful extension of many existing treemap implementations. After introducing related work for treemap displays (Section 2) and the key properties of progressive information presentation (Section 3), we show how both approaches can be combined into a single presentation system (Section 4). While Section 4.1 is concerned with the technical implementation of progressive treemaps, Section 4.2 is dedicated to show how this can lead to an advanced information presentation. Section 5 state the achieved results regarding semantical as well as technical aspects. Conclusions and directions for future research are presented in Section 6.
2
The Treemap Display and Related Work
Visualization of hierarchical data has been extensively researched in the past leading to numerous approaches. The treemap display belongs to the class of visualizations implicitly communicating parent-child relationships. This is achieved by representing the tree as recursively nested rectangular cells (see Figure 1). This containment approach illustrates the structure of the tree and clearly identifies parent and child nodes belonging to the same subtree. Size, color and other graphical cell attributes thereby can be varied in order to represent nominal or ordinal data properties.
Fig. 1. Example of a progressive treemap providing a tour-through-the-hierarchy. Its scalability property significantly decreases resource consumption whenever it is suitable to reduce the number of displayed nodes.
Progressive Presentation of Large Hierarchies Using Treemaps
73
The advantages of the treemap approach are obvious. It fully uses the available space for data display and applies intuitive means to provide information to the topology of the hierarchy. It has been demonstrated to perform well for providing overviews. Furthermore, when combined with an appropriate color scheme, treemaps make it immediately obvious which elements are significant and might require more attention. Thus, it can help to spot outliers and quickly find patterns. However, there are also drawbacks that arise as the amount of data increases. This can result in several thousand cells, many of which are so small in size that they become indiscernible. In this case the treemap loses the ability to efficiently convey information. Especially in deep hierarchies it is difficult to determine size or containments and thus to see relationships between nodes. To overcome these disadvantages, the treemap display has been steadily improved. Thereby, existing publications focus on layout and appearance. The traditional slice-and-dice approach [1] has been enhanced by squarified [5], ordered [5], Voronoi [6], or generalized layouts [4]. There are also radial layouts [7] and different three-dimensional approaches [8]. Cascaded treemaps [9] have been developed to emphasize containment within tree structure, but at the cost of space. There are strategies to emphasize important nodes by applying spatial distortion [10]. However, even when these enhancements are applied, the available screen space is usually too small to allow for the legible display of large trees. One possible solution for this problem is to apply suitable interaction [11]. Browsing large hierarchies, however, is still a demanding task usually coupled with many single browsing steps and loss of context. Interestingly, only few research efforts have been done in enhancing the underlying data handling required to process large hierarchies. Besides the fact that layout calculation is computationally expensive, all data must be available in advance. This makes the visualization difficult to use for large data volumes and environments requiring data transmission. Similar problems in image communication have been overcome by the development and suitable applications of progressive refinement [12]. By taking advantage of encoding schemes that convert the image into a multiscale representation, the data can be sent, processed, and displayed piecewise. Using this approach it is possible that the consumed resources do no longer depend on the original data volume, but only on the required parts. Thereby, the used concepts of Regions of Interest (RoI) and Levels of Detail (LoD) strongly correspond to the Degree of Interest (DoI) approach well-known and successfully applied in computer graphics and visualization.
3
Progressive Information Presentation
Progressive refinement allows for a novel kind of information display that can basically be seen as a predefined or interactive animation of the data tightly coupled with a highly resource-saving processing and transmission system. Besides overcoming bandwidth-limitations for raster imagery, it has also been applied
74
R. Rosenbaum and B. Hamann
to other kinds of data [13] and system constraints [14]. The main features and beneficial properties can be summarized as follows: First conclusions with much less data. Progressive contents are organized data sequences whereby the reconstruction of a truncated sequence leads to an abstraction of the content with less detail. Thus, content previews can be provided at any point. With little data available, first conclusions can already be drawn. Efficient use of system resources. Progressive contents are compressed and stored in a modular hierarchical structure allowing for flexible and highly efficient access and delivery of different abstraction levels. This implements the paradigm - Compress Once: Decompress Many Ways [15, p. 410] and leads to the advantage that just the data required on the viewer’s side is handled. Additional properties of the data become visible. The different data views produced during refinement may allow for conveyance of additional properties and characteristics during presentation, which leads to a deeper understanding of the data. Due to the fact that a progressive presentation is usually designed to show important data in early stages, they stand out in the representation. RoI, LoD or similar concepts like Geometry of Interest (GoI) may be applied to describe this procedure formally [12,16]. Progression is data-seeing. Progressive refinement provides a preview sequence. As each subsequent view adds detail to the display, an incremental buildup of knowledge about the data can be achieved (see Figure 2). Such a tour-through-the-data supports well-accepted visualization principles as the information-seeking mantra [17] – Overview first, zoom and filter, then detailson-demand. To apply progressive refinement, the classic visualization pipeline must be enhanced by a few basic principles [16]. Thereby, a client-server environment is typically assumed to use its full potential. On the server side, the data is preprocessed and stored for multiple use. The preprocessing stage basically transfers the data in a hierarchical data structure that is further compressed for permanent storage and resource-efficient transmission. During a viewing request, this data structure is accessed and flexibly traversed in order to create meaningful previews and sequentially transmitted to the viewing device. The traversal can be predefined by a presentation author and interactively modified dependent on current interests. By using modular compression schemes it is ensured that the contents can be delivered without requiring further processing. The client serves only as viewing and interaction component. For progressive presentation, the different previews are successively extracted, decoded, and, if necessary, postprocessed before display.
Progressive Presentation of Large Hierarchies Using Treemaps
75
... Fig. 2. A progressive treemap designed to convey tree topology and structural characteristics of important nodes applying simple breadth-first (top) and authoring-based (bottom) traversal
4
Progressive Treemaps
Progressive refinement provides many advantages for data visualization. We shown how to implement and take advantage of this approach for the treemaps display in order to reduce consumed resources (Section 4.1) and to improve the conveyance of information about the topology of the hierarchy (Section 4.2). The basic strategy is similar to the general progression procedure – instead of presenting all data at once, the different nested sets of hierarchy nodes are refined top-down, whereby certain nodes and belonging subtrees can be prioritized before or during interactive exploration. 4.1
Progressive Refinement for Treemaps
A basic requirement for progressive refinement is the conversion of the original data into a hierarchical structure. Due to the fact that the result of the treemap layout calculation is already a hierarchy of rectangles, however, this stage can be skipped for progressive treemaps. To encode the rectangles appropriately, an approach similar to delta coding [18] taking advantage of the nesting property of the rectangles is applied. As the respective positions can be stated relative to the parent cell, a much smaller value range is required. This in turn can be used to reduce the number of bits needed to encode position. As all the values of the hierarchy are independently accessible, the approach fully supports random access within the encoded data and does not limit inherent scalability. It is also possible to increase compression performance by further encoding single values. Due to the modular structure of the encoded data, nodes and subtrees can be almost arbitrarily assigned to the different previews. During traversal and sequencing of the hierarchy, it is only necessary to ensure that prior to including a
76
R. Rosenbaum and B. Hamann
node all of its ancestor nodes have already been part of the sequence. This ensures that screen position can be decoded and restored properly. Due to the common overview-than-detail refinement, this, however, is usually not a limitation. To describe prioritization between nodes and the corresponding subtrees, we vary the commonly used term Region of Interest to Node of Interest (NoI) to take into account the terminology used for hierarchical data. The traditional term Levels of Detail nicely corresponds to the different tree levels and is therefore kept. A single NoI is specified by a unique index, its LoD by the desired amount of subtree levels. Based on these specifications, appropriate traversal orders and previews can be developed by an author or interactively modified by the viewer. Once created and traversed, decoding and viewing the progressive contents on the viewer side is straightforward. Based on a low complex protocol indicating parent-child relationships of incoming geometry, the data hierarchy is piecewise reconstructed and the respective data values decoded. Preview switches indicate the belonging of data to single previews. When a switch is signalled, all currently available data is released for display. As the treemap layout has been computed without prior knowledge to the respective screen properties, a postprocessing step scales the geometry to the screen dimensions before rendering. 4.2
Progressive Information Presentation
Appropriate information presentation by meaningful preview sequences depends strongly on the traversal of the data. Due to the variety of possible presentation goals and user preferences, however, strategies and options for valid hierarchy traversal are manyfold. The simplest refinement strategy is a top-down breadth-first traversal. Even when straightforward it can still reveal important knowledge about the data, like the number of hierarchy levels or variations in the depth of subtrees. However, often not all nodes are of the same importance. In such cases it is appropriate to introduce prioritization or to limit the shown LoD. To influence the corresponding traversal strategies, we propose to apply either predefined authoring-based or dynamic interaction-based refinement or a combination of both. Authoring-based refinement. Besides keeping the parent to child traversal order, the author has no limitations when creating a presentation sequence. The author is free to select an appropriate number of previews and to manually specify an LoD for every node and preview. For simpler authoring, however, we propose to use a semi-automatic specification approach. It interpolates the LoD values of single nodes whenever there is no explicit declaration. This allows for quick authoring even when the hierarchy and the number of previews are large. Keeping in mind our goal to convey structure, localization, and relation of nodes and subtrees to others, the example depicted in Figure 2 demonstrates the beneficial properties of a typical progressive treemap presentation. In order to provide an overview and to indicate structural dependencies and relations, the first previews show the first three levels of the node hierarchy only (Figure 2, top). It can be seen how data refines globally. In order to prioritize and display an
Progressive Presentation of Large Hierarchies Using Treemaps
77
important node that resides at a much deeper hierarchy level, only its ancestor nodes are added to the next previews. This leads to a local refinement of the desired geometry as well as its corresponding context(Figure 2, bottom, left). Once the selected node is displayed at the desired LoD, the refinement may be finished. Our example, however, continues with the refinement of the next higher level relative to the prior node in order to provide further context to local relatives (Figure 2, bottom, middle). This prioritization and refinement of one or multiple items is continued until all data is displayed (Figure 2, bottom, right).
Fig. 3. The interaction-based refinement of a node postpones an authoring-based sequence until the desired LoD for the node is achieved (variation of the authoring-based presentation sequence shown in Figure 2)
Interaction-based refinement. In order to simplify the conveyance of local properties, we support interactive node selection and prioritization at any time. This can be easily implemented based on the introduced processing scheme. Once a node is selected, the traversal component reacts by postponing a possible authoring-based traversal and starts refining the desired NoI (see Figure 3, left). The node is refined until its specified LoD is achieved or the user moves to the next node (Figure 3, middle). It is also possible to assign same priority to multiple nodes leading to simultaneous node refinements. If all desired nodes are displayed in the respective LoD, the traversal continues in the prior refinement sequence (Figure 3, right). To avoid redundancies, it is always ensured that already displayed nodes are not handled twice. Typical control elements for animations as pause and forward buttons complete the suite of supported interaction means.
5
Results and General Properties
Semantic aspects. By providing different well-designed incremental previews, progressive refinement allows for a novel kind of “animated” data presentation. It supports and fulfills many of the principles and requirements stated in the literature for successful data visualization and provides options to align the presentation to the respective goals. We show how answering different questions raised by the 2003 InfoVis contest concerning the display of hierarchical data can be significantly simplified by applying progressive refinement.
78
R. Rosenbaum and B. Hamann
We found that a progressive treemap mainly supports answering questions about topology, without requiring the widely applied color coding. Typical overall characteristics of the hierarchy (e.g., What is the deepest branch? ) are inherently communicated by breadth-first traversal, which is simple and does not require any external authoring. The same applies to questions that usually require interaction to be answered (e.g., What are the properties of the first nodes? ). The provided means to suspend refinement, however, help to overcome possible time constraints. Questions related to single data items (e.g., What is the path of this node? ) or to their local relatives (e.g., What are the children or siblings of this node? ) can be easily answered by prioritization using authoring- or interactionbased refinement. In early presentation stages, local refinement is also a great means to emphasize the importance of nodes. Other problems, like appropriate labeling, may also be overcome by taking advantage of the uncluttered views in early progression stages, but have not been considered in our implementation.
0.4%
1.9%
15%
18%
20%
32%
72%
100%
Fig. 4. A typical progressive presentation of a large hierarchy and the amount of data required to display the previews with respect to the whole volume. (Data set: logs A 0301-01.xml, part of the 2003 InfoVis contest concerning hierarchical data).
Technical aspects. Another major advantage of progressive refinement is its general resource-saving characteristic. It tightly combines the different processing stages to a single strategy and thus makes it rather useful for large data volumes. The modular and scalable structure of the data allows for accurate delivery of those parts needed for a certain representation and purpose. It decouples the required resource consumption from the original amount of data. This allows for fast feedback even for large hierarchies as well as low bandwidth environments and is a significant advantage especially for strongly limited viewing devices. As illustrated in Figure 1 these devices can usually process and display only a fraction of all available data in an appropriate manner. In this example just 1.03% (left) and 18.08% (middle) of all data is needed to provide
Progressive Presentation of Large Hierarchies Using Treemaps
79
an appropriate and uncluttered overview of the hierarchy. The same applies when the viewer has gained the desired knowledge and aborts the presentation before all detail is shown. Figure 4 shows relative data volumes required during an example presentation. It can be seen that early previews can be quickly provided and only little system resources are required. As data volumes and therefore processing times increase, the response times in early progression stages are much shorter. If interactivity is mandatory, just the amount of displayed nodes must be adapted. This can be accomplished by selecting relevant NoIs only or by limiting LoDs. Contrary to the traditional approach requiring and displaying all nodes, this is of great importance for many applications. General properties. The introduced approach can be applied together with almost all of the different layouts proposed for the treemaps display. As many interactions can be specified by the NoI/LoD concept, it can also be combined with existing browsing techniques, e.g., [11]. Furthermore, the use of a single server-side “multi-purpose” data structure avoids the costly layout computation and adaptation when the presentation is to be shown multiple times and with different goals. This is of beneficial advantage for smart environments [19]. However, there is no approach without shortcomings. Probably the main limitation of progressive refinement is the fact that not all visualization goals can be supported. This applies especially for all goals avoiding or inverting the overviewthen-detail principle. Furthermore, abstractions provided in previews may lead to wrong conclusions. As the proposed postprocessing stage changes the cell aspect ratio, the approach cannot be applied to squarified treemaps. To overcome this limitation, it must be ensured that layout and screen dimensions match accordingly.
6
Conclusions and Directions for Future Research
Progressive treemaps can be used to overcome the different problems imposed by large hierarchies. Progressive refinement is able to provide multiple data previews and can reduce the consumed system resources significantly. It further allows for an authoring-based tours-through-the-hierarchy and interactive manipulation of the refinement sequence. The strategy can be combined with other existing treemap approaches and applied to hierarchies of different size and topology. It enhances the extraction of knowledge from the data and increases system performance. Although the feedback we received concerning our first implementation is promising, future research should provide an empirical proof of the usefulness of progression by a user study. By taking advantage of the foundations laid in this research effort, it seams also meaningful to apply the approach to other visualizations, e.g., the balloon focus [10], in order to reduce resource consumption.
80
R. Rosenbaum and B. Hamann
References 1. Shneiderman, B.: Tree visualization with tree-maps: 2-d space-filling approach. ACM Transactions on Graphics 11, 92–99 (1992) 2. Johnson, B.: Treemaps: Visualizing Hierarchical and Categorical Data. PhD thesis, University of Maryland (1993) 3. Fekete, J.D., Plaisant, C.: Interactive information visualization of a million items. In: IEEE Symposium on Information Visualization, vol. 0, p. 117 (2002) 4. Vliegen, R., van der Linden, E.J.: Visualizing business data with generalized treemaps. IEEE Transactions on Visualization and Computer Graphics 12, 789–796 (2006) 5. Bruls, M., Huizing, K., van Wijk, J.: Squarified treemaps. In: Proceedings of the Joint Eurographics and IEEE TCVG Symposium on Visualization, pp. 33–42 (1999) 6. Balzer, M., Deussen, O.: Voronoi treemaps. In: Proceedings of IEEE Symposium on Information Visualization, pp. 49–56 (2005) 7. O’Donnell, R., Dix, A., Ball, L.: Ball exploring the pietree for representing numerical hierarchical data. In: Proceedings of HCI 2006, London. Springer, Heidelberg (2006) 8. Bladh, T., Carr, D.A., Kljun, M.: The effect of animated transitions on user navigation in 3d tree-maps. In: Proceedings of the Ninth International Conference on Information Visualisation, Washington, DC, USA, pp. 297–305 (2005) 9. L¨ u, H., Fogarty, J.: Cascaded treemaps: examining the visibility and stability of structure in treemaps. In: Graphics Interface, pp. 259–266 (2008) 10. Tu, Y., Shen, H.W.: Balloon focus: a seamless multi-focus+context method for treemaps. IEEE Transactions on Visualization and Computer Graphics 14, 1157–1164 (2008) ´ Browsing zoomable treemaps: Structure-aware multi11. Blanch, R., Lecolinet, E.: scale navigation techniques. IEEE Transactions on Visualization and Computer Graphics 13, 1248–1253 (2007) 12. Rosenbaum, R., Schumann, H.: Progressive raster imagery beyond a means to overcome limited bandwidth. In: Proceedings of Electronic Imaging - Multimedia on Mobile Devices (2009) 13. Lee, H., Desbrun, M., Schr¨ oder, P.: Progressive encoding of complex isosurfaces. In: Proceedings of ACM SIGGRAPH, pp. 471–476. ACM Press, New York (2003) 14. Pascucci, V., Laney, D.E., Frank, R.J., Scorzelli, G., Linsen, L., Hamann, B., Gygi, F.: Real-time monitoring of large scientific simulations. In: Proceedings of ACM Symposium on Applied Computing (2003) 15. Taubman, D., Marcellin, M.: JPEG 2000: Image compression fundamentals, standards and practice. Kluwer Academic Publishers, Boston (2001) 16. Rosenbaum, R., Schumann, H.: Progressive refinement - more than a means to overcome limited bandwidth. In: Proceedings of Electronic Imaging - Visualization and Data Analysis (2009) 17. Shneiderman, B.: The eyes have it: A task by data type taxonomy for information visualizations. In: Proceedings of the IEEE Symposium on Visual Languages, pp. 336–343 (1996) 18. Deering, M.: Geometry compression. In: Proceedings of ACM SIGGRAPH Conference on Computer graphics and interactive techniques, pp. 13–20. ACM, New York (1995) 19. Thiede, C., Schumann, H., Rosenbaum, R.: On-the- y device adaptation using progressive contents. In: Proceedings of International Conference on Intelligent Interactive Assistance and Mobile Multimedia Computing (2009)
Reaction Centric Layout for Metabolic Networks Muhieddine El Kaissi1 , Ming Jia2 , Dirk Reiners1 , Julie Dickerson2 , and Eve Wuertele2 1
CACS, University of Louisiana, Lafayette, LA 70503 2 VRAC, Iowa State University, Ames, IA 50011
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. The challenge is to understand and visualize large networks with many elements while at the same time give the biologist enough of an overview of the whole network to see and understand relationships between subnets. Classical biological visualization systems like Cytoscape use standard graph layout algorithms based on the molecules (DNA, RNA, proteins, metabolites) involved in the processes. We propose a new approach that instead focuses on the reactions that transform molecules, using higher-level macro-glyphs that summarize a large number of molecules in a compact unit, thus forming very natural and automatic clusters. We also employ natural clustering approaches to other areas of typical metabolic networks. The result is a graph with about 50-60% of the original node and 20-30% of the original edge count, which simplifies efficient layout and interaction significantly.
1
Introduction
Developments in modern analysis and computer processing technologies have revolutionized many fields of the life sciences. The complete genome of biologically relevant organisms like arabidopsis, soy bean and E. Coli is easily available. Combined with gene expression data from affordable gene chips that can give information about the activity levels of thousands of genes in a cell in one step results in a vast amount of raw data that can help understand the biological processes in a cell. However, this is only a part of the picture of the cell, another important part is how the cell chemistry is controlled. Integrating chemistry with biology was introduced in the 1930s with Molecular biology[1], which is the study of biology at a molecular level. The field overlaps with other areas of biology and chemistry, particularly genetics and biochemistry. Biochemistry studies the chemical properties of important biological molecules, like proteins, in particular the chemistry of enzyme-catalyzed reactions. Chemical reactions are important to all levels of biology. In the simplest terms, a reaction requires reactants and products. Reactants are the atoms or molecules that are involved in the reaction, and products are the atoms or molecules resulting from the reaction. In most biological reactions, enzymes act G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 81–91, 2009. c Springer-Verlag Berlin Heidelberg 2009
82
M. El Kaissi et al.
as catalysts to increase the rate of a reaction. Cells are continuously sensing and processing information from their environments and responding to it in sensible ways. The communication networks on which such information is handled often consist of systems of chemical reactions, such as signaling pathways and/or metabolic networks. Systems of chemical reactions like these are pervasive in biology. A metabolic network can be defined as the complete set of processes that determine the physiological and biochemical properties of a cell. As such, metabolic networks contain the chemical reactions of metabolism as well as the regulatory interactions that guide these reactions. A metabolic pathway is a series of chemical reactions occurring within a cell leading to a major end-product. In each pathway, a principal chemical is modified by chemical reactions. Enzymes catalyze these reactions, and often require dietary minerals, vitamins and other cofactors in order to function properly. The study of the structure and synthesis of proteins has become one of the principal objectives of biochemists. Biologists have identified pathways and genes critical to cell function, and some of this information has been captured in public databases, such as Biocyc, MetNet, Reactome . Recent work has begun to reveal universal characteristics of metabolic networks, which shed light on the evolution of the networks [18]. Understanding these processes can be challenging. In most cases there are many influencing factors that control activity levels in many different ways, which in turn be controlled by all kinds of substances in the cell. However, the actual processes and their relationships cannot be automatically extracted from the raw data collected from genome sequences. Developing hypotheses and an understanding of the network of relations between genes, RNA, proteins and metabolites is a very difficult problem which biologists need to use as many as available tools to help their understanding. One of the most powerful tools in the biologists arsenal is visualization. The relationships and connections between the elements in a biochemical reaction have or pathway been shown graphically, to try to visualize more complex metabolic networks as a natural continuation. The biology community has used a vast array of methods developed for generic graph visualization (see section 2), resulting in a number of systems partially or largely specialized for biological data, the most widely used one being the Cytoscape system [16]. Most of the current visualization systems use standard graph layout methodologies, without taking into account the characteristics of biological, specifically metabolic, networks. In this paper we propose a different approach to thinking about visualizing metabolic networks, called Reaction Centric Layout (RCL). The rest of the paper is structured as follows. We present related work in graph drawings and visualization tools. Then, we introduce Reaction Centric Layout and its consequences for visualizing metabolic networks. Some results show that RCL can significantly reduce the complexity of the metabolic graph, resulting in simpler and more efficient as well as easier to understand layouts as compared to classical layout, e.g. using Cytoscape. The paper is closed with a conclusion and ideas for future work.
Reaction Centric Layout for Metabolic Networks
2
83
Related Work
A limited number of researches have tried to adapt standard algorithms to better suit the needs of biological visualization. One example is BioLayout [9]. This layout can generates 2D or 3D representation of the similarity data. BioLayout employs an extensively modified Fruchterman Rheingold graph layout algorithm for the purpose of similarity analysis in biology [11]. Fruchterman Rheingold graph layout algorithm is based on the force-directed layout[13]. On the software side more work has been done to support the specific needs of biological network analysis. CellDesigner is a process diagram editor for generegulatory and biochemical networks [12]. PATIKA is a web interface for analyzing pathways data through graph visualization [5]. Proviz is a tool for protein interaction and visualization based on the Tulip platform [15,2]. Exploring metabolic network and gene expression can also be done in a Virtual Reality environment like MetNet3D where complex data can be displayed and manipulated intuitively [19]. The most prominent example and the most widely used system in metabolic biology is Cytoscape [16]. Cytoscape is a software environment for integrating biomolecular interaction with expression data. A recent survey paper from 2007 explores most previously cited tools together with thirty others [17,3,4]. All of these previously mentioned tools represent substances i.e. proteins as nodes and reactions as edges.
3
Reaction Centric Layout
One of the core problems of visualizing metabolic networks is the large number of nodes and edges in the graph, which make it difficult to find a layout that works well with regular layout algorithms. The goal of Reaction Centric Layout and the other macro-glyph operations described below is to reduce the number of nodes and edges in the graph to simplify the final layout task. The methods themselves are orthogonal to the actual layout, the examples in this paper use the GraphViz package to find the final positioning of the nodes, but any other layout algorithm can be used [8]. 3.1
Basic Concept
In graph visualization the core data is a network of nodes connected by edges. Practically all existing visualizations for metabolic and signaling networks represent the reactants and products as nodes and the reactions as the edges between them, see fig. 1b. This kind of representation is logical and as the nodes represent the physical objects, the different kinds of molecules from DNA to simple molecules like water, and the reactions transform one physical state into another one. However, a metabolic network reactions can also be considered the focus area, because these activities of the cell need to be understood.
84
M. El Kaissi et al.
(a) Reaction Cen-(b) Regular Layout tric Layout Fig. 1. Classical Layout vs. Reaction Centric Layout
We have modified the representation of biological data in node-link graphs which allows biologists to see data from a different perspective. In reaction centric layout, reactions are now represented as nodes, and reactants/products are represented as links. For example, a reactant from reaction R1 that is a product to reaction R2 is now represented by an edge (or link) between reactions R1 and R2. The name of the reactant/product is used to label the edge. The main benefit of this new representation of metabolic/signaling networks is that it leads to a very natural and intuitive clustering process that can be used to reduce the number of nodes and edges in the graph significantly. We employ this clustering by forming macro-glyphs that compactly represent a large number of nodes and edges. 3.2
Macro-glyph Representation
At the core of Reaction Centric Layout is a macro-glyph that represents a reaction, the involved reactants and products. This representation can combine a large number of nodes in the classical layout in a very compact form. Additional edges, such as positive/negative regulations, that affect reactions can also be linked directly into the reaction glyph. It also guarantees that all components that are relevant to a reaction are laid out together and will fit on a single screen, relieving the user from having to possibly follow a large number of different edges to positions outside the viewing area. An analysis of the metabolic networks shows that there are a number of different types of reactants and products that can and should be handled in different ways. The basic ones are just connections between reactions. These appear in just a small number of reactions, in most cases as the product of just one and the reactant of one or a number of them. They form the actual backbone of the reaction pathway, and their understanding is the main goal of the analysis process. They are visualized as labeled edges between the reactions that they are a product of and a reactant for (L-methionine, SAM synthease and S-AdoMet in fig. 2). They can come from any direction (based on the follow-up layout chosen), and are directed. Incoming edges are reactants, outgoing edges are products.
Reaction Centric Layout for Metabolic Networks
85
Fig. 2. Photophosphorylation using Reaction Centric Layout
In addition to these inner reactants (resp. products), there are the edge reactants (resp. products). These do not come from another reaction, or they do not go into another reaction. There can be several different reasons for their existence. In most cases, the network being analyzed is only a part of the whole cell, therefore there are many molecules that are created by parts of the cell that are currently not considered. One example for this is the ethylene in the widely known ethylene biosignaling pathway [14] . In addition to this they might come from outside the cell, e.g. as nutrients. Similarly for products, many of them might go into other pathways in the cell that are not part of the current visualization. In addition to those they might just be components that are not used by the cell in any other process, i.e. they are waste products. These edge reactants/products are kept as individual nodes for visualization. As they form an important component of the pathways definition and function as well as their interactions with the rest of the cell they are important enough to be represented explicitly and very visibly. The last kind of reactants/products are the elementary and basic small molecules that are used in many reactions, like water, carbon dioxide, oxygen etc. There is a number of known ones or they can be extracted from the pathway based on the number of reactions they are involved in. In general they are involved in a very large number of reactions, and in most cases they are practically ubiquitous in the cell. Therefore, considering all of them as a single node instance would skew the layout to the point of unusability. Every visualization system for metabolic networks faces this problem, and in most cases these ubiquitous small molecules are just represented with duplicated nodes. In our case we just add them as reactants/products to the macro glyph. Note that they can be either reactant or product, in which case they would be shown on the top or bottom part of the macro-glyph. Fig. 2 shows a close-up example from the ethylene signaling pathway shown in fig. 5. In it an enzymatic reaction transforms L-methionine into S-AdoMet. It needs ATP, which is split into phosphate and Pyrophosphate. The whole reaction is catalytically supported by the SAM synthease coming in from the bottom.
86
M. El Kaissi et al.
Overall the use of domain-specific macro-glyphs can result in semantically more meaningful graphs with significantly fewer nodes and edges, resulting in smaller graphs that are easier to layout and visualize. 3.3
Natural Clustering for Gene Expression
At some point all reactions in the cell go back to the DNA that encodes the proteins involved in the reactions, either as catalysts, regulators or simply as reactants. But DNA is not useful directly for the cell, it needs to be processed before it can have any effect on the metabolic reactions. On a very simplified level a gene’s DNA, which is stored in the cell nucleus, is first copied into RNA, a process called transcription. This RNA can then leave the nucleus into the cytoplasm, where it is read and turned into a protein, a process called translation. Except for extremely rare cases, every gene goes through these steps: DNA→RNA→Protein. Given the large number of genes possible involved in a certain metabolic pathway, there will be many of these steps in every single pathway. Unless one of the steps is controlled/regulated by an outside influence they all look exactly the same (see fig. 3a). We can employ two different merging strategies to reduce the number of nodes without loss of information. The first strategy is to merge the transcription/translation steps into one node. This directly removes two nodes and two edges from the graph for each transcription/translation step, however given that all genes follow the same pattern no information is lost. In a second step, similar to merging a set of reactants/products and the reaction together to reduce the complexity of the graph, we can merge many transcription/translation steps together into a single “Trans.” macro-glyph as long as they are controlled by the same signals or no signals at all (see fig. 3b). Additional simplifications are possible. In most metabolic pathways, the function of a large number of genes is regulatory but not understood in detail, i.e. it is known that the genes respectively their products will all control a certain reaction, but not exactly how. In the network, this is shown by special nodes that represent an OR or an AND composition of all the gene products with an edge (or link) into them. As a consequence of the macro-glyph clustering described above in many instances these composition-OR/AND nodes will only have one incoming edge now. For those cases we can merge the two nodes by specializing the “Trans.” macro-glyph into a “Trans.-OR”, or a “Trans.-AND” macro-glyph depending on the Boolean function connected to . Fig.3a shows a conventional representation of transcription/translation reactions where these reactions are represented by links. Fig.3a is generated using a curator tool from MetNetDB [10]. After applying reaction centric layout to the same ethylene signaling pathway, we get fig.3b. As noticed, we have two nodes: the Composition-OR and the clustered transcription/translation node. The final clustering is realized by merging these two nodes shown in fig.3c. If necessary the user can at any time interactively unfold the clustered node to analyze or manipulate its encapsulated data.
Reaction Centric Layout for Metabolic Networks
87
(a) Conventional Representation of Transcriptions/Translations
(b) Natural clustering of multiple translation/transcription reactions
(c) Natural clustering with merging Composition-OR Fig. 3. Stages of Natural Clustering
The main benefit of Reaction Centric Layout and Natural Clustering is that it can reduce the number of nodes and edges in the graph in a way that does not lose any information for the biologist user. On the contrary, combining a multitude of nodes into a single one in a well-structured manner can make the understanding of the data in the graph more simple and therefore more efficient.
88
4 4.1
M. El Kaissi et al.
Results System Platform
To test the benefit of our new approach we decided to use Cytoscape as an example. Cytoscape is the de-facto standard for biological graph visualization which is used in many different biological applications all around the world [16]. Our new algorithms was implemented as part of the MNV (Metabolic Network Visualization) tool [6]. MNV allows easy addition of the macro-glyphs and other specific rendering functionality. MNV has been written in Python and uses C++ modules for performance-critical operations like graph rendering. GUI is based on Qt and OpenGL. MNV supports a wide variety of layout algorithms. In this paper, we used the dot and circo algorithms from the GraphViz package. 4.2
“Ethylene Biosynthesis and Methionine Cycle” Pathway Test
We picked the “ethylene biosynthesis and methionine cycle” pathway from MetNetDB as an example, because it is a fairly average representative of metabolic pathways [10]. It contains several transcription/translation reactions as well as a large number of enzymatic reactions. Fig.4 shows an overall snapshot of the pathway in Cytoscape. The sheer number of nodes in the pathway can make it impossible to lay them out in a way that gives enough room to each node, and to keep the labels readable. In this version the user needs to zoom in a significant amount to be able to read the labels and understand the graph. Fig.5 shows the same data using Reaction Centric Layout to drive a GraphViz circo layout. The natural clustering reduces a large number of transcription, translations as well as Composition-Or reactions into a single “Trans.-OR” node. This natural clustering minimizes the number of connections as well as makes one logical entity that represents the relation between DNAs and their respected proteins. We also have achieved a more informative visual representation that more clearly shows the relationship between enzymatic reactions. The methionine cycle is clearly shown with Reaction Centric Layout, much better than the same graph drawn by cytoscape. Reaction centric layout differentiates between highly connected nodes involved in many other reactions, and nodes directly involved between two enzymatic reactions. The nodes which are involved between two enzymatic reactions are shown as a direct link. 4.3
Discussion
For the “ethylene biosynthesis and methionine cycle” pathway, we could achieve a result that is pretty similar to manually drawn pathway images form biology textbooks. This is made possible by the significant reduction in complexity of the graph. The original graph contains 77 nodes and 49 edges. In comparison the Reaction Centric Layout version only contains 43 nodes (55% of the original). The advantage becomes much more significant when looking at the edges, which
Reaction Centric Layout for Metabolic Networks
89
Fig. 4. Cytoscape representation of “ethylene biosynthesis and methionine cycle”
Fig. 5. Reaction Centric Layout, as input to circo layout, representation of “ethylene biosynthesis and methionine cycle”
90
M. El Kaissi et al.
often have more impact on the quality of the graph layout. Here the original graph has 49 edges, which are reduced to 10 (20% of the original) by the Reaction Centric Layout preprocessing. We also applied the same method to larger examples. Running the Reaction Centric Layout on the “superpathway of histidine, purine and pyrimidine biosynthesis” pathway reduces the nodes from 835 to 616 (73%) and the edges from 920 to 151 (16%). At the high end we used a conglomerate of all “arabidopsis” pathways from MetNetDB, which contain 15066 nodes and 17993 edges. Reaction Centric Layout reduces them to 9416 nodes (62%) and 5694 edges (31%).
5
Conclusion and Future Work
We presented a new biological layout approach called Reaction Centric Layout that can be added to any 2D layout. As the main difference to existing algorithms it represents the reactions in the network as nodes and the products and reactants as edges. This enables us to better merge similarly structured subparts of the graphs, primarily transcriptions/translations, into macro-glyphs resulting in graphs that have noticeably fewer nodes and significantly fewer edges than the input graph. For future work, we are looking into integrating more information into the graph, namely gene activity data. The challenge is to keep the gene representation very compact in order to display a large number of them simultaneously. We are also working on a native 3D layout based on the Reaction Centric Layout in order to represent larger graphs in space, as well as to work with large parts of the full pathway sets of our target organisms. Reaction Centric Layout could also be integrated with the gene regulatory networks in order to get a more compact view of these interesting networks[7].
Acknowledgments The work presented in this paper is supported by the National Science Foundation under contract NSF 0612240.
References 1. A History of Molecular Biology. Harvard University Press, Cambridge (2000) 2. Auber, D.: Exploring InfoVis Publication History with Tulip. In: Mutzel, P., Jünger, M. (eds.) Graph Drawing Softwares, Mathematics and Visualization, pp. 10–126. Springer, Heidelberg (2004) 3. Bourqui, R., Auber, D., Lacroix, V., Jourdan, F.: Metabolic network visualization using constraint planar graph drawing algorithm. In: IV 2006: Proceedings of the conference on Information Visualization, Washington, DC, USA, pp. 489–496. IEEE Computer Society Press, Los Alamitos (2006) 4. Bourqui, R., Cottret, L., Lacroix, V., Auber, D., Mary, P., Sagot, M.-F., Jourdan, F.: Metabolic network visualization eliminating node redundance and preserving metabolic pathways. BMC Systems Biology 1(1), 29 (2007)
Reaction Centric Layout for Metabolic Networks
91
5. Dogrusoz, U., Erson, E.Z., Giral, E., Demir, E., Babur, O., Cetintas, A., Colak, R.: PATIKAweb: a Web interface for analyzing biological pathways through advanced querying and visualization. Bioinformatics 22(3), 374–375 (2006) 6. Kaissi, M.E., Dickerson, J., Wuertele, E., Reiners, D.: Mnv: Metabolic network visualization 7. Kaissi, M.E., Dickerson, J., Wuertele, E., Reiners, D.: Visualization of gene regulatory networks. LNCS. Springer, Heidelberg (2009) 8. Ellson, J., Gansner, E., Koutsofios, L., North, S., Woodhull, G.: Graphviz open source graph drawing tools (2002) 9. Enright, A.J., Ouzounis, C.A.: Biolayout–an automatic graph layout algorithm for similarity visualization. Bioinformatics 17(9), 853–854 (2001) 10. Wurtele, E.S., Li, L., Berleant, D., Cook, D., Dickerson, J.A., Ding, J., Hofmann, H., Lawrence, M., Lee, E.K., Li, J., Mentzen, W., Miller, L., Nikolau, B.J., Ransom, N., Wang, Y.: Metnet: Software to build and model the biogenetic lattice of arabidopsis. In: Concepts in Plant Metabolomics, pp. 145–158. Springer, Heidelberg (2007) 11. Fruchtermanand, T.M.J., Reingold, E.M.: Graph drawing by force-directed placement. Softw. Pract. Exper. 21(11), 1129–1164 (1991) 12. Funahashi, A., Morohashi, M., Kitano, H., Tanimura, N.: Celldesigner: a process diagram editor for gene-regulatory and biochemical networks. BIOSILICO 1(5), 159–162 (2003) 13. Tamassia, R., di Battista, G., Eades, P., Tollis, I.G.: Graph Drawing:Algorithms for the Visualization of Graphs. Prentice-Hall, Englewood Cliffs (1999) 14. Guo, H., Ecker, J.R.: The ethylene signaling pathway: new insights. Current Opinion in Plant Biology 7(1), 40–49 (2004) 15. Iragne, F., Nikolski, M., Mathieu, B., Auber, D., Sherman, D.: ProViz: protein interaction visualization and exploration. Bioinformatics 21(2), 272–274 (2005) 16. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., Ideker, T.: Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Research 13(11), 2498–2504 (2003) 17. Suderman, M., Hallett, M.: Tools for visually exploring biological networks. Bioinformatics 23(20), 2651–2659 (2007) 18. Shinfuku, Y., Ono, N., Hogiria, T., Furusawa, C., Shimizu, H.: Analysis of metabolic network based on conservation of molecular structure. Biosystems 95(3), 175–178 (2008) 19. Yang, Y., Engin, L., Wurtele, E.S., Cruz-Neira, C., Dickerson, J.A.: Integration of metabolic networks and gene expression in virtual reality. Bioinformatics 21(18), 3645–3650 (2005)
Diverging Color Maps for Scientific Visualization Kenneth Moreland Sandia National Laboratories
Abstract. One of the most fundamental features of scientific visualization is the process of mapping scalar values to colors. This process allows us to view scalar fields by coloring surfaces and volumes. Unfortunately, the majority of scientific visualization tools still use a color map that is famous for its ineffectiveness: the rainbow color map. This color map, which na¨ıvely sweeps through the most saturated colors, is well known for its ability to obscure data, introduce artifacts, and confuse users. Although many alternate color maps have been proposed, none have achieved widespread adoption by the visualization community for scientific visualization. This paper explores the use of diverging color maps (sometimes also called ratio, bipolar, or double-ended color maps) for use in scientific visualization, provides a diverging color map that generally performs well in scientific visualization applications, and presents an algorithm that allows users to easily generate their own customized color maps.
1
Introduction
At its core, visualization is the process of providing a visual representation of data. One of the most fundamental and important aspects of this process is the mapping of numbers to colors. This mapping allows us to pseudocolor an image or object based on varying numerical data. Obviously, the choice of color map is important to allow the viewer to easily perform the reverse mapping back to scalar values. By far, the most common color map used in scientific visualization is the rainbow color map, which cycles through all of the most saturated colors. In a recent review on the use of color maps, Borland and Taylor [1] find that the rainbow color map was used as the default in 8 out of the 9 toolkits they examined. Borland and Taylor also find that in IEEE Visualization papers from 2001 to 2005 the rainbow color map is used 51 percent of the time. Despite its popularity, the rainbow color map has been shown to be a poor choice for a color map in nearly all problem domains. This well-studied field of perception shows that the rainbow color map obfuscates, rather than clarifies, the display of data in a variety of ways [1]. The choice of a color map can be a complicated decision that depends on the visualization type and problem domain, but the rainbow color map is a poor choice for almost all of them. One of the major contributors to the dominance of the rainbow color map is the lack of a clear alternative, especially in terms of scientific visualization. There G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 92–103, 2009. c Springer-Verlag Berlin Heidelberg 2009
Diverging Color Maps for Scientific Visualization
93
are many publications that recommend very good choices for color maps [2,3,4,5]. However, each candidate has its features and flaws, and the choice of the “right” one is difficult. The conclusion of all these publications is to pick from a variety of color maps for the best choice for a domain-specific visualization. Although this is reasonable for the designer of a targeted visualization application, a general purpose application, designed for multiple problem domains, would have to push this decision to the end-user with a dizzying array of color map choices. In our experience the user, who seldom has the technical background to make an informed decision, usually chooses a rainbow color map. This paper recommends a good default color map for general purpose scientific visualization. The color map derived here is an all-around good performer: it works well for low and high frequency data, orders the data, is perceptually linear, behaves well for observers with color-deficient vision, and has reasonably low impact on the shading of three-dimensional surfaces.
2
Previous Work
This previous work section is divided into two parts. First is a quick review on previously proposed color maps that lists the pros and cons of each. Second is a quick review on color spaces, which is relied upon in subsequent discussions. 2.1
Color Maps
As stated previously, the rainbow color map is the most dominate in scientific visualization tools. Based on the colors of light at different wavelengths, the rainbow color map’s design has nothing to do with how humans perceive color. This results in multiple problems when humans try to do the reverse mapping from colors back to numbers. First, the colors do not follow any natural perceived ordering. Perceptual experiments show that test subjects will order rainbow colors in numerous different ways [5]. Second, perceptual changes in the colors are not uniform. The colors appear to change faster in the cyan and yellow regions than in the blue, green, and red regions. These nonuniform perceptual changes simultaneously introduce artifacts and obfuscate real data [1]. Third, the colors are sensitive to deficiencies in vision. Roughly 5% of the population cannot distinguish between the red and green colors. Viewers with color deficiencies cannot distinguish many colors considered “far apart” in the rainbow color map [6]. A very simple color map that is in many ways more effective than the rainbow is the grayscale color map. This map features all the shades of gray between black and white. The grayscale color map is used heavily in the image processing and medical visualization fields. Although a very simple map to create and use, this map is surprisingly effective as the human visual system is most sensitive to changes in luminance [7,5]. However, a problem with using only luminance is that a human’s perception of brightness is subject to the brightness of the surrounding area (an effect called simultaneous contrast [8]). Consequently, when asked to
94
K. Moreland
compare the luminance of two objects separated by distance and background, human subjects err up to 20% [9]. Another problem with grayscale maps and others that rely on large luminance changes is that the luminance shifts interfear with the interpretation of shading on 3D surfaces. This effect is particularly predominant in the dark regions of the color map. A type of color map often suggested for use with 3D surfaces is an isoluminant color map. Opposite to the grayscale map, an isoluminant map maintains a constant luminance and relies on chromatic shifts. An isoluminant color map is theoretically ideal for mapping onto shaded surfaces. However, human perception is less sensitive to changes in saturation or hue than changes in luminance, espeically for high frequency data [10]. These color maps comprise those most commonly used in the literature and tools today. Other color maps are proposed by Ware [5] as well as several others. Most are similar in spirit to those here with uniform changes in luminance, saturation, hue, or some combination thereof. 2.2
Color Spaces
All color spaces are based on the tristimulus theory, which states that any perceived color can be uniquely represented by a 3-tuple [11]. This result is a side effect of the fact that there are exactly 3 different types of color receptors in the human eye. Limited space prevents more than a few applicable additive color spaces from being listed here. Any textbook on color will provide more spaces in more detail with conversions between them [11, 12]. The color space most frequently used in computer applications is the RGB color space. This color space is adopted by many graphics packages such as OpenGL and is presented to users by nearly every computer application that provides a color chooser. The three values in the RGB color space refer to the intensity output of each of the three light colors used in a monitor, television, or projector. Although it is often convenient to use RGB to specify colors in terms of the output medium, the display may have nonlinearities that interfere with the blending and interpolation of colors [11]. When computing physical light effects, it is best to use a color space defined by the physical properties of light. XYZ is a widely used color space defined by physical light spectra. There is a nonlinear relationship between light intensity and color perception. When defining a color map, we are more interested in how a color is perceived than how it is formed. In these cases, it is better to use a color map based on how humans perceive color. CIELAB and CIELUV are two common spaces. The choice between the two is fairly arbitrary; this paper uses CIELAB. CIELAB is an approximation of how humans perceive light. The Euclidean distance between two points is the approximate perceived difference between the two colors. This Euclidean distance in CIELAB space is known as ΔE and makes a reasonable metric for comparing color differences [12]. This paper uses the notation ΔE{c1 , c2 } to denote the ΔE for the pair of colors c1 and c2 .
Diverging Color Maps for Scientific Visualization
3
95
Color Map Requirements
Our ultimate goal is to design a color map that works well for general-purpose scientific visualization and a wide variety of tasks and users. As such we have the following requirements. These criteria conform to many of those proposed previously [13, 3, 6]. – – – – – –
The map yields images that are aesthetically pleasing. The map has a maximal perceptual resolution. Interference with the shading of 3D surfaces is minimal. The map is not sensitive to vision deficiencies. The order of the colors should be intuitively the same for all people. The perceptual interpolation matches the underlying scalars of the map.
The reasoning behind most of these requirements is self explanatory. The requirement that the color map be “pretty,” however, is not one often found in the scientific literature. After all, the attractiveness of the color map, which is difficult to quantify in the first place, has little to do with its effectiveness in conveying information. Nevertheless, aesthetic appeal is important as users will use that as a criterion in selecting visualization products and generating images.
4
Color Map Design
There are many color maps in existence today, but very few of them satisfy all of the requirements listed in Section 3. For inspiration, we look at the field of cartography. People have been making maps for thousands of years, and throughout this history there has been much focus on both the effectiveness of conveying information as well as the aesthetics of the design. Brewer [2] provides excellent advice for designing cartographic color maps and many well-designed examples.1 This paper is most interested in the diverging class of color maps (also known as ratio [5], bipolar [14], or double-ended [4]). Diverging color maps have two major color components. The map transitions from one color component to the other by passing through an unsaturated color (white or yellow). The original design of diverging color maps is to show data with a significant value in the middle of the range. However, our group has also found it useful to use a diverging color map on a wide variety of scalar fields because it divides the scalar values into three logical regions: low, midrange, and high values. These regions provide visual cues that are helpful for understanding data. What diverging color maps lack in general is a natural ordering of colors. To impose a color ordering, we carefully chose two colors that most naturally have “low” and “high” connotations. We achieve this with the concept of “cool” and “warm” colors. Studies show that people identify red and yellow colors as warm and blue and blue-green colors as cool across subjects, contexts, and cultures. Furthermore, 1
Brewer’s color maps are also available on her web site: www.colorbrewer.org
96
K. Moreland
people associate warmth with positive activation and coolness with negative activation [15]. Consequently, mapping cool blues to low values and warm reds to high values is natural [13]. 4.1
Perceptual Uniformity
An important characteristic of any color map is that it is perceptually uniform throughout. For a discrete color map, perceptual uniformity means that all pairs of adjacent colors will look equally different from each other. That is, the ΔE for each adjacent pair is (roughly) the same. For a continuous color map, we want the perceptual distance between two colors to be proportional to the distance between the scalars associated with each. If we characterize our color map with function c(x) that takes scalar value x and returns a color vector, the color map is perceptually uniform if ΔE{c(x), c(x + Δx)} Δx
(1)
is constant for all valid x. Strictly speaking, we cannot satisfy Equation 1 for diverging color maps because the map necessarily passes through three points in CIELAB space that are not in a line. However, it is possible to ensure that the rate of change is constant. That is, ΔE{c(x), c(x + Δx)} lim (2) Δx→0 Δx is constant for all valid x. This relaxed property is sufficient for describing a perceptually linear color map so long as we make sure that the curve does not return to any set of colors. We can resolve Equation 2 a bit by applying the ΔE operation and splitting up 2 the c function into its components (ΔE{c1 , c2 } = c1 −c2 = i (c1 i − c2 i ) ). ci (x + Δx) − ci (x) 2 lim Δx→0 Δx i 2 ci (x + Δx) − ci (x) lim Δx→0 Δx i
(3)
In the final form of Equation 3, we can clearly see that the limit is the definition 2 of a derivative. So replacing the limit with a derivative, we get i (ci (x)) . With some abuse of notation, let us declare c (x) as the piecewise derivative of c(x). Using this notation, the constant rate of change requirement reduces to the following. c (x)
(4)
Diverging Color Maps for Scientific Visualization
97
The easiest way to ensure that Equation 4 is constant is to linearly interpolate colors in the CIELAB color space. However, that is not entirely possible to do for diverging color maps. Lines from red to blue will not go through white. A piecewise linear interpolation is mostly effective, but can create an artificial Mach band at white where the luminance sharply transitions from increasing to decreasing as demonstrated in Figure 1.
Fig. 1. Using piecewise linear interpolations in CIELAB color space causes Mach bands in the white part of diverging color maps (left image). The transition can be softened by interpolating in Msh space (right image).
Having this sharp transition is fine, perhaps even desirable, when the white value has special significance, but to use the divergent color map in general situations we require a “leveling off” of the luminance as the color map approaches white. To compensate, the chromaticity must change more dramatically in this part of the color map. A method for designing this type of color map is defined in the next section. 4.2
Msh Color Space
To simplify the design of continuous, diverging color maps, we derive a new color space called Msh. Msh is basically a polar form of the CIELAB color space. M is the magnitude of the vector, s (the saturation) is the angle away from the L∗ axis, and h (the hue) is the angle of the vector’s projection in the a∗-b∗ plane. Conversion between the two color spaces is straightforward. L∗2 + a ∗2 +b∗2 L∗ s = arccos M b∗ h = arctan a∗
M=
L∗ = M cos s a∗ = M sin s cos h
(5)
b∗ = M sin s sin h
An ideal way to build a diverging color map in Msh space is to start at one color, linearly reduce s to 0 (to get white), flip h to the appropriate value for the last color, and then linearly increase s to the desired value. In fact, we can show that if s changes linearly while M and h are held constant, Equation 4 is constant,
98
K. Moreland
which is our criterion for a uniform color map. We can characterize a c(x) that behaves in this way in CIELAB space as c(x) = [M cos s(x)
M sin s(x) cos h
M sin s(x) sin h]
(6)
where M and h are constant and s(x) is a linear function of slope sm . To show that linear saturation changes in Msh are perceptually linear, we plug Equation 6 into Equation 4 and resolve to show that perceptual changes are indeed constant. c (x) = M sm sin s(x) M sm cos s(x) cos h M sm cos s(x) sin(h)
= M 2 s2m sin2 s(x) + cos2 s(x) cos2 h + sin2 h = M sm
(7)
Clearly Equation 7 resolves to a constant and therefore meets our criterion for a “uniform” color space. There is still a discontinuity when we flip h. However, because this discontinuous change of hue occurs when there is no saturation, it is not noticeable. And unlike the piecewise linear interpolation in CIELAB space, this piecewise linear interpolation in Msh space results in a smooth change in luminance throughout the entire color map. A common problem we run into with interpolating in Msh space is that the interpolated colors often leave the gamut of colors displayable by a video monitor. When trying to display many of these colors, you must “clip” to what can be represented. This clipping can lead to noticable artifacts in the color map. We have two techniques for picking interpolation points. The first is to uniformly reduce the M of each points. Dropping M will bring all the interpolated colors toward the gamut of displayable colors. Although you will always be able to pull all the colors within the display gamut by reducing M , it usually results in colors that are too dim. Thus, a second technique we can do is to allow M to be smaller for the endpoints than for the middle white color. This breaks the uniformity of the color map because a smaller M will mean that a change in s will have a smaller effect. We can restore the uniformity of the color map again by adding some “spin” to the hue. Even though h is interpolated linearly, the changes have a greater effect on the color when s is larger, which can counterbalance the growing M (although a large change can still cause a noticeable pointing of the luminance). The next section describes how to choose an appropriate hue change. 4.3
Choosing a Hue Spin
Let us consider the transition from a saturated color, cs = (Ms , ss , hs ), at an end of the color map to an unsaturated “white” color, cu = (Mu , 0, hu ), at the middle of the color map. As the color map moves from cs to cu , the M , s, and h coordinates are varied linearly. The slope of these coordinates can be characterized as Mm = Mu − Ms , sm = −ss , and hm = hu − hs . (Note that hu
Diverging Color Maps for Scientific Visualization
99
has no effect on the unsaturated color, but is provided to conveniently define the rate of change.) ,
TN¨Y.
TN¨Y
.N¨Y
B
IN¨Y - SIN S
IN¨Y.TJOT
A
Fig. 2. A small linear movement in Msh space. The three axes, L∗, a∗, and b∗, refer to the three dimensions in CIELAB space. Linear movements in Msh space (a polar version of CIELAB) result in nonlinear movements of the CIELAB coordinates.
Figure 2 shows how a small movement in this linear Msh function behaves in CIELAB space. The distance measurements take advantage of the property that if you rotate a vector of radius r by some small angle Δα, then the change in the vector is limΔα→0 rΔα. Clearly the ΔE, the magnitude of change in CIELAB space, is (Mm Δx)2 + (sm ΔxM )2 + (hm ΔxM sin s)2 (8) Equation 8 will not be constant unless Mm and hm are zero, which, as described in the previous section, is unacceptable. However we can get pretty close to constant by choosing hu so that Equation 8 is equal for cs and cu . (Mm Δx)2 + (sm ΔxMs )2 + (hm ΔxMs sin ss )2 = (Mm Δx)2 + (sm ΔxMu )2 (9) Note that the right side of Equation 9 is missing a term because it evaluates to 0 for the unsaturated color. We can safely get rid of the square roots because there is a sum of square real numbers inside them both. (Mm Δx)2 + (sm Ms )2 + (hm Ms sin ss )2 = (Mm Δx)2 + (sm Mu )2 h2m Ms2 sin2 ss = s2m (Mu2 − Ms2 ) sm Mu2 − Ms2 hm = ± Ms sin ss
(10)
Remember that sm = −ss . We can use Equation 10 to determine a good hue to use for the white point (from the given side). ss Mu2 − Ms2 hu = hs ± (11) Ms sin ss Note that Equation 11 will most certainly yield a different value for each of the saturated colors used in the diverging color map. The direction in which the hue
100
K. Moreland
is “spun” is unimportant with regard to perception. The examples here adjust the hue to be away from 0 (except in the purple hues) because it provides slightly more aesthetically pleasing results.
5
Results
Applying the design described in Section 4, we can build the cool to warm color map shown in Plate I. The control points, to be interpolated in Msh space, are given in Table 1. Table 1. Cool to warm color map control points Color
M
s
h
Red White Blue
80 1.08 0.5 88 0 1.061/-1.661 80 1.08 -1.1
This diverging color map works admirably for all of our requirements outlined in Section 3. The colors are aesthetically pleasing, the order of the colors is natural, the rate of change is perceptually linear, and the colors are still easily distinguished by those with dichromatic vision. The map also has a good perceptual range and minimally interferes with shading. Plate II compares the cool-warm color map to some common alternatives as well as some recommended by Rheingans [4] and Ware [5]. The cool-warm color map works well in all the cases demonstrated here. The rainbow color map exhibits problems with irregular perception and sensitivity to color deficiencies. The grayscale and heated-body color maps work poorly in conjunction with 3D shaded surfaces. The isoluminant color map has a low dynamic range and performs particularly poorly with high frequency data. The common choice of greed-red isoluminant color maps is also useless to most people with colordeficient vision. The blue-yellow map works reasonably well in all these cases, but has a lower resolution than the cool-warm map, which yields poorer results with low contrast. In addition, despite having a relatively large perceptual response, the color map still allows for a significant amount of annotation or visual components to be added, as shown in Plate III. Using the techniques described in Section 4, we can also design continuous diverging color maps with different colors. Such color maps may be useful in domain-specific situations when colors have specific meaning. Some examples are given in Figure 3. An implementation of using the Msh color space to create diverging maps has been added to the vtkColorTransferFunction class in the Visualization Toolkit (VTK), a free, open-source scientific visualization library.2 Any developers or 2
www.vtk.org
Diverging Color Maps for Scientific Visualization
101
Plate I. A continuous diverging color map well suited to scientific visualization
Plate II. Comparison of color map effectiveness. The color maps are, from left to right, cool-warm, rainbow, grayscale, heated body, isoluminant, and blue-yellow. The demonstrations are, from top to bottom, a spatial contrast sensitivity function, a lowfrequency sensitivity function, high-frequency noise, an approximation of the color map viewed by someone with deuteranope color-deficient vision (computed with Vischeck), and 3D shading.
Plate III. Examples of using the color map in conjunction with multiple other forms of annotation
102
K. Moreland
users of scientific visualization software are encouraged to use these color map building tools for their own needs. This diverging color map interpolation has also been added to ParaView, a free, open-source end-user scientific visualization application,3 and was first featured in the 3.4 release in October 2008. Although ParaView does let users change the color map and there is no way to track who does so, in our experience few users actually do this. In the nearly 3000 messages on the ParaView users’ mailing list from October 2008 to July 2009, there was no mention of the change of color map from rainbow to cool-warm diverging. Users seem to have accepted the change with little notice despite most users’ affinity for rainbow color maps.
Fig. 3. Further examples of color maps defined in Msh space
6
Discussion
This paper provides a color map that is a good all-around performer for scientific visualization. The map is an effective way to communicate data through colors. Because its endpoints match those of the rainbow color map most often currently used, it can be used as a drop-in replacement. Diverging color maps have not traditionally been considered for most scientific computing due to their design of a “central” point, which was originally intended to have some significance. However, with the addition of the Msh color space, the central point becomes a smooth neutral color between two other colors. The middle point serves as much to highlight the two extremes as it does to highlight itself. In effect, the divergent color map allows us to quickly identify whether values are near extrema and which extrema they are near. This paper also provides an algorithm to generate new continuous diverging color maps. This interaction is useful for applying colors with domain specific meaning or for modifying the scaling of the colors. Although we have not been able to do user studies, the design of this color map is based on well established theories on color perception. This map is a clear improvement over what is commonly used today, and I hope that many will follow in adopting it.
Acknowledgements Thanks to Russell M. Taylor II for his help on color space and color map design. Thanks to Patricia Crossno, Brian Wylie, Timothy Shead, and the ParaView development team for their critical comments and general willingness to be guinea pigs. 3
www.paraview.org
Diverging Color Maps for Scientific Visualization
103
This work was done at Sandia National Laboratories. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
References 1. Borland, D., Taylor II, R.M.: Rainbow color map (still) considered harmful. IEEE Computer Graphics and Applications 27, 14–17 (2007) 2. Brewer, C.A.: Designing better MAPS: A Guide for GIS Users. ESRI Press (2005) ISBN 1-58948-089-9 3. Levkowitz, H., Herman, G.T.: Color scales for image data. IEEE Computer Graphics and Applications 12, 72–80 (1992) 4. Rheingans, P.: Task-based color scale design. In: Proceedings of Applied Image and Pattern Recognition 1999, pp. 35–43 (1999) 5. Ware, C.: Information Visualization: Perception for Design, 2nd edn. Morgan Kaufmann, San Francisco (2004) 6. Light, A., Bartlein, P.: The end of the rainbow? Color schemes for improved data graphics. EOS, Transactions, American Geophysical Union 85, 385–391 (2004) 7. Mullen, K.T.: The contrast sensitivity of human colour vision to red–green and blue–yellow chromatic gratings. The Journal of Physiology 359, 381–400 (1985) 8. Stone, M.C.: Representing colors as three numbers. IEEE Computer Graphics and Applications 25, 78–85 (2005) 9. Ware, C.: Color sequences for univariate maps: Theory, experiments, and principles. IEEE Computer Graphics and Applications 8, 41–49 (1988) 10. Rogowitz, B.E., Treinish, L.A., Bryson, S.: How not to lie with visualization. Computers in Physics 10, 268–273 (1996) 11. Stone, M.C.: A Field Guide to Digital Color. A K Peters (2003) 1-56881-161-6 12. Wyszecki, G., Stiles, W.: Color Science: Concepts and Methods, Quantitative Data and Formulae. John Wiley & Sons, Inc., Chichester (1982) 13. Fortner, B., Meyer, T.E.: Number by Colors: a Guide to Using Color to Understand Technical Data. Springer, Heidelberg (1997) 14. Spence, I., Efendov, A.: Target detection in scientific visualization. Journal of Experimental Psychology: Applied 7, 13–26 (2001) 15. Hardin, C., Maffi, L. (eds.): Color categories in thought and language. Cambridge University Press, Cambridge (1997)
GPU-Based Ray Casting of Multiple Multi-resolution Volume Datasets Christopher Lux and Bernd Fr¨ ohlich Bauhaus-Universit¨ at Weimar Abstract. We developed a GPU-based volume ray casting system for rendering multiple arbitrarily overlapping multi-resolution volume data sets. Our efficient volume virtualization scheme is based on shared resource management, which can simultaneously deal with a large number of multi-gigabyte volumes. BSP volume decomposition of the bounding boxes of the cube-shaped volumes is used to identify the overlapping and non-overlapping volume regions. The resulting volume fragments are extracted from the BSP tree in front-to-back order for rendering. The BSP tree needs to be updated only if individual volumes are moved, which is a significant advantage over costly depth peeling procedures or approaches that use sorting on the octree brick level.
1
Introduction
The oil and gas industry is continuously improving the seismic coverage of subsurface regions in existing and newly developed oil fields. Individual seismic surveys are large volumetric datasets, which have precise coordinates in the Universal Transverse Mercator (UTM) coordinate system. Often the many seismic surveys acquired in larger areas are not merged into a single dataset. They may have different resolutions, different orientations and can be partially or even fully overlapping due to reacquisition during oil production. Dealing with individual multi-gigabyte datasets requires multi-resolution techniques[1,2,3], but the problem of handling many such datasets has not been fully addressed. We developed an approach for the resource management of multiple multiresolution volume representations, which is the base infrastructure for our efficient GPU-based volume ray casting system. Through the use of shared data resources in system and graphics memory we are able to support a virtually unlimited number of simultaneously visualized volumetric data sets, where each dataset may exceed the size of the graphics memory or even the main memory. We also demonstrate how efficient volume virtualization allows for multiresolution volumes to be treated exactly the same way as regular volumes. The overlapping and non-overlapping volume regions are identified and sorted in front-to-back order using a BSP tree. The main advantage of our approach compared to very recent work by Lindholm et al. [4] is that only bounding boxes of the cube-shaped volumes are dealt with in the BSP tree instead of the viewdependent brick partitions of the involved volumes. As a result our approach requires recomputations of the BSP tree only if the spatial relationship of the volumes changes. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 104–116, 2009. c Springer-Verlag Berlin Heidelberg 2009
GPU-Based Ray Casting of Multiple Multi-resolution Volume Datasets
2
105
Related Work
Visualizing a large volume data set requires the use of level-of-detail and multiresolution techniques to balance between rendering speed and memory requirements. Multi-resolution rendering techniques are typically based on hierarchical data structures to represent the volume data set at various resolutions. LaMar et al. [1] and Boada et al. [2] use an octree data structure to generate a multiresolution volume hierarchy. Plate et al. [3] focused on out-of-core resource management in multi-resolution rendering systems. Until recently all multi-resolution volume rendering systems achieve the visualization of the multi-resolution volume hierarchy by rendering each individual sub-volume block in a single volume rendering pass and use frame buffer composition to generate the final image. This approach has limitations with respect to algorithmic flexibility and rendering quality, e. g. the implementation of advanced volume ray casting techniques such as early ray termination and empty space skipping is cumbersome and inefficient compared to their implementation in a single pass algorithm. The virtualization of multi-resolution data representations enables the implementation of single pass rendering algorithms, which can be implemented such that they are mostly unaware of the underlying multi-resolution representation of the data set. Kraus and Ertl [5] describe how to use a texture atlas to store the individual volume sub-blocks in a single texture resource. They use an index texture for the translation of the spatial data sampling coordinates to the texture atlas cell containing the corresponding data. Based on this approach single pass multi-resolution volume ray casting systems were introduced by Gobbetti et al. [6] and Crassin et al. [7]. Both are based on a classic octree representation of the volume data set and store the octree cut in a 3D-texture atlas. Instead of an index texture to directly address the texture atlas they use a compact encoding of the octree similar to [8]. The leaf nodes of the octree cut hold the index data for accessing the sub-blocks from the texture atlas. Individual rays are traversed through the octree hierarchy using a similar approach to kd-restart [9], which is employed for recursive tree-traversal in real-time ray tracing algorithms on the GPU. Our multi-resolution volume virtualization approach is similar to the one used by Gobbetti et al. but we employ an index texture for direct access to the texture atlas cells. This way we trade increased memory requirements for reduced octree traversal computations. Jacq and Roux [10] introduced techniques for rendering multiple spatially aligned volume data sets, which can be considered a single multi-attribute volume. Leu and Chen [11] made use of a two-level hierarchy for modeling and rendering scenes consisting of multiple non-intersecting volumes. Nadeau [12] supported scenes composed of multiple intersecting volumes. The latter approach, however, required costly volume re-sampling if the transformation of individual volumes changes. Grimm et al. [13] presented a CPU-based volume ray casting approach for rendering multiple arbitrarily intersecting volume data sets. They identify multi-volume and single-volume regions by segmenting the view rays at volume boundaries. Plate et al. [14] demonstrated a GPU-based multivolume rendering system capable of handling multiple multi-resolution data sets.
106
C. Lux and B. Fr¨ ohlich
They identify overlapping volume regions by intersecting the bounding geometries of the individual volumes and they also need to consider the individual sub-blocks of the multi-resolution octree hierarchy. They still rely on a classic slice-based volume rendering method, and thus the geometry processing overhead becomes quickly the limiting factor when moving either individual volumes or the viewer position. Roessler et al. [15] demonstrated the use of ray casting for multi-volume visualization based on similar intersection computations. Both approaches rely on costly depth sorting operations of the intersecting volume regions using a GPU-based depth peeling technique. Very recently Lindholm et al. [4] demonstrated a GPU-based ray casting system for visualizing multiple intersecting volume data sets based on the decomposition of the overlapping volumes using a BSP-tree [16]. This allows for efficient depth sorting of the resulting volume fragments on the CPU. They describe a multi-pass approach for rendering the individual volume fragments using two intermediate buffers. While they support the visualization of multi-resolution volume data sets, their approach is based on the insertion of the volume sub-blocks in the BSP-tree resulting in a very large amount of volume fragments and rendering passes. Even though our approach is also using a BSP-tree for efficient volume-volume intersection and fragment sorting, we do not need to insert volume sub-blocks into the BSP-tree and our efficient volume virtualization technique can deal with a large number of volumes.
3
Rendering System
In this section we will describe the most important parts of our multiple multiresolution volume ray casting system. We first give a brief overview over all system components and their relationships followed by more detailed descriptions of the resource management, our virtualization approach for multiple multiresolution volumes and the rendering method. 3.1
System Overview
Our main goal with this rendering system is to efficiently visualize multiple arbitrarily overlapping multi-gigabyte volume data sets. Even a single data set is potentially larger than the available graphics memory and might even exceed the size of the system memory. For this reason we also need to support outof-core handling of multiple multi-resolution data sets. The rendering system consists of three main parts: The brick cache, the atlas texture and the renderer (cf. figure 1). We use an octree as the underlying data structure for the volume representation similar to [1,2,3]. The original volume data sets are decomposed into small fixed-size bricks. These bricks represent the leaf nodes of the octree containing the highest resolution of the volume. Coarser resolutions are represented through inner nodes, which are generated bottom-up by down-sampling eight neighboring nodes from the next finer level. Inner nodes have the same size as their
GPU-Based Ray Casting of Multiple Multi-resolution Volume Datasets
CPU/RAM
GPU/VRAM
Brick Cache
Atlas Texture
107
HDD Brick Pool … Index Textures
Volume
Volume …
Renderer
Fig. 1. The multi-volume rendering system. The renderer maintains the octree representations of the individual volumes. The brick cache asynchronously fetches requested data from the external brick pool to system memory. The atlas texture holds the current working set of bricks for all volumes. For each volume an individual index texture is generated to provide the address of the actual brick data in the atlas texture during rendering.
child nodes. Consequently, all nodes in the octree are represented by bricks of the same fixed size, which acts as the basic paging unit throughout our system. Each brick shares at least one voxel layer with neighboring bricks to avoid rendering errors at brick boundaries due to texture interpolations as suggested by [17]. The pre-processing of all volume data sets is done completely on the CPU and the resulting octree representations are stored in an out-of-core brick data pool located on a hard drive. During rendering a working set of bricks of the individual volumes is defined by cuts through their octree representation as described in [2]. The renderer maintains these octree cuts and updates them incrementally at runtime using a greedy-style algorithm. This method is guided by view-dependent criteria and a fixed texture memory budget. For updating the octree cuts only data currently resident in the brick data cache is used to represent the multi-resolution volumes. Unavailable brick data is requested from the out-of-core brick pool. This way the rendering process is not stalled due to slow data transfers from the external brick data pool. The brick cache asynchronously fetches requested brick data from the brick pool on the hard disk making the data available for the update method as soon as it is loaded. After the update method finished refining the octree cuts the current working set of bricks in graphics memory is updated to mirror the state of the octree representations. We use a single large, pre-allocated atlas texture of a fixed size to store the working sets of all volumes. This enables us to balance the texture resource distribution over all volumes in the scene (cf. section 3.2). For each volume an individual index texture is maintained in graphics memory. These index textures encode the individual octree subdivisions of the different volumes and allow direct access to the volume data stored in the atlas texture.
108
C. Lux and B. Fr¨ ohlich
The volume ray casting approach used in our system makes use of the individual index texture of each volume to locate the corresponding brick volume data in the shared atlas texture during ray traversal. Based on a BSP-tree we differentiate overlapping from non-overlapping volume regions. This approach provides us with a straightforward depth ordering of the resulting convex volume fragments. These volume fragments are traversed by the rays in front-to-back order to generate the final image. 3.2
Resource Management
Our out-of-core volume rendering system is able to handle multiple extremely large volumes. The biggest memory resources are the brick data cache and the atlas texture. We chose both to be global, shared resources for all volumes that need to be handled at a time. In contrast to individual non-shared resources attached to every volume this allows us to balance the memory requirements of all volumes against each other. If, for example, volumes are moved out of the viewing frustum or are less prominent in the current scene the unused resources can be easily shifted to other volumes without costly reallocation operations in system and graphics memory (cf. figure 2). The brick cache acts as a large second-level cache in system memory, which holds most recently used brick data. We employ an LRU - least recently used - strategy when replacing data cells in this cache. The atlas texture then acts as the first-level cache for the ray casting algorithm. The atlas texture contains the leaf nodes of the current octree cuts of all volumes. We also employ an LRU strategy for managing unused brick cells of the atlas texture to allow for caching if the atlas is not fully occupied by bricks involved in rendering. However, this is rarely the case, since the handled volumes are orders of magnitudes larger than the available texture memory resources. The greedy-style algorithm incrementally updates the octree cut representations of the individual volumes on a frame-to-frame basis. This algorithm is
Fig. 2. This figure shows the texture resource distribution between two seismic volume data sets during a zoom-in operation. As the left volume is moved closer to the viewer a larger amount of the fixed texture resources are assigned to it leaving less resources for the right volume. The size of the bricks in the generated octree cuts, shown as wire frame overlays, show the local volume resolution. Blue boxes represent the highest volume resolution while green boxes show lower resolutions.
GPU-Based Ray Casting of Multiple Multi-resolution Volume Datasets
109
rendering frame N
render process update cuts
swap to atlas
upload bricks RAM-to-VRAM fetch bricks hdd-to-RAM
Fig. 3. The main tasks performed by the rendering system during one rendering frame. After updating the octree cuts the requested brick data is transfered from system to graphics memory in parallel to the rendering process. After uploading and rendering finished the transfered data is swapped to the actual atlas texture. The HDD-to-RAM fetching process is completely decoupled from rendering.
constrained in two ways. First it tries to stepwise approximate the most optimal octree cuts for all volumes under the limit of the available atlas texture memory budget. Second, due to the limited bandwidth from system to graphics memory, only a certain amount of bricks is inserted or removed from the octree cuts in each update step. As shown in figure 2 the update method distributes the available texture resources amongst all volumes. The method terminates if the octree cuts are considered optimal under the current memory budget constraints according to view-dependent criteria or if no more required brick data is resident in the brick cache. Unavailable brick data is fetched asynchronously from the hard disk for future use. Once the data becomes available the update method is able to insert the requested nodes. In addition to the explicitly required data the update method also pre-fetches data into the brick data cache. Stalling of the rendering process due to atlas texture updates needs to be avoided to guarantee optimal performance of the rendering system. Updating a texture that is currently in use by the rendering process would implicitly stall the rendering process until the current rendering commands are finished. We employ an asynchronous texture update strategy using a dedicated brick upload buffer, which can be asynchronously written during rendering. After the current rendering frame is finished the content of this buffer is swapped to the actual atlas texture. This leads to a parallel rendering system layout essentially consisting of three parallel tasks as shown in figure 3. Due to the asynchronous update of the atlas texture the rendering process always uses the data prepared and uploaded during the previous rendering frame. 3.3
Volume Virtualization
The key to combining multi-resolution volume representations with single-pass ray casting systems is an efficient virtualization of the multi-resolution texture hierarchy. Texture virtualization refers to the abstraction of a logical texture resource from the underlying data structures, effectively hiding the physical characteristics of the chosen memory layout. We chose a single atlas texture of a fixed size to represent the multi-resolution octree hierarchies of the volumes by storing their working sets of bricks. Due to the fact that all bricks representing any node
110
C. Lux and B. Fr¨ ohlich
(a) Atlas Texture
(b) Index Texture
(c) Final Rendering
Fig. 4. Volume ray casting using virtualized multi-resolution textures. (a) Atlas texture containing brick data of two volumes. (b) Index textures of the two volumes in the scene. (c) Final rendering based on volume ray casting.
in the octree hierarchy are of the same size exchanging brick data in the atlas texture is a straightforward task without introducing complex memory management problems like memory fragmentation. While similar approaches using an atlas texture for GPU-based volume ray casting systems have been proposed in [6,7] our method uses index textures to encode the octree subdivisions for direct access to the volume data cells in the atlas texture as suggested by [5]. Gobbetti as well as Crassin [6,7] use compact octree data structures on the GPU introducing logarithmic octree traversal costs. While they describe how to efficiently traverse such a data structure their approach lacks the flexibility for arbitrary texture lookups required for e. g. gradient calculations. In contrast using an index texture for direct access to the atlas data reduces the lookup computations to a constant calculation overhead. A texture lookup into a virtualized volume texture requires the following two steps: Sampling the index texture at the requested location which results in a index vector containing information where the corresponding brick data is located in the atlas texture and scaling information. Using this information the requested sampling position is transformed to the atlas texture coordinate system and the respective sample is returned. Using this index texture approach we exchange fast access to the required atlas texture indexing information for a moderately larger memory footprint. These index textures are several orders of magnitude smaller in size than the actual volumes because they describe the octree representation on a brick level. Integrating our multi-resolution volume virtualization approach with single pass volume ray casting is realized as a straightforward extension to the ray traversal. The basic ray casting algorithm remains completely unaware of the underlying octree hierarchy. Only the data lookup routine has to be extended, which hides all of the complexity from the rest of the ray casting method. Because of the relatively small size of the index textures we achieve good texture cache performance when accessing the index textures in a regular pattern as is the case with volume ray casting and gradient calculations. Figure 4 shows an example of a scene consisting of two seismic volumes rendered, which is using our virtualization approach and rendered by volume ray casting on the GPU.
GPU-Based Ray Casting of Multiple Multi-resolution Volume Datasets
+
-
A+
E
+
A-
111
empty
solid
-
D F G
+
E
-
+
+
F
-
G-
-
B+
+
C-
+
H-
B-
-
+
+
E
B
+
+ +
C-
H-
+
D-
H
Fig. 5. Decomposition of two multi-resolution volumes into homogeneous volume fragments using an auto-partitioning solid-leaf BSP-tree. Only the bounding geometries of the volumes are used for the decomposition.
3.4
Ray Casting Multiple Multi-resolution Volumes
For visualizing multiple arbitrarily overlapping volume data sets it is important to differentiate between mono-volume and multi-volume segments as emphasized by Grimm et al. [13]. They identified different segments along the ray paths for a CPU-based ray casting implementation. In contrast we segment the overlapping volumes and use a GPU-based ray casting approach. We use a BSPtree-based approach similar to Lindholm et al. [4] to identify overlapping and non-overlapping volume fragments. While they also support multi-resolution volume data sets their approach treats each brick in the multi-resolution hierarchy as a separate volume. Thus, the BSP process generates an immense amount of volume fragments, which need to be rendered in sequential rendering passes. As a consequence, changing the view causes updates of the multi-resolution hierarchy, which forces them to recreate the complex BSP-tree. In contrast our efficient volume virtualization enables us to treat multi-resolution volumes in exactly the same way as regular volumes by only processing their bounding geometries, which only needs to happen during setup or if the actual volumes are moved. Our BSP implementation is based on an auto-partitioning solid-leaf BSP-tree [16]. The BSP-tree is generated using the bounding geometries of the individual volumes, which also define the split planes. Figure 5 shows an exemplary volume decomposition for two volumes creating four convex polyhedra containing only one fragment, which is overlapped by both volumes. The BSP-tree allows efficient depth sorting of the resulting volume fragments on the CPU, which is required for correct traversal during the actual volume ray casting process. The volume ray casting method processes all visible volume fragments in front-to-back order. Each fragment is processed in a single ray casting pass independent of the number of contained bricks. We use shader instantiation to generate specialized ray casting programs for the different number of volumes overlapping a particular fragment. Two intermediate buffers are used during this multi-pass rendering process: An integration buffer stores the intermediate volume rendering integral for all rays. Another buffer stores the ray exit positions
112
C. Lux and B. Fr¨ ohlich
of the currently processed volume fragment. We need to generate the exit positions explicitly using rasterization because the irregular geometry of the volume fragments generated by the BSP-process does not allow simple analytical ray exit computations on the GPU. During the exit point generation the accumulated opacity from the integration buffer is copied to this buffer to circumvent potential read-write conflicts during ray casting when writing to the integration buffer. The accumulated opacity is used for early ray termination of individual rays. A volume rendering frame of our system consists of the following steps: First, the intermediate buffers are cleared. Then the ray exit positions for each volume fragment are generated by rendering the back faces of the fragment polyhedrons. In the following step the single pass volume ray casting is triggered by rendering the front faces, which generates the ray entry points into the volume fragment while the exit points are read from the second image buffer. The result of the individual ray casting passes is composited into the integration buffer incrementally accumulating the complete volume integral for each ray.
4
Results
We implemented the described rendering system using C++, OpenGL 3.0 and GLSL. The evaluation was performed on a 2.8GHz Intel Core2Quad workstation with 8GiB RAM and a single NVIDIA GeForce GTX 280 graphics board running Windows XP x64 Edition. We tested our system with various large data sets with sizes ranging from 700MiB up to 40GiB. Most datasets were seismic volumes from the oil and gas domain. For this paper we used scenes composed of multiple large seismic multi-resolution volumes as shown in figure 6. Due to confidentiality reasons we are only able to show one small data set containing 1915 × 439 × 734 voxels at 8bits/sample. We duplicated the volume several times to show the ability of our system to interactively handle very large amounts of data. The chosen brick size for all volumes was 643 , the atlas texture size was 512MiB and the brick data cache size 3GiB. Images were rendered using a view port resolution of 1280 × 720. For the evaluation we emulated potential oil and gas application scenarios for multi-volume rendering techniques. Seismic models of large oil fields contain multiple potentially overlapping seismic surveys, which can only be inspected one at a time or adjacent surveys have to be merged. Using our multi-resolution multivolume rendering approach arbitrary configurations can be directly rendered without size limitations, the need for resampling, or adjacency relationships. Figure 6 shows some artificial configurations of sets of surveys. Our system is able to handle many large volumes simultaneously through the described resource management for multiple multi-resolution volumes. Figure 6c shows a scene composed of nine multi-resolution volumes managed through the shared brick data cache and atlas texture. The rendering performance using our multi-volume ray casting system for virtualized multi-resolution volumes is mainly dependent on the screen projection size of the volumes, the used volume sampling rate and the chosen transfer functions. The memory transfers between the shared resources have only little influence on the rendering performance.
GPU-Based Ray Casting of Multiple Multi-resolution Volume Datasets
113
(a) Scenario 1: three volumes, at most two overlapping volumes
(b) Scenario 2: three volumes, at most three overlapping volumes
(c) Scenario 3: nine separate volumes Fig. 6. Example scenes containing three to nine multi-resolution volumes. The left images show the final rendering. The right images show the current volume BSP-tree decomposition. The brightness of the volume fragments represents the respective depth ordering. Fragments being part of multiple volumes are shown in red.
Table 1 shows the very short BSP-update times and the fast frame rendering times for the example scenarios presented in figure 6. We also experimented with artificial scenarios containing up to nine completely overlapping volumes.
114
C. Lux and B. Fr¨ ohlich
Table 1. BSP-tree update and frame rendering times for the example scenarios shown in figure 6. The brick size for all volumes was 643 , the atlas texture size was 512MiB and the brick data cache size 3GiB using a view port resolution of 1280 × 720. Example Scenario 1 2 3
Volumes
Overlapping Volumes
Volume Fragments
3 3 9
2 3 0
13 44 9
BSP-Tree Rendering Update Time Frame Time 0.15ms 0.28ms 0.29ms
22Hz 16Hz 14Hz
The number of volume fragments grew quickly to 500 fragments requiring BSPupdate times up so several milliseconds, which resulted in a large number of required rendering passes. While the BSP-update affects rendering performance during volume manipulation viewer navigation remained fluent. For the scenario shown in figure 6c containing nine non-overlapping volumes we also compared the performance of our multi-volume rendering approach to a single-volume implementation, which renders and blends non-overlapping volumes sequentially. We observed frame rates very similar to our multi-volume approach. Thus, the overhead for maintaining the multiple render targets for the ray casting method in such a scenario is small compared to the actual cost for ray casting.
5
Conclusions and Future Work
We presented a GPU-based volume ray casting system for multiple arbitrarily overlapping multi-resolution volume data sets. The system is able to simultaneously handle a large number of multi-gigabyte volumes through a shared resource management system. We differentiate overlapping from non-overlapping volume regions by using a BSP-tree based method, which additionally provides us with a straightforward depth ordering of the resulting convex volume fragments and avoids costly depth peeling procedures. The resulting volume fragments are efficiently rendered by custom instantiated shader programs. Through our efficient volume virtualization method we are able to solely base the BSP volume decomposition on the bounding geometries of the volumes. As a result the BSP needs to be updated only if individual volumes are moved. The generation of our multi-resolution representation is currently based on view-dependent criteria only. Transfer function-based metrics and the use of occlusion information [6] can greatly improve the volume refinement process and the guidance of the resource distribution among the volumes in the scene. Introducing additional and user-definable composition modes for the combination of overlapping volumes can increase visual expressiveness [18]. Our ultimate goal is to interactively roam through and explore an multiterabyte volume data sets. Such scenarios already exist in the oil and gas domain where large oil fields are covered by various potentially overlapping seismic surveys. Surveys are additionally repeated to show the consequences of the oil
GPU-Based Ray Casting of Multiple Multi-resolution Volume Datasets
115
production, which generates time varying seismic surveys. Currently no infrastructure exists to handle such extreme scenarios.
Acknowledgments This work was supported in part by the VRGeo Consortium and the German BMBF InnoProfile project ”Intelligentes Lernen”. The seismic data set from the Wytch Farm oil field is courtesy of British Petroleum, Premier Oil, Kerr-McGee, ONEPM, and Talisman.
References 1. LaMar, E., Hamann, B., Joy, K.I.: Multiresolution Techniques for Interactive Texture-Based Volume Visualization. In: Proceedings of IEEE Visualization 1999, pp. 355–361. IEEE, Los Alamitos (1999) 2. Boada, I., Navazo, I., Scopigno, R.: Multiresolution Volume Visualization with a Texture-Based Octree. The Visual Computer 17, 185–197 (2001) 3. Plate, J., Tirtasana, M., Carmona, R., Fr¨ ohlich, B.: Octreemizer: A Hierarchical Approach for Interactive Roaming Through Very Large Volumes. In: Proceedings of the Symposium on Data Visualisation 2002, pp. 53–60. IEEE, Los Alamitos (2002) 4. Lindholm, S., Ljung, P., Hadwiger, M., Ynnerman, A.: Fused Multi-Volume DVR using Binary Space Partitioning. In: Computer Graphics Forum, Proceedings Eurovis 2009, Eurographics, vol. 28(3), pp. 847–854 (2009) 5. Kraus, M., Ertl, T.: Adaptive Texture Maps. In: Proceedings of SIGGRAPH/EG Graphics Hardware Workshop 2002, Eurographics, pp. 7–15 (2002) 6. Gobbetti, W., Marton, F., Guiti´ an, J.A.I.: A Single-pass GPU Ray Casting Framework for Interactive Out-of-Core Rendering of Massive Volumetric Datasets. The Visual Computer 24, 797–806 (2008) 7. Crassin, C., Neyret, F., Lefebvre, S., Eisemann, E.: Gigavoxels: Ray-Guided Streaming for Efficient and Detailed Voxel Rendering. In: ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D), pp. 15–22. ACM, New York (2009) 8. Lefebvre, S., Hornus, S., Neyret, F.: Octree Textures on the GPU. In: GPU Gems 2, pp. 595–613. Addison-Wesley, Reading (2005) 9. Horn, D.R., Sugerman, J., Houston, M., Hanrahan, P.: Interactive k-d Tree GPU Raytracing. In: ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D), pp. 167–174. ACM, New York (2007) 10. Jacq, J.J., Roux, C.: A Direct Multi-Volume Rendering Method Aiming at Comparisons of 3-D Images and Models. IEEE Transactions on Information Technology and Biomedicine 1, 30–43 (1997) 11. Leu, A., Chen, M.: Modeling and Rendering Graphics Scenes Composed of Multiple Volumetric Datasets. Computer Graphics Forum 18, 159–171 (1999) 12. Nadeau, D.: Volume Scene Graphs. In: Proceedings of the 2000 IEEE symposium on Volume Visualization, pp. 49–56. IEEE, Los Alamitos (2000) 13. Grimm, S., Bruckner, S., Kanitsar, A., Gr¨ oller, E.: Flexible Direct Multi-Volume Rendering in Interactive Scenes. In: Vision, Modeling, and Visualization (VMV), pp. 386–379 (2004)
116
C. Lux and B. Fr¨ ohlich
14. Plate, J., Holtkaemper, T., Froehlich, B.: A Flexible Multi-Volume Shader Framework for Arbitrarily Intersecting Multi-Resolution Datasets. IEEE Transactions on Visualization and Computer Graphics 13, 1584–1591 (2007) 15. Roessler, F., Botchen, R.P., Ertl, T.: Dynamic Shader Generation for Flexible Multi-Volume Visualization. In: Proceedings of IEEE Pacific Visualization Symposium 2008 (PacificVis 2008), pp. 17–24. IEEE, Los Alamitos (2008) 16. Fuchs, H., Kedem, Z.M., Naylor, B.: On Visible Surface Generation by a priori Tree Structures. In: Proceedings of the 7th annual conference on Computer graphics and interactive techniques, pp. 124–133. ACM, New York (1980) 17. Weiler, M., Westermann, R., Hansen, C., Zimmerman, K., Ertl, T.: Level-of-Detail Volume Rendering via 3D Textures. In: Proceedings of the 2000 IEEE Symposium on Volume Visualization, pp. 7–13. IEEE, Los Alamitos (2000) 18. Cai, W., Sakas, G.: Data Intermixing and Multi-Volume Rendering. Computer Graphics Forum 18, 359–368 (1999)
Dynamic Chunking for Out-of-Core Volume Visualization Applications Dan R. Lipsa1 , R. Daniel Bergeron2, Ted M. Sparr2 , and Robert S. Laramee3 1 3
Armstrong Atlantic State University, Savannah GA 31411, USA 2 University of New Hampshire, Durham, NH 03824, USA Swansea University, Swansea SA2 8PP, Wales, United Kingdom
Abstract. Given the size of today’s data, out-of-core visualization techniques are increasingly important in many domains of scientific research. In earlier work a technique called dynamic chunking [1] was proposed that can provide significant performance improvements for an out-of-core, arbitrary direction slicer application. In this work we validate dynamic chunking for several common data access patterns used in volume visualization applications. We propose optimizations that take advantage of extra knowledge about how data is accessed or knowledge about the behavior of previous iterations and can significantly improve performance. We present experimental results that show that dynamic chunking has performance close to regular chunking but has the added advantage that no reorganization of data is required. Dynamic chunking with the proposed optimizations can be significantly faster on average than chunking for certain common data access patterns.
1
Introduction
Disk drive size and processor speed have increased significantly in the last couple of years. This enables scientists to generate ever larger simulation data and to acquire and store ever larger real world data. While visualization is one of the most effective techniques to analyze these large data sets, visualization usually requires loading the entire data into the main memory. Often, this is impossible given the size of data. This problem has led to a renewed interest in out-of-core visualization techniques, which execute the visualization by loading only a small part of the data into the main memory at any given time. While multidimensional arrays are usually stored in a file using linear storage, a common way of improving access to them is to reorganize the file to use chunked storage. That happens when data is split in chunks (cubes or bricks) of equal size and the same dimensionality as the original volume, and each individual chunk is stored contiguously in the file using linear storage. Chunks are stored in the file in linear fashion by traversing the axes of the volume in a certain order using nested loops. The order of traversing the axes and the size of the chunks is determined by the expected access pattern. However, chunked file storage typically results in better file system performance (compared to linear storage) even when the access pattern does not match the “expected” one. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 117–128, 2009. c Springer-Verlag Berlin Heidelberg 2009
118
D.R. Lipsa et al.
Dynamic chunking views a dataset stored using linear storage as if it were chunked into blocks of configurable size and shape. As soon as an item in any block is accessed, the entire block is read. Loading a block in memory may require many read operations. Previously, dynamic chunking was described in the context of a slicer visualization application [1] and it was shown that it provides some of the benefits of file chunking without having to reorganize or maintain multiple copies of the file. In this paper we extend this work to apply to an arbitrary direction slicer and to ray casting. We propose an optimization that can take advantage of further knowledge about the iteration pattern and significantly improve performance. We propose an optimization that can take advantage of knowledge about the behavior of previous iterations to improve performance. We show that dynamic chunking has performance close to that of regular chunking without the need to reorganize the data, and it can be significantly faster in the average with the proposed optimizations.
2
Related Work
The field of out-of-core visualization has a large body of work. Engel et al. [2] present techniques used to render large data on the GPU while Silva et al. [3] present a survey of external memory techniques. In this section we summarize work that is most closely related to ours. Chunking and related techniques. Sarawagi et al. [4] introduced chunking as a way to improve access to multidimensional arrays stored in files. Many other techniques use chunks as a unit of storage for data [5] or as a unit of communication between a server that contains the data and clients that visualize the data [6]. A technique similar with chunking but applied to irregular grids is the meta-cell technique [7, 8, 9]. Meta-cells contain several spatially close cells, have fixed size (several disk blocks) and are loaded from disk as a unit which enables them to be stored in a space saving format. Pascucci and Frank [10] describe a scheme to reorganizes regular grid data in a way that leads to efficient disk access and enables extracting multiple resolution versions of the data. While chunking is very effective and widely used, it has the disadvantage that data needs to be reorganized using chunked storage. Dynamic chunking works directly with a linear storage file, so no reorganization is required. While reading a dynamic chunk from a linear storage file is less efficient than reading a chunk from a chunked file, dynamic chunking can benefit from choosing the dynamic chunk size based on the available memory and iteration parameters. Caching and prefetching. Common ways to mediate the effect of slow I/O are to use prefetching when the user is thinking what to do as in [11] or to use a separate thread to overlap rendering with data I/O as in [12]. Brown et al. [13] use a compiler to analyze future access patterns in an application and issue prefetch and release hints and use the operating system to
Dynamic Chunking for Out-of-Core Volume Visualization Applications
119
manage these requests for all applications. Rhodes et al. [14] use information about the access pattern provided by an iterator object to calculate a cache block shape that reduces the number of reads from the file. Chisnall et al. [15] use data inferred from previous accesses to an out-of-core octree, to improve out-of-core rendering of a point dataset using discrete ray tracing. Cox et al. [16] speed up out-of-core visualization of Computational Fluid Dynamics using application controlled demand paging. This works similarly with memory mapped files but with the additional advantage that the application can specify the page size and it can translate from an external storage format. Dynamic chunking differs from this work in how we decide which data is chosen to be prefetched and cached. The dynamic chunking module prefetches blocks with the same dimensionality as the data, regardless of where that data is stored on disk. Multiple application-level read operations are required to load such a block, but most of those reads would have occurred eventually anyway.
3
Dynamic Chunking Validation
We use dynamic chunking [1] to speed-up out-of-core execution for two visualization applications using four different data access patterns. One pattern is implemented in an arbitrary direction slicer application that reads slices from a volume of data and composes them to build a maximum intensity projection (MIP) [2] representation of the volume. A ray casting application supports three different access patterns: it reads individual voxels, blocks of size 23 voxels and blocks of size 43 voxels along rays and builds either a MIP representation or a volume rendering [17] representation of the volume. Our applications access 3D data through the Granite library [14] datasource which hides the location of data from its users. Our applications can access data from main memory or from disk without any change in the source code. A datasource is conceptually an n-dimensional volume of voxels where each voxel can store one or more attributes. Our dynamic chunking module uses a datasource to read data from disk and the module, together with the datasource it reads from, is encapsulated behind a datasource interface as showed in Figure 1. This enables us to switch between accessing data from main memory, accessing data from disk and accessing data from disk through dynamic chunking without changing the application source code or the data representation. An application could easily be converted to use dynamic chunking by using a datasource for all data accesses. Datasource Application
Dynamic Chunking
Datasource
data on disk
Fig. 1. The Dynamic Chunking module provides the same interface as the Datasource module. This allows for easy integration into an application.
120
D.R. Lipsa et al.
Lipsa et al. [1] describe dynamic chunking in detail. We summarize the the main idea here. We view the entire volume as composed of blocks of configurable size and shape and we create a block table that stores a reference to each of these blocks. Each reference can point to a block from the volume which has the same number of dimensions as the original volume or it can be nil. Loading a block is done on demand, as soon as any element from the block is needed. We use the Least Recently Used (LRU) block replacement algorithm to maintain cache relevance. We apply dynamic chunking to regular grid data stored using conventional linear storage. We use dynamic chunking to speed up out-of-core execution for two visualization applications: a slicer application and a ray casting application. Slicer application. The slicer application builds a MIP representation of a subvolume by using a slice iterator. This iterator allows a user to specify a subvolume of the data and a vector which represents a plane normal. We move the plane along its normal, using a unit step, for as many iterations as the plane touches the subvolume. At every step, we determine the voxels in the intersection of the subvolume and the plane. This intersection polygon is computed using a 3D scan-conversion algorithm for polygons [18]. We use nearest-neighbor interpolation to determine voxels that form the intersection polygon. We read them from the file and we store them as a texture. Textures are composed using graphics hardware to obtain a final image which is the MIP representation of the volume viewed from the direction specified by the vector normal to the plane. Ray casting application. We implemented a ray casting application and used it to generate a MIP image or a volume rendering [17] representation of a subvolume. The subvolume is an arbitrarily oriented cuboid (rectangular parallelepiped) subset of the entire dataset. The ray casting application accesses the data file by reading equally spaced voxels along each ray such that the distance between any of these voxels is equal to the distance between two voxels. For the MIP image we use either nearest neighbor interpolation or trilinear interpolation to determine data values along the ray. These three possibilities: MIP with nearest neighbor interpolation, MIP with trilinear interpolation and volume rendering require three different access patterns through the data file: reading voxels, reading blocks of size 23 voxels and reading blocks of size 43 voxels along each ray. For each of the three access patterns, the distance between two sample points along a ray is the same as the distance between two neighboring voxels. The resolution of the image produced is determined by the resolution of the data, for instance a volume of size 2563 voxels will produce an image of size 2562 pixels. Volume rendering [17] is implemented by compositing the color and opacity of equally spaced points along a viewing ray. For each point along a ray we read a 43 block of voxels from the data file, which is used to calculate the color and opacity at the current position in the ray. While the slicer and the ray casting applications do not load into the main memory the visualized subvolume, they cache blocks of data through the
Dynamic Chunking for Out-of-Core Volume Visualization Applications
121
dynamic chunking module. Denning [19] defines the working set of a program as “the smallest collection of information that must be present in main memory to assure efficient execution” of the program. For the slicer application, the working set consists of all blocks that cover the current slice in an iteration. For the ray casting application, the working set consists of all blocks that completely enclose all ray segments along one rectangular side of the iteration cuboid.
4
Dynamic Chunking Optimizations
Dynamic chunking is a general technique because it does not know about a particular data access pattern the application may want to use. If the application makes this information known, or if the module can infer this information, further optimizations can be applied that can improve performance. These optimizations are based on the observation that a larger block size results in better performance for dynamic chunking because of larger reads from disk. At the same time, increasing the block size results in a larger cache memory required to store the working set of the application and a block size too large can result in cache thrashing. Our goal is to find the maximum block size that allows us to store the working set of the application in the available physical memory. Our optimizations work when the iteration subvolume is aligned with the principal axes. If the iteration subvolume is not aligned with the principal axes, larger cache blocks result in reading more data that is not used by the application. In this case increasing the cache blocks size may not increase performance of the iteration. We present two block size optimizations: analytical and adaptive. The analytical optimization uses information provided by the application to calculate the optimal block size for certain iteration patterns while the adaptive optimization uses information gathered from previous iterations to optimize the block size used by the dynamic chunking module. Analytical block size optimization. We present an algorithm that finds the maximum block size for the dynamic chunking module such that we can store all blocks that cover a slice in memory. While our algorithm works with either a slicing application or a ray casting application for which the iteration subvolume is aligned with the principal axes, the iteration direction and the view is arbitrary. For simplicity we present our algorithm in 2D, but the same reasoning can be applied in 3D. A slicing application provides two extra parameters to the dynamic chunking module: the iteration subvolume and the orientation of the iteration slice. Our module then calculates the maximum block size that can be used by the dynamic chunking module that avoids cache thrashing. Suppose that the iteration area (see Figure 2) has an edge size of l and that the iteration area is partitioned by the dynamic chunking module into n × n squares. Suppose that the angle between the iteration line and the horizontal axis is α and that the amount of available memory is M . We want to minimize n (this in turn will maximize the square size) such that the working set of the application (the squares that cover the iteration line) still fits in the amount of
122
D.R. Lipsa et al. Y l
iteration line n
α
N
D
C
A
B X
O
Fig. 2. An area with edge length l partitioned in n2 squares where n = 4. The intersection between square ABCD and the iteration line can be determined by looking at the projections of the corners of the square on the normal to the iteration line.
available memory M . We can assume that α is between 0◦ and 45◦ ; all other angles can be treated similarly through symmetry. We denote with IB the maximum number of squares intersected by the iteration line as it traverses the whole subvolume. Note that the number of squares intersected by the iteration line in Figure 2 varies through the iteration. It starts with one square at the beginning of the iteration, then it increases and then it decreases back to one square at the end of the iteration. We denote with MB 2 the size (area) of a square where MB (n) = nl 2 . Our goal is to find the maximum square size, or equivalently to find the minimum n, such that all squares intersected by the iteration line can be stored in the available memory M . That can be done by using a for loop that starts with n = 1 and tests at each iteration if IB ∗ MB < M . If the test is true, then we have found n and, in turn, the square size; if it is not true, we continue by doubling n (which means that we decrease the square size by a factor of 4) until a minimum square size when we give up. In that case we do not have enough memory to optimize the execution of the application through dynamic chunking. So, the only problem left is to find IB the maximum number of squares intersected by the iteration line as it traverses the subvolume. We start by looking at the intersection between the iteration line and a square from the partition of the subvolume. We can deduce that an iteration line sliding along its normal n (which corresponds to angle α ∈ [0, 45] degrees) intersects the square with the left-lower corner at position (i, j) if l l d ∈ ( (nx (i + 1) + ny j), (nx i + ny (j + 1))) n n where d is the distance from the origin to the line d = ON , n is the number of squares per edge of the iteration area, l is the iteration area edge size, nx and ny are the components of the normal to the iteration line n. A similar result can be deduced in 3D.
Dynamic Chunking for Out-of-Core Volume Visualization Applications
123
For each of the n3 blocks, we have an interval that gives us the position of the slice that intersects the block. Overlapping intervals give us positions of the slice that intersects several blocks. We can find the maximum number of overlapping intervals by creating a sorted list with the left and right ends of all intervals. We initialize IB , the maximum number of blocks intersected by the slice, to 0. We traverse the sorted list of interval ends and we execute the following: if we see a left end of an interval we increment IB , if we see a right end of an interval we decrement IB . While we do that we keep track of the maximum value for IB . That maximum value is the maximum number of blocks intersected by the slice as it traverses the subvolume. To overcome large time requirements (n3 log n) of this algorithm, especially for partitions with more blocks than 643 , we note that the maximum number of blocks intersected by a slice depends only on the direction of iteration and on the number of blocks in the partition of the subvolume, and it does not depend on the data itself. In our implementation the maximum number of blocks intersected by a slice are calculated and stored in the application for expected directions of iteration and number of blocks in the partition of the subvolume. Adaptive block size optimization. For iteration patterns for which the working set is not a slice but a more complex shape, such as a part of a sphere, or for applications that do not provide the required additional information, it is not possible to analytically calculate the biggest block size that does not cause cache thrashing for a certain memory size. In this case the dynamic chunking module can learn from the behavior of past iterations and adjust the block size for best performance. The application provides a unique identifier for the iteration pattern used. This identifier allows the dynamic chunking module to keep track of previous iterations and decide if a certain block size worked well for a certain iteration and memory size. The identifier differentiates between different working set shapes, working set sizes and cache memory sizes. The dynamic chunking module starts with a default block size, eventually provided by the application. While the application executes an iteration, the dynamic chunking module tests for cache thrashing by keeping track of the blocks used. If a block is discarded, and then reused we conclude that the working set does not fit in memory and so the cache is thrashed. If an iteration thrashes the cache for a certain block size, the iteration is restarted with a smaller block size that has half the edge size of the original block. We continue this process until we complete an iteration without thrashing or a minimum block size is reached. If the minimum block size is reached, we have too little memory to use dynamic chunking. If we complete an iteration without thrashing, the dynamic chunking module stores the block size for the current identifier together with an optimal flag which signals that no other block size besides the one stored will be tried for future iterations. If an iteration completes without thrashing for a certain block size, the dynamic chunking module stores the block size for the current identifier. A future iteration tries to use a larger block that has an edge size double the edge size of
124
D.R. Lipsa et al.
the stored block, and if that completes without thrashing, the block size will be stored for the current identifier. The process continues until a block size results in cache thrashing, in which case the block size with edge size half the edge size of the original block is stored for the current identifier together with the optimal flag. This algorithm results in progressively larger block sizes used for the dynamic chunking module, which results in progressively better performance. If the initial block size provided by the application does not result in thrashing, the only price paid for this improved performance is one iteration that will thrash the cache, which will be quickly detected by the fact that a block discarded is being reused. If the initial block size provided by the application results in thrashing, several iterations that result in thrashing are possible until a block size that does not result in thrashing is found or the dynamic chunking module gives up. If a block size that does not result in thrashing is found, that block size is optimal.
5
Results
To test the dynamic chunking module we traversed a subvolume of size 2563 voxels with three bytes per voxel (the subvolume has 48 MB) located inside a data volume of size 1024x1216x2048 voxels (the data volume has 7.2GB) with 25 MB allocated to the dynamic chunking module for caching. The subvolume is centered at (256, 256, 256) and we vary either the orientation of the slice for the slicer application or the orientation of the subvolume for ray casting. Our tests measure traversal time through the subvolume for 84 orientations specified by the following pairs of heading and pitch Euler angles [20]: {0, 30, . . . , 330} × {−90, −60, . . . , 90}. We summarize all times required for a traversal for all possible orientation angles using a box plot. For the ray casting application, the orientation of the iteration cuboid uses a bank angle of 20 degrees. For the slicer application, the orientation of the normal to the slice can be specified only using heading and pitch angles. Dynamic chunking improves the time required to read data for various access patterns used in visualization algorithms. Our tests measure only the time required to read the data when using the same access pattern as the visualization application. Note that we set the Java Virtual Machine memory to 30MB (using -Xms and -Xmx switches) and we set the cache memory to 25MB, both less than the size of the subvolume traversed, so our visualization applications and our tests run with data out-of-core. Before each run of a traversal for a particular orientation, we clear the file system cache by running a separate program called thrashcache. For the Linux operating system, this program umounts and then mounts the file system that contains the data. In the graphs presenting our test results, we use acronyms for the data access patterns tested: SLICE for the slicer application, MIP N for the ray casting application that calculates a MIP image and uses nearest neighbor interpolation, MIP T for the ray casting application that calculates a MIP image and uses
Dynamic Chunking for Out-of-Core Volume Visualization Applications
125
VOL C VOL DC
MIP_T C MIP_T DC
MIP_N C MIP_N DC
SLICE C SLICE DC 0
50
100
150
200 time (s)
250
300
350
400
Fig. 3. Data read time for dynamic chunking (DC ) versus chunking (C ) optimizations for four access patterns: slicing (SLICE), ray casting with nearest neighbor interpolation (MIP N), ray casting with trilinear interpolation (MIP T), and volume rendering (VOL)
trilinear interpolation and VOL for volume rendering. We also use acronyms for the optimization technique used when the test is run: DC for dynamic chunking, C for chunking, FS for file system cache only. Our test machine has a dual Intel Xeon at 3GHz processor, with 512KB L2 cache and it has 1GB of RAM. The disk drive has rotational speed of 7200 RPM, it has 8.5 ms average seek time and 2 MB of cache. The machine is running Fedora Core 5 GNU-Linux Operating System and Java 1.5.0. All our applications are built in Java, using Java binding to OpenGL (JOGL) [21] for rendering. This makes our application platform independent. The results presented report only the time required to read the data, the time needed for the actual visualization is not included in the results. This makes our results relevant to any implementation of a visualization algorithm that accesses data using one of the patterns tested. Dynamic chunking versus chunking. Figure 3 displays the time to read a 48MB subvolume from a 7.2GB chunked file and from a 7.2GB file with data stored using linear storage. The box plots labeled C shows the time to read the subvolume from a chunked file. The graphs labeled DC shows the time to read the subvolume from a linear file using dynamic chunking with block size 163 voxels. The file is chunked with chunks of size 163 voxels, the same as the size of pages in the paging module. Paging over a chunked file works about 17% faster than dynamic chunking over a linear file, but it requires reorganization of the file. The big speed advantage that would be expected from chunking, is not seen because of the file system and disk drive cache which are able to avoid many actual hard drive operations. For example, when the dynamic chunking module loads a block of size 163 voxels, from 16x16 read operations sent to the file system many of them are served from the file system or hard drive cache.
126
D.R. Lipsa et al.
33.9
DC, adaptive block size 30.74
DC, analytical block size 40.45
chunking, chunk size 16^3 48.29
DC, block size 16^3
15
20
25
30 35 time (s)
40
45
50
Fig. 4. Data read time for traversing a 48MB subvolume from a 7.2GB volume with a cache of 25MB. We test the SLICE iteration pattern for four optimizations: dynamic chunking (DC) with fixed blocks of size 163 voxels, chunking with chunks of size 163 voxels, dynamic chunking with analytical block size optimization and dynamic chunking with adaptive block size optimization. For adaptive block size optimization we show the average of ten iterations.
Block size optimizations. To test the block size optimizations we ran the same tests as before for the SLICE access pattern, but we turn on either the analytical block size optimization or the adaptive block size optimization. For the analytical optimization the block size used by the dynamic chunking module is determined before the iteration based on the information provided by the application (iteration subvolume and normal to the iteration plane), and information determined from the system (the amount of available memory). Figure 4 shows that we get about 25% better performance than chunking in the average, while chunking is still better for certain traversal angles that match the way chunks are stored in the file. For testing the adaptive optimization we still use the SLICE access pattern but we do not provide the extra information needed to calculate the best block size for the given amount of cache memory. In this case, the block size used by the dynamic chunking module is determined from knowledge gathered from previous iterations. The graph for the adaptive block size optimization shows the mean of ten iterations for each iteration angle. For iterations that are interrupted because of cache thrashing we add the iteration time but we do not count the iteration when we calculate the mean. As an example, we present the sequence of block sizes used for iteration direction (heading, pitch) = (0,-90), in the ten iterations tested. The first iteration uses block size 163 (supplied by the application), and then the adaptive optimization adjusts that to 323 and then 643 . For block size 1283 cache thrashing occurs which signals that the optimal block size is 643 . Cache thrashing is quickly detected from the fact that a block discarded is reused by the application. The rest of 6 iterations are run with the optimal block size 643 . Other iteration
Dynamic Chunking for Out-of-Core Volume Visualization Applications
127
directions behave similarly the only difference being the block size where thrashing occurs, which in turn determines the optimal block size.
6
Conclusions and Future Work
We have presented performance tests for dynamic chunking, a technique that can speed-up common data iterations used in volume visualization algorithms. Our test results show that dynamic chunking performs 5.3 times better than file system cache alone and that by using dynamic chunking over a linear file yields about 83% performance of that of paging over a chunked file. We presented optimizations that can improve performance further if additional information is either provided by the application or infered by the dynamic chunking module. In the future, we plan to explore other possible dynamic chunking optimizations that can be applied to different iterations and visualization applications. We also plan to investigate different ways to get more information from the application about the access pattern.
Acknowledgments We would like to thank Radu G. Lipsa for a valuable discussion on the block size optimization. This research was partially funded by the Welsh Institute of Visual Computing (WIVC).
References 1. Lipsa, D.R., Rhodes, P.J., Bergeron, R.D., Sparr, T.M.: Spatial Prefetching for Out-of-Core Visualization of Multidimensional Data. In: Proc. of SPIE, Visualization and Data Analysis, San Jose, CA, USA, vol. 6495–0G, pp. 1–8 (2007) 2. Engel, K., Hadwiger, M., Kniss, J.M., Lefohn, A.E., Salama, C.R., Weiskopf, D.: Real-Time Volume Graphics, Course Notes. In: Proc. of ACM, SIGGRAPH, p. 29. ACM Press, New York (2004) 3. Silva, C., Chiang, Y., El-Sana, J., Lindstrom, P.: Out-of-Core Algorithms for Scientific Visualization and Computer Graphics, Course Notes for Tutorial 4. In: IEEE Visualization, Boston, MA, USA, IEEE Computer Society Washington, DC, USA (2002) 4. Sarawagi, S., Stonebraker, M.: Efficient Organizations of Large Multidimensional Arrays. In: Proc. of the Tenth International Conference on Data Engineering, Washington, DC, USA, pp. 328–336. IEEE Computer Society, Los Alamitos (1994) 5. Chang, C., Kurc, T., Sussman, A., Saltz, J.: Optimizing Retrieval and Processing of Multi-Dimensional Scientific Datasets. In: Proc. of the Third Merged IPPS/SPDP Symposiums. IEEE Computer Society Press, Los Alamitos (2000) 6. Wetzel, A., Athey, B., Bookstein, F., Green, W., Ade, A.: Representation and Performance Issues in Navigating Visible Human Datasets. In: Proc. Third Visible Human Project Conference, NLM/NIH (2000)
128
D.R. Lipsa et al.
7. Chiang, Y.J., Silva, C.T., Schroeder, W.J.: Interactive Out-Of-Core Isosurface Extraction. In: IEEE Visualization, pp. 167–174. IEEE Computer Society, Los Alamitos (1998) 8. Chiang, Y.J., Farias, R., Silva, C.T., Wei, B.: A Unified Infrastructure for Parallel Out-of-Core Isosurface Extraction and Volume Rendering of Unstructured Grids. In: Proc. of the IEEE Symposium on Parallel and Large-Data Visualization and Graphics, Piscataway, NJ, USA, pp. 59–66. IEEE Press, Los Alamitos (2001) 9. Farias, R., Silva, C.T.: Out-Of-Core Rendering of Large, Unstructured Grids. IEEE Computer Graphics and Applications 21, 42–50 (2001) 10. Pascucci, V., Frank, R.J.: Global Static Indexing for Real-Time Exploration of Very Large Regular Grids. In: Supercomputing 2001: Proc. of the 2001 ACM/IEEE Conference on Supercomputing (CDROM), p. 2. ACM Press, New York (2001) 11. Doshi, P., Rundensteiner, E., Ward, M.: Prefetching for Visual Data Exploration. In: Proc. Eighth International Conference on Database Systems for Advanced Applications, vol. 8, pp. 195–202 (2003) 12. Gao, J., Huang, J., Johnson, C., Atchley, S.: Distributed Data Management for Large Volume Visualization. In: IEEE Visualization (2005) 13. Brown, A., Mowry, T.: Compiler-Based I/O Prefetching for Out-of-Core Applications. ACM Trans. on Computer Systems 19 (2001) 14. Rhodes, P.J., Tang, X., Bergeron, R.D., Sparr, T.M.: Iteration Aware Prefetching for Large Multidimensional Scientific Datasets. In: SSDBM 2005: Proc. of the 17th International Conference on Scientific and Statistical Database Management, pp. 45–54. Lawrence Berkeley Laboratory, Berkeley (2005) 15. Chisnall, D., Chen, M., Hansen, C.: Knowledge-Based Out-of-Core Algorithms for Data Management in Visualization. In: EUROVIS - Eurographics /IEEE VGTC Symposium on Visualization, pp. 107–114 (2006) 16. Cox, M., Ellsworth, D.: Application-Controlled Demand Paging for Out-of-Core Visualization. In: IEEE Visualization, p. 235. IEEE Computer Society Press, Los Alamitos (1997) 17. Levoy, M.: Display of Surfaces from Volume Data. IEEE Computer Graphics and Applications 8, 29–37 (1988) 18. Kaufman, A., Shimony, E.: 3D Scan-Conversion Algorithms for Voxel-Based Graphics. In: SI3D 1986: Proc. of the 1986 Workshop on Interactive 3D Graphics, pp. 45–75. ACM Press, New York (1987) 19. Denning, P.J.: The Working Set Model for Program Behavior. Commun. ACM 11, 323–333 (1968) 20. Dunn, F., Parberry, I.: 3D Math Primer for Graphics and Game Development. Wordware Publishing Inc., Plano (2002) 21. Java.net: Java Bindings for OpenGL (JSR-231), online document (2008), https://jogl.dev.java.net/
Visualization of the Molecular Dynamics of Polymers and Carbon Nanotubes Sidharth Thakur1 , Syamal Tallury2 , Melissa A. Pasquinelli2 , and Theresa-Marie Rhyne1 1
Renaissance Computing Institute, North Carolina, USA College of Textiles, North Carolina State University, USA
[email protected],
[email protected], melissa
[email protected],
[email protected] 2
Abstract. Research domains that deal with complex molecular systems often employ computer-based thermodynamics simulations to study molecular interactions and investigate phenomena at the nanoscale. Many visual and analytic methods have proven useful for analyzing the results of molecular simulations; however, these methods have not been fully explored in many emerging domains. In this paper we explore visual-analytics methods to supplement existing standard methods for studying the spatial-temporal dynamics of polymer-nanotube interface. Our methods are our first steps towards the overall goal of understanding macroscopic properties of the composites by investigating dynamics and chemical properties of the interface. We discuss a standard computational approach for comparing polymer conformations using numerical measures of similarities and present matrix- and graph-based representations of the similarity relationships for some polymer structures.
1
Introduction
Polymers are important constituents in many modern materials such as textile fibers, packaging, and artificial implants. Although polymer materials have many desirable and controllable characteristics, some properties like mechanical strength and electrical conductance can be enhanced by reinforcing them with carbon nanotubes (CNT) [1], which are tube-like molecules comprised of a honeycomb network of carbon atoms. The properties of a polymer-CNT composite are dictated not only by the individual properties of the polymer and CNT, but also by the synergy that arises from their interfacial regions [2,1]. Information about the interface provides researchers the opportunity to tune desired material properties. For example, effective dispersion of CNTs in a polymer nanocomposite can be achieved by reducing the surface reactivity of CNTs through wrapping or bonding polymer molecules to its surface. The complex secondary structures of the polymers as well as the chemical composition of the interface are considered to have a strong influence on the physical properties of the nanocomposites. Due to the minute size of the interfacial regions, usually on the order of nanometers, conventional characterization techniques to measure properties at G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 129–139, 2009. c Springer-Verlag Berlin Heidelberg 2009
130
S. Thakur et al.
the interface are not always applicable. Computer simulations such as molecular dynamics and Monte Carlo simulations are a complementary tool to experiments as they can zoom into molecular details in interfacial regions, and have been used to look at many systems, including polymer-CNT composites[2,3,4,5,6,7]. By post-processing the results of these simulations with visual-analytics, complex relationships can be generated that can then be used to tune the desired characteristics of the material and thus accelerate the development process. The overall goal of this ongoing work is to develop relationships that can be used to correlate molecular-level details to experimentally measurable quantities, such as mechanical properties. The first step is to develop visual-analytic methodologies for exploring how the characteristics of a material are dictated by the conformational dynamics and chemical composition of the molecular constituents, particularly at interfacial regions. This paper presents our initial efforts toward understanding and quantifying the key interfacial relationships: (a) visualization of spatial and temporal dynamics of polymer molecules during their interactions with CNTs, and (b) analysis of the spatial structures or conformations of polymers using computational and visual-analytic methods. We begin the rest of the paper in Section 2 with a brief background of our work. In Section 3 we present our approach for visualizing polymer conformations. Section 4 discusses a quantitative approach for analyzing the conformations. Finally, we conclude and discuss future work in Section 5.
2
Background
Polymers are long chain molecules that are composed of several repeating units of one or more types. Although polymers exhibit processibility as well as special mechanical, electronic, and flow-related (rheological) properties, advancements in nanotechnology and polymer engineering have enabled the fabrication of polymer-CNT composites that enhance the material characteristics beyond what can be obtained from polymers themselves due to the unique properties of the interfacial regions[1]. However, since experimental measurements of the interfacial regions of nanocomposites are challenging to perform, there is limited empirical information on how the structure, conformation, and chemical conformation of the interface dictate the properties of these materials. One important approach to study the properties of nanocomposites involves computer-based simulations of polymers interacting with nanoparticles such as CNTs. One such tool is molecular dynamics (MD) simulations, which employ time evolution of atom-level interactions to produce an ensemble of spatial structures of complex molecular assemblies such as polymer-CNT systems [3,4,8]. MD simulations take initial molecular configurations (e.g., bonding and atomic positions) as input and apply Newtonian mechanics to evolve the atomic coordinates as a function of time, typically on the order of nanoseconds. The output of a single MD run is a trajectory that includes the time evolution of the atomic coordinates (or the molecular conformations) with the corresponding energies of the molecular assembly at each time step. This trajectory can be further analyzed by
Visualization of the Molecular Dynamics of Polymers and CNT
131
taking time averages that can produce information such as statistical thermodynamics quantities and radial distribution functions for atom-atom interactions. However, it is difficult to extract relationships directly from these trajectories that correlate the interfacial characteristics to the properties of a material, and thus its use as a tool for tuning such interfaces for producing desired material properties is limited. Our goal is to use visual-analytics tools to fill this gap. This paper outlines our application of standard visual-analytics methods to supplement the post-processing of data from MD simulations of polymer-CNT systems. We contrast the spatial-temporal patterns in two standard spatial groupings of the polymer backbone atoms. The two groupings include: (a) paths of polymers over the multiple time steps in an MD simulation, and (b) trajectories of the backbone atoms.
3
Visualizations of the Conformations of Polymers
As our first objective, we exploit straightforward graphical representations to generate insightful visualizations of the results of the MD simulations of polymer molecules. Our visualizations consist of a direct representation of the paths of a polymer over all of the time steps in an MD simulation and are created by plotting the positions of all of the atoms on the polymer backbone over all of the time steps as three-dimensional anti-aliased points1 . Unique colors are used for color-coding the points to distinguish individual conformations and to expose the temporal evolution of the polymer conformations. The visualization of molecular trajectories using color-coding is quite common in many domains [9,10,11] and is exploited in our work for displaying and exploring time-varying characteristics of polymer molecules. We also visualize trajectories of individual backbone carbon atoms by plotting polylines through locations of each backbone atom over all of the time steps. These visualizations are not commonly used for MD trajectories, but they allow domain researchers to compare potentially different spatial-temporal characteristics of individual atoms or groups of atoms on the polymer backbone and can visualize chemical driving forces that may arise during interfacial mechanisms. Results and Discussion. We chose two polymers as test cases that have similar chemical composition but different stiffness of connectivity among atoms: poly(propylene) (PP) has flexible backbone connectivity, and poly(acetylene) (PA) has stiff backbone connectivity. The data was obtained from a 3.2 ns run with the DL POLY 2.19 [13] software program; refer to [2] for more details. Figure 1 and figure 2 are visualizations of the paths of each polymer as it evolves in time. The conformation visualizations provide the spatial arrangements of polymer chains about the CNT (in gray) and migration of the chains 1
We use our OpenGL-based prototype to create the conformation visualizations; our application reads molecular data in the DL POLY 2.19 data format and exploits standard exploratory tools like three-dimensional transformations and selections for closer inspection of the polymer conformations.
132
S. Thakur et al.
Fig. 1. Visualization of the paths of the polymer PP over all of the time steps in an MD simulation. Colors correspond to the time steps in the simulation. (inset) Two distinct conformations of the polymer from an animation in VMD[12].
Fig. 2. Visualization of the paths of the polymer PA over all the time steps. (inset A) Projections of the polymer PA in three orthogonal planes during the onset of wrapping of the polymer around the nanotube (see the projection XY). (inset B) Conformations of the polymer during two different time steps shown using VMD[12].
as a function of time. For example, a comparison of figures 1 and 2 reveals the difference in coverage of the CNT by both polymers, and that the stiffer connectivity for the PA polymer provides greater surface coverage. In the figures we also contrast our visualizations to typical MD output from a standard molecular visualization tool called Visual Molecular Dynamics (VMD) [12], where polymer structures at various time steps are superimposed onto the CNT. Our visualizations provide a global context for comparing the polymer conformations, which is often missing or difficult to obtain using the standard MD analysis tools. In addition, our visualizations optionally display 2D projections of the polymers during animations of the MD simulations (inset A in figure 2). From the projections, the domain researches can easily decipher the wrapping behavior along each direction, such as along the diameter of the CNT (xy projection). What follows is other specific insights that can be gained from our conformation visualizations. Visual overview of the hull of the conformations: Our visualizations provide global overviews of the spatial-temporal evolution of the polymer molecules during MD simulations and enable domain experts to compare conformations of polymers with different chemical composition or connectivity or results of simulations with different initial conditions.
Visualization of the Molecular Dynamics of Polymers and CNT
133
Distribution of the conformations: The visualizations reveal the relative density of the conformations in different phases of a simulation; for example, whether the transition between distinct sets of conformations is smooth or abrupt. Visual inspection for a detailed analysis: Our conformation visualizations can be used for identifying unusual or unexpected spatial-temporal behavior of groups of atoms or polymer chain segments during an MD simulation. These sub-structures may subsequently be studied in greater detail. Finally, the custom exploratory tools in our visualizations can supplement other standard tools (e.g., VMD) for exploring complex dynamics of polymers-CNT systems. Although we have listed some useful features of our methods for visualizing polymer conformation, there are some limitations to our approach. For example, the PA polymer aligns parallel to the CNT during a portion of the MD trajectory (see inset B in figure 2); however, the parallel alignment is not immediately apparent due to the dense representation in our visualizations. Another general limitation is that the visualizations provide mostly global qualitative information about the conformations and are often lacking in detail. Therefore, we overcome some of its limitations by supplementing our visualization approach with a more quantitative computational method that we will discuss in the next section.
4
Analysis of the Conformations of Polymers
In many domains dealing with complex molecular systems, computational analytic approaches are an indispensable tool for studying molecular conformations. One standard computational technique is based on using numerical metrics for describing conformational similarities. While this and other quantitative approaches have been used for studying molecular conformations in some domains [14,15,16,11,17,18], few previous studies have explored these methods for investigating the molecular dynamics of polymer-CNT systems. As the second objective in our work, we employ a computational approach to compare similarities among polymer conformations. We also discuss matrix- and graph-based representations for visualizing the similarity relationships. 4.1
Theory-Computational Analysis of Molecular Conformations
A standard quantitative comparison of molecular conformations is based on a two-step analytic procedure described in [15]. The procedure involves (1) computation of a numerical measure describing each conformation, and (2) construction and analysis of a correlation matrix that catalogs the similarities between every pair of conformations based on the chosen numerical measure. Table 1 lists some of the numeric measures we use for exploring polymer conformations that are based on geometric properties of the polymer chains. Some other potentially useful measures are also available; for example, a standard measure is the net moment of inertia of a polymer molecule with respect to a fixed CNT [2], which can provide information on the wrapping behavior of some polymers. Another metric relevant to chain-like molecules such as polymers
134
S. Thakur et al. Table 1. Numeric similarity measures
Inter-atomic distances
A useful similarity measure is inter-atomic distances [19] and it is particularly suitable because the distances remain invariant under threedimensional transformations like rotations and translations [15]. The measure is constructed in two steps by first defining feature vectors based on the distances: d = {dij (x) = |xi − xj |,
i, j ∈ 0, · · · , N − 1}
(1)
where the set x = (x1 , x2 , · · · , xN−1 ) corresponds to the threedimensional positions of atoms or centroids of groups of atoms on the backbone of a given polymer chain. The second step involves the computation of distances between every pair of conformations using a root mean square error: 1 Dij = (dij (x) − dij (y))2 , and i > j (2) N (N − 1)/2 BondBond-orientational order is a geometry-based metric [4], which repreorientational sents the global alignment of a polymer chain with a given fixed axis. order The measure is defined as: n−1 1 3 cos2 ψi − 1 d = {di = , i ∈ 0, · · · , N − 1} (3) N − 3 i=2 2 where ψi is the angle between average vectors between pairs of every other backbone atoms (called sub-bond vectors) and the z axis. The conformational distances are computed as: Dij = (di − dj ), where i, j ∈ 0, · · · , N − 1.
is persistence length [8], which represents the stiffness of the molecule strands. The measures can also use non-geometric information such as energies of polymer conformations computed during the MD simulations. Due to limited space, however, we show results only for the two metrics discussed in Table 1. 4.2
Matrix-Based Visualization of Polymer Conformations
Matrix-based representations are a standard approach for visualizing pair-wise correlations in dense data and can be obtained in a straightforward manner by assembling the pair-wise similarity values in a square symmetric matrix as shown in figure 3(a). Matrix-based visualizations have been used previously for effective exploration of molecular conformations [18,20,21]; we exploit the matrix visualizations to examine similar spatial and temporal patterns of polymer conformations during MD simulations. The results for comparing the similarity relationships among polymer conformations for the polymers PP and PA are given in figure 3. We will interpret these results below.
Visualization of the Molecular Dynamics of Polymers and CNT
135
Fig. 3. (a) Conceptual diagram of a matrix of similarities (Cij ) between conformations i and j. (b-c) Similarity matrices corresponding to the time paths of polymers PP and PA based on inter-atomic distances, and (d-e) similarity matrices for the trajectories of backbone atoms in the two polymers using the inter-atomic distances. Note that the matrices have different minimum and maximum values for the similarity measures.
4.3
Graph-Based Visualization of Polymer Conformations
Graph-based representation is another useful standard approach for exploring similarity relationships between molecular conformations [15]. In this approach, a two-dimensional undirected graph is used wherein nodes represent individual conformations and Euclidean distances between the nodes represent the similarities between the corresponding conformations. The graph-based representations are useful for displaying clusters of similar conformations and the graph topology can reveal higher-order relationships among the conformations, such as transitions between distinct conformations, which may not be immediately apparent using matrix-based visualizations. Graph Generation. Some standard techniques for constructing the 2D graphs of conformational similarities include multi-dimensional scaling (MDS) [22], graph drawing or graph layout algorithms (GLA) [23,24], and numerical optimization techniques like conjugate gradients. The basic principle in these methods is the force-directed placement of the nodes of the graph by minimizing the following global error due to total inter-node distances [25]: 2 E = (|a i − aj | − Dij ) , where Dij is the similarity measure or distance i>j between the two conformations Ci and Cj and ai , aj ∈ R2 are points in the 2D graph. While GLA-based methods can sometimes be more suitable for visualizing similarities of a large number of entities [23], graphing techniques based on efficient versions of MDS are also equally applicable [26,27]. In our work we employ the Kamada-Kawai graph layout algorithm [25] that is available in a standard network analysis tool called Pajek [28]. We chose this option primarily for the convenience in using graph drawing methods and other useful network analysis tools in Pajek. The following procedure was used for generating the graphs of similarity relationships among polymer conformations: 1. Load the polymer conformations into our prototype visualization application 2. Compute conformational similarities and the associated correlation matrices 3. Export the correlation matrix as a Pajek network (.net) data file
136
4. – – – – 5.
S. Thakur et al.
Generate the graph associated with the exported .net file in Pajek: select “Layout > Energy > Starting Positions - Random” turn off arcs under “Options > Lines > Draw Lines” select “Options > Values of lines - Dissimilarities” run the algorithm under “Layout > Energy > Kamada-Kawai - Free” Save and export the final graph as a .net file
Results and Discussion. Figure 4 shows graph-based representations of similarity relationships for polymer paths and trajectories of backbone atoms for the polymer PA using two different similarity measures, namely inter-atomic distances (top) and bond-orientational order (bottom). Our first observation pertains to the paths of the polymer PA during the MD simulation. In the graphs in figures 4 (a) and (c), the highly dense clusters at later time steps represent very similar and thermodynamically stable polymer configurations. The transitions between the clusters represent the onset of polymer-CNT interactions and are clearly visible in the graph. Standard methods like animations can also be used for investigating some of the behavior of the polymer-CNT systems during MD simulations. However, the existing tools often lack the customized exploratory tools and features needed to pose and explore questions specific to our domain. Our matrix- and graphbased representations allow domain researchers to explore some other interesting spatial-temporal associations. For example, the graphs in figures 4 (b) and (d) indicate different characterization of the trajectories of the PA backbone atoms based on two different metrics. According to figure 4(b), the inter-atomic distances along the trajectories of backbone atoms remain locally the same but change gradually over the entire length of the polymer chain. However, figure 4(d) suggests that the bond-orientational orders in the trajectories change by large amounts, though some loose local clusters can be seen in the graph. It is interesting also to compare the matrix- and graph-based representations of the similarity relationships. Figures 3(c) and 4(a) show the two representations in which it is relatively straightforward to identify distinct clusters of similar conformations that correspond to the wrapping behavior of the polymer PA. However, while it is easy to isolate the clusters in the graph-based views, the matrix-based representation can provide information on the distribution of similar polymer conformations, which is harder to ascertain using the graphs. Finally, a comparison of the matrices in figure 3(d) and (e) highlights some of the differences in the trajectories of backbone carbon atoms for the polymers PP and PA. The correlations among the trajectories in the matrix shown in figure 3(d) are generally weak; on the contrary, the correlations in the matrix for PA in figure 3(e) are relatively stronger and more complex - the intricate patterns in the matrix represent the regularities in inter-atomic distances because of persistent loops in the PA molecular chain (inset in figure 2). These general differences in correlations among atom trajectories of PP and PA can be attributed to the differences in rigidity of the backbones of the two polymers, i.e., a flexible chain in PP and a rigid backbone in PA.
Visualization of the Molecular Dynamics of Polymers and CNT
137
Fig. 4. Graphs representing the conformations of the polymer PA using two different similarity measures: (top) inter-atomic distances, and (bottom) bond-orientational order. Note that the nodes are color coded using different color-mapping scheme to expose the temporal evolutions of polymer chains (left) and backbone atoms (right).
5
Conclusion and Future Work
We have highlighted some standard visual-analytics techniques that can be used for investigating the characteristics of the molecular dynamics of the interface of polymers and CNTs. Our visualization provide domain researchers with informative overviews of spatial-temporal patterns of polymers during the simulations. We describe some numeric metrics that can be used in a standard computational method to compare polymer conformations and discuss matrix- and graph-based representations of the computational analysis method. We are currently working on exploring other statistically relevant numeric metrics (e.g., moment of inertia and persistence length) to investigate properties of polymer conformations and compare them with experimentally measurable properties of the polymer-CNT systems. We also plan to compare our current results with the dynamics of polymers without a CNT present.
Acknowledgments This work was conducted at the Renaissance Computing Institute’s (Renci) Engagement Facility at North Carolina State University (NCSU). We thank our visualization colleagues in Renci for their valuable feedback. Thanks to Ketan Mane for providing his code for converting similarity matrices into Pajek input file format. This work was partially funded by startup funds provided to M.A.P by the Department of Textile Engineering, Chemistry & Science, NCSU.
138
S. Thakur et al.
References 1. Tasis, D., Tagmatarchis, N., Bianco, A., Prato, M.: Chemistry of carbon nanotubes. Chemical Reviews 106, 1105–1136 (2006) 2. Tallury, S.S., Pasquinelli, M.A.: Molecular dynamics simulations of flexible polymer chains wrapping single-walled carbon nanotubes (Submitted to J Phys Chem B) 3. Tuzun, R., Noid, D., Sumpter, B., Christopher, E.: Recent advances in polymer molecular dynamics simulation and data analysis. Macromol. Theory Simul. 6, 855–880 (1997) 4. Yang, H., Chen, Y., Liu, Y., Cai, W.S., Li, Z.S.: Molecular dynamics simulation of polyethylene on single wall carbon nanotube. J. Chem. Phys. 127, 94902 (2007) 5. Fujiwara, S., Sato, T.: Molecular dynamics simulations of structural formation of a single polymer chain: Bond-orientational order and conformational defects. J. Chem. Phys. 107, 613–622 (1997) 6. Zheng, Q., Xue, Q., Yan, K., Hao, L., Li, Q., Gao, X.: Investigation of molecular interactions between swnt and polyethylene/polypropylene/polystyrene/polyaniline molecules. Journal of Physical Chemistry C 111, 4628–4635 (2007) 7. Gurevitch, I., Srebnik, S.: Monte carlo simulation of polymer wrapping of nanotubes. Chemical Physics Letters 444, 96–100 (2007) 8. Flory, P.J.: Statistical Mechanics Of Chain Molecules. Hanser Publishers (1989) 9. Callahan, T.J., Swanson, E., Lybrand, T.P.: Md display: An interactive graphics program for vis of md trajectories. J Mol Graph 14, 39–41 (1996) 10. Schmidt-Ehrenberg, J., Baum, D., Hege, H.C.: Visualizing dynamic molecular conformations. In: IEEE Vis 2002, pp. 235–242. IEEE Comp. Soc. Press, Los Alamitos (2002) 11. Bidmon, K., Grottel, S., B¨ os, F., Pleiss, J., Ertl, T.: Visual abstractions of solvent pathlines near protein cavities. Comput. Graph. Forum 27, 935–942 (2008) 12. Humphrey, W., Dalke, A., Schulten, K.: VMD – Visual Molecular Dynamics. Journal of Molecular Graphics 14, 33–38 (1996) 13. Smith, W.: Guest editorial: Dl poly-applications to molecular simulation ii. Molecular Simulation 12, 933–934 (2006) 14. Huitema, H., van Liere, R.: Interactive visualization of protein dynamics. In: IEEE Vis, pp. 465–468. IEEE Computer Society Press, Los Alamitos (2000) 15. Best, C., Hege, H.C.: Visualizing and identifying conformational ensembles in molecular dynamics trajectories. Computing in Science and Engg. 4, 68–75 (2002) 16. Yang, H., Parthasarathy, S., Ucar, D.: A spatio-temporal mining approach towards summarizing and analyzing protein folding trajectories. Alg. Mol. Bio. 2, 3 (2007) 17. Grottel, S., Reina, G., Vrabec, J.: Visual verification and analysis of cluster detection for molecular dynamics. IEEE TVCG 13, 1624–1631 (2007) 18. Shenkin, P.S., McDonald, D.Q.: Cluster analysis of molecular conformations. J. Comput. Chem. 15, 899–916 (1994) 19. Maggiora, G., Shanmugasundaram, V.: Molecular similarity measures. Methods Mol. Biol. 275, 1–50 (2004) 20. Wu, H.M., Tzeng, S., Chen, C.H.: Matrix Visualization. In: Handbook of Data Visualization, pp. 681–708. Springer, Heidelberg (2008) 21. Chema, D., Becker, O.M.: A method for correlations analysis of coordinates: Applications for molecular conformations. JCICS 42, 937–946 (2002) 22. Kruskal, J.: Multidimensional scaling: A numerical method. Psychometrica 29, 115–129 (1964)
Visualization of the Molecular Dynamics of Polymers and CNT
139
23. DeJordy, R., Borgatti, S.P., Roussin, C., Halgin, D.S.: Visualizing proximity data. Field Methods 19, 239–263 (2007) 24. Herman, I., Society, I.C., Melanon, G., Marshall, M.S.: Graph visualization and navigation in information visualization: a survey. IEEE TVCG 6, 24–43 (2000) 25. Kamada, T., Kawai, S.: An algorithm for drawing general undirected graphs. Inf. Process. Lett. 31, 7–15 (1989) 26. Andrecut, M.: Molecular dynamics multidimensional scaling. Physics Letters A 373, 2001–2006 (2009) 27. Agrafiotis, D.K., Rassokhin, D.N., Lobanov, V.S.: Multidim. scaling and visualization of large molecular similarity tables. J. Comput. Chem. 22, 488–500 (2001) 28. Batagelj, V., Mrvar, A.: Pajek - program for large network analysis. Connections 21, 47–57 (1998)
Propagation of Pixel Hypotheses for Multiple Objects Tracking Haris Baltzakis and Antonis A. Argyros Institute of Computer Science, Forth {xmpalt,argyros}@ics.forth.gr http://www.ics.forth.gr/cvrl/
Abstract. In this paper we propose a new approach for tracking multiple objects in image sequences. The proposed approach differs from existing ones in important aspects of the representation of the location and the shape of tracked objects and of the uncertainty associated with them. The location and the speed of each object is modeled as a discrete time, linear dynamical system which is tracked using Kalman filtering. Information about the spatial distribution of the pixels of each tracked object is passed on from frame to frame by propagating a set of pixel hypotheses, uniformly sampled from the original object’s projection to the target frame using the object’s current dynamics, as estimated by the Kalman filter. The density of the propagated pixel hypotheses provides a novel metric that is used to associate image pixels with existing object tracks by taking into account both the shape of each object and the uncertainty associated with its track. The proposed tracking approach has been developed to support face and hand tracking for human-robot interaction. Nevertheless, it is readily applicable to a much broader class of multiple objects tracking problems.
1
Introduction
This paper presents a novel approach for multiple object tracking in image sequences, intended to track skin-colored blobs that correspond to human hands and faces. Vision-based tracking of human hands and faces constitutes an important component in gesture recognition systems with many potential applications in the field of human-computer and/or human-robot interaction. Some successful approaches for hand and face tracking utilize ellipses to model the shape of the objects on the image plane [1–5]. Typically, simple temporal filters such as linear, constant-velocity predictors are used to predict/propagate the locations of these ellipses from frame to frame. Matching of predicted ellipses with the extracted blobs is done either by correlation techniques or by using statistical properties of the tracked objects. In contrast to blob tracking approaches, model based ones [6–11] do not track objects on the image plane but, rather, on a hidden model-space. This is commonly facilitated by means of sequential Bayesian filters such as Kalman or G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 140–149, 2009. c Springer-Verlag Berlin Heidelberg 2009
Propagation of Pixel Hypotheses for Multiple Objects Tracking
141
particle filters. The state of each object is assumed to be an unobserved Markov process which evolves according to specific dynamics and which generates measurement predictions that can be evaluated by comparing them with the actual image measurements. Model based approaches are commonly assumed to be more suitable to track complex and/or deformable objects whose image projections cannot be modeled with simple shapes. Human hands, especially when observed from a short distance, fall in this category. Despite the fact that standard Bayesian filtering does not explicitly handle observation-to-track assignments, the sophisticated temporal filtering which is inherent to model based approaches allows them to produce better data association solutions. This is particularly important for multiple objects tracking, where it is common for tracked objects to become temporarily occluded by other tracked or non-tracked objects. Among model-based approaches, particle filtering [12] has been successfully applied to object tracking, both with edge-based [12] and kinematic [7, 8] imaging models. With respect to the data association problem, particle filtering offers a significant advantage over other filtering methods because it allows for different, locally-optimal data association solutions for each particle which are implicitly evaluated through each particle’s likelihood. However, as with any other modelbased approach, particle filters rely on accurate modeling, which in most cases leads to an increased number of unknown parameters. Since the number of required particles for effective tracking is exponential to the number of tracked parameters, particle filter based tracking is applicable only to problems where the observations can be explained with relatively simple models. In this paper we propose a blob-tracking approach that differs significantly from existing approaches in (a) the way that the position and shape uncertainty are represented and (b) the way that data association is performed. More specifically, information about the location and shape of each tracked object is maintained by means of a set of pixel hypotheses that are propagated from frame to frame according to linear object dynamics computed by a Kalman filter. Unlike particle filters which correspond to object pose hypotheses in the model space, the proposed propagated pixel hypotheses correspond to single pixel hypotheses in the observation space. Another significant difference is that, in our approach, the distribution of the propagated pixel hypotheses provides a representation for the uncertainty in both the position and the shape of the tracked object. Moreover, as it will be shown in the following sections, the local density of pixel hypotheses provides a meaningful metric to associate observed skin-colored pixels with existing object tracks, enabling an intuitive, pixel-based data association approach based on the joint-probabilistic paradigm. The proposed approach has been tested in the context of a human-robot interaction application involving detection and tracking of human faces and hands. Experimental results demonstrate that the proposed approach manages to successfully track multiple interacting deformable objects, without requiring complex models for the tracked objects or their motion.
142
H. Baltzakis and A.A. Argyros
Fig. 1. Block diagram of the proposed approach
2
Problem Description and Methodology
A tracking algorithm must be able to maintain the correct labeling of the tracked objects, even in cases of partial or full occlusions. Typically, this requirement calls for sophisticated modeling of the objects’ motion, shapes and dynamics (i.e. how the shape changes over time). In this paper we present a blob tracker that handles occlusions, shape deformations and similarities in color appearance without making explicit assumptions about the motion or the shape of the tracked objects. The proposed tracker uses a simple linear model for object trajectories and the uncertainty associated with them. Moreover, it does not rely on an explicit model for the shape of the tracked object. Instead, the shapes of the tracked objects and the associated uncertainty is represented by a set of pixel hypotheses that are propagated over time using the same linear dynamics as the ones used to model the object’s trajectory. An overview of the proposed approach is illustrated in Fig. 1. The first step in the proposed approach is to identify pixels that are likely to belong to tracked objects. In the context of the application under consideration, we are interested in tracking human hands and faces. Thus, the tracker implemented in this paper tracks skin-colored blobs1 . To identify pixels belonging to such objects we em1
The proposed tracking method can also be used to track blobs depending on properties other than skin color.
Propagation of Pixel Hypotheses for Multiple Objects Tracking
(a)
(b)
(c)
(d)
143
(e)
Fig. 2. Object’s state representation. (a) Observed blob (b)-(e) Examples of possible states. Ellipses represent iso-probability contours for the location of the object (i.e. the first two components of xt ). Dots represent the pixel hypotheses.
ploy a Bayesian approach that takes into account their color as well as whether they belong to the foreground or not. Image pixels with high probability to belong to hand and face regions are then grouped into connected blobs using hysteresis thresholding and connected components labeling, as in [3]. Blobs are then assigned to objects which are tracked over time. More specifically, for each tracked object two following types of information is maintained: – The location and the speed of the object’s centroid, in image coordinates. This is encoded by means of a 4D vector x(t) = [cx (t), cy (t), ux (t), uy (t)]T , where cx (t) and cy (t) are the image coordinates of the object’s centroid at time t and ux (t) and uy (t) are the horizontal and vertical components of its speed. A Kalman filter is used to maintain a Gaussian estimate x ˆ(t) of the above-described state vector and its associated 4 × 4 covariance matrix P(t). – The spatial distribution of the object’s pixels. This is encoded by means of a set H = {(xi , yi ) : i = 1 . . . N } of N pixel hypotheses that are sampled uniformly from the object’s blob and propagated from frame to frame using the dynamics estimated by the Kalman filter. The representation described above is further explained in Fig. 2. Figure 2(a) depicts the blob of a hypothetical object (a human hand in this example). Figures 2(b)-(e) depict four possible states of the proposed tracker. The distribution of the propagated pixel hypotheses provides the metric used to associate measured evidence to existing object tracks. During the data association step, observed blob pixels are individually processed one-by-one in order to associate them with existing object tracks. After skin-colored pixels have been associated with existing object tracks, the update phase follows in two steps: (a) the state-vector (centroid’s location and speed) is updated using the Kalman filter’s measurement-update equations and (b) pixel hypotheses are updated by resampling them from their associated blob pixels. The resampling step is important to avoid degenerate situations and to allow the object hypotheses to closely follow the blobs shape and size. Finally, track management techniques are employed to ensure that new objects are generated for blobs with pixels that are not assigned to any of the existing tracks and that objects which are not supported by observation are eventually removed from further consideration.
144
H. Baltzakis and A.A. Argyros
(a)
(b)
(c)
(d)
Fig. 3. Blob detection. (a) Initial image, (b) foreground pixels, (c) skin-colored pixels, (d) resulting skin-colored blobs.
3
The Proposed Tracking Method
In this section we provide a detailed description of the proposed multiple objects tracking method. 3.1
Segmentation of Skin-Colored Foreground Blobs
The first step of the proposed approach is to detect skin-colored regions in the input images. For this purpose, a technique similar to [3, 13] is employed. Initially, background subtraction [14] is used to extract the foreground areas of the image. Then, for each pixel, P (s|c) is computed, which is the probability that this pixel belongs to a skin-colored foreground region s, given its color c. This can be computed according to the Bayes rule as P (s|c) = P (s)P (c|s)/P (c), where P (s) and P (c) are the prior probabilities of foreground skin pixels and foreground pixels having color c, respectively. Color c is assumed to be a 2D variable encoding the U and V components of the YUV color space. P (c|s) is the prior probability of observing color c in skin colored foreground regions. All three components in the right side of the above equation can be computed based on offline training. After probabilities have been assigned to each image pixel, hysteresis thresholding is used to extract solid skin color blobs and create a binary mask of foreground skin-colored pixels. A connected components labeling algorithm is then used to assign different labels to pixels that belong to different blobs. Size filtering on the derived connected components is also performed to eliminate small, isolated blobs that are attributed to noise. Results of the intermediate steps of this process are illustrated in Fig. 3. Figure 3(a) shows a single frame extracted out of a video sequence that shows a man performing various hand gestures in an office-like environment. Fig. 3(b) shows the result of the background subtraction algorithm and Fig. 3(c) shows skin-colored pixels after hysteresis thresholding. Finally, the resulting blobs (i.e. the result of the labeling algorithm) are shown in Fig. 3(d).
Propagation of Pixel Hypotheses for Multiple Objects Tracking
(a)
(b)
(c)
145
(d)
Fig. 4. Tracking hypotheses over time. (a), (b) uncertainty ellipses corresponding to predicted hypotheses locations and speed, (c), (d) propagated pixel hypotheses.
3.2
Tracking Blob Position and Speed
The dynamics of each tracked object are modeled by means of a linear dynamical system which is tracked using the Kalman filter [15, 16]. The state vector x(t) at time t is given as x(t) = (cx (t), cy (t), ux (t), uy (t))T where cx (t), cy (t) are the horizontal and vertical coordinates of the tracked object’s centroid, and ux (t), uy (t) are the corresponding components of the tracked object’s speed. The Kalman-filter described above is illustrated in Figures 4(a) and 4(b) which show frames extracted from the same sequence as the one in Fig. 3. The depicted ellipses correspond to 95% iso-probability contours for the predicted location (smaller, red-colored ellipses) and speed (larger, purple-colored ellipses) of each tracked object’s centroid. As can be verified, objects that move rapidly (e.g., object 2 in Fig. 4(a)) or objects that are not visible (e.g., object 2 in Fig. 4(b)) have larger uncertainty ellipses. On the other hand, objects that move slowly (e.g., faces) can be predicted with more certainty. 3.3
Pixel Hypotheses Propagation
ˆ (t|t − 1) Pixel hypotheses are propagated using the predicted state estimate x and the predicted error covariance P(t|t − 1) of the Kalman filter discussed in the previous section. More specifically, each pixel hypothesis (xi , yi ) in H = {(xi , yi ) : i = 1 . . . N } is propagated in time by drawing a new sample from xi + u ˆx (t|t − 1) N , Ph (t|t − 1) (1) yi + u ˆy (t|t − 1) where u ˆx (t|t − 1) and uˆy (t|t − 1) are the predicted velocity components (i.e. third ˆ (t|t − 1) and Ph (t|t − 1) is the top left 2 × 2 submatrix of and forth element of x P(t|t − 1). Figures 4(c) and 4(d) depict the predicted pixel locations (i.e. pixel hypotheses) that correspond to the object tracks shown in Figs. 4(a) and 4(b), respectively. As can be verified, tracks with larger uncertainty ellipses correspond to less concentrated pixel hypotheses. On the other hand, propagated pixel hypotheses tend to have higher spatial density for object tracks that are predictable with higher confidence.
146
H. Baltzakis and A.A. Argyros
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5. Three objects merged into a single blob. Predicted pixel locations for each of the three objects (1st row), pixels finally assigned to each object (2nd row).
3.4
Associating Pixels with Objects
The purpose of the data association step is to associate observations with existing object tracks. In this paper, data association is performed on a pixel basis rather than a blob basis; i.e. each observed skin-colored pixel is individually associated to existing tracks. This permits pixels that belong to the same blob to be associated with different object tracks. The metric used to provide the degree of association between a specific skincolored pixel with image coordinates (x, y) and a specific object track oi is assumed to be equal to the local density of the propagated pixels hypotheses of this track at the location of this specific pixel. More specifically, to estimate the degree of association A(p, oi ) between pixel p and track oi , we make use of the following metric: P CN (p) A(p, oi ) = αi , (2) CN (p) P where N (p) = {pk , p − pk ≤ D} is a neighborhood of pixel p, CN (p) is the number of propagated pixel hypotheses of object track oi within N (p) and CN (p ) is the total number of pixels in N (p). αi is a normalizing factor ensuring that the sum of all data association weights √ of (2) remains constant for each track over time. An 8-neighborhood (D = 2) has proven sufficient in all experiments. After pixels have been associated with tracked objects, weighted means (according to A(p, oi )) are computed for each tracked object and used for the Kalman filter update phase. Pixel hypotheses are also resampled from the weighted distribution of the observed pixels. The above-described data association scheme follows the joint-probabilistic paradigm by combining all potential association candidates in a single, statistically most plausible, update.
Propagation of Pixel Hypotheses for Multiple Objects Tracking
147
Fig. 6. Tracking results for twelve segments of the office image sequence used in the previous examples. In all cases the algorithm succeeds in tracking the three hypotheses.
A notable case that is often encountered in practice, is when all pixels of a single blob are assigned to a single track and vice versa (i.e. no propagated pixel hypotheses are associated with pixels of other blobs). In this case, resampling of pixel hypotheses is performed by uniformly sampling blob pixels. This permits pixel hypotheses to periodically re-initialize themselves and exactly-follow the blob position and shape when no data association ambiguities exist. Figure 5 demonstrates how the proposed tracking algorithm behaves in a case where three objects simultaneously occlude each other, leading to difficult data association problems. The top row depicts the predicted pixel locations for each of the three valid tracks. The bottom row depicts the final assignment of blob pixels to tracks, according to the density of the predicted pixel hypotheses.
4
Experimental Results
Figure 6 depicts the tracker’s output for a number of frames of the image sequence comprising the running example used in Figs 3, 4 and 5. As can be observed, the tracker succeeds in keeping track of all the three hypotheses despite the occlusions introduced at various fragments of the sequence. The proposed tracker comprises an important building block of a vision-based, hand- and face-gesture recognition system which is installed on a mobile robot. The purpose of the system is to facilitate natural human-robot interaction while guiding visitors in large public spaces such as museums and exhibitions. The performance of the system has been evaluated for a three-weeks time in a large public place. Figure 7 depicts snapshots of three different image sequences captured at the installation site. Despite the fact that the operational requirements
148
H. Baltzakis and A.A. Argyros
Fig. 7. Tracking results from a real-world application setup
of the task at hand (i.e. unconstrained lighting conditions, unconstrained hand and face motion, varying and cluttered background, limited computational resources) were particularly challenging, the tracker operated for a three weeks time with results that, in most cases, were proved sufficiently accurate to provide input to the hand- and face-gesture recognition system of the robot. During these experiments the algorithm ran on a standard laptop computer, operating at 640 × 480 images. At this resolution, the algorithm achieved a frame rate of 30 frames per second. Several video sequences obtained at the actual application site are available on the web2 .
5
Conclusions and Future Work
In this paper we have presented a novel approach for tracking multiple objects. The proposed approach differs from existing approaches in the way used to associate perceived blob pixels with existing object tracks. For this purpose, information about the spatial distribution of blob pixels is passed on from frame to frame by propagating a set of pixel hypotheses, uniformly sampled from the original blob, to the target frame using the object’s current dynamics, as estimated by means of a Kalman filter. The proposed approach has been tested in the context of face and hand tracking for human-robot interaction. Experimental results show that the method is capable of tracking several deformable objects that may move in complex, overlapping trajectories.
Acknowledgments This work was partially supported by the EU-IST project INDIGO (FP6-045388) and the EU-IST project GRASP (FP7-IP-215821). 2
http://www.ics.forth.gr/∼xmpalt/research/handfacetrack pixelhyps/index.html
Propagation of Pixel Hypotheses for Multiple Objects Tracking
149
References 1. Birk, H., Moeslund, T., Madsen, C.: Real-time recognition of hand alphabet gestures using principal component analysis. In: Proc. Scandinavian Conference on Image Analysis, Lappeenranta, Finland (1997) 2. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: Real-time tracking of the human body. IEEE Trans. Pattern Analysis and Machine Intelligence 19, 780–785 (1997) 3. Argyros, A.A., Lourakis, M.I.A.: Real-time tracking of multiple skin-colored objects with a possibly moving camera. In: Proc. European Conference on Computer Vision, Prague, Chech Republic, pp. 368–379 (2004) 4. Argyros, A.A., Lourakis, M.I.A.: Vision-based interpretation of hand gestures for remote control of a computer mouse. In: ECCV Workshop on HCI, Graz, Austria, pp. 40–51 (2006) 5. Usabiaga, J., Erol, A., Bebis, G., Boyle, R., Twombly, X.: Global hand pose estimation by multiple camera ellipse tracking. Machine Vision and Applications 19 (2008) 6. Rehg, J., Kanade, T.: Digiteyes: Vision-based hand tracking for human-computer interaction. In: Workshop on Motion of Non-Rigid and Articulated Bodies, Austin Texas, pp. 16–24 (1994) 7. Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed particle filtering. In: IEEE Conference on Computer Vision and Pattern Recognition 2000, Proceedings, vol. 2, pp. 126–133 (2000) 8. Sidenbladh, H., Black, M.J., Fleet, D.J.: Stochastic tracking of 3d human figures using 2d image motion. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 702–718. Springer, Heidelberg (2000) 9. Stenger, B., Mendonca, P.R.S., Cipolla, R.: Model-based hand tracking using an unscented kalman filter. In: Proc. British Machine Vision Conference (BMVC), vol. 1, pp. 63–72 (2001) 10. Shamaie, A., Sutherland, A.: Hand tracking in bimanual movements. Image and Vision Computing 23, 1131–1149 (2005) 11. Stenger, B., Thayananthan, A., Torr, P.H.S., Cipolla, R.: Model-based hand tracking using a hierarchical bayesian filter. IEEE Trans. Pattern Analysis and Machine Intelligence 28, 1372–1384 (2006) 12. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. Int. Journal of Computer Vision 29, 5–28 (1998) 13. Baltzakis, H., Argyros, A., Lourakis, M., Trahanias, P.: Tracking of human hands and faces through probabilistic fusion of multiple visual cues. In: Gasteratos, A., Vincze, M., Tsotsos, J.K. (eds.) ICVS 2008. LNCS, vol. 5008, pp. 33–42. Springer, Heidelberg (2008) 14. Grimson, W.E.L., Stauffer, C.: Adaptive background mixture models for real time tracking. In: Proc. IEEE Computer Vision and Pattern Recognition (CVPR), Ft. Collins, USA, pp. 246–252 (1999) 15. Kalman, R.E.: A new approach to linear flitering and prediction problems. Transactions of the ASME–Journal of Basic Engineering 82, 35–42 (1960) 16. Bar-Shalom, Y., Li, X.: Estimation and Tracking: Principles, Techniques, and Software. Artech House Inc., Boston (1993)
Visibility-Based Observation Model for 3D Tracking with Non-parametric 3D Particle Filters Ra´ ul Mohedano and Narciso Garc´ıa Grupo de Tratamiento de Im´ agenes, Universidad Polit´ecnica de Madrid, 28040, Madrid, Spain {rmp,narciso}@gti.ssr.upm.es http://www.gti.ssr.upm.es
Abstract. This paper presents a novel and powerful Bayesian framework for 3D tracking of multiple arbitrarily shaped objects, allowing the probabilistic combination of the cues captured from several calibrated cameras directly into the 3D world without assuming ground plane movement. This framework is based on a new interpretation of the Particle Filter, in which each particle represent the situation of a particular 3D position and thus particles aim to represent the volumetric occupancy pdf of an object of interest. The particularities of the proposed Particle Filter approach have also been addressed, resulting in the creation of a multi-camera observation model taking into account the visibility of the individual particles from each camera view, and a Bayesian classifier for improving the multi-hypothesis behavior of the proposed approach.
1
Introduction
Robust tracking in complex environments is a challenging task in computer vision and has been extensively addressed in the literature [1]. Due to the inherent limitations of monocular tracking systems, unable to cope with persistent occlusions, the interest in algorithms for tracking in networks composed of multiple cameras with semi-overlapped fields of view has rapidly increased in recent years. Multi-camera methods can be classified according to two main different criteria. The first criterion considers the geometrical constraints and assumptions of the algorithm, mainly whether ground plane movement of the objects of interest is assumed or not. Systems assuming a visible ground plane usually relate different views by means of planar homographies [2], while more general and flexible systems require fully calibrated cameras [3]. The second criterion refers to the manner in that mono-camera cues are combined in order to jointly perform tracking: in other words, whether monocular observations are fused directly in a common multi-camera (and usually probabilistic) framework [4] or if 2D tracking is performed individually in each camera, combining later these 2D tracks according to certain geometrical reasoning [5]. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 150–159, 2009. c Springer-Verlag Berlin Heidelberg 2009
Visibility-Based Observation Model for 3D Tracking
151
Methods performing 2D tracking with later combination into multi-camera tracks can assume both the visible ground plane constraint [2] or fully calibrated cameras [5]. Both cases, independently of the geometrical assumptions performed, are effective in simple environments, but tend to fail when severe occlusions are encountered, as errors are frequent in the individual modules they rely on. 2D tracking must perform certain hard decisions at mono-camera level, losing permanently some potentially useful information. Thus, systems performing direct fusion of monocular cues (usually by means of probabilistic integration) aim to consider every observable information when hard decisions are performed. In the line of probabilistic fusion, outstanding works assuming the ground plane constraint can be found [4,6]. Although proved very robust in a great variety of situations, these methods fail when no visible dominant ground plane can be assumed (either because there multiple movement planes, or because the plane is not visible due to occlusions). Systems using fully calibrated cameras can, however, fuse multi-camera info directly in the 3D world. Following that line, [7] assume complex articulated human models for 3D tracking using Particle Filters. In [3], the authors propose voxelization of the scene for probabilistic 3D segmentation. Although very interesting, voxelization seems too costly computationally for 3D tracking, and even unnecessary if temporal consistency is taken into account. The present work presumes a set of multiple fully calibrated cameras, performing 3D probabilistic integration of monocular cues, not assuming any ground plane constraint. Moving objects have been modeled by means of volumetric occupancy probability density functions, representing mainly their localization in space and their appearance, that evolve over time according to a certain first order Markov dynamic model. Occupancy pdfs are handled by means of Particle Filters [8], and new observation and dynamic models according to this novel application of Particle Filters are proposed.
2
Framework Overview
This work proposes a novel probabilistic framework for multiple object 3D tracking in environments monitored using two or more fully calibrated cameras. The aim of the framework is to allow robust 3D location and tracking even when severe occlusions are suffered from some of the cameras of the system, without assuming any predefined movement (instead of the common ground plane constraint) or shape for the objects of interest (instead of parameterized templates used in usual Particle Filter-based tracking systems [7]). For that purpose, we model each moving object of the scene as a certain non-parametric volumetric occupancy pdf that evolves over time according to a certain dynamic model, and that can be observed following certain rules. Fig. 1 depicts the rationale behind the proposed framework in a 2D version of the 3D actual model. Let us assume a certain moving object Hk (for example purposes, a human, although the system is perfectly suitable for tracking generic objects) can be modeled as a 3D occupancy pdf, in which a higher probability density value at a
152
R. Mohedano and N. Garc´ıa
Occupancy pdf p(x t-1| Z t-1)
Occupancy pdf p(x t | Z t-1)
Prediction using Dynamic model: - Rigid transform - Uncertainty
Updating using observation zt Occupancy pdf p(x t | Z t )
Cam 1, ..., Nc
Occupancy pdf p(x t | Zt )
Bounding volume calculation (3D particle convex hull)
Fig. 1. Visual description of the system operation. Occupancy probabilities, represented by means of 3D particles, propagated over time and updated according to multi-camera observation at time t.
certain 3D position indicates higher certainty of being actually part of the tracked object. Let us denote the pdf at time step t by pk (xt |Zt ), where xt represents a certain 3D position (and also some information about the appearance of the object at that point), and Zt is the set of multi-camera observations {z1 , . . . , zt } available at time step t. Ideally, if the volume occupied by the object could be perfectly delimited with total certainty, the 3D occupancy pdf representing it would be a uniform probability density inside that volume and zero outside (as depicted in the first image of Fig. 1). Using Bayes’ theorem, the pdf at time t can be written as pk (xt |Zt ) ∝ p(zt |xt , Zt−1 ) pk (xt |Zt−1 ).
(1)
The factor pk (xt |Zt−1 ) represents the predicted pdf for time t before taking into account current observations. That prediction can be performed according to
Visibility-Based Observation Model for 3D Tracking
153
pk (xt |Zt−1 ) =
p(xt |xt−1 , Zt−1 ) pk (xt−1 |Zt−1 ) dxt−1 ,
(2)
where p(xt |xt−1 , Zt−1 ) is the dynamic model, governing the temporal evolution of the 3D occupancy pdf. The prediction process is depicted in the second image of Fig. 1. The factor p(zt |xt , Zt−1 ) in Eq. (1) can be considered the likelihood of the observation (or observation model), and can be seen as a function of the 3D position xt once zt has been fixed. Eq. (1) express, then, how the prediction pk (xt |Zt−1 ) is “corrected” or “reshaped” once observations at time t are taken into account, as shown in the third image of Fig. 1. Eqs. (1) and (2) are not easily solvable in general. The prediction equation consists in the integration of the product of two arbitrarily shaped distributions, while the observation likelihood p(zt |xt , Zt−1 ) is, in the considered case, a nonlinear function due to the non-linear nature of perspective projections. To handle, thus, the previous expressions, the involved pdfs are approximated using Monte Carlo methods, resulting in the well-known Particle Filter algorithm [8]. However, unlike usual Particle Filtering tracking methods in which xt represent a possible complete situation of the tracked object [7], the proposed system considers xt representing 3D points being potentially part of the tracked object. That difference of meaning must also reflect on both the dynamic and likelihood models, so they capture the local meaning of the generated particles. The particularities of the likelihood model have been addressed by means of a smart combination of independent mono-camera likelihood models, explained in Section 3. The dynamic model and its extension to multi-hypothesis prediction is addressed in Section 4.
3
Visibility-Based Observation Model
Usual Particle Filter methods (for both mono-camera and multi-camera tracking) combine multiple information sources (e.g. multiple cameras) into a straightforward likelihood model by assuming conditional independence between them [9]. If the complete observation at time t zt can be decomposed into {zct 1 , . . . , zct M }, where M is the number of the cameras of the system, the use of the conditional independence assumption yields pk (zt |xt ) =
M
pc j
(3)
j=1 c
where pj represent the individual likelihood model for camera cj , pk (zt j |xt ). However, if the particle xt cannot be directly seen by some camera, the conditional independence assumption does not hold, and avoids a consistent weighting of particles. In the proposed framework, as each particle xt physically represents a point contained into a moving object, that effect is specially detrimental. This
154
R. Mohedano and N. Garc´ıa Missing object in motion−region segmentation
c1
Static object
Moving object H 2
Moving object H1
c3
c2
Fig. 2. Limitations of the conditional independence assumption. Moving object H2 is not visible from camera c1 due to a static occluding object, causing measurements from c1 to differ from H2 appearance.
situation is depicted in Fig. 2: if a 3D position actually occupied by a moving object cannot be directly seen from one of the cameras, the difference measured between the expected and the observed data will be evidently considerable. That difference will erroneously decrease the likelihood of the measure if conditional independence is assumed. For that purpose, we propose an observation model considering the visibility of the particles from each camera of the system, expressed a visibility confidence term αj (0 ≤ αj ≤ 1), not limited to the conditional independence assumption. The proposed observation model follows the expression pk (zt |xt ) =
M M α k=1 k αj M pj .
(4)
j=1
This observation model combines the observations performed individually in each camera of the system (also with their individual observation model) taking into account the credibility of each comparison, expressed by the parameters αj . That weighted combination of the observation of the cameras is performed by means of a geometric mean, that is able to minimize the contribution of the cameras with lower visibility confidence. It can be proved, by simple derivation, that when a particle can be directly seen from every camera in the system (in our formulation, αj = 1, ∀j ∈ {1, . . . , M }) the result of the proposed joint likelihood model expressed in Eq. (4) coincides with the conditional assumption simplification in Eq. (3). That fact is highly desirable, as the direct visibility of a particle actually conditions completely the observations from different cameras of the system.
Visibility-Based Observation Model for 3D Tracking
155
Fig. 3. Depth maps of the indoor scene considered in the tests
Another interesting aspect of the visibility-based observation model is its behavior when V cameras of the system have direct visibility of a particle, while the rest of the cameras have not, due to static foreground objects or because the particle lie outside their fields of view. Assuming, without any loss of generality, that the first V cameras of the system are those with direct visibility of the particle, the visibility conditions of the situation can be expressed as αj = 1, ∀j ∈ {1, . . . , V } and αj = 0, ∀j ∈ {V + 1, . . . , M } . In these conditions, the joint observation model of Eq. (4) yields
M V pk (zt |xt ) = p1 p2 · · · pV =
(M−V ) = p1 p2 · · · pV p¯ , where
1 p¯ = p1 p2 · · · pV V .
(5)
(6)
That results shows that, in the considered situation, the observation likelihood of each camera without direct visibility is considered unreliable and replaced with the geometric mean of the visible cameras. The estimation of the parameters αi might take into account the field of view of the camera cj and the static objects of the scene (that could occlude the particle), but also occlusions caused by different moving objects. The consideration of the last aspect has not been included in this paper, as it presents some difficulties not yet satisfactorily solved. However, the two first aspects can be considered in a straightforward manner. Evidently, if a particle is outside camera cj field of view, then αj = 0. Occlusions due to static objects could be determined completely if depth maps for each camera view were available. Thus, if a particle could not be directly seen from cj point of view, its αj would equally be 0. Fig. 3 shows depth maps manually generated for the test environment evaluated in Section 5.
156
R. Mohedano and N. Garc´ıa
As for the individual likelihood models considered for each camera cj , two cues have been used, supposed conditionally independent between them: motion c detection and CIE L*a*b color. Binary motion detection mt j has been performed according to [10], and its likelihood model associated can be expressed as c cj 1 − pB if mt j vcj = 1 p mt |xt = , (7) pB otherwise where vcj is the projection of xt onto camera cj image plane, and pB is a certain background probability considering possible occlusions and mistakes of the segmentation. The color likelihood model, however, considers that each particle xt has an associated L*a*b color value xLab ,and the distribution of L*a*b color t observed at vcj given the particle xt is N xLab t , ΣLab , where ΣLab has been taken to be a multiple of the identity matrix.
4
Multi-hypothesis Support for Movement
In the usual Particle Filter approach for tracking, each particle represents a complete posible location of the tracked object, and so each particle can be considered as a different movement hypothesis for the object. Unlike it, the proposed framework uses particles to model the spatial distribution of the objects, losing that capability of considering multiple possibilities for the global movement of the object. To alleviate this effect, we propose the utilization a Bayes classifier considering a certain limited set of discrete movements. Let us consider multiple possible dynamic models Md , with d ∈ {1, . . . , D}, for the time evolution of a certain tracked object. It would be possible to determine which model is the most probable (given the available observations) by means of a naive Bayes classifier. To establish this classifier, the posterior probability of a certain dynamic model Md is pk (Md | Zt ) ∝ pk (zt | Md , Zt−1 ) pk (Md | Zt−1 ) ,
(8)
where pk (Md | Z ) is the prior probability of the dynamic model Md given the observations up to t − 1, and pk (zt | Md , Zt−1 ) is the likelihood of the current observation given both Md and the previous observations (everything for the object Hk ). The estimation of pk (zt | Md , Zt−1 ) can be performed within the proposed Particle Filter framework. Assuming that the measuring process at time step t only depends on xt (and not on Md or Zt−1 ), the volumetric occupancy pdf given the dynamic model Md can be written as t−1
pk (xt | Zt , Md ) =
p(zt | xt ) pk (xt | Md , Zt−1 ) , pk (zt | Md , Zt−1 )
(9)
where the volumetric occupancy pdf predicted according to dynamic model Md is pk (xt | Md , Zt−1 ) = pd (xt |xt−1 ) pk (xt−1 |Zt−1 )dxt−1 . (10)
Visibility-Based Observation Model for 3D Tracking Camera 1
Camera 2
Camera 3
157
Camera 4
Fig. 4. Four different views of the monitored environment (centroid trajectories and 3D convex hull-based segmentation superimposed)
The desired term, for the purposes of the Bayes classifier, appears in the denominator of Eq. (9), and follows the expression pk (zt | Md , Zt−1 ) = p(zt | xt ) pk (xt | Md , Zt−1 )dxt . (11) The discussed approach for multi-hypothesis allows the test of several possibilities for the dynamic model of the tracked objects between time steps t − 1 and t, selecting that one with the greater posterior probability according to the observations at time t. However, it increases drastically the computational cost of the system, as the consideration of D different dynamic models would increase the number of handled particles from NS to D × NS . Thus, in practice, the number of the “product number of samples” and “number of considered dynamic models” must be limited. The performed tests have considered dynamic models Md composed of a static part, common for every Md and relative to a local coordinate system, and a dynamic part, addressing the global movement of the local coordinate system and aiming to represent global movements of the tracked object. The static part considers particle color and 3D position remaining constant over time (except for a certain gaussian drift representing uncertainty). The dynamic part consider rigid movements of the objects, as translations and rotations, and forms the specific part of each Md . In the presented tests, 9 possible rigid movements have been considered, obtained by the combination of 3 possible displacements and 3 rotations.
5
Results
The proposed scheme has been tested in different environments. The results show the performance of the framework when tracking several humans, although its non-parametric nature makes it suitable for the tracking of generic moving objects. Here, we have considered a representative complex indoor scenario: a computer laboratory. So, the environment is full of furniture and equipment and there are moving persons undergoing frequent occlusions from multiple points of view. This scenario is monitored by four calibrated cameras [11], located in the top corners, with semi-overlapped fields of view.
158
R. Mohedano and N. Garc´ıa
Fig. 4 shows the 4 different views at a certain time step, including the trajectory of each object over time, and an approximated 3D segmentation of the moving objects at that time step, computed as the 3D convex hull of the minimum set of 3D particles xt accumulating a total probability of more than 0.95. These results have been generated with a background probability pB of 0.2 (see Eq. (7)), although the performed tests show that values in between 0.05 and 0.4 produce also satisfying results. As for the covariance matrix ΣLab of the color observation model noise, each L*a*b dimension has been considered independent with an equal standard deviation of 10, while further tests show that certain variations of this value do not drastically affect system performance. Visibility confidence parameters αj have been calculated as explained in Sec. 3, using the depth maps showed in Fig. 3. Fig. 4 (specially camera 1), shows the excellent behavior of the system, correctly tracking and locating in the 3D world two different moving objects that have occluded each other from some of the views when they crossed their trajectories, and whose lower half is permanently occluded in two camera views due to furniture occlusions. Even though occlusions caused by moving objects have not been yet included in the visibility-based model, the system proves itself able to cope with interacting moving objects (however, its inclusion might even improve tracking precision). Note also that the process of leaving the field of view of some of the cameras (see Fig. 4, camera 4) is consistently addressed by the proposed visibility-based observation model.
6
Conclusions
This paper addresses the problem of 3D tracking of generic real objects with general displacements, without restricting them to the usual ground plane constraint. For this purpose, we propose a novel framework for 3D tracking using multiple calibrated cameras with semi-overlapped fields of view, in which objects are modeled by means of non-parametric 3D occupancy probability density functions. These occupancy pdfs are evolved over time following a Bayesian approach, taking into account low-level cues observed from the different views of the system, and thus combining multi-camera information directly without performing any mono-camera decision on object tracking. The process is performed by means of 3D Particle Filters, where each particle represent a 3D point composing the 3D tracked object. As Particle Filters are applied in a conceptually different way as in usual tracking algorithms, both observation and dynamic models have been adapted for the presented approach. Therefore a visibility-based observation model has been created, inspired in the weighted geometric mean. This model consistently updates the occupancy pdfs by giving more credibility to those views with better visibility of the particles. A specific dynamic model has also been created, considering multiple hypothesis for the dynamic evolution of the objects and choosing the best one by means of a Bayes classifier. Tests performed on the systems show its excellent 3D tracking (and localization) robustness, even in situations with severe occlusions and multiple interacting objects.
Visibility-Based Observation Model for 3D Tracking
159
Acknowledgements This work has been partially supported by the Ministerio de Ciencia e Innovaci´ on of the Spanish Government under project TEC2007-67764 (SmartVision) and by the Comunidad de Madrid under project S0505/TIC-0223 (Pro-Multidis). Also, R. Mohedano wishes to thank the Comunidad de Madrid for a personal research grant.
References 1. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Trans. Systems, Man and Cybernetics 34(3), 334–352 (2004) 2. Black, J., Ellis, T., Rosin, P.: Multi view image surveillance and tracking. In: IEEE Workshop on Motion and Video Computing, pp. 169–174 (2002) 3. Landabaso, J.L., Pard´ as, M.: Foreground regions extraction and characterization towards real-time object tracking. In: Multimodal Interaction and Related Machine Learning Algorithms, pp. 241–249 (2005) 4. Khan, S.M., Shah, M.: Tracking multiple occluding people by localizing on multiple scene planes. IEEE Trans. Pattern Analysis and Machine Intelligence 31(3), 505–519 (2009) 5. Focken, D., Stiefelhagen, R.: Towards vision-based 3d people tracking in a smart room. In: IEEE Int. Conf. Multimodal Interfaces, pp. 400–405 (2002) 6. Mittal, A., Davis, L.S.: M2tracker: A multi-view approach to segmenting and tracking people in a cluttered scene using region-based stereo. Int. Journal of Computer Vision 51(3), 189–203 (2002) 7. Deutscher, J., Reid, I.: Articulated body capture by stochastic search. Int. Journal of Computer Vision 61(2), 185–205 (2005) 8. Doucet, A., Godsill, S.J., Andrieu, C.: On sequential monte carlo sampling methods for bayesian filtering. Statistics and Computing 10(3), 197–208 (2000) 9. Wang, Y.D., Wu, J.K., Kassim, A.A.: Adaptive particle filter for data fusion of multiple cameras. Journal of VLSI Signal Processing Systems 49(3), 363–376 (2007) 10. Zivkovic, Z., van der Heijden, F.: Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognition Letters 27(7), 773–780 (2006) 11. Carballeira, P., Ronda, J.I., Vald´es, A.: 3d reconstruction with uncalibrated cameras using the six-line conic variety. In: IEEE Int. Conf. Image Processing, pp. 205–208 (2008)
Efficient Hypothesis Generation through Sub-categorization for Multiple Object Detection Dipankar Das, Yoshinori Kobayashi, and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570, Japan {dipankar,yosinori,kuno}@cv.ics.saitama-u.ac.jp
Abstract. Hypothesis generation and verification technique has recently attracted much attention in the research on multiple object category detection and localization in images. However, the performance of this strategy greatly depends on the accuracy of generated hypotheses. This paper proposes a method of multiple category object detection adopting the hypothesis generation and verification strategy that can solve the accurate hypothesis generation problem by sub-categorization. Our generative learning algorithm automatically sub-categorizes images of each category into one or more different groups depending on the object’s appearance changes. Based on these sub-categories, efficient hypotheses are generated for each object category within an image in the recognition stage. These hypotheses are then verified to determine the appropriate object categories with their locations using the discriminative classifier. We compare our approach with previous related methods on various standards and the authors’ own datasets. The results show that our approach outperforms the state-of-the-art methods.
1
Introduction
Multiple object category detection with large appearance changes in images is one of the most complex tasks in computer vision. The appearance of object categories changes due to many factors, such as intra-category variation (different textures and colors), illumination, viewpoints, and poses. When the appearance changes due to object viewpoints, some sub-categorization technique improves the detection performance [1,2]. However, when the appearance changes due to multiple factors, it is hard to find one dominating property to divide an object category into appropriate number of sub-categories. Manually assigning the sub-category labels [3,4] for the training samples could be difficult and time consuming. In this paper, we propose a method of multiple category object detection adopting the hypothesis generation and verification approach that can solve the accurate hypothesis generation problem (due to appearance changes) by automatic sub-categorization. In our approach, the generative model, pLSA[5], is fitted to the training data without knowledge of labels of bounding boxes, and G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 160–171, 2009. c Springer-Verlag Berlin Heidelberg 2009
Efficient Hypothesis Generation through Sub-categorization
161
topics are assigned based on the image specific topic probability under each category. Probabilistic latent semantic analysis (pLSA) model was originally developed for topic discovery in a text corpus. In image analysis, the model considers images as documents and discovers topics as object categories, so that an image containing instances of several objects is modelled as a mixture of topics. In our flexible learning strategy, a single object category can be represented with multiple topics and the model can be adapted to diverse object categories with large appearance variations. The appropriate number of sub-categories (generated from multiple topics) is optimized on a validation dataset. In the testing stage, promising hypotheses are generated efficiently for multiple object categories with their positions and scales. For this purpose, our algorithm considers the bag-of-visual-words (BOVW) extracted from the image and the number of sub-subcategories generated during the learning stage. Once hypotheses have been generated, a discriminatively trained SVM classifier verifies these hypotheses using merging features. Our merging feature is computed using pyramid HOGs (PHOG) and pre-computed (in the generative stage) visual words based on SIFT descriptors. Since the hypothesis generation stage effectively acts as a pre-filter, the discriminant power is applied only where it is needed. Thus, our system is able to detect and localize multiple objects with a large number of categories. Unsupervised generative learning for hypothesis generation or image classification has recently begun to attract attention of researchers. Our method is inspired by Sivic et al. [6] that applies the unsupervised topic discovery technique to discover categories in images. However, an important distinction is that they restricted the number of topics searched for to the number of categories truly present in their datasets, whereas our method automatically detects and fits to an appropriate number of topics of input categories. Fergus et al. [7] use an unsupervised generative learning algorithm to build representations of particular object categories. However only objects from a single category are presented to the system, which is then tested in a category versus background setting. In contrast, our generative model simultaneously learns representation for multiple object categories coexisting within the same image and generates hypotheses for each object category. The combined approach of hypothesis generation and verification procedures produces good detection and localization results [8,9]. However, in [9] the same feature is used for both generative and discriminative classifiers without subcategorization and is not sufficient enough to distinguish complex object categories in images with multiple objects. Our approach differs from these in using different features and techniques for both in hypothesis generation and SVM verification stages. Moreover, we compare our detection performance using the sub-categorization technique to that of [9]. Our research is most closely related to the approach of Das et al. [8] who also used a combination of pLSA and supervised classification for specific object recognition. However, our approach goes beyond this with respect to the followings: (i) we propose and implement a new algorithm that automatically sub-categorizes input object categories into
162
D. Das, Y. Kobayashi, and Y. Kuno
appropriate number of topics; (ii) we modify the hypotheses generation stage of Das et al. [8] so that our algorithm can generate hypotheses much faster from sub-categories of multiple object categories; (iii) we perform experiments over some standard datasets to compare the performance of our method with some state of the art recognition frameworks; and (iv) finally, we compare the effectiveness of the experimental results on our database using hypothesis generation and verification technique with and without sub-categorization.
2
Sub-categorization Approach
The sub-categorization algorithm uses the output of the pLSA model to subcategorize input training object categories. The model associates each observation of a visual word, w, within an image, d, with a topic variable, z. In the formulation of pLSA model for objects, we first seek the visual vocabulary of visual words for training images that are insensitive to changes in viewpoint, scale and illumination. This visual vocabulary is formed by vector quantizing the SIFT descriptors [10] using k-means clustering algorithm. In order to construct visual vocabulary and visual words, we first detect and describe interest points from all training images. 2.1
Interest Points Detection and Description
A promising recent research direction in computer vision is to use local features and their descriptions. The combination of interest point detectors and invariant local descriptors has shown interesting capabilities of describing images and objects. In this research, local interest points are detected in two phases. In the first phase, corner points are detected in the image using the technique proposed by He and Yung [11] and all of the corner points are selected as the keypoints. Then in the second phase, the rest of the total keypoints are selected by taking the uniform samples on the object edges. Duplicated keypoints are eliminated if any. Edge strengths are used as the weight of the samples. Both corner points and uniform samples on the edges make the model more shape informative, which is important to obtain an overall estimate of the object boundary. Thus the model can give an estimate of possible object shape in addition to probable object locations during hypothesis generation. Each generated keypoint is described using the 128-dimensional SIFT descriptors. The SIFT descriptors are computed over the circular patch with radius r = 10. 2.2
Sub-categorization and Topics Optimization
The pLSA model determines P (w|z) and P (d|z) by using the maximum likelihood principle. The mixing coefficients P (zk |dj ), for all training images, can be seen as an object feature and be used for classification purpose. It can also be used to rank the training images with respect to latent topic, zk , and computed as: P (dj |zk )P (zk ) P (zk |dj ) = K . (1) l=1 P (dj |zl )P (zl )
Efficient Hypothesis Generation through Sub-categorization
163
Based on equation 1, our sub-categorization algorithm is given by the following steps: 1. Let K be the total number of topics and Tc is the number of object categories to be learned. 2. Repeat the following steps for K = Tc to mTc , where the value of m depends on the intra-category appearance variation. In our experiments, the value of m is set to 3. 3. Learn the pLSA model for K topics. 4. For all training objects compute the probabilities P (zk |dj ) by equation 1. 5. Compute the following normalized summation under each topic: ni P (zk |dl )i P (Z|D)i = l=1 , (2) ni where ni is the total number of objects for i-th category. 6. Assign object category i to the topic with the highest P (Z|D)i . If the topic with the highest P (Z|D)i has already been assigned to another category, then try to find the topic with the second highest P (Z|D)i and so on. At the end of this process, one topic is assigned to each category. If K > Tc , to each of the rest topics Tr = K − Tc , assign the object category with the maximum value of P (Z|D). At each step K of the above algorithm, the performance of the sub-category is measured with respect to correctly generated hypotheses on the validation set. Our hypothesis evaluation criteria are based on: Area(Bph ∩ Bgt ) ≥ 50% , Area(Bph ∪ Bgt )
(3)
where Bph and Bgt are the bounding boxes of predicted hypothesis and ground truth object, respectively. Then the number of topics giving the maximum performance is chosen as the optimal number of topics for the system. We propose to consider the background as another object category. In this case, the value of K in step 2 varies from Tc + Bc to mTc + Bc , where Bc is the number of sub-categories for the background, and in step 6 the background sub-categories are first determined.
3
Promising Hypotheses for Optimized Model
In order to generate promising hypotheses using the optimized model, we use a modified version of the algorithm proposed by Das et al. [8]. We make modifications in two major sections. First, we generate hypotheses for the optimized model that consists of one or more sub-categories for each object category. Second, we reduce the search space in our algorithm so that the recognition time of the system can decrease. The algorithm consists of the following steps:
164
D. Das, Y. Kobayashi, and Y. Kuno
# visual words
# visual words
70 60 50 40 30 20
ROI for cup noodle 50 100 150 200 250 300
(a)
10 0
0
50
100 150 200 250 300
(b)
70 60
Coffee−jar Cup−noodle Coffee−mug
50 40 30 20 10 0
Spoon
0
50
100 150 200 250 300
(c)
50
100 150 200 250 300
(d)
Fig. 1. Hypothesis generation and verification results: (a) detected visual words (small circles) and ROI (rectangular box) for the object cup noodle, (b) local maxima based on the number of generated visual words for cup noodle object, (c) local maxima after suppressing overlapping windows, (d) detected target objects with their locations
1. Repeat the following steps with their corresponding rectangular region of interest (ROI) for all object categories. The ROI is the smallest rectangular window within the image that contains all possible visual words for a particular object category (Fig. 1(a)). 2. Compute the average aspect ratio Mai of the window for each object category i as Mai = Mwi /Mhi , where Mwi and Mhi are the mean width and mean height of the object category i computed during the training stage using ground truth bounding boxes. 3. For each object category, slide the window with size Mwi × HROI i and count the number of visual words, Nvw = z∈ts nvwiz , where nvwiz is the number of visual words for object category i and sub-topics ts and HROI i is the height of ROI for category i. 4. Determine the local maxima (Fig. 1(b)) based on the number of visual words Nvw in each of the sliding window. 5. For all local maxima regions within an image find and suppress the windows, if any, which overlap by 75% or more with the window that contains the maximum number of visual words. This step is almost similar to the nonmaximum suppression technique (Fig. 1(c)). 6. After suppressing the non-maximum windows in each neighborhood, within the remaining windows slide the object window with average aspect ratio Mai and determine the window that contains the maximum number of visual words in each local region. Such windows are selected as the promising hypotheses for our optimized model. Since we reduce the search space significantly, our hypothesis generation algorithm is much faster compared to [8]. If the size of the rectangular ROI is m × n and the object window size is p× q, then the algorithm proposed by Das et al. [8] requires (m − p + 1) ∗ (n − q + 1) times of sliding window search for hypothesis generation. On the other hand, if there are r promising hypotheses (r is much smaller than (n − q + 1), typically in the range 1 to 10 in our case), then our algorithm requires only (m−p+1)∗r +(n−q +1) times of sliding window search. For ten object categories, our modified algorithm requires 0.084s to generate all hypotheses, whereas [8] requires 7.72s on a 2.40 GHz PC.
Efficient Hypothesis Generation through Sub-categorization
4
165
Discriminative Learning and Verification
Object detection and localization algorithm using a discriminative method with combined features have been shown to perform well in the presence of clutter background, viewpoint changes, partial occlusion, and scale variations [12]. In our approach, along with pLSA, a multi-class SVM classifier is also learned in parallel using shape and appearance features. To represent the shape of an object, spatial shape descriptors are extracted from the object of interest. In order to describe the spatial shape of an object we follow the scheme proposed by Anna Bosch et al. [13]. We extract the Pyramid Histogram of Orientation Gradient (PHOG) features only from the target object using the ground truth bounding box. Although shape representation is a good measure of object similarity for some objects, it is not sufficient enough to distinguish among all types of objects. In such cases, object appearance represented by the bag of visual words is a better feature to find the similarity between them. The appearance patches and descriptors are computed in a similar manner as described in subsection 2.1. Then the normalized histogram of visual words for each object is computed. Finally, the combination of both shape and appearance features for an object O, are merged as: H(O) = αHP HOG (O) + βHBOV W (O) , (4) where both α and β are weights for the shape histogram, HP HOG (O) and appearance histogram, HBOV W (O), respectively. The multi-class SVM classifier is learned using the above merged feature in which the higher weight is given to more discriminative feature. The values of α and β in equation 4 are determined using the cross-validation datasets. We use the LIBSVM [14] package for our experiments in a multi-class mode with the rbf exponential kernel. In the verification step, merged features are extracted in the regions bounded by the windows of the promising hypotheses and fed into the multi-class SVM classifier in the recognition mode. Only the hypotheses for which a positive confidence measurement is returned are kept for each object. Objects with the highest confidence level are detected as the correct objects (Fig. 1(d)). The confidence level is measured using the probabilistic output of the SVM classifier.
5
Experimental Results
In this section we carry out a series of experiments to investigate the benefits of our sub-categorization approach with merged features. We evaluate our approach on different datasets. In order to compare our approach with [9,15], we use four categories of objects. Most of these datasets are taken from the PASCAL database collection [16] and contain a single object per image. For multiple objects per image the performance of the system is evaluated on our own datasets. It consists of 10 categories of everyday objects related to our application (service robot) in different environments against cluttered, real-world background with occlusion, scale, and minor viewpoint changes. Our datasets are created with ground truth bounding boxes that contain multiple objects per image. There
166
D. Das, Y. Kobayashi, and Y. Kuno Table 1. Details on the datasets for our experiments
Dataset and category #category #train object #test object #validation object Weizmann (horse) 1 164 164 50 TUD (cow) 1 40 71 50 Caltech (motorbike) 1 398 400 50 UIUC (car) 1 200 170 50 Authors’ (multiple) 10 582 1420 307
are a total of 713 images containing 2002 objects. Table 1 summarizes the total number of training, testing and validation objects for our experiment. Given a completely unlabeled image of multiple object categories, our goal is to automatically detect and localize object categories coexisting in the image. In our approach object presence detection means determining if one or more categories of objects are present in an image and localization means finding the locations of objects in that image. Based on the object presence detection and localization, an object is counted as a true positive object if it satisfies the equation 3. Otherwise, the detected object is counted as false positive. In the experimental evaluation, the detection and localization rate (DLR) is defined as: DLR =
# of true positive objects . # of annotated objects
(5)
The false positive rate (FPR) is defined as: FPR =
5.1
# of false positive objects . # of annotated objects
(6)
Sub-categorization and Hypothesis Generation
One of the main contributions of this paper is to generate efficient hypotheses by sub-categorizing an object category into appropriate number of topics. The sub-categorization and topic optimization algorithm has been described in subsection 2.2. In this section, we present experimental results to investigate the benefits of our algorithm with respect to correctly generated hypotheses. Our hypotheses evaluation criteria are based on equation 3. We examined two cases: case with background information and case without background information. In the former case, the sub-categorization algorithm considers background information as a separate category. We used our own dataset with the number of training and validation objects as indicated in the last row of the Table 1. Sub-categorization without Background Information. The pLSA model was learned for ten categories of objects and sub-categorization was performed by varying the number of topics from 10 to 30. At each step the performance of the system was measured on the validation set. Fig. 2(a) shows percentage of
Efficient Hypothesis Generation through Sub-categorization 90
Accuracy: 84%
% correct hypotheses
% correct hypotheses
90
85
85
75
75
70
Accuracy: 59%
70
65 65 60 60 12 14 16 18 20 22 24 26 28 30 32 10 12 14 16 18 20 22 24 26 28 30 Number of topics Number of topics
(a)
Accuracy: 91%
Accuracy: 75%
80
80
167
(b)
Accuracy: 77%
(c)
Accuracy: 59%
(d)
Fig. 2. Sub-categorization performance on the validation set: (a) the number of topics versus correctly generated hypotheses without background category, (b) with background category, (c) & (d) generated hypothesis regions for different target objects
correctly generated hypotheses against number of topics on the validation set. The optimal number of topics for the experiments is selected at the maximum value as indicated by the circular marker on the graph. Sub-categorization with Background Information. Here an additional category was included as the background category with the ten object categories. In the training period, the background features were taken from the areas other than the target objects in the images. However, since the background features may vary significantly, we need to restrict the number of sub-categories for the background up to a certain value, such as 2 or 3. In this experiment, it was fixed to 2 sub-categories. The reason for adding background category is to give the system the opportunity of discovering background features generated from the areas other than the target object areas within the test images. It reduces the confusion between objects’ features and background features in the hypothesis generation stage. The optimization result, when we included background category, is shown in Fig.2(b). The optimal numbers of topics for the ten categories of objects are 18 without background category and 21 with two background sub-categories. With these optimal number of topics, performance values of the hypotheses generation algorithm are 84.0% and 86.6%, respectively. It has been shown that the number of sub-categories of a particular object category depends not only on the variation of this particular object’s appearance features, but also on the existence of the other object categories in the training process. In our experiments, the number of sub-categories for an object category was typically in the range from 1 to 3. During hypothesis generation, promising hypotheses were generated for each object category by taking all sub-categories in consideration. Figs. 2(c) and (d) show some examples of the best hypothesized locations along with the percentage of overlapping areas with their ground truth bounding boxes. From the graphs in Fig. 2, it is clear that the hypothesis generation algorithm produces better results when the background category is included in the training object categories.
168
D. Das, Y. Kobayashi, and Y. Kuno Table 2. Performance comparison with the other methods Category and dataset Car (UIUC) Cow (TUD) Horse (Weizmann) Motorbike (Caltech)
ISM [15] 87.6% 92.5% 88.5% 80.0%
IRD [9] 88.6% 93.2% 88.5% 84.0%
Authors 95.9% 95.7% 99.4% 98.5%
Table 3. Comparison of cross-category confusion with the other methods Car ISM IRD Car − Cow 1.00 0.49 Horse 0.71 0.16 Motorbike 1.07 0.08 Category
Auth. ISM 0.07 0.23 0.29 0.53 0.13 0.29
Cow IRD 0.00 − 0.08 0.09
Auth. ISM 0.09 0.02 0.18 0.03 0.00 0.22
Horse IRD 0.01 0.11 − 0.00
Motorbike Auth. ISM IRD Auth. 0.69 0.18 0.03 0.12 0.26 1.05 0.05 0.05 0.68 0.05 0.03 0.06 −
Table 4. Experimental results on authors’ dataset Without sub-category With sub-category Sub-cat. + background cat. DLR FPR DLR FPR DLR FPR Coffee jar 0.76 0.11 0.82 0.09 0.85 0.09 Coffee mug 0.34 0.37 0.61 0.33 0.73 0.32 Spoon 0.79 0.07 0.82 0.08 0.84 0.06 Hand soap 0.64 0.30 0.73 0.22 0.72 0.25 Cup noodle 0.62 0.72 0.86 0.70 0.90 0.56 Monitor 0.80 0.02 0.92 0.02 0.97 0.02 Keyboard 0.85 0.03 0.98 0.01 0.96 0.01 Mouse 0.53 0.14 0.72 0.14 0.75 0.08 CD 0.22 0.40 0.67 0.26 0.78 0.22 Book 0.39 0.95 0.66 0.78 0.80 0.81 Avg. rate 0.61 0.27 0.78 0.23 0.83 0.21 Category
5.2
Comparison with Other Methods
The performance of our system was compared to the integrated representative and discriminative (IRD) representation of Fritz et al. [9], and the implicit shape model (ISM) of Leibe et al. [15], using the same datasets that were tested in [9]. The comparison was performed on four object categories using the multicategory discrimination results that are presented in [9]. The four object categories were fitted on 7 optimal number of topics using the validation sets. We used the SVM classifier with merged features to verify the hypotheses generated by our algorithm as discussed in section 3. Table 2 summarizes detection and localization performance of our method with the other methods. In the case of horse dataset, the performance of [9,15] was calculated from the recall-precision
Efficient Hypothesis Generation through Sub-categorization
Coffee−jar
169
Computer−monitor Hand−soap Coffee−jar
Hand−soap
Coffee−mug
CD
Computer−monitor
Hand−soap
Coffee−jar
Computer−keyboard Coffee−mug CD
Computer−mouse Computer mouse
CD
Computer−keyboard
Spoon
Computer−monitor Hand−soap Coffee−jar Cup−noodle
Coffee−jar Coffee−mug Book Computer−keyboard CD
Coffee−mug
Computer−mouse
Fig. 3. Example detection and localization results on authors’ dataset
curve that are presented in [9]. Table 3 shows the comparison of cross-category confusion (false positive per test image) on all four object categories. Note that in all cases our method achieved better detection and localization results than the other two methods. The superior performance compared to other methods might be due to the use of better features and sub-categorization technique for hypothesis generation. In [9], they used the same feature for both generative and discriminative classifiers. In our approach different features were used for both generative and discriminative parts. Although the recognition task is different from our multiple object detection and localization, we performed this experiment to compare the basic performance of our method with others. 5.3
Performance on Authors’ Datasets
In this sub-section we show the detection and localization performance of our approach on our own datasets. In the training period, both the pLSA and SVM models were fitted for all training object categories. In the generative learning stage, the pLSA model learned the optimal number of topics using the validation datasets and sub-categorized each object category to an appropriate number. The object specific feature weights (α and β), optimal threshold
170
D. Das, Y. Kobayashi, and Y. Kuno
value, the penalty parameter (C) and the kernel parameter (γ) were also determined using a validation and cross-validation (v=5) datasets. In most of the studies [9,12,6], a small number of categories (two to five) were used for categorization purposes. Thus, we collected the dataset consisting of ten categories of objects in different environments and backgrounds. Table 4 shows the detection and localization rates. The false positive rates are also indicated in the adjacent columns. The pLSA model was fitted for 18 topics without background and 21 topics including two background sub-categories in the experiments on our dataset. As shown in Table 4, without sub-categorization our system produces an average DLR 61% with the FPR of 27%. On the other hand, using the subcategorization technique, the system increases the average DLR to 78% with a reduction of the false positive rate from 27% to 23%. The best performance is obtained by including the background sub-category in the training categories. In this case, the average DLR for ten categories of objects is 83%. Since the background information reduces confusion between object features and background features, it also reduces the false positive rate. Some detection and localization results on our datasets are shown in Fig. 3.
6
Conclusion
In this paper, we have proposed a new approach of automatically sub-categorizing of an object category to an appropriate number by fitting the pLSA model to the optimal number of topics. We have also demonstrated how to generate hypotheses efficiently from these sub-categories with results equivalent to an exhaustive search of a quality function over all rectangular region of interest. The system has shown the ability to accurately detect and localize coexisting objects even in the presence of cluttered background, substantial occlusion, and significant scale changes. Our experimental results have demonstrated that the sub-categorization technique significantly improves the accuracy of generated hypotheses, which in turn increases the detection and localization results for all object categories. The SVM verification stage, on the other hand, uses category specific weighted merged feature to enrich the performance of the system. In the future, we will explore the possibility of detecting pose based on the window of the detected object by the SVM classifier and its surrounding visual words.
Acknowledgments This work was supported in part by the Ministry of Education, Culture, Sports, Science and Technology under the Grant-in-Aid for Scientific Research (KAKENHI 19300055).
References 1. Wu, B., Nevatia, R.: Cluster boosted tree classifier for multi-view, multi-pose object detection. In: IEEE International Conference on Computer Vision, Rio de Janeiro, Brazil, pp. 1–8 (2007)
Efficient Hypothesis Generation through Sub-categorization
171
2. Seemann, E., Leibe, B., Schiele, B.: Multi-aspect detection of articulated objects. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR(2)), pp. 1582–1588. IEEE Computer Society, New York (2006) 3. Mansur, A., Kuno, Y.: Improving recognition through object sub-categorization. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne, T.-M., Monroe, L. (eds.) ISVC 2008, Part II. LNCS, vol. 5359, pp. 851–859. Springer, Heidelberg (2008) 4. Huang, C., Ai, H., Li, Y., Lao, S.: Vector boosting for rotation invariant multi-view face detection. In: IEEE International Conference on Computer Vision, Beijing, China, pp. 446–453. IEEE Computer Society, Los Alamitos (2005) 5. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42, 177–196 (2001) 6. Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering objects and their localization in images. In: IEEE International Conference on Computer Vision, Beijing, China, pp. 370–377. IEEE Computer Society, Los Alamitos (2005) 7. Fergus, R., Li, F.F., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: IEEE International Conference on Computer Vision, Beijing, China, pp. 1816–1823. IEEE Computer Society, Los Alamitos (2005) 8. Das, D., Mansur, A., Kobayashi, Y., Kuno, Y.: An integrated method for multiple object detection and localization. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne, T.-M., Monroe, L. (eds.) ISVC 2008, Part II. LNCS, vol. 5359, pp. 133–144. Springer, Heidelberg (2008) 9. Fritz, M., Leibe, B., Caputo, B., Schiele, B.: Integrating representative and discriminative models for object category detection. In: IEEE International Conference on Computer Vision, Beijing, China, pp. 1363–1370. IEEE Computer Society, Los Alamitos (2005) 10. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 11. He, X.C., Yung, N.H.C.: Curvature scale space corner detector with adaptive threshold and dynamic region of support. In: ICPR (2), pp. 791–794 (2004) 12. Murphy, K.P., Torralba, A.B., Eaton, D., Freeman, W.T.: Object detection and localization using local and global features. In: Ponce, J., Hebert, M., Schmid, C., Zisserman, A. (eds.) Toward Category-Level Object Recognition. LNCS, vol. 4170, pp. 382–400. Springer, Heidelberg (2006) 13. Bosch, A., Zisserman, A., Mu˜ noz, X.: Representing shape with spatial pyramid kernel. In: ACM Int. Conf. on Image and Video Retrieval (CIVR), Amsterdam, The Netherlands, pp. 401–408 (2007) 14. Chang, C.C., Lin, C.J.: Libsvm: A library for support vector machines (2008), http://www.csie.ntu.edu.tw/cjlin/libsvm/ 15. Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with an implicit shape model. In: Workshop on Statistical Learning in Computer Vision, Prague, Czech Republic, pp. 17–32 (2004) 16. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge (voc2007) results (2007), http://www.pascal-network.org/challenges/VOC/voc2007/ workshop/index.html
Object Detection and Localization in Clutter Range Images Using Edge Features Dipankar Das, Yoshinori Kobayashi, and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570, Japan {dipankar,yosinori,kuno}@cv.ics.saitama-u.ac.jp
Abstract. We present an object detection technique that uses local edgels and their geometry to locate multiple objects in a range image in the presence of partial occlusion, background clutter, and depth changes. The fragmented local edgels (key-edgels) are efficiently extracted from a 3D edge map by separating them at their corner points. Each key-edgel is described using our scale invariant descriptor that encodes local geometric configuration by joining the edgel at their start and end points adjacent edgels. Using key-edgels and their descriptors, our model generates promising hypothetical locations in the image. These hypotheses are then verified using more discriminative features. The approach is evaluated on ten diverse object categories in a real-world environment.
1
Introduction
The field of robot vision is developing rapidly as robots become capable of operating with people in natural human environments. For robots to work effectively in the home and office, they need to be able to identify a specific or a general category of objects requested by the user. A great deal of research has been done on detecting objects in 2D intensity images. However, it has been challenging to design a system based on 2D intensity images that can handle problems associated with 3D pose, lighting, and shadows effectively. A range sensor provides a robust estimate of geometric information about objects, which is less sensitive to the above imaging problems and is useful for service robots to accurately locate objects in a scene. As a result, the design of an object detection and localization system using 3D range data has received significant attention over the years. Most previous 3D object recognition methods compare unknown range surfaces with the model in the database and recognize the one with the smallest metric distance [1,2,3]. These methods require a global model, which needs an extra surface modeling process from multiple range images. In this paper, we propose an alternative approach that measures the similarity between images using local feature sets. In recent years, 3D object recognition with local feature characteristics has been an active research topic in computer vision. For example, Li and Guskov [4] have proposed a method to find salient keypoints on a local surface of 3D objects and similarity has been measured using the pyramid kernel function. In [5], Chen and Bhanu have introduced an integrated local surface G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 172–183, 2009. c Springer-Verlag Berlin Heidelberg 2009
Object Detection and Localization in Clutter Range Images
173
descriptor for surface representation and 3D object recognition. However, their methods have been applied for only some 3D free-form objects with a great deal of surface variation in an image without any background clutter. Object recognition in range images in real-world cluttered environments is a difficult problem with widespread applications. Although the local surface descriptors perform well for some 3D free-form object classes, local 3D edge features and their descriptors are better representations for shape informative 3D objects with background clutter and partial occlusion. This paper presents a simple, effective, and fast method for detecting and localizing multiple objects in real-cluttered environments where objects are represented by sets of local and spatial distribution of 3D edges. A two-stage process is used to detect and localize objects. In the first stage, a set of hypotheses is generated using key-edgels and their descriptors. For this purpose, we first construct a 3D edge map using both boundary and fold edges. Then an edgel list is constructed from the edge map. The edgel list consists of edge segments between two corner points, the segments with a single or without any corner points and loop edges. Each edgel in the edgel list (key-edgel, ek ) is described along with its start and end points adjacent edgels. Our descriptor captures the local geometric distribution of edges, and invariant to depth and minor viewpoints changes. The key-edgels and their descriptors efficiently make the local surface boundary distinguishable among different object categories. Once the hypotheses have been generated, in the second stage of our approach, each hypothesis location is verified using additional object features along with the previous edgel features in order to make the final decision of object presence. Our additional features consist of a pyramid histogram of orientation gradient (PHOG) and are obtained using the technique proposed by Anna Bosch et al. [6]. By combining the hypothesis generation and verification techniques with efficiently detected 3D edge features, we are able to quickly detect and localize multiple object categories within an image. Experiments with real imagery containing clutter background, partial occlusion and significant depth changes demonstrate the efficiency of the algorithm. A wide variety of approaches have been proposed on object detection and localization using edges or contours. Recent research on 2D intensity images has shown that the local object edges and their networks can effectively detect shape for some object categories [7,8,9]. However, the surfaces of many object categories consist of irregular textures and colors, which lead to generating a great deal of surface edge variations within a category and makes the model difficult to detect such type of object category. However, we can get more reliable 3D edge boundaries from range images than intensity images. In object extraction from range images, some segmentation based approaches have been proposed that utilize edge features of objects. The edge-region-based range image segmentation approaches proposed by the authors of [10,11], were limited to only three simple objects and is not applicable for multiple complex objects within an image. A model based edge detection in range images of piled box-like objects using modified scan line approximation technique has been proposed by Katsoulas and
174
D. Das, Y. Kobayashi, and Y. Kuno
Weber [12]. However, all of the previous methods use global edges of 3D range image for objects extraction and obtain good results on some regular objects. Their reliance on global edge properties makes them vulnerable to clutter and occlusion. In this paper, we propose a new method to detect and localize multiple objects in a range image using local 3D edge features, which can tolerate partial occlusion and clutter background.
2
Range Data Collection and Preprocessing
We collected range images of ten object categories. All images were acquired using the Swiss laser ranger, SR-4000, at a resolution 144(h) × 176(w) in an office and kitchen environments. Fig. 1 shows the object categories that are used in our experiments.
(a) Coffee mug
(b) Teapot
(f) Pineapple
(g) Electric iron (h) Kitchen bowl (i) Fry pan
(c) Spray cleaner (d) Toy horse
(e) Cup noodle
(j) Tissue box
Fig. 1. Example of range images of ten object categories
In the images of MESA Swiss laser ranger, SR-4000, the saturation noise occurred when the signal amplitude or ambient light is too great. In this case, the MSB of the measured pixel is flagged as ‘saturated’. This can be avoided by reducing the integration time, increasing the distance or changing the angle of faces of reflective objects or if necessary by shielding the scene from ambient light. However, in an image of a scene that contains multiple objects, it is difficult to completely eliminate saturation noise . In the preprocessing stage, our filtering method removes the saturation noise from the image (Fig. 2). Since the saturation noise sometimes occurs contiguously, the conventional averaging or median filter is not appropriate for this purpose. In our filtering approach, instead of searching over all the image regions, we only concentrate on the surroundings of the flagged positions due to saturation noise. If (x, y) be a flagged position due to saturation noise, then our weighted averaging filter of size m × n is given by: s t i=−s j=−t w(i, j) ∗ f (x + i, y + j) g(x, y) = (1) s t i=−s j=−t w(i, j)
Object Detection and Localization in Clutter Range Images
175
(b)
(a)
Fig. 2. (a) Range image with saturation noise, (b) filtered image
where,
w(i, j) =
0 1
if (i, j)-th pixel is flagged pixel , otherwise
m = 2s + 1, n = 2t + 1, and f is the pixel value at (x, y).
3
Edge Feature Extraction and Generative Learning
In order to learn the generative model, pLSA [13], we first compute a cooccurrence table, where each of the range images is represented by a collection of visual words, provided from a visual vocabulary. 3.1
Key-Edgel Detection and Description
To extract reliable visual words, we first determine key-edgels on the range images that are insensitive to change in depth and viewpoint. These key-edgels are detected by constructing edgel list from the 3D edge map. In this research, edges of 3D range images (Fig. 3 (e)) are formed by combining both jump edges and fold edges. The jump edges (Fig. 3 (b)) are detected using the depth discontinuity, whereas the fold edges (Fig. 3 (d)) are detected using the surface normal discontinuity. The singular value decomposition filter (SVDF) generates more reliable and noise free fold edges from the normal enhanced range image. SVDF. The normal enhanced image is sensitive to noise produced by the range sensor. As a result, when we use it to determine edges, the edge map includes a great deal of false edges as shown in Fig. 3 (c). In order to eliminate this effect, we use SVDF on the normal enhanced images. To construct SVDF, we first create a normal enhanced image from the range data and obtain the singular values S ∈ {s1 , s2 , . . . , sk }, where k = 144 in our experiment. The singular value decomposition technique produces singular values in decreasing order: s1 > s2 > . . . , > sk . Then we normalize si by si /max(S). The filtered image is formed by taking the first n singular values such that sn ≥ δ. It can be shown that δ = 0.1 produces approximately noise free fold edges for our experiments (Fig. 3(d)).
176
D. Das, Y. Kobayashi, and Y. Kuno
(a)
(b)
(d)
(c)
(e)
Fig. 3. 3D edge map for the object teapot: (a) example range image, (b) detected jump edge, (c) edge map represented by Ny normal component, (d) result of using SVDF, (e) combined 3D edge map Table 1. Edgel list construction based on detected corner points on a 3D edge map # of corner points and their positions on the edge (i) Loop edge without corner points (ii) One corner point and loop edge (i) Edge without corner points (ii) One corner point either at the start or end of the edge (iii) Two corner points at the start and end of the edge (i) n corner points without start and end points (i) n corner points with either start or end point of the edge (ii) n corner points on the loop (i) n corner points with both of the start and end points
Classified # of Example edges from edge map edgel type edgels Isolated-1
1
Isolated-2
1
Connected n + 1 Connected
n
Connected n − 1
Edgel List Construction. One of the principle contributions of this paper is the construction of a reliable edgel list and description of edgels in the list with their geometric relationship. The edgel list is constructed from the combined 3D edge map of the range image. Edgels in the edge map are classified into two categories depending on the detected corner points in the edge map: connected edgel and isolated edgel. Corner points are detected using the technique proposed by He and Yung [14]. Table 1 summarizes edgel partition strategy and their classification. Our edgel partition strategy segments long edges into appropriate length of key-edgel, ek . All key-edgels in the edge map make up the edgel list. The edgel list also includes the edgel type for each key-edgel. Key-Edgel Description. In order to describe each ek in the edgel list, we first try to find the edgel geometry over the local region of an object. For this purpose, the start and end points adjacent edgels of each ek are determined based on the edgel type of the edgel list. If edgel type is isolated-1, then it is considered separately without any connected components. For isolated-2, each
Object Detection and Localization in Clutter Range Images
P1
P2
P3
P4 (a)
2
Px −1, y −1 Px , y −1 Px +1, y −1 Px −1, y
P
Px , y
Px +1, y
Px −1, y +1 Px , y +1 Px +1, y +1
(b)
177
Adjacent list of edgel 2 is {1,3}, using window (b)
4 1
3 5
Adjacent list of edgel 4 is {2,3}, using window (a)
(c)
Fig. 4. Adjacent edgels detection: (a) window for isolated edgel, (b) window for connected edgel, (c) examples of detected adjacent edgel lists in coffee mug object
of the edgel endpoints has been extended slightly along its endpoints directions in order to find the adjacent edgels. This task is efficiently done using m × m coordinate window. According to the coordinate window as shown in Fig. 4 (a), if the majority of the ek ’s pixels fall within P1 subwindow, then we find the P3 subwindow for the adjacent edgels. Here, P is the endpoint coordinate of ek . Similarly, if the majority of pixels fall in the P2 subwindow then we search for P4 and so on. If the edgel is connected then adjacent connected edgels are determined simply using the window given in Fig. 4 (b). Fig. 4(c) shows two of the detected adjacent lists using the above coordinate windows. Here, circles on the object edges indicate detected corner points. Once the adjacent edgels of the ek have been determined, each ek is described using the following descriptors: 1. For each key-edgel, ek , find the number of adjacent edgels, N of ek . 2. Calculate the normalized length of ek as lk = lek /L, where lek is the length of the key-edgel ek and L is the maximum edgel length within the image. 3. Calculate the edgels midpoint distance, dj , for each of the adjacent edgels with ek and normalize the distance with the maximum distance, D, of the adjacent edgels as: d1 d2 dN , ,... . (2) D D D 4. Construct the gradient orientation histogram for the edgel of ek and its adjacent edgels. Normalize the histogram to sum to unity. To make the descriptor fixed size, our midpoint distance is fixed to M-dimension by taking at most M values among N . If N < M , then all M − N midpoint distance values is set to 0. If the gradient orientation histogram has K-bin, then the total length of the descriptor will be (M + K + 2)-dimension. In order to calculate the midpoint distances, the edgel segments are ordered with respect to the ek . The order is from top to bottom. If two edgels have the same y-coordinate in the adjacent edgels list, the order is determined based on the x-coordinate from left to right. Our descriptor encodes the geometric properties of the keyedgel (orientation histogram, length and midpoint distances). The normalization at different steps ensures that our descriptors are scale invariant.
178
D. Das, Y. Kobayashi, and Y. Kuno
Codebook Construction and pLSA Learning. Before using key-edgel features for generative learning, we construct a codebook or visual vocabulary [15] of feature types by clustering a set of training key-edgels according to their descriptors. For clustering, we use K-mean clustering algorithm with codebook size of 300. The codebook is necessary to generate a bag-of-visual-words (BOVW) for each training or testing image. Then the pLSA model is learned to a number of topics given to the system using BOVW extracted from the training objects. The model associates each observation of a visual word, w, within an object, o, with a topic variable z ∈ Z = {z1 , z2 , . . . zk }. Here, our goal is to determine P (w|z) and P (z|o) by using the maximum likelihood principle. The model is fitted for all of the training object categories without knowledge of labels of bounding boxes. Then the topics are assigned based on the object specific topic probability, P (zk |oj ) under each category.
4
Discriminative Learning with Merging Features
In our learning approach, along with pLSA, a multi-class SVM classifier is also learned using the merging features. Our objective is to represent an object category by its local shape and the spatial layout of the shape. Thus, in addition to BOVW, we extract PHOG descriptor. The descriptor consists of a histogram of orientation gradient over object subregion at each resolution level− a Pyramid Histogram of Orientation Gradient(PHOG). We compute the PHOG descriptor using the technique as described in [6]. However, our PHOG descriptor is more reliable because we only extract PHOG from the object area using bounding box instead of from whole image area. Our merging features for an object, o, is given by: H(o) = HBOV W (o) + HP HOG (o) . (3) The merging features append two histograms in one big histogram, H(o). The multi-class SVM classifier is learned using the features given by equation 3. We use the LIBSVM [16] package for our experiments in a multi-class mode with the rbf exponential kernel.
5
Hypothesis Generation and Verification
In the recognition stage, promising hypotheses are generated within the range image for probable object locations. For this purpose visual words are extracted from the range image using the same technique as described in section 3. Each visual word is classified under the topic with the highest topic specific probability P (wi |zk ). Then promising hypotheses are generated using the modified hypothesis generation algorithm of Das et al. [17]. In our approach, the visual word that is represented by the key-edgel descriptor is treated differently. When a key-edgel supports a topic, it is assumed that all the coordinate points of the edgel support that topic. In order to generate the initial region of interest and the promising hypotheses, all supported coordinate points under different topics are taken into consideration instead of taking simply the visual word. Once
Object Detection and Localization in Clutter Range Images
179
promising hypotheses have been generated, merging features are extracted from the regions bounded by the windows of the promising hypotheses and fed into the multi-class SVM classifier in the recognition mode. Only the hypotheses for which a positive confidence measurement is returned are kept for each object. Objects with the highest confidence level are detected as the correct objects.
6
Experimental Results
To validate our approach, we detect and localize multiple objects in images with partial occlusion, clutter backgrounds and significant depth changes. The dataset used in this experiment includes range images of ten object categories (Fig. 1). There were a total of 590 range images. Among them 300 images (single object per image) were used for training purpose and the rest 290 images of 685 objects were used for testing the system. Some objects are symmetric (kitchen bowl, pineapple etc.) and others are asymmetric (coffee mug, teapot etc.) with respect to rotation. Since objects were presented randomly within an image, there exist differences in depth, position, and rotation. Depth changes caused a significant amount of scale variation among objects. In our approach object presence detection means determining if one or more objects are present in an image and localization means finding the locations of objects in that image. The localization performance is measured by comparing the detected window area with the ground truth object window. Based on the object presence detection and localization, an object is counted as a true positive object if the detected object boundary overlaps by 50% or more with the ground truth bounding box for that object. Otherwise, the detected object is counted as false positive. In the experimental evaluation, the detection and localization rate (DLR) is defined as: DLR =
# of true positive objects . # of annotated objects
(4)
The false positive rate (FPR) is defined as: FPR =
# of false positive objects . # of annotated objects
(5)
Performance of the New Descriptor. Before investigating the accuracy of the system, we first show how well visual words are generated on different objects within an image. Fig. 5 shows the generated probable visual words in different object categories within images. The visual words are indicated by different markers for different object categories (Figs. 5 (a) and (b)). From the figure, it is clear that our hypothesis generation algorithm predicts nearly accurate visual words for the object categories. When objects are rotated by 180◦ our approach also predicts object categories accurately. In this case, during the learning stage, our generative model subdivides each rotational asymmetric object categories into two subtopics; however, rotational symmetric object categories are grouped under a single topic (Table 2). Figs. 5 (c) and (d) show the final detected objects with their locations within the range image.
180
D. Das, Y. Kobayashi, and Y. Kuno
coffee−mug spray−cleaner teapot
spray−cleaner
teapot
coffee−mug
(a)
(b)
coffee−mug
(c)
(d)
Fig. 5. Detected probable objects edgels (a) and (b), and verified locations (c) and (d) Table 2. Example sub-categories with Z = 17 for our dataset Category Coffee Tea- Spray Toy Cup Pine- Electric Fry Tissue Kitchen mug pot cleaner horse noodle apple iron pan box bowl #topics 2 2 2 2 1 1 2 2 2 1 Table 3. Experimental results on our dataset Category DLR FPR BOVW+ DLR PHOG FPR BOVW
Coffee mug 0.56 0.09 0.63 0.04
Teapot 0.88 0.03 0.93 0.04
Spray clean. 0.88 0.19 0.94 0.21
Toy Cup Pinehorse noodle apple 0.83 0.89 0.92 0.15 0.19 1.52 0.99 0.91 0.98 0.10 0.19 0.77
Elect. iron 0.91 0.19 0.95 0.15
Fry pan 0.76 5.69 1.00 2.92
Tissue box 0.56 1.56 0.66 1.09
Kitch. bowl 0.76 0.16 0.78 0.09
Avg. rate 0.79 0.40 0.86 0.25
Accuracy and Computational Efficiency on Our Dataset. In this subsection, we measure the detection and localization performance on our dataset. The values of different parameters are obtained using validation and cross-validation datasets. In this experiment, the best codebook size of 300 and the number of topics 17 is obtained on the validation dataset. The values of cost parameter, C, and kernel parameter, γ, are obtained on the five-fold(v = 5) cross-validation dataset using the grid search technique. The best values of C and γ for our experiments are 10 and 2.5, respectively. Table 3 shows the detection and localization rate at the false positive rate indicated in their adjacent row. Images with multiple objects along with their ground truth bounding boxes are used to determine the accuracy of the system. As shown in the Table 3, only the BOVW feature produces the average DLR of 79% with the FPR of 40%. However, the merging features (BOVW + PHOG) incrementally increase the DLR to 86% with a reduction of FPR from 40% to 25%. Thus, we get more reliable results using the merging features. Fig. 6 shows the detection and localization results on our dataset. In each pair of rows in Fig. 6, the first row shows the detection results on range images of our approach documenting the performance under clutter background, partial occlusion, and significant depth and small viewpoints changes. The second row displays the corresponding color images for visual clarity.
Object Detection and Localization in Clutter Range Images
teapot
spray−cleaner teapot coffee−mug
181
spray−cleaner coffee−mug
teapot
teapot
coffee−mug
toy−horse cup−noodle coffee−mug
teapot cup−noodle
spray−cleaner toy−horse coffee−mug
electric−iron teapot coffee−mug kitchen−bowl
electric−iron teapot
kitchen−bowl
electric−iron kitchen−bowl
pineapple pineapple teapot tissue−box coffee−mug
frypan kitchen−bowl
kitchen−bowl
tissue−box
Fig. 6. Detected objects in range image and their corresponding color image. In each pair of rows, the first row shows the detection results on range images and the second row displays the corresponding color images for visual clarity.
Our detection and localization approach is very fast. The average time for different stages of our approach is given in Table 4. The system is implemented in Matlab and execution time is calculated in a Intel(R) Core(TM)2 CPU of 2.4 GHz. The computational time of our system depends on the total number of object categories. The second row of Table 4 shows the average time required to
182
D. Das, Y. Kobayashi, and Y. Kuno Table 4. Execution time (in sec) of the system
Avg. time/image Avg. time/object
Preprocessing 0.17 0.07
Feature extraction Codebook Hypo. gen. & & description generation verification 1.08 0.03 0.17 0.46 0.01 0.07
Total time 1.45 0.61
search for all possible ten object categories within an image. Since the last row shows average time per object, which is calculated on 290 test images with 685 objects of ten categories, this average time will decrease if the number of actual target objects within an image increase.
7
Conclusion
In this paper, we have presented an efficient approach for multiple objects category detection and localization using local edge features from range images. Since we use the local edgels and their geometry, it is robust to partial occlusion, background clutter and significant depth changes. The detection rate and computational efficiency suggests that our technique is suitable for real time use. The method is useful for service robots applications, because it can use 3D information to know the exact position and pose of the target objects. The current system can detect rotational symmetric objects with large viewpoint changes. However, it can handle small viewpoint changes for rotational asymmetric object categories. And, in theory, if the training data consist of images with large viewpoint changes, then the generative model automatically subcategorize objects in a given category into multiple topics and generate nearly accurate hypotheses. However, to do this, we need to show many images from various viewpoints. And the number of subcategories may increase greatly. We are now working on how to deal with this problem.
Acknowledgments This work was supported in part by the Ministry of Education, Culture, Sports, Science and Technology under the Grant-in-Aid for Scientific Research (KAKENHI 19300055).
References 1. Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21, 433–449 (1999) 2. van Dop, E.R., Regtien, P.P.L.: Object recognition from range images using superquadric representations. In: IAPR Workshop on Machine Vision Applications, Tokyo, Japan, pp. 267–270 (1996)
Object Detection and Localization in Clutter Range Images
183
3. Li, S.Z.: Object recognition from range data prior to segmentation. Image Vision Comput. 10, 566–576 (1992) 4. Li, X., Guskov, I.: 3d object recognition from range images using pyramid matching. In: IEEE Int. Conf. on Computer Vision, Rio de Janeiro, Brazil, pp. 1–6 (2007) 5. Chen, H., Bhanu, B.: 3d free-form object recognition in range images using local surface patches. Pattern Recognition Letters 28, 1252–1262 (2007) 6. Bosch, A., Zisserman, A., Mu˜ noz, X.: Representing shape with spatial pyramid kernel. In: ACM Int. Conf. on Image and Video Retrieval, Amsterdam, The Netherlands, pp. 401–408 (2007) 7. Ferrari, V., Fevrier, L., Jurie, F., Schmid, C.: Group of adjacent contour segment for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 30, 30–51 (2008) 8. Shotton, J., Blake, A., Cipolla, R.: Contour-based learning for object detection. In: IEEE Int. Conf. on Computer Vision, Beijing, China, pp. 503–510. IEEE Computer Society, Los Alamitos (2005) 9. Opelt, A., Pinz, A., Zisserman, A.: A boundary-fragment-model for object detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 575–588. Springer, Heidelberg (2006) 10. Wani, M.A., Batchelor, B.G.: Edge-region-based segmentation of range images. IEEE Trans. Pattern Anal. Mach. Intell. 16, 314–319 (1994) 11. Sood, A.K., Al-Hujazi, E.: An integrated approach to segmentation of range images of industrial parts. In: IAPR Workshop on Machine Vision Applications, Kokubunji, Tokyo, Japan, pp. 27–30 (1990) 12. Katsoulas, D., Werber, A.: Edge detection in range images of piled box-like objects. In: Int. Conf. on Pattern Recognition, pp. 80–84. IEEE Computer Society, Los Alamitos (2004) 13. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42, 177–196 (2001) 14. He, X.C., Yung, N.H.C.: Curvature scale space corner detector with adaptive threshold and dynamic region of support. In: Int. Conf. on Pattern Recognition, pp. 791–794 (2004) 15. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: European Conf. on Computer Vision, Workshop on Statistical Learning in Computer Vision, Prague, Czech Republic (2004) 16. Chang, C.C., Lin, C.J.: Libsvm: A library for support vector machines (2008), http://www.csie.ntu.edu.tw/cjlin/libsvm/ 17. Das, D., Mansur, A., Kobayashi, Y., Kuno, Y.: An integrated method for multiple object detection and localization. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne, T.-M., Monroe, L. (eds.) ISVC 2008, Part II. LNCS, vol. 5359, pp. 133–144. Springer, Heidelberg (2008)
Learning Higher-Order Markov Models for Object Tracking in Image Sequences Michael Felsberg and Fredrik Larsson Department of Electrical Engineering, Link¨oping University {mfe,larsson}@isy.liu.se
Abstract. This work presents a novel object tracking approach, where the motion model is learned from sets of frame-wise detections with unknown associations. We employ a higher-order Markov model on position space instead of a first-order Markov model on a high-dimensional state-space of object dynamics. Compared to the latter, our approach allows the use of marginal rather than joint distributions, which results in a significant reduction of computation complexity. Densities are represented using a grid-based approach, where the rectangular windows are replaced with estimated smooth Parzen windows sampled at the grid points. This method performs as accurately as particle filter methods with the additional advantage that the prediction and update steps can be learned from empirical data. Our method is compared against standard techniques on image sequences obtained from an RC car following scenario. We show that our approach performs best in most of the sequences. Other potential applications are surveillance from cheap or uncalibrated cameras and image sequence analysis.
1 Introduction Object tracking is a common vision problem that requires temporal processing of visual states. Assume that we want to estimate the position of an object moving in 3D space, given its observed and extracted position (i.e. coordinates) in 2D image data, taken from an uncalibrated moving camera. Our focus is on temporal filtering, however, this problem is specific to vision-based tracking since the association problem between visual detections and objects does not exist in many classical sensors, e.g., accelerometers. The output of the proposed method is 2D trajectories of physical objects. The objects’ dynamics are assumed to be unknown and non-linear and the noise terms nonGaussian. This setting constitutes a hard, weakly-supervised learning problem for the motion and measurement models since no point-to-point correspondences between the observations are available. Once learned, the motion models are applied in a Bayesian tracking framework to extract trajectories from sequences of sets of detections, i.e., also solving the association problem between detections and objects. The major advantage of our approach compared to other learning methods is that sets of frame-wise detections with unknown correspondences are much easier to extract than strictly corresponding detections or fully stable tracking of object appearances. We
The research leading to these results has received funding from the European Community’s 7th Framework Programme (FP7/2007-2013) under grant agreement n◦ 215078 DIPLECS.
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 184–195, 2009. c Springer-Verlag Berlin Heidelberg 2009
Learning Higher-Order Markov Models for Object Tracking in Image Sequences
185
employ a higher-order Markov model on position space instead of a first-order Markov model on a high-dimensional state-space of object dynamics. Compared to the latter, our approach allows the use of marginal rather than joint distributions, which results in a significant reduction of computation complexity. Densities are represented using a grid-based method, where the rectangular windows are replaced with a smooth Parzen window estimator sampled at the grid points, where sampling is meant in the signal processing sense (i.e. not stochastic sampling) throughout this paper. This method is as accurate as particle filter methods [1] with the additional advantage that the prediction and update steps can be learned from empirical data. The densities are estimated and processed in the channel representation and thus the employed tracking approach is called channel-based tracking (CBT). 1.1 Related Work Relevant literature research can be found in the area of non-linear, non-Gaussian Bayesian tracking [2,3]. In Bayesian tracking, the current state of the system is represented as a probability density function of the system’s state space. At the time update, this density is propagated through the system model and an estimate for the prior distribution of the system state is obtained. At the measurement update, measurements of the system are used to update the prior distribution, resulting in an estimate of the posterior distribution of the system state. Gaussian, (non-)linear Bayesian tracking is covered by (extended) Kalman filtering. Common non-Gaussian approaches are particle filters and grid-based methods [2]. Whereas particle filters apply Monte Carlo methods for approximating the relevant density function, grid based methods discretize the state-space, i.e., apply histogram methods for the approximation. In the case of particle filters, densities are propagated through the models by computing the output for individual particles. Grid-based methods use discretized transition maps to propagate the histograms and are closely related to Bayesian occupancy filtering [4]. An extension to grid based methods is to replace the rectangular histogram bins with overlapping, smooth Parzen windows, that are regularly sampled. This method is called channel-based tracking [1]. CBT implements Bayesian tracking using channel representations [5] and linear mappings on channel representations, so-called associative networks [6]. The main advantage compared to grid-based methods is the reduction of quantization effects and computational effort. Also, it has been shown that associative networks can be trained from data sets with unknown element-wise correspondence [7]. As pointed out above, channel representations are sampled Parzen window estimators [8], implying that CBT is related to kernel-based prediction for Markov sequences [9]. In the cited work, system models are estimated in a similar way as in CBT, but the difference is that sampled densities make the algorithm much faster. Another way to represent densities in tracking are Gaussian mixtures (e.g. [10]) and models based on mixtures can be learned using the EM algorithm, cf. [11], although the latter method is restricted to uni-modal cases (Kalman filter) and therefore disregarded. A vision-specific problem in tracking is the associations of observations and objects, in particular in multiple object tracking [12]. Standard solutions are probabilistic multiple-hypothesis tracking (PMHT) [13] and the probabilistic data association filter
186
M. Felsberg and F. Larsson
(PDAF) [14]. Thus, in our experiments, we have been comparing our approach to a combination of PMHT and a set of Kalman filters, based on an implementation of [15] and [16], and our own implementation of PDAF. The main novelties of this paper compared to the approach of CBT as defined in [1] is: 1: The multi-dimensional state-space is embedded in a probabilistic formulation (previous work only considered a 1D state and just concatenated channel vectors, leading to sums of densities). 2: The higher-order Markov model for the CBT is embedded into a first order model. This allows to use the Baum-Welch algorithm to learn models from datasets without known associations. 3: The Baum-Welch algorithm has been adapted to using channels. 4: The tracking is applied for visual tracking among multiple objects in a moving camera, and compared to PMHT and PDAF. 1.2 Organization of the Paper After the introduction, the methods required for further reading are introduced: Bayesian tracking, channel representations of densities, and CBT. The novelties of this paper are covered in Section 3: First, probabilistic multi-dimensional formulations for CBT are considered. Second, the CBT method is extended to embed the higher-order Markov model into a first order model and we show that it is sufficient to use the marginals of a higher-order Markov model to track multiple objects. Third, we adapt the BaumWelch algorithm to the CBT formulation. Fourth, we provide empirical evidence that correspondence-free learning works with the Baum-Welch algorithm applied to the first-order model embedding. In section 4, the whole algorithm is evaluated on image sequences acquired from a RC car. In section 5 we discuss the achieved results.
2 Channel-Based Bayesian Tracking Channel-based tracking (CBT) is a generalization of grid-based methods for implementing non-linear, non-Gaussian Bayesian tracking. Hence we give a brief overview on Bayesian tracking and channel representations before we describe CBT. The material of this section summarizes the material from [1]. 2.1 Bayesian Tracking For the introduction of concepts from Bayesian tracking we adopt the notation from [2]. Bayesian tracking is commonly defined in terms of a process model f and a measurement model h, distorted by i.i.d. noise v and n xk = fk (xk−1 , vk−1 ) ,
zk = hk (xk , nk ) .
(1)
The symbol xk denotes the system state at time k and zk denotes the observation made at time k. Both models are in general non-linear and time-dependent. The current state is estimated in two steps. First, given the posterior density of the previous state and all previous observations are known and assuming a Markov process of order one, the prior density of the current state is estimated in the time update as p(xk |z1:k−1 ) = p(xk |xk−1 )p(xk−1 |z1:k−1 ) dxk−1 . (2)
Learning Higher-Order Markov Models for Object Tracking in Image Sequences
187
Second, the posterior is obtained from the measurement update as p(xk |z1:k ) = p(zk |xk )p(xk |z1:k−1 ) p(zk |xk )p(xk |z1:k−1 ) dxk .
(3)
In the case of non-linear problems with multi-modal densities, two approaches for implementing (2) and (3) are commonly used: The particle filter and the grid-based method. Since CBT is a generalization of the grid-based method, we focus on the latter. Grid-based methods assume a discrete state space such that the continuous densities are approximated with histograms. Thus, conditional probabilities of state transitions are replaced with linear mappings. In contrast to [2] where densities were formulated using Dirac distributions weighted with discrete probabilities, we assume band-limited densities and apply sampling theory, since this is more consistent with the formulation of CBT. Sampling the densities p(xk1 |z1:k2 ) gives us wki 1 |k2 p(xk1 |z1:k2 ) ∗ δ(xi − xk2 )
k1 , k2 ∈ {k − 1, k}
(4)
where ∗ denotes convolution and δ(xi − x) is the Dirac impulse at xi . Combining (2) and (4) and applying the power theorem gives us ij j i wk|k−1 = fk wk−1|k−1 (5) j
fkij
i
where = p(xk |xk−1 ) ∗ δ(x − xk ) ∗ δ(xj − xk−1 ). Accordingly, combining (3) and (4) results in j j i i wk|k = hik (zk )wk|k−1 hk (zk )wk|k−1 (6) j
where hik (zk ) = p(zk |xk ) ∗ δ(xi − xk ). Grid-based methods require the more samples the higher the upper band limit of the pdf, i.e., the wider the characteristic function ϕx (t) = E{exp(itT x)}. 2.2 Channel Representations of Densities The channel representation [5,17] can be considered as a way of sampling continuous densities or, alternatively, as histograms where the bins are replaced with smooth, overlapping basis functions b(x), see e.g. [18]. Consider a density function p(x) as a continuous signal that is sampled with a smooth basis function, e.g., a B-spline. It is important to realize here that the sampling takes place in the dimensions of the stochastic variables, not along the time axis k. It has been shown in the literature that an averaging of a stochastic variable in channel representation is equivalent to the sampled Parzen window (or kernel density) estimator with the channel function as kernel function [8]. For the remainder of this paper it is chosen as [19] 2a π b(x) cos2 (axn ) if |xn | < , 0 otherwise. (7) π n 2a Here a determines the relative width, i.e., the sampling density. For the choice of a the reader is referred to [20]. According to [8], the channel representation reduces the quantization effect compared to ordinary histograms by a factor of up to 20. Switching from histograms to channels allows us to reduce computational load by using fewer bins, to increase the accuracy for the same number of bins, or a mixture of both.
188
M. Felsberg and F. Larsson
For performing maximum likelihood or maximum a posteriori (MAP) estimation using channels, a suitable algorithm for extracting the maximum of the represented π distribution is required. For cos2 -channels with a spacing of 3a , an optimal algorithm in least-squares sense is obtained in the one-dimensional case as [19] xˆk1 = l +
l+2 1 arg wkj 1 |k1 exp(i2a(j − l)) . 2a
(8)
j=l
N -dimensional decoding is obtained by local marginalization in a window of size 3N and subsequent decoding of the N marginals. The index l of the decoding window is chosen using the maximum sum of a consecutive triplet of coefficients: l = arg maxj (wkj 1 |k1 + wkj+1 + wkj+2 ). 1 |k1 1 |k1 2.3 Channel-Based Tracking Channel-based tracking (CBT) is defined by replacing the sampled densities (4) with wki 1 |k2 p(xk1 |z1:k2 ) ∗ b(xi − xk1 )
(9)
where b(x) is the channel basis function (7). The power theorem which has been used to derive (5) and (6) does not hold in general if we sample with channels instead of impulses, because some high-frequency content might be removed. However, if the densities are band-limited from the start, the regularization by the channel basis functions removes no or only little high-frequency content and (5) and (6) can be applied for the channel-based density representations as well. For what follows, the coefficients of (5) are summarized in the matrix Fk = {fkij } and the coefficients of (6) are summarized in the vector-valued function hk (zk ) = {hjk (zk )}. Both operators can be learned from a set of training data if both remain stationary and we remove the time index k (not from zk though): F and h(zk ). The prior and posterior densities are now obtained by wk|k−1 = Fwk−1|k−1 , wk|k = h(zk ) · wk|k−1 hT (zk )wk|k−1 , (10) where · is the element-wise product, i.e., the enumerator remains a vector. In [1] the system model f is learned by estimating the matrix F from the covariance of the state channel vector. Since the model matrix corresponds to the conditional pdf and not to the joint pdf, the covariance is normalized with the marginal distribution for the previous state (see also [21], plugging (3.3) into (2.7)) K K max max T ˆ= F wk|k w 1 wT (11) k−1|k−1
k=1
k−1|k−1
k=1
where 1 denotes a one-vector of suitable size and the quotient is evaluated point-wise. For the initial time step, no posterior of the previous state is available and the time update cannot be computed using the model matrix above. Instead, the empirical distribution for the initial state is stored and used as w0|0 .
Learning Higher-Order Markov Models for Object Tracking in Image Sequences
189
The measurement model h(zk ) and its estimation is not considered in more detail here, since our experiments are restricted to the case where zk are noisy and cluttered observations of wk . A more general case of observation models has been considered in [1]. In summary the algorithm is just iterating (10) and (8) over k.
3 Learning Higher-Order Markov Models In this section, we generalize (10) for multi-dimensional input.1 In a next step, we show why higher-order marginalized Markov models are suitable for tracking multiple objects. We describe further how they can be embedded in a first-order model in order to apply standard algorithms like the Baum-Welch algorithm [22]. Finally, we explain how these models can be learned from data without point-wise correspondences. 3.1 Multi-dimensional Case Consider the conditional density of a certain state dimension m given N previous states and apply Bayes’ theorem: 1 N 1 N m m 1 N p(xm k |xk−1 , . . . , xk−1 ) = p(xk−1 , . . . , xk−1 |xk )p(xk )/p(xk−1 , . . . , xk−1 ) . (12)
Since channel representations of densities are closely related to robust statistics [8] and since robust matching of states allows to assume mutual independence of the old states x1k−1 , . . . , xN k−1 [23], we obtain 1 N p(xm k |xk−1 , . . . , xk−1 ) =
N
n m (p(xnk−1 |xm k )/p(xk−1 ))p(xk )
(13)
n=1
and applying Bayes’ theorem once more results in 1 N m 1−N p(xm k |xk−1 , . . . , xk−1 ) = p(xk )
N
n p(xm k |xk−1 ) .
(14)
n=1
Note that the new states xm k still depend on all old states but these conditional densities are computed by separable products of pairwise conditional densities and a proper normalization. This factorization is of central importance to avoid a combinatorial explosion while producing only a small approximation error. A practical problem is, however, that the densities are represented by channels and repeatedly multiplying these representations will lead to extensive low-pass filtering of the true densities. Their product might not even be a valid channel vector! Considering the basis functions (7) more in detail, it turns out that taking the squareroot of the product of channel vectors is a good approximation of the channel representation of the corresponding density function product
(p1 p2 ) ∗ b(xi ) ≈ (p1 ∗ b(xi ))(p2 ∗ b(xi )) (15) and whenever multiplying densities in channel representation, we applied a square-root to the factors. This product is directly related to the Bhattacharyya coefficient [24]. 1
Note that the method proposed in [1], namely to concatenate channel vectors, is not correct in full generality.
190
M. Felsberg and F. Larsson
3.2 Higher-Order Markov Models for Tracking In order to have low computational complexity during runtime and a moderate number of learning samples during training, joint densities of high-dimensional state spaces should be avoided. Replacing joint densities with the respective marginals is possible in case of statistical independence, but in the practically relevant case of tracking multiple objects, using marginals means to mix up properties from different objects. Instead of having two objects with two different properties each, one ends up with four objects: The correct ones and two ghost objects with mixed properties, see below. The approach to drastically reduce dimensionality is no option either. The state space should contain sufficiently many degrees of freedom to describe the observed phenomena. Typical choices for tracking are first or second order Euclidean motion states or higher-order Markov models, albeit less frequently used. Euclidean motion states appear more attractive when the system model is to be engineered, since we are used to thinking in physical models. However, this is not relevant when it comes to learning systems. Contrary, learning systems are easier to design when the in-data lives in the same space and hence we consider n-tuples of positions instead of motion states. Actually, higher-order Markov models have an important advantage compared to motion states: In case of several moving objects, it is important to have correspondence between the different dimensions of the motion state, i.e. to bind it to a particular physical object. Otherwise one ends up with a grid of possible (and plausible) ghost states.The same happens if e.g. a second-order Markov model of position is used (which corresponds to position and velocity) and correspondence between the consecutive states is lost. However, depending on the absolute placement of the states, the ghost states quickly diverge into regions outside an area which can be assumed to be plausible. Furthermore, it is very unlikely that there will be consistent measurements far away from the correct locations, i.e., the wrong hypotheses will never get support from measurements. In expectation value sense, all wrong hypotheses will vanish, which is a direct consequence of the proof in [7]. Hence, if joint densities are no option due to computational complexity, the higher-order Markov model is more suitable for multi-object tracking than motion state spaces. Using higher-order Markov models has already been proposed in [1], however not in a proper product formulation as derived in Sect. 3.1. 3.3 Embedding nD Higher-Order Models Higher-order Markov models depend on more states that just the previous one. In order to make use of the Markov property, n consecutive states need to be embedded in a larger state vector. However, as shown in Sect. 3.1, we have to multiply all densities obtained from different dimensions and time-steps according to (14), i.e., we may not propagate the new state vectors through a linear mapping. Instead, we obtain m wk|k−1 = (wkm )1−N/2
N n Fm n wk−1|k−1 .
(16)
n=1
What remains is how to learn the models Fm n , h and the prior. Note that due to the separable product in (16), all linear models can be learned separately by the BaumWelch algorithm [22] and we omit the indices n and m i what follows.
Learning Higher-Order Markov Models for Object Tracking in Image Sequences
191
In its initial formulation, the Baum-Welch algorithm is supposed to work on discrete state spaces. It can thus be applied to grid-based methods, but it has to be modified according to Sect. 3.1 for being applicable to channel vectors. Hence, all products of densities occurring in the Baum-Welch algorithm are replaced with square-root products of channel vectors. The α-factor from the algorithm is identical to the update equations (10) (αk = wk|k ), which are modified accordingly. The β-factor from the algorithm is computed by propagating backwards through the measurement model and the system model βk = (FT (h(zk+1 ) · βk+1 ))T (17) where · denotes the element-wise product. Again, all products are replaced with squareroot products in the case of channels. The computation for the system model update is straightforward, given that the factors α and β are known 1 F← βk+1 · h(zk+1 ) · (Fαk ) (18) k N for a suitable normalization N . The measurement model is implemented as a mapping to measurement channels, i.e.a matrix as well, and it is updated as 1 h(z) ← zk (αk · βk ) . (19) k N In the tracking evaluation, the β-factor has not been used, due to its anti-causal computation. 3.4 Correspondence-Free Learning It has been shown that certain types of learning algorithms on channel representations do not require element-wise correspondence for learning and that learning becomes even faster if sets of samples are shown simultaneously [7]. This scenario is exactly the one that we meet when we try to learn associations of observations and objects: Detections might fail or might be irrelevant. Consider e.g. Fig. 1 for the type of learning problem that we face: We get detection results without correspondence, including drop-outs, outliers, and displacements. From these detections we train a system model and a measurement model. Using these models, we track individual cars through the sequence. The correspondence-free learning has been shown in [7] by proving equivalence to a stochastic gradient descent on the learning problem with correspondence. The central question here is, whether the Baum-Welch algorithm will lead to a similar result, i.e., whether the expectation of the algorithm will be the solution of the learning problem with correspondence. We will not give a formal proof here, but in the subsequent section, we will give empirical evidence that this is the case. The Baum-Welch algorithm is initialized with the covariance-based estimates that were used in [1]. In our experiments, the algorithm converged after four iterations. The fact that the algorithm found a model capable of tracking an individual object empirically supports that correspondence-free learning also works in this scenario. Thus, CBT using Baum-Welch gives a solution for the hard association problem of detections and objects without additional optimization of discrete assignment problems.
192
M. Felsberg and F. Larsson
Fig. 1. Two consecutive frames, 321 (left) and 322 (right), from the first RC car (rightmost box) sequence with detections (boxes)
4 Experiments We evaluated the proposed method in comparison to PMHT [13] and to PDAF [14]. Both methods are extensions to the Kalman filter, which try to overcome the problem of object tracking in the presence of multiple, false and/or missing measurements. PMHT uses a sliding window that incorporates previous, current and future measurements and applies the expectation-maximization approach to get the new state estimates. PDAF makes probabilistic associations between target and measurements by combined innovation. That is, PDAF makes a weighted update based on all measurements where the weights are given by the probability for the respective measurement given the current prediction. The experimental setup is illustrated in Fig. 1: From (partly very low quality) image sequences, each consisting of several hundred frames showing the front view, we detect cars. These detections are indicated by the colored boxes. Note that we do not use visual tracking of vehicles by e.g. least-squares, since the cars might change their appearance significantly from different views, due to shadows, highlights, change of aspect etc. For detection we use a real-time learned cascade classifier [25]2 . The input to our tracking algorithm is an unsorted list of coordinates with corresponding likelihood-ratios for each frame. We trained the system in leave-one-out manner on the detections from all sequences except for the respective evaluation sequence. We were only interested in the other RC car in the sequences, i.e., we were only interested in the primary occurring trajectory. The parameters were chosen as follows. For the PMHT method, we obtained most stable results with a motion model including velocity but with constant size; size-change estimates did not improve the results. For the cost function of an association, we chose distance instead of probability, since the latter did not give reasonable results for large parts of the trajectories. The PMHT implementation is based on the implementation3 of [15] and [16]. For the PDAF method, we also obtained the most stable results with the 2
3
We would like to thank our project partners from Prague (J. Matas, T. Werner and A. Shekhovtsov) for providing the detections. http://www.anc.ed.ac.uk/demos/tracker/
Learning Higher-Order Markov Models for Object Tracking in Image Sequences
193
Table 1. The RMSE of each method compared to manually labeled ground truth. The number in the paranthesises denotes the maximum deviation that was used. If the actual deviation was larger it was replaced with this value. CBT PMHT PDAF Detector CBT PMHT RC1 (∞) 7.3 13.1 18.4 40.43 RC2 (∞) 6.7 15.6 RC1 (20) 6.6 8.9 9.1 10.54 RC2 (20) 6.5 7.1 RC1 (5) 3.9 3.2 4.0 4.17 RC2 (5) 3.9 3.5
PDAF Detector 23.3 42.72 8.5 10.58 3.9 4.23
Fig. 2. Result on RC car sequences. Left/center/right column shows the result of CBT/PDAF/PMHT on RC sequence 1. The red crosses indicate the detections and the blue curve is the result obtained by each method. We have only plotted the result for the x-axis.
model described above. We computed the weights of each measurement as the probabilities for the respective measurement given the prediction based on previous timesteps. For the CBT method, we chose 20 channels per state vector dimension and a model of order 3. After some optimization, all three methods delivered reasonable trajectories in all test cases. However, for PMHT and PDAF the association of detections to the correct object sometimes failed, e.g., for PMHT around frame 80 in RC1 and for PDAF around frame 150 in RC1. Figure 2 shows the obtained result for this sequence. The initial detections are indicated as red crosses, and the obtained results as blue curves. The
194
M. Felsberg and F. Larsson
accuracy of PMHT is very good in all segments. However, spurious trajectories had previously been removed by thresholding the length of the trajectories. PDAF shows slightly worse accuracy but on the other hand suffers less from completely loosing track of the object. Notice that CBT does not loose track of the object at all and also shows the best average accuracy in most cases. The root mean squared error (RMSE) for the three methods compared to manually labeled ground truth are shown in table 1. In order to make the comparison fair for PMHT, which is designed to track multiple objects instead of a single object in clutter, we always choose the reported object that is closest to the ground truth. This means that we ignore each time PMHT looses track of the object, as long as it reintroduces a new object at the correct position. We also have included the performance of the pure detector in the last column. For the detector we used the most likely measurement, given by the log-likelihood, when there were multiple reported detections. All results are shown for three different outlier rejection strategies; no rejection, rejection for outliers larger than 20 and larger than 5. We can see that the performance of PMHT is slightly better than CBT when we almost disregard outliers, see RCX(5). CBT is performing best when we put some weight to outliers, RCX(20), and there is an even greater difference when no thresholding is done, c.f. RCX(∞). Note that when no target is reported at all, e.g. PMHT frame 350 in RC1, that frame did not contribute to the RMSE for that method. The CBT kept track of the car in front in nearly all cases, until it got out of view. No association problems occurred. Again, the estimates are very accurate in comparison to the localization of the original detections and in comparision to the other two methods.
5 Conclusion We have extended the framework of CBT in several ways. The multi-dimensional case has been re-formulated in a sound probabilistic approach. The previously suggested higher-order Markov model has been embedded into a first-order model, allowing to apply the Baum-Welch algorithm for learning the system model and the measurement model for the tracking. The learning algorithm itself has been extended and it has been shown to work on weakly labeled data. The association problem of observations and objects is solved without additional discrete optimization steps. The resulting tracking algorithm has been shown to extracting individual objects from noisy detections of multiple objects and compares favorably with existing techniques. We have discussed the advantages of using marginals of higher-order Markov models compared to motion states. As a result of working on marginals, the algorithm runs in full real-time. The proposed method shows best accuracy and robustness in most of the evaluated sequences. Potential application areas are visual surveillance from cheap and/or uncalibrated cameras and image sequence analysis of objects with unknown system models.
References 1. Felsberg, M., Larsson, F.: Learning Bayesian tracking for motion estimation. In: International Workshop on Machine Learning for Vision-based Motion Analysis (2008) 2. Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Sig. P. 50, 174–188 (2002)
Learning Higher-Order Markov Models for Object Tracking in Image Sequences
195
3. Isard, M., Blake, A.: CONDENSATION – conditional density propagation for visual tracking. International Journal of Computer Vision 29, 5–28 (1998) 4. Cou´e, C., Fraichard, T., Bessi`ere, P., Mazer, E.: Using Bayesian programming for multisensor multitarget tracking in automotive applications. In: ICRA (2003) 5. Granlund, G.H.: An Associative Perception-Action Structure Using a Localized Space Variant Information Representation. In: Proceedings of the AFPAC Workshop (2000) 6. Johansson, B., et al.: The application of an oblique-projected landweber method to a model of supervised learning. Mathematical and Computer Modelling 43, 892–909 (2006) 7. Jonsson, E., Felsberg, M.: Correspondence-free associative learning. In: ICPR (2006) 8. Felsberg, M., Forss´en, P.E., Scharr, H.: Channel smoothing: Efficient robust smoothing of low-level signal features. PAMI 28, 209–222 (2006) 9. Georgiev, A.A.: Nonparamtetric system identification by kernel methods. IEEE Trans. on Automatic Control 29 (1984) 10. Han, B., Joo, S.W., Davis, L.S.: Probabilistic fusion tracking using mixture kernel-based Bayesian filtering. In: IEEE Int. Conf. on Computer Vision (2007) 11. North, B., Blake, A.: Learning dynamical models using expectation-maximisation. In: ICCV 1998 (1998) ˚ om, K., Berthilsson, R.: Real time viterbi optimization of hidden markov mod12. Ard¨o, H., Astr¨ els for multi target tracking. In: Proceedings of the WMVC (2007) 13. Streit, R.L., Luginbuhl, T.E.: Probabilistic multi-hypothesis tracking. Technical report, 10, NUWC-NPT (1995) 14. Shalom, B.Y., Tse, E.: Tracking in a cluttered environment with probabilistic data association. Automatica 11, 451–460 (1975) 15. Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intell. 22, 747–757 (2000) 16. Jonker, R., Volgenant, A.: A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38, 325–340 (1987) 17. Snippe, H.P., Koenderink, J.J.: Discrimination thresholds for channel-coded systems. Biological Cybernetics 66, 543–551 (1992) 18. Pampalk, E., Rauber, A., Merkl, D.: Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps. In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, pp. 871–876. Springer, Heidelberg (2002) 19. Forss´en, P.E.: Low and Medium Level Vision using Channel Representations. PhD thesis, Link¨oping University, Sweden (2004) 20. Felsberg, M.: Spatio-featural scale-space. In: Tai, X.-C., et al. (eds.) SSVM 2009. LNCS, vol. 5567, pp. 235–246. Springer, Heidelberg (2009) 21. Yakowitz, S.J.: Nonparametric density estimation, prediction, and regression for markov sequences. Journal of the American Statistical Association 80 (1985) 22. Baum, L.E., et al.: A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 41, 164–171 (1970) 23. Rao, R.P.N.: An optimal estimation approach to visual perception and learning. Vision Research 39, 1963–1989 (1999) 24. Therrien, C.W.: Decision, estimation, and classification: an introduction into pattern recognition and related topics. John Wiley & Sons, Inc., Chichester (1989) 25. Sochman, J., Matas, J.: Waldboost - learning for time constrained sequential detection. In: Proc. Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 150–157 (2005)
Analysis of Numerical Methods for Level Set Based Image Segmentation Björn Scheuermann and Bodo Rosenhahn Institut für Informationsverarbeitung Leibnitz Universität Hannover {scheuerm,rosenhahn}@tnt.uni-hannover.de
Abstract. In this paper we analyze numerical optimization procedures in the context of level set based image segmentation. The Chan-Vese functional for image segmentation is a general and popular variational model. Given the corresponding Euler-Lagrange equation to the ChanVese functional the region based segmentation is usually done by solving a differential equation as an initial value problem. While most works use the standard explicit Euler method, we analyze and compare this method with two higher order methods (second and third order RungeKutta methods). The segmentation accuracy and the dependence of these methods on the involved parameters are analyzed by numerous experiments on synthetic images as well as on real images. Furthermore, the performance of the approaches is evaluated in a segmentation benchmark containing 1023 images. It turns out, that our proposed higher order methods perform more robustly, more accurately and faster compared to the commonly used Euler method.
1
Introduction
One popular problem in the field of computer vision is image segmentation. The problem has been formalized by Mumford and Shah as the minimization of a functional [1]. With the use of level set representations of active contours [2] one obtains a very efficient way to find the minimizers of such a functional. As shown by many seminal papers and textbooks on segmentation using these variational frameworks [2,3,4] there has been a lot of progress but it still faces several difficulties. The reason for these difficulties is in most cases a violation of model assumptions. For example the model usually assumes to have homogeneous [2] or smooth [1] object regions. Due to noise, occlusion, texture and shading this model is often not appropriate to delineate object regions. A successful remedy is the statistical modeling of regions [3] and the supplement of additional information such as texture [5] and motion [6], which increases the number of scenes where image segmentation can succeed. To find a minimizer it is a common technique to numerically solve the corresponding Euler-Lagrange equation using the explicit Euler method for initial value problems [2]. The main contribution of this paper is the analysis of numerical methods like the explicit Euler method (EU) and higher order Runge-Kutta G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 196–207, 2009. c Springer-Verlag Berlin Heidelberg 2009
Analysis of Numerical Methods for Level Set Based Image Segmentation
197
methods (RK-2 and RK-3). We will compare the segmentation accuracy and further analyze the dependence for involved parameters like the timestep of the numerical methods or weighting parameters. The advantages of the higher order methods are demonstrated by several experiments on synthetic and real images. Paper Organization. In Section 2 we continue with a short review of the variational approach for image segmentation, which is the basis for our segmentation framework. Section 3 introduces the various numerical methods and also describes how to find a minimizer for the functional described in Section 2. Experiments in Section 4 will demonstrate the advantages of the chosen higher order numerical methods over the standard method and other segmentation methods. The paper will finish with a short conclusion.
2
Image Segmentation Using a Variational Framework
The variational approach for image segmentation used in our framework is based on the works of [2,7,8,9]. Using the level set formulation for the general problem of image segmentation has several advantages. To allow a convenient and sound interaction between constraints that are imposed on the contour itself and constraints that act on the two regions separated by the contour, the 1-D curve is embedded into a 2-D, image-like structure. Another important advantage of the level set representation is the natural given possibility to handel topological changes of the 1-D curve. This is especially important if the object is particular occluded by another object or if the object consist of multiple parts. In the case of a two-phase segmentation, the level set function ϕ : Ω → R splits the image domain Ω into the two regions Ω1 , Ω2 ⊆ Ω with ≥ 0, if x ∈ Ω1 ϕ(x) = . (1) < 0, if x ∈ Ω2 The boundary between the object that is sought to be extracted and the background is represented by the zero-level line of the function ϕ. Like most of the works on level set segmentation do, we focus on this special segmentation case with two regions. The interested reader can find an extension to the presented method on multiple regions in [10,11]. Another successful remedy to extend the number of situations in which image segmentation can succeed is the use of additional constraints like the restriction to a certain object shape [4,12]. The following three constraints are imposed as an optimality criterion for contour extraction: (i) the data within each region should be similar (ii) the data between the object and the background should be dissimilar (iii) the contour dividing the region should be minimal As shown in [2] these model assumptions can be expressed by the so called Chan-Vese energy functional that is: E(ϕ) = − H(ϕ) log p1 + (1 − H(ϕ)) log p2 dΩ + ν |∇H(ϕ)| dΩ (2) Ω
Ω
198
B. Scheuermann and B. Rosenhahn
where ν ≥ 0 is a weighting parameter between the three given constraints, pi are probability densities and H(s) is a regularized Heaviside function with (i) lim H(s) = 0 , s→−∞
(ii) lim H(s) = 1 , s→∞
(iii) H(0) = 0.5 .
(3)
The regularized Heaviside function is needed to build the Euler-Lagrange equation and to make it possible to indicate at each iteration step to which region a pixel belongs. Minimizing the first term maximizes the total a-posteriori probability given the the two probability densities p1 and p2 of Ω1 and Ω2 , i.e., pixels are assigned to the most probable region according to the Bayes rule. The second term minimizes the length of the contour and act as a smoothing term. Minimization of the Chan-Vese energy functional (2) can be easily performed by solving the corresponding Euler-Lagrange equation to ϕ ∂ϕ p1 ∇ϕ = δ(ϕ) log + ν div , (4) ∂t p2 |∇ϕ| where δ(s) is the derivative of H(s) with respect to its argument. Starting with some initial contour ϕ0 and given the probability densities p1 and p2 one has to solve the following initial value problem ⎧ 0 for x ∈ Ω ⎨ϕ(x, 0) = ϕ . (5) ∂ϕ p1 ∇ϕ ⎩ = δ(ϕ) log + ν div ∂t p2 |∇ϕ| The way the two probability densities p1 and p2 are modeled is a very important factor for the quality of the segmentation process. In this paper we restrict to the very simple full Gaussian density using gray values [3]. This restriction is made because we only want to analyze several numerical methods and therefore it is not necessary to use different statistical models. Other possibilities for image cues to use for the density model are color and texture [5,13] or motion [6]. There are also various other possibilities to model the probability densities given these image cues, e.g., a Gaussian density with fixed standard derivation [2], a generalized Laplacian [14] or nonparametric Parzen estimates [5]. Let now μ1 and μ2 be the mean gray value in Ω1 or rather Ω2 and σ1 and σ2 the standard deviation of the two regions Ω1 and Ω2 . Then the probability of u(x) ∈ Ω to be in Ωi is: 2
(u(x)−μi ) − 1 2σ2 i pi (u(x)) = √ e for i ∈ {1, 2} , (6) 2πσi where the probability densities p1 and p2 have to be updated after each iteration step. For our full Gaussian density model this comes down to updating 1/2 u(x)H(ϕ) dΩ (u(x) − μ1 )2 H(ϕ) dΩ Ω Ω μ1 = ; σ1 = Ω H(ϕ) dΩ Ω H(ϕ) dΩ 1/2 (7) u(x)(1 − H(ϕ)) dΩ (u(x) − μ2 )2 (1 − H(ϕ)) dΩ Ω Ω μ2 = ; σ2 = , Ω (1 − H(ϕ)) dΩ Ω (1 − H(ϕ)) dΩ
using the Heaviside function to indicate the two separated regions.
Analysis of Numerical Methods for Level Set Based Image Segmentation
199
30 EU EU (Timestep/2) RK2 RK3 exact solution
25
20
15
10
5
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Fig. 1. Comparison of the different numerical methods for the initial value problem y (t) = sin(t)2 · y(t) with y(0) = 2. The Euler method apparently fail to compute an accurate solution whereas both Runge-Kutta methods are almost exact.
3 3.1
Numerical Methods Euler Method
Several methods exist to numerically compute the solution of an initial value problem of the type (5). The easiest method used by most previous works is the simple Euler method (EU) [2], which is an explicit first order numerical procedure for solving initial value problems. The idea of the Euler method is to assume that the state of change is constant for an interval Δt. For a given initial value problem y (t) = f (t, y(t)) , y(t0 ) = y0 the Euler method is defined by yn+1 = yn + Δtf (tn , yn ) for n ≥ 0 , (8) where Δt is the timestep and tn+1 = tn + Δt. For the initial value problem given by Equation (5) this method leads to ϕ0 = initial contour , (9) n+1 ϕ = ϕn + ΔtL(ϕn ) for n ∈ N . Where L(ϕn ) is defined as the following operator: p1 ∇ϕn L(ϕn ) := δ(ϕn ) log + ν div p2 |∇ϕn |
for n ≥ 0 ,
(10)
and Δt is the timestep. To increase the accuracy of the solution one has two possibilities. The first is to reduce Δt and the other is to choose a method with a higher order of convergence.
200
B. Scheuermann and B. Rosenhahn
(a)
(b)
(c)
(d)
(e)
(f )
Fig. 2. (a) Synthetic image and level set initialization; (b) detail of the synthetic image; (c) detail of the final segmentation using EU and ν = 4; (d) detail of the final segmentation using EU and ν = 10; (e) detail of the final segmentation using RK-2 and ν = 4; (f) detail of the final segmentation using RK-2 and ν = 10
3.2
Runge-Kutta Methods
The Runge-Kutta schemes are well known methods with a higher order of convergence compared to EU. The Runge-Kutta methods used in this paper are explicit iterative methods for the approximation of solutions to initial value problems. Consider an initial value problem (8), an explicit second order RungeKutta method (RK-2), also called Heun’s method or modified Euler method, is given by: y˜n+1 = yn + Δtf (tn , yn ) , Δt yn+1 = yn + (f (tn , yn ) + f (tn+1 , y˜n+1 ) for n ≥ 0 . 2 For the initial value problem (5) RK-2 leads to ⎧ 0 ⎪ = initial contour , ⎨ϕ ϕ˜n+1 = ϕn + Δt · L(ϕn ) for n ∈ N , ⎪ ⎩ n+1 n ϕ = ϕn + Δt ˜n+1 ) for n ∈ N . 2 · L(ϕ ) + L(ϕ
(11)
(12)
A third order Runge-Kutta method (RK-3) for the initial value problem given by (5) can be defined analogue to Shu and Osher [15] by: ⎧ 0 ⎪ = initial contour , ⎪ϕ ⎪ ⎪ ⎨ϕ˜n+1 = ϕn + Δt · L(ϕn ) for n ∈ N , 1 (13) ϕ˜n+ 2 = ϕn + Δt · L(ϕn ) + L(ϕ˜n+1 ) for n ∈ N , ⎪ 4 ⎪ ⎪ ⎪ ⎩ϕn+1 = ϕn + Δt · L(ϕn ) + L(ϕ˜n+1 ) + 2L(ϕ˜n+ 12 ) for n ∈ N . 6
A simple 1-D example of these methods is shown by Figure 1. Obviously the accuracy of the solution increases with the choose of a smaller timestep and the choose of a higher order method. Remark: In our algorithm the spatial discretization is done using finite differences.
Analysis of Numerical Methods for Level Set Based Image Segmentation
(a)
201
(b)
Fig. 3. (a) Segmentation error in dependence of the weighting parameter ν (the segmentation error of RK-2 and RK-3 is exactly the same); (b) segmentation error in dependence of the noise level (the segmentation error of EU and RK-3 is almost the same)
4
Experiments
In this Section we demonstrate the impact of the higher order methods RK-2 and RK-3 applied to level set based image segmentation. Experiments are performed on synthetic images, real images and on the segmentation-benchmark developed by Feng Ge and Song Wang [16]. 4.1
Synthetic Images
For the analysis of the three presented numerical methods, we first use the synthetic image and the initialization of the level set function shown in Figure 2a. The dimension of this image was 400 × 300 pixels. We choose this image because the object consists of parts with high and small curvature. We define the region-based segmentation accuracy analogue to Ge et al. [16] by P (R; G) =
|R ∩ G| |R ∩ G| = , |R ∪ G| |G| + |R| − |R ∩ G|
(14)
where G is the ground-truth foreground object and the region R is the segment derived from the segmentation result using one of the numerical methods. The properties of this definition are discussed in [16]. Equation (14) leads to the following definition of the segmentation error εSE := 1 − P (R; G) .
(15)
Figure 2b shows a detail of the synthetic image, while Figures 2c and 2d demonstrate that the smoothness parameter ν has a large influence on the final segmentation if EU is used. Figures 2e and 2f apparently show that ν has little influence on using RK-2. It can be seen from Figure 3a that for EU only choosing ν = 0 leads to a segmentation error εSE = 0, but choosing ν = 20 results in εSE = 0, 01
202
B. Scheuermann and B. Rosenhahn
(a)
(b)
(c)
Fig. 4. (a) Synthetic image with noise; (b) final segmentation using EU; (c) final segmentation using RK-2
corresponding to 1200 wrong segmented pixels. Conversely, RK-2 and RK-3 lead to the segmentation error εSE = 0 for ν ∈ {0, . . . , 20}, Obviously the higher order methods are more robust to choices of the smoothness parameter. In Figure 3b we added Gaussian pixel noise to the synthetic image shown in Figure 2a. It can be seen from Figure 3b that the segmentation obtained by RK-2 is more robust to noise. Figure 4 shows the final segmentation results for a noisy synthetic image (4a) using EU (4b) and RK-2 (4c). Apparently the segmentation method using EU converges to a local minimum of the energy functional while RK-2 reaches the global minimum. 4.2
Real Images
To analyze the numerical methods on real images, we apply the segmentation methods on the image benchmark presented in [16]. To demonstrate the robustness of the level set based segmentation using RK-2 and RK-3 on the weighting parameter ν, we use the real image of the benchmark shown in Figure 5a. For RK-2 and RK-3 the segmentation error εSM is between 0.055 and 0.057 for ν ∈ {0, . . . , 20}, which implies that the variance of the final segmentation is less than 0.5%. Using EU the error εSM is between 0.056 and 0.1 implying a variance bigger than 4% (see Figure 5d). Figures 5b and 5c show the final segmentation using EU and RK-2 with the weighting parameter ν = 10. The dependence of all methods on the timestep Δt is shown in Figure 5e. Obviously, the final segmentation using RK-2 or RK-3 is almost identical for all Δt ∈ {1 . . . 30}, whereas for EU the segmentation error εSE increases with a bigger timestep Δt. We find that Table 1. Comparison of the average performance of four image segmentation methods Method Normalized-cut method Euler method 2nd order RK method 3rd order RK method
avg. perf. 0.39 0.50 0.52 0.52
Analysis of Numerical Methods for Level Set Based Image Segmentation
(a)
(b)
(d)
203
(c)
(e)
Fig. 5. (a) Real image from the segmentation benchmark presented in [16]; (b) final segmentation using EU and ν = 10; (c) final segmentation using RK-2 and ν = 10; (d) segmentation error in dependence of the weighting parameter ν (the segmentation error of RK-2 and RK-3 is almost identical); (e) segmentation error in dependence of the timestep Δt (the segmentation error of RK-2 and RK-3 is almost identical).
RK-2 converges more reliably then EU, even for large timsteps (cf. Figure 1). This is an important fact because a bigger timestep leads to a reduction of the total number of iterations and thereby to a reduction of the computational time needed to segment an object. Figure 6 shows the total segmentation accuracy of the different methods. The curves indicate the performance distribution of the methods on all 1023 images of the segmentation benchmark. The horizontal axis denote the proportion of images and the y-axis indicates the segmentation accuracy p(x). A specific point (x, p(x)) on the curve indicates that 100 ·(1 − x) percent of images are segmented with an accuracy better than p(x). The curve NC describes the performance of the Normalized-cut method (NC) implemented by Shi et al. [17], to compare our results to another segmentation strategy. We decided to compare our methods to the normalized-cut method, since it was the method with the best average performance on the segmentation benchmark [16]. Because the performance curves of EU, RK-2 and RK-3 are almost indistinguishable, Table 1 shows the average performance of our three segmentation methods in comparison to NC. Apparently the average performance of the level set based segmentation methods is saliently better than NC and RK-2 and RK-3 are slightly better than EU (the average performance increases by 4%).
204
B. Scheuermann and B. Rosenhahn
Fig. 6. Performance curve of the three numerical methods and the Normalized-cut method (NC) on the 1032 images of the segmentation benchmark
To compare the computational time of the methods we choose 100 Images of the segmentation benchmark where the segmentation accuracy was better than 90% for all numerical methods. Using the same timestep Δt = 1 it took 8.4 minutes to segment the images with the Euler method and 6.8 min using the described second order Runge-Kutta method. Choosing a bigger timestep for RK-2 and EU farther reduces the computational time. Table 2 shows the computational time needed to segment the 100 images for various methods and timesteps. Table 2. Comparison of the computational time needed to segment 103 images from the segmentation benchmark
Euler Euler Euler 2nd order 2nd order 2nd order
Method method, with Δt = 1 method, with Δt = 2 method, with Δt = 4 RK method, with Δt = 1 RK method, with Δt = 2 RK method, with Δt = 4
comp. time 8.4 min 3.6 min 2.9 min 6,6 min 3,1 min 2,5 min
avg. perf. 0,9440 0,9415 0,9413 0.9441 0.9415 0.9422
In Figure 7, we present more segmentation results on real images. We decided for these images to show, that RK-2 is able to find the global minimum in cases where EU converges to a local minimum (see Figures 7a - 7c). Figures 7d - 7g shows that, using the same timestep Δt = 1 and the same smoothing parameter ν = 4, the segmentation accuracy increases using RK-2 instead of EU. Besides the total number of iterations is much smaller, which leads to a reduction of the computational time by the factor 2, even if the computational time for one iteration is bigger. These results clearly show that RK-2 more reliably achieves accurate segmentations.
Analysis of Numerical Methods for Level Set Based Image Segmentation
(a)
(b)
205
(c)
(d)
(e)
(f )
(g)
Fig. 7. (a) Level set initialization; (b) final segmentation using EU; (c) final segmentation using RK-2;(d) level set initialization; (e) segmentation result after 180 it. using EU; (f) final segmentation using EU (900 it.); (g) final segmentation using RK-2 (180 it.) The computational time is reduced by the factor 2.
206
5
B. Scheuermann and B. Rosenhahn
Conclusion
In this paper we proposed to use higher order optimization schemes to solve the well known variational approach to image segmentation, and we compared our approach with the traditional method for this problem. By synthetic and real image experiments we showed that the use of higher order Runge-Kutta methods improves the average accuracy of the final segmentation and reduces the dependence on the timestep and smoothing, parameters which critically influence the performance of the Euler method. We showed that using the second order Runge-Kutta method more reliably achieves accurate segmentations. Using our proposed scheme increases the number of scenes in which image segmentation using the variational approach can succeed. Furthermore, the computational time decreases, in most cases.
References 1. Mumford, D., Shah, J.: Boundary detection by minimizing functionals. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, June 1985, pp. 22–26. IEEE Computer Society Press, Springer (1985) 2. Chan, T., Vese, L.: Active contours without edges. IEEE Transactions on Image Processing 10(2), 266–277 (2001) 3. Zhu, S.C., Yuille, A.: Region competition: unifying snakes, region growing, and bayes/mdl for multiband image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence 18(9), 884–900 (1996) 4. Cremers, D., Tischhäuser, F., Weickert, J., Schnörr, C.: Diffusion snakes: introducing statistical shape knowledge into the mumford-shah functional. International Journal of Computer Vision 50(3), 295–313 (2002) 5. Rousson, M., Brox, T., Deriche, R.: Active unsupervised texture segmentation on a diffusion based feature space. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison, WI, pp. 699–704 (2003) 6. Cremers, D., Yuille, A.L.: A generative model based approach to motion segmentation. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 313–320. Springer, Heidelberg (2003) 7. Malladi, R., Sethian, J., Vemuri, B.: Shape modelling with front propagation: A level set approach. IEEE Transaction on Pattern Analysis and Machine Intelligence 17(2), 158–174 (1995) 8. Paragios, N., Deriche, R.: Unifying boundary and region based information for geodesic active tracking. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition., Forth Collins, Colorado, vol. 2, pp. 300–305. IEEE Computer Society Press, Los Alamitos (1999) 9. Rosenhahn, B., Brox, T., Weickert, J.: Three-dimensional shape knowledge for joint image segmentation and pose tracking. International Journal of Computer Vision 73(3), 243–262 (2007) 10. Zhao, H.K., Chan, T., Merriman, B., Osher, S.: A variational level set approach to multiphase motion. Journal of Computational Physics 127, 179–195 (1996) 11. Brox, T., Weickert, J.: Level set based segmentation of multiple objects. In: Rasmussen, C.E., Bülthoff, H.H., Schölkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 415–423. Springer, Heidelberg (2004)
Analysis of Numerical Methods for Level Set Based Image Segmentation
207
12. Rousson, M., Paragios, N.: Shape priors for level set representations. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 78–92. Springer, Heidelberg (2002) 13. Brox, T., Weickert, J.: A tv flow based local scale estimate and its application to texture discrimination. Journal of Visual Communication and Image Representation 17(5), 1053–1073 (2006) 14. Heiler, M., Schnörr, C.: Natural image statistics for natural image segmentation. International Journal of Computer Vision 63(1), 5–19 (2005) 15. Shu, C.W., Osher, S.: Efficient implementation of essentially non-oscillatory shockcapturing schemes. Journal of Computational Physics 77, 439–471 (1988) 16. Ge, F., Wang, S.: New benchmark for image segmentation evaluation. Journal of Electronic Imaging 16(3) (2007) 17. Cour, T., Yu, S., Shi, J.: Normalized cut image segmentation source code (2004), http://www.cis.upenn.edu/~jshi/software/
Focused Volumetric Visual Hull with Color Extraction Daniel Knoblauch and Falko Kuester University of California, San Diego
Abstract. This paper introduces a new approach for volumetric visual hull reconstruction, using a voxel grid that focuses on the moving target object. This grid is continuously updated as a function of object location, orientation, and size. The benefit is a reduced amount of voxels that have to be evaluated or allocated towards capturing the target at higher resolution. This technique particularly improves reconstructions where the total reconstruction space is larger than the moving reconstruction target. The higher resolution of the voxel grid also reduces the computational cost per voxel reprojection since a one voxel to one input pixel reprojection ratio is approximated. In addition, the appropriate view independent color of the surface voxels is computed allowing for realistic visual hull texturing. All color calculations are performed locally, based on approximated surface voxel normals and the input images. A color outlier detection approach is introduced, which reduces the influence of occlusions in the color evaluation. The parallel nature of the presented focused visual hull reconstruction technique, lends itself to hardware acceleration, allowing interactive rates to be achieved by performing most computations on the GPU. A set of case studies is provided for well-defined static and dynamic data sets.
1 Introduction 3D model reconstruction has received wide attention, yet complexity and usability challenges have proven to be persistent, in particular when the reconstruction of high-quality models is desired at interactive rates. Application domains requiring dynamic model or scene reconstructions, include tele-presence and augmented reality. The creation of avatars, for example, allowing users to intuitively and naturally collaborate in virtual worlds, has to occur continuously and nearly instantaneously to capture user posture, movements and actions realistically. The presented 3D model reconstruction technique uses a voxel-based visual hull. This volumetric approach has the advantage of explicit geometry and opens the possibility of easier skeleton extraction in later stages of the project. The most common approach to speed up the volumetric visual hull reconstruction is to precompute a look-up table for voxel classification. Due to this precomputation, the area of the reconstruction is limited and fixed. The size of voxels is based on the total reconstruction space, covering the possible object movement, and not on the object to be reconstructed. In this paper we introduce a focused visual hull reconstruction, allowing to adjust and focus the voxel grid to the reconstruction target. This may result in fewer voxels than in the common approach but the calculated voxels are concentrated on the target object. As a result the object’s reconstruction resolution is higher because the voxels size can be reduced. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 208–217, 2009. c Springer-Verlag Berlin Heidelberg 2009
Focused Volumetric Visual Hull with Color Extraction
209
Since the reconstructed models will be used in a later stage of this project for a tele-immersion system, the texturing of the object is of great importance. Based on the calculated visual hull the surface voxels are evaluated and the corresponding normals approximating the target surface are calculated. Once the normal of each voxel is known, the cameras that influence the voxel color can be evaluated and the resulting color can be calculated as a weighted combination of corresponding camera pixel colors. The color evaluation is performed locally, therefore, occlusions from cameras may occur. A color outlier detection is introduced to reduce the influence of occluded cameras and improve the voxel coloring. Combining the two main contributions of this paper, focused visual hull reconstruction and local color extraction, results in real-time object reconstruction with high geometric and visual quality.
2 Previous Work Due to the high complexity of 3D reconstruction, approximations are utilized or expensive high-end hardware [1] [3] is used to allow interactive frame rates. The visual hull [6] is an approximation of the reconstruction target representing the intersection of silhouette cones from different viewing angles. These silhouette cones are the result of the projection of the silhouettes into space. This representation guarantees the inclusion of the entire reconstruction target. Based on the definition of visual hull, the concavities of an object cannot be detected or reconstructed. In many applications such as real-time avatar reconstruction or movement tracking of users, the trade-off is between speed and a perfect reconstruction. There are two basic methods to calculate a visual hull, namely the polyhedral and the volumetric approach. In the polyhedral approach geometric properties are used to calculate the visual hull [8]. This results in a set of polygons that define the surface of the visual hull. In this approach the accuracy of the visual hull depends on the discretization of the silhouettes. The cost increases with finer discretization and higher numbers of involved silhouettes. In the volumetric approach, a voxel representation is used for computations [11] [13]. In this approach the reconstruction area is discretized into voxels. Each of these voxels is backprojected to the input silhouettes. If all the backprojections result in a silhouette pixel, the corresponding voxel is part of the visual hull. The backprojection of the voxel into the silhouette images represents the silhouette cones. This classification is highly parallelizable as the voxels are independent. There are advantages of having a volumetric visual hull representation, as it facilitates the following calculations for skeleton extraction or avatar virtual world interaction as the volumes are already known. It is also much easier to adjust the reconstruction resolution by increasing or decreasing the voxel size, than changing the discretization in 2D. There have also been several approaches that do not explicitly calculate the visual hull but render images of the visual hull from a given viewpoint [7] [9]. It is common to precompute a look-up table for the voxel backprojection, to allow fast reconstruction. Based on the size of the voxels and the resolution of the input images, normally more than one point in each voxel is reprojected in order to reduce the amount of errors in voxel identification [2]. However, this also means that the voxel grid is fixed
210
D. Knoblauch and F. Kuester
Fig. 1. Left: Comparison between fixed and focused voxel grid. The fixed voxel grid covers the entire possible object movement range. The focused voxel grid covers the moving object. Only dark grey voxels are used for visual hull evaluation. Right: Red voxels are the extracted border voxels and the green neighborhoods show the introduced normal calculation approach.
and the larger the possible reconstructed space to moving reconstruction object ratio is, the smaller the resolution of the visual hull will be. Ladikos et al. [5] implemented such an approach in CUDA. The same paper introduces an approach that reduces the resolution of the input images, so that every voxel completely reprojects into one pixel. This can be done efficiently thanks to the highly parallel nature of the GPU. Most explicit volumetric visual hull reconstructions are used for human motion tracking [10]. This explains why there is normally no explicit voxel colorization. Most visual hull approaches that integrate color information do this by view dependent object texturing. Li et al. [7] calculate the directional color by weighting the color information from different cameras by the angle from the visual hull normal. However, this approach is only used as a view dependent visual hull texturing. A similar approach has been introduced by Matusik et al. [9].
3 Focused Volumetric Visual Hull This paper introduces a novel approach to volumetric visual hull computation. Instead of fixing the voxel grid and the corresponding size, the voxel grid focused on the reconstructed object is introduced by evaluating its position, orientation, and size in each frame and propagating this information to the next frame to better enclose the target. Figure 1 shows on the left side the usual approach with a fixed voxel grid big enough to cover the whole possible object movement range and a focused voxel grid, covering the moving object. It can be seen that by focusing, the number of voxels to evaluate can be decreased and the voxel grid resolution increased. The grey voxels are the voxels that have to be evaluated in each approach. The input values for the following steps are the intrinsic and extrinsic camera parameters, known through a prior calibration, and the respective silhouettes. 3.1 Grid Center The first step to focus the visual hull is to evaluate its approximate location in space. Knoblauch and Kuester solved a similar problem for their focused depth map
Focused Volumetric Visual Hull with Color Extraction
211
calculation in [4]. The centroids of two of the input silhouettes are calculated. Due to the assumption that the two centroids represent the same point in space, the location of the grid center can be calculated. The voxel grid is now wrapped around the calculated grid center. Subsequently, it is necessary to evaluated which voxels are part of the visual hull. Each voxel center is reprojected into the input silhouettes and if all the reprojections result in a silhouette pixel the voxel is part of the visual hull. As every voxel can be calculated independently, real-time performance is achieved by implementing this process on the GPU using CUDA as programming language. The center of the resulting visual hull is evaluated and the difference between silhouette based grid center and the visual hull center is used to adjust the initial grid center in the next frame. 3.2 Grid Orientation The orientation of the reconstructed object is evaluated to align the voxel grid to the target object in the following frame, assuming the target has small changes in orientation from frame to frame. This assumption can be made as interactive rates are achieved and in the final setup humans interacting with each other or the virtual environment are reconstructed. To calculate the orientation the center of each surface voxel is taken as input for a principal component analysis (PCA). The PCA is performed by doing a singular value decomposition (SVD) of the covariance matrix C of the mean-substracted input points represented in the rows of matrix M of size 3xN. C=
1 M M T = W ΣV T N
(1)
The eigenvector matrix V sorted by decreasing eigenvalue represents the optimal orientation of the object aligned voxel grid. If this is done every frame it can be seen that the voxel grid smoothly moves with the target object. 3.3 Grid Size By orienting the voxel grid it is possible to evaluate how much of the voxel grid is actually occupied by the visual hull in each principal axis. It is also known that the voxel grid is centered on the target object. This means that evaluating the minimum min and maximum max coordinate in each voxel grid orientation occupied by the visual hull provides the dimensions of the reconstructed object in reference to the voxel grid coordinate system, in order to adjust the size of the voxel grid to the target. For efficiency reasons the maximum number of voxels in each direction is fixed, but the size of the voxels can be adjusted. Another limitation to the adjustment is that the voxel size has to be the same in all three principal directions. This means that if we evaluate the voxel grid axis with the highest visual hull coverage a, the voxel size can be scaled by the factor s given in the following equation. s=
maxa − mina + 2threshold gridSizea
(2)
Here threshold is a buffer added to the voxel grid size, in our system it consists of 5% of the voxel grid dimension. By increasing threshold the robustness against fast
212
D. Knoblauch and F. Kuester
movements and object changes can be increased. This leads to a very flexible voxel grid that adjusts to the reconstructed object over time even if the object changes size or shape. In the case where not all the voxel grid dimensions are used to the same extent, the calculations can be reduced to only the covered voxels plus two times the threshold (once on each side). This results in a speed-up as computations and read back from the GPU are restricted to areas that could be part of the visual hull. The reduction of evaluated voxels can be seen in a 2D example in Figure 1 where the dark grey voxels are the ones that have to be evaluated and the light gray ones are discarded because of the object dimensions. In the remainder of the paper we will call voxels in which reprojections are completed as evaluated voxels. The voxel grid is initialized in the first frame, so that it is aligned to the world coordinate system and covers the volume of the entire possible reconstruction area.
4 Colored Visual Hull 4.1 Normal Extraction The volumetric visual hull is an approximation to the real object. In order to calculate the color of each voxel a normal, approximating the unknown object surface has to be calculated. A simple and efficient way to approximate the normal is to take the 27neighborhood of each surface voxel into account. A voxel is considered to be a surface voxel if one of the voxels in the 27-neighborhood is not part of the visual hull. Figure 1 shows on the right side an example in 2D representing the border voxels in red. Each of the voxels in the 27-neighborhood has a kernel coordinate ranging from -1 to 1 in each axis direction. By adding the coordinate vectors of all visual hull voxels in the kernel, the direction of the approximate normal can be evaluated. Figure 1 shows an examples of the resulting normalized vectors in 2D. 4.2 Color Calculation The voxel colors are calculated in a localized manner allowing fast computation on the GPU. To evaluate the best cameras for color input the angles between the normal and camera direction are evaluated. The following equation shows the calculation of the voxel color c. c=
n
v(i)w(i)cc(i)
(3)
i=0
w(i) = arccos (n ∗ cd) 1 if w(i) <= 90◦ and out(i) = 0 v(i) = 0 if w(i) > 90◦ or out(i) = 1 √ √ 0 if d(i) ⊂ [m − var, m + var] out(i) = 1 else
(4) (5) (6)
Focused Volumetric Visual Hull with Color Extraction
213
Table 1. Comparison of the Fixed and Focused approach based on evaluated voxels, actual visual hull voxels and the average voxel dimension Visual Hull #Evaluated #Visual Hull Voxel Dimension Fixed 16,777,200 339,530 0.053 Focused 6,117,291 1,280,499 0.033
Where n is the number of cameras, v(i) is the visibility term for the camera, w(i) is the color weight and cc(i) is the color extracted by voxel center back-projection from camera i. This equation has also been used by view dependent color calculation approaches [7] [9]. The weights are calculated based on the dot product of the normalized normal n and the camera direction cd. The weights w(i) are normalized so that their sum equals one. To evaluate a reliable color for the voxel, only cameras with no occlusions are considered. Based on the local approach the visibility cannot be calculated efficiently, thus an approximation is introduced. All the cameras with an angle greater than 90 degrees are considered occluded. Another way to evaluate if the camera is seeing the right voxel is to compare the colors of the corresponding pixels in the input images. To calculate if cc(i) is an outlier out(i) the assumption is made that the input colors are distributed in a Gaussian curve. By doing so, the mean m and variance var of the colors can be calculated and all the input colors values flagged as outliers using equation 6. d(i) being the Euclidean difference between the mean color m and cc(i). This outlier calculation approximates the visibility function based on the assumption that if there is an occlusion the corresponding color will be different to the wanted color. There are two cases where this assumption is not valid, first if the occlusion color is similar, which does not affect our color calculation significantly or if there are more cameras occluded then not. In this case our approach fails, but considering that the calculations are performed locally for high performance this drawback is acceptable.
5 Results Tests with the Middleburry Multi-View data sets [12] known as ‘Temple’ and ‘Dino’, a dynamic data set by Vlasic [14] called ‘crane’ and our ‘teapot’ date set are conducted. A machine with Quad Core 2.66 GHz CPU, 4 GB of RAM and a GeForce 8800 GTX with 800 MB of Memory has been used. The ‘teapot’, ‘Temple’ and ‘Dino’ data sets have 16 input images with resolution 640x480. The ‘crane’ data set has 8 input images with resolution 1600x1200. 5.1 Focused Visual Hull To show the benefits of a focused visual hull compared to a fixed visual hull, tests have been conducted with moving and still target objects. The first test was conducted based on the ‘teapot’ data set. The teapot has been rotated by 360 degrees and simultaneously moved in the reconstruction area over a time period of 120 frames. Additionally the
214
D. Knoblauch and F. Kuester
Fig. 2. Focused visual hull reconstruction of ‘crane’ data set [14]. First frame, before focus completed (left), subsequet frame, focused reconstruction (middle), and further frame of the sequence (right).
size of the teapot has been scaled over time to simulate a deformable object. The fixed voxel grid had a size of 2563 . The voxel dimensions are 0.053 , to cover the desired reconstruction area. The total amounts of evaluated voxels and resulting visual hull voxels are counted and investigated for each frame. Table 1 shows the averaged results. Less voxels are evaluated thanks to focusing. This means less computations on the same voxel grid size. Despite the lower number of evaluated voxels in the focused voxel grid, more voxels end up being part of the visual hull. This results in a higher resolution reconstruction. To allow the same resolution by the fixed voxel grid, the grid dimensions have to be expanded to 4243 with a voxel size of 0.033 , the average voxel size of the focused visual hull. Traditional approaches would need a look-up table with size greater than the available memory on CPU or GPU of the test machine. In contrast the focused voxel grid is calculated on the fly and needs no look-up table. To show that this approach is suited for tele-immersion systems, tests on the ‘crane’ data set have been performed. The results can be seen in Figure 2. The results in the first frame, before full focus is shown on the left side. The subsequent frame with focus can be seen in the middle. To demonstrate the flexibility of this approach the data sets called ‘Temple’ and ‘Dino’ have also been tested. It can be seen in Figure 3 that this approach works for still objects too, as long as the reconstruction is done over several frames to allow convergence. 5.2 Color Extraction As mentioned in previous sections, a normal has to be extracted for every surface voxel to calculate the corresponding voxel color. Figure 3 shows results of the normal calculation on the top row. The colors of the voxels represent the direction of the corresponding normal. It can be seen that with all the input data sets a smooth normal distribution can be obtained. Discretization artifacts do not significantly affect the color calculation, as the normals are only used to evaluate the best input cameras. The occlusion handling based on color outlier estimation is tested by reconstructing a colored focused visual hull in three different modes. Figure 4 shows the different results. In the left image, the color evaluation is performed based on the best camera, which is the one with the
Focused Volumetric Visual Hull with Color Extraction
215
Fig. 3. Color extraction for focused visual hull is based on surface normals (left side of each pair), to evaluate the possible input cameras and weight their color input (right side)
Fig. 4. Color extraction for focused visual hull computation. Color extracted from best camera (left), color extracted with weighted colors (middle) and color extraction with weighted colors and outlier evaluation (right).
smallest angle to the estimated surface normal of the corresponding voxel. It can be seen that the area under the handle colors are influenced by occlusions. In the middle image the colors from all possible cameras are added up and weighted based on the angle between normal and the corresponding voxel-cameras direction. The occlusion artifacts under the handle get smaller but the weighting introduces new artifacts as now the occlusions from different cameras can influence the surface colors. Finally we run the proposed algorithm with color outlier detection which produces the result on the right. Most occlusion artifacts disappear, and the remaining artifacts stay because the occlusion colors are similar to the resulting voxel color. These artifacts are acceptable as the occlusion handling is calculated locally without global knowledge for real-time performance. 5.3 Performance The first performance test is based on the moving ‘teapot’ data set. Different voxel grid sizes are tested. The results can be seen in table 2. The main steps of the focused visual hull reconstruction as voxel evaluation (visual hull), normal calculation (including border voxel evaluation) and the color extraction are listed. The data transfer from and to the GPU is also measured and shown in the R/W field in table 2. Voxel evaluation and color extraction are the most expensive steps. Nevertheless even with a voxel grid of dimension 2563 , a frame rate of around 4 fps is achieved. With the hardware used in this
216
D. Knoblauch and F. Kuester
Table 2. Performance analysis of moving teapot scene with different resolutions. Normal calculation includes border voxel detection and the read and write to the GPU is combined. Visual Hull (ms) 643 1 1283 7 2563 70.66 Grid
Normal (ms) 0.87 6.7 52.4
Color (ms) 1.63 12.79 78.9
R/W (ms) 9.22 11.9 31.2
Total (ms) 11.94 38.4 233.16
test good interactive rates of around 25 fps are achieved with voxel grid dimensions of 1283 . It can also be said that the cost of the reconstruction grows approximately linear with the amount of evaluated voxels. Further measurements have been done with the ‘crane’ data set. The average frame rate with a voxel grid dimension of 2563 is around 8 fps. The speed-up to the ‘teapot’ data set can be explained by the fact that a human body has one dominant axis and thanks to the focusing the other dimensions do not influence the performance as much. At this point it is necessary to mention that the frame rate before full focusing is at about 3 fps. This leads to the conclusion that for tele-immersion system based data, the focused visual hull not only improves the visual result as seen in Figure 2, but it also speeds up reconstruction. In the case where color extraction is not needed, such as motion tracking, the focused visual hull calculation without color extraction of the ‘crane’ data set can run at about 12 fps with a grid size of 2563. A grid size of 1283 results in frame rates over 60 fps. This is due to a reduction in calculations and less information transfer from GPU to CPU. This allows a smooth, high resolution user reconstruction for future motion tracking systems.
6 Conclusion This paper introduces a focused volumetric visual hull reconstruction with view independent color extraction. The visual hull is focused based on orientation, dimension and location of the reconstruction target. By assuming small changes of the object from frame to frame, this results in an object aligned voxel grid with adjusted voxel dimensions to enclose the object as tightly as possible. This approach leads to a higher resolution voxel grid by concentrating the voxel evaluation on the target object instead of covering the entire possible object movement range. Depending on the further use of the resulting visual hull, a view independent color extraction is performed. This color extraction is based on the normal of each surface voxel to evaluate the possible color input cameras. These calculations are all done locally in the voxel neighborhood and are therefore prone to occlusions. To overcome this drawback a local approximation to occlusion detection is introduced. This leads to less color artifacts based on object occlusions. All the visual hull and color extraction related calculations are performed on the GPU with CUDA as programming language, which leads to interactive rates. Tests have
Focused Volumetric Visual Hull with Color Extraction
217
been conducted with static and dynamic data sets. These tests show that the number of evaluated voxels is reduced and they are adjusted to enclose the target object closely, given fixed voxel grid dimensions. This leads to a higher resolution colored focused visual hull reconstruction with interactive rates without look-up table precomputation.
References 1. Allard, J., Menier, C., Raffin, B., Boyer, E., Faure, F.: Grimage: markerless 3d interactions. In: International Conference on Computer Graphics and Interactive Techniques. ACM, New York (2007) 2. Cheung, G., Kanade, T., Bouguet, J., Holler, M.: A real time system for robust 3D voxel reconstruction of humanmotions. In: IEEE Conference on Computer Vision and Pattern Recognition, Proceedings, vol. 2 (2000) 3. Hasenfratz, J., Lapierre, M., Sillion, F.: A real-time system for full body interaction with virtual worlds. In: Eurographics Symposium on Virtual Environments, pp. 147–156 (2004) 4. Knoblauch, D., Kuester, F.: VirtualizeMe: interactive model reconstruction from stereo video streams. In: Proceedings of the 2008 ACM symposium on Virtual reality software and technology, pp. 193–196. ACM, New York (2008) 5. Ladikos, A., Benhimane, S., Navab, N.: Efficient visual hull computation for real-time 3D reconstruction using CUDA. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2008, pp. 1–8 (2008) 6. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(2), 150–162 (1994) 7. Li, M., Magnor, M., Seidel, H.: Hardware-accelerated visual hull reconstruction and rendering. In: Graphics Interface Proceedings 2003: Canadian Human-Computer Communications Society, p. 65. AK Peters, Ltd., Wellesley (2003) 8. Matusik, W., Buehler, C., McMillan, L.: Polyhedral visual hulls for real-time rendering. In: Proceedings of the Eurographics Workshop in Rendering Techniques 2001, London, United Kingdom, June 25-27, p. 115. Springer, Heidelberg (2001) 9. Matusik, W., Buehler, C., Raskar, R., Gortler, S., McMillan, L.: Image-based visual hulls. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 369–374. ACM Press/Addison-Wesley Publishing Co., New York (2000) 10. Miki´c, I., Trivedi, M., Hunter, E., Cosman, P.: Articulated body posture estimation from multi-camera voxel data. In: IEEE Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii (2001) 11. Potmesil, M.: Generating octree models of 3D objects from their silhouettes in a sequence of images. Computer Vision, Graphics, and Image Processing 40(1), 1–29 (1987) 12. Seitz, S., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: Int. Conf. on Computer Vision and Pattern Recognition, pp. 519–528 (2006) 13. Szeliski, R.: Rapid octree construction from image sequences. CVGIP: Image Understanding 58(1), 23–32 (1993) 14. Vlasic, D., Baran, I., Matusik, W., Popovi´c, J.: Articulated mesh animation from multi-view silhouettes. ACM Trans. Graph. 27(3), 1–9 (2008)
Graph Cut Based Point-Cloud Segmentation for Polygonal Reconstruction David Sedlacek and Jiri Zara Czech Technical University in Prague, Faculty of Electrical Engineering
Abstract. The reconstruction of 3D objects from a point-cloud is based on sufficient separation of the points representing objects of interest from the points of other, unwanted objects. This operation called segmentation is discussed in this paper. We present an interactive unstructured pointcloud segmentation based on graph cut method where the cost function is derived from euclidean distance of point-cloud points. The graph topology and direct 3D point-cloud segmentation are the novel parts of our work. The segmentation is presented on real application, the terrain reconstruction of a complex miniature paper model, the Langweil model of Prague. Keywords: segmentation, graph cut, point-cloud, terrain reconstruction, Langweil model.
1
Introduction
The 3D objects reconstruction from a set of photos often utilizes a 3D pointcloud to separate various objects of interest. If the shape of reconstructed objects is unknown an automatic segmentation does not provide satisfactory results. An interactive processing of input point-cloud is thus more efficient. Such a user-driven segmentation can be performed either in image space (by selecting areas in input photos) or in 3D space of the point-cloud. A set of segmented points is finally converted into a polygonal model representing boundaries of reconstructed objects (see Fig. 1). The main contribution of this paper is in 3D point-cloud interactive segmentation performed completely in 3D space. The segmentation process derives a
Fig. 1. Three steps of the reconstruction process (from left to right): input point-cloud, segmented point-cloud (ground points only), reconstructed polygonal ground G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 218–227, 2009. c Springer-Verlag Berlin Heidelberg 2009
Graph Cut Based Point-Cloud Segmentation for Polygonal Reconstruction
219
benefit from a graph cut algorithm, i.e. from searching the minimal cut or maximum float in the weighted graph. The weight function is based on euclidean distance and no other information is needed (e.g. normals, colours). The graph topology presented in this paper has not been used in graph cut algorithm so far. Yuan et al. already describe graph cut for point-cloud segmentation at [1]. A hybrid approach to 3D segmentation is described in their paper. The pointcloud is pre-segmented into clusters based on certain properties (points distance, colours, normals). The user can freely manipulate with point-cloud into desired view position, then the point-cloud is rendered into an image. The graph cut algorithm is applied on this image, the segmentation criterion is derived from points colour and normal. Finally the image segmentation is projected back, to the pre-segmented 3D point-cloud clusters. Contrary to this approach, our method is completely 3D based and does not require any additional information about point-cloud which results in more generic usage. The paper is organized as follows. Section 2 introduces previous work related to graph cut and segmentation. Section 3 is the core of this paper – it describes the point-cloud segmentation and graph construction. The application of segmentation is presented in Section 4 on the example of Langweil model ground reconstruction. The main contributions of our approach are summarized in Section 5.
2
Related Work
Our work combines two previously unconnected techniques, the unstructured point-cloud segmentation with powerful graph cut algorithm [2] in order to get interactive segmentation. The graph cut algorithm is designed to minimize energy function in weighted graph where the energy function defines segmentation. This technique is widely used in standard image segmentation [3,4,5] or in segmentation of range images or stereo matching [6,7]. Several works benefits from graph cuts for 3D mesh segmentation [8,9] or surface extraction [10,11,12]. Segmentation refers to the task of labelling a set of measurements in the 3D object space (point-cloud). The points with the same label should satisfy several similarity conditions (e.g. close distance, colour, normal) and segmentation methods differ in a way how this is achieved. Segmentation of unstructured pointcloud was discussed in several works: clustering [13,1], region growing [14,15], divide and conquer, mesh segmentation[15]. Our method is similar to the region growing but it provides globally optimized results thanks to the graph cut algorithm behaviour. In the area of point-cloud segmentation, the graph cut has been used in several works. Similarly as previously mentioned [1], the other works use some kinds of pre-segmentation (they do not segment the point-cloud directly) and they require additional information from 2D images like color, gradient or depth. Quan et al. [16] use a ratio of 3D euclidean distance and color gradient from corresponding pictures as a segmentation criterion. They build a weighted graph, pre-segment
220
D. Sedlacek and J. Zara
it with thresholding and then the graph cut is applied on the wrongly presegmented parts, based on user will. Anguelov et al. applied the Markov Random Fields for point-cloud and range data segmentation in [17] where the object segmentation is based on local surface features. Trained Markov Random Fields can be solved through graph cut.
3
Segmentation
We present a point-cloud segmentation as a graph cut problem. It means a finding minimal cut at a weighted graph containing all point-cloud points. The graph must sufficiently represent an unstructured point-cloud and the segmentation process minimizes the energy function. The energy function reflects equilibrium between segmentation and graph topology. The graph construction is described first followed by segmentation description. 3.1
Graph Construction
The weighted graph G contains all 3D points (vertices) V from the input pointcloud and two imaginary vertices called terminals. The terminals represent assignment of points from V to two sets representing Object (S ) or Background (T ) points. Terminals corresponding to these sets are called source s and sink t. Every vertex from V is initially connected with N nearest neighbours from V where each edge E connecting two vertices is weighted with capacity CE 2
CE = κ ∗ e−DE /σ
(1)
where DE is Euclidean distance of two vertices and σ controls exponent function behaviour. 2 DE = (x1 − x2 )2 + (y1 − y2 )2 + (z1 − z2 )2 (2) The constant κ is scaling factor from float to integer values 0; κ. The simple graph is shown in figure 2a where every point is connected with N = 4 nearest neighbours. The edges between vertices V and terminals are not set in initialization phase. This is a special behaviour in comparison with standard graph cut as presented in [3] where edges among point p ∈ {V } and two terminals s and t exist and he edges capacity is set to probability of assigning the point to Object or Background set based on certain a priory known information. For example the probability may reflect how the point fits to known colour model of the object and background. The omission of edges to terminals has two advantages for graph cut algorithm. First advantage is decreasing the edges amount in final graph which results in higher speed in max flow computing. Boykov in [2] proved that asymptotic complexity of their graph cut implementation is O(mn2 |C|) where m is point count, n is edges count and |C| is minimal cut size. The second advantage for point-cloud segmentation is in segmentation behaviour. The second advantage
Graph Cut Based Point-Cloud Segmentation for Polygonal Reconstruction
221
deals with the graph cut behaviour. Based on our observation, the algorithm generates too many small areas (either Object or Background labeled) when initialized with many edges connected to terminals [18]. Having no such initial edges allows us to better control the algorithm output interactively.
Fig. 2. Graph topology and segmentation example. Vertices (white dots) are connected with 4 nearest neighbours here. The edges cost are reflected by edge’s thickness. a) Points specified interactively by a user (seeds) for Object, Background are shown as arrows with appropriate letter. b) Seeds are replaced by corresponding terminals, newly created edges to terminals inherit their capacities from corresponding edges previously connecting seeds with their neighbours. c) Segmented input graph. Light grey vertices are Objects, dark grey ones are Background.
3.2
Graph Cut Based Segmentation
Segmentation in our approach is presented as a minimal graph cut problem. We want to separate the V set into two disjunctive sets labelled S and T while the energy function E(L) describing equilibrium between segmentation and pointcloud behaviour is minimized. The energy function is defined as E(L) = λ Rp (Lp ) + Bp,q (Lp , Lq ) (3) p∈V
p,q∈M
where L = {Lp |p ∈ V } is labelling of point-cloud V. The coefficient λ ≥ 0 specifies relative importance of the region term Rp () versus the boundaries properties term Bp,q (). The set of all N nearest neighbouring pairs from V is M. The boundary properties term reflects a distance function between neighbour points and should be taken into account only in case of different labelling between neighbour pixels. Bp,q (Lp , Lq ) = CE ∗ δ(Lp , Lq )
(4)
Where CE is edge capacity between points p,q from equation 1 and δ(Lp , Lq ) = {
1 if Lp = Lq 0 othervise
(5)
222
D. Sedlacek and J. Zara
The region term Rp () influences a process of segmentation through an interaction. A user specifies point-cloud labelling by selecting several points. We connect these points either to terminal s or t (see Fig. 2a). For the case of Object terminal s, capacities from point p to terminals will be Csp = ∞ Cpt = 0
(6)
and similarly for Background terminal t. Since we do not want to increase a number of edges during interactive work, we apply the following simplifications on the flow graph. First, we do not construct any edge to terminal when the edge capacity is 0. Secondly, the point connected to a terminal with maximum (infinite) edge capacity can be considered as a terminal itself. In this case, we can disconnect this point from the graph, reconnect its edges to adequate terminal, and keep the capacities of these edges unchanged (see fig. 2b). The segmentation is computed using max flow algorithm, implemented as in [2]. After each user input (a stroke in our case) the graph cut is computed and visualised, so the user can decide to add more strokes. The segmentation process is illustrated in figure 3.
Fig. 3. Segmentation process. The first row represents points in S set, the T set points are in bottom row. From left to right, initially points are interpreted as members of S set, the seeds are represented by arrows with appropriate letter. At second column the points are segmented depending on seeds from the first image, additional background seed is specified. The final segmentation is on the right.
4
Application-Terrain Reconstruction
The proposed segmentation method was intensively used in terrain reconstruction of Langweil model of Prague. Langweil model is described in Section 4.1. The terrain reconstruction with details about point-cloud segmentation and final triangulation are presented in Section 4.2. The other tests were carried out with
Graph Cut Based Point-Cloud Segmentation for Polygonal Reconstruction
223
LiDAR data from [http://opentopography.org], Northern San Andreas Fault subset. We choose LiDAR data because they are well customizable for our application and prove that the algorithms are applicable to more common problems than was the main application purpose. 4.1
Langweil Model of Prague
The oldest model of Prague was created by Anton´ın Langweil in years 1826 1837 and is placed at City of Prague Museum [www.muzeumprahy.cz]. It is made from paper and illustrated by pen-and-ink drawings. The model size is about 3.5m × 6m in scale 1:480, corresponding area of the real city is about 1.6km × 2.6km. There are more than 2000 buildings corresponding to land register and almost 7000 other unique objects like shelters, small walls, statues, and trees. The ground varies throughout the city. The Old Town is mainly planar going down to the Vltava river and then to the hilly part of Prague, the Prague Castle and gardens around. The model itself consists of 60 parts, each with different number of objects and a complexity, see Fig. 4.
Fig. 4. From left to right: Langweil model of Prague at City of Prague Museum, one model part, detail photo - a wall with millimetre ruler
Because of paper nature of the model, museum experts prohibited standard types of scanners, lasers, high-powered flashing, and model touching. For this reason, a special robot was developed for the reconstruction purposes. The robot automatically took photos of one model part from several camera positions and orientations. The whole scanning took two months of non-stop running. Almost 300 000 photos with the 4K resolution were obtained. 4.2
Terrain Reconstruction
The terrain reconstruction is performed in a three-stage process. The first stage is a point-cloud generation from top view photos. The point-cloud is then interactively segmented using segmentation described in Section 3 - we present implementation details in the following text. The last step is a polygonal terrain reconstruction preceded by point-cloud filtration.
224
D. Sedlacek and J. Zara
Segmentation: The aim of the application of our novel method has been to separate ground points from other objects in paper model like walls or trees. For this reason we decide to adjust a distance function from equation 2 to 2 DE = (x1 − x2 )2 + (y1 − y2 )2 + γ ∗ (z1 − z2 )2
(7)
where γ is penalisation of height in Euclidean distance. In equations 1 and 7 we pre-set experimentally the values σ = 0.25 and γ = 200. These values depend on the point-cloud density and real ground behaviour. If the ground is almost planar γ should be increased for faster segmentation convergence and contrariwise for hilly terrain. Similarly with σ value. If the point-cloud density is low σ should be higher so the distance function do not penalties distant points. The neighbour count was set to N = 16, scaling factor was set to κ = 4096 and we did not experiment with those values. Several σ values were tested, 0.05, 0.07, 0.1, 0.25, 0.5. The values from 0.05 to 0.1 inclusive were to low, the segmentation can’t cross over little discontinuities between points and the user need to use more strokes. On the other hand σ = 0.5 was to high and segmentation process connect distant points together so the user need to add more opposite strokes. The γ value is more dependent on point-cloud behaviour than σ. The γ value should be lowered for hilly terrain (tested values 50, 100 works well) and for flat parts should be higher (tested values 200, 250). User interaction is necessary during segmentation process. The point-cloud is shown to the user in 2D form. User looks from orthographic top view to the terrain and selects which part should be a ground. The points are separated into two sets - ground points T and the other points S representing buildings, roofs, walls and trees. The consecutive segmentation is visualised by green and red pixels (T and S sets respectively) and the points depth is reflected through the colour intensity. The user add strokes to the image what is interpreted as adding edges between selected point-cloud points and terminals s or t as described in Section 3.2. Initially, all point-cloud points are interpreted and visualised as members of S set, i.e. objects. User adds strokes to the places where the points are assigned to the wrong set. After each stroke the minimal-cut is computed and the segmentation is visualised, so the user can decide if it is necessary to add more strokes or if the point-cloud is well-segmented, see Fig. 3. When the user is satisfied with segmentation, all points of the S set are deleted and only T points are used in the next step, the filtration. Filtration and polygonal ground reconstruction: The segmented pointcloud contains all points signed as a ground but the points are still too dense and are not suitable for a polygon reconstruction, i.e. the polygon ground reconstructed from those points will have to much triangles. For this reason we should perform point-cloud filtration before geometry reconstruction. The second possibility is to decimate triangles after polygon reconstruction [19]. The point-cloud structure from previous step direct to filtration before reconstruction.
Graph Cut Based Point-Cloud Segmentation for Polygonal Reconstruction
225
The final step of our process is point-cloud transformation into polygonal representation. There exist processes mainly based on marching cubes [20] or voronoi diagrams [21,22]. Those approaches try to represent 3D point-cloud as a solid 3D polygonal object. In case of terrain model data behaviour is more 2D than 3D. The segmented and filtered point-cloud looks like height map and for this reason the 2D voronoi triangulation is enough.
Fig. 5. Point-cloud segmentation and polygon terrain reconstruction examples. From left to right: input point-cloud, segmented points, reconstructed polygonal ground.
Table 1. Point-cloud sizes, initialization times, user interaction counts. The initialization phase is preprocessed off-line. Parts 10 and 13 are in figure 5, part 23 is in pictures 1 and 3. Part 39 and LiDAR data are not shown. model part 10 part 13 part 23 part 39 LiDAR
5
points init time update time background seeds object seeds 235 672 2m 40s 1s 18 0 86 225 1m 10s 1s 2 0 147 490 1m 52s 1s 11 5 754 998 6m 36s 1s 46 7 500 000 5m 27s 1s 10 2
Results
We have successfully applied graph cut method for unstructured 3D point-cloud segmentation. We have utilized this novel algorithm on the large set of input data. In the case of the Langweil model of Prague, point-clouds were of typical size of half million points. For each such a point-cloud, human operators were able to separate ground data using several strokes and/or clicks, usually from 10 to 50. All these user inputs were processed immediately at interactive speed (less
226
D. Sedlacek and J. Zara
than 2 seconds) detailed results are in table 1. The final reconstructed polygonal terrain has become a solid and precise base, on which the other objects of the whole 3D digital Langweil model were positioned. Examples taken from the project [www.langweil.eu] are shown in fig. 6.
Fig. 6. Final reconstruction of two different parts showed as wire-frame and textured model. For better illustration the ground is displayed together with other objects as is in final model.
Acknowledgements This research has been partially supported by MSMT under the research programs MSM 6840770014 and LC-06008 (Center for Computer Graphics). We also appreciate a lot of scientific suggestions given by Daniel S´ ykora.
References 1. Joaquim, T.I.: In: Yuan, J.A.J., Xu, X., Nguyen, H., Shesh, A., Chen, B. (eds.) Eurographics workshop on sketch-based interfaces and modeling (2005) 2. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(9), 1124–1137 (2004) 3. Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in n-d images, vol. 1, pp. 105–112 (2001) 4. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 888–905 (2000) 5. Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 1101–1113 (1993) 6. Kahler, O., Rodner, E., Denzler, J.: On fusion of range and intensity information using graph cut for planar patch segmentation. Int. J. Intell. Syst. Technol. Appl. 5, 365–373 (2008) 7. Bleyer, M., Gelautz, M.: Graph-cut-based stereo matching using image segmentation with symmetrical treatment of occlusions. Image Commun 22, 127–143 (2007) 8. Katz, S., Tal, A.: Hierarchical mesh decomposition using fuzzy clustering and cuts. In: SIGGRAPH 2003: ACM SIGGRAPH, pp. 954–961. ACM, New York (2003)
Graph Cut Based Point-Cloud Segmentation for Polygonal Reconstruction
227
9. Golovinskiy, A., Funkhouser, T.: Randomized cuts for 3d mesh analysis. In: SIGGRAPH Asia 2008: ACM SIGGRAPH, Asia, pp. 1–12. ACM, New York (2008) 10. Hornung, A., Kobbelt, L.: Robust reconstruction of watertight 3d models from non-uniformly sampled point clouds without normal information. In: SGP 2006: Proceedings of the fourth Eurographics symposium on Geometry processing, Airela-Ville, pp. 41–50. Eurographics Association, Switzerland (2006) 11. Sinha, S., Pollefeys, M.: Multi-view reconstruction using photo-consistency and exact silhouette constraints: a maximum-flow formulation. In: Tenth IEEE International Conference on Computer Vision, ICCV 2005., vol. 1, pp. 349–356 (2005) 12. Labatut, P., Pons, J.P., Keriven, R.: Efficient multi-view reconstruction of largescale scenes using interest points, delaunay triangulation and graph cuts. In: IEEE 11th International Conference on Computer Vision. ICCV 2007, pp. 1–8 (2007) 13. Dorninger, P., Nothegger, C.: 3d segmentation of unstructured point clouds for building modelling, p. 191 (2007) 14. Rabbani, T., van den Heuvel, F., Vosselmann, G.: Segmentation of point clouds using smoothness constraint, pp. xx–yy (2006) 15. Jiang, X.Y., Bunke, H.: Fast segmentation of range images into planar regions by scan line grouping. Machine Vision and Applications, 115–122 (1994) 16. Quan, L., Tan, P., Zeng, G., Yuan, L., Wang, J., Kang, S.B.: Image-based plant modeling. ACM Trans. Graph. 25(3), 599–604 (2006) 17. Anguelov, D., Taskar, B., Chatalbashev, V., Koller, D., Gupta, D., Heitz, G., Ng, A.: Discriminative learning of markov random fields for segmentation of 3d scan data. In: CVPR 2005: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), Washington, DC, USA, pp. 169–176. IEEE Computer Society, Los Alamitos (2005) 18. S´ ykora, D., Dingliana, J., Collins, S.: Lazybrush: Flexible painting tool for handdrawn cartoons. Comput. Graph. Forum 28(2), 599–608 (2009) 19. Y., H.M.O., K., N.: Efficient and feature-preserving triangular mesh decimation. Journal of WSCG, 167–174 (2004) 20. Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface construction algorithm. Computer Graphics 21(1987) 21. Aurenhammer, F.: Voronoi diagrams—a survey of a fundamental geometric data structure. ACM Comput. Surv. 23(3), 345–405 (1991) 22. Amenta, N., Bern, M., Kamvysselis, M.: A new voronoi-based surface reconstruction algorithm. In: SIGGRAPH 1998: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp. 415–421. ACM, New York (1998)
Dense Depth Maps from Low Resolution Time-of-Flight Depth and High Resolution Color Views Bogumil Bartczak and Reinhard Koch Institute of Computer Science University of Kiel Hermann-Rodewald-Str.3 24118 Kiel, Germany {bartczak,rk}@mip.informatik.uni-kiel.de Abstract. In this paper a systematic approach to the processing and combination of high resolution color images and low resolution time-offlight depth maps is described. The purpose is the calculation of a dense depth map for one of the high resolution color images. Special attention is payed to the different nature of the input data and their large difference in resolution. This way the low resolution time-of-flight measurements are exploited without sacrificing the high resolution observations in the color data.
1
Introduction
The reconstruction of dense depth information for a color image has its use in many applications. Machines or vehicles equipped with color cameras can use the depth information for navigation and map building. A particular advantage of the direct association of depth and color is found in the domain of building digital textured models of scenes and objects. These models serve the purposes of documentation, planing, and visualization. The representation of large scenes and fine details in such models is difficult. Approaches trying to alleviate this challenge are image based rendering techniques [1], where new viewpoints of a scene are generated from a set of color images. These techniques largely benefit from the use of geometric priors, which can have the form of dense depth maps [2]. Due to its compact representation, the image based rendering approach is also attractive to the application in free viewpoint video and 3D television. In order to be useful in future television applications, a dense depth reconstruction scheme has to be able to cope with a large variety of scenes. Unfortunately, the often applied image based matching techniques strongly depend on the scene’s texturing. Therefore these approaches tend to fail in homogeneous image regions and at repetitive patterns. An alternative to image matching, is to use active devices like laser scanners or correlating time-of-flight (ToF) cameras [3]. Both devices use the principle of measuring the traveling time of actively sent out light and do not depend on the scene’s color distribution. Laser scanners produce high quality scans, but due to their processing scheme and their bulk they G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 228–239, 2009. c Springer-Verlag Berlin Heidelberg 2009
Combining Low Res ToF Depth Maps and High Res Color Views
229
Fig. 1. High resolution color image and corresponding low resolution ToF depth map. Due to the low resolution of the ToF camera the lamp arm is not observed in the depth map although visible in the color image.
are best suited for the reconstruction of static scenes. On the other hand, ToF cameras deliver depth images with high frame rates and are thus applicable in dynamic scenes. ToF cameras however have a low resolution (204 × 204 pixel)1 when compared with today’s Full HD (1920 × 1080 pixel) television standard. When combining high resolution color with low resolution ToF depth maps, the mismatch in resolution leads to misalignments at object boundaries. In addition small or thin objects might not be observed by the ToF camera (see fig. 1). Given multiple high resolution calibrated color views and one calibrated low resolution ToF depth map, this work describes an algorithm for the retrieval of a dense depth map for one of the color images. The combination of color images and ToF depth for the estimation of dense depth maps is not new. However, to our understanding, previous contributions to this specific topic do not regard the different natures of the two data sources properly. This is discussed in the next section. Section 3 then gives an introduction to the proposed approach, while sections 4 and 5 describe the system’s parts in detail. In section 6 some results achieved with our algorithm are presented and discussed. The work is concluded in section 7.
2
State of the Art
Different propositions have been made of how to combine ToF data with color information. A rough classification of methods is found by separating approaches that only use a single color view and those that use multiple color views. In [4,5] monocular schemes are discussed. These approaches basically upsample low resolution depth maps by the use of a color controlled bilateral filtering scheme, so that color edges and depth discontinuities become aligned. Although outliers can be detected from a local neighborhood, these algorithms solely rely on the low resolution depth measurements and cannot correctly asses the depth’s validity. It is moreover not possible to reintroduce information that has been lost due to the low resolution of the depth map. If multiple color views are available, the photo consistency across views can be used to estimate depth by triangulation. Dense stereo matching [6] has difficulties with ambiguities, in particular in structureless or repetitive image regions. These 1
www.pmdtec.com
230
B. Bartczak and R. Koch
difficulties can be alleviated in combination with ToF measurements, which are not influenced by such factors. In [4,7,8] approaches are described that initialize a reconstruction with the ToF camera observations and try to refine it using photo consistency. The problem these approaches have to face is that photo consistency is not a convex measure. As such it will only yield improved results with close to correct observations from the ToF camera, and therefore situations like shown in fig. 1 will be difficult to handle. [9] describes an approach for dense depth map estimation where photo consistency and ToF depth maps are evaluated simultaneously. This is achieved by the use of a discrete hypothesis cost volume (see [6]), where the costs per pixel and per hypothesis are a weighted sum of a photo consistency and a robust distance measure to the ToF depth input. Additionally a smoothness cost is introduced and an optimized depth map is derived by using belief propagation. However, two important aspects of this approach are not discussed in the paper. Firstly ToF depth measurements and the photo consistency have different domains. While the distance measure will give differences in some metric unit, the photo consistency will measure differences in the color value range. A proper combination of these measures does not only have to consider the different value ranges but also the ambiguities in the photo consistency. This means that although it can be expected that a correct hypothesis will have low photo consistency costs, it cannot generally be assumed that a wrong hypothesis will have significantly higher costs. In combination with a wrong ToF depth measurement, this can lead to an error in the reconstruction, although photo consistency on its own might have come to the right conclusion. The second aspect is that due to the interpolation process involved in the upsampling, the ToF depth will appear smooth on the resolution level of the color images. Any sort of smoothness constraint will therefore implicitly prefer the low resolution ToF observations. One way to preserve the high resolution depth information despite smoothness constraints, is to strengthen them via the data term. The definition of a data term well fitting these challenges is the main contribution of this paper.
3
General Approach
The main idea behind the proposed scheme is to find a cost function C for a designated high resolution color image, which will be called the reference image from now on. The positions of the function’s per pixel minima shall closely correspond to the correct depth per pixel. This function has to be derived from multiple high resolution color images and a low resolution ToF depth map. The previous section discussed the difficulties when depth measurements and photo consistency are combined. In order to remedy this, we in a first instance estimate depth maps, which are solely based on the color information. This is done via dense local image matching for each (suitable) color image pair containing the reference image. Although it is generally agreed that local image matching is not reliable for dense reconstruction tasks, its use in finding matches between salient image regions is accepted. Instead, a global optimization approach could
Combining Low Res ToF Depth Maps and High Res Color Views
231
!
Fig. 2. Overview of the proposed processing scheme
have been applied. At this stage however, it is our interest to reliably find high resolution depth estimates for structures not visible in the ToF depth map. These structures will typically have sufficient saliency, but due to their size, the smoothness constraints used in global optimization might hinder their detection. A scheme well suited for the reconstruction of such structures is proposed in [10]. For the removal of unreliable or wrong matches contained in the estimated depth maps many propositions have been made. Most reliable are the cross-validation of the left/right consistency [11] and to invalidate all depth estimates that lie in image regions where the local color variance is found to be below the anticipated noise level. Other than that, the cost’s uniqueness and a maximal cost threshold can be used [12] to find unreliable matches. After the depth estimation and validation, one or multiple sparse but reliable depth maps for the reference image are available. By using such pairwise stereo estimates rather than simultaneously correlating all views for depth estimation, we follow the idea of multi view stereo reconstruction approaches, where pairwise matches are only used if their costs reach a predefined quality [13,14]. This technique avoids dealing with conflicts in the matching measure, which appear at (semi-)occlusions or non lambertian lighting effects. In such situations, it can be expected that different image pairs will indicate a conflict through inconsistent depth estimates. On the other hand, reliable reconstructions will be indicated through a consensus in the depth estimates. This includes the observations made by the ToF camera. In order to compare the ToF observations with the high resolution stereo depth maps, we warp the ToF depth map into the reference view. Since this has to be done via a forward mapping from a low to a high resolution, a triangle surface mesh is generated from the ToF depth map and the target map is rendered using computer graphics techniques [15]. Due to the change in viewpoint, wrong occlusions can appear at object boundaries. These disocclusions can be detected by comparing the angle between a triangle’s normal and it’s line of sight. A triangle is removed if this angle is close to 90 degrees [16]. After these processing steps, a varying number of depth observations is available for each pixel of the reference view. As stated before, we assume that a
232
B. Bartczak and R. Koch
consensus in these observations is a good measure for the correctness of a reconstruction. Therefore, the measurement of the consensus is the key motivation when defining the cost functions. Thereby we not only look at the depth observed per pixel but also in a local neighborhood. Since this is a smoothness assumption, which prefers the ToF camera’s low resolution, a high resolution photo consistency measure is added as a soft constraint to the cost function. In order to spread the measured consensus, the depth with the lowest costs is selected and fed back into the process. Fig. 2 gives an overview of this process. The details are explained in the next section.
4
Measuring Consensus
In this section, we define the calculation of the cost C(u, v, z) for a depth hypothesis z per pixel (u, v) of the reference image. Like sketched in fig. 2 this cost consists of the distance from the given depth observations and the photo consistency in a local neighborhood. At this point, it has to be considered that the depth distances and the photo consistency have different domains. In order to combine them, we individually transform them into the value range [0, 1] ⊂ R. Herein, a value of 0 shall indicate that an observation beliefs a hypothesis z to be correct, while the value 1 expresses an observation’s disbelief to the correctness of a hypothesis z. Thus, we do not combine values from different domains but rather compare indications to the correctness of a hypothesis. The normalization of all used costs to the range [0, 1] moreover allows to deal with a not fixed number of depth values per pixel. The strategy followed during normalization is to prefer depth hypotheses with the higher count of approving observations. 4.1
Distance Costs
The distance costs Di (u, v, z) for a hypothesis z per given depth map Zi are defined as: 2 min (Zi (u, v) − z) /μ2Z , 1 , Zi (u, v) ∈ [zmin , zmax ] Di (u, v, z) = . (1) 1 , else These costs equal 1 if the value Zi (u, v) found in the depth map is not within the valid search range [zmin , zmax ] or if z lies outside the uncertainty range μZ . Otherwise Di is the normalized squared distance between z and Zi (u, v). The truncation to the maximal value 1 is not only done to account for outliers, but also to implicitly detect close depth observations. This is visualized in fig. 3. Here cost graphs induced by four hypothetical depth measurements and their mean are depicted. Due to the constant cost contribution outside the uncertainty range, a minimum is formed by the mean, where observations are close to each other. The constant cost contribution, when no valid depth observation is given, allows to implicitly count the number of contributions to a minimum, since costs are potentially lower if more observations are available.
Combining Low Res ToF Depth Maps and High Res Color Views
233
Fig. 3. Hypothetical depth observations and corresponding distance costs (dashed). The consensus on the correctness of a depth is indicated by the strength of a minimum in the mean costs (solid). The maxima in the mean costs reflect the overall agreement on the falseness of a depth.
4.2
Photo Consistency Costs
In order to find an appropriate mapping of the photo consistency into the value range [0, 1], the minimal photo consistency costs induced by a depth hypothesis between the reference image Ir and any other input image Ii are considered: AD(u, v, z) = min (||Ir (u, v) − Ii (Hi (u, v, z)) ||1 , μPC ) . i
(2)
In this formula the photo consistency is derived from the absolute color difference observed at hypothetical correspondences. These correspondences are found by the homography mapping Hi utilizing a plane parallel to the reference image plane and with distance z to the reference camera center. The maximal distance value returned is μPC . This empirically determined threshold expresses the global uncertainty range of the photo consistency measure. It is assumed that the smallest photo consistency costs observed at a pixel corresponds with the correct depth. On the other hand, the largest observed costs will correspond to a wrong depth. Therefore, the final costs P C are defined as follows: AD(u, v) = min (AD(u, v, z)) , z P C(u, v, z) =
AD(u,v,z)−AD(u,v) AD(u,v)−AD(u,v)
1
AD(u, v) = max (AD(u, v, z)) ,
(3)
, AD(u, v) − AD(u, v) > Δnoise . , else
(4)
z
This form of mapping effectively increases the contrast between the lowest and highest observed color difference, which also increases the sensitivity to noise in homogeneous image regions. By comparing the distance between the minimal and maximal values, occlusions and homogeneous image regions can be detected. Therefore we disable the contribution of the photo consistency based costs if the distance range is below the anticipated noise level.
234
4.3
B. Bartczak and R. Koch
Aggregation
In the aggregation the distance costs Di and the photo consistency costs P C observed in a local image neighborhood are fused into a single cost value. This is done in two stages. First, the mean costs per pixel are calculated. Afterwards a weighted mean is applied to combine costs from neighboring pixels. Given N depth maps and a neighborhood N (u, v) containing (u, v), this is expressed by: C(u, v, z) =
P C(k,l,z)+ i≤N Di (k,l,z) w(u, v, k, l, z) N +1 . (k,l)∈N (u,v) w(u, v, k, l, z)
(k,l)∈N (u,v)
(5)
Like proposed in [10,17] the weights w(u, v, k, l, z) are introduced for the handling of discontinuities. Three features are used to calculate these weights. First it can be expected that with increasing spatial distance of neighboring pixels the observed z distance will differ. Moreover, it can be assumed that depth discontinuities and color discontinuities coincide. So that neighboring pixels with different colors are expected to belong to different depth levels. Finally, jumps in the photo consistency often indicate differences in the depth of neighboring pixels [11]. All three features are expressed in form of distances Δ and transformed into a value range of [0, 1] via the function ω(Δ, γ) = e−Δ/γ . The distances used are the spatial distance Δs = ||(u, v) − (k, l)||2 , the color distance in the reference image ΔC = ||Ir (u, v) − Ir (k, l)||2 and the photo consistency distance Δpc = |P C(u, v, z) − P C(k, l, z)|. Using these distances the weights are defined as: w(u, v, k, l, z) = ω(Δs , γs )ω(ΔC , γC )ω(Δpc , γpc ). (6) In the case that P C(u, v, z) is invalid ω(Δpc , γpc ) is set to 1. 4.4
Feedback
After aggregation the fused cost for a depth hypothesis z is available, which is based on the number and closeness of observations in a local image neighborhood. Depending on the number of input images and the used approach of acquiring depth observations (section 3) these costs will contain ambiguities, i.e. weak minima. Using a feedback mechanism we try to strengthen minima and at the same time propagate the conclusions drawn from a local neighborhood to adjacent neighborhoods. For this, the depth value with the smallest costs is selected for each pixel, which delivers a fused depth map Zf . Hereby pixels, whose lowest costs are 1, are assigned the depth value z⊥ = 0, which is an invalid depth z⊥ ∈ / [zmin , zmax ] (compare eq. (1)). The fused depth map is added to the set of initial depth maps and the cost functions are recalculated. This can be repeated, replacing Zf after every iteration, which leads to a diffusion like process. In this process the original input depth maps remain unchanged, which prevents that details become smoothed out. Please observe that due to non linear, adaptive aggregation of costs, discontinuities are still preserved.
Combining Low Res ToF Depth Maps and High Res Color Views
5
235
Refinement
The result of the processing steps described in section 4 is a fused depth map Zf for a single reference view. Applying this scheme to different views, allows to compare their consistency and this way to detect errors, which in particular will occur at occlusions. Depth values that are not consistent to any other view’s fused depth are removed from Zf by setting them to z⊥ = 0. In order to fill the generated holes, we use a simple inpainting technique, where the missing depth values are calculated using the depth values of neighboring pixels with similar color. This is achieved through a monocular adaption of the scheme described in section 4, which thus basically becomes a variant of [4]. For this we only add Zf into the set of input depth maps and remove the contribution of the photo consistency costs. This change affects the aggregation in eq. (5) and the weight calculation in eq. (6). The repetitive application of this monocular adaption, diffuses the mean depth value of border pixels into invalidated regions, if the border pixels have a similar color like the invalidated pixels. Pixels within larger holes are not filled immediately, since their minimal costs remain 1 until the hole sizes shrink sufficiently (compare section 4.4 and eq. (1)).
6
Experiments
We tested our proposals on two real indoor data sets, consisting of four color views in Full HD resolution (1920×1080 pixels) and one low resolution ToF depth map (176 × 144 pixels). This data is shown in fig. 4 and 5. We used the approach described in [18] to find the cameras’ parameters, thereafter we estimated three sparse stereo depth maps for each color view and warped the ToF depth map into every color view, like described in section 3. Fig. 6 shows an extract of the calculated input depth maps. These input depth maps were fused into four depth maps like discussed in section 4. Finally, a refined depth map was calculated for one of the views, following the explanations in section 5. In order to find a close approximation to the depth value with minimal costs, the cost function was densely sampled using M equally spaced hypotheses in the search range, which in the given situation was app. zmin = 2m to zmax = 7m. After selecting the depth sample with the lowest costs z˜ the final depth approximate was found by computing the minimum of a quadratic polynomial, which was fitted to the z positions z˜ − Δz , z˜,˜ z + Δz and their corresponding costs. The same sampling was used to determine approximations to the minimal and maximal matching costs AD and AD. The parameters used for the fusion and the refinement are listed in table 1. The right columns of fig. 7 show the results we achieved with this approach. The center right column shows the result if only a single camera pair is used. The right most column displays the fused depth maps if all camera pairs are combined. The gain in using multiple pairs is on the one hand the increased number of depth observations for fine structures. On the other hand it allows to resolve occlusion problems. This is clearly visible when comparing the top images of fig. 7. Due to the use of multiple image pairs, the reconstruction
236
B. Bartczak and R. Koch Table 1. Used parameters during the fusion and the refinement Parameter Fusion Refinement Number of Iterations 5 5 Distance Tolerance μZ = 7.5cm μZ = 3.75cm Number of Hypotheses (Spacing) M = 260 (Δz ≈ 2cm) M = 260 (Δz ≈ 2cm) Neighborhood 41 × 41pxl 81 × 81pxl Spatial distance weight γS = 9.5 γS = 19 RGB-Color distance weight γC = 13.1 γC = 4.35 PC-distance weight γpc = 0.1 not used AD-truncation μp c = 25 not used Expected noise level Δnoise = 9 not used
in the area between the leaves of the plant in the upper right corner is improved as well as the image part between the person’s chin and its torso. In order to show the improvements between our approach and previous propositions (see section 2) the depth maps shown in the left columns of fig. 7 were calculated. The left most column displays the outcome of using the monocular scheme proposed in [4]. Due to the strong dependence on the ToF data this approach fails at fine structures. This can be observed at the plants’ leaves in the top row or the lamp arm in the lower image row. Moreover erroneous measurements cannot be detected. These results however demonstrate that the use of low resolution ToF depth can be an asset, especially in the structureless regions where color matching is questionable. In the left center column the effect of adding a photo consistency cost like proposed in [9] can be seen. Some severe errors in the top left corner of the ToF depth in the beergarden data set are improved. Furthermore fine structures and edges are reconstructed with greater precision. However the lamp arm in the living room data was not recovered. The reason for this is the ambiguity in the photo consistency, especially when using a window based measure. The soft color adaptive steering of the aggregation, like used in [9], does in some cases not suffice to prevent the bias towards wrong depth hypotheses. The situation is worsened if those wrong hypotheses are additionally supported by the depth measurements of the ToF camera. Three mechanisms in our algorithm address this problem. Firstly the use of cross validated stereo depth maps, which deliver an unbiased high resolution opinion on the depth distribution from the color views. Secondly the normalization of the photo consistency and the distance costs to the range of [0, 1]. This way the pixel based photo consistency can compete with the distance costs. Finally the integration of the photo consistency distance into the adaptive aggregation (see eq. (6)), so that discontinuities are handled properly even if the required color contrast is too low. Please observe that despite the feedback mechanism and the refinement (section 5) the reconstruction is derived from local observations alone. Nevertheless the achieved results (fig. 7) are very homogeneous. In many cases this is owed to the low resolution ToF depth maps. Due to the increased contribution of the high resolution color data in our proposal, the results achieved with only two
Combining Low Res ToF Depth Maps and High Res Color Views
Camera 1
Camera 2
Camera 3
237
Camera 4
Fig. 4. Captured color views of the beergarden scene (top row) and the living room scene (bottom row)
Fig. 5. Captured depth maps: beergarden scene (left) and living room scene (right) Warped ToF
Sparse Stereo Depth
Warped ToF
Sparse Stereo Depth
Fig. 6. Excerpt of used input depth maps for camera 3. The stereo depth maps shown are generated from matches between cameras 2 and 3. Monocular ToF / Color
ToF / Color / Photo Consistency
Proposed Cam. 2 & 3
Proposed All Views
Fig. 7. Results achieved for camera 3. Left: color controlled upsampling of ToF depth maps. Center Left: simultaneous evaluation of distance costs and photo consistency. Center Right and Right: proposed approach using camera views 2 and 3 or using all views.
238
B. Bartczak and R. Koch
color views, can become slightly more noisy than the results of previous works. On the contrary it was shown that the proposed cost term allows to reliably preserve high resolution structures, which makes it a good candidate for a data term in approaches that integrate smoothness models [6].
7
Conclusion
In this work we discussed a robust approach to the fusion of low resolution ToF depth images and stereo reconstruction from high resolution color images. The presented algorithm can exploit the use of more than two color views and delivers dense depth maps for a high resolution color image. Fine structures are well preserved even if contradictions are present in the low resolution ToF depth maps. In the current state the approach is heuristic in nature. However the use of uncertainty ranges and normalization already brings it close to a probabilistic formulation. The investigation and proper modeling of uncertainties will be the concern of future work, especially in the face of the additional information that are available from ToF cameras. Future work will also deal with the integration of sophisticated smoothness constraints to resolve ambiguities. Like described in section 2 the challenge here is to avoid a bias towards the low resolution ToF depth maps in the reconstruction.
Acknowledgment This work was partially supported by the German Research Foundation (DFG), KO-2044/3-2 and the Project 3D4YOU, Grant 215075 of the Information Society Technologies area of the EUs 7th Framework programme.
References 1. Buehler, C., Bosse, M., McMillan, L., Gortler, S., Cohen, M.: Unstructured lumigraph rendering. In: SIGGRAPH 2001: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 425–432. ACM, New York (2001) 2. Shade, J., Gortler, S., He, L.w., Szeliski, R.: Layered depth images. In: SIGGRAPH 1998: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, ACM, New York (1998) 3. Xu, Z., Schwarte, R., Heinol, H., Buxbaum, B., Ringbeck., T.: Smart pixel - photonic mixer device (pmd). In: M2VIP 1998 - International Conference on Mechatronics and Machine Vision in Practice, pp. 259–264 (1998) 4. Yang, Q., Yang, R., Davis, J., Nister, D.: Spatial-depth super resolution for range images. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition CVPR 2007, pp. 1–8 (2007) 5. Chan, D., Buisman, H., Theobalt, C., Thrun, S.: A noise-aware filter for real-time depth upsampling. In: Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications with ECCV (2008)
Combining Low Res ToF Depth Maps and High Res Color Views
239
6. Scharstein, D., Szeliski, R., Zabih, R.: A Taxonomy and Evaluation of Dense Twoframe Stereo Correspondence Algorithms. In: Proceedings of IEEE Workshop on Stereo and Multi-BaselineVision, Kauai, HI (2001) 7. Beder, C., Bartczak, B., Koch, R.: A combined approach for estimating patchlets from PMD depth images and stereo intensity images. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 11–20. Springer, Heidelberg (2007) 8. Gudmundsson, S., Aanaes, H., Larsen, R.: Fusion of stereo vision and time-offlight imaging for improved 3d estimation. In: Proceedings of the DAGM Dyn3D Workshop, Heidelberg, Germany (2007) 9. Zhu, J., Wang, L., Yang, R., Davis, J.: Fusion of time-of-flight depth and stereo for high accuracy depth maps. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition CVPR 2008, pp. 1–8 (2008) 10. Yoon, K.-J., Kweon, I.S.: Adaptive support-weight approach for correspondence search. IEEE Transactions of Pattern Analysis and Machine Inteligence (PAMI) 28(4), 650–656 (2006) 11. Egnal, G., Egnal, G., Wildes, R.: Detecting binocular half-occlusions: empirical comparisons of five approaches. IEEE Transactions of Pattern Analysis and Machine Inteligence (PAMI) 24, 1127–1133 (2002) 12. Muhlmann, K., Maier, D., Hesser, R., Manner, R.: Calculating dense disparity maps from color stereo images, an efficient implementation. In: Proc. IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001), pp. 30–36 (2001) 13. Goesele, M., Curless, B., Seitz, S.: Multi-view stereo revisited. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2402–2409 (2006) 14. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multi-view stereopsis. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition CVPR 2007, pp. 1–8 (2007) 15. Bartczak, B., Schiller, I., Beder, C., Koch, R.: Integration of a time-of-flight camera into a mixed reality system for handling dynamic scenes, moving viewpoints and occlusions in real-time. In: Proceedings of the 3DPVT Workshop, Atlanta, GA, USA (2008) 16. Pajarola, R., Sainz, M., Meng, Y.: Dmesh: Fast depth-image meshing and warping. Int. J. Image Graphics 4, 653–681 (2004) 17. Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Proc. Sixth International Conference on Computer Vision, pp. 839–846 (1998) 18. Schiller, I., Beder, C., Koch, R.: Calibration of a pmd camera using a planar calibration object together with a multi-camera setup. In: Proceedings of the ISPRS Congress, Bejing, China (2008)
Residential Building Reconstruction Based on the Data Fusion of Sparse LiDAR Data and Satellite Imagery* Ye Yu1, Bill P. Buckles2, and Xiaoping Liu1 1 College of Computer and Information, Hefei University of Technology, China Department of Computer Science and Engineering, University of North Texas, USA
[email protected],
[email protected],
[email protected]
2
Abstract. Reconstruction of residential building models in an urban environment is a challenging task yet has many applications such as urban planning, simulation of disaster scenarios, cartography, wireless network planning, line-of-sight analysis, virtual tours, and many others. This paper presents a novel method for residential building reconstruction and modeling in urban areas using airborne light detection and ranging (LiDAR) data and satellite imagery. The main contribution is the automatic isolation of building roofs and roof reconstruction based on the fusion of LiDAR data and satellite imagery. By using cue lines which are generated from satellite imagery to separate buildings from other objects (including other buildings), we are able to automatically identify individual buildings from residential clutter and re-create a virtual representation with improved accuracy and reasonable computation time. We applied the method to urban sites in the city of New Orleans and demonstrated that it identified building measurements successfully and rendered 3D models effectively. Our experiments show that our method can successfully reconstruct small buildings with relatively sparse LiDAR sampling and in the presence of noise. Keywords: LiDAR, Building rendering, Satellite imagery, Roof extraction.
1 Introduction Urban landscapes modeling benefits planners in drainage system design, street improvement project selection, disaster management, and other tasks. Reconstruction of building geometry in an urban environment is a key component of the modeling task. Consider flood events as an example for which the rendering of urban landscapes may be useful. The Federal Emergency Management Agency (FEMA) defined a threshold equivalent to a 100 year flood. Via a realistic three-dimensional (3D) model, then, given the inundation level of the 100 year flood, we can construct a map that marks the buildings affected, estimates the number of floors to which the water extends, identifies survivable spaces, and charts a search sequence. Building modeling using current methods is tedious, requires tremendous amounts of effort, and relies on onsite surveys accompanied by manual drafting. In order to *
This research was supported in part by the National Science Foundation, grant numbers: IIS-0737861 and IIS-0722106.
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 240–251, 2009. © Springer-Verlag Berlin Heidelberg 2009
Residential Building Reconstruction
241
reduce the time and effort for producing accurate and high-resolution models of an area of interest, rapid and highly automatic methods for building reconstruction from sensor data are required. Light detection and ranging (LiDAR) sensors are deployed for land surveys and flood plain mapping. The instrument acquires elevation data over broad regions from an airborne vehicle. The use of LiDAR has resulted in the availability of massive amounts of elevation data from which accurate reconstruction of 3D urban models is possible. Satellite imagery is another source of sensor data. It can compensate for the low density of LiDAR data and provide further detail about building contours. Methods have been developed for 3D urban reconstruction from LiDAR data. The emphasis lies in building detection and building boundary determination. Most methods begin by gridding LiDAR elevation points to a depth image, such as a digital surface model (DSM), and then use image segmentation techniques to detect building footprints. Rottensteiner [1] applied segmentation using the variance among neighboring DSM normal vectors as a feature to find planar segments. Then roof planes were grouped to locate the building boundary. Zhang et al. [2] separated the ground and non-ground LiDAR measurements using a progressive morphological filter. Building points were labeled via a region-growing algorithm. The building boundary was derived by connecting boundary points. In order to avoid interpolation errors, others have proposed operating directly on the original elevation points (the point cloud). In these methods, usually, building data points are classified based on the geometric properties such as elevation, elevation discontinuity [3] and elevation variance [4]; building faces are determined by local plane fitting and clustering based on the similarity of normals [5]. To improve geometrical precision, researchers have integrated aerial images into the process to improve the geometric quality of the models. The approaches to integrating the two modalities are quite varied. Fujii and Arikawa [6] extracted objects by analyzing vertical geometric patterns of LiDAR elevation images and texture mapping was realized by snipping textures from aerial images. Chen [7] achieved building reconstruction by integrating the edges extracted from aerial imagery and the planes derived from LiDAR. Lee [8] and Yong [9] extracted building corner features and building edges, respectively, using LiDAR and aerial imagey. The results were integrated in order to increase the detection accuracy. Despite the progress that has been made, there is still much room for improvement, especially for low data density. In our study, the density is 0.16 points/m2, which is very sparse compared with the LiDAR density used by others (see Table 1). Given this low data density, we propose a novel method for residential building reconstruction and modeling using LiDAR data and satellite imagery. Our method combines the gross geometric features identifiable from LiDAR data with feature details visible in satellite imagery. In our approach, we first locate a region of interest (ROI) from LiDAR data where an ROI contains candidate roof points. An ROI is then divided into separate buildings using cue lines from the visual data. Finally, via model matching and texture mapping, we obtain realistic renderings of geometric models of buildings, even small ones, which closely fit the LiDAR elevation data. Three sections that follow. In section 2, a three-phase workflow is given and the individual processing steps are described in detail. Experimental results are presented in section 3 for two neighborhoods for which residence-sized buildings have minimal horizontal separation. A significant outcome is the brief execution time required. Finally, the conclusions are presented in section 4.
242
Y. Yu, B.P. Buckles, and X. Liu
Table 1. LiDAR point data densities employed in representative building extraction studies; the data density for our study is 0.16 points/m2
Citation Reference [1] Reference [2] Reference [3] Reference [4] Reference [5] Reference [7] Reference [8] Reference [9]
LiDAR Data Density 10 points/ m2 0.25-1 points/m2 7 points/m2 9 points/m2 0.67 points/m2 1.6 points/ m2 0.7 points/m2 0.8 points/m2
Data Form Depth image Depth image LiDAR point cloud LiDAR point cloud LiDAR point cloud Depth image Depth image Depth image
2 Methodology Figure 1 illustrates the overall framework of our method. There are three steps - preprocessing, building roof isolation and building reconstruction.
Fig. 1. Framework of our methodology
In the preprocessing step, roof points were extracted based on elevation information, and then grouped based on the connectivity of points in the TIN structure. In order to categorize the roof type and then reconstruct the building, candidate roof points must be separated into several clusters. The points in each cluster belong to one building; we call this step building roof isolation. In building roof isolation, satellite imagery was fused with LiDAR data to provide separating information. Once roofs were isolated, individual roofs were categorized using relationships among the normal vectors. Then by means of model matching and texture mapping, the building reconstruction was achieved. The following subsections provide details. 2.1 Preprocessing Preprocessing consists of extracting the candidate roof points and separating them into nonintersecting sets.
Residential Building Reconstruction
243
Initially, the LiDAR point cloud was gridded to create a depth image. Figure 2a depicts a depth image for an urban neighborhood. While residential blocks can be easily separated by tracing local streets, it is nontrivial to cluster points for a building roof. Figure 2b illustrates a zoomed-in view of the quadrangle highlighted in Figure 2a. From this, we can see that the LiDAR data is too sparse to detect building contours, even though we used data interpolation in the gridding process. Due to the low data density, buildings blend together after interpolation. The same blending phenomenon occurs when using LiDAR data to create a DSM. Due to this phenomenon, we returned to the raw LiDAR point data to measure geometrical properties. We constructed from the point cloud a TIN model, in which the edges and triangles maintain the relationship of the sample points.
a) Gridded LiDAR data
b) Zoom-in view of region A
Fig. 2. Gridded LiDAR data set (depth image)
Candidate roof points were extracted based on the following rules: 1. Height, i.e., distance above terrain, exceeds a threshold (e.g., two meters) 2. Distance from neighboring roof point does not exceed an adaptive threshold based on average nearest neighbor distance among all points. To apply the rules, we needed first to estimate the terrain elevation. Terrain points must exhibit local minima.[10] In urban areas, terrain undulation is mild, so we divided the LiDAR points into blocks and calculated the minimum elevation in each block. Then in each block, we extracted candidate roof points based on the above two rules. Edges and triangles connected to the candidate roof points were preserved. At the same time, points exhibiting great height variances with neighbors were removed as tree points. Based on the connectivity of roof points in the TIN model, we clustered them into sets. If the buildings are sufficiently distant from each other, each set corresponds to one building. Otherwise, one set represents several buildings. In order to further divide the points into individual roofs, we developed a roof isolation method based on the cue lines from satellite imagery. This is described in the next subsection.
244
Y. Yu, B.P. Buckles, and X. Liu
2.2 Building Roof Isolation Based on Cue Lines The remainder of this subsection is summarized by the shaded area in Figure 3. The steps include building roof isolation entailing separating boundary estimation, cue line generation, and ancillary processing.
Fig. 3. Building Roof Isolation Process
2.2.1 Separating Boundary Estimation Candidate roof points have been separated into sets as described above. It is at this point that the imagery came into play. If we obtain tight boundaries of each set of points, termed separating boundaries, the boundaries may be overlaid onto imagery. From the imagery we can obtain the ROIs we need to process. The roof point sets have irregular boundaries. The nonconvex hull is used to express the tight boundary of each set. The algorithm for constructing the nonconvex hull is given below. Edges of each triangle in the TIN are assigned counterclockwise directionality with respect to the triangle interior. Clearly, each edge has a left triangle, lTri, and some edges will be adjacent to a right triangle, rTri, as shown in Figure 4. Edges not adjacent to a right triangle are termed boundary edges and a set of ordered boundary edges is denoted as Bj=<e1, e2,…,en>. Each Bj circumscribes a nonconvex polygon. For each polygon, the sequence of angles, one at the head of each edge, is denoted as <θ1, θ2,..., θn>. A polygon is classified as exterior if it contains roof points and interior if not. Exterior and interior polygons are illustrated in Figure 5 by black and white boundaries, respectively. For Bj let, Tθ = Σθi. Clockwise angles, e.g., θ2, θ3 θ4, θ5, and θ6 in Figure 6a, are expressed as negative values. Counterclockwise angles, e.g., θ1, θ2 θ3, θ4, and θ5 in Figure 6b, are expressed as positive values. If Tθ = -360o then Bj is the boundary of an interior polygon. If Tθ =360o then Bj is the boundary of an exterior polygon. An exterior polygon, less any contained interior polygons, constitute the nonconvex hull of the LiDAR roof point set.
Residential Building Reconstruction
245
Fig. 4. Edge and triangle structure of TIN model Fig. 5. Boundary of candidate roof points
a) Interior polygon
b) Exterior polygon
Fig. 6. Angle change between edges of a polygon
It is convenient in the remaining analysis to use the simplest structured ROI that is derivable from the boundary edge set, Bj. Doing so avoids introducing excessive cal) ( culation in the steps that follow. Let Bj , B j , and Bˆ j be, respectively, the nonconvex hull of the points enclosed by Bj, the convex hull of the point set, and the minimum ) bounding rectangle (MBR) of Bj . Let α(·) indicate the area of its argument. Note the following facts. ) ( • Bj ⊆ B j ⊆ Bˆ j ( ) ) • B j \ Bj and Bˆ j \ Bj each denote sets of polygonal regions, {p1, p2, …, pr}, containing no roof points (i.e., empty spaces within the respective polygons) • If r ≠ 0, there exists pm such that α(pm) ≥ α(pi), for i = 1, 2, …, r. The ROI is chosen by applying the following rules in order. ) ( ) ( ) • Bj if α( B j )>>α( Bj ) or α(pm)/α( B j \ Bj ) is large (i.e., a significant percentage ( of the area void of roof points is within one empty space polygon of B j ) ( ) ) • B j if α( Bˆ j ) >> α( Bj ) or α(pm)/α( Bˆ j \ Bj ) is large •
ˆ B j
246
Y. Yu, B.P. Buckles, and X. Liu
2.2.2 Cue Line Generation and Roof Point Isolation From imagery of urban neighborhoods many straight lines can be extracted. Shadows, occlusion, and building texture make it a challenge to identify the ones relevant to building boundaries as previous research has shown. Thus we developed a cue line generation scheme obviating much of the computation that encumbers other approaches. Cue lines are the long, salient lines in the image that further separate a roof point set. It is much easier to extract cue lines than to extract all lines surrounding a building. Each ROI was enhanced to increase contrast. A Canny edge detector was applied followed by a Hough transform to detect lines. Disconnected, co-linear lines near each other were connected. Lines having lengths below a threshold were deleted. Lines coinciding with an ROI boundary were ignored. Figure 7 illustrates contrast enhancement, edge detection, and cue line discovery.
a) Contrast enhancement
b) Edge detection
c) Cue lines
Fig. 7. Cue line generation
As shown in Figure 7c, some roof crease lines were also detected. These were eliminated by comparing with the LiDAR elevation information.
a) One example
b) A more complex case
Fig. 8. Building roof isolation result
Using cue lines individual building roofs were isolated. In one example, shown in Figure 8a, the LiDAR data were separated into four areas by cue lines. In area 4, there are no roof points. In area 1, there are only a few. These areas can be ignored in the reconstruction process. The LiDAR points isolated in areas 2 and 3 were rendered as buildings as described in the next subsection.
Residential Building Reconstruction
247
One further issue: The cue lines so extracted may not be parallel. In this case, selected ones are extended to intersect with the nearest non-parallel cue line. As shown in Figure 8b, A, B, C, D are the intersecting loci to be calculated. Then, based on this set of line orientations, we separated the roof points into eight regions. 2.3 Building Reconstruction Our building reconstruction process was achieved by roof type categorization and model matching. In our study area, several types of roofs exist, such as gable and hip roofs. We created a model template database for identified roof types. In general, a building roof consists of one or more planes. If we represent every plane with its normal vector, differentiating spatial correlations exist that uniquely define each type. Without loss of generality, we use the gable roof model to demonstrate our method. Figure 9 shows the spatial relationships of the normal Fig. 9. Normal vectors for a vectors of a gable roof. The normal vectors in each roof gable roof plane are denoted as v1 and v2; the vertical direction is denoted as Z; the angles among the three vectors are denoted with θ1, θ2 and θ3 as illustrated. Normally, the pitch angle of a gable roof plane is in the range of 25°~45°. To account for noise in LiDAR data, we relax pitch angle to the range 15°~65°. The features that identify a gable roof are as follows: 1) Two planes with distinct normal vectors exist. 2) The angles satisfy the following rules: 15°<θ1, θ2<65°, θ1≈θ2. 3) The sample counts in the two roof planes are roughly equal. To match roof features to our database of templates, we clustered the normal vectors of the triangular facets in the TIN model according to their orientations. Because of the presence of noise, there are directional perturbations of the normal vectors among triangular facets of the same roof plane. These perturbations can be easily addressed by relaxing the threshold and allowing normal vectors with slightly different directions to be grouped together. As a result, several sets of triangular facets were identified, each set representing one roof plane. We sorted the triangle sets in descending order according to the number of triangles in each. We compared pairs of sets of LiDAR roof points using the three criteria above. If a pair of sets exists that meets the criteria, a match was found. Then building reconstruction can be achieved by plane fitting, roof boundary determination, and ridge crease identification. Furthermore, texturing was achieved by extracting textures from satellite images in order to increase the rendered scene’s realism.
3 Experimental Results We used LiDAR data obtained from the Louisiana LiDAR Project managed by Region VI of FEMA. It is the pre-Katrina data acquired using the Leica Geosystems ALS40
248
Y. Yu, B.P. Buckles, and X. Liu Table 2. LiDAR data specification
Characteristics Field of view
Specifications 40°
Characteristics Sample point spacing
Specifications 3 meters
Overlap
30%
Points per quarter quadrangle
8.5 million
Pulses per second
30,000
Vertical accuracy
0.5 feet to 1 feet
Acquisition height
8,000 feet
Horizontal accuracy
3 feet to 6 feet
a) TIN model
b) Georeferenced satellite imagery
c) Reconstructed residential buildings Fig. 10. Residential building reconstruction example 1
instrument. The specifications of the data are listed in Table 2. The corresponding satellite imagery was obtained from Google Map. After applying our method to LiDAR data of urban sites in New Orleans, we evaluated the accuracy and efficiency. The data density in the study area is 0.16 points/m2. Typically, less than 20 LiDAR points are present for each roof plane. Examples of the rendering result are shown in Figures 10 and 11. Accuracy is assessed by comparing to the georeferenced high-resolution visual imagery.
Residential Building Reconstruction
a) TIN model
249
b) Georeferenced satellite imagery
c) Reconstructed residential buildings Fig. 11. Residential building reconstruction example 2
In example 1, of 20 residential buildings, 18 were successfully recognized and reconstructed, i.e., a reconstruction rate of 90% (see Table 3). In example 2, out of 21 residential buildings, 19 were successfully reconstructed. The construction rate is 81% . By comparing to the georeferenced satellite image, we can judge where and how inaccuracies occur. Buildings missing in the reconstruction (highlighted with white ellipses in Figure 10 and 11) possess roof types that were not matched. At this time, the roof template database has only the simplest types such as gable and hip roofs. This problem can be solved either by introducing more templates into our database or by decomposing complex roofs into several simpler ones. In Figure 11b, the area enclosed within a white rectangle may appear to contain a building. However, from Figure 11a we can see that it is not a building roof given the elevation information.
250
Y. Yu, B.P. Buckles, and X. Liu Table 3. Reconstruction rate assessment
Example 1
Total Building Number 20
Reconstructed building number 18
Example 2
21
17
Reconstruction Rate 90% 81%
The efficiency of our reconstruction process is assessed by measuring the processing time. The computation includes building TIN models, preprocessing, cue line generation, model matching and reconstruction time. Table 4 summarizes the time spent reconstructing each square block in the above examples. Note that our code is implemented with C++ and uses the MATLAB library for image processing. Testing was conducted on a desktop system with an Intel Core 2 duo central processing unit at 2.4 GHz with 2GB of memory. Table 4. Computational time for reconstruction Parameter Number of LiDAR samples
Example 1 1448
Example 2 1623
Number of triangle facets Average time for Individual steps
2874
3226
Build TIN model
2 seconds
2.5 seconds
Preprocessing
1 second
1 second
Cue Line Generation
2.5 seconds
2.5 seconds
Model Matching
1 second
1 second
Reconstruction
1 second
1 second
7.5 seconds
8.0 seconds
Total time cost
4 Conclusion We have described a novel method for automatic 3D residential building reconstruction and modeling in an urban area using airborne LiDAR data and satellite imagery. Using building roof isolation via a new cue line method and using model-based reconstruction, we are able to automatically identify individual buildings from cluttered residential areas and render them realistically. As demonstrated on urban sites in the city of New Orleans, the method identified building measurements successfully and rebuilt 3D models effectively. This was possible in spite of data noise, relatively small buildings, and, especially significant, very low data density. We evaluated the accuracy of our method by comparing with georeferenced visual imagery, and assessed the efficiency in terms of processing time. Remarkably, instances in which reconstruction failed is not attributed to the method, per se. It was due to the incompleteness of the building template database. This can be improved by aggregating more templates or by extending the method by decomposing complex roofs into simple ones.
Residential Building Reconstruction
251
References [1] Rottensteiner, F.: Automatic Generation of High-quality Building Models from LiDAR Data. IEEE Computer Graphics and Applications 23(6), 42–50 (2003) [2] Zhang, K., Yan, J., Chen, S.: Automatic Construction of Building Footprints from Airborne LiDAR Data. IEEE Transactions on Geoscience and Remote Sensing 44(9), 2523–2532 (2006) [3] Vosselman, G.: Building Reconstruction using Planar Faces in Very High Density Height Data. In: Proceedings of ISPRS Conference on Automatic Extraction of GIS Objects from Digital Imagery, München, vol. 32/3-2W5, pp. 87–92 (1999) [4] Verma, V., Kumar, R., Hsu, S.: 3D Building Detection and Modeling from Aerial LiDAR Data. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. v2, pp. 2213–2220 (2006) [5] Zeng, Q., Lai, J., Li, X., Mao, J., Liu, X.: Simple Building Reconstruction from LiDAR Point Cloud. In: Proceedings of International Conference on Audio, Language and Image Processing (ICALIP 2008), July 2008, pp. 1040–1044 (2008) [6] Fujii, K., Arikawa, T.: Urban Object Reconstruction using Airborne Laser Elevation Image and Aerial Image. IEEE Transactions on Geoscience and Remote Sensing 40(10), 2234–2240 (2002) [7] Chen, L.C., Teo, T.A., Rau, J.Y., et al.: Building Reconstruction from LIDAR Data and Aerial Imagery. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2005), vol. 4, pp. 2846–2849 (2005) [8] Lee, L., Shyue, S., Huang, M.: Building Corner Feature Extraction Based on Fusion Technique with Airborne LiDAR Data and Aerial Imagery. In: Proceedings of the 3rd International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP 2007), vol. 1, pp. 43–46 (2007) [9] Yong, L., Huayi, W.: Adaptive Building Edge Detection by Combining LiDAR Data and Aerial Images. In: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Beijing, Part B1, vol. XXXVII, pp. 197–202 (2008) [10] Matei, B.C., Sawhney, H.S., Samarasekera, S., et al.: Building Segmentation for Densely Built Urban Regions using Aerial LiDAR Data. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp. 1–8 (2008)
Adaptive Sample Consensus for Efficient Random Optimization Lixin Fan and Timo Pylv¨an¨ainen Nokia Research Center Tampere, Finland
[email protected],
[email protected]
Abstract. This paper approaches random optimization problem with adaptive sampling, which exploits knowledge about data structure obtained from historical samples. The proposal distribution is adaptive so that it invests more searching efforts on high likelihood regions. In this way, the probability of reaching the global optimum is improved. The method demonstrates improved performance as compared with standard RANSAC and related adaptive methods, for line/plane/ellipse fitting and pose estimation problems.
1 Introduction Finding the global maximum or minimum of an objective function is a challenging yet pervasive problem in many computer vision applications. For instance, fitting noisy data with parametric models is often formulated as maximum a posterior (MAP) estimation of the model parameter θ: ˆ = arg max p(θ|D), [θ] (1) θ
where input data D may represent range data measurements, edge points of geometric primitives or correspondences between image feature points. Random sampling approaches such as RANSAC [1–4] are widely used to restrict the searching on a finite data-dependent subspace. In order to increase the probability of finding the global solution, numerous RANSAC improvements have been proposed to reduce the searching space by using different priori knowledge [5–7]. Despite the randomness involved in searching, these methods adopted a “fixed” searching strategy i.e. the probability of sampling a new model is independent of the models previously evaluated. This is in contrast to standard optimization approaches such as gradientdescent, in which promising search directions are adapted according to gradients of previously evaluated function values. How to incorporate a similar adaptive mechanism into random sampling approaches is the key problem addressed in our work. This paper presents an adaptive sampling technique, which, at iteration t, conditions the probability qt of sampling a new model θ on models θ1 , ..., θt−1 evaluated in previous iterations: qt (θ|D, L(θ1 ), ..., L(θt−1 )), (2) in which L ∝ p is the likelihood function proportional to the posterior distribution to be maximized. More specifically, this paper proposes two data weighting schemes, G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 252–263, 2009. c Springer-Verlag Berlin Heidelberg 2009
Adaptive Sample Consensus for Efficient Random Optimization
253
which adapt sampling to (re-)visit high likelihood models more often (Section 3). Convergence of adaptive sampling are rigorously proved. We demonstrate the efficiency of adaptive methods with a number of data fitting problems, including line fitting, 3D plane fitting, ellipse fitting and pose estimation. It is observed that adaptive sampling methods outperform standard RANSAC, in terms of speed and robustness (see Section 4).
2 Related Work Our method is most similar to ensemble inlier set based method [8], in which data points were weighted based on previously evaluated models, and weighted points were used to adapt new model sampling. In this paper, we present a theoretical analysis in Theorem 1 which ensures the convergence of adaptive weighting and sampling. Also, the weighting scheme (12) used in this paper is more efficient than [8]. Our adaptive sampling method is closely related to algorithms [9–11] that call forth a local optimization around the-best-model-thus-far. One pitfall of local optimization methods, however, is the tendency of being trapped at local minima, especially when data with multiple structures are presented (see Section 4.1). Adaptive sampling is loosely related to guided sampling, in which priori knowledge such as match scores are used to weight data points and, consequently, improve sampling efficiency [5–7]. However, such information are obtained from offline preprocessing and remain unchanged during the course of optimization. Other research efforts to improve RANSAC speed include [12–14], which focus on reducing the computational complexity of each model evaluation.
3 Adaptive Sample Consensus We propose to adapt the proposal distribution qt , so that it invests more searching efforts on high likelihood regions to improve the probability of reaching the global optimal state. Detailed adaptive sampling strategy is described below. Notice that we use the term models and states interchangeably in the rest of this paper. Definition 1. Assume a data set D, the set of permissible models Θ and a parametric model function f (x, θ), where x ∈ D and θ ∈ Θ. Then the minimal set M(θ) is any subset of s data points, where s is the minimum number of data points for which there is a unique θ such that f (x, θ) = 0, ∀x ∈ M(θ). In this paper, we emphasize that a minimal set defines a model. Even though two distinct minimal sets give the identical parametric function, they are still regarded as two models. Definition 2. ε-Inlier set for model θ is the set of points satisfying Iε (θ) = {x ∈ D | |f (x, θ)| ≤ ε}, where ε is a threshold.
(3)
254
L. Fan and T. Pylv¨an¨ainen
Definition 3. ε-Offspring of model θj is the set of models, whose minimal sets are subsets of the inlier set of θj : Oε (θj ) = {θi ∈ Θ|M(θi ) ⊂ Iε (θj )}.
(4)
Given positive weights W (θj ) = {w(x)|x ∈ Iε (θj )} associated with each inlier of model θj , the offspring models can be proposed by randomly select s points according to weights W (θj ). Thus, the probability q o (θi )|θi ∈Oε (θj ) of proposing an offspring model is proportional to the product of minimal set weights: x∈Mε (θi ) w(x) o q (θi ) = . (5) o z∈Oε (θj ) q (z) AdaSAC Algorithm: Given a dataset D and a parametric model function f (x, θ), an Adaptive Sample Consensus process evolves through a sequences of states X1 , X2 , ..., Xm : 1. Initialize X1 as any model from Θ with equal probability q0 , and set W1 : {w(x) = 0|x ∈ D}. 2. At new step m+1, an existing model θj is randomly selected from X1 , X2 , ..., Xm , with probability proportional to respective evaluation scores L(θj ). 3. A new model θi is proposed either from the offspring Oε (θj ) with probability qt = (1 − γ)q o |θi ∈Oε (θj ) , or from the entire model space Θ with probability qt = γq0 , in which γ ∈ [0 ∼ 1] is a pre-defined scalar. 4. A threshold εi is estimated for θi , using steps elaborated in Section 3.2. 5. Model θi is accepted, i.e. Xm+1 = θi , with probability α(θj , θi ) = min(1,
q(θj )p(θi ) ), q(θi )p(θj )
(6)
where p is the target (posterior) distribution in question. Corresponding weights Wm+1 are updated according to (10) or (12). 6. If θi is rejected, the current model is retained Xm+1 = θj and Wm+1 = Wm . The balance equation (6) reveals that AdaSAC is a Metropolis-Hastings method in general [15]. Yet it is special in that its proposal distribution qt (θ|D, W ) is conditional on both data D and weights W , that is statistical measure based on historical samples (see Theorem 3.1). Also, (6) ensures accepted models be fair samples with respect to the target distribution p, regardless of q. There are two random search steps involved in this sampling scheme: a) existing models, according to their likelihood scores, are randomly selected as parent models to generate new models; b) new models are randomly proposed according to offspring model distribution (5). Compared with RANSAC sampling, AdaSAC visits more often high likelihood regions, and less frequently low likelihood areas (by γ times). 3.1 Why AdaSAC Is Efficient? Definition 4. Following [8], the set of associate models for data point x ∈ D is defined as the collection of permissible models that accept x as its inlier, i.e., Aε (x) = {θ ∈ Θ | |f (x, θ)| ≤ ε}.
(7)
Adaptive Sample Consensus for Efficient Random Optimization
255
For reasons to be clear soon, we are interested in two measures on associate models of a data point x: Pa (x) =
p(θ) ∝
L(θ)
(8)
Pm (x) = max p(θ) ∝ max L(θ).
(9)
θ∈Aε (x)
θ∈Aε (x)
and θ∈Aε (x)
θ∈Aε (x)
Theorem 1. For an AdaSAC process which has m accepted models, it holds that: 1. if weights are updated according to {Wm+1 (x) = Wm (x) + 1|x ∈ Iε (θi )},
then the normalized weights w (x) = stant:
w(x) m
(10)
asymptotically approaches the con-
lim w (x) = Pa (x),
m→∞
(11)
with variance Var(w (x)) → 0 as m → ∞. 2. Similarly, if weights are updated according to {Wm+1 (x) = max(Wm (x), L(θj ))|x ∈ Iε (θj )},
(12)
then the weights w(x) asymptotically approaches the constant: lim w(x) = max L(θ) ∝ Pm (x),
m→∞
θ∈Aε (x)
(13)
with variance Var(w(x)) → 0 as m → ∞. Please refer to Appendix for proof. Normalized weights w (x) was used to classify inliers from outliers in [8], however, the statistical interpretation was not given there. For the global optimal model θ∗ , its minimal set points M(θ∗ ) tend to accumulate high weights, since a minimal set point is very likely also an inlier of offspring models Oε (θ∗ ). Updating W by (10), qt has higher probability to reach the global optimal model, as compared to the naive proposal q0 . Weighting scheme (12) leads to more efficient sampling, since now w is strictly proportional to the posterior distribution of the best-model-thus-far among which have been explored in Aε (x). Our experimental results also justify that (12) lead to improved performance as compared with (10). It is worth mentioning that a similar weighting scheme has been used in a local hillclimbing method [11]: Wm+1 (x) =
|Iε (θˆ∗ )| ifx ∈ Iε (θˆ∗ ) 1 otherwise,
(14)
where θˆ∗ is the best-model-thus-far. Compared with (12), (14) focuses on the inlier set of the best model, and suppresses all other points. This hill-climbing strategy turns to be very effective when data has single instance of the model, nevertheless, it suffers from local minima when data has multiple instances of the same model (see Section 4.1 for experimental results).
256
L. Fan and T. Pylv¨an¨ainen
3.2 Estimation of ε The median absolute deviation (MAD) of fitting error has been used to iteratively (re-) estimate the model threshold in [8]: εˆMAD = median (| f (x, θ) − median(f (x, θ)) |) ; x ∈ D
(15)
MAD as such is not directly applicable to AdaSAC, and is improved in two ways. First, we estimate multiple thresholds for different models, since there are multiple search paths in AdaSAC. As demonstrated in our experiments, multiple thresholds is important for fitting multi-structure data. Second, random perturbation is added so that it is possible to recover from underestimated thresholds. 1. A threshold εp from the parent model is used to classify inliers Iεp (θi ), where θi is the model in question. 2. The Median Absolute Deviation (MAD) is computed for this inlier points: M AD (θi ) = median(|f (x, θi ) − median(f (x, θi ))|); x ∈ Iεp (θi ). ˆ
(16)
3. Random perturbation is added: o = ˆMAD ∗ (1 + δ),
(17)
where δ is uniformly distributed between [−0.08 ∼ 0.02]. The threshold o is then used by offspring models (in step 3). 3.3 Extra Time Cost Extra time cost of AdaSAC is mainly attributed to the weighted model sampling1 , and the time cost of Metropolis-Hasting step is negligible. Average extra time is approximately 0.53ms per iteration, which amounts to about 10% of standard RANSAC computation time per iteration. The elapse time was measured by Matlab profile on a 1.6GHz Intel Core Duo CPU laptop. 3.4 Stopping Criterion One possible stopping criterion for AdaSAC is to re-estimate maximum of iterations as N=
log(1 − p) , log(1 − ( wn (x))s )
(18)
x∈I(θˆ∗ )
where p is the required probability of finding the optimal model, wn the normalized data weights such that wn (x) = 1, and I(θˆ∗ ) the (current) best model inliers. This x∈D
is a a modified version of standard RANSAC stopping criterion, taking into account of the proposed weighting scheme. 1
Our implementation of weighted sampling involves (1) computing the cumulative sum of w(x) and (2) finding the index of a cumulative sum element whose ratio to the total sum exceeds a random number U ∈ [0 1].
Adaptive Sample Consensus for Efficient Random Optimization 500
900
520
800
500
257
500 480 450
480
700
460
460 600
400 380
400
350
AdaSAC−I AdaSAC−II LoSAC−I LoSAC−II HC−SAC RANSAC
500
400
#Inliers
420
AdaSAC−I AdaSAC−II LoSAC−I LoSAC−II HC−SAC RANSAC
#Inliers
#Inliers
#Inliers
440
AdaSAC−I AdaSAC−II LoSAC−I LoSAC−II HC−SAC RANSAC
440
420
400
300
AdaSAC−I AdaSAC−II LoSAC−I LoSAC−II HC−SAC RANSAC
360 340 320 300
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Inlier noise
0.04
0.045
0.05
380 200
300
360
100
250
1
2
3
4
5
6
7
Hyperplane Dim.
8
9
10
11
0
340
0
0.1
0.2
0.3
0.4
0.5
0.6
Outlier ratio
0.7
0.8
0.9
1
320
1
2
3
4
5
6
7
8
9
10
11
Hyperplane Dim.
Fig. 1. Hyperplane single structure fitting results with dimensionality increased from 2 to 10. Curves show average number of returned true inliers, and error bars for corresponding standard deviation. Left: X-axis represents inlier noise levels increased from 0.005 to 0.05, and Y-axis the numbers of returned true inliers. Middle left: number of returned true inliers w.r.t. dimensionality. Middle right: X-axis represents increasing outlier ratio from 0.1 to 0.9, Y-axis the number of returned true inliers. Right: number of returned true inliers w.r.t. dimensionality.
4 Experimental Results Two sets of experiments are used to compare AdaSAC methods with related works. Weighting schemes (10) and (12) lead to two AdaSAC algorithms, the former denoted as AdaSAC I and the latter as AdaSAC II. 4.1 Random Optimization with Given Threshold In the first experiment, we compare simplified AdaSAC methods with three adaptive algorithms, i.e. locally optimized RANSAC methods [9, 10], denoted as LoSAC-I and LoSAC-II 2, and Hill-climbing RANSAC (HC-SAC) [11]. We assume a optimal threshold based on true inlier noise scale ε is given in this experiment. The Metropolis-Hasting step is not used for AdaSAC methods. This way, all methods are tested on the same ground. We test sampling efficiency with hyperplane fitting problem of dimensionalities varying from 2 to 10. Inlier points are randomly generated and different levels of Gaussian noise is added to the inliers. Outliers are generated from an even distribution over the d dimensional hypercube. Inliers and outliers altogether amount to 1000 points. For each dimension, 100 data sets are generated. The number of iterations is set as 400 for all methods. Returned inlier sets are compared with the ground truth, and the average number of true inliers are measured. Figure 1 shows the test results for a single structure data with different levels of inlier noise and outlier rations. It is shown that all adaptive methods improves RANSAC efficiency, regardless of inlier noise levels and outlier ratios. More pronounced improvements are observed when the dimensionality increases. Among all methods, the best improvement is due to HC-SAC, with small margin above AdaSAC-II. However, HC-SAC performance deteriorates as data with multiple instances of the same model is presented 2
Two local optimization RANSAC algorithms in [9] i.e. Simple(LoSAC-I) and Inner RANSAC(LoSAC-II) are implemented and compared in this experiment. Other two methods are not compared due to unspecified implementation details and parameter settings.
258
L. Fan and T. Pylv¨an¨ainen
800
550
450
750
500
400
400
350
450 700
350 300 400
550
300
AdaSAC−I AdaSAC−II LoSAC−I LoSAC−II HC−SAC RANSAC
300
250
200
#Inliers
AdaSAC−I AdaSAC−II LoSAC−I LoSAC−II HC−SAC RANSAC
350
#Inliers
#Inliers
600
#Inliers
AdaSAC−I AdaSAC−II LoSAC−I LoSAC−II HC−SAC RANSAC
650
AdaSAC−I AdaSAC−II LoSAC−I LoSAC−II HC−SAC RANSAC
250
200
250 150 500
150 200
450
400
1
2
3
4
5
6
7
Hyperplane Dim.
8
9
10
11
100
100
100
150
1
2
3
4
5
6
7
Hyperplane Dim.
8
9
10
11
50
1
2
3
4
5
6
7
Hyperplane Dim.
8
9
10
11
50
1
2
3
4
5
6
7
8
9
10
11
Hyperplane Dim.
Fig. 2. Hyperplane multiple structure fitting results with the number of structures increased from 2 to 5. Curves show average number of returned true inliers, and error bars for corresponding standard deviation. Left: 2 hyperplanes with 810 and 790 inliers respectively. Only inliers from the first structure are counted as true inliers. 400 noisy outliers are also added. Total number of points amounts to 2000. HC-SAC and AdaSAC-II performances are almost identical for this data. Middle left: 3 hyperplanes with 543,533 and 524 inliers. AdaSAC-II outperforms HC-SAC for this data. Middle right: 4 hyperplanes with 410, 400,400 and 390 inliers. Right: 5 hyperplanes with 330,320,320,320 and 310 inliers.
(see Figure 2). If the data has more than 2 structures, HC-SAC is often trapped by local minima and AdaSAC-II becomes the preferred solution under this condition. 4.2 Random Optimization with Unknown Threshold In the second set of experiments, no priori knowledge about inlier scale noise is given. Adaptive methods have to automatically estimate the optimal threshold as an optimization parameter. This is a much more challenging problem than the first experiment. We compare AdaSAC with the ensemble inlier set method (EIS) [8]. Thresholds estimated by AdaSAC I is used by a standard RANSAC for reference. The number of maximum iterations is set as a fixed number to compare performances of different methods. In practice, one can use stopping condition (18) to terminate AdaSAC iterations. We monitor the best likelihood scores over different iterations. For each method, the mean best likelihood scores are measured by repeating the test for K(=15) times for each data set. In order to quantify performance improvements, the average ratio over different datasets of the mean best scores to corresponding RANSAC scores were computed. Unless stated otherwise, algorithm parameter is set as γ = 0.1 for all the tests. Multiple 2D line and 3D plane fitting: Multiple line segments are randomly generated within a 2D region with x, y coordinates between [0 ∼ 100]. The number of segments are randomly chosen between [1-4]. Each segment consists of 200 to 500 inlier points, which are corrupted by Gaussian noise with σ randomly chosen between [0.15 − 3.15]. 250 uniformly distributed outliers are added. 20 datasets are generated and each is repeatedly tested for 15 times. Figure 3 illustrates some test datasets and fitting results. It is shown in Table 1 that for line fitting, the best score of EIS, AdaSAC I and II are about 10% above RANSAC method.
Adaptive Sample Consensus for Efficient Random Optimization
90
90
90
90
80
80
80
80
70
70
70
70
60
60
60
60
50
50
50
50
40
40
40
40
30
30
30
30
20
20
20
10
10
10
10
20
30
40
50
60
70
80
90
10
L(model)
20
30
40
50
60
70
80
20 10
90
10
L(model)
259
20
30
40
50
60
70
80
90
10
L(model)
20
30
40
50
60
70
80
90
L(model)
700
400
400 350 350
600
350 300
300
300
500 250
250
250 400 200
200
200 300
EIS AdaSAC I AdaSAC II RANSAC
150
100
200
0
50
100
150
200
250
300
350
400
450
500
0
#Iter
EIS AdaSAC I AdaSAC II RANSAC
150
100
100
50
0
EIS AdaSAC I AdaSAC II RANSAC
0
50
100
150
200
250
300
350
400
450
500
0
#Iter
EIS AdaSAC I AdaSAC II RANSAC
150
100
50
50
0
50
100
150
200
250
300
350
400
450
500
0
#Iter
0
50
100
150
200
250
300
350
400
450
500
#Iter
Fig. 3. Example Line Fitting Results. Top: Fitted inliers (red) and outliers (green), using AdaSAC I. Bottom: Performance comparison. X-axis: number of iterations. Y-axis: the mean best model likelihoods, for different methods.
120 80
80
80
60
60
100
60 40 40
60
20
20 20
80
40
40
0
0
20
−20
0
60
60
80 60
40
40
40
20
20
20
60
80
60
40
20
0
0
L(model)
100
80
80
80
40 20 0
L(model)
0
−20
20
60
40
80 50 20
0
L(model)
40
100
80
60
L(model)
350 500
350 200
300 300 400 250 250
150 200
300 200
100
150 150
EIS AdaSAC I AdaSAC II RANSAC
200
100
EIS AdaSAC I AdaSAC II RANSAC
100
100
0
50
100
150
200
250
300
350
400
450
500
0
#Iter
EIS AdaSAC I AdaSAC II RANSAC
50
50
50
0
EIS AdaSAC I AdaSAC II RANSAC
0
50
100
150
200
250
300
350
400
450
500
0
#Iter
0
50
100
150
200
250
300
350
400
450
500
0
#Iter
0
50
100
150
200
250
300
350
400
450
500
#Iter
Fig. 4. Example Plane Fitting Results (as above) 180
100
160
90
250
200 180
80
140
200
160
70
140
120 60
150
120
100 50
100
80 40
100
80
60 30 40
0
60
20
20
50
40
10
0
20
40
60
80
100
120
140
160
0
180
L(model)
20
0
10
20
30
40
50
60
70
80
90
0
100
L(model)
0
50
100
150
200
0
250
L(model)
0
20
40
60
80
100
120
140
160
180
200
L(model)
0.5
1.2 0.6
1 0.5
0.45
0.4
0.4
0.35
0.35
0.3
0.8 0.4 0.3
0.25 0.25
0.6
0.3
0.2
EIS AdaSAC I AdaSAC II RANSAC
0.2
0.1
0.2
EIS AdaSAC I AdaSAC II RANSAC
0.4
0.2
EIS AdaSAC I AdaSAC II RANSAC
0.15
0.1
0.1
0.05
0.05
0
0
100
200
300
400
500
600
700
800
900
1000
#Iter
0
0
100
200
300
400
500
600
700
800
900
1000
#Iter
0
EIS AdaSAC I AdaSAC II RANSAC
0.15
0
100
200
300
400
500
600
700
800
900
1000
#Iter
0
0
100
200
300
400
500
600
700
800
900
1000
#Iter
Fig. 5. Example Ellipse Fitting Results (as above)
3D multiple planes fitting are tested in a similar way. 20 datasets are generated and each is repeatedly tested for 15 times. Figure 4 illustrates some test datasets and fitting results. It is shown in Table 1 that for plane fitting, the best scores of AdaSAC I, II are 17% and 29% above RANSAC scores. It is worth mentioning that EIS performance starts to deteriorate as the number of planes increases (see plots in Figure 4).
260
L. Fan and T. Pylv¨an¨ainen iBest:450 Lbest:54.3026 nin:123 thrd:1.69
iBest:490 Lbest:19.1611 nin:121 thrd:6.39
iBest:203 Lbest:3.0919 nin:86 thrd:17.96
100
100
100
200
200
200
300
300
300
400
400
400
500
500
500
600
600
100
200
300
400
500
0
600
700
100
800
200
300
400
500
0
600
0 0 0
0
0
0
0
0
0 0
8 0 0 0
4
0
0
0
0 00
00
0 0 0 0 0 0 0 0000 00 000000 0000 000 0000 00000 0 0 0000000 000000 000 0 00 0 0 00000 0000000 0000000 00000000000000 0 00000−2 000 −4 0 0 0
0
2
00
0
−8
−6
0
0 0 0 0 0 0 0 0 00 0 0
0
20
4
1.5 0
1
6
8
0
0
0
0
0
0
4.5
0
12
0 0
8
0
0
0
0
0 0
0
−8
−6
0
−4
0 0 00 0 0 00 00 0 00 0 0000000000000000 00 0 0 0 0 000 00000 0 00 00 0 0 000000 000000000 0−2 0000 000 00 000 0 0 0 0 00 0000 0 0 0
2
0
0
2
00
1
0
0 0
0 0
0 04
6
−2.5
8
L(model)
0 0 0 0 0
−2
00
0 00
00
0 0
0 0
0
1.40
0
1
0
0
00 0
0.8
0
0
0 0
0
00 0
0.4 0
0 0 0 0 00 0
2
−1
−0.8
0
0
0
0
0
0
L(model)
0 0
−0.6
0
0
0
0
0
0 0
−0.40 0
0 0
0
0 0
0
0
0
0 0 0
0
0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 00 00 00 0 0 −0.2 00 0 0.2 0
0
0
0
0
0 0 0 0 0 0
0
0 0
0
0
0
0
0 0
0 0
0 0
0
00
0
0
0
0
0
0
0 0
0
0
0 0 0 0 0
0 0
00
0
0
0
0 0
0 0
00 0 0 0 0 0 0 0 0 00 0000 000 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0
0 0
0 01.5
0
0
0.6 00
0 0
0 0
0
0 0 0
0 0
0
0 0 00 00 0
0 0
0.2 1
0
0
0
0
0
0 0 0
1.2
00.6 0
0
0
0
0
0
0
0
0.4
0 0
0
0
0 00 0
0 0
0 0 0
0
0
0
0
00 0 00
1.6
0
0 0
0.2
0
0
0
0
1.8
0
0
0
0 0
0
0 0
−0.2 0 0 0 0
00 0
0 0
0
0
0 0 0
0 0
0 0
0
0
0 0
0
0
00
0−0.4
0
0
0
0
0
0.50
0 0 00
0
0
0
0
0 0
0
0 0
0
0
0
0 0 0
0
0
0 0 0 0
0
0 0
0
0 0
0
0
0
00
0
0
0
0
0 0
0
00
0
0 0 00 0 0 0
0 0 0 0
0
0
0
0.6
0−0.6
0 0 0
0
0
0
0
0 0 0 000 0 0
0
0
0.400 0
0
0 0 0 0 0
0
00 0 0 00 0 0 000 0 0 0 000 0 0 0 0 0 0 0 −1.5 −1 0 0−0.5 00 0 0 0 0 0 0 00 00 0 0 0 0 0
0
0
0 0
0 0 00
0
00 0
0
0
0
0 0
0 0
0
0
0
0
0 0 0
1.5
00 0 0
0
0 0 0
2
0
0
0 0
0
0
0 0
00 0 00 0 0 00
0
0 0 0
0
0 0
0
0
0
0
00
0 0
0
0 0
0
0
0
0
0 0
0 0
00
0 0
0
0
0
0
0 0
0
0
0
2.5
0
0
2
0
0
0 00
00 0 0 0
0
3
0 0 0
4
0 0 0
0
0 0
6
0 0
0 0
0
0 0
0
3.5
0 0
0
0
0
0.8
0
00
0
00
00
0
00
0
0 0 0
0
0
0 0
0
0 0
0 0
0
0
40
0 0
10
0 0 0
0
0 0
0
0
00
0
0 0
0
00
5
0
14
0
0
0 00
0 0
0
0 0 0 0
0
0 0
0 0
0
0
0
0
16
1
0
0
0 0
0
0 0
0
0 0
0
0
0
5.5
0
0
0
0
0 0
0
0
0 0
0 0
00
0
0
0
0
0
0
0
0
00 0 0 0
0
0
0 0 0 00 0 0 0 00 0 0 0 0 0 0 0 00 0 00 0 0 0 0 00 0 0 0 00 00 0 00 0 00 00 0 00 0 00 0 0 0 0 0 0 0 0 00 0 0 00 0 00 0000 00000 0 0 00 0 0 00 0 0 0 0 0 0 0 000 0 0 0 0 0 00 0 0 0 00 00 0 0 0 0 0 0 0 0 0 0 0 0 000000 0 0 0 00 0 0 0 00000 0000 00 0 0 0 0 000 0 0000000 00 0 00 000 00 0 00 00 0 0.5 0 −0.5 −2.5 −2 −1.5 −1 1 1.5 2 0 0 0 0 0 0 0 0 00 0 0 0 0 00 0 0 00 0 00 0 0 0 0 00 0 0 0 0 0 0 0 0 0 000 000 00 0 0 0
20
0
0
0
0 0
0
0 0
0
0 000
1.2 0
0
800
0 0
0
0 0
0
0 00
0
700
0
0
1.6
0
0 0
0 0
00
0
0
0 0 0
0
0 0 0
0
0 00 0
0
0
0 0
600
0
1.4
0
0
0 00 0 0 0 0 0
0
0
0
0
0 0
0
2.5 0
0
0 0
0
3
500
0
0
0 00
0 0
400
00 0
0 0
4 3.5 0
300
0
0 0
0
00 0
0
0
0
0
0
0
00
0
0
00
0
0
0
0
0
0
10
0 0
0
0
0
0
0 0
0
4.5
12
0
0
5 0
0
0 01.8
0
200
0 0 0 0 0
0
0 0
0 0
6
5.5
100
0 0
0
0
0
0
14
600
800
0 0
16
700
0 0
0
0 0 0
0.4
0.6
0.8
1
0 0
0
L(model)
60
20 5
50
4
15
40
3
30 10
EIS AdaSAC I AdaSAC II RANSAC
20
10
0
0
200
400
600
800
1000
1200
1400
#Iter
1600
EIS AdaSAC I AdaSAC II RANSAC
5
0
0
200
400
600
800
1000
1200
1400
#Iter
1600
EIS AdaSAC I AdaSAC II RANSAC
2
1
0
0
200
400
600
800
1000
1200
1400
#Iter
1600
Fig. 6. Example P3P Fitting Results. Top: test images with SURF feature points (blue) and projections of 3D points (red). Top middle: AdaSAC I pose estimation, with the optimal (red) and accepted poses (green). Blue points represent 3D points (viewed from top). Bottom middle: RANSAC pose estimation. Notice differences from AdaSAC I pose estimation. Bottom: performance comparison. X-axis: number of iterations. Y-axis: the mean best model likelihoods, for different methods.
Ellipse fitting: Multiple ellipses are randomly generated, and the number of ellipses are randomly chosen between [1-4]. Each ellipse consists of 200 to 500 inlier points, which are corrupted by Gaussian noise. 20 datasets are generated and each is repeatedly tested for 15 times. Figure 5 illustrates some test datasets and fitting results. It is observed in Table 1 that EIS and AdaSAC I achieve, respectively, 28% and 33% improvements above RANSAC. Again, EIS performance deteriorates as the number of ellipses increases. P3P pose estimation: Estimation of camera pose from images with a known 3D reference is called the perspective n-point (PnP) problem, which is the first problem to be solved using RANSAC. In this experiment, 3D reference points are generated by a structure from motion system from a pair of images. Subsequent images are matched with 3D points, while more reconstructed points are added to the 3D point set.3 Local SURF features [16] are extracted from images and used for pose estimation. 3
Detailed description of the system is out of the scope of this paper, and is reported elsewhere.
Adaptive Sample Consensus for Efficient Random Optimization
L(model)
L(model)
261
L(model) 80
EIS AdaSAC I AdaSAC II RANSAC
25
EIS AdaSAC I AdaSAC II RANSAC
100
80
EIS AdaSAC I AdaSAC II RANSAC
6
5
EIS AdaSAC I AdaSAC II RANSAC
70
60
20
4
50
60 15
3
40
40
30
2
10
20
1 20
5
10
0 0
0
20
40
60
80
100
120
140
160
180
200
#Iter
0
0
20
40
60
80
100
120
140
160
180
200
0
20
40
60
80
100
120
140
160
180
200
#Iter 0
#Iter
0
20
40
60
80
100
120
140
160
180
200
#Iter
Fig. 7. Fundamental matrix estimation results. Top: test image pairs. Bottom: performance comparison. X-axis: number of iterations. Y-axis: the mean best model likelihoods, for different methods. Table 1. Summary of the average ratio of the best scores to corresponding RANSAC scores, for different fitting problems Line Plane EIS 1.09 1.22 AdaSAC I 1.09 1.17 AdaSAC II 1.12 1.29
Ellipse 1.28 1.33 1.01
P3P 1.44 1.48 1.56
F 1.77 1.84 1.90
We test 12 images from two different scenes. Figure 6 illustrates one dataset images and fitting results for P3P pose estimation. It is observed that in less than 500 iterations, AdaSAC I, II achieve reasonable estimation, and improve RANSAC performance by 48% and 56% respectively. Fundamental matrix estimation: We test different methods on 5 datasets, including public data Corridor,Church,College and two private datasets. Figure 7 illustrates some test images and fitting results, by using a standard eight-point algorithm [17] and SURF features [16]. It is observed that AdaSAC I, II achieve 84% and 90% improvement over RANSAC, while improvement varies from dataset to dataset. Table 1 summarizes performance comparison. First, adaptive sampling schemes, EIS, AdaSAC I and II, improve RANSAC efficiency and thus the best likelihood scores. The score ratio increases from 1.09 to 1.90 as model dimensionality increases from line fitting (s = 2) to fundamental matrix estimation (s = 8). Second, AdaSAC and EIS successfully estimate the threshold even for high noise data. Third, EIS performance deteriorates as the number of multiple structures increases. It is shown that EIS is hard to recover from mistargeting on a suboptimal structure with thresholds underestimated. In contrast, AdaSAC I and II are more error resilient for multi-structure data.
5 Conclusion This paper describes a general sampling principle for efficient random optimization. The proposed sampling strategy exploits knowledge about data structure obtained from
262
L. Fan and T. Pylv¨an¨ainen
previously evaluated models. Not only elaborating two specific sampling schemes, we also give their statistical interpretations and rigorous proof of convergency. As a practical data fitting method, AdaSAC estimates the optimal inlier set and threshold simultaneously. Improved performance is demonstrated, as compared with standard RANSAC and other related methods, for line/plane/ellipse fitting and pose estimation problems.
References 1. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24, 381–395 (1981) 2. Subbarao, R., Meer, P.: Beyond RANSAC: User independent robust regression. In: Proceedings of the Conference on Computer Vision and Pattern Recognition Workshop (2006) 3. Torr, P.H., Zisserman, A.: MLESAC: A new robust estimator with application to estimating image geometry. Computer Vision and Image Understaring 78, 138–156 (2000) 4. Torr, P.H., Davidson, C.: IMPSAC: Synthesis of importance sampling and random sample consensus. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 354–364 (2003) 5. Tordoff, B.J., Murray, D.W.: Guided-MLESAC: Faster image transform estimation by using matching priors. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1523–1535 (2005) 6. Chum, O., Matas, J.: Matching with PROSAC - progressive sample consensus. In: Proceedings of the Computer Vision and Pattern Recognition, pp. 220–226 (2005) 7. Goshen, L., Shimshoni, I.: Balanced exploration and exploitation model search for efficient epipolar geometry estimation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 151–164. Springer, Heidelberg (2006) 8. Fan, L., Pylv¨an¨ainen, T.: Robust scale estimation from ensemble inlier sets for random sample consensus methods. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 182–195. Springer, Heidelberg (2008) 9. Chum, O., Matas, J., Kittler, J.: Locally optimized RANSAC. In: Michaelis, B., Krell, G. (eds.) Proceedings of the 25th DAGM Symposium Pattern Recognition, pp. 236–243 (2003) 10. Chum, O., Matas, J., Obdrzalek, S.: Enhancing ransac by generalized model optimization. In: Proceedings of Asian Conference on Computer Vision, pp. 812–817 (2004) 11. Pylv¨an¨ainen, T., Fan, L.: Hill climbing algorithm for random sampling consensus methods. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Paragios, N., Tanveer, S.-M., Ju, T., Liu, Z., Coquillart, S., Cruz-Neira, C., M¨uller, T., Malzbender, T. (eds.) ISVC 2007, Part I. LNCS, vol. 4841, pp. 672–681. Springer, Heidelberg (2007) 12. Nist´er, D.: Preemptive ransac for live structure and motion estimation. In: Proceedings of the International Conference on Computer Vision, pp. 199–206 (2003) 13. Matas, J., Chum, O.: Randomized ransac with sequential probability ratio test. In: Proceedings of the International Conference on Computer Vision, pp. 1727–1732 (2005) 14. Raguram, R., Frahm, J.M., Pollefeys, M.: A comparative analysis of ransac techniques leading to adaptive real-time random sample consensus. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 500–513. Springer, Heidelberg (2008) 15. Hastings, W.: Monte carlo sampling methods using markov chains and their applications. Biometrika 5, 97–109 (1970)
Adaptive Sample Consensus for Efficient Random Optimization
263
16. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded Up Robust Features. In: Proceedings of the ninth European Conference on Computer Vision (2006) 17. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
Appendix: Proof of Theorem 1 Proof. 1. Weight increment Wm+1 − Wm is a bernoulli distributed random event with success probability Pa , which is the total probability of selecting any model in the associate set of a data point x. The total number of weight increments in m trials is then binomially distributed, with mean mPa (x) and variance mPa (1 − Pa ). For w = w/m, E(w ) = E(w)/m = Pa and V ar(w ) = Pa (1 − Pa )/m → 0 when m → ∞. 2. For this proof, we denote associated models Aε (x) as {θ1 , θ2 , . . . , θK }, and order them so that L(θi ) ≤ L(θj ) for i ≤ j. From (12) it follows that the probability that the weight wm (x) at m:th iteration is at most L(θl ) is the probability that no higher model was visited in the m iterations, i.e. P (wm (x) ≤ L(θl )) = (1 −
K
p(θi ))m ,
(19)
i=l+1
and thus P (wm (x) = L(θl )) = (1 −
K
p(θi ))m − (1 −
i=l+1
K
p(θi ))m .
(20)
i=l
It follows that P (wm (x) = L(θl )) → 0 for ∀l < K as m → ∞, and P (wm (x) = L(θK )) → 1 as m → ∞. By direct computation then E(wm (x)) =
K
L(θi )P (wm (x) = L(θi ))
i=1
=
K−1
L(θi )P (wm (x) = L(θi )) +
(21)
i=1
L(θK )P (wm (x) = L(θK )) → L(θK ), as m → ∞. V ar(wm (x)) =
K
(L(θi ) − E(wm (x)))2 P (wm (x) = L(θi ))
i=1
=
K−1
(L(θi ) − E(wm (x)))2 P (wm (x) = L(θi )) +
i=1
(L(θK ) − E(wm (x)))2 P (wm (x) = L(θK )) → 0, as m → ∞.
(22)
Feature Matching under Region-Based Constraints for Robust Epipolar Geometry Estimation Wei Xu and Jane Mulligan Department of Computer Science, University of Colorado at Boulder, Boulder, Colorado 80309-0430 USA {Wei.Xu,Jane.Mulligan}@Colorado.edu
Abstract. Outlier-free inter-frame feature matches are important to accurate epipolar geometry estimation for many vision and robotics applications. We discover a set of high-level geometric and appearance constraints on low-level feature matches by exploiting reliable region matching results. A new outlier filtering scheme based on these constraints is proposed that can be combined with traditional robust statistical methods to identify outlier feature matches more reliably and efficiently. The proposed filtering scheme is tested in an real application of outdoor mobile robot navigation based on far-field scenes and of scenes that contain repeated structures.
1 Introduction Epipolar geometry estimation (or relative pose estimation) between different views [1] is a hot topic in vision and robotics research and has many applications such as short-baseline/wide-baseline stereo, 3-D reconstruction, object tracking, mobile robot navigation, and visual odometry. Since the estimate is computed from image feature correspondences, researchers have developed and are still endeavoring to develop more and more prominent and stable image features. The focus has transited from the traditional Harris corner features [2]) that are computed from a simple local signature (i.e., local neighborhood in the image), to the more recently developed view independent features such as SIFT [3], GLOH [4] and MSER [5] that exploit high-level information from a much lager and more sophisticated signature aiming for global invariance. However, the preliminary feature matching schemes of the feature detectors are usually not outlier-free. To address this problem, robust statistical methods, such as RANSAC [6], MAPSAC [7] and LMeds [8], are usually combined with the epipolar geometry estimator to refine the preliminary inter-frame feature correspondences and estimate the geometry at the same time. These statistical methods identify whether a correspondence is an inlier or an outlier based on whether or not it is consistent with the major structure of the whole data (i.e. all feature correspondences). A characteristic of many view independent image features (e.g., SIFT) is that although the features are computed from a local signature the matching of them is global. This design is necessary for applications where the relative pose between a pair of views is large (e.g., wide baseline stereo). In this case, the matching scheme has to search over the whole peer image to find the optimal correspondences of the features in the base image. However, global matching may also arouse more outliers, especially in an outdoor G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 264–273, 2009. c Springer-Verlag Berlin Heidelberg 2009
Feature Matching under Region-Based Constraints
265
environment where similar texture patterns (e.g., trees and grasses), repeated structures (e.g., buildings) and lighting changes may all cause false matching. In this paper, we propose a filtering scheme to overcome or alleviate this shortcoming of global matching by applying region-based locality and appearance constraints to refine global feature matching results.
2 Related Work As introduced earlier, robust statistics methods are usually used to refine global feature matching results. Among these methods, RANSAC [6] and its variants (e.g, MAPSAC [7]) are most popular ones. They use random sampling and hypothesis-andverification test repeatedly to locate the major structure of the data. However, the success of RANSAC and its variants is limited by the following practical issues: 1) They require the user to specify the fraction of outliers in the data () and the residual threshold between inliers and outliers (t), but in practice these parameters are usually unknown and have to be guessed. 2) They guarantee to locate the major structure of the data at a designated confidence level (δ) in N trials — N = ln(1−δ)/ln(1−(1−)m ) where m is the minimal number of data points to generate a hypothesis model. If this level δ is set up too high or fraction of outliers () is high, hundreds of thousands trials may be needed to locate the major structure of the data. This is impractical for many vision and robotics applications. In practice, many applications (e.g., [9]) would rather trade performance for time and limit the number of trials (N ) to be several hundreds or a few thousands. In addition to robust statistical methods, methods making use of high-level geometric objects (e.g., line segments [10], texture-invariant regions [11]) or constraints (e.g., the sidedness constraint [12]) have also been proposed to guide low-level feature matching. However, the efficacy of these methods is limited for outdoor vision and robotics applications: the texture-invariant regions only exist in texture abundant areas and thus usually have a sparse and uneven distribution in outdoor scenes; the matching scheme based on line segments only works for indoor environments; and the sidedness constraint needs several different views to apply. Our filtering scheme is motivated by the previous work of Tao et al. [13] that proposed a global matching framework of stereopsis which makes use of low-level segmentation in depth space to restrict the search space of dense stereo correspondences. Following its general idea, we propose to exploit high-level geometric and appearance constraints based on reliable region matches to refine preliminary correspondences of view independent features. However, unlike Tao’s work, we do not intend to integrate the region-based constraints into any particular feature detectors, but are more interested in developing a general scheme for filtering the preliminary correspondences generated by any view independent feature detectors or any feature detectors that employ global matching. Another piece of previous work closely related to ours is the enhanced feature matching scheme using saliency region correspondences [14]. It integrated image segmentation and region matching in the joint image space and solved them together using the normalized cuts technique. The approach is more general than ours because it aims at wide-baseline applications. However, according to the examples provided in the paper,
266
W. Xu and J. Mulligan
it may still get false region matches due to the dramatic appearance changes between wide-baseline views. Comparing to [14], the approach proposed in this paper aims at more restricted short-baseline applications but tries to provide a more effective solution to it, since the region matches can be more reliability obtained given short-baseline. It also decouples image segmentation and region matching and solves them separately in a much simpler manner than normalized cuts. The target applications of the proposed scheme includes those in which the relative pose between a pair of views is small or moderate, such as short-baseline stereo and object tracking and mobile robot navigation at a high image sampling rate.
3 The Approach To make the proposed filtering scheme work, we need to solve the following problems: 1) how to obtain good representations of regions and match them correctly, and 2) what kind of constraints from region matches can be used to filter low-level feature correspondences, and how? Details of our solutions are given as follows. 3.1 Inter-frame Region Matching We developed an inter-frame region matching scheme based on the similarities of a sequence of geometric and appearance statistics of image segmentation results. Considering our target task is epipolar geometry estimation for outdoor mobile robot navigation, we adopted the Color Structure Code (CSC) algorithm [15] for segmenting the sampled image frames. It is reported that CSC is good at segmenting natural color scenes in a test with over 5,000 outdoor natural images [15]. Our own experiments confirmed this report and showed that in most cases the segments generated by CSC are conceptually coherent with real-world regions when the scene is relatively distant or flat. Besides, CSC is very fast and was successfully used for real-time vehicle tracking [16]. We assume the image segments generated by CSC are reasonable representations of real-world regions in the scene, so we can match the regions by comparing the similarities between their corresponding segments. The pair-wise region similarities are computed from a sequence of statistics measuring regional geometric and appearance properties including color (Scolor ), centroid (Scentroid ), area (Sarea ), texture (Stexture ) and shape (including measurements on the proportion and orientation of the equivalent ellipse (See prob and See ori ) and on the bounding rectangle (Sbr )). Given a pair of regions A and B, Scolor (A, B) and Scentroid (A, B) are calculated as the normalized Euclidean distances between the mean color vectors (in the CIE-Lab color space) and the mean position vectors of the pixels in A and B. The area similarity Sarea is defined as: Sarea (A, B) =
|sA − sB | sA + s B
where sA and sB are the sizes of regions A and B respectively. The shape similarity is measured by the similarities of the equivalent ellipse (See prob and See ori ) and of the bounding rectangle (Sbr ) which are defined as [17]:
Feature Matching under Region-Based Constraints
See prop (A, B) =
267
1 |mA − mB | |nA − nB | ·( + ) 2 mA + mB nA + n B
where mA and nA (mB and nB ) are the lengths of the major and minor axes of the equivalent ellipse of region A(B). Δθ+90 180 · 180 − Δθ 90 e where θA (θB ) denotes the angle of the major axis of the equivalent ellipse of region A mB A(B), Δθ = |θA − θB |, Π = min( m nA , nB ). See ori (A, B) = (1 −
Sbr (A, B) =
1
)· 4(Π−1)
1 |wA − wB | |hA − hB | ·( + ) 2 wA + wB hA + hB
where wA and hA (wB and hB ) denote the width and height of the bounding rectangle of region A(B) respectively. The definition and computation of the equivalent ellipse and the bounding rectangle are described in [18]. The measurement of the texture similarity is in two steps. First, we compose a texture vector of six fields that measures the mean, standard deviation, smoothness, third moments, uniformity and entropy of the intensity distribution in a region [19]. Next, the texture similarity Stexture (A, B) is calculated as the normalized Euclidean distances between the texture vectors of A and B. All the above similarity measurements are already normalized as defined. Finally, the overall similarity score is calculated as a weighted sum of the individual measures: S(A, B) = α1 Scolor + α2 Scentroid + α3 Sarea + α4 See ori + α5 See prob + α6 Sbr + α7 Stexture The values of the weights αi ’s in the above equation are determined by application and set up by experience. Specifically, we use a larger weight for color, area and centroid position and a smaller weight for the other features for our target applications that have a lot of outdoor far-field scenes. Given the region set R1 = {rp1 ∈ I1 , p = 1 . . . m1 } of a frame I1 and the region set 2 R = {rq2 ∈ I2 , q = 1 . . . m2 } of another frame I2 , the matching between R1 and R2 is the following combinatorics optimization problem: O = arg min { 2 1 rp ,rq
subject to:
rp2 ∈ R2 rq1 ∈ R1 rp21 = rp22 rq11 = rq12
m1
S(rp1 , rp2 )
+
1 ∪m p=1 2 ∪m q=1
rp2 rq1
p=1
and and
m2
S(rq1 , rq2 )}
q=1
= R2 , = R1 ,
if p1 = p2 , if
q1 = q2 .
In our practice, we revised the above problem a little considering practical factors. We defined two constraints on the color and area statistics of the regions: (a)Scolor ≤ τ1 ,
268
W. Xu and J. Mulligan
(b)SA > τ2 and SB > τ2 , considering in an outdoor environment the measurement of regions of far-field scenes or under varying lighting conditions may not be accurate. τ1 and τ2 are thresholds determined by application. Constraint (a) requires the appearance change of matched regions is not too large; (b) excludes small image segments from matching because they are less reliable at representing a region. Due to constraint (b), the completeness constraints in the above problem (the first two constraints) have to be released since not all regions are now included. Also, considering in our target applications the relative pose is not too large we had another constraint on motion: (c)Scentroid ≤ τ3 , where the threshold τ3 is co-determined by the speed of the vehicle and the image sampling rate. These practical constraints guarantee that the obtained region matches are reliable. We adopted the “perfect-matching” scheme proposed by Rehrmann [17] to solve the revised optimization problem: For a region in one image frame, the optimal match of it is the region peer in the other frame with the highest similarity score to it. This examination is performed for every inter-frame pair of regions. Region pairs whose members are mutually optimal to each other with very high similarity scores are classified as reliable “perfect matches”. Previous work [16] and our own experience show the “perfect-matching” scheme works well for vehicle tracking and outdoor mobile robot navigation applications with a normal image sampling rate (15 fps of 320x240 frames). However, it may be less effective (i.e., only a few “perfect matches” are obtained) when there occurs a dramatic viewpoint change between a pair of frames. Even this situation occurs, it only reduces the number of region-based constraints that can be used to refine the low-level feature matches but would not affect the correctness of those constraints. 3.2 Region-Based Constraints In general conditions the geometric transformation between a pair of views of a static scene is a similarity transformation under which the geometric properties of a region including position, area, orientation and shape are not consistent. However, if the relative pose between a pair of views is relatively small these properties may only change a little and can still be used as measures for region matching, as our region matching scheme does. Also, we consider that the appearance of a region may change with time in outdoor vision and robotics applications — uncontrolled lighting, change of viewpoints and non-diffuse reflections may cause the same scene point to have different color/intensity values in different images. Taking both of these geometric and appearance aspects into consideration, we bring forward the following region-based constraints on low-level feature correspondences: Region ownership constraint — Feature matches where the two features reside in non-matched regions, called “cross-region” matches, are detected and identified as outliers. Considering region boundaries may not be accurately detected in practice, we do not count as outliers those “cross-region” matches in which any of the two features is close to the boundary of the matching region of its owner region. Appearance constraint — For all pairs of feature matches residing in the same pair of matched regions, re-compute the matches of these features taking into account the color difference between these two regions. This may result in some features being assigned different conjugates to the original matches.
Feature Matching under Region-Based Constraints
269
Input: Image I1 and its segments Rp1 ∈ I1 , p = 1 . . . m1 ; image I2 and its segments Rq2 ∈ I2 , q = 1 . . . m2 ; inter-frame region matches RM = {(Ri1 , Ri2 ) | Ri1 ∈ I1 , Ri2 ∈ I2 and i = 1 . . . m}; inter-frame feature matches F Mall = {(Xj1 , Xj2 ) | Xj1 ∈ I1 , Xj2 ∈ I2 and j = 1 . . . n}; adjacency threshold τ Output: Feature match inlier set F Minlier ∈ F Mall and outlier set F Moutlier ∈ F Mall Update F M using RM using the appearance constraint; F Minlier = ∅, F Moutlier = ∅; for j=1 to n do foreach (Xj1 , Xj2 ) ∈ F Mall do find regions Rp1 ∈ I1 and Rq2 ∈ I2 such that Xj1 ∈ Rp1 and Xj2 ∈ Rq2 ; end if (Rp1 , Rq2 ) ∈ RM then F Minlier = F Minlier ∪ (Xj1 , Xj2 ); else find region Rp2 ∈ I2 such that (Rp1 , Rp2 ) ∈ RM ; find region Rq1 ∈ I1 such that (Rq1 , Rq2 ) ∈ RM ; d1 = computeMinDist(Xj1 , Rq1 ); d2 = computeMinDist(Xj2 , Rp2 ); if min(d1 , d2 ) < τ then F Minlier = F Minlier ∪ (Xj1 , Xj2 ); else F Moutlier = F Moutlier ∪ (Xj1 , Xj2 ); end end end
Algorithm 1. An outlier filtering scheme based on the proposed region-based constraints In our implementation, the appearance constraint was enforced by first compensating the average color difference of a pair of matched regions and then updating the feature matches within this pair. It is not able to identify outlier feature matches but may change the original assignment. The region ownership constraint directly identifies outlier feature matches that break it. An outlier filtering scheme based on these constraints are summarized in Algorithm 1.
4 Experimental Results We have tested the proposed constraint filtering scheme with respect to epipolar geometry estimation for different target applications. View independent SIFT features, eightpoint algorithm and MAPSAC [7] (which is a maximum a posteriori (MAP) variant of RANSAC) are adopted to compute the epipolar geometry between a pair of frames. The evaluation criterion of the estimated epipolar geometry is the sum of the back-projection errors of the scene points onto the image planes. Our evaluation strategy is to compare the accuracy of the estimated epipolar geometry with and without applying the filtering
270
W. Xu and J. Mulligan
(a).
(b).
Fig. 1. (a) Original image pair from a motion sequence. (b) Color segments with yellow lines connecting matching segment centroids.
(a).
(b). Fig. 2. (a) Inlier matches of SIFT features (connected by green lines) identified by a run of MAPSAC with N = 1, 000 trials. (b) The two outliers (connected by red lines) that have passed the examination of MAPSAC.
scheme on the preliminary feature matches before running MAPSAC. The comparison results will show whether or not there is a performance gained by incorporating the region-based constraints. Fig. 1a shows a pair of images of different views of an outdoor far-field scene. The image pair are collected in an application of mobile robot navigation based on farfield scenes. The images are first segmented using the CSC algorithm, and then the inter-frame region match set RM are computed using the proposed region matching scheme. In far-field scenes, textured regions are usually smoothed out by distance and agglomerated into large color coherent regions, so both the color-based CSC segmentation algorithm and the proposed region matching scheme work very well, as Fig. 1b shows. SIFT features are detected on each image and their preliminary matches are examined by MAPSAC (Fig.2a). Fig.2b shows two obvious outlier matches in Fig.2a that have passed the examination of MAPSAC. As introduced earlier, this problem can be
Feature Matching under Region-Based Constraints Mean estimation error vesus SIFT threshold
Mean estimation error vesus SIFT threshold
14
14 MAPSAC only Random discarding + MAPSAC Constraint filtering + MAPSAC
12
10 mean estimation error
mean estimation error
MAPSAC only Random discarding + MAPSAC Constraint filtering + MAPSAC
12
10
8
6
8
6
4
4
2
2
0 0.5
271
0.55
0.6
0.65
0.7 0.75 SIFT threshold
0.8
0.85
(a) On the far-field scene
0.9
0.95
0 0.5
0.55
0.6
0.65
0.7 0.75 SIFT threshold
0.8
0.85
0.9
0.95
(b) On the “two doors” scene
Fig. 3. Comparison with MAPSAC. (a) Epipolar geometry estimation errors under different SIFT threshold levels for the far-field scene. (b) Experiment results with SIFT features for the “two doors” scene — Incorporating the proposed constraint filtering scheme helps reduce the epipolar geometry estimation errors by around 50% than applying MAPSAC alone for this scene.
overcome if the user can provide the true values of the outlier fraction and the inlier identification threshold t. However, in practice the true values of these parameters are usually unknown and vary for different scenes, and guessed values have to be used instead. So, this ”missing outlier” problem of robust statistics methods can often happens if the guess is wrong, and other efforts in additional to robust statistical methods such as the proposed region-based constraints are needed to refine the matches. We then applied the “Constraint filtering + MAPSAC” scheme to identify outlier feature matches and estimate the epipolar geometry for this far-field scene example. Fig. 3(a) shows the mean estimation errors of 100 runs of both the “MAPSAC only” and the “Constraint filtering + MAPSAC” scheme. It is shown that the “Constraint filtering + MAPSAC” scheme can achieve lower estimation errors than applying MAPSAC alone, and the improvement are more significant for noisy data which correspond to higher SIFT thresholds. To verify that the performance improvement is not purely due to the drop of matches, we also tested another “Random discarding + MAPSAC” scheme that randomly discards the same amount of preliminary matches as that of the outliers that the proposed constraint filtering scheme identifies before going to MAPSAC. The performance of this “Random discarding + MAPSAC” scheme is also shown in Fig. 3(a). It should be noted that the power of the proposed constraint filtering scheme is not fully exerted in this example of far-field scenes because there only exist a few outliers in total in the preliminary SIFT feature matches, which leaves little space for the proposed filtering scheme to improve upon. Fig. 4 shows another example of a close-range scene containing repeated structures where the proposed constraint filtering scheme can better exert its power. The two doors in the scene have the same structure and similar local texture patterns. The global matching scheme of the SIFT feature detector is easily confused with this kind of scenes of repeated structures. Fig. 3(b) shows the the proposed constraint filtering scheme can greatly improve the accuracy of the estimated epipolar geometry for this kind of scenes — it reduced the mean estimation errors (of 100 runs) by around 50% for different SIFT threshold levels.
272
W. Xu and J. Mulligan
(a).
(b).
(c).
(d).
Fig. 4. (a) Original image pair of the “two doors” scene. (b) Color segments with yellow lines connecting matching segment centroids. (c) Inlier matches of SIFT features identified by using the “Constraint filtering + MAPSAC” scheme. SIFT feature match threshold is 0.50. (d) Outliers identified by the proposed constraint filtering scheme alone. Note that these outliers are not always correctly identified by using MAPSAC alone (see also Fig. 3(b)).
5 Discussion and Conclusion In this paper we propose a set of high-level region-based constraints for refining lowlevel image feature matches. We use color image segmentation techniques to extract coherent regions over a pair of images of the same scene. High-level geometric and appearance information, in the format of region-based constraints, are extracted from reliable region matching results. These constraints are used to identify and remove outliers of preliminary feature matches before a robust statistical method, MAPSAC, is applied to estimate the epipolar geometry. Experiments with different applications show that combining a pre-filtering scheme based on the proposed ownership (spatial) and appearance constraints with MAPSAC can achieve better performance than applying MAPSAC alone. The effectiveness of the proposed constraints relies on the quality of the image segmentation and region matching results. Since our target application is short-baseline applications, reliable region matching results can be obtained in most cases. How to extend the proposed scheme to wide-baseline applications is a possible direction for future work.
References 1. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 2. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conference, pp. 147–151 (1988)
Feature Matching under Region-Based Constraints
273
3. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 4. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence 10, 1615–1630 (2005) 5. Matas,J., et al.: Robust wide baseline stereo from maximally stable extremal regions. In: Proc. 9th European Conference on Computer Vision (ECCV 2002), vol. 1, pp. 384–393 (2002) 6. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. of the ACM 24, 381–395 (1981) 7. Torr, P.H.S., Murray, D.W.: The development and comparison of robust methods for estimating the fundamental matrix. International Journal of Computer Vision 24, 271–300 (1997) 8. Zhang, Z.: Estimating motion and structure from correspondences of linesegments between two perspective images. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 1129–1139 (1995) 9. Zhang, W., Kosecka, J.: A new inlier identification procedure for robust estimation problems. In: Proc. Robotics: Science and Systems Conference 2006, RSS 2006 (2006) 10. Bay, H., Ferrari, V., Van Gool, L.: Wide-baseline stereo matching with line segments. In: Proc. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR 2005), vol. 1, pp. 329–336 (2005) 11. Schaffalitzky, F., Zisserman, A.: Viewpoint invariant texture matching and wide baseline stereo. In: Proc. 18th IEEE International Conference on Computer Vision (ICCV 2001), pp. 636–643 (2001) 12. Ferrari, V.: Wide-baseline multiple-view correspondence. In: Proc. 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR 2003), vol. 1, pp. 718–725 (2003) 13. Tao, H., Sawhney, H.S., Kumar, R.: A golbal matching framework for stereo computation. In: Proc. 18th IEEE International Conference on Computer Vision (ICCV 2001), pp. 532–539 (2001) 14. Toshev, A., Shi, J., Daniilidis, K.: Image matching via saliency region correspondences. In: Proc. 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR 2007), pp. 1–8 (2007) 15. Rehrmann, V., Priese, L.: Fast and robust segmentation of natural color scenes. In: Chin, R., Pong, T.-C. (eds.) ACCV 1998. LNCS, vol. 1352, pp. 598–606. Springer, Heidelberg (1998) 16. Ross, M.: Segment clustering tracking. In: Proc. 2nd Europ. Conf. on Colour in Graphics, Imaging, and Visualization, pp. 598–606 (2004) 17. Rehrmann, V.: Object oriented motion estimation in color image sequences. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 704–719. Springer, Heidelberg (1998) 18. Hornberg, A.: Handbook of Machine Vision. Wiley, Chichester (2006) 19. Gonzalez, R.C., Woods, R., Eddins, S.L.: Digital Image Processing Using MATLAB. Prentice Hall, Englewood Cliffs (2003)
Lossless Compression Using Joint Predictor for Astronomical Images Bo-Zong Wu and Angela Chih-Wei Tang Visual Communications Laboratory Department of Communication Engineering National Central University
Jhongli, Taiwan
Abstract. Downloading astronomical images through Internet is a slow operation due to their huge size. Although several lossless image coding standards that have good performance have been developed in the past years, none of them are specifically designed for astronomical data. Motivated by this, this paper proposes a lossless coding scheme for astronomical image compressions. We design a joint predictor which combines the interpolation predictor and partial MMSE predictor. Such strategy benefits from its high compression ratio and low computation complexity. Moreover, the scalable and embedding functions can be further supported. The interpolation predictor is realized by upsampling the downsampled input image using bi-cubic interpolation, while the partial minimum mean square error (MMSE) predictor predicts the background and foreground (i.e., stars) separately. Finally, we design a simplified Tier-1 coder from JPEG2000 for entropy coding. Our experimental results show that the proposed encoder can achieve higher compression ratio than JPEG2000 and JPEG-LS.
1 Introduction Digital imaging arrays such as CCDs (charge-coupled device) used by astronomy organizations such as European DENIS and American Sloan Digital Sky Survey have produced several TBytes of astronomical images in the past years. For the requirement of accurate analysis of astronomical data with efficient storage and transmission, lossless image compression is required. On the other hand, to transmit the astronomical data that can adapt to diverse user needs through Internet, progressive transmission supported by scalabilities such as spatial scalability or SNR scalability is highly expected [1]. For example, astronomers can browse astronomical images with lower resolution first and will see the interesting objects in the image with high resolution if needed. Image coding standards such as JPEG2000 [2] and JPEG-LS [3], provide lossless and near-lossless compressions in addition to lossy compressions. JPEG-LS is a fast lossless image compression algorithm with median edge detector (MED) predictor based on LOCO-I [3]. JPEG2000 includes coding tools such as Discrete Wavelet Transform, bit-plane context modeling, and binary arithmetic coding. With these coding tools, JPEG2000 performs well in low bit-rate compressions. In most cases, G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 274–282, 2009. © Springer-Verlag Berlin Heidelberg 2009
Lossless Compression Using Joint Predictor for Astronomical Images
275
JPEG-LS outperforms JPEG2000 in lossless compressions. However, JPEG2000 supports the lossless mode with the profile of scalability but JPEG-LS does not. Although JPEG2000 and JPEG-LS perform well, they are not specifically designed for astronomical images. We observe that astronomical images consist of two major parts, stars and noisy background. Noise in background is hard to be predicted and be compressed. On the other hand, the regions containing blocks of stars are usually circular and are hard to be predicted by the MED predictor in JPEG-LS. To achieve better prediction, minimum mean square error (MMSE) predictor is individually applied at block level [4]. In [5], B. Aiazzi proposes the relaxation-labeled prediction encoder (RLPE) for the lossless/near lossless compression of astronomical images. This algorithm calculates MMSE predictors for different image blocks and uses fuzzy C means to cluster predictors and thus reduce the huge amounts of predictors. The lossless compression ratio of RLPE is higher than those of JPEG2000 and JPEG-LS. However, its computational complexity is much higher than those of JPEG2000 and JPEG-LS. Similar to JPEG-LS, RLPE does not support scalabilities. In [6], a two-layer image coding scheme extended from the H.264 lossless image coder is proposed. Its compression ratio is higher than that of JPEG-LS. Interpolation operation benefits from its low computation complexity. However, there are only few previous work adopting it in data compressions. In [7] a lossy image compression scheme is proposed where uniform downsampling before encoding is applied. Then, collaborative adaptive down-sampling and upconversion (CADU) interpolation reconstructs the JPEG2000 decoded image [7]. In the scalable video coding (SVC) extension of the H.264/AVC standard, the spatial scalability is accomplished by taking the downsampled video frame as the base layer. The enhancement layers are then implemented by coding the spatial redundancy between the upsampled reference layer and the current layer. Interpolation based predictor provides similar prediction performance to MMSE predictor for most astronomical images. By upsampling the downsampled image, we can reduce the spatial redundancy where only the difference between the reconstructed image and the original image is compressed. Although downsampling may result in the loss of high-frequency components of video contents, few astronomical images have high-frequency contents. Thus, we apply the bi-cubic interpolation which follows the downsampling stage in this paper. The joint predictor which includes the interpolation predictor and partial MMSE predictor are adopted. Since the computational complexity of MMSE predictor is higher than those of interpolation predictor and MED predictor, our joint predictor reduces the number of MMSE predictors while it achieves similar prediction performance compared with calculating distinct MMSE predictors for different image blocks. Moreover, the design of our joint predictor is capable of supporting the scalable coding and progressive transmission. We also simplify the Tire-1 coder with three coding passes in JPEG2000 [8] into the coder with only one coding pass. This reduces the computation complexity while the compression performance is still good. The rest of this paper is organized as follows. In Section 2, we introduce the interpolation predictor and partial predictor adopted in this paper. In Section 3, we propose the lossless image coding algorithm for astronomical data. Section 4 gives the experimental results with discussions and conclusions are given in Section 5.
276
B.-Z. Wu and A.C.-W. Tang
2 Predictors for Astronomical Image DPCM (Differential Pulse Code Modulation) purposes to decorrelate the input signals without the loss of information. Its performance is strongly affected by the estimator it adopts. Examples of estimator include Wiener filter, MMSE-based LMS filter, and Kalman filter [8]. In lossless image compressions, estimators such as GAP [9], MED, MMAE [10], and MMSE predictors have been adopted. Among them, MMSE filter is the most popular due to its impressive improvements over former context-based adaptive predictions such as MED and GAP predictors and easy implementation. The MMAE predictor provides better prediction performance but its computation complexity is much higher than that of MMSE predictor [10]. The idea of joint predictor has been proposed in [11], where the GAP predictor and MMSE predictor are switched. MMSE predictor works well in both smooth regions and edge areas [5], but the MMSE prediction is time consuming. Furthermore, it is unnecessary to apply MMSE prediction to all blocks of an astronomical image since such image lacks high-frequency components. Therefore, we adaptively use interpolation predictor and MMSE predictor to reduce computation complexity.
Fig. 1. Nearest neighborhood (NN) downsampling and interpolation
The main idea of interpolation predictor is predicting the target signal by interpolation. As shown in Fig. 1, we downsample an input image by the nearest neighborhood downsampling with the factor two. After downsampling, we generate the reconstructed image by interpolating the downsampled image. One quarter of pixels (gray circles) in the reconstructed image and those in the original image are the same. The residual image is the difference between the original image and the reconstructed image where one quarter of pixels are surely zero in the residual image. Thus, we only need to code three quarters of the residual image. In MMSE, the current pixel value is predicted by referring to the pixels in the neighborhood that have been coded. Since the image characteristics of local regions are distinct, better performance can be achieved by splitting an input image into macroblocks and calculate the MMSE predictor for each block B(n). That is, the set of prediction coefficients varies with blocks. A MMSE estimator finds an optimal predictor a for the target block B(n) by minimizing the mean square error MSE, (1) where N is the prediction order, T is the number of reference pixels , and nik presents the k-th neighbor of pixel ni.
Lossless Compression Using Joint Predictor for Astronomical Images
277
Our experimental results show that the variance of the residual image by one MMSE predictor for each block in the background and by one MMSE for all blocks in the background are very close. That is, only one MMSE for all blocks in the background is enough. Based on this, in order to predict accurately and reduce the computation complexity of the calculation of MMSE predictors, we propose the idea of partial MMSE prediction. Before calculating the prediction coefficients of the MMSE predictor, we identify the characteristics of the current block using the statistical measurements. Then, we calculate one MMSE predictor for each block in the foreground and one common MMSE predictor for all blocks in the background. The algorithm is described as follows. First, we compute the mean and variance for each block. Meanwhile, we compute the mean and variance of the global image. Next, a block is classified by
, (2) where Mb is the local mean of the n-th block, M is the global mean of image Is, Var is the variance of image Is , and Max is the maximum pixel value of the n-th block.
3 The Proposed Lossless Coding Scheme for Astronomical Images In our proposed encoder, we design the joint predictor including interpolation predictor and partial MMSE predictor to reduce the spatial redundancy. Next, we adopt the context-based arithmetic coding to encode the residual image. The flow chart of our encoder is shown in Fig. 2. The details of joint predictor and context-based arithmetic coding are stated in subsections 3.1 and 3.2.
Fig. 2. The proposed encoding scheme
278
B.-Z. Wu and A.C.-W. Tang
3.1 Joint Predictor for Astronomical Images
(a)
(b)
(c)
(d)
Fig. 3. (a) The original image. (b) The residual image RL by bilinear interpolation. (c) RL by MED predictor. (d) RL by bi-cubic interpolation.
For an input image I, we adopt the interpolation predictor to get the residual image RL and the downsampled image Is. We will use the partial MMSE prediction to generate the small residual image Rs if the global variance of Is is large. Otherwise, Is is sent to the context-based arithmetic coder directly. We adopt bi-cubic interpolation to upsample Is to Ir. Compared with NN interpolation and bilinear interpolation, bi-cubic interpolation can achieve better image quality in the region containing edges. The residual images RL based on bilinear interpolation, bi-cubic interpolation, and MED detector are shown in Fig. 3. As we can see, the variance of RL resulted from bi-cubic interpolation is smaller than those from the other two methods. To further reduce spatial redundancy in Is, we adopt the partial MMSE predictor but not interpolation predictor. The reason is that stars in Is are usually small circles and their information may be lost after downsampling. To solve this problem and reduce the computational complexity, we use the static MMSE prediction which calculates only one common MMSE predictor for all blocks in the background. Then, we use the adaptive MMSE prediction to calculate the MMSE predictor which is adaptive to the image content of each block in the foreground. As a result, we can predict well in the foreground and save the calculations in the background. The partial MMSE prediction significantly reduces the number of MMSE predictors. With partial MMSE prediction, in average, we only need to calculate 60 MMSE predictors for Is (240X240) of all test images I (480X480). However, without our partial MMSE prediction which separates foreground and background, we need to calculate 255 MMSE predictors for Is. Compared with RLPE in [6] where 900 MMSE predictors in I are computed, the computation reduction of our partial MMSE predictor is 93%. 3.2 A Fast One-Pass Tier-1 Coder for Astronomical Images Tier-1 coder is part of the embedded block coding with optimized truncation (EBCOT) adopted in JPEG2000 [12]. Tier-1 encoder divides a residual image into one-sign bitplane and several magnitude bit-planes. The scanning order is from the most significant bit (MSB) to the least significant bit (LSB). In order to improve the coding
Lossless Compression Using Joint Predictor for Astronomical Images
279
Fig. 4. An example of the bit-plane coding
performance, a fractional bit-plane coding is adopted. Each magnitude bit-plane is encoded by three passes including significance propagation pass (Pass1), refinement pass (Pass 2) and cleanup pass (Pass 3). These three passes are composed of four primitives: Zero Coding (ZC), Sign Coding (SC), Magnitude Refinement (MR), and Run-Length Coding (RLC). When we encounter the first symbol 1 in magnitude bitplane, we code the sign bit-plane immediately (Fig. 4). The main idea of Tier-1 coder is coding the image on the basis of bit-planes. And thus, progressive transmission is possible by adopting the Tier-1 coder. Although EBCOT enables high compression ratio and progressive transmission, it is the most time consuming module in JPEG2000. Typically, it costs more than 50% of the encoding time in its software implementation [13] and some fast algorithms have been proposed [14]-[17]. In JPEG2000, some samples are coded by the specific pass and the other two passes are skipped. These skipped samples are called wasted samples [18]. In order to save the computation for wasted samples, we combine the original three coding passes into one pass. In our observations, we find the RLC coding primitive cannot improve the compression ratio of the residual image with low variance. Therefore, we combine the first two passes into one pass and remove the third pass. We scan stripe by stripe in this single pass. If the current sample is insignificant, we will use zero coding (ZC). If the current sample is set to be significant, sign coding (SC) primitive will be used. If the current sample has already been significant, we used magnitude refinement (MR) primitive.
4 Experimental Results Astronomical images can be classified into five classes [19]: clump, cluster, galaxy, spiral and interaction. (Fig. 5). In our experiment, we test all classes, where each class includes 3 images. All the 15 test images (480X480) are downloaded from Digital Sky Survey (DSS) system [20]. For the convenience of comparisons with other codecs, the format of input images is in Portable Gray Map (PGM) where images are grayscale with 8-bits dynamic range. In our tests, the block size is 16X16 pixels, the
280
B.-Z. Wu and A.C.-W. Tang
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(m)
(n)
(o)
(l)
Fig. 5. The representative images corresponding to four classes. (a)-(c) Clump. (d)-(f) Cluster. (g)-(i) Galaxy. (j)-(l) Spiral. (m)-(n) interaction.
prediction order N is 4, and the number of reference pixels T is 14X14 in MMSE predictors. The comparisons of the proposed encoder with JPEG2000 and JPEG-LS are shown in Table 1. The compression ratio of the proposed encoder is higher than those of JPEG2000 and JPEG-LS for all classes of astronomical images. From Table 1, we can find the compression ratio of astronomical images with large or many objects e.g. Cluster3, is lower than that of images with just a few small objects like clump1. This is because Cluster3 includes many details and it is hard to predict by the MMSE predictor and interpolation predictor. Although the compression ratio of our one-pass Tier-1 encoder is almost the same as the original Tier-1 coder, our coder saves around 14% of execution time compared
Lossless Compression Using Joint Predictor for Astronomical Images
281
Table 1. Comparisons of bit-rate among three lossless compression methods Images Clump1 Clump2 Clump3 Cluster1 Cluster2 Cluster3 Galaxy1 Galaxy2 Galaxy3 Spiral1 Spiral2 Spiral3 Interaction1 Interaction2 Interaction3 Average
JPEG2000 (bps) 1.81 1.85 2.05 2.78 2.07 3.04 2.72 1.9 1.86 2.04 1.91 1.99 2.04 2.77 2.79 2.24
JPEG-LS (bps) 1.58 1.63 1.82 2.67 1.8 2.95 2.63 1.69 1.65 1.86 1.71 1.77 1.82 2.70 2.70 2.07
Proposed (bps) 1.45 1.54 1.72 2.51 1.74 2.74 2.47 1.64 1.58 1.74 1.64 1.79 1.79 2.37 2.38 1.94
with the original Tier-1 coder. On the computing platform Pentium PC with 2.83GHz, the average computation time of our Tier-1 encoder is 0.103 seconds. The proposed coding scheme provides a good solution to astronomical image compressions since its compression ratio is higher than those of JPEG2000 and JPEG-LS. On the other hand, although RLPE provides higher compression ratio than JPEG-LS, the computation complexity of RLPE is twice as high as JPEG2000 as shown in [5] while the speedup of our proposed encoder is 50% compared with JPEG2000. Moreover, our proposed scheme can further support the spatial scalability and progressive transmission.
5 Conclusions In this paper, we propose a fast, efficient, and flexible coding scheme designed for astronomical images. The joint predictor can reduce the spatial redundancy efficiently with low computational complexity. In entropy coding, we simplify the Tier-1 coder while keep the benefits such as high compression ratio and embedding coding. Our one-pass Tier-1 coder can save 14% of execution time than the traditional Tier-1 encoder. Compared with JPEG2000, our proposed encoder can save 13% of the bitrate and save more than 50% of the execution time. Compared with JPEG-LS, our proposed encoder can achieve higher compression ratio and can further support the progressive transmission.
References 1. Starck, J.L., Murtagh, F.: Astronomical Image and Data Analysis, 2nd edn. Springer, Heidelberg (2002) 2. Christopoulos, C.A., Skodras, A.N., Ebrahimi, T.: The JPEG 2000 Still Image Coding System: An Overview. IEEE Transactions on Consumer Electronics 46(4), 1103–1127 (2000) 3. Weinberger, M., Seroussi, G., Sapiro, G.: The LOCO-I Lossless Image Compression Algorithm: Principles and Standardization into JPEG-LS. IEEE Transactions on Image Processing 9(6), 1309–1324 (2000)
282
B.-Z. Wu and A.C.-W. Tang
4. Li, X., Orchard, M.T.: Edge-directed Prediction for Lossless Compression of Natural Images. IEEE Transactions on Image Processing 10(6), 813–817 (2001) 5. Lastri, C., Aiazzi, B.: Virtually Lossless Compression of Astrophysical Images. EURASIP Journal on Applied Signal Processing 2005, 2521–2535 (2005) 6. Ding, J.R., Chen, J.Y., Yang, F.C., Yan, J.F.: Two-layer and Adaptive Entropy Coding Algorithm for H.264-based Lossless Image Coding. In: Proc. IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 1369–1372 (2008) 7. Wu, X., Zhang, X., Wang, X.: Low Bit-Rate Image Compression via Adaptive DownSampling and Constrained Least Squares Upconversion. IEEE Transactions on Image Processing 18(3), 552–561 (2009) 8. Simon, H.: Adaptive Filter Theory, 4th edn. Prentice Hall, Inc., New Jersey (2002) 9. Wu, X., Memon, N.: Context-based Adaptive Lossless Image Coding. IEEE Transactions on Communications 45(4), 437–444 (1997) 10. Hashidume, Y., Morikawa, Y.: Lossless Image Coding Based on Minimum Mean Absolute Error Predictors. In: Society of Instrument and Control Engineers Annual Conference, pp. 2832–2836 (2007) 11. Tiwari, A.K., Kumar, R.V.R.: Least Squares Based Optimal Switched Predictors for Lossless Compression of Image. In: Proc. IEEE International Conference on Multimedia and Expo, pp. 1129–1132 (2008) 12. Taubman: High Performance Scalable Image Compression with EBCOT. IEEE Transactions on Image Processing 9(7), 1158–1170 (2000) 13. Adams, M.D., Kossentini: JasPer: a software-based JPEG-2000 codec implementation. In: Proc. IEEE International Conference on Image Processing, vol. 2, pp. 53–56 (2000) 14. Du, W., Sun, J., Ni, Q.: Fast and Efficient Rate Control Approach for JPEG2000. IEEE Transactions on Consumer Electronics 50(4), 1218–1221 (2004) 15. Li, N., Bayoumi, M.: Three-Level Parallel High Speed Architecture for EBCOT in JPEG2000. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 5–8 (2005) 16. Rathi, S., Wang, Z.: Fast EBCOT Encoder Architecture for JPEG 2000. In: Proc. 2007 IEEE Workshop on Signal Processing System, pp. 595–599 (2007) 17. Varma, K., Damecharla, H.B., Bell, A.E.: A Fast JPEG2000 Encoder that Preserves Coding Efficiency: The Split Arithmetic Encoder. IEEE Transactions on Circuits and Systems for Video Technology 55(11), 3711–3722 (2008) 18. Chiang, J.S., Chang, C.H.: High Efficiency EBCOT with Parallel Coding Architecture for JPEG2000. EURASIP Journal on Applied Signal Processing 2006, 1–14 (2006) 19. Sloan Digital Sky Survey, http://cas.sdss.org/dr6/en/tools/places/page1.asp 20. Digital Sky Survey System, http://archive.stsci.edu/cgi-bin/dss_form
Metric Rectification to Estimate the Aspect Ratio of Camera-Captured Document Images Junhee Park and Byung-Uk Lee Department of Electronics Engineering, Ewha W. University, Seoul 120-750, Korea
[email protected],
[email protected]
Abstract. Document images from mobile phone cameras and digital cameras suffer from geometric distortion; therefore distortion correction is desired for better character recognition. We propose a method to calculate the aspect ratio of planar documents using 3D perspective projection of quadrangles without using vanishing points. The proposed method is based on estimation of direction of a pair of parallel lines from their projected image. We verify the accuracy of our method from experimental results at various viewing angles. Our contribution can be applied to increase the character recognition rate by compensating for distorted documents with the accurate aspect ratio. Keywords: Metric rectification, aspect ratio, 3D shape recovery, camera projection, perspective projection.
1 Introduction Automatic character recognition using mobile phone cameras and digital cameras has been an active research area. However there is unavoidable geometric distortion of camera-captured document images because the orientation of camera capture is not perpendicular to documents being imaged. This distortion deteriorates character recognition rate. After image capture, geometric distortion needs to be corrected for. In this paper, we propose a method to calculate the aspect ratio of rectangular documents using their 3D perspective projection image. The proposed correction increases the character recognition rate by compensating for distorted documents with the accurate aspect ratio. Hartley and Zisserman [2] described metric rectification in detail using the concept of vanishing points. Liebowitz and Zisserman et al. [6] proposed a rectification method by estimating vanishing points with known angle and length ratio. Zhang et al. [11] applied the above method to moving vehicles for traffic scene surveillance. Kanatani [8] proposed an algorithm to compute 3D orientations of three corner edges from their projection images if the corner edges are known to be perpendicular. Liu et al. [10] presented a method for determining a camera location from straight line correspondences. They assumed that the focal length of the camera is known and the 2D to 3D line or point correspondences are given. Chen et al. [7] proposed a metric rectification method for planar homography based on a closed-form algebraic solution of the IAC (image of absolute conic). They used vanishing lines and the image of one arbitrary circle for rectification. Wilczkowiak et al. [4] proposed a method to calibrate cameras, G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 283–292, 2009. © Springer-Verlag Berlin Heidelberg 2009
284
J. Park and B.-U. Lee
recover shapes of parallelepipeds, and estimate the relative pose of all entities based on the observation of constraints such as coplanarity, parallelism, or orthogonality that are often embedded in parallelepipeds. They first calculated intrinsic and orientation parameters using a factorization-based algorithm and then found position and size parameters using linear least square estimation. This approach is adopted to interactive 3D reconstruction. Lucchese [9] presented a closed-form pose estimation from metric rectification using 3D points of a planar pattern and its image. Jagannathan et al. [12] described a spectrum of algorithms for rectification of document images for camera-based analysis and recognition. Clues like document boundaries, page layout information, organization of text and graphics components, a prior knowledge of the script or selected symbols etc. are effectively used for estimating vanishing points to remove the perspective effect and computing the frontal view needed for a typical document image analysis algorithm. Recently Liang et al. [1] presented a geometric rectification framework for restoring the frontal-flat view of documents from a single camera-captured image. Their approach estimated the 3D document shape from texture flow information obtained directly from the image without requiring additional 3D data or prior camera calibration. They used directional filters to extract linear structures of characters, estimated vertical and horizontal texture flows, and vanishing points, and then determined the camera focal length and the plane orientation. Finally they estimated 3D document shapes, and rectified geometric distortion of camera-captured document images. Their framework provided a unified solution for both planar and curved documents. Iwamura et al. [5] proposed a layout-free rectification method which does not require information such as the borders of a document, parallel textlines, a stereo image or a video image. However this method requires English font information to calculate vanishing lines from a specific English character font. In this paper, we present an algorithm to calculate the 3D orientation of parallel lines from their projected image without estimating vanishing points. First we calculate two 3D direction vectors from parallel lines of a document plane and then obtain the normal vector of the document plane from their cross product. We propose a method to calculate the aspect ratio of the document using the plane normal vector. Our intuitive derivation is easy to understand, and yields the same result as the one using the concept of vanishing points. This paper is organized as follows. Section 2 describes the relationship between a 3D line and its projection, Section 3 presents the orientation of the plane composed of two pairs of perpendicular lines and Section 4 contains the calculation of the aspect ratio. Experimental results and conclusions are given in Section 5 and 6.
2 Relationship between a 3D Line and Its Projection We show the relationship of a line direction in 3D space and in a projected image plane, which will be used to recover the 3D direction from their projections. Let l be a line in 3D, and Q0 = ( X 0 , Y0 , Z 0 ) and Q1 = ( X 1 , Y1 , Z1 ) be two points on the line l . And l ' is the projection of the line l , P0 and P1 are two points on the line l ' corresponding to Q0 and Q1 respectively, as shown in Fig. 1. Assume that the 3D line l is projected on the image plane where the focal length is f .
Metric Rectification to Estimate the Aspect Ratio
285
z= f
Fig. 1. A line l in 3D and its projection l '
A point ( X 0 Y0 Z 0 )T is on the line l , and let the slope of the line be (a0 b0 c0 )T ; then the equation of this line is as follows.
⎛ X 0 ⎞ ⎛ a0 ⎞ ⎜ ⎟ ⎜ ⎟ l = ⎜ Y0 ⎟ + ⎜ b0 ⎟ . ⎜ Z ⎟ ⎜c ⎟ ⎝ 0⎠ ⎝ 0⎠ We use perspective projection model, and the focal length is f ; therefore the equations of P0 and P1 are given as the following equations.
⎛ ⎜ P0 = f ⎜ ⎜ ⎜ ⎝
X0 ⎞ ⎛ X 0 + a0 ⎛ X1 ⎞ ⎟ ⎜ Z +c ⎜ ⎟ Z0 ⎟ Z1 ⎟ 0 0 ⎜ , P1 = f = f⎜ ⎜ Y0 + b0 ⎜ Y1 ⎟ Y0 ⎟ ⎟ ⎜ ⎜ ⎟ Z0 ⎠ ⎝ Z1 ⎠ ⎝ Z 0 + c0
⎞ ⎟ ⎟. ⎟ ⎟ ⎠
We assume that the focal length f is known, and the principal point is the origin of the 2D image plane. The slope P1 − P0 of l ' , which is the line on the image plane, is calculated as follows.
⎛ ⎛ a0 ⎞ P0 ⎞ ⎜⎜ ⎜ ⎟ − c0 ⎟⎟ . f ⎠ ⎝ ⎝ b0 ⎠ ⎛a ⎞ P Therefore the slope of l ' is proportional to ⎜ 0 ⎟ − c0 0 . b f ⎝ 0⎠ P1 − P0 =
f Z 0 + c0
The 3D point Q0 is projected to P0 on the 2D image plane, and the slope of l ' is ⎛ a0 ⎞ P0 ; therefore the equation of the line l ' is as follows. ⎜ ⎟ − c0 b f ⎝ 0⎠
⎛⎛a ⎞ P ⎞ l ′ = P0 + λ ⎜ ⎜ 0 ⎟ − c0 0 ⎟ , ⎜ b f ⎟⎠ ⎝⎝ 0 ⎠ where λ is a scalar parameter to represent the line l ' .
286
J. Park and B.-U. Lee
3 The Orientation of the Plane Composed of the Two Pairs of Perpendicular Lines Using the mapping of a 3D line direction vector to a 2D perspective image plane, we can derive the plane normal vector from a parallelogram in 3D. Let the vertices of a parallelogram in 3D be Q1 , Q2 , Q3 and Q4 , and their perspective projection vertices on the 2D image plane be P1 , P2 , P3 and P4 , respectively, as shown in Fig. 2. z= f
Fig. 2. A parallelogram in 3D and its projection
Let the orientation of the line Q1Q2 be (a1 b1 c1 )T . Define Pij ≡ Pj − Pi . Then P12 is represented by equation (1). ⎛a ⎞ P P12 = ⎜ 1 ⎟ − c1 1 . b f ⎝ 1⎠
(1)
Since the line Q1Q2 and Q4 Q3 are parallel, they have the same direction vector
(a1 b1 c1 )T ; therefore the direction vector P43 is given as (2). ⎛⎛ a ⎞ P ⎞ P43 = k ⎜ ⎜ 1 ⎟ − c1 4 ⎟ , ⎜ b f ⎟⎠ ⎝⎝ 1 ⎠
(2)
where k is a scale factor. If equation (2) is divided by k , we have P43 ⎛ a1 ⎞ P = ⎜ ⎟ − c1 4 . k f ⎝ b1 ⎠
(3)
By (1) – (3), we obtain
P12 −
⎛P P ⎞ P43 P = c1 ⎜ 4 − 1 ⎟ = c1 14 . k f ⎠ f ⎝ f
(4)
Metric Rectification to Estimate the Aspect Ratio
287
Here, we interpret the geometric meaning of equation (4). The point P5 in Fig. 3 is the intersection of P14 and the line l ′′ which is paralleled to P43 and passes through P2 . The RHS of (4) is a point on the line P14 , and the LHS of (4) is a line passing P2
and the direction is P43 , which is the line l ′′ in Fig. 3. Therefore c1
P14 is the interf
section of the line l ′′ and P14 , which is P5 .
• l ′′
Fig. 3. A point P5 is the intersection of P14 and the line l′′ , which passes through P2 and is parallel to P43
Therefore we conclude that.
P12 −
P43 c1 = P14 ≡ P15 , k f
1 P43 ≡ P25 , k P12 + P25 = P15 . −
⎛ 0 −1⎞ Let the 90 degree rotation matrix in 2D plane be Rot90 = ⎜ ⎟ . Then ⎝1 0 ⎠ P14 ⊥ = Rot90 P14 is the 90 degree rotated vector of P14 in 2D plane which is perpendicular to P14 . Therefore, we can obtain k from equation (4) after applying dot product P14 ⊥ .
P ⎛ P14 ⊥ ⋅ ⎜ P12 − 43 k ⎝ k=
P14 ⎞ ⊥ = 0, ⎟ = P14 ⋅ c1 f ⎠ P14 ⊥ ⋅ P43 . P14 ⊥ ⋅ P12
(5)
If we know f , we obtain c1 by applying k in equation (4), and a1 and b1 by applying c1 in equation (1) as shown in equation (6) and (7).
288
J. Park and B.-U. Lee
⎛ PT P PT P ⎞ c1 = f ⎜ 14T 12 − 14T 43 ⎟ , ⎝ P14 P14 kP14 P14 ⎠
(6)
⎛ P14T P12 P14T P43 ⎞ ⎛ a1 ⎞ − T ⎟ P1 . ⎜ ⎟ = P12 + ⎜ T ⎝ b1 ⎠ ⎝ P14 P14 kP14 P14 ⎠
(7)
We observe that the focal length f influences the estimation c1 , which is a component of the line direction perpendicular to the image plane. The focal length f does not affect a1 and b1 that are components of the line direction parallel to the image plane. Image point (u v)T , which is the projection on 2D image of 3D point ( X Y Z )T , is given as follows, where (u0 v0 )T denotes the principal point. ⎛u⎞ ⎛ f ⎜ ⎟ ⎜ ⎜v⎟ = ⎜ 0 ⎜1⎟ ⎜ 0 ⎝ ⎠ ⎝
0
f 0
u0 ⎞ ⎛ X ⎞ ⎛ fX + u0 Z ⎞ ⎛ fX / Z + u0 ⎞ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟ v0 ⎟ ⎜ Y ⎟ = ⎜ fY + v0 Z ⎟ = ⎜ fY / Z + v0 ⎟ . ⎟ ⎜ ⎟ 1 ⎟⎠ ⎜⎝ Z ⎟⎠ ⎜⎝ Z 1 ⎠ ⎝ ⎠
The direction of line Q1Q2 is (a1 b1 c1 )T . Since Q4 Q3 is parallel to Q1Q2 , its direction vector is the same. We obtain the direction of the other parallel line pair using the same method. If vanishing points are employed to calculate the direction of two parallel lines, we can obtain the same result. Let (a2 b2 c2 )T be the direction of two parallel lines Q1Q4 and Q2 Q3 . First, calculate the 3D line direction (a1 b1 c1 )T using projected image of two parallel 3D lines, then the other line direction (a2 b2 c2 )T is calculated using the same method. Thereby we obtain two directions of 3D line, and calculate the normal vector n of the 3D plane containing Q1 , Q2 , Q3 and Q4 by cross product of two line direction vectors. n = ( a1 b1 c1 ) × ( a2 b2 c2 ) ≡ ( a b c ) . T
T
T
4 Calculation of the Aspect Ratio The equation of a 3D plane r = ( X Y Z )T with normal vector n is represented as
nT r = aX + bY + cZ = d . We calculate the 3D plane normal vector n = ( a b c ) by T
above equations (6) and (7), and the plane parameter d can be obtained from Q1 on this plane, where P%1 is (u1 v1 f )T in 3D coordinate of P1 (u1 , v1 ) on 2D image plane, since the z coordinate of 2D image plane is f . Q1 is on the line connecting the origin and P% ; therefore it is represented by a scale factor s . 1
1
Metric Rectification to Estimate the Aspect Ratio
289
Q1 = s1 P%1 . And the point Q1 is on the plane r . d = nT Q1 = s1nT P%1 . If this value s1 changes, the size of the document on the plane changes similarly, however the aspect ratio is the same. Therefore we can set s1 to 1 for simplicity. Then the above equation becomes d = nT P% . We calculate the coordinates of Q , Q 1
2
3
and Q4 using this 3D image equation. Q2 = s2 P%2 , Q3 = s3 P%3 , Q4 = s4 P%4 . s2 nT P%2 = d , s3nT P%3 = d , s4 nT P%4 = d . s2 = d / nT P%2 , s3 = d / nT P%3 , s4 = d / nT P%4 . We calculate the normal vector of the plane from two pairs of parallel lines. Then 2D camera image plane quadrangle is inverse projected to the 3D plane to obtain undistorted positions Q1 , Q2 , Q3 and Q4 . Therefore, the aspect ratio is restored to the original value. Using this method, we correct for the geometric distortion using perspective mapping parameters between four vertices of a parallelogram.
5 Experimental Results Fig. 4 is a typical example of distortion correction using our proposed method. The restored aspect ratio of the document is 1.40, and the ground truth is 1.41.
(a)
(b)
Fig. 4. (a) Distorted image and (b) distortion correction image
290
J. Park and B.-U. Lee
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
Fig. 5. Test images from various viewing angles Table 1. The aspect ratio of our proposed method and Liang method. (1) Focal length f is calculated from orthogonal constraint, (2) Used accurate focal length obtained from separate camera calibration. image #
proposed method
Liang method (1)
Liang method (2)
1 2 3 4 5 6 7 8 9
0.78 0.78 0.78 0.78 0.79 0.79 0.78 0.80 0.79
0.78 0.81 0.75 0.82 0.89 0.63 1.36 0.52 0.20
0.78 0.78 0.78 0.78 0.79 0.79 0.78 0.80 0.79
We experiment with various viewing angles between a 3D document plane and the principal axis of a camera and calculate the aspect ratio using our proposed method. Fig. 5 shows nine test images with various viewing angles. The ground truth is 0.78. Table 1 compares the experimental results obtained by our proposed method and by the Liang method. Our method shows stable aspect ratio of 0.78-0.80. Liang’s method
Metric Rectification to Estimate the Aspect Ratio
291
calculates focal length using the orthogonality of horizontal and vertical text lines, which is subject to error. Therefore the resulting aspect ratio shows a range of 0.201.36. However, Liang’s method results are the same as ours if we apply more accurate focal length. From the table, we verify that our proposed algorithm restores the aspect ratio with high accuracy. We observe that the estimated aspect ratio shows stability against viewing angle change.
6 Conclusions In this paper, we derive equations to calculate the 3D direction of a line from a projected image of a pair of parallel lines in 3D space without estimating vanishing points. We calculate two 3D direction vectors of a document plane and then obtain the normal vector of the document from the cross product of two vectors on the plane. We propose a method to calculate the aspect ratio using the plane normal vector. We correct for the geometric distortion using perspective mapping parameters between four vertices of a distorted image and four vertices of an undistorted quadrangle. We obtain the accurate aspect ratio from experimental results, independent of the angle between a 3D plane and changes in the principal axis of a camera. Acknowledgments. This work was supported in part by the Ministry of Knowledge Economy (MKE), Korea Industrial Technology Foundation (KOTEF) through the Human Resource Training Project for Strategic Technology, and the Acceleration Research Program of the Ministry of Education, Science and Technology of Korea and the Korea Science and Engineering Foundation.
References 1. Liang, J., DeMenthon, D., Doermann, D.: Geometric rectification of camera-captured document images. IEEE Trans. Pattern Analysis and Machine Intelligence 30(4), 591–605 (2008) 2. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. CUP, Cambridge (2003) 3. Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000) 4. Wilczkowiak, M., Strurm, P., Boyer, E.: Using Geometric Constraints through Parallelepipeds for Calibration and 3D Modeling. IEEE Trans. Pattern Analysis and Machine Intelligence 27(2), 194–207 (2005) 5. Iwamura, M., Niwa, R., Horimatsu, A., Kise, K., Uchida, S., Omachi, S.: Layout-Free Dewarping of Planar Document Images. In: SPIE Conf. Document recognition and retrieval (2009) 6. Liebowitz, D., Zisserman, A.: Metric Rectification for Perspective Images of Planes. In: IEEE Conf. Computer Vision and Pattern Recognition, pp. 482–488 (1998) 7. Chen, Y., Ip, H.H.S.: Planar Metric Rectification by Algebraically Estimating The Image of the Absolute Conic. In: IEEE Conf. Pattern Recognition, vol. 4, pp. 88–91 (2004) 8. Kanatani, K.: 3D recovery of polyhedra by rectangularity heuristics. Industrial Applications of Machine Intelligence and Vision, 210–215 (1989)
292
J. Park and B.-U. Lee
9. Lucchese, L.: Closed-Form Pose Estimation from Metric Rectification of Coplanar Points. IEE Proceedings: Vision, Image, and Signal Processing 153(3), 364–378 (2006) 10. Liu, Y., Huang, T.S., Faugeras, O.D.: Determination of Camera Location from 2-D to 3-D Line and Point Correspondences. IEEE Trans. Pattern Analysis and Machine Intelligence 12(1), 28–37 (1990) 11. Zhang, Z., Li, M., Huang, K., Tan, T.: Robust automated ground plane rectification based on moving vehicles for traffic scene surveillance. In: IEEE Conf. Image Processing, pp. 1364–1367 (2008) 12. Jagannathan, L., Jawahar, C.V.: Perspective Correction Methods for Camera-Based Document Analysis. Camera-Based Document Analysis and Recognition, 148–154 (2005)
Active Learning Image Spam Hunter Yan Gao and Alok Choudhary Dept. of EECS, Northwestern University, Evanstion, IL, USA
[email protected],
[email protected]
Abstract. Image spam is annoying email users around the world. Most previous work for image spam detection focuses on supervised learning approaches. However, it is costly to get enough trustworthy labels for learning, especially for an adversarial problem where spammers constantly modify patterns to evade the classifier. To address this issue, we employ the principle of active learning where the learner guides the user to label as few images as possible while maximizing the classification accuracy. Active learning is more suited for online image spam filtering since it dramatically reduces the labeling costs with negligible overhead while maintaining high recognition performance. We present and compare two active learning algorithms, based on an SVM and a Gaussian process classifier respectively. To the best of our knowledge, we are the first to apply active learning for the task of spam image filtering. Experimental results demonstrate that our active learning based approaches quickly achieve > 99% high detection rate and < 0.5% low false positive rate with small number of images being labeled.
1
Introduction
Global spam volumes increase very fast over the past five years. Email spam accounted for 96.5% of incoming emails received by businesses by June 2008 [1], and costed more than $70 billion management expenses for US government annually. Among all spam emails, approximately 30% are image spams, which embed the spam messages in image attachments, as reported by McAfee [2] in 2007. Detecting image spams is a typical image content recognition problem. In the arms race with anti-spam technology, spammers constantly employ different image manipulation technologies, such as all the tricks used in CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), to embed spam messages into images. These different tricks include adding speckles and dots in the image background, varying borders, randomly inserting subject lines, and rotating the images slightly and so on. Figure 1 shows some examples of image spams. Previous work has leveraged OCR techniques and text classifier for image spam detection. However, the appearance of CAPTCHA technologies easily degrades the recognition rate of an OCR system, which in turn affects the accuracy of the text classifier. As an improvement, many recent works have been targeting G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 293–302, 2009. c Springer-Verlag Berlin Heidelberg 2009
294
Y. Gao and A. Choudhary
Fig. 1. Spam image examples
on automated and adaptive content based image spam detection, e.g., Gao et al.’s image spam hunter [3], Dredze et al’s fast image spam classifier [4], and near duplicate image spam detection [5]. Most of them employ supervised statistical machine learning algorithms to build a classifier for filtering spam images using discriminative image features. Although supervised learning algorithms have achieved good accuracy for the task of image spam detection, getting sufficient labeled images for robust training is always expensive, especially for the adversarial problem that re-training model needs to be done quite often. By leveraging the principle of active learning [6,7,8,9,10], we can drastically reduce the labeling cost by identifying the most informative examples for users to label. Hence in this paper we propose a system prototype of an active learning image spam hunter to solve the adversarial spam detection problem. Our goal is to create a strong classifier while requesting as few labels as possible. We present and compare two active learning algorithms, which are based on an SVM and a Gaussian process classifier respectively. These two algorithms are tested on an image spam dataset collected from Jan 2006 to Mar 2009, which contains both positive spam images collected from our email server, and negative natural images downloaded from Internet. Our approaches on average requires very few images to be labeled in a corpus to achieve >99% detection rate and <0.5% false positive rate. The remainder of this paper is organized as follows. Section 2 presents an overview of system design and operation flow of the active learning image spam hunter. Then in Section 3, we describe two active learning algorithms. One is based on an SVM classifier, and the other is based on a Gaussian process classifier. We present the image statistic features adopted to discriminate natural and spam images in Section 4. In Section 5, we use extensive experiments to validate and compare the effectiveness of the proposed system and algorithms. Finally, we conclude and summarize our future work in Section 6.
Active Learning Image Spam Hunter
295
Stream of emails with attachment
Update Get labeled
Yes
Image spam classifier
Output the result (spam or not?)
Request label?
Active learning process
Fig. 2. Prototype system diagram
2
System Framework
In this section, we present an active learning system prototype of image spam hunter, as shown in Figure 2, to differentiate spam images from normal image attachments. The whole dataset is splitted into two: the labeled dataset and unlabeled dataset. The labeled dataset is denoted as XL = {xi |i ∈ L}, with labels YL = {yi ∈ {−1, +1}|i ∈ L}, where 1 represents spam image and −1 represents non-spam image, respectively. The unlabeled dataset is denoted as XU = {xi |i ∈ U }. We assume L = [1, n] and U = [n + 1, N ]. Let X = XL ∪ XU . When the system is firstly used, XL is an empty set φ and XU may cover the full dataset X . We randomly choose a few (< 10) spam images and non-spam images to label and take them as the initial labeled dataset for training the first round classifier. The core of this prototype system is an active learning algorithm with a data sample choosing criterion AL(y(x)), where y(x) is the classifier induced from the learning algorithm. As long as an appropriate mathematic quantity AL(y(x)) is defined, we can make any supervised learning algorithm to be an active learning algorithm. The active learning criterion AL(y(x)) efficiently guides the users to label as few images as possible while maximizing the recognition performance of the classifiers. More formally, at each step of the active learning algorithm, we first perform the supervised learning algorithm with the current XL , and build the image spam classifier y(x). Next we select x∗ = arg max AL(y(x))
(1)
XL ⇐ XL + x∗
(2)
x∈XU
to label and get
∗
XU ⇐ XU − x .
(3)
296
Y. Gao and A. Choudhary
With the new XL , the above active learning step is iterated until the recognition accuracy of the classifier reaches a satisfactory level. We will discuss the selection of iteration times in Section 5. In this way, the continuously adaptive classifier is generated and ready to filter the incoming batch of new emails with image attachment.
3
Active Learning Algorithms
We present two different active learning algorithms in this section. One is adapted from the probabilistic output of an SVM(Support vector machines) [11,12]. The other is built on top of a Gaussian process (GP) classifier [13,14]. 3.1
Active Learning SVM
Given the labeled data set XL = {xi , yi }ni=1 , the primal problem of a linear SVM solves the following quadratic program for obtaining the maximum margin linear classifier [15,16], i.e., min w
w2 +C ξi 2 i
(4)
s.t. yi (xi · w + b) ≤ 1 − ξi and ξi ≤ 0 ∀i.
(5)
The solution of the above constrained optimization problem is usually obtained by solving the Wolfe dual problem, 1 αi αj xi · xj 2 i,j i s.t. 0 ≤ αi ≤ C ∀i and αi yi = 0.
max
αi −
(6) (7)
i
It shows that the solution is given by w=
Ns
αi yi xi ,
(8)
i=1
where Ns indicates the number of support vectors for the classifier. Therefore, the classification result of a new data vector x is N s y = sign αi yi xi · x + b . (9) i=1
It is easy to observe that in both the Wolfe dual problem Equation 6 and the final classifier Equation 9, the data vectors only present in the form of dot
Active Learning Image Spam Hunter
297
product. This enables us to construct nonlinear SVM by leveraging the kernel tricks [16], i.e., to solve the following problem max
αi −
i
1 αi αj k(xi , xj ) 2 i,j
s.t. 0 ≤ αi ≤ C ∀i. αi yi = 0,
(10) (11) (12)
i
where k(xi , xj ) is a kernel function which defines the dot product of the nonlinear transformed data vectors φ(xi ) and φ(xj ) in a reproducing kernel Hilbert space (we use Gaussian radial basis kernel in our experiments), i.e., k(xi , xj ) = φ(xi ) · φ(xj ).
(13)
Similarly, the final nonlinear SVM classifier is y = sign
N s
αi yi k(xi , x) + b .
(14)
i=1
Note that we do not need to explicitly define the nonlinear transformation φ(x) since both the optimization problem in Equation 10 and the solution in Equation 14 only involves the kernel function. As shown by Madevska-Bogdanova et al. [11], we could transform the function output from a support vector machine to be a posterior distribution by using a Sigmoid function, i.e., p(y = 1|x) =
1 , Ns 1 + exp{k( i=1 αi yi k(xi , x) + b)}
(15)
where k is a constant quantity which could be estimated from the training data. With this posterior probability of the predicted label given the data point, a natural active learning criterion would be based on the uncertainty of the predicted label given a data point. Let p1 = p(y = 1|x). The uncertainty is naturally defined by an entropy term, H(y(x)) = −p1 log p1 − (1 − p1 ) log(1 − p1 ).
(16)
Therefore, for this active learning SVM, we define AL(f (x)) = H(y(x)).
(17)
The rationale behind the criterion is that the active learning algorithm should guide the users to label the image for which the classifier are least confident to recognize.
298
Y. Gao and A. Choudhary
3.2
Active Learning Gaussian Process Classifier
Given the labeled dataset XL , an unlabeled data xu , and XLu = XL + xu , we introduce a latent variable zi , which is the soft label of the data point xi . We denote ZLu = {zi |i ∈ L + u}. In a GP classifier, the joint distribution of ZLu is assumed to be a joint Gaussian with zero mean and covariance defined by a kernel function k(·, ·) applied to xi and xj , i.e., p(ZLu |XLu ) ∼ N (0, K),
(18)
where K is a N × N matrix with the element kij = k(xi , xj ). We denote KLL be the sub-matrix of K that is induced by XL . Following Kapoor et al. [14], we assume p(y|z) be a Gaussian distribution N (y, σ 2 ). We immediately have p(ZLu |XLu , YL ) ∝ p(ZLu |XLu )p(YL |ZLu ) = p(ZLu |XLu ) p(yi |zi )
(19) (20)
i∈L
Denote yu be the label of xu we would like to predict, we are interested in inferring the following quantity p(yu |XLu , YL ) = p(yu |ZLu )p(ZLu |XLu , YL )dZLu . (21) ZLu
Denote k(xu ) = [k(xu , x1 ), k(xu , x2 ), . . . , k(xu , xn )]T , and I be the identity matrix, by following [14], we have p(yu |XLu , YL ) = N (¯ yu , σ ¯u2 )
(22)
where y¯u = k(xu )T (σ 2 I + KLL )−1 YL
(23)
σ ¯u2 = k(xu , xu ) − k(xu )T (σ 2 I + KLL )−1 k(xu ) + σ 2 .
(24)
Denote p1 = p(yu = 1|XLu , YL ), we can define the entropy by using Equation 16, and the active learning criterion would exactly take Equation 17. It is worth noticing that Kapoor et al. [14] defined their active learning criterion with this GP classifier to be |y¯u | AL(y(x)) = − . (25) σ ¯u In this binary classification problem, it is easy to verify that this is equivalent to our entropy uncertain measure.
4
Image Features
We extract 23 discriminant image statistical features [17] for our active learning image spam hunter. They cover the properties of color, texture, shape and appearance.
Active Learning Image Spam Hunter
299
For color statistics, we first build a 103 -dimension color histogram in the joint RGB space by quantizing each color band into 10 different levels. The entropy of this histogram is computed as the first statistics. We further set up one 100dimension histogram for each of the 3 color channels. Then the discreteness, mean, variance, skewness, and kurtosis for each of the three histograms are calculated, which adds another 5 × 3 = 15 statistics. Here the discreteness is the summation of all the absolute differences between any two consecutive bins. So altogether we collect 16 color statistics. Local binary pattern (LBP) [18] is used to analyze the texture statistics. We extract 59-dimension texture histogram, including 58 bins for all the different uniform local binary patterns, i.e., the pattern of at most two 0 ∼ 1 transitions in a 8-bit stream, and an additional bin for all other non-uniform local binary patterns. The entropy of the LBP histogram is calculated as 1 texture statistics. Shape information is also considered as the important features in our system. A 40 × 8 = 320 dimensional gradient magnitude-orientation histogram is built to describe the shape information. The entropy of the histogram is the first shape feature, and the second feature is the difference between the energies in the lower frequency band and the higher frequency band. Then we use the total amount of edges and the average length of the edges as another two shape features by running a Canny edge detector [19]. Thus there are 4 shape statistics in total. Last but not least, we use the spatial correlogram [20] of the grey level pixels within 1-neighborhood to represent appearance information. The first feature is the average variance ratio of all the slices, which is the ratio between the variance of the slice and the radius of the symmetric range over the mean of the slice that accounts for 60% of the total counts of the slice. Then histograms are built from each slice of the correlogram, and the average skewness of the histograms is calculated as the second feature.
5
Experiment
In our experiments, we report the recognition accuracy on both the active learning pool X = XL ∪ XU , and the hold-out data-set Xh . We keep track of the recognition accuracy with the progress of active learning. We also compare with a baseline setting where at each step we randomly choose an image sample from Xu for the users to label. We call the active learning process to be active supervision and the baseline setting to be random supervision. We adopt the Gaussian radial basis kernel for both the SVM and the GP classifier. In the following, we present our data collection first followed by the detailed experimental results. 5.1
Data Collection
We collected an image dataset which contains 1190 spam images and 1760 normal images. The spam images are extracted from real spam images received by 10 graduate students in our department between Jan. 2006 and Mar. 2009. These spam images were extracted from the original spam emails and all of them are converted to JPEG format. For normal image attachments, we collect photo
300
Y. Gao and A. Choudhary 1
1
0.06 SVM with Ative Supervision GP with Active Supervision SVM with Random Supervision GP with Random Supervision
0.99 0.05
0.98
0.98
False Positive
Accuracy
0.96 0.95
0.96 True Positive
0.04 0.97
0.03
0.94
0.02
0.92
0.01
0.9
0.94 SVM with Ative Supervision GP with Active Supervision SVM with Random Supervision GP with Random Supervision
0.93 0.92 0
50 100 150 200 Number of Labeled Examples Added
0
250
SVM with Ative Supervision GP with Active Supervision SVM with Random Supervision GP with Random Supervision
0.88 0
50 100 150 200 Number of Labeled Examples Added
250
0
50 100 150 200 Number of Labeled Examples Added
250
(a) The progressive changes of the overall recognition accuracy, false positive and true positive rates on active learning pool X. 0.965
0.06
0.94 SVM with Ative Supervision GP with Active Supervision SVM with Random Supervision GP with Random Supervision
0.96 0.055
0.955
Accuracy
0.94 0.935
True Positive
False Positive
0.05
0.945
0.045
0.91
0.9
0.04
0.93
0.89 SVM with Ative Supervision GP with Active Supervision SVM with Random Supervision GP with Random Supervision
0.925 0.92 0.915
0.93
0.92
0.95
0
50 100 150 200 Number of Labeled Examples Added
0.035
0.03
250
SVM with Ative Supervision GP with Active Supervision SVM with Random Supervision GP with Random Supervision
0.88
0
50 100 150 200 Number of Labeled Examples Added
0.87
250
0
50 100 150 200 Number of Labeled Examples Added
250
(b) The progressive changes of the overall recognition accuracy, false positive and true positive rates on hold-out data pool Xh.
Fig. 3. The changes of the accuracy with the progress of active learning
images by either downloading from photo sharing site Flikr.com, or fetching the photo images from popular image search engines such as Microsoft live image search (http://www.live.com/?scope= images). 5.2
Results Comparison
Since typical users usually deal with hundreds of emails in a one-day batch, we randomly extract a subset of 10% images from the whole data corpus as the test subset in each experiment. To test the generalization performance of the classifiers induced from active learning, each time we randomly sample 20% data from the test subset as a hold-out dataset Xh . The rest 80% is adopted as the active learning pool X . We randomly select 10 samples from active learning 1
0.015
1 SVM with Ative Supervision SVM with Random Supervision
0.995
0.995
Accuracy 0.985
0.01
True Positive Rate
False Positive Rate
0.99
0.99
0.985 0.98 0.975
0.005 0.97
0.98 0.965
SVM with Ative Supervision SVM with Random Supervision 0.975
0
50 100 150 200 Number of Labeled Examples Added
250
0
0
50 100 150 200 Number of Labeled Examples Added
250
0.96
SVM with Ative Supervision SVM with Random Supervision 0
50 100 150 200 Number of Labeled Examples Added
250
Fig. 4. The recognition accuracy of running active learning SVM on an initialized classifier
Active Learning Image Spam Hunter
301
pool to initialize the system. Figure 3 presents the experimental results averaged over 100 runs. Part (a) in Figure 3 presents the progressive changes of the overall recognition accuracy, false positive and true positive rates on X with the human adding more and more labels, while part (b) shows the results on Xh . In general, the classifiers induced form active supervision achieve much better results than those from random supervision, in other words, much less amount of labels are needed for active supervision to achieve the same recognition accuracy as random supervision. In particular, the active learning SVM only requires to label less than 50 images in X to achieve over 99% recognition accuracy. This is also observed in the holdout dataset where the recognition accuracy quickly approach to the saturation point than the algorithms with random supervision. Moreover, with our feature setting and the selected kernel function, the active learning SVM consistently shows better performance than the active learning GP classifier. The recognition performance on Xh also shows that the induced classifier generalize well so that it may be employed for fully automated image spam filtering. But it is preferred to always run in the active learning mode as we can ensure more than 99% accuracy by the end of the learning process. If not considering the initialization process of the system, the amount of labels required to adapt the classifier to next batch of emails is even less. Figure 4 presents the recognition performance of continuously running the active SVM algorithm on a second subset of data, initialized from the SVM classifier obtained from the first subset. The reported results are also averaged over 100 different runs. As we can clearly observe, with a well-trained initial SVM, the active learning SVM only requires to label 20 (<7%) images in order to achieve over 99% recognition accuracy. That is to say, our active learning image spam hunter system only needs <7% label data to get the ideal high detection rate. This ratio may further reduce with the increase of the dataset .
6
Conclusion
In conclusion, we propose to employ active learning for online image spam email filtering. The design of a prototype system is presented and two different active learning algorithms are evaluated. Our extensive comparative experiments manifests that the active learning SVM is a better choice for this task, given the image statistic features we adopted.
Acknowledgements This work was supported in part by DOE FASTOS award number DE-FG0208ER25848, NSF HECURA CCF-0621443, NSF SDCI OCI-0724599, CNS-0830927, and NSF ST-HEC CCF-0444405.
References 1. Sophos Plc: http://www.sophos.com/pressoffice/news/articles/ 2008/07/dirtydozj-ul08.html
302
Y. Gao and A. Choudhary
2. McAfee: http://www.avertlabs.com/research/blog/?p=170 3. Gao, Y., Yang, M., Zhao, X., Pardo, B., Wu, Y., Pappas, T., Choudhary, A.: Image spam hunter. In: Proc. of the 33th IEEE International Conference on Acoustics, Speech, and Signal Processing, Las Vegas, NV, USA (2008) 4. Dredze, M., Gevaryahu, R., Elias-Bachrach, A.: Learning fast classifiers for image spam. In: Proc. the 4th Conference on Email and Anti-Spam (CEAS), California, USA (2007) 5. Wang, Z., Josephson, W., Lv, Q., Charikar, M., Li, K.: Filtering image spam with near-duplicate detection. In: Proc. the 4th Conference on Email and Anti-Spam (CEAS), California, USA (2007) 6. Freund, Y., Seung, H.S., Shamir, E., Tishby, N.: Selective sampling using the query by committee algorithm. Machine Learning 28, 133–168 (1997) 7. Tong, S., Koller, D., Kaelbling, P.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 999–1006 (2001) 8. Goh, K.S., Chang, E.Y., Lai, W.C.: Multimodal concept-dependent active learning for image retrieval. In: Proceedings of the 12th annual ACM international conference on Multimedia. ACM, New York (2004) 9. Lawrence, N.D., Seeger, M., Herbrich, R.: Fast sparse gaussian process methods: The informative vector machine. In: Advances in Neural Information Processing Systems, vol. 15, pp. 609–616. MIT Press, Cambridge (2003) 10. MacKay, D.J.C.: Information-based objective functions for active data selection. Neural Computation 4, 590–604 (1992) 11. Madevska-Bogdanovaa, A., Nikolikb, D., Curfsc, L.: Probabilistic svm outputs for pattern recognition using analytical geometry. Neurocomputing 62, 293–303 (2004) 12. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297 (1995) 13. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006) 14. Kapoor, A., Grauman, K., Urtasun, R., Darrell, T.: Active learning with gaussian processes for object categorization. In: Eleventh IEEE International Conference on Computer Vision, Rio de Janeiro, Brazil (2007) 15. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 121–167 (1998) 16. Aizerman, A., Braverman, E.M., Rozoner, L.I.: Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control 25, 821–837 (1964) 17. Ng, T.T., Chang, S.F., Tsui, M.P.: Lessons learned from online classification of photo-realistic computer graphics and photographs. In: IEEE Workshop on Signal Processing Applications for Public Security and Forensics, SAFE (2007) 18. M¨ aenp¨ a, T.: The local binary pattern approach to texture analysis extensions and applications. PhD thesis, Infotech Oulu, University of Oulu, Finland (2003) 19. Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 8, 679–698 (1986) 20. Huang, J., Kumar, S.R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Los Alamitos, CA, USA (1997)
Skin Paths for Contextual Flagging Adult Videos Julian St¨ottinger1,2 , Allan Hanbury3 , Christian Liensberger4, and Rehanullah Khan1 1 3
PRIP, Vienna University of Technology, Austria 2 CogVis Ltd., Vienna, Austria Information Retrieval Facilities, Vienna, Austria 4 Microsoft, Redmond, Washington, USA
Abstract. User generated video content has become increasingly popular, with a large number of internet video sharing portals appearing. Many portals wish to rapidly find and remove objectionable material from the uploaded videos. This paper considers the flagging of uploaded videos as potentially objectionable due to sexual content of an adult nature. Such videos are often characterized by the presence of a large amount of skin, although other scenes, such as close-ups of faces, also satisfy this criterion. The main contribution of this paper is to introduce to this task two uses of contextual information in the form of detected faces. The first is to use a combination of different face detectors to adjust the parameters of the skin detection model. The second is through the summarization of a video in the form of a path in a skin-face plot. This plot allows potentially objectionable segments of videos to be found, while ignoring segments containing close-ups of faces. The proposed approach runs in real-time. Experiments are done on per pixel annotated and challenging on-line videos from an on-line service provider to prove our approach. Large scale experiments are carried out on 200 popular public video clips from web platforms. These are chosen from the community (top-rated) and cover a large variety of different skin-colors, illuminations, image quality and difficulty levels. We find a compact and reliable representation for videos to flag suspicious content efficiently.
1
Introduction
User generated content has become very popular in the last decade and has significantly changed the way we consume media [2]. With the international success of several Web 2.0 websites (platforms that concentrate on the interaction aspect of the internet) the amount of publicly available content from private sources is vast and still growing rapidly. The amount of video material being uploaded every day is too large to allow the operating companies to manually classify the content of every submitted video as appropriate or objectionable. The predominant methods to overcome this problem are to block contents based on keyword matching that categorizes user generated tags or comments. Additionally, connected URLs can be used to check the context of origin to trap these websites [8]. This does not hold true for G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 303–314, 2009. c Springer-Verlag Berlin Heidelberg 2009
304
J. St¨ ottinger et al.
Fig. 1. Most popular videos from youtube.com on July 4th, 2009
websites like YouTube that allow uploading of videos. The uploaded videos are not always labeled by (valid) keywords for the content they contain (compare Fig. 1). As no reliable automated process exists, the platforms rely on their user community: Users flag videos and depending on this, the administrators may remove the videos flagged as objectionable. This method is rather slow and does not guarantee that inappropriate videos are immediately withdrawn from circulation. A possible solution for rapid detection of objectionable content is a system that detects such content as soon as it is uploaded. As a completely automated system is not feasible at present, a system that flags potentially objectionable content for subsequent judgement by a human is a good compromise. Such a system has two important parameters: the number of harmless videos flagged as potentially objectionable (false positive rate), and the number of objectionable videos not flagged (false negative rate). In the context of precision and recall of a classification application, these two parameters present a trade-off. For a very low false negative rate, a larger amount of human effort will be needed to examine the larger number of false positives. These parameters should be adjustable by the end-users depending on the local laws (some regions have stricter restrictions on objectionable content) and the amount of human effort available. A further enhancement to reduce the amount of time required by the human judges is to flag only the segments of videos containing the potentially objectionable material, removing the need to watch the whole video, or search the video manually. One reason why videos may be considered objectionable is due to explicit sexual content. Such videos are often characterized by a large amount of skin being visible in the frame, so a commonly used component for their detection is a skin detector [8,15]. However, this characteristic is also satisfied by frames not considered as objectionable, most importantly close-ups of faces. This paper considers the flagging of user-uploaded videos as potentially objectionable. The main contribution of this paper is to introduce two uses of contextual information in the form of detected faces. The first is to use tracked faces
Skin Paths for Contextual Flagging Adult Videos
305
to adjust the parameters of the skin detection model. As it is shown in Fig. 1, user generated content contains many faces. We develop classification rules based upon a prior face detection using the well known approach from Viola et al. [13]. This work builds on [7] where it is shown that more precise adaptive color models outperform more general static models especially for reducing the high number of false positive detections. In [9] it is shown that humans need contextual information to interpret skin color correctly. We extend their approach by using a combination of face detectors: We combine frontal face detection and profile face detection in a combined tracking approach for more contextual information in the skin color representation. The second use of face information is through the summarization of a video in the form of a path in a skin-face plot. This plot allows potentially objectionable segments of videos to be extracted, while ignoring segments containing close-ups of faces. We show that the properties of the skin paths give a reliable representation of the nature of videos. The proposed approach was kept algorithmically simple, and currently runs at over 30 frames per second. A high level of performance is required in such an application to cope with the large number of uploaded videos. In Section 2 we summarize some related work, while Section 3 describes our multiple model approach for fast skin detection with face information, as well as our summarization of the videos on the skin-face plot. The experiments and results are presented in Section 4. Section 5 concludes.
2
Related Work
In computer vision, skin detection is often used as a first step in face detection, e.g. [11], and for localization in the first stages of gesture tracking systems, e.g. [1]. It has also been used in the detection of naked people [4,8]. The latter application has in most cases been developed for still images. The approaches to classify skin in images or videos can be grouped into three types of skin modeling: parametric, nonparametric and explicit skin cluster definition methods. The parametric models use a Gaussian color distribution since they assume that skin can be modeled by a Gaussian probability density function [14]. Non-parametric methods estimate the skin-color from the histogram that is generated by the used training data [5]. An efficient and widely used method is the definition of classifiers that build upon the approach of skin clustering. This thresholding of different color space coordinates is used in many approaches, e.g. [10] and explicitly defines the boundaries of the skin clusters in a given color space. The underlying hypothesis is that skin pixels have similar color coordinates in the chosen color space, which means that skin pixels are found within a given set of boundaries in a color space. Although this approach is extremely rapid, its main drawback is a comparably high number of false detections [6]. We are able to compensate for this issue in our approach by using a multiple adaptive model approach and contextual information in the form of faces.
306
3
J. St¨ ottinger et al.
Method
In this section we describe the adaptive skin-color modeling in detail. We address the problem of changing light conditions, different skin colors and varying image quality in videos in adapting the skin color model according to reliably detected faces. Figure 2 gives an overview of the main steps. We can do the face detection and tracking (see Section 3.1) and the color conversion (Section 3.2) in parallel on the input frame. With this data, we can build our skin model and propagate it to adjust to present skin-color variations and illumination changes in Section 3.3 for a robust skin color classification.
Input Frame
Apply all Face Detectors Face Detection
Color Space Conversion
Face Tracking
Skin-color Model Estimation
Model Propagation
Set of Models
Apply All Models
Classification Result
Fig. 2. Overview of the proposed method
3.1
Face Detection and Tracking
Due to its real-time performance, we use the face detector proposed by Viola et al. [13], as done by Khan et al. [7]. Opposed to their approach, we run profile face detectors and frontal face detectors in parallel. We track faces in the videos to adjust the model propagation strategy. Additionally, false positive detections are likely to “pop out” of the background for a short time, which can be easily suppressed with a tracking algorithm. Color or feature based tracking techniques may fail when there are large changes in the illumination, the facial expression or in the viewpoint. We rely on a geometrical approach that removes every false positive in the annotated data set and lets us track faces from one detector to the other. For every given detection of detector Di where i = 1..n is the detector
Skin Paths for Contextual Flagging Adult Videos
307
identifier and n the number of detectors, we merge every dependable detection for any frame m by n
i i i i (Dm ∩ Dm−1 ) ∧ (Dm ∩ Dm+3 ) 0.5
(1)
i=1
We merge cases where multiple detectors give the same faces. When a head is turned, profile face detections are merged with frontal faces over time. 3.2
Skin-Color Modeling
Choosing a color space that is relatively invariant to minor illuminant changes is crucial to any skin color tracking system. The transformation simplicity and explicit separation of luminance and chrominance components makes Y CbCr attractive for skin color modeling [12]. For 24 bit color depth, the following values apply: Y =(0.299 ∗ (R − G)) + G + (0.114 ∗ (B − G)) Cb =(0.564 ∗ (B − Y )) + 128 Cr =(0.713 ∗ (R − Y )) + 128 The favorable property of this color space for skin color detection is the stable separation of luminance, chrominance, and its fast conversion from RGB. These points make it suitable for our real-time skin detection. The static values used when initially no face is detected are [3]: Cbmax = 127, Cbmin = 77, Crmax = 173, Crmin = 133. If we do not detect any face in a video, these values are the general static skin-color model. In that case, we do not gain any advantage from this approach. These values apply to a very broad range of illumination circumstances and a range of skin color, thus having a large number of false positives. However, it overlaps significantly with the idea humans have about skin-color [9]. Our approach for more specific skin-color models is explained in the next Section. 3.3
Skin Color Model Instance Initialization and Destruction
Any detected and tracked face introduces a new skin color model instance, which allows skin of different color and under different light conditions to be detected. After a face has been detected its color is examined: The range for the Cb and Cr components (of the Y CbCr color space) are used to generate a newly adapted range model. The Y component is ignored since it encodes only the luminance. In a first step, we have to estimate how much skin is present in the detected face. Detections usually contain certain parts that are not skin, such as hair, open eyes, mouth, eye brows etc. Therefore we statically cluster skin-color with the borders defined in Section 3.2. Having the possible skin region extracted,
308
J. St¨ ottinger et al.
(a) Original frame, 1 de- (b) Static skin-color (c) 1 adapted model aptected faces. model, green indicates plied, green indicates dedetected skin. tected skin. Fig. 3. Video 2 example frame and its classification result with a near skin-color background
(a) Original frame, 4 de- (b) Static skin-color (c) 3 adapted models aptected faces. The left one model, green indicates plied, green indicates deis detected twice. detected skin. tected skin. Fig. 4. Video 5 example frame and its classification result. There are 4 different skincolors present in the scene, 4 detected faces from two different detectors leading to 3 connected color models.
we adjust the model: The average skin pixel color gives the median value for the new skin-color model. As evaluated in [9], the Cb channel gives best performance using a range of 30% of the static color model, the Cr channel is more stable having a range 17, 5%. With this adapted cluster definitions, we are able to classify every pixel by 4 simple threshold operations per model. This contextual information makes the approach more precise reducing the number of false positives (compare Fig. 3). Additionally, in the same manner we track multiple faces, we are able to track multiple skin-colors per scene robustly, as it can be seen in Fig. 4. A tracked face adapts its skin-color model dynamically per frame. When the face is lost by the tracker, a possible re-detection of the face gives a second very similar skin-color model. We do not forget about any model we create in the course of the video. This is an assumption which works well for rather short online video clips with a limited number of persons, but does not hold for long movies with a grand variation of illumination circumstances and actors.
Skin Paths for Contextual Flagging Adult Videos
3.4
309
Skin Paths for Video Classification
In detecting adult video material, we are interested in the amount of skin visible. It is possible to visualize this information in the form of skin graphs, showing the number of skin pixels detected per frame [7]. However, a major problem of such skin color based classification systems is that e.g. portrait shots like interviews and news do have a large amount of skin present in the scene, which makes a decision based on skin pixel count difficult. After a successful face detection, we overcome this problem by estimating the relation between skin inside the face region and the whole frame. This measure gives an idea about the scale of the people in that shot. Plotting this measure against the overall skin detection, we are able to describe the property of the given frame meaningfully. Videos differ heavily from scene to scene and from shot to shot. To get an idea of a video, we have to provide a compact representation for our detection method. We introduce skin paths, which average the described measure over a fixed number of frames and give an intuitive idea of the character of a given video. In Fig. 5, the skin path for video number 9 and the corresponding frames are given. On the x-axis, the mean quotient of the skin color area inside a tracked face and the skin coverage of a fixed number of frames is given. The y-axis represents the mean total skin color coverage in these frames. The path starts at the very left, as in the beginning there is no face detected and much sand is detected as skin. Following the path, more faces are detected and the skin color model adjusts towards the right color model, giving a better idea about the amount of skin present in the scene. We show in the following Section 4, that there are certain areas in the skin graph correlated with properties and content of the videos. From the information of the skin paths, we can categorize the nature of videos reliably by the position of the data points of the graphs. Additionally, there is a trend that data points with x = 0 provide more unreliable results as there are none or few faces detected. This gives a confidence measurement for the skin detection itself.
4
Experiments
In this Section, we develop a robust classification rule for flagging adult videos and prove our approach on a large data set containing the most popular online videos. In the following Section 4.1, we describe the videos used in more detail. Section 4.2 evaluates the approach on a per pixel basis and proposes the classification rules that are evaluated in a large scale experiment in Section 4.3. 4.1
Data Sets
15 videos have been provided by an internet service provider that requires a skin detection application for their on-line platform. Their aim was to choose challenging videos with near skin-color backgrounds. Pink and brown backgrounds
310
J. St¨ ottinger et al.
0.48
Video 9
2
Mean skin Coverage sampled by 80 frames
0.46 0.44
6
4 0.42
3
0.4 0.38 0.36 0.34 0.32 1 0.3 0
5 0.005
0.01 0.015 0.02 0.025 Mean Face Skin Coverage / Detected Skin Coverage sampled by 80 frames
0.03
0.035
Fig. 5. Skin path for the classification of Video 9 and the corresponding key frames. The detection and incorporation of facial skin makes the results more reliable.
Fig. 6. Example frames from the annotated video data-set used
such as beaches, sand, cork boards or similar are detected as false positives easily (see Fig. 3 and compare Fig. 6). We added 10 videos to encounter additional challenges as a larger variety of skin-colors, especially different skin-colors in one frame. Most of the sequences also contain scenes with multiple people and/or multiple visible body parts and scene shots both indoors and outdoors, with steady or moving camera. The lighting varies from natural light to directional stage lighting. Sequences contain shadows and minor occlusions. Collected sequences vary in length from 100 frames to 500 frames. They also contain data errors and are generally of poor quality, varying size and frame rate. Ground truth has been generated for all of the 25 videos on a per pixel basis annotating 10764 frames manually. The second data set consists of 200 publicly available videos. To provide an objective collection of videos, we chose the 100 most popular videos from youtube1 on July 7th, 2009. As there are not more videos available in this category, we additionally gathered 50 videos “being watched right now” which are not in the previous category. For the adult material, we chose the 50 most popular videos from youporn2 as this platform provides explicit adult material only and is publicly available. We want to make sure that our classification is not biased by the 1 2
http://www.youtube.com http://www.youporn.com
Skin Paths for Contextual Flagging Adult Videos
311
Fig. 7. The Fscore of the classification results. The proposed approach using multiple face detectors outperforms the adaptive skin-color modeling [9] and the static skin-color classification [3].
two different data sources. There is a probability that the two classes of video material differ e.g. in frame rate, size, video quality or noise level just because of the two platforms they are downloaded from. Such criteria would nullify any classification success. Therefore we chose 10 videos of the adult material where we encounter rather extended non-adult scenes and deleted the adult scenes out of it. Finally, the second data set consists of 160 videos with non-adult material (100 most popular, 50 being watched, 10 edited adult material) and 40 videos with explicit content. Example frames are shown in Fig. 1. 4.2
Adult Video Classification and Detection
Fig. 7 shows that the use of multiple face detectors provides a significant increase of classification performance of almost 10% compared to single face detection and a combined color space voting [9]. The weighted harmonic mean indicates the per pixel evaluated classification results. In Fig. 8 the skin paths for the whole data-set are given. As it is shown by the red paths, adult material tends to have few and small faces compared to a large amount of skin present. This intuitive criteria is well suited for classification of videos in the skin path diagram: We make the assertion that the skin path of suspicious video material enters the area defined by x ¡ x < 0.08 and y > 0.55. We classify our whole data-set correctly with two false positive detections of videos. These two contain desert shots without faces present.With this classification technique, we can detect adult material reliably with the tendency to get false positive detections but very few false negative ones. Additionally, the distance to the upper left corner gives an idea of the character of the scene: Video 8 and 21 sporting girls in bikinis show to have a path beginning below x = 0.5, y = 0 (and therefore near adult material) and smoothly adapt themselves towards the lower right (towards the unsuspicious space). Videos without much non-facial skin visible (e.g. Interviews) have skin paths significantly towards the bottom.. The main areas are summarized in Table 1: Adult is the zone where we encounter adult material to be flagged, suspicious videos tend to show persons in full with lots of visible skin as they appear e.g. in sports clips. 5 videos are classified as suspicious material. They are intuitively correct as they are beach, dance and massage scenes. We separate videos with
312
J. St¨ ottinger et al. 1
Video 1 Video 2 Video 3 Video 4 Video 5 Video 6 Video 7 Video 8 Video 9 Video 13 Video 14 Video 15 Video 16 Video 20 Video 21 Video 22 Video 23 Video 24 Video 25 Video 10 Video 11 Video 12 Video 17 Video 18 Video 19
Mean skin Coverage sampled by 80 frames
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Mean Face Skin Coverage / Detected Skin Coverage sampled by 80 frames
Fig. 8. Skin paths of the relation between facial skin pixels and other skin pixel drawn onto the overall skin coverage. Red indicates adult material, green unsuspicious video material. Table 1. The 3 main characters of videos that can be extracted from the skin paths reliably and their classification performance on the annotated data set character of video adult material suspicious portrait
classification rule classification accuracy absolute false negatives x < 0.08, y > 0.55 0.85 0 x < 0.08, y > 0.43 x 0.08, y < 0.25 1 0
portrait shots as there are in news, interviews and most of the “webcam” video messages robustly into the portrait area with an accuracy of 1. We show in the next section that these values hold for arbitrary online videos as well. 4.3
Flagging Adult Online Videos
We repeat the experiment from the previous Section on the 200 online videos described in Section 4.1. As it can be seen in Fig. 9, adult material has again a strong trend towards the upper left corner. We apply the classification rules defined for adult material and reach an accuracy of 0.91. The two false negatives are both classified as suspicious character, which can be seen in the second line of Tbl. 2. The reason for this wrong classification is in the nature of the two videos: The first one is of explicit adult material, but almost no skin and no face is visible as the actors are dressed fully. The other false positive contains a couple that apparently takes the video with their webcam on their own. In contrast to all other adult videos, the actors appear rather small in the image. Although the skin color is detected precisely, the amount of skin is not enough for our adult material classification rule. For portrait videos, 54 videos are classified as portrait videos. They all contain portrait shots. We do not encounter false positive detections of tracked faces, although we do not know precisely how many we missed.
Skin Paths for Contextual Flagging Adult Videos
313
Mean skin Coverage sampled by 80 frames
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 Mean Face Skin Coverage / Detected Skin Coverage sampled by 80 frames
0.8
0.9
1
Fig. 9. Skin paths of the relation between facial skin pixels and other skin pixel drawn onto the overall skin coverage. Red indicates adult material, green unsuspicious video material. Table 2. Classification accuracy and absolute number of false negatives and false positives. 1st line shows flagging performance of adult material, second line for both adult and suspicious material. classification rule classification accuracy nr. of false negatives nr. of false positives x < 0.08, y > 0.55 0.91 2 15 x < 0.08, y > 0.43 0.82 0 37
5
Conclusion
We present a practical approach to detecting skin in on-line videos in real-time. Instead of using solely color information, we include contextual information in the scene through multiple face detection and combined face tracking. By using a combination of face detectors and an adaptive multiple model approach to dynamically adapt skin color decision rules we are able to significantly reduce the number of false positive detections and the classification results become more reliable compared to static color threshold based approaches or approaches using multiple color spaces. The runtime of the algorithm is still real-time and can be carried out in parallel. We give the skin path as a compact and powerful representation of videos. We are able to extract reliable features from facial and non facial skin and classify on-line videos successfully. The number of false negatives is very low, providing a reliable flagging of adult material. The approach is computationally inexpensive and can be carried out in real-time.
Acknowledgment This work was partly supported by the Austrian Research Promotion Agency (FFG), project OMOR 815994, and the CogVis3 Ltd. However, this paper reflects 3
http://www.cogvis.at/
314
J. St¨ ottinger et al.
only the authors’ views; the FFG or CogVis Ltd. are not liable for any use that may be made of the information contained herein.
References 1. Argyros, A.A., Lourakis, M.I.: Real-time tracking of multiple skin-colored objects with a possibly moving camera. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 368–379. Springer, Heidelberg (2004) 2. Cha, M., Kwak, H., Rodriguez, P., Ahn, Y.-Y., Moon, S.: I tube, you tube, everybody tubes: analyzing the world’s largest user generated content video system. In: Int. Conf. Internet Measurement, pp. 1–14 (2007) 3. Chai, D., Ngan, K.N.: Locating facial region of a head-and-shoulders color image. In: Int. Conf. Automatic Face and Gesture Recognition, pp. 124–129 (1998) 4. Fleck, M.M., Forsyth, D.A., Bregler, C.: Finding naked people. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1064, pp. 593–602. Springer, Heidelberg (1996) 5. Jones, M.J., Rehg, J.M.: Statistical color models with application to skin detection. IJCV 46(1), 81–96 (2002) 6. Kakumanu, P., Makrogiannis, S., Bourbakis, N.: A survey of skin-color modeling and detection methods. PR 40(3), 1106–1122 (2007) 7. Khan, R., St¨ ottinger, J., Kampel, M.: An adaptive multiple model approach for fast content-based skin detection in on-line videos. In: Int. Workshop Analysis and Retrieval of Events/Actions and Workflows in Video Streams (2008) 8. Lee, J.-S., Kuo, Y.-M., Chung, P.-C., Chen, E.-L.: Naked image detection based on adaptive and extensible skin color model. PR 40(8), 2261–2270 (2007) 9. Liensberger, C., St¨ ottinger, J., Kampel, M.: Color-based and context-aware skin detection for online video annotation. In: MMSP (to appear, 2009) 10. Phung, M.-S.L., Bouzerdoum, S. M.-A., Chai, S. M.-D.: Skin segmentation using color pixel classification: Analysis and comparison. PAMI 27(1), 148–154 (2005) 11. Senior, A., Hsu, R.-L., Mottaleb, M.A., Jain, A.K.: Face detection in color images. PAMI 24(5), 696–706 (2002) 12. Vezhnevets, V., Sazonov, V., Andreev, A.: A survey on pixel-based skin color detection techniques. In: ICCGV, pp. 85–92 (2003) 13. Viola, P., Jones, M.J.: Robust real-time face detection. IJCV 57(2), 137–154 (2004) 14. Yang, M., Ahuja, N.: Gaussian mixture model for human skin color and its application in image and video databases. In: SPIE, pp. 458–466 (1999) 15. Zheng, H., Daoudi, M., Jedynak, B.: Blocking adult images based on statistical skin detection. ELCVIA 4(2), 1–14 (2004)
Grouping and Summarizing Scene Images from Web Collections Heng Yang and Qing Wang School of Computer Science and Engineering Northwestern Polytechnical University, Xi’an 710072, P.R. China
[email protected]
Abstract. This paper presents an efficient approach to group and summarize the large-scale image dataset gathered from the internet. Our method firstly employs the bag-of-visual-words model which has been successfully used in image retrieval applications to give the similarity between images and divides the large image collections into separated coarse groups. Next, in each group, we match the features between each pair of images by using an area ratio constraint which is an affine invariant. The number of matched features is taken as the new similarity between images, by which the initial grouping results are refined. Finally, one canonical image for one group is chosen as the summarization. The proposed approach is tested on two datasets consisting of thousands of images which are collected from the photo-sharing website. The experimental results demonstrate the efficiency and effectiveness of our method.
1 Introduction With the explosion of the digital photography in the internet, people can easily find out the scene views of famous places that they are interested in from the photosharing websites, such as Flickr [1]. However, people are always confused for it is difficult for them to grasp the highlights when they face thousands of the unorganized searching results. As a result, automatic image grouping and summarizing for the web-scale collections has become the hot issue in recent years. To solve this problem is also a big challenge, due to the huge size of the dataset and the heavy contamination by images with wrongly associated tags. Recent literatures present a number of approaches for addressing above mentioned issue. The Photo Tourism system developed by Snavely et al. [2] can organize the large-scale images by 3D model and visualization. One of the key steps of this system is to group the large amount of unordered images automatically by exhaustively calculating feature match between each possible image pairs employing the ANN (Approximate Nearest Neighbor) searching algorithm [7]. Therefore, the computational complexity of their grouping strategy is O(n2), where n is the number of images in the dataset. In the case that n is huge, the computational cost of their grouping strategy is very heavy and even unacceptable. The similar grouping scheme was also adopted in [3], which aimed at summarizing scene images. Zheng et al. [16] presented a webscale landmark recognition system, in which the landmark visual models are built by using image matching and unsupervised clustering approach. The clustering algorithm G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 315–324, 2009. © Springer-Verlag Berlin Heidelberg 2009
316
H. Yang and Q. Wang
is carried out on the matched local region graph. However, constructing the graph needs to perform image feature matching on all image pairs in the entire dataset, which also suffers the burden of heavy computation. Zeng et al. [13] proposed an annealing-based algorithm which optimized an object function for grouping images. However, their method also relies on knowing similarity of most image pairs. A more efficient way for grouping images is presented by Li et al. [4]. They first divided the large amount of images into small groups by clustering the low-dimensional global GIST descriptors [5] and choose one representative image for each cluster as the iconic view. Then the iconic views are organized by an iconic scene graph using keypoint-based methods. Chum et al. [6] proposed a web-scale image clustering method based on the randomized data mining method. They firstly used the min-Hash algorithm to find out the cluster seeds and then the seed are used as the visual queries to obtain clusters. The commonness of the above algorithms is that they have to do either ANN searching or clustering with millions of feature descriptors in order to obtain the similarity between images. This makes them lack of efficiency or even impractical in application when the size of web image collections grows very large. To be different with them, we obtain image similarities by resorting to bag-of-visual-words model which has proven successful in image retrieval applications [8-10]. These methods are inspired by text retrieval methods and can rank the image similarities very efficiently by using the inverted files scheme [8]. In this paper, we compute the similarity between images using the image retrieval methods and coarsely partition large-size image dataset into separated content-related groups. Then, in each group, we refine the grouping result by performing the feature matching. Finally, one canonical image for one group is selected by our system as the summarization of the large dataset. The remainder of the paper is organized as follows. Section 2 presents grouping and summarizing algorithms and section 3 shows the experimental results and related analyses. Finally, conclusions are described in section 4.
2 Novel Image Sorting Algorithm Our method can be divided into two main stages which are coarse grouping and refining. Algorithm 1 gives the summary of our approach. 2.1 Coarse Grouping Using Bag-of-Visual-Words Model The vocabulary tree is a kind of bag-of-visual-words algorithm and has been proven successful in image retrieval applications [9, 10]. The vocabulary tree nodes are built by hierarchical k-means clustering algorithm on SIFT feature vectors from the training data. In our experiment, a vocabulary tree with 10 branching factor and 5 levels is built using the whole dataset as the training data. Next, the GNP searching algorithm is employed to enhance the retrieval performance, since it considers more candidates instead of one at each level of the tree. The feature vectors are thus quantized to visual words by GNP algorithm and are organized by inverted file structure which keeps track of the number of times each visual word appears in each image. Then each image is in turn taken as the query and retrieved in the whole dataset. The similarity
Grouping and Summarizing Scene Images from Web Collections
317
Algorithm 1. Scene Images Grouping and Summarizing Input: Image collection from photo-sharing website Output: Image grouping results and canonical images for summarization 1. Extract SIFT [11] features from all the images 2. Coarse grouping using bag-of-visual-words model (1) Choose training images to build a vocabulary tree [9] (2) Quantize feature vectors to their corresponding visual words by the GNP (Greed N-Best Paths) [10] algorithm. (3) Using every image as a query to retrieve the whole image collection to calculate the similarity of every image pair (the similarity of each pairs is organized by a matrix Mr) (4) Using view-spanning tree [12] to divide the image set into groups based on Mr 3. For each group, refine the result (1) For every image pair in one group, features are initially matched if they are quantized to the same visual word. (2) Using the affine invariant constraint to verify initial matched results. The number of the remaining inliers is taken as the new similarity between image pairs (the similarity matrix is denoted as Mc). (3) Using view-spanning tree to partition the images of the group again based on Mc. (4) Choose a representative image as the canonical image for summarizing each sub-group.
between images is calculated by the TF-IDF scheme [8] and is organized in the adjacency matrix Mr. The retrieval process is very efficient. In our experiment, the n × n similarity matrix values can be computed in about 26 seconds for n=1249 and about 21 seconds for n=1112, respectively. Finally, a spanning-tree algorithm is employed to partition the large image dataset into groups based on Mr. The spanning-tree algorithm firstly adds the highest similar image pair to the tree; then it adds a new image which has the highest similarity with the tree node images. The process is repeated until none of the similarity value of the remaining images surpasses a pre-set threshold Tr. The remaining images which are not added to any groups are considered as the contaminated images and are rejected by our system. 2.2 Refine Grouping Results The grouping results can be refined group-by-group via considering matched features, since that the features hold more information than their corresponding visual words, such as their positions in image. Our scheme of initially feature matching utilizes the inverted files obtained in previous phase. The features are considered correspondence to each other when they are quantized to the same visual word. Algorithm 2 shows the specific detail that we choose the matched feature pairs for one of the same visual words. It is worth noticing that Algorithm 2 guarantees the one-to-one mapping between feature sets, which is important to the next step. Next, the initial matched results should be verified by constraint in order to reach higher matching precision and thus provide higher quality similarity measure. The most common used constraint is geometric verification using RANSAC-like algorithms. However, this process is very time consuming and it is too strong for the grouping purpose. In the paper, we propose a simple yet effective constraint - area
318
H. Yang and Q. Wang
ratio of triangles which is an invariant under the affine transformation [14]. As we know, the image transformation within multi-view can be considered or approximately considered as affine transformation in most cases. Therefore, this constraint will be theoretically effective. Algorithm 3 gives the detail for rejecting outliers between two images using affine invariant constraint. As a result, the number of the remaining matched features is considered as the new and enhanced similarity measure between images. The similarity values of image pairs in a group are organized in the adjacency matrix Mc. The spanning-tree algorithm is carried out once more to refine the grouping result based on Mc, and the threshold Tc for spanning-tree is set to 20 in our experiments. Finally, a canonical image in one group is selected as the representative for that group. The criterions for choosing a canonical image have two steps. We first use the criterion (i) to choose image that has the most connections with other images in the spanning-tree. If only one image is returned, this image is regarded as the canonical one of that group; if more than one image is returned, criterion (ii), choose the image that has the maximum sum of matched features with all its connected neighbors, will be further applied among the returned images and one image will be picked out as the canonical one of that group. The final output of our system is the m groups of images and the corresponding m canonical images as the summarization of the input image dataset. Algorithm 2. Feature matching of the same visual word Wi Input: Fi1 and Fi 2 . They denote the feature subsets in two images I1 and I 2 , where the features are quantized to the same visual word Wi Output: ni matched feature pairs 1. ni = min{ | Fi1 | , | Fi 2 | }. 2. Find the minimum distance among | Fi1 | × | Fi 2 | candidate distances and mark the corresponding features f1 ∈ Fi1 and f 2 ∈ Fi 2 as a matched feature pair. 3. Remove f1 from Fi1 and f 2 from Fi 2 . 4. If Fi1 ≠ ∅ and Fi 2 ≠ ∅ , goto step 2; otherwise output the results. Algorithm 3. Verify matched features by the constraint Input: Initial matched feature pairs between I1 and I 2 Output: The matched feature pairs that pass the constraint verification 1. Choose four matched feature pairs in sequence, which are denoted as A↔a, B↔b, C↔c, D↔d. Make sure any three of them are not collinear. Therefore, the triangle area ratio should S S S S keep invariant, i.e. ABC = BCD = CDA = DAB , where S denotes area of the triangle. S abc Sbcd Scda S dab 2. Let V1 = ( S ABC , S BCD , SCDA , S DAB ) and V2 = ( Sabc , Sbcd , Scda , S dab ) , v1 and v2 are corresponding normalized ones respectively. Calculate the scalar product p of v1 and v2 . 3. If p exceeds a pre-set threshold Tp, keep the four matched feature pairs; otherwise, reject the four matched pairs from the initial set. 4. If all the pairs in initial set are checked, output the results; otherwise, goto step 1.
Grouping and Summarizing Scene Images from Web Collections
319
2.3 Complexity Analysis The computational complexity of our approach is composed of two parts, which are coarse grouping by retrieval methods and refinement process in each coarse group. The first process is quite efficient due to the use of inverted file structure. To obtain the similarity matrix takes just seconds on our experimental dataset. Therefore, the time cost of the first part can be ignorable compared to the second one. The second part is the most time consuming since it performs matching between feature sets. The total time complexity (T) can be calculated only including the second part by (1), where ni is the number of images of the i-th group, t denotes the average time cost for matching features between an image pair, and n is the total number of images in the dataset. m
m
i =1
i =1
T = t ∑ Cn2i , S .t. ∑ ni ≤ n
(1)
If n = max(n1 , n2 ,..., nm ) and n m , the total time complexity of our approach is O(n 2 ) . In the case of n n , the time cost of our approach will be much lower than the previous approaches [2,3] which are O(n2).
3 Experimental Results We have tested our approach on two datasets which are the Bell Tower of Xian city, China and the Pantheon of Roma, Italy. The images of the first dataset are automatically downloaded from the Flickr.com website using keyword searching and the second dataset is downloaded from the website [15]. The experiments were run on the common PC: Intel(R) Pentium duel-core processor 2.0 GHz and 2 GB memory. 3.1 Grouping Results and Evaluation on the Bell Tower Dataset
The Bell Tower dataset consists of 1249 images and totally 867,352 SIFT features are extracted from these images. We use all these features to train a vocabulary tree with 10 branching factor and 5 levels. After coarse grouping stage, 956 images are automatically partitioned into 14 groups. The other images are considered as contaminated ones and rejected by our system. One of the coarse grouping results is shown in Figure 1 (a). At the refining stage, enhanced similarity is recalculated between image pair in each coarse group. As a result, some low-related images may be rejected, or the connection in the view spanning-tree in a group may be readjusted, or even the coarse group is split into several finer groups. Figure 1 shows an example that a coarse group is split into two finer groups. The images in Figure 1(a) seem similar to each other, since they all contains tower, building and the street view. But actually they are taken from different views. The refined grouping results (b) and (c) are more accurate. The tower appeared in group (b) is actually the Drum Tower which locates near to the Bell Tower appeared in group (c).
320
H. Yang and Q. Wang
Fig. 1. A coarse group is split into two small groups at the refining stage. The images in red rectangles in (a) denote the rejected images by refinement and the images in blue rectangles in (b) and (c) denote the canonical images, respectively. C1, 108
C2, 54
C3, 9
C4, 151
C5, 94
C6, 14
C7, 11
C8, 9
C9, 82
C10, 63
C11, 59
C12, 42
C13, 22
C14, 34
C15, 31
C16, 69
C17, 22
Fig. 2. The canonical images of the clusters on the Bell Tower dataset. The text upon each image denotes the cluster name, the number of images of this cluster.
After refining stage, there are totally 874 images left at last, while the other images are rejected. The 874 images are partitioned to 17 clusters, and therefore 17 canonical images as the summarization of the Bell Tower dataset are automatically returned by our system as listed in Figure 2. Figure 3 shows five examples of the grouping results. In order to evaluate the grouping performance of our approach, we label the images by hand in advance. Our labeling distinguishes the picture orientation of the bell
Grouping and Summarizing Scene Images from Web Collections
321
tower building, the different street view and even the day-time or night-time of the pictures. These ground-truth labels can measure precision and recall for different steps, so that the grouping performance can be quantified. Figure 4 shows recallprecision curves for different stages of our approach on the Bell Tower dataset. We can see that the refining stage can greatly increase the precision of the coarse grouping result. Furthermore, using the proposed constraint on the refining stage can further improve the grouping performance of that without using the constraint.
C1
C2
C6
C10
C12
Precision
Fig. 3. Selected images from selected clusters. The image within red rectangle denotes the false positives which are wrongly categorized. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Coarse grouping Refining without the constraint Refining with the constraint
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall
1
Fig. 4. Recall-precision curves of grouping results of different steps
3.2 Analysis of the Constraint
In algorithm 3 we propose to screen outliers by affine invariant constraint. Here, we discuss the performance of the constraint for feature correspondence. Table 1 lists the
322
H. Yang and Q. Wang
feature matching results of the group shown in Figure 1 (b) to prove the effectiveness of the proposed constraint. We just list the results of connected pairs in the tree. The homography matrix between image pair is estimated by RANSAC algorithm based on the initial matched features, which is used as the ground truth to verify the correctness of the matched features. From Table 1, we can find that the proposed constraint can effectively reject outliers, though it also rejects some inliers. The higher precision with using the proposed constraint is beneficial to provide more accurate similarity between images, and therefore better grouping performance can be achieved as Figure 4 shows. In addition, Figure 5 shows an example of feature matching between two images. It can be observed that most of the outliers are rejected after using the proposed constraint. Table 1. Correspondence results without and with constraint Image Pairs (a,d) (b,c) (c,d) (d,e) (e,g) (f,g) (f,i) (g,h) (g,j) (g,k)
Number of matched features (True num, Precision) Without the constraint With the constraint 49 (30, 61%) 25 (24, 96%) 79 (65, 80%) 65 (65, 100%) 51 (27, 53%) 29 (26, 90%) 70 (55, 79%) 49 (45, 92%) 95 (74, 78%) 61 (59, 97%) 157 (138, 88%) 129 (127, 98%) 88 (67, 76%) 60 (59, 98%) 92 (76, 83%) 78 (73, 94%) 81 (56, 69%) 54 (52, 96%) 51 (33, 65%) 29 (28, 97%)
(a) Original image pair of (f,i) in Figure 1(b)
(b) The result without the constraint
(c) The result with the constraint
Fig. 5. An example of feature correspondence
3.3 Discussions
There are two important thresholds in our approach, which are Tr (in section 2.1) and Tp (in Algorithm 3). Tr is the threshold used in the coarse grouping stage. With the
Grouping and Summarizing Scene Images from Web Collections
323
increasing of the Tr, the precision of grouping results will increase, while the recall of the results will decrease. As a result, more small groups will be produced, yet some of them should have been partitioned in a big group. In the coarse grouping stage, a large recall is required to guarantee the images belonging to the same group can be assigned to a big group, and then the precision of grouping can be improved by the refining step. Therefore, Tr is set to 0.1 in our experiments, which is a relative small value. Tp is the threshold used in the constraint to reject the outliers. The larger the Tp is, the number of the remaining matched features will be smaller. It means the precision of the feature matching will increase, while the recall will decrease. Therefore, Tp should be set to balance the precision and recall of feature matching, so that good image grouping results can be reached. We find in our experiments that better results can be achieved when Tp is set to 0.9. By applying the same threshold setting, we evaluate our approach on another dataset which is the Pantheon dataset that also used in [3]. The results of [3] are shown on the website [15], from which we can see that there are totally 15 clusters consisting of 1112 images. In our implementation, there are totally 1,089,822 SIFT features extracted from the images and a vocabulary tree with 10 branching factor and 5 levels is trained based on all these features. Our grouping and summarizing results are shown in Figure 6. There are totally 10 groups produced at last and 86 images are rejected by our approach. The grouping results are convincing, since the images in each group are indeed visually connected and content-related. It demonstrates again the effectiveness of our approach. The results of our approach can help users browse the web image collections in two levels. The canonical images compose the level 1 which help the user grasp the highlights of the large collections quickly and the images content-related with each canonical one in each group can be considered as level 2 which give users more detail information. C1, 171
C2, 497
C3, 25
C4, 21
C5, 162
C6, 29
C7, 10
C8, 49
C9, 40
C10, 22
Fig. 6. The canonical images of the clusters on the Pantheon dataset. The text upon each image denotes the cluster name, number of images of this cluster.
4 Conclusion We have presented an approach to automatically group and summarize large-scale images from web collections. On the coarse group stage, our method employs the successful image retrieval algorithms to greatly speed up our grouping process. On the refining stage, a constraint is proposed based on the affine invariant. The
324
H. Yang and Q. Wang
constraint can be used to effectively reject the outliers of matched feature pairs without using any RANSAC-like estimation algorithm. The experimental results have proven the efficiency and effectiveness of our method. The output of our system can provide user an efficient navigating way in large-scale databases and photo-sharing websites. In addition, the results of our system can be taken as the input of the 3D reconstruction system. It can be done by the well-known structure-from-motion methods and this is left in our future work.
Acknowledgments This work is supported by National Natural Science Fund (60873085), National Hi-Tech Development Programs under grant No. 2007AA01Z314, P. R. China.
References 1. (www), http://www.flickr.com/ 2. Snavely, N., Seitz, S., Szeliski, R.: Photo Tourism: exploring photo collections in 3D. SIGGRAPH 25(3), 835–846 (2006) 3. Simon, I., Snavely, N., Seitz, S.M.: Scene summarization for online image collections. In: ICCV (2007) 4. Li, X., Wu, C., Zach, C., Lazebnik, S., Frahm, J.M.: Modeling and recognition of landmark image collections using iconic scene graphs. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 427–440. Springer, Heidelberg (2008) 5. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42, 145–175 (2001) 6. Chum, O., Matas, J.: Web scale image clustering. Research Report. Czech Technical University (2008) 7. Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM 45(6), 891–923 (1998) 8. Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: ICCV, pp. 1470–1477 (2003) 9. Nistér, D., Stewénius, H.: Scalable recognition with a vocabulary tree. In: CVPR, pp. 2161–2168 (2006) 10. Schindler, G., Brown, M., Szeliski, R.: City-scale location recognition. In: CVPR (2007) 11. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60, 91–110 (2004) 12. Yao, J., Cham, W.K.: Robust multi-view feature matching from multiple unordered views. Pattern Recognition 40, 3081–3099 (2007) 13. Zeng, X., Wang, Q., Xu, J.: Map model for large-scale 3d reconstruction and coarse matching for unordered wide-baseline photos. In: BMVC (2008) 14. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision, 2nd edn. Cambridge University Press, Cambridge (2003) 15. (www), http://grail.cs.washington.edu/projects/ canonview/pantheon_index/pantheon.html/ 16. Zheng, Y., Zhao, M., Song, Y., Adam, H., Buddemeier, U., Bissacco, A., Brucher, F., Chua, T.S., Neven, H.: Tour the World: Building a Web-Scale Landmark Recognition Engine. In: CVPR (2009)
Robust Registration of Aerial Image Sequences Clark F. Olson1 , Adnan I. Ansar2 , and Curtis W. Padgett2 1 University of Washington, Bothell Computing and Software Systems, Box 358534 18115 Campus Way N.E. Bothell, WA 98011-8246
[email protected] 2 Jet Propulsion Laboratory California Institute of Technology 4800 Oak Grove Drive Pasadena, CA 91109-8099
Abstract. We describe techniques for registering images from sequences of aerial images captured of the same terrain on different days. The techniques are robust to changes in weather, including variable lighting conditions, shadows, and sparse intervening clouds. The primary underlying technique is robust feature matching between images, which is performed using both robust template matching and SIFT-like feature matching. Outlier rejection is performed in multiple stages to remove incorrect matches. With the remaining matches, we can compute homographies between images or use non-linear optimization to update the external camera parameters. We give results on real aerial image sequences.
1
Introduction
Persistent, high resolution aerial imagery can provide substantial scene detail over wide areas. Sensor platforms such as Angel Fire, Constant Hawk and ARGUS that dwell over an area give wide area coverage at high resolution. Both Angel Fire and ARGUS serve these images at low latency to multiple ground users at update rates sufficient for real time situational awareness. To increase the utility of these persistent surveillance platforms to ground users, there is a need to align the imagery to existing databases (road, city, waterway maps, etc.) and to fuse the data from multiple platforms servicing the same area, possibly of different modalities. One of the simplest ways to accomplish this is to use onboard Inertial Navigation Sensors (INS) and GPS to geo-register the collected imagery to a terrain map (either a pre-existing Digital Elevation Map, or one generated from the collected imagery). Requested portions of the image product can then be served directly to ground users from the aerial platform thus eliminating the need to send the full image stream (requiring greater than 100 Mbits/second over limited bandwidth) to the ground for further processing. A well registered image stream also improves the efficiency of the image compression routines allowing for higher quality imagery (or more coverage) to be passed directly to the user. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 325–334, 2009. c Springer-Verlag Berlin Heidelberg 2009
326
C.F. Olson, A.I. Ansar, and C.W. Padgett
Image analysts need information much more than they need raw data, and utility grows at each level of information that is successfully and reliably extracted. Fusion of data from multiple sensors has long been known as an effective way of highlighting and interpreting significant events while reducing false indications. Often the challenge, however, is ensuring proper association between data items collected from dissimilar sensors. In other words, you need to line images up before you can find out where information coincides. The value of even simple pixel-level image fusion has been shown for well over a decade. Targets pop out and images become easy to interpret [1]. Pixel level fusion, however, depends upon lining up images to a very high level of precision, a task that is difficult enough when the sensors are mounted on the same platform. Although very precise INS/GPS sensors exist, the challenge of localizing a fraction of square meter (the typical pixel size on the ground) from more than a mile away is extremely high, even frame to frame on the same image. Further, future persistent surveillance platforms will come in all size classes (Shadow, Scan Eagle, Predator, etc.) precluding the use of the best INS sensors due to size/power constraints. Biases in the INS/GPS pointing systems also introduce misalignments that can result in substantial ground errors. Ideally, the data products collected onboard should be referenced to a standard, conical map to align the locally produced imagery prior to dissemination. This would provide a unified view of the data products collected across time (prior flights of the same sensor), platforms (multiple views of the same ground), and modalities (different sensors, wavelengths, resolution, etc.) providing a robust, easy to manipulate view of a scene for exploitation by image analysts or automated recognition algorithms. Given these constraints, we are interested in a problem where persistent surveillance of a site is performed on multiple days and it is desirable to align the imagery from the current day with the previously acquired data in real-time. This requires the computation of an offset in the external camera parameters over some initial set of images that can be propagated to future images allowing precise registration of the future images with minimal processing. In this scenario, we have high resolution data (the captured images are 4872×3248, although we perform most processing at 1218×812). In general, our scenario allows us to assume that the images are captured at roughly the same elevation, since this is typical with the surveillance flights (and we can rescale according to the estimated altitude, if necessary). Similarly, since the aircraft circles a particular area of interest, we can extract images from the previous sequence from approximately the same viewpoint. To accomplish the registration, we use a feature matching strategy with careful outlier rejection. We then optimize the offset in the external camera parameters using multiple images in each sequence in order to determine a precise relative positioning between the sequences and allow real-time alignment. We can also use these techniques to align multiple sequences with a single previously generated map.
Robust Registration of Aerial Image Sequences
2
327
Previous Work
There has been extensive work on image registration [2,3,4], including work on aerial images [5,6,7,8] and aerial image sequences [9,10,11]. A common strategy has been feature detection and matching, followed by a process to optimize the alignment of features. Zheng and Chellapa [5] described such a technique for finding the homography aligning the ground planes for the registration of oblique aerial images. Tuo et al. [6] perform registration after modifying the images to fit a specified brightness histogram. Features are then detected and aligned. Yasein and Agathoklis [7] solve only for a similarity transformation, but use an iterative optimization where the points are weighted according to the current residual. Xiong and Quek [8] perform registration up to similarity transformations without explicitly finding correspondences. After detecting features in both images, an orientation is computed for each feature and all possible correspondences are mapped into a histogram according to the orientation differences. The peak in the histogram is chosen as the rotation between the images. Scale is determined through the use of angle histograms computed with multiple image patch sizes and selecting the histogram with the highest peak. Niranjan et al. [9] build upon the work of Xiong and Quek in order to register images in an image sequence up to a homography. Lin et al. [10] concentrate on registering consecutive aerial images from an image sequence. They use a reference image (such as a map image) in order to eliminate errors that accumulate from local methods. Their two-step process first performs registration between images and then uses this as an initial estimate for registration with the reference image. Wu and Luo [11] also examine registration in an aerial image sequence. In their technique, the movement of the camera is predicted from previous results in the sequence. This information is used to rank possible correspondences. A variation of RANSAC [12] is used to validate correspondences and to refine the motion estimate.
3
Feature Matching
In order to register images from different sequences (and from the same sequence), we match features or landmarks identified in the images. Two methods are used for feature matching that may be used independently or in combination. In both methods, we select a set of discrete features to be matched in one or more of the images. Given our mission scenario, we work under the assumption that the images are captured from roughly the same altitude and that we can find images captured with roughly the same camera axis from the two image sequences. This allows us to neglect major scale and orientation differences in the feature matching process and gains us robustness to false positives that might occur between features of different scales or orientations. The techniques are able to handle variation of up to 15 degrees or 15% scale change. In cases where these assumptions are not
328
C.F. Olson, A.I. Ansar, and C.W. Padgett
warranted, we can easily replace the techniques with those invariant to scale and rotation changes. This is expected to be rare, since the INS provides us with data that can be used to warp the images into roughly the same scale and rotation. 3.1
Feature Selection
Features are selected in images using a two step process. First, each pixel in the image is assigned a score based on the second moment matrix computed in a neighborhood around each pixel: I 2 Ix Iy x (1) Ix Iy Iy2 Following Shi and Tomasi [13] we use the smallest eigenvalue of this matrix as a measure of how easy the feature is to track. (If the larger eigenvalue is low, this indicates a featureless region. If the larger eigenvalue is high, but the smaller eigenvalue is low, this indicates a linear edge.) Given the scores, we select features from rectangular subimages in order to distribute the features over the entire image. Features are selected greedily starting with the largest scores and discarding those that are too close to previously selected features. The current implementation uses 16 subimages in a 4×4 grid with 16 features selected in each for a total of 256 features. 3.2
Robust Template Matching
Of the two feature matching techniques we use, template matching using gradient or entropy images is the more robust, but also more time consuming, unless multiresolution search techniques are used. The technique is based previous work for aligning entropy images [14]. We first compute a new representation of each image, replacing each pixel with either a local measure of either the image gradient or entropy. Feature matches are detected using normalized correlation between the templates encompassing the detected features and the search image. We look for matches over the entire search image efficiently using the Fast Fourier Transform (FFT) to implement normalized correlation. This allows each template to be processed in O(n log n) time, where n is the number of pixels in the search image. In order to improve the speed, a multi-resolution search option is included. Featureless regions in the search image are undesirable and lead to poor quality matches, so we discount template windows in the search image with below average root-mean-square (RMS) intensity (in the gradient or entropy image) using the following function: N C(r,c), if w(r, c) ≥ w S(r, c) = 2w(r,c)·N C(r,c) (2) , if w(r, c) < w w(r,c)+w where N C(r, c) is the normalized correlation of the template with the window centered at (r, c), w(r, c) is the RMS intensity of the window, and w is the average window intensity over the search image.
Robust Registration of Aerial Image Sequences
3.3
329
SIFT-Like Feature Matching
We also use a method based on the SIFT technique [15]. However, we do not employ the scale and rotation invariance aspects of SIFT, since the images are already (or can be transformed to be) at roughly the same scale and rotation. Feature extraction is first performed on both images using the technique described above. Each feature is characterized using the SIFT method as a vector with 128 entries representing a histogram of gradients in the feature neighborhood at various positions and orientations. Features are compared using normalized correlation and the best match is tentatively accepted if the normalized correlation exceeds 0.75. One disadvantage to this technique is that, even if the same feature is located in both images, it may be localized at slightly different locations. To correct this, we refine each feature match using a brute-force search that considers positive and negative displacements in row and column of up to two pixels. 3.4
Comparison
Over a test set consisting of 108 image pairs, the entropy techniques averaged 119.3 inliers found, while the gradient techniques averaged 116.6 inliers and the SIFT-based techniques averaged 79.7 inliers. The entropy and gradient techniques required approximately 22 seconds working on 1024×812 images1 using a multi-resolution search, including file I/O, feature detection, matching, and refinement. The SIFT-based techniques required approximately 30 seconds, but were able to work on the full 1218×812 images. When the multi-resolution search is not used, the average number of inliers increases to 143.1 for entropy matching and 140.4 for gradient matching, but the computation time increases to 350 seconds. Figure 1 shows examples of features extracted and matched using aerial images.
4
Outlier Rejection
We use a multiple step process to reject outliers in the detected feature matches. The first step is designed to reject gross outliers using sampling. The second step (which is iterated) computes a homography between the points and discards those with larger residuals. Both steps are very efficient and require a fraction of the time used for feature matching. 4.1
Gross Outlier Rejection
Gross outliers are detected using a variation of the RUDR strategy for model fitting [16]. In this strategy, trials are used that hypothesize sets of correct matches that are one match less than the number sufficient to calculate the model parameters. The parameters are estimated by combining the hypothesized matches 1
The images were reduced from 1218 columns to 1024 in order to quarter the time required by the FFT.
330
C.F. Olson, A.I. Ansar, and C.W. Padgett
(a)
(b)
Fig. 1. Feature selection and matching. (a) Features selected in first image. (b) Best matches found in second image. Outlier rejection has not been performed at this stage.
with each possible remaining match and detecting clusters among the estimated parameters. A trial is expected to succeed whenever the hypothesized set of matches contains no outliers. To reject gross outliers, we use a simple model of the motion that allows only similarity transformations between the images (rotation, translation, and scale). For this model, two matches are sufficient to solve for the model parameters. Therefore, each trial fixes one match and considers it in combination with all of the other matches. Since the number of matches is not large, we perform a trial for each match, rather than using random sampling as previously described [16]. In our scenario, the clustering in each trial is relatively simple, since the images have roughly the same scale and orientation. We can simply eliminate those matches that produce a rotation estimate that varies significantly from zero or a scale estimate that varies significantly from one. (Our experiments allow 15% scale change and 15◦ rotation.) The trial with the largest set of inliers (that is, the fewest eliminated matches) is selected and the remaining matches are not considered further. 4.2
Careful Outlier Rejection
The matches that survive the previous step are examined more carefully for further outliers. In this step, we solve for the homography that best aligns the matches using a least-squares criterion and compute the residual for each match. Any match with residual greater than twice the median residual is eliminated. This is iterated until one of the following conditions hold: 1. No outliers are found. 2. The median residual falls below a threshold (1.0). 3. The number of matches falls below a threshold (20). Note that the third condition was never met in our experiments, but is present in the implementation to ensure that sufficient matches remain to perform the optimization. Figure 2 shows the matches from Fig. 1 with outliers removed.
Robust Registration of Aerial Image Sequences
(a)
331
(b)
Fig. 2. Features after outlier rejection. (a) Features selected with rejected matches excluded. (b) Best matches found in second image with rejected matches excluded.
5
Nonlinear Parameter Optimization
Given the matches computed between the images sequences, we can now perform a nonlinear optimization step to refine the motion between the sequences. Our goal is to compute a single six degree-of-freedom (DOF) transformation between the two sequences, under the assumption that the relative errors within each sequence are small. However, only five parameters can be extracted without additional information, owing to the scale ambiguity. We set the scale by requiring the average depth of the points from the camera to agree with the elevation specified by the INS for the first image. This allows us to optimize the six motion parameters without ambiguity. As in previous work [17], our optimization uses a state vector that includes not only the motion parameters, but also the elevation of each feature point matched. (Initially, each point is estimated to be on a flat ground plane.) With this formulation, we can use an objective function based on the distance between the predicted feature location (according to the estimated motion) and the matched feature location. The objective function is augmented with a penalty term that enforces the scale constraint. Overall, this yields a state vector with m + 6 variables to optimize, where m is the number of distinct features matched, and an objective function with 2n + 1 constraints, where n is the number of feature matches between the sequences. The values of m and n are not necessarily the same, since a feature may be matched in multiple images of a sequence. We optimize the objective function using the Levenberg-Marquardt method with lmfit, a public domain software package based on MINPACK routines [18].
6
Results
Figure 3 shows three registration results using these techniques. In each case, an image has been warped into the frame of an image from a previous sequence. The images are merged by taking the average of the pixel values after the warping
332
C.F. Olson, A.I. Ansar, and C.W. Padgett
Fig. 3. Registration results. Pairs of images are merged by taking the average of the pixel values after warping the second image into the frame of the first image.
Robust Registration of Aerial Image Sequences
333
has been performed. Locations outside of the image are left as white. (Areas covered by only one image are lightly illustrated.) The first example shows the relatively straightforward example that has been used to illustrate the components of the system in Figures 1 and 2. It can be observed that the registration is good, since there is little blurring in the averaged pixels and the landmarks (such as roads) align well. The second example shows a scene with buildings and that has increased warping between the images. The left side of these images is dark owing to occlusion. The final example shows a more complex case, where clouds occluded the terrain in one of the sequences and a data box produced distractors that did not move with the terrain. Furthermore, the overlap between the images is reduced in this case. Despite these issues, our algorithm was able to correctly register the images. In practice, our operational scenario will yield images with greater overlap and less rotation than in this example.
7
Summary
The registration of aerial image sequences is important in persistent surveillance applications in order to accurately fuse current images with previously collected data. We have described techniques for the robust registration of such image sequences. We first match landmarks between the sequences by selecting interesting image features and using robust matching techniques. Outlier rejection is performed carefully in order to extract a set of high quality matches. Finally, the external camera parameters are refined using nonlinear optimization. Results on real images sequences indicate that the method is effective.
References 1. Waxman, A.M., Aguilar, M., Fay, D.A., Ireland, D.B., Racamato Jr., J.P., Ross, W.D., Carrick, J.E., Gove, A.N., Seibert, M.C., Savoye, E.D., Reich, R.K., Burke, B.E., McGonagle, W.H., Craig, D.M.: Solid-state color night vision: Fusion of low-light visible and thermal infrared imagery. Lincoln Laboratory Journal 11, 41–60 (1998) 2. Brown, L.G.: A survey of image registration techniques. ACM Computing Surveys 24, 325–376 (1992) 3. Maintz, J.B.A., Viergever, M.A.: A survey of medical image registration. Medical Image Analysis 2, 1–16 (1998) 4. Zitova, B., Flusser, J.: Image registration methods: A survey. Image and Vision Computing 21, 977–1000 (2003) 5. Zheng, Q., Chellappa, R.: Automatic registration of oblique aerial images. In: Proceedings of the IEEE International Conference on Image Processing, vol. 1, pp. 218–222 (1994) 6. Tuo, H., Zhang, L., Liu, Y.: Multisensor aerial image registration using direct histogram specification. In: Proceedings of the IEEE International Conference on Networking, Sensing and Control, pp. 807–812 (2004)
334
C.F. Olson, A.I. Ansar, and C.W. Padgett
7. Yasein, M.S., Agathoklis, P.: A robust, feature-based algorithm for aerial image registration. In: Proceedings of the IEEE International Symposium on Industrial Electronics, pp. 1731–1736 (2007) 8. Xiong, Y., Quek, F.: Automatic aerial image registration without correspondence. In: Proceedings of the 4th International Conference on Computer Vision Systems (2006) 9. Niranjan, S., Gupta, G., Mukerjee, A., Gupta, S.: Efficient registration of aerial image sequences without camera priors. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part II. LNCS, vol. 4844, pp. 394–403. Springer, Heidelberg (2007) 10. Lin, Y., Yu, Q., Medioni, G.: Map-enhanced UAV image sequence registration. In: Proceedings of the IEEE Workshop on Applications of Computer Vision (2007) 11. Wu, Y., Luo, X.: A robust method for airborne video registration using prediction model. In: Proceedings of the International Conference on Computer Science and Information Technology, pp. 518–523 (2008) 12. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24, 381–396 (1981) 13. Shi, J., Tomasi, C.: Good features to track. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 593–600 (1994) 14. Olson, C.F.: Image registration by aligning entropies. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 331–336 (2001) 15. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 16. Olson, C.F.: A general method for geometric feature matching and model extraction. International Journal of Computer Vision 45, 39–54 (2001) 17. Xiong, Y., Olson, C.F., Matthies, L.H.: Computing depth maps from descent images. Machine Vision and Applications 16, 139–147 (2005) 18. Wuttke, J.: lmfit - a C/C++ routine for Levenberg-Marquardt minimization with wrapper for least-squares curve fitting (2008) based on work by B.S. Garbow, K.E. Hillstrom, J.J. More, and S. Moshier.: Version 2.4, http://www.messen-und-deuten.de/lmfit/ (retrieved on June 2, 2009)
Color Matching for Metallic Coatings Jayant Silva and Kristin J. Dana Electrical and Computer Engineering Department Rutgers University Piscataway, NJ, USA Abstract. Metallic coatings change in appearance as the viewing and illumination direction change. Matching of such finishes requires a match of color as viewed under different lighting and viewing conditions, as well as texture, i.e. matching the spatial distribution of color. The system described in this paper provides a purely objective and automatic match measure for evaluating the goodness of a color match. A new texture camera has been utilized in this work to capture multiview images of the coating to be analyzed under different illumination angles. These multiview images are used to characterize the color travel of a coating as well as analyze the appearance attributes of the finish such as orange peel, sparkle, and texture of the flakes within the coating.
1
Introduction
Modern finishes for automotive vehicles make use of metallic paints that change in appearance with viewing angle, presenting a unique challenge to obtaining an accurate color match. Metallic paints are produced by combining metallic flakes with colored particles in the paint substrate[1][2]. The flakes in the finish are oriented almost parallel to the surface, like tiny shining mirrors (sparkling effect )[3][4]. The sparkles are the texture of the paint and this texture changes in appearance with viewing and illumination direction[5]. Besides this change in brightness, the gloss provided by the clear coat as well as its irregular surface (called orange peel effect) alters the clarity of the reflected image and therefore the visual appearance of the coating [3]. These appearance based characteristics of the finish add to its visual complexity and are illustrated in Figure 1. With automotive manufacturers establishing stringent color matching standards for virtually every colored component in the vehicle, traditional methods of visual color evaluation are no longer acceptable. The widesperad use of special effect pigments containing aluminum flakes, micas, pearlescent and interference pigments have made color design capabilties limitless increasing the complexity of obtaining a perfect color match. The system described in this paper provides a purely objective match measure for evaluating the goodness of a color match. The texture camera described in [6][7][8], has been utilized in this work to capture multiview images of the coating to be analyzed under different illumination angles. These multiview images are used to characterize the color travel of a coating as well as analyze individual appearance attributes of the finish such as the orange peel, sparkle and texture of the flakes within the coating. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 335–344, 2009. c Springer-Verlag Berlin Heidelberg 2009
336
J. Silva and K.J. Dana
Fig. 1. Left: Color travel of an automotive coating. The presence of metallic flakes causes the blue paint to show a change in brightness with viewing angle. The coating appears brightest at near specular angles. The brightness then decreases as the aspecular angle increases. This change is brightness is termed as color travel [9]. Notice how the color travel enhances the visual appearance of the curved surface. The sparkle and texture of the flakes is also seen clearly. An orange peel effect can also be seen by observing the reflected image in the coating. Right: Three black paint samples showing varying degrees of orange peel effect. Shown in the figure is a reflection of a fluorescent light fixture seen in the panels. Notice that while the reflected image appears hazy in sample A, it is clear in samples B and C. The orange peel effect is clearly seen when one observes the roughness of the grid like pattern of the fixture reflected in Panel C.
2
Matching Technique
The variation in brightness with viewing and illumination angle exhibited by the a metallic flake finish is captured by the BRDF ( bidirectional reflectance distribution function). Since the BRDF of each finish is unique, it can be used as a match measure to differentiate between coatings that closely resemble each other. Measurement of the BRDF using traditional methods such as a gonioreflectometer is cumbersome [10]. In these traditional devices, the sensor and light source are moved to multiple combinations of viewing and illumination angles. Instead, we make use of a texture camera [6][7][8], that can measure multiple views of a surface using a simple imaging procedure. The device instantaneously records reflectance from multiple viewing directions over a partial hemisphere and conveniently controls the illumination direction over the hemisphere. There are no angular movements of parts, only planar translations of an aperture and a mirror. 2.1
Measurement Device
The texture camera (Texcam), described in [6][7][8], consists of a parabolic mirror, a CCD camera, and translation stages. The imaging components and their arrangement are illustrated in Figure 2. The beam splitter allows simultaneous control of viewing and illumination direction. A concave parabolic mirror section is positioned so that its focus is coincident with the surface point to be measured. The illumination source is a collimated beam of light parallel to the global plane of the surface and passing through a movable aperture. An incident
Color Matching for Metallic Coatings
337
Z Y X
Fig. 2. BRDF/BTF Measurement Device. The surface point is imaged by a CCD video camera observing an off-axis concave parabolic mirror to achieve simultaneous observation of a large range of viewing directions. The device achieves illumination direction variations using simple translations of the illumination aperture. Measurements of bidirectional texture are accomplished by translating the surface in the X - Y plane. The lab prototype (right) can be miniaturized for a compact device. The diagram on the left shows the light source along the Z axis, while the prototype shows the light source positioned along the Y axis. Both configurations are valid.
ray reflecting off the mirror will reach the surface at an angle determined by the point of intersection with the mirror. Therefore, this aperture can be used for convenient orientation of the illumination direction. The aperture ensures that only a spot of the concave mirror is illuminated, and therefore one illumination direction is measured for each aperture position. The light reflected at each angle is reflected from the mirror to a parallel direction and diverted by the beam splitter to the camera. The camera is equipped with an orthographic or telecentric lens that images the light parallel to the optical axis. The camera that is positioned so that its optical axis lies along the X-axis views the image of the mirror. The image of the mirror as now seen by the camera corresponds to reflectance measurements from all angles in a partial hemisphere as seen in Figure 3. The images captured by a CCD camera can be used to extract specific viewing angles. 2.2
Imaging Procedure
The spatial sampling of the imaged surface is obtained through planar translations of the sample surface. The spatial sampling, or equivalently the size of the texture image pixel, is therefore controlled by the movement of the surface. The stage can be made to move in very small increments (0.0025 mm for the current prototype), although the optical imaging principles limit the system resolution. To obtain a texture image, the illumination aperture is positioned to achieve the correct illumination direction. A single pixel that corresponds to the desired viewing direction is identified in the camera image. The surface is then translated along a two dimensional grid in the X-Y plane. The pixel is acquired for each surface position and the pixels from each of the images are then put together in the proper sequence to obtain the texture image, for one viewing
338
J. Silva and K.J. Dana
Fig. 3. Instantaneous multiple view images. The texture camera is able to image all views of a point on the sample surface within a partial hemisphere simultaneously within a single image. Information about the surface such as the specularity, body color etc. are all obtained instantaneously through the same image. Left: The Texcam image (i.e. the BRDF) of a piece of glossy blue cardboard. Right: Texcam image of blue metallic paint. Notice how the BRDF of the metallic paint is complex. Besides the primary specularity and the blue body color, the flakes in the paint also produce secondary peaks in the image.
direction. This procedure is illustrated in Figure 4. To obtain the texture image for several viewing directions, several such pixels, each corresponding to a different viewing direction, are selected from each image and are then put together. An example of these texture images and their corresponding viewing angles is shown in Figure 4.
3
Experimental Results
The goal of the experiments performed using metallic coatings was to evaluate the usefulness of the Texcam in producing color matches that showed a high correlation with the visual judgement of color matching experts. Two sets of panels have been used in the experiments. The first set consists of metallic black panels, while the second set consists of three subsets of colored panels - red, green and blue. The panels in each set were subjected to visual assessment by color matching experts, performed in a controlled environment. The panels were then imaged by the Texcam and matched based on their reflectance measurements across multiple viewing directions, to determine whether the matches produced by the Texcam agreed with those of the color matching experts. The texture camera was used to reconstruct a 5mm×5mm patch of each of the panels as seen from 50 different viewing directions. The different views of the patch were then put together in sequence to form a composite spatial reconstruction. The advantage of such a composite spatial reconstruction is that it allows all the visual characteristics of the coatings to be visualized simultaneously. The precise color travel of the coating, surface texture, primary specularity and sparkle due to the flakes may all be observed in a single image, as seen in Figure 5. The sequence of different viewing directions of the panel represents the
Color Matching for Metallic Coatings
339
Fig. 4. Left: Imaging Procedure of the Texcam. A viewing direction is selected by choosing the appropriate pixel from the multiview image. The surface to be scanned is then translated along the X- Y plane and several such pixels from each the multiview image are put together in the scanning sequence to produce a spatial reconstruction of the surface as seen from that viewing direction. Right: An illustration of the viewing directions that are reconstructed by the camera. Shown in the figure are the spatial reconstructions of the square patch of the panel as the polar angle is changed. 50 viewing directions are captured; the first 25 are for a fixed azimuth angle of −90◦ with the polar angle varying from 0◦ to 24◦ in increments of 1◦ , and the next 25 are for a fixed azimuth angles of +90◦ with the polar angle varying from 0◦ to 22◦ in increments of 1◦ . The square patch on the extreme left would therefore correspond to a viewing direction that has a polar angle of 24◦ and an azimuth angle of −90◦ , while the square patch on the extreme right would correspond to a viewing direction that has a polar angle of 22◦ and an azimuth angle of 90◦ .
color travel of the coating. The color travel is quantified as the BRDF of the coating and BRDFs of different coating can be compared to determine whether they match each other. It is important to note that unlike a conventional multiangle spectrophotometer, reflectance measured by the texture camera is independent of the placement and orientation of the panel [11]. To determine whether a particular coating matches the standard or not, we first find the the average RGB value for the patch for each viewing direction. The images are intensity normalized and thresholding is used to remove the saturated measurements that occur close to the specularity. The difference in the RGB components for each viewing direction gives the error between the sample panel and the standard and is used to determine the goodness of the match. 3.1
Experiment 1
The first experiment consisted of matching 15 metallic black panels. The panels are divided into three subsets of five panels each named A, B and C. The evaluation of the panels by the color matching experts can be summarized as follows: Panels from set A represent the color standard and match each other. Panels from set B match each other, but don’t quite match those from set A. The Panels from set C do not match each other and are considered a poor match for the panels from set A and B. The set of black panels was then studied with
340
J. Silva and K.J. Dana
Fig. 5. The spatial reconstruction of a square patch from a black panel showing its three distinct components.(a): surface texture (b): primary specularity (c): subsurface reflectance. Note the sparkle that is observed in the subsurface reflectance; also note that an orange peel effect is visible in the surface texture. The measurements differ in texture appearance indicating that the texture camera is capable of imaging each component distinctly. Each square represents a different view of the same patch. When viewed as a sequence, the spatial reconstruction shows the precise change in color with viewing direction.
the Texcam using the BRDF based matching technique. Figure 6 shows a plot of the BRDF of a black panel. Panels that have the same color travel will have similar shaped BRDF curves. We therefore overlap and compare the curves of different panels to classify them into the sets A, B and C. Figure 6 shows the BRDF curves for the panels of set A, B and C. Note that although the curves for panel A1 and A3 have the same shape as the rest of the panels in the set, they appear to be displaced toward the right. This is due to warping of the panels. The curves indicate that all the panels from set A match each other. The BRDF curves for the panels of set B overlap each other almost perfectly and hence all panels from set B match each other. The BRDF curves for the panels from set C do not overlap with each other indicating that the panels from set C do not match other panels within the same set. Furthermore, when the BRDF curves of sets A, B and C are overlapped, as shown in Figure 7, it is clear that the panels from set A nearly match set B, but those of set C do not match B at all. These observations are in direct agreement with the experts opinion. 3.2
Experiment 2
The second experiment involved the matching of colored panels using the texcam. Three sets of colored panels were used - red, green and blue. Each set consists of four panels. The first panel is a color standard. The remaining three panels in each set were subjected to visual assessments, performed in a controlled environment, and were unanimously ranked by three color matching experts
Color Matching for Metallic Coatings
300
300
300
Set A
Set B
Set C
250
A1,A3 A2, A4, A5
150
100
200
150
100
50
50
0
0
0
5
10
15
20
25
30
35
40
45
50
250
Intensity 0−255
200
Intensity 0−255
Intensity 0−255
250
341
200
150
100
50
0
5
10
15
20
25
30
35
40
45
0
50
0
5
10
Viewing Direction 0−50
Viewing Direction 0−50
15
20
25
30
35
40
45
50
Viewing Direction 0−50
Fig. 6. Left to right: Plot of the logarithm of the average RGB value against viewing direction for the sets of black panels A, B and C. The viewing directions are shown in Figure 4. The peak represents the brightness observed when the panel is viewed at the specular angle, which then drops off on either side as the viewing direction moves away from the specular angle. Notice that in the BRDF curves of panels belonging to set A, A1 and A3 are displaced to the right of those of A2, A4 and A5. This is due to the warping of panels A1 and A3. The characteristic shape of the BRDF however, remains unchanged. All the panels in set B have the same color and appearance. They match each other perfectly as seen by their overlapping BRDF curves. The non overlapping BRDF curves of the panels in set C on the other hand suggest that they do not match each other in appearance. 300
300
Set A Set B
200
150
100
200
150
100
50
50
0
Set B Set C
250
Intensity 0−255
Intensity 0−255
250
0
5
10
15
20
25
30
35
Viewing Direction 0−50
40
45
50
0
0
5
10
15
20
25
30
35
40
45
50
Viewing Direction 0−50
Fig. 7. Overlapping plots of the logarithm of the average RGB value against viewing direction for two sets of black panels. Left: Plots of the logarithm of the average RGB value against viewing direction of set A vs those of set B. Right: Plots of the logarithm of the average RGB value against viewing direction of set B vs those of set C. Notice that the panels of set B do not match those of A, although they match the other panels within the set. Similarly, panels from set C do not match those in set B, nor do they match each other. The difference in the characteristic curves is in agreement with the opinion of the color matching experts regarding which panels match each other.
independent of one another, as an acceptable (A), borderline (B), or unacceptable (C) match for the standard. Our first set of colored panels analyzed using the Texcam is the metallic red color. Figure 8 shows a plot of the RGB error versus viewing angle. Panel C is clearly an unacceptable match and shows a large error across most of the viewing directions. Visual evaluation confirms that it indeed appears darker than the standard on flop. The error in viewing directions 12 through 15 for Panel B indicates that the color travel from light
342
J. Silva and K.J. Dana
RGB error v/s viewing direction for Red panels RGB error
1
A B C
0.8 0.6 0.4 0.2 0
0
10
20
30
Viewing direction # 1−50
40
50
60
Standard A B C RGB error v/s viewing direction for Green panels RGB error
1
A B C
0.8 0.6 0.4 0.2 0
0
10
20
30
Viewing direction # 1−50
40
50
60
Standard A B C RGB error v/s viewing direction for Blue panels
RGB error
1
A B C
0.8 0.6 0.4 0.2 0
0
10
20
30
Viewing direction # 1−50
40
50
60
Standard A B C
Fig. 8. Top to bottom: RGB error versus viewing direction for the three sets of colored panels- an OEM red, green and light blue. In each figure, A, B and C are the acceptable, borderline and unacceptable match as ranked by the color matching experts. An error for viewing directions 12 through 17 indicates that the panel shows a different color travel while an error for viewing directions 20 through 50 indicates a difference in the flop color of the panel as compared to the standard. C is clearly the unacceptable match, since it has a different flop color as compared to the standard, indicated by error peaks for almost all viewing directions. B shows an error for viewing directions 12 through 17, indicating that it has a slightly different color travel as the standard, but the same flop color. It is therefore a borderline match. A shows the least error across all viewing directions, indicating that it has the same color travel and flop color as the standard. Panel A is therefore an acceptable match. This trend is true except for the light blue panels. For the light blue panels, panel A, considered an acceptable match by visual judgement, shows a significant error when compared to the standard. This can be explained by observing the spatial reconstruction of the standard and panel A and noticing that the flop of the two panels is the same, the flop color of A is lighter than the standard. This causes the greater RGB error value for panel A seen in the error plot.
Color Matching for Metallic Coatings
343
to dark occurs slightly later as compared to the standard. Panel B can therefore be considered as a borderline match.Panel A shows negligible error across all viewing directions, indicating that it has the correct color and travel, and so is an acceptable match for the standard. The second set of panels is a metallic green color. Panel A and B show a significant error in viewing directions 14 through 17. This indicates that the color travel for these panels slightly differs from the standard. Apart from this minor difference in the color travel, panel A does not show a significant error for the flop color and is an acceptable match when compared to the color standard. Panel B shows some error in flop color for a few viewing directions and is therefore a borderline match for the color standard. Panel C shows the same color travel as the standard, which is reflected in the insignificant error for viewing directions 14 through 17. Panel C does however, show a significant error between viewing directions 20 through 50, indicating that it does not match the color standard in its flop color. Panel C is therefore an unacceptable match for the standard. The third set of panels is a metallic light blue. Panel A shows a large error for viewing directions 20 through 50, This indicates a mismatch in flop color. Visual evaluation confirms this, with panel A being lighter in color as compared to the standard. The error plot for viewing directions 12 through 15 however indicates that it does show the same color travel. Panel B shows an error in viewing directions 12 through 15 indicating a slight difference in color travel from light to dark. It also does not show a large error in viewing directions 20 through 50, indicating that it is similar in flop color to the standard. Panel B can therefore be regarded a borderline match. Panel C shows a large error in flop color, which can clearly be seen by noticing the error plot for Panel C between viewing directions 20 through 50. Panel C is therefore an acceptable match for the color standard.
4
Discussion and Implications
A completely automated and objective matching technique for automotive coatings has been developed using the texture camera. The texture camera was used to reconstruct a 5mm×5mm patch of each of the panels as seen from 50 different viewing directions. The different views of the patch were then put together to form a composite spatial reconstruction. The precise color travel of the coating, as well as it appearance based characteristics such as the orange peel, primary specularity and sparkle due to the flakes were all be observed in simultaneously in the composite spatial reconstruction. This image also makes it possible to quantify the color travel of the coating as its BRDF and match the panels based on their BRDF. These matches produced by the Texcam were shown to have a high correlation with the professional judgement of the color matching experts. Furthermore, the analysis and matching of a metallic coating by the Texcam is extremely fast - the imaging of the surface from 50 different viewing directions and the subsequent comparison is achieved in under 15 seconds. The large number of viewing directions that can simultaneously be imaged by the Texcam
344
J. Silva and K.J. Dana
makes it superior to a conventional multi angle spectrophotometer. Metallic effect pigments that contain aluminum or mica flakes as well as newer families of pigments like Xirallic exhibit an intense sparkle when viewed under direct sunlight. This sparkle can neither be captured nor sufficiently characterized with traditional multi- angle spectrophotometers [12]. Also, unlike the conventional multi- angle spectrophotometer, the color travel information obtained using the texcam is independent of the orientation of the panel [11]. These features of the Texcam make it a potential viable solution for automated matching of metallic flake finishes.
References 1. Besold, R.: Metallic effect - characterization, parameter and methods for instrumental determination. Die Farbe, 79–85 (1990) 2. Buxbaum, G., et al.: Industrial Inorganic Pigments (1993) 3. McCamy, C.S.: Observation and measurement of the appearance of metallic materials. part i. macro appearance. COLOR research and application, 292–304 (1996) 4. Rodrigues, A.: Color technology and paint. In: Color and Paints Interim Meeting of the International Color Association Proceedings, pp. 103–108 (2004) 5. Ershova, S., Kolchina, K., Myszkowskib, K.: Rendering pearlescent appearance based on paint-composition modelling. In: The European Association for Computer Graphics 22th Annual Conference: EUROGRAPHICS 2001, vol. 20(3), pp. 227–238 (2001) 6. Dana, K., Wang, J.: Device for convenient measurement of spatially varying bidirectional reflectance. Journal of the Optical Society of America, 1–12 (2004) 7. Dana, K., Wang, J.: A novel approach for texture shape recovery. In: International Conference on Computer Vision, pp. 1374–1380 (2003) 8. Dana, K.: Brdf/btf measurement device. In: International Conference on Computer Vision, pp. 460–466 (2001) 9. Westlund, H.B., Meyer, G.W.: Applying appearance standards to light reflection models. In: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 501–510 (2001) 10. Li, H., Foo, S.C., Torrance, K.E., Westin, S.H.: Automated three-axis gonioreflectometer for computer graphics applications. In: Proc. SPIE Advanced Characterization Techniques for Optics, Semiconductors, and Nanotechnologies II, vol. 5878 (2005) 11. Davis, D.J.: An investigation of multi angle spectrophotometry for colored polypropylene compounds. In: SPE/ ANTEC Proceedings, pp. 2663–2671 (1996) 12. Streitberger, H.J., D¨ ossel, K.F.: Automotive paints and coatings, p. 404 (2008)
A Shape and Energy Based Approach to Vertical People Separation in Video Surveillance Alessio M. Brits and Jules R. Tapamo School of Computer Science University of KwaZulu-Natal Westville Campus Durban 4000, South Africa
[email protected],
[email protected]
Abstract. In this paper we explore various methods which can be used to vertically separate groups of people in video sequences. Firstly, we discuss the technique used to create a horizontal height projection histogram from the shape of a group of people. We then use two techniques to split this histogram, and develop a vertical seam from the splitting points. The vertical seam is calculated by maximizing the energy of all possible seams using the intensity of the edges. Testing was performed on the CAVIAR data set. We achieved promising results, with the highest average segmentation accuracy at 93.38%.
1
Introduction
Video surveillance has become a widely used tool in several domains. A considerable amount of video footage is collected in many of these domains and, as a consequence of this large amount of data that needs to be processed, automated and intelligent systems are essential in order to extract useful knowledge. Video surveillance systems are traditionally used in the forensic analysis of crimes, recently however, they have also become popular for the prevention and detection of such crimes. Finding methods and techniques to interpret and analyze a video scene has received considerable attention in recent years. In [1] crowd analysis is performed by estimating the number of people in a scene containing large crowds. This is done by combining several existing image processing and machine learning techniques. With the use of a Support Vector Machine the system is trained to recognize the shape and contours of a person’s head. The features used are extracted by applying statistical methods on the Haar Wavelet Transform of the original image. However, this process is only accurate when the people in a crowd are on the same horizontal plane. This problem is then solved by applying a perspective transform to the images before any feature is calculated. By calculating the vanishing point in the scene, people’s heads are then extracted on any horizontal plane in a given image. Once the heads in the scene have been identified, they are grouped. By splitting each group into intervals and making the assumption that the spacing between heads is constant, an estimate of the G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 345–356, 2009. c Springer-Verlag Berlin Heidelberg 2009
346
A.M. Brits and J.R. Tapamo
size of the crowd is calculated. The authors have then established that by combining the shape and contour of a human head, an accurate estimate of crowd density can be obtained, even without prior knowledge of the scene. Some methods used for crowd analysis involve identifying the contour of a person in order to determine their actions and intentions. One such method is presented by Yokoyama and Poggio in [2]. They first apply an adapted form of optical flow combined with an Edge Detector. Then by thresholding edges with little motion and removing edges that were present in the background model, the sets of edges for each foreground object are obtained. Using the NearestNeighbor technique, these sets are grouped into distinct objects and then snakes [3] are used to find the contour of each separate object. The above process does not cater for occluded objects, which are only taken care of when the tracking starts. As each object is tracked, several states are detected in order to identify whether an object is occluded, reappeared, merged or separated. Lara and Hirata [4] discuss a method for contour extraction in an image sequence. They apply the external morphological gradient to first calculate the edges of an entire frame. Then a rough contour of the target object is obtained by modeling the background from these edges using the median of a set of previous frames, and calculating the difference between the edges of the background and the current frame. The contour is refined by applying several other morphological steps including noise filtering. As Lara and Hirata primarily make use of mathematical morphological operators, the computational time complexity of this technique is reasonable considering the results achieved. Minor further improvements to this technique, would allow its use in real time applications. This method does not cater for occlusion and further improvements would be needed to extract the correct contours of occluded objects. Most methods that analyze video scenes require the analysis of the objects represented in the scene. Several key problems need to be addressed when attempting to identify and calculate feature descriptors for each object. Motion detection is required to locate areas of interest in a given video scene; this is often solved by applying a background subtraction technique. The blobs detected by the background subtraction technique will need to be separated and identified. A connected component algorithm, as used in [5], is commonly applied to extract, separate and identify these blobs. In our context we assume that every moving object is a person or a group of people. In crowd analysis most of the methods emphasize locating the object for which feature descriptors are required, in this paper however the main focus will be the separation of people within a group. This is to ensure that feature extraction techniques will be applied to single objects, instead of being applied to the combination of two or more people. Often this problem is solved by applying several derivations of the classic occlusion detection techniques discussed in [6]. One common approach is described by Elgammal and Davis in [7]. In this technique a person is split into three distinct areas (head, torso and legs); color and spatial features of each of these areas are then calculated. Using a technique to maximize
A Shape and Energy Based Approach to Vertical People Separation
347
the likelihood that a pixel belongs to one object, the object can then accurately be tracked over a video sequence. Occlusion is modeled using two methods. The first is to reason whether one object is in front of or behind another object. Using elliptical regions to label each object, the relative depth of each object can then be evaluated. By labeling a pixel as one object or another and determining which instance maintains the least error calculation will determine which object is being occluded. The second occlusion modeling method uses the relative depth of an object by evaluating the probability that a ray from a given pixel would pass through one object before another. Once it is determined that an object is occluded by another object the occluded object is then correctly re-detected by comparing its current features to the features extracted earlier. The results obtained using this method are shown to be reasonably good at solving the occlusion problem while tracking people. The main problem with this technique is that it requires the objects to be separated at first, before occlusion takes place. Many methods for tracking using shapes have been proposed in the literature. In [8] a system is developed which uses the shape of a silhouette of an object to identify the number of people present in that object. This is done by locating areas within the silhouette that resemble the head, arms and legs of a person. Then, applying the assumption that the head is always above the torso, each person’s head is located and counted. The area of the silhouette to which each person belongs is then extrapolated by evaluating the normalized distance from the head to the torso, etc. Once this is finalized the tracking of each person proceeds based on their detected heads and matching them is based on the appearance model that is stored for each person. The position of each object is also predicted. This is to ensure that a completely occluded object is still tracked. Their results seem promising; however most of the segmentation boundaries that separate people are straight lines. In this paper we will discuss a novel method for segmenting people in order to reduce the number of errors in tracking due to occluded or touching objects (see example in figure 1). The method will not require any prior information, such as the features, about the objects thereby decreasing the number of possible errors in techniques that require initialization such as in [7].
Fig. 1. Example of two people that need to be separated vertically. Ideal Separation Boundary Highlighted.
348
2
A.M. Brits and J.R. Tapamo
Description of Methods and Techniques
In order to separate groups of people vertically, the moving objects (blobs) in the video sequence are first converted into a histogram. A rough segmentation is obtained by splitting this histogram into separate unimodal histograms. This segmentation is then further refined by using an energy function to find a vertical seam. 2.1
Component Isolation
In order to analyze and detect how many people need to be separated and where to separate them, each connected region’s shape is considered. All the connected regions are extracted by first running a background subtraction on the video, then performing a connected component detection on the foreground mask. In our case we use a Mixture of Gaussians for background subtraction [9] and apply two dilations followed by two erosions in order to remove unwanted holes in the foreground. To isolate each component a connected component algorithm is applied and all components that are smaller than a preset threshold are removed. 2.2
Shape Analysis
To perform a vertical separation, the horizontal shape of each component needs to be considered. This is done by projecting the maximum height of each vertical column onto a histogram, see figure 2.
Fig. 2. Left: Original Image. Middle: Detection of Significant Blob. Right: Horizontal Height Projection Histogram.
In other words we can first find the bounding box [xb , yb ] × [xe , ye ] (see figure 3), around the connected component B to be split. Then let F be the binary foreground mask obtained from the background subtraction. We will have: B = {(x, y) ∈ [xb , yb ] × [xe , ye ], F (x, y) = 1} .
(1)
The horizontal height projection histogram h(y) (see figure 3) is thus defined as follows: h(y) = xe − xmin (y) + 1 for y ∈ [yb , ye ] , (2) where xmin (y) =
min {x, F (x, y) = 1} .
x∈[xb ,xe ]
(3)
A Shape and Energy Based Approach to Vertical People Separation
349
Fig. 3. Graph describing the formulation of the Horizontal Height Projection Histogram
As we are separating people, the distinct shape of a person can be easily described as a unimodal horizontal height projection histogram around each mode. Therefore when analyzing the horizontal height projection histogram of any component, it can be concluded that the number of people within this component is at least the number of modes detected in the histogram. In figure 2, it is clearly shown that there are two modes, therefore two people. In order to minimize errors that could occur in the horizontal height projection histogram h(y), a smoothing is performed on h(y) by adaptively calculating the standard deviation of the Gaussian [10] and applying the Gaussian convolution on h(y). Although the Gaussian smoothing does reduce errors when calculating the horizontal height projection histogram, to minimize them further a median filter is applied to the Gaussian smoothed histogram. This removes any small changes in the gradient that might otherwise indicate a trough in the histogram. Thereafter we count the number of times the gradient of the histogram crosses the zero mark. This number indicates the number of modes within the histogram and therefore the number of people the component should be split into. 2.3
Mode Separation
Once the number of modes in the horizontal height projection histogram is calculated, the separation between the modes must be found in order to separate the people that the histogram represents. Two methods are described below. Simple Method: A method of separating the histogram into unimodal segments is to separate it whenever the gradient of the histogram crosses the zero mark. In figure 4 this method has been used to split the component into unimodal segments for two different frames.
350
A.M. Brits and J.R. Tapamo
Fig. 4. Separation by the Simple Method on selected two frames
Using Otsu’s Method: A common method to separate a multi-modal histogram into separate unimodal parts is Otsu’s method [11]. Otsu’s method calculates the optimal thresholds {t1 , t2 , ..., tM }, where M is the number of modes 2 minus one, by calculating the maximum between-class variance σBCV . 2 {t∗1 , t∗2 , ..., t∗M } = arg max{σBCV (t1 , t2 , ..., tM )} ,
(4)
where 2 σBCV (t1 , ..., tM ) =
M
wi (μi −μT ) and wi =
i=1
ti ti (k) k( f N ) f (k) and μ = , N wi
k=ti−1
k=ti−1
(5) where f (x) is the histogram and N =
L
f (i) and L is its length.
i=1
The original version of this method is computationally expensive; a faster version of it was developed by Liao et al [12]. In figure 5 Otsu’s method has been used to split the component into unimodal segments for two different frames. Note how using either method gives the same result in the first frame shown in figures 4 and 5, however in the second selected frame the results are quite different.
Fig. 5. Separation by using OTSU’s Method on selected two frames
2.4
Energy Based Segmentation
The separation based on the horizontal height projection histogram of a component into different objects only depends on the first pixel. This is hardly accurate,
A Shape and Energy Based Approach to Vertical People Separation
351
Fig. 6. Left: Vertical Seam Calculated using the Greedy Algorithm. Right: Overlaid over original frame.
especially the further away from the initial pixel. Therefore a method is needed to analyze each pixel below the initial pixel to separate the component more accurately. A method developed by Avidan and Shamir [13] reduces the horizontal size of an image by removing a vertical seam, which corresponds to the least energy, thereby not greatly reducing the overall quality of the image. They define a vertical seam to be a set of 8-connected pixels from the top of the image to the bottom with only one pixel in each row. Conversely to that of Avidan and Shamir, we want to find a vertical seam which corresponds to the highest energy, that is, we want to find the pixels of an image that exhibit large changes in their neighborhood. One method of computing the maximum deviation between two pixels is to calculate the edges of the image and use those edges in the energy calculation. There are many edge detection algorithms, we will be comparing edges calculated from both the Sobel Edge Detection (Only vertical edges are required, as we only want a vertical separation), and edges calculated using the external morphological gradient. The latter is calculated as the difference between the dilation and the erosion of the image. The energy E(i, j) at pixel (i, j) is the pixel intensity of the Edge Detection of Image I at (i, j). Greedy Algorithm: The vertical seam from the initial point derived from the separation of the horizontal height projection histogram, to the bottom of the component can be calculated as follows: 1. Initialize current pixel to the initial pixel found by the splitting method. 2. Find all pixels in the next row which are 8-connected to the current pixel and calculate their energy. 3. Choose the pixel with the maximum energy and set it as the current pixel. Go to step 2 and repeat until the entire seam is calculated. Figure 6 shows the result of applying the Greedy Algorithm on a selected frame. This is a simple and computationally efficient way of calculating the vertical seam, however it will not choose the seam with the maximum energy as choosing the local maximum at each iteration will not necessarily choose the global maximum.
352
A.M. Brits and J.R. Tapamo
Fig. 7. Left: Vertical Seam Calculated using the Dynamic Algorithm. Right: Overlaid over original frame.
Dynamic Algorithm: A more efficient method for calculating the global maximum uses dynamic programming. However in some cases the global maximum seam jumps from one extreme to another between frames. This occurs when there are two seams with similar energy. This causes the segmentation method to become less focused on the initial starting point which often results in it choosing regions on the outside of the objects, this causes the segmentation line to move erratically over the entire object between frames. To remove this effect a limiter is placed on horizontal movement. This limiter is applied by multiplying the energy by the Gaussian at that point. Therefore the energy used in the vertical seam is calculated by: 1 −x2 E ∗ (i, j) = E(i, j) √ exp[ 2 ] . 2σ σ 2π
(6)
Where x =horizontal distance from initial point and σ = N 6−1 and N =maximum height of the vertical seam (height from initial point to the bottom of the object). Therefore to calculate the global maximum using dynamic programming, a method similar to the one used in [13] is applied: M (i, j) = E ∗ (i, j) + max(M (i − 1, j − 1), M (i − 1, j), M (i − 1, j + 1)) .
(7)
Figure 7 shows the results of applying the Dynamic Algorithm on a selected frame.
3
Experimental Results and Discussions
All algorithms were programmed in un-optimized and non-parallelized C++ code, using videos from the CAVIAR data set. Several experiments were conducted which include the combinations of Histogram Separation (Simple or Otsu), Edge Detection (Sobel or External Morphological Gradient) and Vertical Seam Calculation (Greedy or Dynamic). The following eight combinations were used: 1. Simple Separation + Morphological Edge + Greedy Energy Algorithm 2. Simple Separation + Morphological Edge + Dynamic Algorithm
A Shape and Energy Based Approach to Vertical People Separation
3. 4. 5. 6. 7. 8.
353
Simple Separation + Sobel Edge Detector + Greedy Energy Algorithm Simple Separation + Sobel Edge Detector + Dynamic Algorithm Otsu Separation + Morphological Edge + Greedy Energy Algorithm Otsu Separation + Morphological Edge + Dynamic Algorithm Otsu Separation + Sobel Edge Detector + Greedy Energy Algorithm Otsu Separation + Sobel Edge Detector + Dynamic Algorithm
The results of the above experiments are displayed in Table 1. Accuracy was tested on selected key frames, and the percentage of pixels correctly segmented over the total number of pixels is recorded. In order to interpret the accuracy of the segmentation, each result was manually checked and any pixels belonging to another object were counted as incorrectly classified pixels. Because mode counting is not always accurate in counting the number of people in a group, (usually because of background subtraction problems, or because people are situated directly behind other people), therefore there are several cases in our experiments where the number of separation boundaries did not equate to the number of people. Figure 8b shows an example where the background subtraction created large holes in the foreground objects thereby distorting the shape of the object. Figure 8c shows an example where a person is walking directly behind another thereby hiding her mode. This result will impact negatively on the accuracy, therefore another set of results are shown that only include the accuracy when the number of modes was calculated correctly. Table 1 shows that experiment number 6 gave the highest accuracy over all the frames, while experiment number 5 gave the highest accuracy when the modes were calculated correctly. This shows that allowing a greater area for the vertical seam to exist, using the Dynamic method, can reduce the errors that are obtained when the number of modes did not equate to the number of people. With experiment number 6, 1289 frames were further analyzed and categorized into three sections: The first, containing 658 frames, was considered to have been perfectly segmented (see figure 8a); the second, with 430 frames, was considered to have been correctly segmented, but not with a 100% accuracy (see figure 8f); the third contained 201 frames with an incorrect segmentation, such as in figure 8b, where two people were segmented into three. Table 1. Results of Tests Experiment No: Mean Accuracy (%) (%) excl. Incorrect Modes Total Run Time (fps) 1 91.06 97.68 6.97 2 90.65 97.66 3.79 3 89.23 93.95 7.71 4 90.22 95.52 4.12 5 91.91 98.40 7.43 6 93.38 97.86 3.12 7 88.10 95.22 8.36 8 92.55 96.92 3.29
354
A.M. Brits and J.R. Tapamo
(a)
(b)
(d)
(e)
( )
(c)
(f)
Fig. 8. Overall Best Experiment: Selected frames showing results (Segmentation line highlighted in orange) overlaid with original data using experiment 6 (best results)
(a)
(b)
(d)
(e)
( )
(c)
(f)
Fig. 9. Overall Worst Experiment: Selected frames showing results (Segmentation line highlighted in orange) overlaid with original data using experiment 7 (worst results)
A Shape and Energy Based Approach to Vertical People Separation
355
Figures 8 and 9 show the difference between the best and worst (experiment 6 and experiment 7) results respectively. Notice that experiment 7 used the Greedy method to calculate the vertical seam, so when the number of modes is calculated incorrectly the error is compounded. For example see figure 9b.
4 4.1
Conclusion and Future Work Conclusions
We have presented a method to separate groups of people walking together in order to facilitate tracking of single people instead of entire groups. Results show that experiment 6 reported the highest accuracy, while experiment 7 ran in the shortest time. In cases where a person’s head is not well separated, as in figures 8c and 9c, the number of modes was incorrectly detected which reduced the accuracy of the overall result. It can be seen that this method can vertically separate people in video surveillance. One can apply this technique instead of, or in conjunction with, known occlusion detection and handling problems. With optimized and parallelized code, this method could easily run in real time. 4.2
Future Work
There are many ways to extend this method; alternate techniques for mode detection and separations in the horizontal height projection histogram could be found. It may also be possible to devise different energy functions instead of using the intensity of the edges calculated with Sobel edge detection or the external morphological gradient. However, a more direct improvement to segmentation accuracy would be achieved by more accurately detecting the number of people in a single component. This technique can also be applied to other algorithms to improve occlusion detection and handling. For example this technique could be applied to the technique in [7], which could improve their results.
References 1. Lin, S.-F., Chen, J.-Y., Chao, H.-X.: Estimation of number of people in crowded scenes using perspective transformation. IEEE Transactions on Systems, Man, and Cybernetics, Part A 31(6), 645–654 (2001) 2. Yokoyama, M., Poggio, T.: A contour-based moving object detection and tracking. In: ICCCN 2005: Proceedings of the 14th International Conference on Computer Communications and Networks, Washington, DC, USA, pp. 271–276. IEEE Computer Society, Los Alamitos (2005) 3. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision V1(4), 321–331 (1988) 4. Lara, A.C., Hirata Jr., R.: A morphological gradient-based method to motion segmentation. In: dos Campos, S.J. (ed.) Proceedings, October 10–13, 2007, vol. 2, pp. 71–72. Universidade de São Paulo (USP), Instituto Nacional de Pesquisas Espaciais, INPE (2007)
356
A.M. Brits and J.R. Tapamo
5. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.P.: Pfinder: real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7), 780–785 (1997) 6. Gabriel, P.F., Verly, J.G., Piater, J.H., Genon, A.: The state of the art in multiple object tracking under occlusion in video sequences. In: Advanced Concepts for Intelligent Vision Systems (ACIVS), 2003, pp. 166–173 (2003) 7. Elgammal, A., Davis, L.S.: Probabilistic framework for segmenting people under occlusion. In: Proc. of IEEE 8th International Conference on Computer Vision, pp. 145–152 (2001) 8. Haritaoglu, I., Harwood, D., Davis, L.S.: Hydra: Multiple people detection and tracking using silhouettes. In: International Conference on Image Analysis and Processing, vol. 0, p. 280 (1999) 9. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2, -252 (1999) 10. Lin, H.-C., Wang, L.-L., Yang, S.-N.: Automatic determination of the spread parameter in gaussian smoothing. Pattern Recogn. Lett. 17(12), 1247–1252 (1996) 11. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics 9(1), 62–66 (1979) 12. Liao, P.s., Chen, T.s., Chung, P.c.: A fast algorithm for multilevel thresholding. Journal of Information Science and Engineering 17, 713–727 (2001) 13. Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Trans. Graph. 26(3), 10 (2007)
Human Activity Recognition Using the 4D Spatiotemporal Shape Context Descriptor Natasha Kholgade and Andreas Savakis Department of Computer Engineering Rochester Institute of Technology, Rochester NY 14623
[email protected],
[email protected]
Abstract. In this paper, a four-dimensional spatiotemporal shape context descriptor is introduced and used for human activity recognition in video. The spatiotemporal shape context is computed on silhouette points by binning the magnitude and direction of motion at every point with respect to given vertex, in addition to the binning of radial displacement and angular offset associated with the standard 2D shape context. Human activity recognition at each video frame is performed by matching the spatiotemporal shape context to a library of known activities via k-nearest neighbor classification. Activity recognition in a video sequence is based on majority classification of the video frame results. Experiments on the Weizmann set of ten activities indicate that the proposed shape context achieves better recognition of activities than the original 2D shape context, with overall recognition rates of 90% obtained for individual frames and 97.9% for video sequences.
1 Introduction Human activity recognition has important applications in the areas of video surveillance, medical diagnosis, smart spaces, robotic vision and human computer interaction. Several strategies have been employed in activity recognition to develop features such as principal components [1, 2] and skeletonization [4, 5, 6] from the human subjects in video sequences, and to recognize these features by HMMs, neural networks and nearest neighbors. Improvement in recognition can be obtained when spatial and temporal features are combined into one representation. Pioneering work in this field was done in [3] by using motion energy and motion history images. The authors of [16] use spatiotemporal features such as sticks, balls and plates developed as solutions of the Poisson equation on a volume constructed by concatenating frames of a subject’s silhouette, obtained by background subtraction, along the temporal axis. The authors of [21] use spatiotemporal features found by linear filters for activity recognition; these have also been used by [18] in a hierarchical model (of the constellation of bags-of-features type) together with spatial features found with the 2D shape context [7]. In [20], the authors store distances between all combinations of frame pairs in a self-similarity matrix and analyze the inherent similarities in action recognition. In [21], a two-stage recognition process is used together with local spatiotemporal discriminant embedding; in the first stage, the silhouette is projected into a space G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 357–366, 2009. © Springer-Verlag Berlin Heidelberg 2009
358
N. Kholgade and A. Savakis
where discrimination is enhanced between classes that are further apart in the spatial domain, and if such discrimination is not obtained, then in the second stage, a short segment of frames centered at the present frame is used to form a temporal subspace for discrimination. The precursor to the descriptor proposed in this paper is the 2D shape context [7]. The 2D shape context is a shape descriptor that bins points in the contour of an object using a log-polar histogram. For every point along the contour, such a log-polar histogram is maintained that counts how surrounding points fall within various sections. Contour points are binned according to their radial distance and angular offset from the point under consideration. The use of these two criteria for binning makes its histograms two-dimensional resulting in the 2D shape context, but typically, the histograms are vectorized, and histogram vectors for various points are stacked to form a matrix. In [7] and [8], in-depth description of the shape context and its use in applications such as matching numerical digits, objects and trademarks has been provided. In the area of human subject representation, the 2D shape context was used for human body configuration estimation in still images [9], pose estimation in motion sequences [11], and action recognition [10]. Extension of this descriptor to 3D was proposed in [12] where the log-polar histogram is stretched out into the third dimension to form a cylindrical histogram. A spherical-histogram based 3D shape context has been used in [13] for action recognition on a spatiotemporal volume formed (as in [16]) by concatenating silhouettes along the temporal dimension. Points for the 3D shape contexts are usually obtained from the surface of the volumes (3D objects or spatiotemporal); however, in [14], they have been generated by using all the voxels within the 3D object. In this paper, we propose a 4D spatiotemporal shape context (STSC) descriptor, which captures both spatial and temporal description of the object based on its contour shape and motion over consecutive frames. For a contour point under consideration, in addition to the radial distance and angular offset of surrounding points used for spatial representation, the STSC uses two more criteria, namely the magnitude and direction of the velocity at the surrounding points for development of histograms. The use of velocity magnitude allows distinction between fast and slow moving objects for example, running versus walking, while the use of direction allows distinction between the trajectories of object parts, for example, bending versus jumping. Since there are two additional criteria of magnitude and direction for binning velocity information, the resulting histograms in the STSC are four-dimensional. The fourdimensional histograms for various points are vectorized and concatenated into a matrix for processing. For activity recognition applications, the advantage of the STSC descriptor over the traditional 2D shape context is that it incorporates local motion information that makes it possible to distinguish between similarly-shaped stances of activities that are separable due to the subject’s motion. In comparison with the 3D shape context in [12], it requires fewer frames and can be applied for activity recognition in multipleaction sequences. The spatiotemporal shape context is applicable to other areas besides activity recognition, such as expression recognition, semantic annotation and content-based video retrieval.
Human Activity Recognition Using the 4D STSC Descriptor
359
This paper is organized as follows. Section 2 outlines the feature extraction process for generating the spatiotemporal shape context, Section 3 presents a comparison of results for activity recognition on the Weizmann dataset [16] using the 2D and spatiotemporal shape contexts, and comparison of our results against those from other works on the same dataset, and Section 4 includes concluding remarks.
2 Methodology This section outlines the process of generating the STSC descriptor and the methodology for using it for activity recognition. 2.1 Extraction of Points and Generation of Motion Vectors For each video frame, background subtraction is performed by subtracting a background frame obtained by concatenating subject-free halves of the corresponding subject’s walking video. After background subtraction, the contour of the subject’s silhouette is obtained at each frame using the chain code. Uniformly spaced points on the silhouette contour are selected for generating the STSC. The number of silhouette points N is chosen to be 50 in this paper, but it is a parameter that may be varied to optimize performance or efficiency. To estimate the motion vectors, a 7x7 square centered at each contour point is selected in the current video frame and the nearest 7x7 square within the next video frame is found by searching within a 13x19 window. The displacement between the centers of their centers provides a motion estimate for that particular point. Fig. 1 shows the process of getting motion vectors. Examples of the resulting contour points and their motion vectors are provided in Fig. 2.
(a)
(b)
(c)
(d)
Fig. 1. Motion vector extraction for a point along the boundary of a walking subject: (a) contour of the subject, (b) 50 equidistant points from the contour; (c) for a given silhouette point a 7x7 window (white) around the point is selected; (d) the best match 7x7 window (light gray) in the next frame within a 13x19 search region (dark gray) is found and the displacement between the centers (black arrow) represents the motion vector.
360
N. Kholgade and A. Savakis
Fig. 2. Boundary points, with 50 points per silhouette, and corresponding motion vectors for (a) bend, (b) run, (c) jump forward, (d) side-shuffle, (e) walk, (f) wave with two hands, (g) jumping jacks, (h) jump in place, (i) skip, (j) wave with one hand.
2.2 Generation of the Spatiotemporal Shape Context The STSC can be generated by specifying the number of bins required along the radial, angular, motion magnitude and motion direction dimensions denoted by nr, nθ, nvr, and nvθ respectively, and the bounds for the distance and motion magnitudes, namely rmin, rmax, vmax, and vmin=0. Given points Pi and Pj on the contour of an object, the individual bin for the distance r between Pi and Pj is determined as:
br = arg min ( r − t r (i ) ) subject to r < t r (i ) i
(1)
where tr is the nr-element threshold vector given as
⎛ ⎞ i tr (i ) = exp10 ⎜⎜ log(rmin ) + (log(rmax ) − log(rmin ))⎟⎟ nr ⎝ ⎠
(2)
Similarly, the individual bin for the motion change magnitude vr at point Pj, is determined as: bv = arg min( vr − tvr (i) ) subject to vr < tvr (i ) i
(3)
where tvr is an nvr-element threshold vector given by Equation (4). t vr (i ) =
(i − 2) ⋅ v max n vr − 2
(4)
The vectors tr and tvr contain the thresholds for the various bins within which the rand vr-values of the point Pj get categorized. The selected bins form an upper cap on these values. Bins for angular offset θ and motion change direction vθ are obtained as
Human Activity Recognition Using the 4D STSC Descriptor
361
⎢ θ ⋅ nθ ⎥ bθ = 1 + ⎢ ⎥ ⎣ 2π ⎦
(5)
⎢v ⋅ n ⎥ bvθ = 1 + ⎢ θ vθ ⎥ ⎣ 2π ⎦
(6)
The index for the final bin in the STSC-vector for Pi in which to place Pj is given as:
f = (br − 1)nθ nv nvθ + (bθ − 1)nv nvθ + (bv − 1)nvθ + bvθ
(7)
Let Hi be the shape context vector for Pi; then Hi is incremented by one at position f to reflect the inclusion of Pj in the shape context vector description of Pi: H i ( f ) := H i ( f ) + 1
(8)
The process is repeated for all points to get a matrix of size N × nr nθ nvr nvθ which forms the STSC. In our implementation, nr = 5, nθ = 12, nvr = 5, nvθ = 12. These parameters may be varied in order to obtain optimal performance. Fig. 3 gives an image version of the STSC matrix; it tends to be sparse.
Fig. 3. Spatiotemporal shape context (STSC) developed for the frames in Fig. 1: the STSC matrix shown here has 50 rows corresponding to the 50 points in Fig. 1 (b), and 3600 columns corresponding to nr nθ nvr nvθ = 5x12x5x12; each row has a total of 50 points that have been binned according to their radial displacement from the point corresponding to that row, angular offset, magnitude of motion vector and direction of motion vector.
2.3 Matching and Classification with the Spatiotemporal Shape Context For a given pair of images or video frames, a match value can be attributed by computing the STSCs of the objects in the two images, and by finding an optimal correspondence between the two STSCs. This correspondence is essential prior to matching. Since rows of the STSC correspond to points from the object contour, row-wise ordered matching is not guaranteed between STSCs of two image pairs, even if they
362
N. Kholgade and A. Savakis
are visually similar in shape and motion. To introduce an ordered match, an optimal permutation must be computed for the rows of one STSC. The authors of [7] compute the correspondence between their 2D shape contexts with the Hungarian algorithm [15], which we adopt in our implementation. The Hungarian algorithm computes correspondence using an NxN cost matrix as input which can be obtained by computing pair-wise distances between each of the N rows in one STSC to each of the N rows in the other STSC. The distances are computed using the Х2 metric; if Hi is the i-th row in the first STSC, and Hj is the j-th row in the second STSC then the element of the cost matrix at the (i,j)-th location is given using the Х2 metric as: Cij =
(
)
2 1 N H i (k ) − H j (k ) ∑ 2 k =1 H i (k ) + H j (k )
(9)
Once the Hungarian algorithm is used to compute the optimal correspondence between rows of the two STSCs, then a match value can be assigned to the pair by computing the sum of Х2 distances between optimally corresponded rows of the STSCs. A small match value indicates that two activity stances are similar in their shapes and motion, while a large match value indicates the opposite. This allows kNN classification to be used on the match values in order to identify the activity. For the purpose of classification, a library of action stances and their STSCs can be maintained, and one can compute the STSC of an input frame from itself and the next adjacent frame using the equations listed in section 2.2. Match values can then be computed between the input frame’s STSC and the STSCs in the library after optimal correspondence is found using the Hungarian algorithm, and the k nearest match values can be used to classify the input frame’s action. Such a classification is done for every frame in the sequence and the activity for the entire video is found by majority voting on the results of individual frames within the video.
3 Results Recognition was performed on the ten activities of the Weizmann dataset [16]. Two experiments were conducted for the purpose of comparison: one using the 2D shape context, and one using the STSC. For each experiment, a library of the corresponding shape contexts was maintained by hand-picking 10 frames per subject per activity, and computing shape contexts for each frame. The spatiotemporal shape context calculation was done using silhouette points on each frame and motion information between successive frames. Classification was done using the leave one out technique, i.e. for every test subject, library shape contexts corresponding to all activities for that subject were left out. During testing, shape contexts were computed from each frame in the video sequence and were matched to the library using k-NN with k=1, i.e. the closest matching library shape context was selected. The activity corresponding to the closest match was used to classify the input frame, and for a video sequence, majority vote of the activity classifications for individual frames were used to make the activity decision for the video. Results from the two experiments were used to generate confusion matrices that contain percentages of frames classified as various activities for individual video frames and entire video sequences.
Human Activity Recognition Using the 4D STSC Descriptor
363
Table 1. Results from individual frame recognition using the 2D shape context
B JF R SS W V2 JJ JP S V1
B 93.5 5.1 0 0.5 0 0 0 0.8 0 1.9
JF 4.3 79.5 2.0 0.2 5.0 0 0 0 7.4 0
R 0 1.6 72.4 0.2 3.6 0 0 0 31.1 0
SS 0 0.2 1.3 86.2 0 0.3 2.5 5.3 0 0
W 0 3.1 8.4 0.7 88.3 0 0 0 6.5 0
V2 0 0 0 0.2 0 95.9 2.6 0 0 0
JJ 0 0.2 0 0.9 0.3 2.8 87.4 2.4 0.2 0
JP 1.1 0.2 0.2 11.1 0 0.8 7.5 91.3 0 1.5
S 0 10.1 15.7 0 2.8 0 0 0 54.8 0
V1 1.1 0 0 0 0 0.2 0 0.2 0 96.6
Table 1 provides results for classifying individual video frames and entire video sequences respectively using the 2D shape context. The nomenclature is as follows: B=bend, JF=jump forward, R=run, SS=side-shuffle, W=walk, V2=wave with two hands, JJ=jumping jacks, JP=jump in place, S=skip and V1=wave with one hand. Individual frame classification rates for most activities are above 80%; however, jumping, running and skipping show lower classification. There is room for improvement in the recognition of several actions, since bending and jumping confuse with each other due to the hunched-back posture of the subject. Several stances of profile-based actions such as walking, running and jumping overlap with one another. Similarly, shapes of frontally-faced actions such as side-shuffling, jumping jacks, waving with two hands and jumping in place overlap. Some frames of waving with one hand confuse with bending due to incorrect silhouette extraction that causes the elbow to connect with the head. Correct video classification is attained for all actions except skipping which is recognized at a rate of 70%. An overall frame recognition rate of 86% and video recognition rate of 96.8% is obtained. Table 2 provides results of individual frame recognition and video recognition when the STSC is used during classification. We observe that the mismatches due to the 2D shape context between bending and jumping, and between walking and running (on account of similarities in shape) are reduced when STSC is used. Mismatches of several jumping and running frames with the skipping action are reduced as well. Mismatches of jumping in place with two-handed waving and jumping jacks, and of side-shuffling with jumping in place are reduced considerably. Some misclassifications are introduced for jumping jacks frames with jumping in place, jumping in place frames with one-handed waving, and side-shuffling with walking, primarily due to the added similarity of motion vectors. The classification rate for the bending activity is much higher, and the classification rate for waving with one hand now attains 100% rate. All videos are classified correctly except skipping which is recognized at a rate of 80%. Overall recognition rates of 90% for frames and 97.9% for videos are obtained. Thus, we observe that there is an overall increase in classification over the original shape context.
364
N. Kholgade and A. Savakis Table 2. Results from individual frame recognition using the STSC
B JF R SS W V2 JJ JP S V1
B 98.4 0 0 0 0.3 0 0.7 0.2 0 0
JF 0 86.6 2.5 1.9 0.6 0 0 0 20.4 0
R 0 0.9 86.1 0.5 0.4 0 0 0 17.4 0
SS 0 2 0.2 93.3 2.1 0 0 0 1.0 0
w 0 0.7 1.3 4.1 95.6 0 0 0 3.4 0
V2 0 0 0 0 0 97.1 1.4 1.1 0 0
JJ 0 0 0 0.2 0 0 83.1 0.2 0 0
JP 0 0 0 0 0 0 13.7 93.8 0 0
S 0 9.8 9.9 0 1 0 0 0 57.8 0
V1 1.6 0 0 0 0 2.9 1.1 4.7 0 100
Table 3. Individual frame and video recognition rates of the 2D and STSC compared against other works (- indicates that result was not provided,* recognition was done on space-time cubes with 8 frames per cube as opposed to individual frames)
STSC (this paper) 2D Shape Context Grunmann et al [13] Gorelick et al [16]* Jhuang et al (skip excluded, using C3 features) [17] Niebles et al (skip excluded) [18] Junejo et al [19] Jia et al (using LSTDE) [21]
Frame rate 90% 86% -
recognition
Video rate 97.9% 96.8% 94.4% 97.8%
-
98.8%
55%
72.8%
90.9%
95.3% -
recognition
Table 3 shows the overall frame and video recognition rates for the work in this paper compared against other works done using the Weizmann dataset. It must be noted that for [16], recognition has been done on space-time cubes consisting of 8 frames per cube, and hence cannot necessarily be categorized as frame or video classification. However, our video recognition rates are similar to the space-time cube classification rates. Our frame-based activity recognition with the STSC is on par with that in [21], and is better than [13] and [19]. In [17] and [18], recognition has been done by excluding the skipping action from the dataset: we observe that our frame and video recognition rates even with skipping are higher than those from [18], and since we obtain 100% video classification on all actions except skipping, then if skipping were excluded from the library, we obtain 100% overall video recognition rate which is higher than the rate in [17].
Human Activity Recognition Using the 4D STSC Descriptor
365
4 Conclusions In conclusion, this paper proposes a new 4D spatiotemporal shape context which is a descriptor for objects in video sequences that encodes information about the shape of the object and the change in its motion from one frame to the next. Using the spatiotemporal shape context for activity recognition demonstrates that it performs at least as well as other leading methods and is better than the shape context for most activities, since it allows the separation of similarly shaped stances using the differences in their motion over frames. The spatiotemporal shape context can be used to represent shape and motion in other areas of object or category recognition, such as content based video retrieval, expression or face recognition in video, and distinction between individual actions in multiple-action sequences. Future work includes analyzing the performance of the spatiotemporal shape context for activity recognition over short segments of video for the purpose of recognizing changes in activities using a sliding window.
Acknowledgements This research is supported in part by the Eastman Kodak Company and the Center for Electronic Imaging Systems (CEIS), a NYSTAR-designated Center for Advanced Technology in New York State.
References 1. Masoud, O., Papanikolopoulos, N.: Recognizing Human Activities. In: IEEE Conference on Advanced Video and Signal Based Surveillance, Miami, Florida, pp. 157–162 (2003) 2. Niu, F., Mottaleb, M.: View-invariant Human Activity Recognition Using Shape and Motion Features. In: IEEE Sixth International Symposium on Multimedia Software Engineering, Miami, Florida (2004) 3. Davis, J., Bobick, A.: The Representation and Recognition of Human Movement Using Temporal Templates. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, pp. 928–934 (1997) 4. Guo, Y., Xu, G., Tsuji, S.: Understanding Human Motion Patterns. In: International Conference on Pattern Recognition, Jerusalem, Israel, pp. 325–329 (1994) 5. Fujiyoshi, H., Lipton, A.: Real-Time Human Motion Analysis by Image Skeletonization. In: IEEE Workshop on Applications of Computer Vision, vol. 15. Princeton, New Jersey (1998) 6. Chen, D.Y., Liao, H.Y.M., Shih, S.W.: Continuous Human Action Segmentation and Recognition Using a Spatiotemporal Probabilistic Framework. In: Eighth IEEE International Symposium on Multimedia, San Diego, California, pp. 275–282 (2006) 7. Belongie, S., Malik, J., Puzicha, J.: Shape context: A new descriptor for shape matching and object recognition. In: NIPS (2000) 8. Belongie, B., Malik, J., Puzicha, J.: Shape Matching and Object Recognition Using Shape Contexts. Technical Report, UCB//CSD00 – 1128, Berkeley (2001)
366
N. Kholgade and A. Savakis
9. Mori, G., Malik, J.: Estimating Human Body Configurations Using Shape Context. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 666–680. Springer, Heidelberg (2002) 10. Kholgade, N., Savakis, A.: Human activity recognition in video using two methods for matching shape contexts of silhouettes. In: SPIE Defense and Security Symposium, Orlando, Florida (2008) 11. Qiu, X., Wang, Z., Xia, S., Li, J.: Estimating Articulated Human Pose from Video Using Shape Context. In: IEEE International Symposium on Signal Processing and Information Technology, Athens, Greece (2005) 12. Kortgen, M., Park, G., Novotni, M., Klein, R.: 3D Shape Matching with 3D Shape Contexts. In: Seventh Central European Seminar on Computer Graphics, Budmerice, Slovakia (2003) 13. Grundmann, M., Meier, F., Essa, I.: 3D Shape Context and Distance Transform for Action Recognition. In: International Conference on Pattern Recognition, Tampa, Florida (2008) 14. Huang, K.S., Trivedi, M.: 3D Shape Context Based Gesture Analysis Integrated with Tracking using Omni Video Array. In: IEEE Workshop on Vision for Human-Computer Interaction in conjunction with IEEE CVPR Conference, San Diego (2005) 15. Kuhn, H.W.: The Hungarian Method for an assignment problem. Naval Research Logistics Quarterly 2, 83–97 (1995) 16. Gorelick, L., Blank, M., Shechtman, E., Irani, M.: Actions as Space-Time Shapes. PAMI 29(12), 2247–2253 (2007) 17. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: IEEE International Conference on Computer Vision (2007) 18. Niebles, J.C., Fei-Fei, L.: A hierarchical model of shape and appearance for human action classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2007) 19. Junejo, I.N., Dexter, E., Laptev, I., Pérez, P.: Cross-view action recognition from temporal self-similarities. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 293–306. Springer, Heidelberg (2008) 20. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition via Sparse SpatioTemporal Features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005) 21. Jia, K., Yeung, D.: Human Action Recognition using Local Spatio-Termporal Discriminant Embedding. In: IEEE Computer Society Conference on Pattern Recognition (2008)
Adaptive Tuboid Shapes for Action Recognition Roman Filipovych and Eraldo Ribeiro Computer Vision and Bio-Inspired Computing Laboratory Department of Computer Sciences Florida Institute of Technology Melbourne, FL 32901, USA {rfilipov,eribeiro}@fit.edu
Abstract. Encoding local motion information using spatio-temporal features is a common approach in action recognition methods. These features are based on the information content inside subregions extracted at locations of interest in a video. In this paper, we propose a conceptually different approach to video feature extraction. We adopt an entropybased saliency framework and develop a method for estimating tube-like salient regions of flexible shape (i.e., tuboids). We suggest that the local shape of spatio-temporal subregions defined by changes in local information content can be used as a descriptor of the underlying motion. Our main goal in this paper is to introduce the concept of adaptive tuboid shapes as a local spatio-temporal descriptor. Our approach’s original idea is to use changes in local spatio-temporal information content to drive the tuboid’s shape deformation, and then use the tuboid’s shape as a local motion descriptor. Finally, we conduct a set of action recognition experiments on video sequences. Despite the relatively lower classification performance when compared to state-of-the-art action-recognition methods, our results indicate a good potential for the adaptive tuboid descriptor as an additional cue for action recognition algorithms.
1
Introduction
Human action recognition has received significant attention in the computer vision community over the past decade. Action recognition is a challenging problem with a number of applications including surveillance, video-retrieval, and humancomputer-interaction. In this paper, we address the issue of extracting descriptive features from motion videos. Inspired by object recognition methods [9,10,1], recent action recognition approaches have demonstrated the effectiveness of using local motion descriptors extracted at spatio-temporal locations across the video volume [4,11]. Local motion descriptors are usually built using filter responses or motion measurements calculated inside spatio-temporal subregions [7,4]. For example, Kl¨aser et al. [7] calculate video descriptors from histograms of oriented gradients (HoG). Dollar et al. [4] consider spatio-temporal subregions of cuboid shape, and obtain descriptors using normalized pixel values, brightness gradients, and windowed optical flow. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 367–376, 2009. c Springer-Verlag Berlin Heidelberg 2009
368
R. Filipovych and E. Ribeiro
Motion descriptors are usually extracted at the locations provided by spatiotemporal region detectors. For example, Laptev and Lidenberg [8] extended the Harris corner detector [5] to the spatio-temporal domain. In [8], interest points are detected by analyzing spatio-temporal filter responses over increasing scales, where the scale of the operator kernel determines the scale of the spatio-temporal subregion. Dollar et al. [4] proposed a spatio-temporal corner detector by modifying the temporal component of the operator kernel. Another class of approaches work on the adaptation of the entropy-based salient region detector originally introduced by Kadir and Brady [6]. Their method works by considering changes in local information content over different scales. An extension of this detector to video domain was recently introduced by Oikonomopoulos et al. [11]. Current rigid spatio-temporal regions (i.e., cuboids, ellipsoids, cylinders) [11,7,4] do not allow for the use of regions’ shape as a cue for video analysis. As a result, these regions may not be able to capture nontrivial motion variations due to human’s articulated motion. Additionally, methods based on cuboid- or elliptic-shaped subregions strongly rely on the availability of descriptive information content inside the analyzed subregions. In fact, changes in human appearance, illumination, and viewpoint may compromise the descriptiveness of the subregions’ content. At the same time, the shape of traditional spatio-temporal subregions carries little information about the motion’s local spatial properties. In this paper, we propose new video features that are designed to “follow” the local spatio-temporal information flow. We adopt the region saliency framework [6,11] and develop a method for estimating tube-like salient regions (i.e., tuboids) of adaptive shape. We argue that the shape of the local spatio-temporal information flow is important to describe local motion. We show how features that are invariant to scale, and partially invariant to viewpoint changes can be extracted from videos. Our main goal in this paper is to introduce the concept of adaptive tuboid shapes as a local spatio-temporal descriptor. Our main idea is to use changes in local spatio-temporal information content to drive the tuboid’s shape deformation, and then use the tuboid’s shape as a local motion descriptor. Finally, we conduct a set of experiments on real motion videos, and show that our new adaptive-shape descriptors can be effective for action recognition. The remainder of this paper is organized as follows. In Section 2, we review the spatio-temporal subregion saliency measure from [11]. In Section 3, we introduce our spatio-temporal subregions of flexible tube-like shape, and describe the tuboid parameters-estimation procedure. In Section 4, we develop a set of descriptors based on the shapes of the extracted tuboids. In Section 5, we describe the action learning and recognition method used in our paper. Experimental results are reported in Section 6, with the paper concluding in Section 7.
2
Measuring Spatio-Temporal Information Content
In this section, we describe the saliency measure introduced by Kadir and Brady [6], and extended to the spatio-temporal domain by Oikonomopoulos et al. [11].
Adaptive Tuboid Shapes for Action Recognition
369
The method begins by calculating Shannon’s entropy of local image attributes (e.g., intensity, filter response) inside cylindrical spatio-temporal volumes over a range of scales. This entropy is given by: HD (s) = − pD (q, s) log2 pD (q, s)dq, (1) q∈D
where pD (q, s) is the probability density function (pdf) of the signal in terms of scale s, and descriptor q which takes on values from descriptors in the video volume D. Here, the pdf can be approximated by a pixel intensity histogram or by a kernel-based method such as Parzen windows. In our case, we follow [11], and use a histogram of the values obtained from the convolution of the image sequence with a Gaussian derivative filter. The spatio-temporal subregion D is assumed to be extracted at the origin of the coordinate system (i.e., at the spatio-temporal location x = (0, 0, 0)T ). The scales s = (s1 , . . . , sn ) represent the size parameters of the analyzed volumes (e.g., spatio-temporal cylinder’s radius and length). Once the local entropy values are at hand, a set of candidate scales is selected for which the entropy HD has local maxima, i.e., ∂HD (s) ∂ 2 HD (s) S= s: = 0, < 0 . (2) ∂s ∂ 2s A saliency metric, YD , as a function of scales s, can be defined as: YD (s) = HD (s)WD (s),
∀(s) ∈ S,
(3)
where, for candidate scales in S, the entropy values are weighted by the following interscale unpredictability measure defined via the magnitude change of the pdf as a function of scale: ∂ WD (s) = si ∀(s) ∈ S. (4) ∂si pD (q, s) dq, q∈D i An important property of the saliency measure in Equation 3 is that it does not depend on the content inside a subregion. Instead, the measure is based on changes in information content over scales. This makes the saliency metric YD particularly robust. Next, we introduce our tuboid regions used in this paper, and propose a tuboid parameters estimation algorithm.
3
Tuboids
Spatio-temporal subregions considered in [11] have very simple shape (i.e., cylinder). In this section, we propose to use subregions of more complex shape by introducing parametrization for tube-like video volumes. Our tuboid model consists of a disk of variable radius that slides along a curve describing the temporal evolution of the spatial information content. We assume
370
R. Filipovych and E. Ribeiro
r
v
ts
ts
v
te time (t)
v
t
te
t
ts te t
e
time
time
tim
time
time
time
time
time
time
time
Fig. 1. Examples of tuboids and their parametric components as a function of time. Top: orthogonal projection curves describing the spatio-temporal variation of (x, y) coordinates, as well as the temporal variation of the flexible cylinder radius. Bottom: examples of tuboids.
that a spatio-temporal subregion is centered at the origin of the coordinate system. At time t, a sliding disk Dt of radius rt is given by: (p − ct ) · (p − ct ) ≤ rt2 ,
(5)
where p = (xt , yt )T is a point on the disk, and ct = (x0,t , y0,t )T is the disk’s center point. Let g = (x0,t , y0,t , rt )T represent our tuboid model. We model the temporal evolution of tuboid g (i.e., medial-axis points and disk radius) using quadratic parametric equations given by: gt =
2 k=0
ak tk
for
t ∈ [ts , te ] ,
(6)
Adaptive Tuboid Shapes for Action Recognition
371
where ak = (axk , ayk , ark )T . The time values t belong to a bounded interval starting at ts and ending at te . Equation 6 defines the tuboid’s shape. Figure 1 shows examples of shapes described by Equations 5 and 6. In the case of cylindrical subregions considered in [11], the components in (6) are independent of time, and carry little information about the motion’s local properties. Please, notice that the point (x0,0 , y0,0 ) (i.e., the medial-axis point for t = 0) may lie outside the tuboid, and thus will not necessarily coincide with the local coordinate system’s origin. This characteristic is illustrated in Figure 1(top row). The allowed deviation of the axial curve from the local coordinate origin is controlled by the parameter estimation procedure, and will be discussed later in this paper. Examples of tuboids of different shapes are shown in the last two rows of Figure 1. The figure also shows plots of individual components of Equation 6. Our goal is to estimate the parameters of a salient subregion described by (5) and (6). However, a direct optimization over the saliency measure YD may not be applicable due to several reasons. First, the increase in subregion’s dimensionality creates the possibility of degenerate cases [6]. Secondly, due to noise and space discretization, the set of candidate scales as described in (2) may be empty if the number of shape parameters is large. Finally, an exhaustive search over tuboid parameters is computationally intensive. We address these issues by using an alternative scale function, and by employing a gradient-ascent approach. 3.1
Estimating Tuboid Parameters
Following Equation 6, a tuboid can be completely defined by a set of eleven parameters, Θ = (ax0 , ax1 , ax2 , ay0 , ay1 , ay2 , ar0 , ar1 , ar2 , ts , te ). Unfortunately, obtaining a set of candidate scales for the set S described in Equation 2 may not be achievable. Indeed, in our implementation we noticed that even for three parameters the set of candidate scales S is often empty. To proceed, we define a scale function F (Θ) that is independent of the parameters of the tuboid’s axial curve. This independence allows us to adapt the shape of the axial curve solely based on the tuboid’s information content. The scale function is given by F (Θ) = rts + r0 + rte + te − ts . Here, F (Θ) defines a tuboid’s scale in terms of radii of disks obtained by XY -plane slices at times t = ts , t = 0, and t = te , as well as by the tuboid’s temporal length described by ts and te . We begin with a set of initial tuboid parameters Θ0 , and employ a gradient ascent search to maximize the saliency measure YD over scales F (Θ). Assuming that parameters in Θ are discrete, the set of parameters at the next step of the gradient ascent algorithm is updated as: Θn+1 = Θn + λ∇YD (F (Θn ))
(7)
where λ is a small constant controlling the convergence speed of the search process. While in Equation 7 the direction of the search is determined by singlevalued scales F (Θn ), region D during the iteration n + 1 is defined by Θn . Initialization. In order to avoid local minima, the estimation procedure in Equation 7 requires good initialization of Θ0 . We initialize the algorithm with the
372
R. Filipovych and E. Ribeiro
parameters of a cylindrical subregion that corresponds to the maximal change in information entropy as defined by Equation 1. The initialization procedure is given by the following maximization: Θ0 = arg max HD (F (Θ)), Θ
s.t.,
∂HD (F (Θ)) ∂ 2 HD (F (Θ)) = 0, < 0, xt = 0, ∂F (Θ) ∂ 2 F (Θ)
yt = 0,
rts = r0 = rte (8)
In (8), the conditions xt = 0, yt = 0, and rts = r0 = rte , result in an initial spatio-temporal subregion of cylindrical shape. A crucial difference between our initialization procedure and the candidate scales estimation in Equation 2 is that we consider entropy changes over single-valued scales F . This is different from the joint maximization over individual parameters in (2). In this way, our parameters initialization approach is more likely to yield a solution. Nevertheless, for some spatio-temporal locations, the initialization in (8) may not result in a solution. In this case, the location is deemed non-salient and is removed from further consideration. The percentage of these points is small, due to the use of an interest detector pre-processing step that we will discuss later in this paper.
4
Feature Extraction
General Features. The quadratic curves described by Equation 6 can be characterized by three points. For simplicity, we choose the triplets (xts , x0 , xte ) and (yts , y0 , yte ) to define the axial curve’s projections onto XT and Y T planes, respectively. We also choose the triplet (rts , r0 , rte ) to define rt . We then define a general tuboid feature vector using eleven components as dg = (xts , x0 , xte , yts , y0 , yte , rts , r0 , rte , ts , te ). While the general features dg completely describe the shapes of the underlying tuboids, they may not be suitable for scenarios with varying camera parameters. Next, we propose a scale-invariant tuboid descriptor that is less sensitive to scale variations. Scale-Invariant Features. We commence by assuming that the videos are obtained by static cameras with different scale parameters. The spatial components of scale-invariant tuboid shape descriptors can be obtained using unit vectors defined by the points Pts = (xts , yts ), P0 = (x0 , y0 ), and Pte = (xte , yte ). The components of the scale-invariant descriptor are given by: Pt − P0 , u= s Pts − P0
P0 − Pte , v= P0 − Pte
and
Pt − Pts . w= e Pte − Pts
(9)
Examples of vectors u, v, and w are shown in Figure 2. Additionally, scale effects on rt can be reduced by considering the additional feature components ξ = rts /r0 and ζ = rte /r0 . The scale-invariant tuboid descriptor is given by:
Adaptive Tuboid Shapes for Action Recognition
ds = (u1 , u2 , v1 , v2 , w1 , w2 , ξ, ζ, ts , te )
373
(10)
where u = (u1 , u2 ), v = (v1 , v2 ), and w = (w1 , w2 ). The two temporal components ts and te can be incorporated into ds since scale variations do not affect the temporal evolution of the motion.
Fig. 2. Scale-invariant components u = (u1 , u2 ), v = (v1 , v2 ), and w = (w1 , w2 )
Viewpoint Invariant Features. Here, we assume a simplified camera-viewing geometry in which camera parameters are limited to vertical or horizontal rotations about its center. We further assume that all cameras face the scene of interest, and are located at the same distance from it. Additionally, the scene is assumed to be far from the camera centers. These assumptions allow us to approximate projective camera transformations by affine transformations. In this paper, we consider descriptors that are invariant to horizontal or vertical circular translations of the camera (i.e., X-view-invariant and Y -view-invariant features). The X-view-invariant feature vector consists of five components and is given by dxv = (yts , y0 , yte , ts , te ). Similarly, it is possible to consider camera vertical circular translation. In this case, the Y -view-invariant feature vector is given by: dyv = (xts , x0 , xte , ts , te ). The radial components (rts , r0 , rte ) are not considered as they are not view-invariant.
5
Learning and Recognition
Our recogntion method is described as follows. Let (d1 , . . . , dM ) be a set of descriptors extracted from training video sequences of a specific action. Descriptors dj are assumed to be of the same type (e.g., general, scale-invariant, X-view-invariant, or Y -view-invariant). We model distribution of descriptors as K a mixture of K Gaussian densities given by p(d|θ) = i=1 γi pi (d|θi ), where d is a descriptor, θi are the parameters of i-th mixture, γi represent the mixing K weights such that i=1 γi = 1, pi is a multivariate Gaussian density function parametrized by μi and Σi (i.e., the mean and covariance matrix, respectively), and θ presents the set of model parameters (γ1 , . . . , γK , μ1 , Σ1 , . . . , μK , ΣK ). The parameters can be estimated by using the Expectation-Maximization (E.M.) al 1 , . . . , d M ) extracted from a novel video gorithm [3]. Given a set of features (d 1 M i sequence, the classification score is given by τ = M i=1 p(d |θ).
374
6
R. Filipovych and E. Ribeiro
Experiments
The goal of our experiments is to demonstrate the potential of our tuboid-shaped descriptor. For this, we performed a set of classification experiments on the Weizmann human action dataset [2]. The dataset consists of videos of nine actions performed by nine individuals under similar camera conditions. Estimating tuboid parameters for every spatio-temporal location in a video is computationally intensive. Instead, we used an off-the-shelf spatio-temporal interest point detector [8,4] to obtain a set of candidate interest locations. In particular, we report our results using the detector from [4]. We applied a simple motion filter by convolving the image sequence with a derivative of a Gaussian along the temporal domain. The preprocessed sequences where then used to estimate tuboids’ parameters at the locations of interest. In our experiments, we noticed that about 10% of the tuboids estimated at the provided locations had cylindrical shape (i.e., xt , yt , and rt did not change over time). Cylindrical tuboids represent degenerate cases for scale-invariant features due to division by zero in Equation 9. As a result, we discarded those tuboids where the values xt , yt , and rt had small variances. We first extracted general features dg from the estimated tuboids, and adopted a leave-one-out scheme for evaluation by taking videos of actions performed by one individual for testing, and using sequences of the remaining individuals for training. The recognition rate achieved using these features was 85.2%. Similarly, we extracted scale-invariant features ds , X-view-invariant features dxv , and Y view-invariant features dyv . Recognition rates were 75.3%, 61.7%, and 75.3% for scale-, X-view-, and Y -view-invariant features, respectively. Next, we assessed the effect of the camera parameters changes on the performance of our descriptors. The main goal of this experiment is to show the robustness of the proposed approach in the presence of camera parameters variations. However, an interest point detector will generate different sets of interest locations under different camera conditions. As a result, using videos with uncontrolled scale (or viewpoint) variations does not allow to eliminate the influence of the interest region detector. To circumvent this issue, we resorted to a simulated change-of-scale approach. We rescaled frames in the original videos from the Weizmann dataset. For every video, a random scale was selected from the interval [0.5, 2], and all frames in the sequence were rescaled to the same size. Additionally, the interest locations obtained for the original sequences were transformed accordingly. In this way, the effect of interest point detector was removed from our experiments. Examples of transformations considered in this paper are shown in Figure 4(c). Again, we extracted general, scale-invariant, Xview-invariant, and Y -view-invariant descriptors, and performed a leave-one-out validation for every descriptor type. Finally, we simulated horizontal camera translations by rescaling the videos’ horizontal dimension to a random scale. The performance of different features was assessed in the classification task. Similar experiment was performed for vertically rescaled sequences. The obtained recognition results are shown in Figure 4(d). The graph shows the recognition performance of our proposed
Adaptive Tuboid Shapes for Action Recognition
375
Fig. 3. Effects of transformations considered in our experiments
Fig. 4. Classification performance of our features under different camera conditions
features under different camera conditions. The plot suggests that tuboid general features allow for superior action recognition performance for all considered camera parameters. This effect is primarily due to the probabilistic nature of our learning method. Using X-view-invariant features resulted in the worst performance in our experiments. At the same time, Y -view-invariant features allow for a much better recognition performance as compared to X-view-invariant features. This suggests that, for the actions in the Weizmann dataset, the Xcomponent of the motions is more descriptive than the Y -component (i.e., viewpoint changes in the horizontal plane affect recognition performance more significantly than viewpoint changes in the vertical plane).
7
Conclusion
In this paper, we proposed a novel approach to video feature extraction. Rather than using information inside video-subregions, our features are based on the shapes of tuboid regions designed to follow the local information flow. We developed a set of general descriptors based on tuboid shapes, as well as scaleinvariant, and partially view-invariant descriptors. Preliminary experiments performed on the Weizmann dataset suggest that the descriptor works well but there are a number of issues that still need attention. Among these issues is
376
R. Filipovych and E. Ribeiro
the need for a more comprehensive evaluation of the method on other motion datasets containing actual viewpoint and scale invariance. Additionally, our best recognition result was only 85.2% while state-of-the-art action recognition methods have achieved 100% recognition on the same dataset used in our experiments. This might be in part because our descriptor uses shape information only. While the tuboid’s shape seems to be useful, we might be able to improve classification performance by combining the shape and content in the salient region. Finally, the temporal slices of our tuboids are of circular shape. By allowing a more flexible parametrization in Equation 5, it might be possible to include more information into the tuboid shape descriptors. We are currently working on these issues and the results of these studies will be reported in due course. Acknowledgments. Research was supported by U.S. Office of Naval Research under contract: N00014-05-1-0764.
References 1. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Comput. Vis. Image Underst. 110(3), 346–359 (2008) 2. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV, pp. 1395–1402 (2005) 3. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Ser. B 39 (1977) 4. Doll´ ar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: VS-PETS (October 2005) 5. Harris, C., Stephens, M.: A combined corner and edge detection. In: Proceedings of The Fourth Alvey Vision Conference, pp. 147–151 (1988) 6. Kadir, T., Brady, M.: Scale saliency: a novel approach to salient feature and scale selection. In: VIE, pp. 25–28 (2003) 7. Kl¨ aser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3dgradients. In: British Machine Vision Conference, September 2008, pp. 995–1004 (2008) 8. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV, Nice, France (October 2003) 9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 10. Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. IJCV 60(1), 63–86 (2004) 11. Oikonomopoulos, A., Patras, I., Pantic, M.: Human action recognition with spatiotemporal salient points. IEEE Transactions on Systems, Man, and Cybernetics Part B: Cybernetics 36(3), 710–719 (2006)
Level Set Gait Analysis for Synthesis and Reconstruction Muayed S. Al-Huseiny, Sasan Mahmoodi, and Mark S. Nixon ISIS, School of Electronics and Computer Science, University of Southampton, UK {mssah07r,sm3,msn}@ecs.soton.ac.uk
Abstract. We describe a new technique to extract the boundary of a walking subject, with ability to predict movement in missing frames. This paper uses a level sets representation of the training shapes and uses an interpolating cubic spline to model the eigenmodes of implicit shapes. Our contribution is to use a continuous representation of the feature space variation with time. The experimental results demonstrate that this level set-based technique can be used reliably in reconstructing the training shapes, estimating in-between frames to help in synchronizing multiple cameras, compensating for missing training sample frames, and the recognition of subjects based on their gait.
1 Introduction For almost all computer vision problems, segmentation plays an important role in higher level algorithm development. The fact that real world images are mostly complex, noisy and occluded makes the achievement of robust segmentation a serious challenge. Some of these difficulties can be tackled via the introduction of prior knowledge [1], due to its capacity to compensate for missing or misleading image information caused by noise, clutter or occlusion [2, 3]. Accordingly a robust gait prior shape should enable improved segmentation of walking subjects. By hitting this target, this paper clears the way to solve other related problems like synchronizing multiple cameras [4] and amenably estimating the temporally correlated prior shapes. One of the earliest uses of prior knowledge in image segmentation was in the work by Cootes et al [5, 6], in which a Gaussian model was used to learn a set of training shapes represented by a set of corresponding points. Leventon et al [7] image segmentation approach uses geodesic active contour [8] guided by statistical prior shape captured by PCA. By using the Chan and Vese [9] segmentation model, Tsai et al [10] modified the Leventon et al [7] approach to develop a prior shape segmentation for objects with linear deformations such as human organs in medical images. This approach has shown success for a wide range of application specifically image segmentation tasks. Human gait segmentation is inherently more challenging, because the deformation of shapes is non-Gaussian and because gait is self occluding. It is also periodic, and as such temporally coherent in terms of the shapes which are not equally likely at all times [2]. Two main directions under the title of statistical shape priors have been suggested and employed so far in order to deal with these issues. Some authors suggest the use of kernel density [11] to decompose a shape’s deformation modes. The problem however with this approach is that the kernel is chosen regardless of how the data is distributed in feature space [12]. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 377–386, 2009. © Springer-Verlag Berlin Heidelberg 2009
378
M.S. Al-Huseiny, S. Mahmoodi, and M.S. Nixon
An alternative is to use traditional linear PCA, accompanied with some mechanism to synthesize new shapes. In other words the PCA will be used to reduce the dimensionality, while the temporally dependent deformation will be handled by this mechanism. In one early approach, Cremers [2] have developed a gait segmentation model based on Autoregressive (AR) systems as a mechanism to synthesize new shapes. Motivated by the above, we describe a new approach to shape reconstruction which can reconstruct moving shapes in image sequence: the interpolation with cubic spline, and then application of our proposed method to estimate accurate shapes in a human gait sequences. So in the rest of the paper, section 2 explains the statistical shape model, section 3 deals with the interpolating cubic spline, the results and discussions are presented in section 4 and finally the paper concludes in section 5.
2 Statistical Shape Model This section describes the process of learning the training set shapes and then reconstructing the learned shapes. Principal Components Analysis (PCA) decomposition is used to capture the main modes of variation that the shape undergoes over time, and to reduce dimensionality. Following Leventon [7], the boundaries of the n shapes with pixels each, constituting the training set, are embedded as the zero level sets of n signed distance functions (SDFs) using Fast Marching [13]-see Fig. 1 for illustration. These functions then form the set , ,…, . The goal then is to build a statistical model from this set of shapes.
Fig. 1. Formation of SDFs of four training shapes out of 38 using Fast Marching. Also projected beneath them the corresponding contours that represent their zero level sets.
The mean shape, µ is computed by averaging the elements over the set , i.e., ⁄ . Then the shapes’ variability is calculated by PCA. First, each shape is centralized by subtracting the mean from to create the mean-offset maps . Next, n column vectors, ui, are formed by vertically stacking the columns of the ith offset maps to generate n lexicographical column vectors. These vectors collectively define the shape-variability ( ) matrix S: ∑
u1 ,u2 , . . . .,un .
(1)
Level Set Gait Analysis for Synthesis and Reconstruction
Eigenvalue decomposition is then employed to decompose the to its eigenvectors and eigenvalues as: 1 ,
matrix
379
⁄ (2)
where is an matrix whose column vectors represent the modes of variation (principal components) and is the diagonal matrix of the eigenvalues. In order to reduce the computational burden, the eigenvectors of the (n << N) matrix are computed as: 1
,
(3)
then the set of vectors: ,
(4)
must be normalized to become of unity length: ,
(5)
which in turn represent the eigenvectors of the matrix
⁄ [14], i.e.
.
Fig. 2. The First three coefficient vectors (eigenmodes) of 38 training SDFs alongside with their counterparts synthesized by AR and cubc spline
A set of coefficients (see Fig. 2) is computed to quantify the contribution of each eigenmode to the ith shape, these coefficients are periodic and they are computed as: ,
(6)
where is the matrix of the first k eigenvectors. , the weighting coefficients’ vector represents the shape. Accordingly an estimated valid shape similar to those of the training set , can be reconstructed using k (k
k
+ μ.
(7)
380
M.S. Al-Huseiny, S. Mahmoodi, and M.S. Nixon
The accuracy of the shape estimate obviously depends on k; there is a tradeoff between the accuracy and the computational cost, this issue is further clarified in the experimental results. On the whole the first few components are quite enough for most applications, and the rest may be regarded redundant. The backbone then of the ability of this model to capture and reconstruct the pattern in the shapes of the training set is the set of coefficients , and the main contribution of this paper is to propose a method to synthesize new shapes in a gait sequence based on the training set. The above brief description represents the statistical shape model, usually used to constrain the state space during the segmentation process to the learned class of shapes. In analyzing a walking subject over time, sampled silhouettes follow some pattern. To make best use of this observation an AR system was suggested [2] to estimate the set of coefficients . Taking into account that gait is a smooth continuous motion, this paper suggests using a cubic spline to model the behavior of the elements of . Although there are other interpolation schemes, numerical experiments demonstrate satisfactory results by employing a cubic spline interpolation method. We also note that the optimality of a particular spline needs to be further investigated in a different context.
3 Interpolating Cubic Spline This section gives a brief description of the interpolating cubic spline suggested to synthesize new shapes. The cubic spline [15] is a piecewise continuous curve, passing through each of the values of a tabulated function , 1 … , it is supposed to model. There is a separate cubic spline polynomial for each interval, each with its own coefficients. For a single interval between and the cubic spline polynomial is: y where
for
A
,
B
1
A,
A
C
,
,
A
and
(8) D
B
. Substituting for x and y, finding and substituting for equations defined by: 6
′′
3
′′
6
1 linear
by solving
′′
following the assumptions in [15], and solving for , , and collectively constitute the piecewise smooth cubic spline.
B
,
(9)
. These polynomials
4 Experiments and Discussion We apply the cubic spline to model and reconstruct the statistical shape model coefficients. The experimental results are exhibited in two parts: The first is a comparative
Level Set Gait Analysis for Synthesis and Reconstruction
381
application of AR system and the cubic spline to the shape model. This is aimed to test the performance of the proposed approach in comparison with AR and hence the suitability for use as gait segmentation prior shape model, this is quantified by the error function ℰ described below. The second part includes further tests to verify the robustness of the proposed approach in perceiving the evolution of the implicit shapes, namely by examining its ability to estimate the in-between shapes, by testing its adaptability to the lack of part of the training data, and finally by using the generated model parameters in the recognition of subjects. • Error Function ℰ : This measure is introduced to assess the accuracy of the estimated shape; the total number of erroneous pixels in the estimated shape is computed as follows: ℰ
|
|
,
(10)
where is the original shape defined by the training set, is its estimated counterpart. Initially the training set silhouettes are aligned with respect to their centre of gravity and the corresponding embedding SDFs are computed by applying the Fast Marching technique [13] (see Fig. 1).
Fig. 3. Error function (ℰ ) for the shapes silhouettes
In first experiment a set representing a person’s single walking cycle sequence of 38 shapes is used. The statistical shape model is applied to calculate the coefficient vectors eq. 6. The AR system and the cubic spline are then both applied to model these coefficients vectors (see Fig. 2-b and c). Next, both approaches are used to reconstruct the estimated walking sequence shapes eq. 7. Fig. 3 shows ℰ for both approaches (logarithmic scale is used in this figure to accommodate both plots in the same figure). It is noted that because of logarithmic scales used in figure 3, the error term may appear smoother as its value increases (e.g. for the case of AR method). This measure shows that for the cubic spline there are on average 11.1 erroneous pixels per image of 12×104 pixels (0.009%) compared to 1802.3 erroneous pixels per the same image (1.5%) for AR.
382
M.S. Al-Huseiny, S. Mahmoodi, and M.S. Nixon
Fig. 4 contrasts these results, by showing some of the training sequence shapes alongside with their reconstructed counterparts. In the bottom row it is very easy to observe a filtering effect imposed by the AR system on the time series data, , this may be due to the fact that AR system acts as a recursive discrete time filter [16], and therefore affects the original data due to its intrinsic filtering properties, which appears on the reconstructed shapes in the form of smoothening or even erasing the thin parts of the shapes, like the hands and the feet which in some cases result in invalid shapes like having twisted hands or feet. The middle row which shows the shapes reconstructed by the cubic spline look much more similar to the training set, and this is consistent with the outcome of the error measure ℰr.
Fig. 4. The estimation of the training shapes. Top row is a sample of the training sequence shapes of order (right to left): 1, 5, 10, 13, 21, 26, 28, 33 and 38. Middle row is the same shapes reconstructed by cubic spline. Bottom row is the shapes estimated by AR.
One final point worth mentioning is that AR is not self starting by nature, i.e. it depends on initial condition data that must be provided prior to the start of the reconstruction operation, and when such data is not available or inaccurate the subsequent reconstruction is poor, putting this in the context of segmentation this means that the segmentation of the first few frames must rely on an alternative technique, and that the overall segmentation depends on the accuracy of that alternative segmentation technique, this is of course in addition to the genuine shortcoming of the AR demonstrated by the results above. Comparing this with the cubic spline which demonstrated excellent reconstruction results and is self starting data sequence reconstruction technique, the argument may be very strongly held towards using this approach proposed here in the gait prior shape segmentation model. In the second experiment the approach proposed here is tested for its capability to estimate the transitional shapes in-between the successive training shapes, i.e. to upsample the training data set. This has application in synchronizing multiple cameras. This is quite important because a large number of cameras placed in different locations, working with different sampling rates, produce unsynchronized footages. Up-sampling can numerically synchronize the captured frames by reconstructing inbetween frames. Fig. 5 exhibits the subtle movements estimated by this approach using two different sampling rates.
L Level Set Gait Analysis for Synthesis and Reconstruction
383
Fig. 5. The up-sampling: botto om rows show for each sampling step size three sample couplees of consecutive training sequencee silhouettes. The top rows are the estimated shapes’ contoours corresponding to these silhoueettes in addition to the shapes in-between.
In the third experiment the proposed approach is then tested for its capabilityy to compensate for missing daata in the training set by applying the leave one out ttest, where each time one of thee training set sample shapes is removed and the remainning shapes are used for interpo olation, the model is then used to estimate the whole waalking sequence including the missing shape, this is repeated to all of the sample spacee.
Fig. 6. Leave one out test: Thee bottom rows show the two training sequence silhouettes beefore and after the removed one. Th he top rows are the reconstructed shapes’ contours includingg the missing one.
Fig. 6 shows examples of o this test, in which the missing frames successfully hhave been reproduced. The errorr function eq. 11 which is the total number of erroneeous pixels is used to quantify th his test. From Fig. 7-a, it can be seen that on average thhere are 11.5 erroneous pixels per p image of 12×104 pixels (0.01%). The autocorrelattion function (ACF) is also deriived for ℰ , shown in Fig. 7-b, which confirms that errrors in shape estimation are rand dom in nature.
Fig. 7. Error Analysis for the estimated 38 shapes
384
M.S. Al-Huseiny, S. Mahmoodi, and M.S. Nixon
The impact of the number of eigenmodes used in the reconstruction and to estimate missing shapes is assessed by performing the leave one out test to reconstruct missing frames by decreasing the number of eigenmodes iteratively, Fig. 8 shows the results of this experiment. As shown in this figure, by increasing the number of eigenmodes, the reconstruction error decreases.
Fig. 8. The error function (Er) for estimating the 28th shape in the sequence with the number of eignemodes used in the estimation ranging from 1 to 37
In the fifth experiment the proposed approach is used for recognition based on the assumption that different subjects have different deformability coefficients i.e. α, hence the gait cycles of four different subjects are used in producing four different sets of α, for one of those subjects another test gait cycle different from the one used earlier is also used and the corresponding α is also produced. Next a distance meas∑ |α | ure is defined as α , where is the index over the coefficients, is a single period of walking cycle shapes, α and α are the coefficients vectors for which the distance is computed and is the time variable. This distance is employed and shown good results. This distance is used to choose the subject whose gait cycle coefficients have the least distance from those of the test gait cycle as the recognized one. Table 1 shows that the correct subject (the forth) has been identical with a ratio of 0.003%, that represent the distance for the correct subject relative to the average distance for the incorrect ones. Table 1. , calculated between the test subject’s set and four subjects’ sets, the correct subject is highlighted 1st subject 2nd subject 3rd subject 4th subject
Distance ( ) 28992 25158 26722 7868
5 Conclusions This paper has introduced an interpolating cubic spline to better model walking subjects by modeling the time variation of the coefficients of the eigenvectors over one
Level Set Gait Analysis for Synthesis and Reconstruction
385
walking cycle. This demonstrated better performance and accuracy over the autoregressive system used in the literature for the same purpose. The technique proposed here succeeded in capturing the key variability modes which led to success in the reconstruction of walking cycle shapes identical to the training set. Our method presented here was also used successfully in reconstructing the inbetween frames which did not exist in the initial training set and hence cleared the way for numerically synchronizing multiple cameras, an application which increasingly vital with the rise in using monitoring cameras. The method proposed here was also tested for its tolerance to missing parts of the training set for which the technique proved robust. Furthermore the present technique was employed in recognition by using the variability coefficients, which can be further improved by using the orthogonal basis functions such as Chebyshev moments as future work. The technique proposed here demonstrates promising results and we therefore expect it to be a very useful tool in other gait problem such as action identification and modeling/recognition for the curved path walking cycles in the gait challenge. The technique presented here can be easily applied in other applications, such as modeling the movement of moving objects, including animals.
References 1. Nixon, M.S., Aguado, A.: Feature Extraction & Image Processing, 2nd edn. Academic Press, London (2008) 2. Cremers, D.: Dynamical statistical shape priors for level set-based tracking. IEEE Trans. on PAMI 28(8), 1262–1273 (2006) 3. Mowbray, S.D., Nixon, M.S.: Extraction and recognition of periodically deforming objects by continuous, spatio-temporal shape description. In: Proc. IEEE Conf. on CVPR, vol. 2, pp. 895–901 (2004) 4. Prismall, S.P., Nixon, M.S., Carter, J.N.: Novel Temporal Views of Moving Objects for Gait Biometrics. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, Springer, Heidelberg (2003) 5. Cremers, D.: Statistical shape priors for level set segmentation. PAMM 7(1), 1041903–1041904 (2007) 6. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active Shape Models-Their Training and Application. Computer Vision and Image Understanding 61(1), 38–59 (1995) 7. Leventon, M.E., Grimson, W.E.L., Faugeras, O.: Statistical shape influence in geodesic active contours. In: Proc. IEEE Conf. on CVPR, pp. 316–323 (2000) 8. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic Active Contours. International Journal of Computer Vision 22(1), 61–79 (1997) 9. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. on Image Processing 10(2), 266–277 (2001) 10. Tsai, A., Yezzi Jr., A., Wells, W., Tempany, C., Tucker, D., Fan, A., Grimson, W.E., Willsky, A.: A shape-based approach to the segmentation of medical imagery using level sets. IEEE Trans. on Medical Imaging 22(2), 137–154 (2003) 11. Cremers, D., Osher, S., Soatto, S.: Kernel Density Estimation and Intrinsic Alignment for Shape Priors in Level Set Segmentation. International Journal of Computer Vision 69(3), 335–351 (2006)
386
M.S. Al-Huseiny, S. Mahmoodi, and M.S. Nixon
12. Dambreville, S., Rathi, Y., Tannenbaum, A.: A Framework for Image Segmentation Using Shape Models and Kernel Space Shape Priors. IEEE Trans. on PAMI 30(8), 1385–1399 (2008) 13. Sethian, J.A.: A fast marching level set method for monotonically advancing fronts. Proc. of the National Academy of Sciences of the United States of America 93(4), 1591–1595 (1996) 14. Leventon, M.E.: Statistical models in medical image analysis, PhD Thesis, MIT (2000) 15. Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes with Source Code CD-ROM, The Art of Scientific Computing, 3rd edn. Cambridge University Press, Cambridge (2007) 16. Oppenheim, A., Schafer, R., Buck, J.: Discrete-Time Signal Processing, 2nd edn. Prentice Hall, Englewood Cliffs (1999)
Real-Time Hand Detection and Gesture Tracking with GMM and Model Adaptation Gabriel Yoder and Lijun Yin Department of Computer Science, State University of New York, Binghamton
Abstract. Hand gestures are an efficient manner for human computer interaction (HCI). They can also be used for the development of a non-intrusive biometrics system. In this paper, we address the issues of hand detection and gesture tracking using a single camera. A simple yet effective approach is proposed for applications with complex backgrounds and minimal constraints on the subject. A hand detection approach is presented using a Bayesian classifier based on Gaussian Mixture Models (GMM) for identifying pixels of skin color. A connected component based region-growing algorithm is included for forming areas of skin pixels into areas of likely hand candidates. Given the detected hand region, we further detect the hand features using a deformable model for hand gesture estimation. We propose a novel method, a 3D physics-based dynamic mesh adaptation approach, to estimate and track hand shape and finger directions. The physics-based hand model adaptation algorithm allows us to model hand shape and orientation at the same time, thereby improving the robustness and speed for hand gesture tracking and regeneration.
1
Introduction
The ability to detect hands and their various gestures has many potential applications in the fields of biometrics, HCI, etc. For example, it can be used as a pre-detector for a hand biometrics system, a generalized pointing detector, a relatively simple virtual mouse, or even a gesture based system for command execution. All of these applications require an accurate way to determine hand location and posture, yet for them to be truly useful for HCI and biometrics applications, the restrictions on the user must be minimized. The goal of this work is to develop a real-time non-intrusive hand tracking system using a regular camera. A number of approaches have been used for optically collecting information pertaining to a hand’s location and gesture with varying degrees of accuracy and constraint on the user [1][2][3]. For motion capture purposes, good results have been achieved with the use of multiple cameras combined with special markers placed on key positions for easy detection [4]. Such systems are also generally used in an environment with a simple background that may easily be removed. For HCI or biometrics applications, it is desirable to remove as many G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 387–396, 2009. c Springer-Verlag Berlin Heidelberg 2009
388
G. Yoder and L. Yin
of these restrictions as possible while still maintaining sufficiently high accuracy. In many environments, it is impractical to use more than a single camera. This limits the ability for 3-dimensional positioning, however for many applications 2-dimensional positioning is sufficient. There have been a variety of trade-offs concerning the use of markers and background restrictions. In some systems, the hand remains free of markers, however the background is constrained to a single color which contrasts well with skin tone [5][6]. In other systems, the background is unconstrained, but either markers are used or else the hand is expected to be found in a predetermined location on initial start of the system [7]. With marker based systems, there exists a trade-off between number of markers and the level of detail in hand posture information. For example, systems that are only concerned with a pointing gesture, such as [8], may simply use a single colored thimble to mark the pointing finger. An ideal optical system of hand detection would be fast, accurate, and would require no special restrictions on the user or the environment. In this paper, we explore the initial development of components intended for use in a system which aims to provide accurate, marker-free detection of hand position with minimal background restrictions. The basis for hand detection is a Gaussian Mixture Model (GMM) based classifier for distinguishing skin colored pixels from background pixels. This is combined with a connected component analysis system in order to locate blocks of the image which contain significant amounts of skin colored pixels. We will further address the issue of hand gesture tracking with a single camera. Hand motion analysis and hand gesture tracking has attracted intensive research in recent years [9][10][11][5][4]. The great variety and adaptability of hand movement and the indistinguishable hand features of the joint parts have made it difficult to capture features robustly and accurately. Some previous work utilized mechanic glove-based devices [12][5] to track hand features, which are device constrained and do not allow free-form motions. There are some alternative methods to date which utilize the so-called non-intrusive technique to detect the hand features. For example, researchers make use of magnetic trackers and video-based tracking systems [5] for motion tracking and feature detection [13][14][15][16]. Thalmann et al. [17] applied a video-based tracking system to the modeling of different body parts and to real-time animation in a single environment. Lee et al [18] used a thimble with a color blob for finger tip tracking. Some systems relied on stereo cameras to implement the hand tracking and pointing [19][20][21]. In short, the non-intrusive methods still face a big challenge in hand feature extraction in terms of accuracy and robustness. In view of the problems in hand feature detection and tracking, and motivated by the recent advances of feature detection in computer vision [9][10][1][2][3], we propose to develop a novel technique based on our 3D hand model for robust and accurate feature extraction from the motion hand with variable postures. We envision that using the 3D model based approach allows the model to fit and track the hand motion simultaneously with the accurate estimate of hand shape and orientation.
Real-Time Hand Detection and Gesture Tracking with GMM
2 2.1
389
Hand Detection Skin-Tone Pixel Segmentation
The complex nature of the range of colors which constitute skin tones make it difficult to establish a firm set of values for image segmentation, and many types of methods for handling skin detection have been proposed [22]. Variations in lighting further complicate the process of distinguishing skin from background colors. For these reasons, it is desirable to use machine learning approaches to establish a classifier which could accurately classify known examples of each class as well as provide reasonable guesses for colors not found in the training data. A Bayesian classifier provides a simple, robust classification with a built in classification confidence and can be used if a probability distribution function (PDF) can be established for each class. Since the actual PDF for the skin and background classes is not known, it is necessary to design an approximation of the PDF based on known data. A Gaussian Mixture Model (GMM) provides such an approximation. In a GMM, a single PDF (1) is composed of the weighted sum of the PDFs for a series of multivariate Gaussian distributions. Since the PDF for a multivariate Gaussian distribution with mean vector μ and covariance matrix Σ is known to be (2) (where D is the dimensionality of feature vector x), the real work behind determining the PDF for a GMM is to establish the number of component Gaussians C, the mean and covariance of the Gaussians, as well as the relative weights of the Gaussians αC . p (x|α1 , μ1 , Σ1 , ..., αC , μC , ΣC ) =
C
αc p (x|μc , Σc )
(1)
c=1
p (x|μ, Σ) =
1 (2π)D |Σ|
1
T
e[− 2 (x−μ)
Σ −1 (x−μ)]
(2)
A number of methods exist for fitting the specifications of a GMM to a set of data. One such method is to establish a fixed number of Gaussians for the GMM, and then fit each of the Gaussians using the Expectation-Maximization (EM) algorithm (which is thoroughly covered in other sources such as [23] and [24]). In brief, the component weights, means, and covariances are repeatedly calculated using (3), (4), (5), and (6) until the change from one iteration i to the 0 next is sufficiently small. In order to calculate wn,c , each αic is initialized to C1 , each mean vector is filled with random values in the range (0.0, 1.0], and each covariance matrix contains values in (0.0, 1.0] on the diagonal and 0 elsewhere. If a poor fit results from the training, the number of Gaussians C is adjusted and the process is simply run again. The GMM for each class is trained separately using known examples from the class. αic p x|μic , Σci i wn,c = C (3) i i i j=1 αj p x|μj , Σj
390
G. Yoder and L. Yin
Σci+1
N
1 i w N n=1 n,c N i n=1 xn wn,c i+1 μc = N i n=1 wn,c T N i i+1 xn − μi+1 c n=1 wn,c xn − μc = N i n=1 wn,c αi+1 = c
(4)
(5)
(6)
Our skin tone pixel segmentation uses a single Bayesian classifier based on the GMMs determined from a static set of training data. The classifier is fully computed ahead of time by a training program which processes a set of example images to extract the user-specified color components and train the GMMs accordingly. Once the training program has completed, the classifier remains fixed, and may be used by a separate hand detection application. The hand detection application examines each pixel independently, and assigns the pixel to a given class based solely on the output of the classifier for that pixel. 2.2
Hand Region Detection
In order to locate potential hand regions in images, our system first uses a Bayesian classifier to identify skin tone areas. In initial implementations, every pixel of the image was processed by the classifier, however this proved to be too time consuming for real-time processing. In its current incarnation, the hand detector divides the image into a grid of blocks which are 8x8 pixels in size. The feature values of the pixels within each block are averaged, and these mean values are fed into the classifier. The resulting classification is then applied to all pixels within the block. The size of the block was empirically selected based on the criteria that it would sufficiently reduce processing time and that it would evenly divide common image sizes. Once the Bayesian classifier has identified each block as either skin or background (essentially producing a down-scaled classification image), the results are scanned for connected regions consisting of blocks with a skin confidence of at least 50%. We applied a skin-pixel based region-growing approach to detect the connected components of skin-regions. Connected regions whose width or height is below an empirically derived threshold are assumed to be false positives and are discarded. Any remaining connected regions are presumed to be hands.
3
Hand Gesture Tracking
The motion tracking of the hand gesture attempts to follow the hand trajectory using an appearance model to match the hand region. The tracking not only works on the hand palm region, but also the individual finger regions with various gestures. In general, one hand gesture or motion involves dynamic and/or static configuration of the hand. Each posture frame in a video sequence must be analyzed in
Real-Time Hand Detection and Gesture Tracking with GMM
391
order to obtain the 3D position and orientation of the hand. However the noisy data and occlusion of the fingers make it a very difficult task. In this paper, we propose a novel feature detection method, Deformable Hand-Model based Feature Adaptation and Estimation algorithm, to capture the hand features as well as the hand shapes. Our feature detection method is composed of two steps: initial global estimation and fine adaptation. In the first stage, we apply the principal component analysis method to compute the major axis of a hand within the detected hand region. This stage is to determine the approximate orientation of the hand so that our hand model can be roughly adjusted to fit to the hand using an affine transformation. The deformable 3D hand model with different views is shown in Figure 1.
Fig. 1. Generic 3D hand models in different views with wire-frame meshes and shaded meshes
In the second stage, the fine adjustment of our model is applied to adapt to the hand image. Our deformable hand model adaptation method is an extension of the approach for adaptive sampling and modeling of images containing non-rigid objects with physical deformation found in [25]. The nodes of a hand model mesh are mobile observers or sampling sites and they distribute themselves over the image data so as to represent the hand (including fingers and palm) with sufficient accuracy. Clearly, it is beneficial to concentrate those nodes of interest where they will do the most good - in highly curved areas of the hand surface, especially in subtle regions such as small convex and concave surface and small curved fingers. The hand model mesh can be assembled from nodal points connected by adjustable springs. The fundamental equation is a second order differential motion equation with a real 3-D force to be applied for simulating boundaries of 3D objects: mi
d2 xi dxi + γi + g i = fi 2 dt dt
i = 1, . . . , N
(7)
where γ is the damping coefficient dissipating kinetic energy in the mesh through friction, which eventually brings the mesh to rest, and f is the external force acting on node i. m is the node mass. Here the internal force is determined by the deformation of springs (Mi ) which connect to node i. gi = j∈Mi ((k2 − k1 )Ii + k1 )dj , where k1 and k2 specify the range of spring stiffness from minimum to maximum. The stiffness Ck = (k2 − k1 )Ii + k1 . Ii is the image observation linking to the node i, and dj is the displacement of the jth spring. The external forces of the nodes are used to link the dynamic mesh to the observed image data. In our physical-based spring meshes, the bigger the stiffness of springs is, the stronger the contraction force will be. So it makes sense to increase the stiffness
392
G. Yoder and L. Yin
of spring in the area of ”interesting features” to make the mesh denser in that area (it ”steals” nodes from other surface areas). For the hand model adaptation, our dynamic mesh is deformed by two external forces: (1) a horizontal force from the image plane, which is exerted by the gradient values and the curvature values of the position in the active areas (e.g. finger tips, finger boundaries, and palm boundary), and (2) a vertical force in the direction perpendicular to the image plane. The vertical force deforms the mesh vertically to make the hand shape consistent to the image intensity surface. The stiffness Ck is adjustable according to the hand image curvature. The Ck is increased in the areas of finger tips and finger webs (joint positions between two fingers), where large curvatures are exhibited. The curvatures of joint parts are changed dynamically along with the finger rolling and unfolding. As a result, the hand model is deformed to each finger area after the mesh energy is minimized. Since the adapted 3D mesh model represents the hand shape at the current posture, the corresponding shape and finger orientation can be determined. Figure 2 illustrates two examples of the adapted hand models with various gestures and finger pointing directions.
Fig. 2. Example of adapted hand gesture models with various gestures and finger pointing directions
4
Experimental Results
In order to train the GMM for the skin class, 812 images from the Massey Hand Gesture Database were used [26]. These images consist of hands in various postures in front of a dark background. Some images have natural lighting and others have artificial lighting. In order to extract only the skin pixels out of these images, each pixel was examined by the training program and was automatically discarded if the maximum value of the RGB components was less than or equal to 55 (this threshold was based on observation of the background colors from the image database). Training data for the background or non-skin class consisted of 8 images of a home-office taken at a resolution of 4 megapixels. Unlike the skin images, there was no need to discard any of the pixels since they were all relevant to the class. Due to the inclusion of all pixels and the higher resolution of the office photos, there were in fact slightly more training pixels for the background pixels in spite of the skin images vastly outnumbering the background images. Some example training images from both classes are shown in Fig. 3. The features that were used by the classifier were taken from the YIQ color space. Like the HSV color space, YIQ consists of one component (Y) which
Real-Time Hand Detection and Gesture Tracking with GMM
393
Fig. 3. Left: example skin training images; Right: example non-skin training images
corresponds to overall brightness, while the other two components (IQ) provide the color. By discarding the brightness component from the YIQ data, a certain degree of lighting invariance was gained. This color space has been used for hand detection by others, such as [27] which specifies the conversion process from RGB to YIQ. (1) Test on static images: Initial tests of the classifier were performed by processing the same images that were used for the training in addition to a handful of personal photos which included people. Since the majority of these images were used for training purposes, it was expected that results would be good and would serve as an upper bound of what might be expected for a more generalized test. Overall, the results are positive. Among more than 800 test hand-images, the hand regions were correctly detected approximately 90% of the time. Among the images from the Massey Hand Gesture Database, there were few background pixels being mistaken for skin, however a number of more shaded skin pixels were mistaken for background. Among the personal photos, it was observed that a fair amount of wood tones and brassy colors were being mistaken for skin. Unsurprisingly, a small amount of reddish tones were included in the false skin classification. (2) Test on video sequences: For the actual hand detection tests, we used our video capture system to capture hand videos under various imaging conditions with two indoor environments (one with a simple background, the other with a complex background). A common Logitech webcam with automatic brightness and color adjustment was selected. The tests were performed on a PC with a 2.5 GHz AMD Phenom processor. The processing speed is 10 frames per second at the image resolution of 640x480 pixels. In the simple background environment, the captured video consists of the torso and arm of a person standing in front of a wall. Our experiment shows that 96% of the time, the hand regions are detected correctly with a good bounding box for the hand. Example frames are shown in Figure 4. Although the majority of the hand was detected in most of frames, a very small amount of bad frames have some background registered as a false positive. These bad frames appear to be a side effect of the camera’s automatic color adjustment producing some odd initial results. The second environment consisted of the home-office used to collect the background training data. The camera took longer to properly adjust the color for this video and consequently most of the background is included for the first 3 seconds. The color adjustment produced less than optimal results for this video. A portion of the upper section of wall contained a reflection from the
394
G. Yoder and L. Yin
Fig. 4. Top: Video frames with hand detected in a simple background; Middle: Skin portion of the top row; Bottom: Video frames with hands detected in a variable lighting/complex background.
ceiling light, and this reflection was adjusted to a yellowish orange shade that frequently yielded false positives from the classifier. Additionally, the hand that was farther from the camera had a purple cast, and consequently was missed or only partially located in a few frames. However aside from these issues, the hands were successfully located in the remaining frames. The correct detection rate for hand regions is approximately 90%. Some sample frames are illustrated in Figure 4. We have also tested on hand sequences with model adaptations using a highresolution video capture system with a SONY active video camera at the frame rate of 10 frames per second with the resolution of 720x640 pixels. Out of 1, 000 frames from ten subjects’ hand videos, 85% of hand frames are correctly adapted. Figure 5 shows the sample frames of hand model adaptation across video sequences with both finger gestures and pointing gestures.
Fig. 5. Adapted hand gesture model sequence (upper block) and pointing model sequence (lower block). In each block, Top two rows: original videos; Bottom two rows: adapted models.
Real-Time Hand Detection and Gesture Tracking with GMM
5
395
Conclusion and Future Work
In this paper, we presented a color-based GMM approach for hand detection and a model-based gesture tracking approach to address the issues of hand and gesture analysis. The results are promising in terms of effective implementation of real-time hand detection and gesture tracking. There are still some limitations in terms of the color information versus various imaging conditions. Hand occlusion is also a challenging issue for model based gesture tracking. In our future work, we will develop a second post-detection algorithm for hand patch estimation based on the detected skin-pixels. This is intended to separate hand regions from other skins regions. An approach like local binary pattern with AdaBoosting (similar to face detection approach) will be investigated in order to improve the performance of hand detection and classification. We will further collect more variety of hand gesture images under various hand postures for realistic hand modeling and hand feature description.
Acknowledgement We would like to thank the Air Force Research Lab at Rome, NY for supporting this work.
References 1. Cinque, L., et al.: Fast viewpoint-invariant articulated hand detection combining curver and graph matching. In: FGR 2008 (2008) 2. Chik, D., et al.: Using an Adaptive VAR Model for Motion Prediction in 3D Hand Tracking. In: IEEE FGR 2008 (2008) 3. Suk, H., Sin, B., Lee, S.: Recognizing Hand Gestures using Dynamic Bayesian Network. In: IEEE FGR 2008 (2008) 4. Moeslund, T.B., Granum, E.: A survey of computer vision-based human motion capture. Computer Vision and Image Understanding 81(3), 231–268 (2001) 5. Pavlovic, V., Huang, T., et al.: Visual interpretation of hand gestures for HCI: a review. IEEE Trans. PAMI (1997) 6. Kakumanu, P., Bourbakis, N., et al.: A survey of skin-color modeling and detection methods. In: Pattern Recognition (2006) 7. Kurata, T., Okuma, T., Kourogi, M., Sakaue, K.: The hand mouse: Gmm handcolor classification and mean shifttracking. In: IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, pp. 119–124 (2001) 8. Lee, M., Weinshall, D., Cohen-Solal, E., Colmenarez, A., Lyons, D.: A Computer Vision System for On-Screen Item Selection by Finger Pointing. In: IEEE CVPR 1999, vol. 1, p. 1999 (2001) 9. Manders, C., Farbiz, F., Chong, J., Tang, K., Chua, G., Loke, M., Yuan, M.: Robust hand tracking using a skin tone and depth joint probability model. In: FGR 2008 (2008) 10. Park, C., Roh, M., Lee, S.: Real-Time 3D Pointing Gesture Recognition in Mobile Space. In: FGR 2008 (2008)
396
G. Yoder and L. Yin
11. Nguyen, T., Binh, N., Bischof, H.: An active boosting-based learning framework for real-time hand detection. In: FGR 2008 (2008) 12. CyberGlove, http://www.immersion.com/3d/products/cyber_glove.php 13. Kolsch, M., Turk, M.: Robust hand detection. In: IEEE FGR 2004 (2004) 14. Oka, K., Sato, Y., Koike, H.: Real-time fingertip tracking and gesture recognition. IEEE CG&A (2002) 15. Argyros, A.A., Lourakis, M.I.A.: Real-time tracking of multiple skin-colored objects with a possibly moving camera. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 368–379. Springer, Heidelberg (2004) 16. Wu, Y., Huang, T.: Vision-based gesture recognition: A review. In: The 3rd Gesture Workshop (1999) 17. Kalra, P., Magnenat-Thalmann, N., Mossozet, L., Sannier, G., Aubel, A., Thalmann, D.: Real-time animation of realistic virtual humans. IEEE Computer Graphics and Application, 42–56 (1998) 18. Lee, M., Lyons, D., et al.: A computer vision system for on-screen item selection by finger pointing. In: CVPR 2001 (2001) 19. Jojic, N., et al.: Detection and estimation of pointing gestures in real-time stereo sequences. In: IEEE Automatic Face and Gesture Recognition, FGR 2000 (2000) 20. Yamamoto, Y., et al.: Arm-pointing gesture interface using surrounded stereo cameras system. In: IEEE International Conference on Pattern Recognition (2004) 21. Colombo, C., et al.: Visual capture and understanding of hand pointing actions in a 3-D environment. IEEE Trans. on SMC-B, 677–686 (2003) 22. Vezhnevets, V., Sazonov, V., Andreeva, A.: A survey on pixel-based skin color detection techniques. In: Proc. Graphicon, pp. 85–92 (2003) 23. Bilmes, J.A.: A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical Report TR97-021, International Computer Science Insitute and Computer Science Division U.C. Berkeley, Berkeley CA (1998) 24. Paalanen, P.: Bayesian classification using gaussian mixcute model and EM estimation: implementation and comparisons. Information Technology Project (2004) 25. Terzopoulos, D., Vasilescu, M.: Sampling and reconstruction with adaptive meshes. In: IEEE CVPR 1991, pp. 70–75 (1991) 26. Dadgostar, F., Barczak, A., Sarrafzadeh, A.: Massey hand gesture database, http://www.massey.ac.nz/fdadgost/xview.php?page= hand image database/default 27. Soh, J., Yoon, H.-S., Wang, M., Min, B.-W.: Locating hands in complex images using color analysis. In: IEEE International Conference on Systems, Man, and Cybernetics, pp. 2142–2146 (1997)
Design of Searchable Commemorative Coins Image Library ˇ Radoslav Fasuga, Petr Kaˇspar, and Martin Surkovsk´ y ˇ - Technical University of Ostrava, Department of Computer Science VSB 17. listopadu, Ostrava - Poruba, Czech Republic {radoslav.fasuga,petr.kaspar.st4,martin.surkovsky.st1}@vsb.cz
Abstract. This paper describes the process of design and implementation of searchable digital image library of commemorative and circular coins. The authors discuss some basic problems related with coin recognition process based on color, texture, scale, rotation, contour and shape descriptors. The article describes how to build digital coin library based on coin images and text descriptors. The article places emphasis on process of preparation of indexed pre-calculated templates, which can be used for search based on averse and reverse coin images. The issues of image quality and information relevancy for specific coin, based on individual relevancy mask are also discussed. The article contains information about both well known and specialized newly designed algorithms and descriptors used for preparation, indexing and recognition processes. Finally, we showed our results, time efficiency of the used algorithms, as well as comparison with existing techniques and future envisioning. Next steps for the final noncommercial concept of searchable Commemorative Coins Image Library are introduced.
1
Introduction
Coins are one of the most frequently used legal tender in the world. First coins were used thousands of years before Christ. Creation of a specialized content searchable library is a common task of computer science. There are databases describing coins as a table of records containing several textual attributes and classification structures. The typical representatives of the large coin libraries are Coin Archives1 and Numis Master2 . We are able to store information about material (which is most usually metal), diameter, weight, and rim. We can identify currency, country and year of issue, etc. based on coin averse and reverse labels and pictogram. We can also find out details for a particular coin, such as mint, designer, mintage, etc. using additional information sources. But this theory must be tempered with reality. 1 2
www.CoinArchives.com - is a repository of coins previously featured in major numismatic auctions. www.NUMISMASTER.com - is a searchable coin library and marketplace powered by Krause Publications.
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 397–406, 2009. c Springer-Verlag Berlin Heidelberg 2009
398
2
ˇ R. Fasuga, P. Kaˇspar, and M. Surkovsk´ y
Current Practical Solutions
“How can the human brain solve this task?” Identification begins with observing a coin shape, labels, colors (equal to used materials), continues with finding additional information in printed books. Before being able to identify a coin, we needed a lot of learning. Problems start when we are not able to read labels on a coin, or the object in question has been damaged. Typical examples are coins having Arabic, Chinese, Japanese, Cyrillic labels, old Roman coins, etc. In this case, it’s difficult to define the correct orientation (exact angle for rotation normalization) for reading and comparison to others. Sooner or later we need to turn to a coin collector for help. The main idea of this research activity is to design an open web based system that will be able to add new coins and identify them based on averse and reverse coin images.
3
Technical Aspect of Coin Identification Problem
The problem has to do with the area of image retrieval and computer vision, fast indexing and content description based on the expert predefined rules (masks). Speed and accuracy are important because the response to a request has to be fast and accurate, that’s why it’s better to take additional computing time for precise preprocessing and indexing to get future benefit for faster searching process. We need to pre-calculate pictures into uniform size, scale, and orientation. The source image will be transformed into numeric vectors that can be used for fast comparing process [1]. Actually there exists a lot of methods for image retrieval oriented to fingerprints, iris picture, identification by using common image classification, MPEG-7 descriptors [2], SIFT (scale-invariant feature transform) and SURF (Speededup robust features) descriptors approach [3], etc. The other possibility is using of eigenspace [4]. Iris and fingerprint recognition process includes anchor nodes for normalization, orientation and comparison. Rules for identifying important properties for limited set of pictures with certain information were created based on researches. It is difficult to define exact rules for describing coin content for coins recognition. Our current research is oriented to content descriptor Fig. 1. Different coins from different countries have different shapes, pictograms, values, text orientation (linear, circular). A commemorative coin is a small plastic piece of art, where designers materialize their own ideas about content and historical situation using artistic techniques. In the end, what we are speaking about is a set of pictures without exact rules for comparison [5]. Recognition process based on colors is problematic because we get different colors from different capture sources. Most useful information is received by analyzing shapes. These methods can be divided into contour based shapes and region based shapes. For our purpose it is better to use contour based shapes. In our research activity we tried to use GFD (General Fourier Descriptor) [6], but obtained results were not suitable for large scale of images.
Design of Searchable Commemorative Coins Image Library
399
Fig. 1. Available image descriptors are text, content based descriptor, coin semantic ontology
Then we tried to use Gabor filter [3] for textures description. The results again were not useful for detailed information that can be stored in a coin picture.
4
Coin Images Preprocessing
One fundamental part of the coin image preprocessing is finding of correct angle for image rotation normalization. Before completing this part, it is important to take several steps for coin image normalization. Images can come from different sources like camera, photography, drawn image, scan or model. Another relevant factor is image quality, colorization and level of sharpening, here it’s important to receive uniform format for all coin pictures [7]. Scale normalization is important. By using images with small resolution we lose information. Large resolution involves additional noise on image and requires large disk capacity; some source images are not available in high resolution. We have carried out some experiments using square with resolution 100 × 100px (pixels), 256 × 256px, then 512 × 512px and 1000 × 1000px. Finally, resolution 256 × 256px has been chosen, in a view of future preprocessing techniques and balance between information relevancy and noise level. Color normalization colors are important information, they help us to identify source material, but for comparison process they have some issues. Old coins have oxidation or additional noise, different source images has different color scale. A typical example is shown on picture Fig. 2. Both images represent the same coin but the polished mirror plate in the scanner is represented with black (dark) color and in camera format, the same plate is flashed and it is a lighter color. Finally, we can only store a small signature about colors to classify materials (yellow hues for gold, gray scale for silver and aluminum, etc.), but for future content normalization we use grayscale pictures.
400
ˇ R. Fasuga, P. Kaˇspar, and M. Surkovsk´ y
Fig. 2. Different sources for the same coin - camera with flashlight (left) and scanner (right) and their thresholding matrix
Color brightness is a next step in picture normalization. Differences in brightness are based on different lighting condition and for us are important to receive normalized coins. This problem is eliminated by Histogram equalization algorithm that equalizes values in grayscale histogram. Polar transformation is a useful technique for the normalizing rotation process (1) Fig. 3 (left). This operation eliminates second rotation problem and gives vertical interpretation of the original radial image Fig. 3 (right). x = r · cos ϕ, y = r · sin ϕ
(1)
Fig. 3. Polar transformation process (left), result transformation in polar coordinate system (right)
Edge detection, as it is shown on Fig. 2, same coin from different sources has different visual representation (different brightness levels). When we use thresholding algorithms we receive different results and for future segmentation we have two different templates for comparing Fig. 2 (third and forth). What is needed is a uniform descriptor for both sources. Edge detection algorithms are used as a relevant descriptor. Two relevant algorithms for our purpose are Canny Edge detector [8] and Adaptive Thresholding Fig. 4 (left, right). Results are similar but not same. We are currently working with both descriptors but only one of them will have to be chosen in the end.
Design of Searchable Commemorative Coins Image Library
401
Fig. 4. Edge detectors Canny algorithm (left), Canny with thresholding (center), and Adaptive thresholding (right)
Rotation standardization process. All of the previously described preprocessing operations make rotation standardization (normalization) process easier. Big problems occur in the situation when we try to calculate picture centroid or investigate the rotation based on edge detectors in pictures from Cartesian axes. Same source pictures with different rotations bring different results. It means that different edges are highlighted. This problem can be solved by using Polar transform. Firstly we transform grayscale picture into polar coordinates and then apply edge detector. In this way we obtain the same working source for all pictures with different rotations [9]. By using Canny detector (with setting: kernel size: 9, theta: 0.45, lower threshold: 6, upper threshold: 20, scale: 3.0, offset: 0) we find edges in polar coordinates, then we threshold un-useful (dark) edges 4 (center). Final result is a binary matrix where value 1 (brightness ≥ 60) represents edge and 0 all other (brightness < 60) areas Fig. 5 (bottom). As a next step, we calculate histogram count of value 1 in all columns Fig. 5 (top). Based on this histogram we find N -maxims in the histogram. Based on those maxims we calculate main maximum like maximum as a maximum with a large set of nearest maxims Fig. 5 (top).
Fig. 5. Finding maxims in histogram (top), calculation main maximum R column and calculate α distance for rotation normalization (bottom)
402
ˇ R. Fasuga, P. Kaˇspar, and M. Surkovsk´ y
Based on this main maxim R and center of image S we calculate α distance which is equal to rotation angle in Cartesian coordinates. We can move R column into the central S point of the image x-axes and recalculate it back from image polar coordinates into Cartesian coordinates, however, some relevant information about edges is lost. Better solution is to use source grayscale image, apply α rotation on it, and then use edge detector to receive uniform resulting template independent on source rotation for indexing and searching process.
5
Storing Image Templates into Database
Rotation normalization paradox is a typical situation when one picture with different rotation is given different angles for normalization. This typically occurs in pictures with symmetric portraits and dominant elements like cross, triangle, etc. In this case, it is important to store all relevant rotations for one template. This situation can be solved by using source (indexed gray scale) images, and trying to rotate them in 3 degree steps. Finally, we obtain 120 rotations for one image. Then we produce image preprocessing described in Chapter 4 for all of those rotations and receive 120 vectors. Based on clustering algorithms we reduce the number of relevant vectors for source image. The result is in range 1 to 5 rotations for each source image.
Fig. 6. Image segmentation process based on squared 4 × 4 matrix (left), circular symmetric matrix (middle) 63 segments, and circular asymmetric matrix (right) 39 segments
Summed segment information (number of edges = value 1 in matrix) can be stored as one single value. Based on this value it is possible to eliminate a large set of templates with different edge detection cardinality. This information is useless for the process of searching. A matrix with resolution 4 × 4 Fig. 6 or 8 × 8 is more accurate. Based on them we obtain vectors with 16 or 64 values. The advantage of square matrices is speed. However, our coins are commonly presented in the form of circles (noncircle coins are imprinted into circular borders). For a circular matrix two masks need to be prepared Fig. 6 that can divide a circle into segments. Symmetric (= rings have the same diameter and all the
Design of Searchable Commemorative Coins Image Library
403
ring parts from the lower level (from the center) are divided into two symmetric parts in higher level) matrix with 63 values can be used, but more relevant is asymmetric (= rings have different size diameters based on experiments focused on relevant information positions) matrix with 39 segments. [10] Library is based on DBMS (database management system) MySQL, optimized for the data retrieval time efficiency. Database contains catalog attributes of the coins (country of origin, material, etc.) as well as computed vectors and masks for the purpose of search. The obtained vectors and relevant (= different) rotation can be stored in database by using this template: Coin Template(coin id, rotation angle, segment 1, , segment n). A 256 × 256px image contains template 4 × 4, 16 segments with possible value of edge pixels between < 0, 256 >. See Table 1. Table 1. Example of pre-calculated template vectors Coin id
Rot Vector values
czk 47 1995 30◦ czk 47 1996 66◦ czk 47 1996 81◦
12,23,54,57,23,67,87,34,23,4,24,76,43,23 10,27,60,44,26,62,33,32,24,8,19,87,22,18 16,22,73,52,21,64,81,56,25,5,24,75,42,12
Question is why we store all (more than one) rotation for one template. For our approach it is better to pre-calculate those templates for relevant coins, we store more data but searching process for a requested coin is faster.
6
Coin Image Significance Mask
Typical problem in a large set of coins is similarity. For example all circular coins from same decade with same denomination have same design and the only difference between them is year of coinage. Based on this, we deduce that some segments in templates are more important than others Fig. 7. Primarily we can simply use a binary mask where value 1 represents higher significance than other with value 0. Detailed masks can build significance hierarchy. For example: 4 primary (central) motive, 3 year of issue, 2 secondary motive, 1 additional text, 0 other irrelevant parts Table 2. In practice human numismatic expert marks significant segments for a particular coin. This can be done by drawing polygon around the important area. Subsequently, the polygon is projected into segmentation mask and relevant (drawn over) segments are marked. Storing a specific template and mask of relevancy increases complexity of searching process but on the other side brings better accuracy in detailed searching process.
404
ˇ R. Fasuga, P. Kaˇspar, and M. Surkovsk´ y
Fig. 7. Similar commemorative coins issued by Czech National Bank in different years Table 2. Example of template vector with simple and detail mask Coin id
Rot Vector values
czk 47 1995 30◦ Simple mask Detail mask
7
12,23,54,57,23,67,87,34,23, 4,24,76,43,23 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0 0, 0, 1, 2, 3, 3, 4, 4, 0, 0, 3, 3, 1, 0
Comparing a Given Coin Image with Templates and Masks
User of our system uploads coin image (two images avers and revers at most), selects relevant image areas, which cut out all unnecessary pixels and normalize it. (Chapter 4) Finally we receive one single template vector for the uploaded image. Now we can compare our single image vector S = (seg1 , . . . , segn ) with templates vector T = (tseg1 , . . . , tsegn ) stored and indexed in database. We compare each segments by using two algorithms based on Minkowski distance, especially City Block distance (2) and Euclid distance (3). Results with the lowest distance are most similar. CBD = |seg1 − tseg1 | + |seg2 − tseg2 | + . . . + |segn − tsegn |
Euclid =
2 2 2 (seg1 − tseg1 ) + (seg2 − tseg2 ) + . . . + (segn − tsegn )
(2)
(3)
Template masks with important areas can be used in two ways. Two steps mask evaluation. In first step we search without template mask and receive reduced set of relevant coins. In the second we compare just significant segments and based on them precise result.
Design of Searchable Commemorative Coins Image Library
405
Single step mask evaluation. We need a new mask vector M = (m1 , . . . , mn ) and increased mask M + = (msk1 , . . . , mskn ) where k = 1, . . . , n; ∀mskk = mk +1 (eliminate zero values in mask). There are several possible evaluations (in example based on City Block distance). Decreasing important information for reducing differences between masked values version one with division (4) version two with extract the root operation (5) (which is slower), version (6) increase mask importance against other segments. k=1
|segk − tsegk | mskk
(4)
|segk − tsegk |
(5)
|segk − tsegk | × mskk
(6)
Single mask dec 1 =
k=n
Single mask dec 2 =
k=1
mskk
k=n
Single mask inc =
k=1 k=n
8
Experiment Results
Results of searching process with rotation normalization are shown at Fig. 8 First image is reqeuseted source, next for images represent most relevant results (method withouth masking). Indexed database contains 85000 coins. These results of rotation normalization are for the same type of coin captured from different sources. There are about five stored templates for all rotations of a coin. And after getting new images of the coin from a new source a new templates with another rotation angle may be stored. One of the main goals of our system future development is to decrease a number of stored templates for the same type of coins.
Fig. 8. Requested coin (left one) with results
406
9
ˇ R. Fasuga, P. Kaˇspar, and M. Surkovsk´ y
Conclusion
In this article we have described our research activity building a library of commemorative and circular coins. The proposed solution is focused on pictures which represent coins, but this method can by generalized for other contents as well. Actually, we have worked with a large set of coins. We are preparing massive parallel preprocessing by using multi-core architecture with different masks and vector lengths. Our future plan is to increase the intelligence of our system by using ontologism and faster searchable structures. We would like to cooperate with specialists in the areas of computer vision and image retrieval or apply our solution on different perspective and practical areas of human interests.
References 1. Aslandogan, Y.A., Yu, C.T.: Techniques and systems for image and video retrieval. IEEE Trans. On Knowledge and Data Engneering 11, 56–63 (1999) 2. Yong, M.R., Munchurl, K., Ho, K.K., Manjunath, B.S., Jinwood, K.: Mpeg-7 homogenous texture descriptor. ETRI 23 (2001) 3. Kampel, M., Huber-Mork, R., Zaharieva, M.: Image-based retrieval and identification of ancient coins. IEEE Intelligent Systems 24, 26–34 (2009) 4. Huber, R., Ramoser, H., Mayer, K., Penz, H., Rubik, M.: Classification of coins using an eigenspace approach. Pattern Recogn. Lett. 26(1), 61–75 (2005) 5. Feder, J.: Towards image content-based retrieval for the world-wide web. Advanced Imaging 11, 26–29 (1996) 6. Zhang, D., Lu, G.: Generic Fourier Descriptor for Shape-based Image Retrieval. Monash University, Faculty of Information Technology (2002) ˇ ara, J., Beneˇs, B., Sochor, J., Felkel, P.: Modern computer graphic, 2nd edn. 7. Z´ Computer Press, Brno (2004) 8. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Analysis and Machine Intelligence 8, 679–714 (1986) 9. Zhang, D.: Image Retrieval Based on Shape Australia. Monash University, Faculty of Information Technology (2002) 10. Berretti, S., Bimbo, A.D., Pala, P.: Retrieval by shape similarity with perceptual distance and effective indexing. IEEE Transactions on Multimedia 2, 225–239 (2000)
Visual Intention Detection for Wheelchair Motion T. Luhandjula1,2, E. Monacelli3, Y. Hamam1, B.J. van Wyk1, and Q. Williams2 1
French South African Technical Institute in Electronics at the Tshwane University of Technology, Pretoria, RSA 2 Meraka Institute at the Council for Scientific and Industrial Research, Pretoria, RSA 3 LISV Laboratory – Université de Versailles St-Quentin-en-Yvelines, France
[email protected],
[email protected],
[email protected]
Abstract. This paper describes a visual interface that recognizes the command request of a person by inferring the intention to travel in a desired direction at a certain speed from the person’s head movements. A rotation and a vertical motion indicate the intent to change direction and speed respectively. The context for which this solution is intended is that of wheelchair bound individuals. This paper describes work in progress that provides a proof of concept tested on static images. Results show that the symmetry property of the head can be used to detect a change in its position and can therefore serve as a visual intent indicator. The solution described in this paper, focusing on the specific task of head pose estimation, intends to provide a contribution to the realisation of an enabled environment allowing people with severe disabilities and the elderly to be more independent and active in society. Keywords: Head pose estimation, intention, intention detection, visual interface, enabled environment, disabilities, intention curve.
1 Introduction One of the challenges facing the task of realizing an enabled environment where people with disabilities and the aged are independent and can therefore be active by contributing in society is to develop systems that can assist them in performing the tasks they wish to carry out without other people’s assistance. Good performance in a team environment is heavily conditioned by the awareness of people’s intention within the team [2] and therefore a human-machine team, where the machine has a support role, requires that the intention of the user be well understood by the machine. The need for this intention awareness capability requires that much attention be given to this problem as an important aspect of the more general area of the enabled environment. In addition to contributing to the task of robust face recognition for multiview analysis which is still a difficult task under pose variation [3], pose estimation can be considered as a sub problem of the general area of intention detection because it is useful for the inference of nonverbal signals related to attention and intention. Estimating the head pose is crucial since it usually coincides with the gaze direction. Furthermore, head pose estimation is also essential for analyzing complex meaningful gestures [4]. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 407–416, 2009. © Springer-Verlag Berlin Heidelberg 2009
408
T. Luhandjula et al.
Existing head pose estimation methods can be categorized in appearance-based and model-based methods. Appearance-based techniques use the whole sub image containing the face while model-based approaches use a geometric model [4]. Another set of approaches includes the application of eigenspace techniques to directly recognize the pose of a specific user [5]. According to [6] the principal advantage of global approaches is that only the face needs to be located and that no facial landmark or face model are required, making them appropriate for very low resolution images of the face. This is useful for video surveillance, intelligent environments and human interaction modelling. Template matching is a popular method to estimate head pose where the best template can be found via a nearest-neighbour algorithm and where the pose associated with this template is selected as the best pose. Advanced template matching can be performed using Gabor Wavelets and Principle Components Analysis (PCA) or Support Vector Machines, but these approaches tend to be sensitive to alignment and are dependent on the identity of the person [6]. This paper provides an alternative solution for a visual system that infers people’s intentions through head pose estimation. The context for which this solution is intended is that of wheelchair bound individuals whose intention of interest is the direction they wish the wheelchair to follow and its speed. The proposed head pose estimation scheme can be classified as belonging to the appearance-based category making use of the symmetry property of the head as a whole and therefore has the advantages of accommodating low resolution images and not being dependent on the identity of the person. The aim of this study is to make a contribution to the general problem of intention detection applied to assistive living for the support of the elderly and people with disabilities.
2 Methods As mentioned earlier, the type of data used is visual: A sequence of images is captured from a CCD camera and no visual markers are added to the images. The estimated head pose over time is used as the intention indicator. The motivation behind this choice is the availability and flexibility of the head for a wide range of disabilities. The head performs two types of motions: Rotation to indicate a direction intention (centre, left and right) and a vertical motion to indicate speed variation (constant, decrease and increase). This work assumes that the head is already detected and does not deal with the pre-processing step of detecting and tracking the head. The approach in this paper extracts high level information (referred to in this work as the intention curve) from the acquired images, resulting in intent recognition. A symmetry-based approach [1] is used to extract symmetry curves associated with frontal views of the head. Note that in this paper, the use of the symmetry-based approach differs from that of [1] in that the latter uses the method for face detection while the same method is used in this work for head pose estimation. The assumption is that different positions of the head give different symmetry curves. The classification of the position of the head in different positions is done by using the centres of gravity (COG) of the symmetry curves and the y-intercepts of the lines approximating the symmetry curves and it is implemented using two different methods namely a ‘difference of means’ approach and a ‘mean and standard deviation of a Gaussian distribution’ approach. Intention detection in the context of this work entails the
Visual Intention Detection for Wheelchair Motion
409
classification of the sequence of these COGs or y-intercepts representing intention curves using the two classification methods mentioned above. 2.1 Symmetry-Based Approach The underlying assumption is that human heads viewed from the front are symmetric; and when moved from their initial position (centred position) the symmetry they display breaks down, giving the indication of a motion from the initial centred position to a new position (Right, Left, Up or Down). The indication of a new direction to the right or to the left is given by the head moving to the right or the left respectively, and the indication of an increase or decrease in speed is given by the motion of the head going down or up respectively. The symmetry curve, based on the work of [1], is given by
C ( x) =
k Y
∑ ∑|
ω =1y =1
I ( x − ω , y ) − I ( x + ω , y ) | , ∀x ∈ [k + 1 X − k ] .
(1)
The symmetry-value C(x) of each pixel-column in the image is evaluated by taking the sum of the differences of two pixels at a variable distance ω = [1 k] from it on both sides making the pixel-column the centre of symmetry. This process is done for each row and the resulting symmetry-value is the summation of these differences. Note that Equation 1 evaluates the symmetry horizontally (appropriate for the direction recognition task because the symmetry curve display considerable changes where the head rotates). However, for the vertical motion, the symmetry is calculated vertically as shown in Equation 2. Note that as shown in Fig. 1, the symmetry displayed by the face is better horizontally than vertically where the symmetry curve are less descriptive. However the position of the centre of gravity of the curves still gives sufficient information for classification. k Y
C ( y ) = ∑ ∑ | I ( x, y − ω ) − I ( x, y + ω ) | , ∀x ∈ [k + 1 X − k ] ω =1y =1
(2)
Fig. 1 shows the symmetry curves calculated for different positions of a head in rotation and in vertical motion.
Fig. 1. Symmetry curves of five different positions of the head in Rotation and five different positions of the head in Vertical motion. The vertical line indicates the position of the COG.
410
T. Luhandjula et al.
2.2 Classification for Individual Positions of the Head Method 1: Centre of Gravity (COG) of the Symmetry Curve
The context in which the COG is to be understood is that of a point in the curve at which all the values of the curve are considered to be centred. The symmetry curves shown in Fig. 1 all differ and will therefore give different COGs. It is calculated as C=
x1 f ( x1 ) + x2 f ( x2 ) + ... + xn f ( xn ) , f ( x1 ) + f ( x2 ) + ... + f ( x n )
(3)
where the symmetry curve is defined by the function: f : x → f ( x) with f(x) given by the expressions in Equations 1 and 2 for direction recognition and speed variation detection, respectively. The position of the symmetry curve’s COGs gives an indication on the position of the head and its position on the symmetry curve is shown by the vertical lines in Fig. 1 for different positions of the head for both motions. The approaches used to classify these different positions into three classes (centre, right and left for direction classification and centre, up and down for speed variation detection) are given below: Difference of means: The mean of the COGs is calculated for each training set. The difference between the COG to be classified and the mean for each class is calculated, and the class corresponding to the mean where the difference is the smallest is chosen. Mean and standard deviation of a Gaussian distribution: The mean and the standard deviation of the COGs are calculated for each training set. They are then associated with a Gaussian distribution along with the given COG to be classified. The resulting highest probability among the three cases corresponds to the class the given COG belongs to. Method 2: Linear Regression on the Symmetry Curve
Given the left part of Fig. 1, lines can be found that approximate the symmetry curves which will differ from each other in the five given positions. These lines can therefore be used to distinguish between the different positions of the head for direction recognition. However this method cannot be used for less descriptive symmetry curves such as those shown in the right part of Fig. 1 for speed variation recognition. Given a curve y = f(x), the goal of linear regression is to find the line that best predicts y from x where x is the independent variable and y the dependent one. Linear regression does this by finding the line that minimizes the sum of the squares of the vertical distances of the points from the line: Let f : X → Y = f ( x) be a function describing a symmetry curve: a linear regression is a form of regression analysis in which the relationship between y and x, is modelled by a least squares function called linear regression equation: Y = Xβ + ε
(4) T
where Y = [ y 1 y 2 ...
⎡1 1 ... 1 ⎤ y N ] and X = ⎢ ⎥ . ⎣ x1 x2 ... x N ⎦ T
Visual Intention Detection for Wheelchair Motion
411
The least squares estimate is given by
β = ( X ' X ) −1 X 'Y
(5)
where β gives the values of the y-intercept and the angle of the line with respect to the x-axis. By empirical study it has been established that the y-intercepts are more discriminative than the angles. Fig. 2 shows the resulting lines.
Fig. 2. Lines approximating the symmetry curve for head (in rotation) images
2.3 Intention Detection: Classification of Intention Curves
The task of intent recognition in the context of this work involves the detection of the direction the subject intends to take and the speed variation he wishes to perform, by analysing the motion of the head. The problem of monitoring the time sequence of individual positions of the head is addressed. The two motions of the head that are indicative of an intention are the rotation (for direction) and vertical motion (for speed variation). Intention Detection for Direction This problem is addressed by looking at the sequence of the symmetry curves’ COG and the sequence of y-intercepts obtained from the linear regression on the symmetry curves. Let E be the set {I(i) : I(i) is the ith frame in a sequence of N = 15 frames}:
Centre of gravity sequence: For each image frame in E the symmetry curve and the COG of the symmetry curve is obtained using Equations 1 and 3 respectively. Fig. 3 (left part) shows the resulting intention curves for the three different types of motion. It shows that these three types of motion each exhibit a different pattern and can therefore be classified. Y-intercept sequence: For each image frame in E the symmetry curve is obtained using Equation 1 and the line approximating the symmetry curve is obtained using Equation 5, where β gives the y-intercept. The y-intercepts in a time sequence as shown in Fig. 3 (right part) form the intention curves used for intent classification and
412
T. Luhandjula et al.
each exhibit a different pattern. Note that since the y-intercepts and the COG values increase in opposite directions given a particular motion, the intention curves made of y-intercepts and those made of COG exhibit opposite patterns and can therefore be classified.
Fig. 3. Intention curves obtained from COGs of symmetry curves and y-intercepts of lines approximating symmetry curves for heads in Rotation
Intention Detection for Speed Variation For speed variation recognition a centred head indicates constant speed, a decrease is detected when it is going up and an increase is recognized when the head is going down: Let E be the set {I(i) : I(i) is the ith frame in a sequence of N = 15 frames}. For each image frame in E the symmetry curves and the COG are obtained using Equations 1 and 2 respectively. Fig. 4 shows how these different scenarios exhibit each a different pattern and can therefore be classified.
Fig. 4. Intention curves obtained from COGs of symmetry curves for heads in Vertical motion
Classification: A simple decision rule can be used for classification. Let C be the intention curve of the COG sequence or the sequence of y-intercepts:
Visual Intention Detection for Wheelchair Motion
413
Table 1. Decision rule for classification of intention curves for intention detection Initialisation A = 0; B = 0; (Initialize A notifying a decrease and B notifying an increase) ∀ i ∈{x : x ≥ 1 and x ≤ length(C ) − 1} , D = C (i) – C (i+1) If D > 0 A = A + |C (i) – C (i+1)| (notifying a decrease in value of C, by adding the extend to which there Is a decrease to the value of A) If D < 0 B = B + |C (i) – C (i+1)| (notifying an increase in value of C, by adding the extend to which there is a decrease to the value of B) Classification Let μ Class , σ Class be the statistics (means and standard deviations) of the difference between A and B in a training set for each class: class = {Centre, Right/Up, Left/Down}; and n = {1, 2, 3}. Difference of means d n =| ( A − B ) − μ Class | , d = min ([d1 d2 d3]) If (A > B and d = d1) or (A < B and d = d1) Intention = Going Straight If A > B and d = d2 Intention = Going Right If A < B and d = d3 Intention = Going Left Statistics (mean and standard deviation) in Gaussian Calculate Pn =
1 2 × π σ class
exp{−
(( A − B) − μ class ) 2 2 2σ class
}
P = max ([P1 P2 P3]) If (A > B and P = P1) or (A < B and P = P1) Intention = Going Straight If A > B and P = P2 Intention = Going Right If A < B and P = P3 Intention = Going Left
From Fig. 3 and given the initialisation procedure in Table 1, it can be visually established that in case of a centred face, A and B should be approximately the same, in case of a ‘going left/down’ scenario, B should be much higher than A, and in the case of a ‘going right/up’ scenario, A should be much higher than B. The situation is the other way around for the y-intercept. .
3 Results The experimental results have been obtained by collecting video sequences of five different subjects with three sequences each, for both types of motion. For classification of individual positions, 100 frames are used for training and 325 testing frames are used for validation. For intention curve classification, the training sets are made of 150 examples each obtained using 15 frames and 350 intention curves are used for validation. For each subject, one sequence is used for training and the two others for validation. Two sets of result are given below: the classification rates for head positions and for intention curves. For direction classification, it can be observed that COGs are better indications than y-intercepts, and that the Difference of means using COG of Symmetry curves yields the best overall performance. In the case of y-intercepts, though the performance is satisfactory for each class, there is a performance disproportion between the
414
T. Luhandjula et al. Table 2. Results for direction classification (individual position and intentions curves)
Individual positions Overall performance: Intention curves
Classes
Difference of means using COG of Symmetry curve
Statistics of Symmetry curve’s COG in Gaussian distribution
Difference of means for yintercepts (of Lines approximating a symmetry curve)
Centre: Right: Left:
98.4615% 88% 100%
98.4615% 90.7692% 90.7692%
96% 81.8462% 99.3846%
Centre: Right: Left:
95.4872% 100% 84% 97.1429%
93.3333% 98.5714% 98.2857% 97.1429%
92.4103% 100% 84.8571% 97.1429%
92% 94.5714% 88.8571% 100%
93.7143%
98%
94%
94.4762%
Overall performance:
Statistics of yintercepts (of lines approximating a symmetry curve) in Gaussian distribution 94.1538% 81.8462% 100%
Table 3. Results for speed variation classification (individual position and intentions curves) Classes Individual positions classification Overall performance: Intention curves classification Overall performance:
Centre: Up: Down:
Difference of means using COG of Symmetry curve 92.6154% 98.1538% 81.8462%
Statistics of COG of Symmetry curve in Gaussian distribution 95.3846% 96.9231% 83.3846%
Centre: Up: Down:
90.8718% 100% 94.2857% 90%
91.8974% 96.2857%
94.7619%
93.5619%
94.4% 90%
left and the right positions yielding 99.3846% (using the Difference of means for yintercepts method) and 100% (using the Statistics of y-intercepts in Gaussian distribution approach) for left position, and 81.8462% for right position using both methods (refer to Table 2). This is due to the fact that though for both left and right the symmetry is broken as Figure 1 (left side) and Figure 2 show, the y-intercept of the lines approximating the symmetry curve for the Right position are closer to that of the lines approximating the symmetry curve for the centred position than that of the Left position (refer to Figure 2). It is not the symmetry curves that are classified but rather the y-intercepts of the lines approximating the symmetry curves. Another observation worth mentioning is the performance disproportion between left and right for intention curve classification (refers to Table 2). This disproportion
Visual Intention Detection for Wheelchair Motion
415
is exhibited for both differences of means approaches with COG and y-intercept (84% and 84.8571% respectively) and is less pronounced when using the mean and standard deviation of the Gaussian distribution approach (98.2857% and 88.8571%). This can be explained by the added information provided in the second set of methods by the standard deviation of the COGs and Y-intercept. In some cases this added information is redundant, for example in position classification for both direction and speed variation classification (refer to Tables 2 and 3). This is because the standard deviations associated to the classes are close to each other. But in this particular case it adds relevant information for classification because the difference between the standard deviations associated with each class is significant. Table 3 also shows that the position classification for speed variation exhibits the worst results for the ‘Down’ class because as shown in Fig.1 the COGs of symmetry curves associated with ‘Down’ faces are closer to those associated with centred faces.
4 Conclusion This paper describes a visual scheme that detects a requested command by inferring a subject’s intention. The context for which this solution is developed is that of wheelchair bound individuals whose intention of interest is the direction of the motion they wish the wheelchair to follow as well as its speed variation. The motion of the head viewed from the front, is used as indicator for the intent recognition task. The main contribution of this work is the use of a symmetry-based approach that is combined with different classification schemes, to recognize head motions (rotation and vertical motion) indicative of the intent of the subject. It has been shown that the symmetry property of a person’s face can be used to detect any change in its position and can therefore be a visual intent indicator. This paper is a work in progress that provides a proof of concept tested on static images of head segmented manually, where this symmetry property yields very good results. The next crucial step is to implement this solution in real time and with quantified direction and speed variation after which it will be tested on wheelchair bound individuals. Another application on which the proposed solution will be used is that of an input mechanism as an alternative to a pointing device for people with disabilities for whom the motion of the head is the only or preferred option. The comparison of this head pose estimation solution with existing ones found in literature is part of our ongoing investigation. Given the quality of these results, this solution shows promise in making a contribution to the general problem of intent recognition applied to assistive living (for support of the elderly and people with disabilities) as well as the specific task of head pose estimation.
References 1. Luhandjula, K.T., van Wyk, B.J., Kith, K., van Wyk, M.A.: Eye detection for fatigue assessment. In: Proceeding of the Seventeenth International Symposium of the Pattern Recognition Society of South Africa, Parys, South Africa (2006) 2. Kanno, T., Nakata, K., Furuta, K.: Method for team intention inference. Human-Computer Studies 58, 393–413 (2003)
416
T. Luhandjula et al.
3. Ma, B., Zhang, W., Shan, S., Chen, X., Gao, W.: Robust Head Pose Estimation using LGBP. In: Proceedings of the 18th International Conference on Pattern Recognition, vol. 2, pp. 512–515 (2006) 4. Vatahska, T., Bennewitz, M., Behnke, S.: Feature-based Head Pose Estimation from Images. In: Proceedings of the IEEE-RAS 7th International Conference on Humanoid Robots (Humanoids), Pittsburgh, USA (2007) 5. Fitzpatrick, P.: Head pose estimation without manual initialization. Term Paper for MIT Course. MIT, Cambridge (2001) 6. Gourier, N., Maisonnasse, J., Hall, D., Crowley, J.L.: Head Pose Estimation on Low Resolution Images. Perception, recognition and integration for interactive environments (2006) 7. Tu, J., Fu, Y., Hu, Y., Huang, T.: Evaluation of Head Pose Estimation for Studio Data. In: Stiefelhagen, R., Garofolo, J.S. (eds.) CLEAR 2006. LNCS, vol. 4122, pp. 281–290. Springer, Heidelberg (2007) 8. Christensen, H.V., Garcia, J.C.: Book Infrared Non-Contact Head Sensor, for Control of Wheelchair Movements. In: Pruski, A., Knops, H. (eds.) Assistive Technology: From Virtuality to Reality, pp. 336–340. IOS Press, Amsterdam (2003) 9. Kuno, Y., Shimada, N., Shirai, Y.: Look where you are going [robotic wheelchair]. IEEE Robotics & Automation Magazine 10(1), 26–34 (2003) 10. Matsumotot, Y., Ino, T., Ogsawara, T.: Development of intelligent wheelchair system with face and gaze based interface. In: Proceedings of the 10th IEEE International Workshop on Robot and Human Interactive Communication, Paris, France, pp. 262–267 (2001)
An Evaluation of Affine Invariant-Based Classification for Image Matching Daniel Fleck and Zoran Duric Department of Computer Science, George Mason University, Fairfax VA 22030, USA {dfleck,zduric}@cs.gmu.edu
Abstract. This paper presents a detailed evaluation of a new approach that uses affine invariants for wide baseline image matching. Previously published work presented a new approach to classify tentative feature matches as inliers or outliers during wide baseline image matching. After typical feature matching algorithms are run and tentative matches are created, the approach is used to classify matches as inliers or outliers to a transformation model. The approach uses the affine invariant property that ratios of areas of shapes are constant under an affine transformation. Thus, by randomly sampling corresponding shapes in the image pair a histogram of ratios of areas can be generated. The matches that contribute to the maximum histogram value are then candidate inliers. This paper evaluates the robustness of the approach under varying degrees of incorrect matches, localization error and perspective rotation often encountered during wide baseline matching. The evaluation shows the affine invariant approach provides similar accuracy as RANSAC under a wide range of conditions while maintaining an order of magnitude increase in efficiency.
1
Introduction
Image matching is a fundamental problem in computer vision. The goal of image matching is to determine if all or part of one image matches all or part of another image. In many applications, after determining a match is present, a registration step is used to align the images so the matching parts overlap precisely. Matching and registration are used in many computer vision applications including location recognition, facial recognition, object recognition, motion understanding, change detection, among others. A recent review of image matching algorithms was conducted by Mikolajczyk and Schmid [1].The reviewed algorithms for matching image pairs typically have four phases. In the first phase features are detected in both images. Usually, features correspond to locations in the image that are invariant under some image transformations. This means the features will have a similar appearance in different images. In the second phase feature descriptors are computed as the “signatures” of the features. In the third phase the features of the first image are compared to the features of the second image. The comparison is performed using a suitable distance measure on the descriptors, and the tentative matches are G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 417–429, 2009. c Springer-Verlag Berlin Heidelberg 2009
418
D. Fleck and Z. Duric
ordered by similarity. The matches generated in the third phase are considered tentative due to the high percentage of incorrect matches produced at this stage. Thus, a fourth phase is required to remove incorrect matches. The fourth phase in a typical algorithm attempts to fit an affine or perspective transformation model to the top tentative matches. The model is then used to classify each tentative match as an inlier or outlier. Using only the inliers a simple least-squares fit can be used to determine a very accurate transformation model between the images. Our previous work [2] described a novel approach to the fourth phase using affine invariants to classify tentative matches as either inliers or outliers. This work demonstrated that the affine invariant approach compares favorably to RANSAC [3] in accuracy under a small sample set while providing an order of magnitude increase in efficiency. In this work we build on the previous work by evaluating the algorithm and RANSAC under various amounts of localization error, incorrect matches, and perspective transformations. The remainder of this paper is organized as follows. Section 2 presents previous work in matching and registration and describes the affine invariant approach. Section 3 describes the tests performed and presents the results. Section 4 concludes the paper.
2 2.1
Related Research Matching and Registration
In its simplest form matching computes a distance function to find the feature descriptors in one image that are the minimum distance from a feature descriptor in another image and labeling these as a tentative match. Matching performance is very dependent on the dimensionality and discriminate power of the dimensions in the feature descriptor. Computing the pairwise distance between feature descriptors is frequently done using simple Euclidean distance, cross-correlation, Mahalanobis distance, or sum of squared differences between the vectors [4]. A subset of these matches (typically the strongest above some threshold) are chosen for a second matching step. The second matching step, “model fitting”, evaluates geometric constraints of groups of tentative matches to determine if there is an image transformation model that can predict many of the matches. If such a model (or models) can be found, it is used to generate a transform equation. The transformation equation describes how one image can be registered to the other image (and vice versa). Finding this transformation between images is the final goal of the matching step. Initial model fitting approaches attempted to fit all tentative matches to the model by minimizing some overall distance metric between the model and all the points. These approaches only work when the percentage of outliers (incorrect tentative matches) is very small compared to the percentage of inliers (correct tentative matches). In the case, where the average is minimally affected by the outliers the technique may succeed. However, in many cases the percentage of outliers may be high. In those cases, the Random Sample Consensus (RANSAC) algorithm from Fischer and Bolles [3] has been used with great success.
An Evaluation of Affine Invariant-Based Classification for Image Matching
2.2
419
RANSAC
RANSAC starts by assuming some transformation model (typically affine or perspective). In the affine case three tentative matches are the minimum set needed to uniquely specify the model’s six degrees of freedom. From the three sample matches a model is created and all other samples (denoted as N ) are evaluated using that model. Reprojection error is calculated as described in Section 2.4. Matches predicted with a small reprojection error are considered inliers and all others are considered outliers. The model with the highest number of inliers is then chosen as the correct image transformation model. By building the consensus set from a minimal set RANSAC can tolerate a large percentage of outliers and still determine a correct model [3]. Due to RANSAC’s success, many researchers have worked to improve the efficiency [5, 6, 7] and the accuracy [8, 9, 10] of the original approach. In the remainder of this section we will discuss these improvements. A common way to improve efficiency in RANSAC is to reduce either the number of model iterations to test or to determine as early as possible that the current model being tested is not correct thereby reducing the number of sample matches to test. In [5] Chum, et. al. use a randomized per-evaluation Td,d test to determine if the current model being evaluated is likely to be a correct model. Using this early exit from the testing process they report an efficiency improvement of an order of magnitude. In [6] Nister performs a shallow breadth first evaluation of model parameters to determine likely inlier models and then completes the depth first evaluation of only the models with the highest probability of being correct. By reducing the number of models to be fully evaluated Nister also reports an efficiency increase. Both of these improvements still require the same number of initial hypothesis as the standard RANSAC algorithm. The efficiency improvements also perform essentially the same steps as the original RANSAC algorithm, but reduce the number of data points to evaluate through an early evaluation process. Other RANSAC improvements attempt to increase accuracy. The standard RANSAC algorithm uses a set threshold T to classify a specific data point as an inlier or outlier. If this threshold is incorrectly set, the results will be inaccurate. In [8], Torr and Zisserman introduce the maximum likelihood estimation by sampling consensus (MLESAC) approach. MLESAC evaluates the probability that a given model is the correct model Mh using a mixture of Gaussians to represent the residual errors Rh . They randomly sample the parameter space and choose the parameters that maximize the posterior probability p(Mh |Rh ). This method improved the scoring accuracy of the approach. MLESAC was further refined in [9] to assign weights to the tentative matches rather than using a purely random sampling. This weighting enables their guided sampling MLESAC to find a solution faster than the standard version. While enhancing the efficiency and accuracy of the approach, these improvements share the same basic algorithm of fitting and testing model hypotheses as the original RANSAC algorithm [11].
420
2.3
D. Fleck and Z. Duric
Classifying Outliers Using Affine Invariants
Inlier and outlier classification can be done very differently by exploiting properties of affine invariance among image pairs. When two images are related through an affine transformation certain invariant properties hold. These include: – parallelism of corresponding lines – ratio of the lengths of corresponding parallel lines – ratio of areas of corresponding shapes [12]. An affine transformation equation shown in (1) maps a set of image locations in one image to another image. The transformation has six degrees of freedom including translation in the X and Y direction, rotation, non-isotropic scaling and shear. ⎛ ⎞ ⎛ ⎞⎛ ⎞ x a11 a12 ty x ⎝y ⎠ = ⎝a21 a22 tx ⎠ ⎝y ⎠ (1) 1 0 0 1 1 Our previous work in [2] described how to exploit the invariance of ratio of areas of shapes under affine transformations to classify matches. This is done by selecting corresponding triangles in both images and creating a histogram of the ratio of the areas of those triangles. Because the affine transform maintains this invariant, all triangles that are composed of correctly matching features will generate the same ratio of areas. Incorrect correspondences will generate an incorrect ratio, but these ratios will be spread throughout the histogram assuming a random distribution of errors. Thus, the final histogram will have a large value in one bin (Bmax ) corresponding to the correct ratio of areas, and all other bins will contain a level of noise. To determine inliers the frequency of occurrence of each feature in Bmax is measured and the feature is classified as an inlier if it is in the larger of the top 10% or top 10 features contributing to Bmax . This is important because all bins (including Bmax ) have invalid matches contributing to their noise level. The noise level must be removed to ensure only true inliers contribute to the final model. Using only these model inliers a simple least squares fit can create a transformation model between the images. To compute the ratios of areas the algorithm randomly samples three matches from the images. The triangle formed by the three points is first checked to determine if it has any angle < 5◦ . This is done because skinny triangles can have large variations in areas caused by small localization errors in any of their vertexes. Thus, to avoid these inaccuracies skinny triangles are rejected. The ratios of areas of remaining triangles are computed using Cramer’s rule by computing a matrix of points T , and then applying (3). ⎡
⎤ X1 Y1 1 T = ⎣ X2 Y2 1 ⎦ X3 Y3 1
(2)
An Evaluation of Affine Invariant-Based Classification for Image Matching
AreaT =
1 |det(T )| 2
Ratio = (numBins/4) ∗
AreaT 1 AreaT 2
421
(3) (4)
The ratio of areas is then given by (4). Multiplying the ratio by the number of bins effectively spreads the ratios of areas across all bins. Dividing the number of bins by 4 increases the granularity of each individual bin, while limiting the possible scale difference (highest ratio) allowed by the algorithm to 4. If the ratio is less than the number of bins it is added to the histogram. Additionally, a list of matches contributing to the bin is updated with each match contributing to the ratio. This is used later to determine the frequency of each match in the bin. This approach has several advantages over typical model fitting. Areas are much simpler to compute and therefore more computationally efficient than affine or perspective models. Additionally, because the inliers are taken after one pass through the data the approach is much more efficient than typical model fitting approaches that must test all points for agreement with many possible models. In this paper we will demonstrate under what image conditions the affine invariant approach provides similar accuracy as RANSAC while maintaining a large efficiency improvement. 2.4
Reprojection Error
Reprojection error is used to compare the models generated by RANSAC and the affine invariant approach. Reprojection error is the distance between the feature location predicted by the model and the actual corresponding feature found during initial matching. To compute reprojection error we assume x ˜j is the j projected location of feature j using the model and x is the detected location of feature j given by the feature detector. Reprojection error (R) is then the distance between x ˜j and xj given by Equation 5 [13]. R=
j (˜ x1 − xj1 )2 + (˜ xj2 − xj2 )2
(5)
j
Once computed we determine how many of the tentative features matches are within a certain reprojection error of the model prediction. Features below the threshold are considered inliers and features above the threshold are considered outliers. Typically the threshold is chosen experimentally based on the feature detector used and the image resolution. A large percentage of inliers indicates an accurate model and thus a matching image pair. A higher number of inliers also indicates a better model in many instances.
422
3
D. Fleck and Z. Duric
Comparison between Affine Approach and RANSAC
There are multiple sources of errors during wide baseline image matching. In the original feature detection, a feature is given a location in the 2D image. Feature localization error models how well the feature location in the image corresponds to the true 3D location [14]. In multiple images of the same object taken from different locations localization errors cause features to be detected at slightly different places. Model fitting approaches must tolerate these errors and classify points as inliers when appropriate. Another source of errors are the incorrect matches present after initial matching. These incorrect matches typically will not fit the correct model. Thus, model fitting approaches must accurately label these as outliers to enable accurate model generation using only inliers. Lastly, a potential source of error for the affine invariant approach is perspective rotation between image pairs. As described in Section 2.3 the approach assumes the transformation between the matching images can be approximated by an affine transformation. The degree to which this does not hold causes inaccuracies in the computed histogram resulting in an incorrect set of inliers. In this paper we quantify the effects of these errors when using the affine invariant approach. Additionally, we quantify the same effects using RANSAC for comparison. 3.1
Experimental Setup
In all tests a single parametrized experimental setup is used. This setup ensures an unbiased test for both algorithms being evaluated and provides a framework for future evaluations. The test process shown in Algorithm 1 first detects features in real images. Some example images from [15] are shown in Figure 1. In our experiments we used the Scale Invariant Feature detector (SIFT) [16] however any feature detector could be used. Using real images and features ensures no bias will be present in the experiment from synthetically generated features or distributions of features. Using the detected features the algorithm then computes a set of matches consistent with the parameters of the test. These parameters set the amount of rotation present in the second image, the percentage of correct and incorrect matches present and the maximum amount of localization error. Each parameter is explained in more detail in their respective sections below. For each parameter set the tests were run multiple times for all images under test to validate the consistency of the algorithms. 3.2
Tolerance of Approach to Outliers
A critical metric for any inlier / outlier detection approach is the percentage of inliers required for the approach to succeed. Previous work in [2] computed this empirically as follows. Using w as the percentage of inliers in the data, the percentage of successful shapes chosen is calculated as: inlierP ercentage = ws ,
(6)
An Evaluation of Affine Invariant-Based Classification for Image Matching
423
Fig. 1. Initial images used in experiments. These images were randomly selected from the Zurich Building Database [15].
Input: percentage of inliers to test (p), max localization error per feature in pixels (maxLoc), amount of rotation (rot), image to use (image) Output: generated models and associated reprojection error // Create test matching set numM atches ← 500 numGoodM atches ← numM atches ∗ p numIncorrectM atches ← numM atches ∗ (1 − p) f eatures ← SIFT(image) ; // detect features H ← generateHomography(rot) ; // create known homography for i ← numGoodM atches; i++ do f ← randomlyChooseFeature(f eatures) f t ← transformFeature(f, H) tentativeM atches ← append(f, f t) end for i ← numIncorrectM atches; i++ do f ← randomlyChooseFeature(f eatures) f t ← randomlyChooseFeature(f eatures, f ) ; // choose non-matching feature tentativeM atches ← append(f, f t) end tentativeM atches ← addLocalizationError(tentativeM atches, maxLoc) // Generate and evaluate models hAf f ine ← computeModelAffine(tentativeM atches) hRAN SAC ← computeModelRansac(tentativeM atches) reprojectionErrAf f ine ← computeReprojectionError(hAf f ine, tentativeM atches) reprojectionErrRansac ← computeReprojectionError(hRAN SAC, tentativeM atches)
Algorithm 1. Perform Matching Test and Evaluation
where s is the number of samples required to determine a ratio. In the case of triangles s = 3. The percentage of outliers is then outlierP ercentage = 1 − inlierP ercentage. Under the assumption of a uniform distribution of data, the outliers are spread evenly across the bins in the ratio histogram. The noise level of the histogram can then be calculated as: noiseLevel = outlierP ercentage/numBins
(7)
In our implementation of the approach we set the number of bins in the histogram to 200 (numBins = 200). Using (4) the algorithm can tolerate a ratio of areas up to 4. This means one triangle can be four times the other. It was experimentally determined that detection of a true spike over the noise requires the spike to be at least 10 times greater than the noise level. Thus, using (7) our system will support data with < 63% outliers (≥ 37% inliers).
424
D. Fleck and Z. Duric
450
450
400
400
Total number of matches found
Total number of matches found
Our current work validates these numbers experimentally using Algorithm 1. The percentage of good matches was varied from 10% through 90% over five images. To understand the impact of incorrect matches during this test there was no rotation or localization. Figure 2 shows the results for RANSAC and the affine invariant approach. As seen the approach does not do as well as RANSAC below 30%, however the affine invariant approach does equally as well at 40% or above in this test. This agrees with the previous empirical results. This test demonstrates the robustness of the affine invariant approach. The results between the algorithms are very consistent, showing each algorithm is able to find the correct model when 40% inliers are present.
350 300 250 200 150 100
350 300 250 200 150 100 50
50
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 % of correct matches possible (out of 500)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 % of correct matches possible (out of 500)
Fig. 2. Tolerance to incorrect matches. Left: RANSAC. Right: Affine approach.
3.3
Tolerance of Affine Invariant Approach to Localization Error
The goal of model fitting is to find a model between images that can predict correct matches. To do this tentative matches that fit the model are labeled as inliers and those that do not are labeled as outliers. However, most tentative matches do not fit the model exactly. Frequently this is due to localization error. As features are detected in different images the same 3D location in the real world is mapped to different 2D coordinates in the image based on choices made in the feature detectors. The amount of that difference is localization error as defined in [13]. Thus, some features differ from the model prediction by a small distance. A model fitting approach must correctly label features that exhibit localization error as inliers and features that are true mismatches as outliers. We validated the localization error tolerance of the affine invariant approach by modeling localization error as a uniform distribution between zero and a maximum value. Each feature was then modified with a different localization error (in both X and Y directions) and then the test algorithm was executed. By varying the maximum value of the error we could determine how well the algorithm performed under differing levels of localization error.
An Evaluation of Affine Invariant-Based Classification for Image Matching
425
450
450
400
400 Total number of matches found
Total number of matches found
Figure 3 shows the results of trials on multiple images for the affine invariant approach and RANSAC. As shown, the affine invariant approach performs very similarly to RANSAC in the presence of localization errors. As the magnitude of the errors increases (to 9 pixels in our tests) both algorithms still find a model, but many inliers are removed. Changing tolerances can better account for localization error, but results in many false positive inliers. Thus, for all tests in this paper the same tolerances were used to ensure no tuning was done specifically for any type of test performed.
350
300
250
200
350
300
250
200
150
150 1
3 5 7 Max localization error (in pixels)
9
1
3 5 7 Max localization error (in pixels)
9
Fig. 3. Tolerance of algorithms to localization error. Left: RANSAC Right: Affine approach.
3.4
Tolerance to Image Rotation
Image rotation is a common occurrence when matching wide-baseline images. As the viewer changes position the resulting image can rotate around the optical axis (planar), around the X or Y axes (perspective) or both. The affine invariant approach assumes a transformation between images that can be approximated by an affine transform. Planar rotation can be modeled exactly using an affine transformation. Thus, planar rotation should be well-supported by the approach until features are no longer initially able to be matched. Perspective rotation cannot be modeled exactly using an affine transformation. Two additional degrees of freedom must be used to model a perspective rotation. However, research has shown that perspective changes can be approximated by affine invariants [17,4]. We validate these claims using Algorithm 1 by rotating an image both in-plane and through perspective changes. After each rotation we applied the affine invariant approach and RANSAC and compared the results. As seen in Figure 4 planar rotation does affect the overall outcome, however both model fitting approaches provide similar results. Figure 4 shows perspective rotation causes more variance in the results of the affine invariant approach than under RANSAC. A higher
426
D. Fleck and Z. Duric
450
450
400
400
400
400
350 300 250 200 150
350 300 250 200 150
350 300 250 200 150
100
100
100
50
50
50
0
5
10 15 20 25 Rotation (in degrees)
(a)
40
0
5
10 15 20 25 Rotation (in degrees)
(b)
40
Total number of matches found
500
450
Total number of matches found
500 450
Total number of matches found
Total number of matches found
number of statistical outliers are present. This is due to the explicit assumption in the affine invariant approach that the transformation between images can be approximated by an affine transform. As perspective distortion increases, the error in the assumption increases, and the accuracy of the affine invariant algorithm degrades. As in the previous experiments, these results only consider rotation. Other tests documented in Section 3.5 show performance of the algorithms when all factors are combined.
0
350 300 250 200 150 100 50
5
10 15 20 25 Rotation (in degrees)
(c)
40
0
5
10 15 20 25 Rotation (in degrees)
40
(d)
Fig. 4. Tolerance to rotation. a. RANSAC tolerance to planar rotation. b. Affine tolerance to planar rotation. c. RANSAC tolerance to perspective rotation. d. Affine tolerance to perspective rotation.
3.5
Combined Tolerance
Real image pairs have many sources of error when matching. These include the three we have investigated separately (i.e. localization error, incorrect matches, and rotation). Understanding how error affects the algorithm separately is important to enable tuning and modifying the algorithm for specific error conditions. However, to fully understand the accuracy of the algorithm for real image pairs all three must also be evaluated together. We performed the same experiments using Algorithm 1 with all combinations of localization error, incorrect matches and perspective rotation used in the previous experiments. This generated 63,000 individual data points. Figure 5 was generated to visualize the results showing the boundaries where the algorithms find a correct model. The figure shows combinations of parameters where each algorithm found over 75% of the available matches. The output clearly shows that all types of error have an impact. Additionally, the impact of the combined parameters is seen earlier (for lower values) when using the affine invariant approach than when using RANSAC. The results show the limitations of the approach under different conditions. For example, the approach is accurate at 25◦ of perspective rotation if 75% of matches are inliers. Alternatively, the approach is accurate with only 40% inliers at 10◦ of perspective rotation. In addition to the accuracy the efficiency improvement is maintained as seen in Figure 6. When the images are very similar, RANSAC performs at its best case efficiency which is seen in test 1 of the figure. However, in all other tests
An Evaluation of Affine Invariant-Based Classification for Image Matching
427
Parameter values for correct models −− Ransac(+), Affine(o) 40
35
30
Rotation (degrees)
25
20
15
10
5
0 1
0.5
0 0
% Good Matches
0.5
1.5
1
2.5
2
3.5
3
4
4.5
5
Localization Error
Fig. 5. Locations where ≥ 75% of matches were found. RANSAC designated by a (+), Affine Invariant approach designated by an (o). AFFINE Time
RANSAC Time 5
5
4.5
4.5
4
4
3.5
3.5 3 Values
Values
3 2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
Test1
Test2
Test3
Test4
Test5
0
Test1
Test2
Test3
Test4
Test5
Fig. 6. Left: Time for RANSAC. Right: Time for affine approach.
RANSAC performance degrades significantly, while the affine invariant approach maintains its original efficiency. This is due to the fundamental difference in the approaches described in Section 2.3.
4
Conclusion
In this work we evaluated a new affine invariant approach to matching. We have shown that under varying degrees of localization error the affine invariant ap-
428
D. Fleck and Z. Duric
proach provides a similar accuracy as RANSAC. With incorrect matches present, we have shown that the affine invariant approach performs well with at least 40% inliers while RANSAC performs well with at least 20% inliers. As expected the performance of the affine invariant approach under perspective rotation is increasingly unstable as the rotation increases. These results set a baseline for the affine invariant approach and will help guide future work in improving the accuracy under these conditions. Additionally, we tested the effects with multiple sources of error present. The tests show the accuracy of the algorithm is very dependent on both perspective rotation and the percentage of inliers. This paper has demonstrated that the affine invariant approach has utility in situations under different conditions of localization error, incorrect matches and perspective rotation. Under these conditions the affine invariant approach will provide a significant speed improvement over RANSAC. Additionally, by understanding how the accuracy of the algorithm changes under different conditions our future work will be targeted to improve the tolerance to the conditions or to detect conditions where the algorithm does not perform well and enable a fall-back to RANSAC. These improvements will result in an algorithm that is as accurate as RANSAC in all cases, while being more efficient in many cases with a worst-case RANSAC efficiency. The affine invariant approach is a significant departure from traditional RANSAC-based algorithms that iteratively develop and test hypotheses. Using affine invariants enables a single-pass algorithm to perform accurately and efficiently. Future work will continue to increase the efficiency and accuracy of the algorithm guided by the results presented in this paper. Additional enhancements similar to techniques used to improve traditional RANSAC can be applied to further improve the affine invariant approach.
Acknowledgment The authors would like to thank DARPA and Ascend Intelligence LLC for their support of this work under contract W15P7T-07-C-P219.
References 1. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005) 2. Fleck, D., Duric, Z.: Affine invariant-based classication of inliers and outliers for image matching. In: Image Analysis and Recognition, International Conference ICIAR 2009. Springer, London (2009) 3. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981) 4. Trucco, E., Verri, A.: Introductory Techniques for 3-D Computer Vision. Prentice Hall PTR, Upper Saddle River (1998) 5. Chum, O., Matas, J.: Randomized ransac with td,d test. In: Proceedings of the 13th British Machine Vision Conference (BMVC), pp. 448–457 (2002)
An Evaluation of Affine Invariant-Based Classification for Image Matching
429
6. Nister, D.: Preemptive ransac for live structure and motion estimation. MVA 16(5), 321–329 (2005) 7. Chum, O., Matas, J., Kittler, J.: Locally optimized ransac. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 236–243. Springer, Heidelberg (2003) 8. Torr, P.H.S., Zisserman, A.: Mlesac: a new robust estimator with application to estimating image geometry. Comput. Vis. Image Underst. 78(1), 138–156 (2000) 9. Tordoff, B., Murray, D.W.: Guided sampling and consensus for motion estimation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 82–96. Springer, Heidelberg (2002) 10. Wang, H.: Robust adaptive-scale parametric model estimation for computer vision. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1459–1474 (2004) Senior MemberSuter, David. 11. Zhang, W., Kosecka, J.: Ensemble method for robust motion estimation. In: 25 Years of RANSAC, Workshop in conjunction with CVPR (2006) 12. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2004) 13. Ma, Y., Soatto, S., Kosecka, J., Sastry, S.S.: An Invitation to 3-D Vision: From Images to Geometric Models. Springer, Heidelberg (2003) 14. Morris, D.D., Kanade, T.: Feature localization error in 3d vision. In: Winkler, J., Niranjan, M. (eds.) Uncertainty in geometric computations, July 2001, pp. 107–117. Kluwer Academic Publishers, Dordrecht (2001) 15. Griesser, A.: Zurich building database, http://www.vision.ee.ethz.ch/showroom/zubud/index.en.html 16. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 17. Cheng, Y.: Analysis of affine invariants as approximate perspective invariants. Computer Vision and Image Understanding 63(2), 197–207 (1996)
Asbestos Detection Method with Frequency Analysis for Microscope Images Hikaru Kumagai1, Soichiro Morishita2 , Kuniaki Kawabata3 , Hajime Asama2 , and Taketoshi Mishima1 1
Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama, Japan 2 The University of Tokyo, 5-1-5, Kashiwanoha, Kashiwa-shi, Chiba, Japan 3 RIKEN, 2-1, Hirosawa, Wako-shi, Saitama, Japan
Abstract. In this paper, we propose an asbestos detection method focusing on frequency distribution of microscopic images. In building construction, asbestos has been used for molding plates and heat insulation materials. However, increased injury caused by asbestos has become a problem in Japan. Removal of asbestos from building materials and rendering it harmless are common means of alleviating asbestos hazards. Nevertheless, those processes necessitate a judgment of whether asbestos is included in building materials. According to the JIS standards, it is necessary to count 3000 particles in microscopic images. We consider the asbestos shape, and define a new feature obtained through frequency analysis. The proposed method intensifies the low-brightness asbestos using its feature, so it can detect not only high-brightness particles and asbestos but also low-brightness asbestos. We underscore the effectiveness of the method by comparing its results with results counted by an expert.
1 Introduction In building construction, asbestos has been used for mold plates and heat insulation materials. However, health hazards such as malignant mesothelioma from exposure to asbestos have become a problem recently in Japan. As a measure against growing asbestos problems, asbestos analyses to check asbestos contents in building materials are performed widely. Dispersion staining [1] is a qualitative analytical method to assess whether building materials contain asbestos. The related JIS standard practices [2] are as follows. 1. Make nine samples from building materials. 2. Three kinds of asbestos (chrysotile, amosite, and crocidolite) are generally used; choose three samples that have been stained with immersion in liquid. 3. Count 3000 particles using microscopic observation from the chosen three samples. 4. If four or more asbestos particles are present among those 3000 particles, then the sample is judged as harmful. Part of one such photomicrograph is presented in Fig. 1(a). Counting particles and asbestos is performed visually (Fig. 1(b)), which requires enormous amounts of time and effort. This process particularly makes asbestos analysis onerous and costly. Therefore, G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 430–439, 2009. c Springer-Verlag Berlin Heidelberg 2009
Asbestos Detection Method with Frequency Analysis for Microscope Images
431
(a) An example of microscopic appearance of asbestos (b) An expert on counting Fig. 1(a) Fig. 1. An example of microscopic appearance of particles and an expert on counting it visually
Fig. 2. The automated microscopic system for supporting asbestos qualitative analysis
an automated microscopic system for supporting asbestos qualitative analysis [3] is proposed. This system’s structure is presented in Fig. 2. This system detects small particles and those with low brightness, for which it has very high detectability. However, vastly numerous low brightness and small particles are present in building materials. Detecting these particles, the ratio to particles and asbestos becomes small. On the other hand, if the detection of low-brightness particles is suppressed, then detection of low-brightness asbestos particles would be affected. These become error judgments and constitute important problems that must be solved. This study is intended to establish a method to detect not only high-brightness particles and asbestos but also low-brightness asbestos. We propose an asbestos detection method that particularly addresses the frequency distribution of microscopic images.
2 Related Research Automated asbestos analysis systems such as Magiscan [4],[5] and AFACS [6] have been developed. However, these systems are used for detecting atmospheric airborne asbestos. Their samples include fewer particles whose size can be recognized except
432
H. Kumagai et al.
asbestos. Moreover, methods for detecting particles [7] and bubbles [8],[9] in images are proposed. These detection targets’ size and shape are analogous. However, in this study, particles and asbestos in building materials are targets; their sizes and shapes are various. Therefore, it is necessary to propose a method for detecting these adequately.
3 Proposed Method In this study, we consider that asbestos has an elongated shape. We propose an asbestos detection method particularly addressing the frequency distribution. The proposed method procedure and its flow chart (Fig. 3) are as follows: 1. For enhancing asbestos, we examine a method compressing a microscopic image from three-dimensional to one-dimensional. 2. Analyzing the frequency by two-dimensional Fourier transform, we analyze the frequency distribution of asbestos. 3. We calculate the asbestos direction from the frequency distribution of asbestos. 4. We perform edge enhancement based on the asbestos direction.
Fig. 3. A flowchart of the proposed method
3.1 Dimensional Compression for Enhancing Asbestos Low-brightness asbestos in microscopic images has colors such as yellow (Fig. 4(a)) and blue (Fig. 4(b)). Therefore, it is difficult to judge whether asbestos is included (Fig. 4(c)). Related color histograms are presented in Fig. 5 and 6. Therefore, we choose a dimensional compression method for enhancing asbestos by checking the RGB correlation of each sample. Actually, we perform a principal component analysis (PCA) for Fig. 4. Then, the results that calculated factor loadings and the contribution of each first principal component (PC1) are presented in Table 1. In fact, PCA is a multivariate data analysis method that uses correlations among variates from multivariate measurement and describes the original measurement properties by lower-dimensional variates. In Table 1, it is apparent that the correlation of R and G is strong and that of B is weak
Asbestos Detection Method with Frequency Analysis for Microscope Images
(a) yellow
(b) blue
433
(c) no asbestos
Fig. 4. An example of low-brightness asbestos and no asbestos
Fig. 5. The color histogram of Fig. 4(a) Table 1. PC1 Factor loadings and the contribution of Fig. 4
Red(R) Green(G) Blue(B) Cum.contribution(%)
Fig. 4(a) Fig. 4(b) Fig. 4(c) 0.934 0.751 0.73 0.958 0.71 0.86 0.282 0.929 0.4 62.35 64.51 48.38
in Fig. 4(a) and 4(c) samples. On the other hand, it might be readily apparent that the correlation of B is strong in the Fig. 4(b) sample. As described above, although Fig. 4(a) and 4(c) show a similar trend, the factor loading of B in detecting target Fig. 4(b) is large. The unsettled RGB weights uniquely, so we use the simple average method included in Eq. (1) in this study. V = ωr R + ωgG + ωbB
ωr = ωg = ωb = 0.3˙
(1)
Using V , we can obtain a gray-scale image with enhanced asbestos of detecting targets. 3.2 Frequency Analysis for Asbestos Feature Extraction To find a feature of detection difficulty and low-brightness asbestos, we analyze the frequency distribution. However, variously colored and shaped particles are present in
434
H. Kumagai et al.
Fig. 6. The color histogram of Fig. 4(b)
images. For that reason, it is difficult to analyze only the frequency of low-brightness asbestos. Therefore, we perform a frequency analysis after dividing an image into small areas including targets only. Then we use a two-dimensional Fourier transform, as presented in Eq. (2) as a means of analyzing the frequency, as F(u, v) =
∞ −∞
f (x, y) exp{−2π i(ux + vy)}dxdy,
(2)
where F(u, v) is the two-dimensional frequency spectrum of f (x, y), f (x, y) is the brightness on arbitrary point (x, y) of the image, u is the frequency component of x, and v is the frequency component of y. A small area (ROI) including the target only is presented in Fig. 7(a). Furthermore, the power spectrum obtained by performing the Fourier transform for Fig. 7(a) is presented in Fig. 7(b). The components near the center are lowfrequency components and components far from center are high-frequency components. On the frequency distribution of Fig. 7, low brightness and elongated asbestos become visible.
(a) Region of interest (ROI)
(b) Power spectrum of Fig. 7(a)
Fig. 7. Two-dimensional Fourier transform
3.3 Asbestos Direction Detection for Edge Enhancement We calculate a covariance matrix from the frequency distribution of small area including asbestos, and calculate the asbestos direction. Edge enhancement based on the
Asbestos Detection Method with Frequency Analysis for Microscope Images
435
direction, can enhance the brightness of the elongate asbestos. The covariance matrix is represented as 1 x w, w = μ = f (x, y)w ∑ ∑ y f (x, y) ∑y ∑x y x
Σ =
1 ∑ ∑ f (x, y)(ww − μ )(ww − μ )T , ∑y ∑x f (x, y) y x
(3)
where Σ is the covariance matrix, μ signifies the average vector of brightness, f (x, y) denotes the brightness on arbitrary point (x, y) on the image, and w stands for the coordinate vector. We calculate the eigenvalue λ1 and λ2 (λ1 ≥ λ2 ) of the covariance calculated Eq. (3). Because asbestos is elongated, the aspect ratio in the case of approximating a rectangle becomes large. Therefore, we can judge whether asbestos is present in a small area by roughly estimating the aspect ratio from λ1 and λ2 . The calculated first eigenvector v1 is presented in Fig. 8(a). Additionally, we designate the angle θ presents the asbestos direction as Fig. 8(b).
(a) Primary eigenvector
(b) Angle of Fig. 8(a)
Fig. 8. Eigenvalue decomposition of covariance matrix
3.4 Edge Enhancement for Asbestos Detection We use edge enhancement to enhance asbestos. Many edge detectors such as Laplacian [10], Canny [11] and Marr-Hildreth [12] have been proposed. In this study, we use a Sobel filter as the edge-enhancement method. A Sobel filter is a first-derivative filter that can only enhance the edge direction in the case of a deciding direction has enhanced the edge. In this study, we use not a 3×3 but a 5×5 mask size to perform edge enhancement based on the asbestos angle. The horizontal Sobel filter is Dx and the vertical Sobel filter
(a) Horizontal enhancement Dx
(b) Vertical enhancement Dy
Fig. 9. Sobel filter for edge detection (5 × 5)
436
H. Kumagai et al.
is Dy . Each filter is presented in Fig. 9. We propose the original Sobel filter based on the angle using Fig. 9. We denote the original Sobel filter as D, each filter coefficient of D is represented as D = cos θ Dx + sin θ Dy .
(4)
Fig. 7(a) with the original Sobel filter is presented in Fig. 10(a). Therefore, by enhancing the low-brightness asbestos, it is possible to detect particles, as shown in Fig. 10(b).
(a) Sobel edge enchancement of Fig. 7(a)
(b) Binarized in Fig. 10(a)
Fig. 10. Sobel edge detection and binarization
4 Experiment We conducted the experiment using the proposed method for 21 microscopic images (640 × 480[pixel]) obtained from the Japan Association for Working Environment Measurement. For target images (640 × 480[pixel]), we empirically determined the size of small areas that should be divided 32 × 32. The procedure of the experiment is described as follows: 1. We binarize with a threshold that was decided empirically, and detected particles and asbestos in each image. 2. We perform labeling processing for the obtained binary images, and count particles and asbestos in each image. 3. We compare the obtained result with the result counted by an expert (Fig. 11); then we evaluate the false positives and negatives. 4. We conduct experiments 1 - 3 using a preprocess with background subtraction [13] and the proposed method; then we compare results obtained using the preprocess with background subtraction and the proposal method. Here, to confirm the expert counting basis, we judge low brightness and small particles as noise (false positive). 4.1 Result For example, we use the image (Fig. 12(a)) as input. Moreover, the image changed in the dynamic range is presented in Fig. 12(b) to show particles emphatically. The resultant output with background subtraction is presented in Fig. 13(a), the resultant output with the proposed method is presented in Fig. 13(b). By comparing Fig. 13(a) with Fig. 13(b), we show the detection of low-brightness asbestos enclosed in the circle and the reduction of false positives. The results obtained using background subtraction and the proposed method for each evaluation target are presented in Table 2.
Asbestos Detection Method with Frequency Analysis for Microscope Images
437
Fig. 11. An example of an image counted by an expert
(a) An input image
(b) Fig. 12(a) changed in the dynamic range
Fig. 12. An example of input image
(a) With background subtraction
(b) With the proposal method
Fig. 13. Outputs
4.2 Discussion As presented in Table 2, we not only detected the same number of asbestos using background subtraction and the proposed method, but also enormously reduced false
438
H. Kumagai et al. Table 2. Result (a)Particle result
Answer(Goal) Background subtraction Proposal method
Particles False positive False negative 349 0 0 1841 1496 0 760 415 0
(b)Asbestos result
Answer(Goal) Background subtraction Proposal method
Asbestos False positive False negative 41 0 0 37 0 4 37 0 4
positives using the proposed method. This is true because of the reduction of false positive by threshold processing using enhancement of low-brightness asbestos in the proposed method. Moreover, on Table 2(b), four false negative of asbestos exist because the asbestos direction is calculated uniquely when including several asbestos particles in a small area, as shown in Fig. 14.
Fig. 14. ROI including two asbestos
5 Conclusion This study was undertaken to establish a method to detect not only high-brightness particles and asbestos particles, but also low-brightness asbestos. We considered that the asbestos has an elongated shape, and analyzed the frequency using a two-dimensional Fourier transform. We proposed a method that calculates the asbestos direction by analyzing the frequency distribution of asbestos. It enhances the asbestos edges based on the directional analysis of the frequency distribution of asbestos. Results of experiments underscore the effectiveness of the method: we not only detected low-brightness asbestos, but also sharply reduced false positives using the proposed method. Future investigations will examine a method that can analyze the frequency of several asbestos particles in a small area, and will study a method to determine a suitable resolution of the small area.
Asbestos Detection Method with Frequency Analysis for Microscope Images
439
References 1. Walter, C.: Detection and identification of asbestos by microscopical dispersion staining. Environmental Health Perspectives 9, 57–61 (1974) 2. JIS (Japanese Industrial Standard) A 1481: Determination of asbestos in building material products (2006) 3. Kawabata, K., Morishita, S., Takemura, H., Hotta, K., Mishima, T., Asama, H., Mizoguchi, H., Takahashi, H.: Development of an automated microscope for supporting qualitative asbestos analysis by dispersion staining. Journal of Robotics and Mechatronics 21, 186–192 (2009) 4. Baron, P.A., Shulman, S.A.: Evaluation of the magiscan image analyzer for asbestos fiber counting. American Industrial Hygiene Association Journal 48, 39–46 (1987) 5. Kenny, L.C.: Asbestos fibre counting by image analysis - the performance of the manchester asbestos program on magiscan. Annals of Occupational Hygiene 28, 401–415 (1984) 6. Inoue, Y., Kaga, A., Yamaguchi, K.: Development of an automatic system for counting asbestos fibers using image processing. Particulate Science and Technology 16, 263–279 (1998) 7. Shen, L., Song, X., Iguchi, M., Yamamoto, F.: A method for recognizing particles in overlapped particle images. Pattern Recognition Letters 21, 21–30 (2000) 8. Stokes, M.D., Deane, G.B.: A new optical instrument for the study of bubbles at high voidfractions within breaking waves. IEEE Journal of Oceanic Engineering 24, 300–311 (1999) 9. Duraiswami, R., Prabhukumar, S., Chahine, G.L.: Bubble counting using an inverse acoustic scattering method. The Journal of the Acoustical Society of America 104, 2699–2717 (1998) 10. Berzins, V.: Accuracy of laplacian edge detectors. Computer Vision, Graphics, and Image Processing 27, 195–210 (1984) 11. Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 8, 679–698 (1986) 12. Marr, D., Hildredth, E.: Theory or edge detection. Proceedings of the Royal Society London 207, 187–217 (1980) 13. Kumagai, H., Morishita, S., Kawabata, K., Asama, H., Mishima, T.: Accuracy improvement of counting asbestos in particles using a noise redacted background subtraction. In: IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pp. 74–79 (2008)
Shadows Removal by Edges Matching P. Spagnolo, P.L. Mazzeo, M. Leo, and T. D’Orazio Institute of Intelligent Systems for Automation, via Amendola 122/D, Bari, Italy
[email protected]
Abstract. Motion detection algorithms usually detect moving regions in a rough way; in some application contexts it could be mandatory to obtain the exact shape of such objects by removing cast shadows as well as ghosts and reflections due to variations in light conditions. To address this problem we propose an approach based on edge matching. The basic idea is that edges extracted in shadow (or ghost) regions in current image exactly match with edges extracted in the correspondent regions in the background image. On the contrary, edges extracted on foreground objects have not correspondent edges in the background image. In order to remove all shadow regions instead of only shadow points, we firstly segment the foreground image into subregions, according to uniformity of photometric gain between adjacent points. The algorithm has been tested in many different real contexts, both in indoor and outdoor environments.
1
Introduction
In the last years the attention on surveillance applications has greatly increased, due to a general need of safeness and security. Extracting information from video sequences is a challenge for researchers, and the possible applications vary from security control, to multimedia analysis for behavior understanding. The starting point of each surveillance application is the foreground object detection. It is usually performed by background subtraction or temporal difference. However, usually, these algorithms erroneously detect shadows as part of foreground objects because they move together. Shadow characteristics change radically according to the context: in an outdoor scene the orientation, intensity and size of shadow vary with light conditions, time of the day, presence/absence of clouds, and so on. In indoor context, instead, the presence of multiple light sources, as well as reflective surfaces, greatly changes the shadow appearance. For these reasons shadow detection is a challenging task. Shadow removing algorithms can be divided in two main categories: spectral approaches, that use only pixel color information, and texture approaches, that use the texture content of regions/points to remove shadows. Spectral approaches can also be divided in other subcategories, according to the color space they use. In [1] authors explored the HSV color space for shadow G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 440–449, 2009. c Springer-Verlag Berlin Heidelberg 2009
Shadows Removal by Edges Matching
441
detection, classifying as shadows those points having approximately the same hue and saturation values evaluated in the correspondent background image. In [2] authors utilized the normalized RGB color space, including a lightness measure to detect cast shadows. In [3] authors presented a method able to work both with still images and video sequences. It starts from an initial set of hypothesis based on RGB differences between each frame and the reference image, with a validation procedure by exploiting photometric and geometric properties of shadows. The YUV color space is the basis of the approach proposed by [4], even if more recently in [5] authors presented an exhaustive analysis of shadow models in other color spaces. The color model based on the RGB color space proposed in [6] separated the brightness from the chromaticity component to deal with both shadows and illumination changes. In [7] authors presented a shadow detection algorithm based on a training of shadow samples in RGB color space with a support vector domain description. The manual segmentation of shadow in the training phase, as well as the absence of an updating procedure able to handle dynamic illumination changes are the main limitations of this approach. Gray intensity and spatial coordinates are the input of the approach proposed in [8]. Texture based approaches work in the direction of detecting shadows by analyzing texture content of small regions. In [9] authors assumed that shadows often appear around the foreground objects, so they tried to detect shadows by extracted moving edges. In [10] the intensity ratio between the current and background image was the starting point of the removing algorithm. In a similar way, in [11], the authors explored ratio edges, which are modeled as a chi-squared distribution, for shadow identification. Ratios between background and current images was also the core of the approach proposed in [12]. An analysis of the morphological gradient of foreground regions was the starting point of the approach proposed in [13]. In [14] authors proposed an algorithm that compares patches of new frames with the respective patches of the reference image, and Gabor functions were used to check if the texture content remains the same. In [15] authors used a local normalized cross-correlation metric to detect shadows, and a texture similarity metric to detect illumination changes. The analysis of the standard deviation of pixel ratios in small neighborhoods was also the main topic of the approach proposed in [16]. Other approaches utilize a hybrid solution, integrating intensity information with texture contents. In [17] shadow segmentation was performed by a dynamic conditional random field based on intensity and gradient features. [18] relied on the brightness edge and shading information to detect moving cast shadows in a well textured background. In [19] authors employed both the pixel and the edge information at each channel of the normalized RGB color space to detect shadowed pixels. Recently, good works have been presented in [20], [21] and [22]. A good review about the argument can be found in [23].
442
P. Spagnolo et al.
In this work we propose a two steps algorithm able to detect and remove shadows and artifacts due to variation (suddenly or gradually) in light conditions. In the first part foreground objects are segmented into subregions, according to uniformity in the pixels spectral content. For this purpose photometric gain is evaluated for each foreground point, and then an unsupervised clustering procedure is carried out in order to agglomerate pixels with similar both spectral content and geometric position. In this way the problem of the approaches that evaluates pixel characteristics individually, suffering in boundaries removal, is overcome. After this, texture is evaluated both on the current image and the reference one. For this purpose edges are extracted and compared between those images. Subregions with a high percentage of matching edges are considered as shadow (or ghost) and removed. The proposed algorithm is independent from the foreground segmentation procedure; the only requirement is the presence of a reference image. So it can be considered as a stand-alone module that can be easily added to each motion detection system, with the goal of improving quality of segmentation output. In the following the segmentation procedure will be presented, as well as the clustering approach utilized for this purpose, and the edge extraction and matching algorithm will be explained. Finally, experimental results obtained both in indoor and outdoor context will be presented.
2
Shadow Removing
Shadow is an abnormal illumination of a part of an image relative to the background due to the interposition of an opaque object with respect to a bright point-like illumination source. From this assumption, we can note that shadows move with their own objects, but that they have not a fixed texture, as the foreground objects: they are half-transparent regions which retain the representation of the underlying background surface pattern. Therefore, our aim is to detect the parts of the image that have been detected as moving regions from the previous step of motion segmentation but with a structure substantially unchanged with respect to the corresponding background. To do this, we firstly apply a segmentation procedure to recover large regions characterized by a constant photometric gain, then a features matching with respect to the background model to select the actual shadow regions is carried out. 2.1
Foreground Segmentation
As explained before, the proposed algorithm can be considered as an additive module that works after a conventional background subtraction algorithm. So, in the following, we consider a foreground segmented object as the input of our approach. The segmented image, as previously illustrated, needs to be further divided into subregions, in order to allow to remove all shadow parts, not only single
Shadows Removal by Edges Matching
443
points. For this purpose we firstly evaluate the photometric gain exhibited by all foreground points: G(x, y) = B(x, y)/I(x, y) (1) where B(x, y) is the grey level of pixel (x, y) in the background reference image and I(x, y) is the corresponding value detected in the current image. So, each point can be represented by a feature vector p = (x, y, G). These vectors are the input of an unsupervised clustering procedure with the goal of segment foreground objects into subregions, according to uniformity of photometric gain and spatial position; for this purpose we have utilized the BSAS (Basic Sequential Algorithm Scheme) algorithm. Independently from the specific clustering algorithm, firstly we need to define a metric as a way for estimating the ”proximity” of two items, according to the selected features. The selected proximity measurement is the Manhattan distance: d(p1 , p2 ) =
n
wi |p1i − p2i |
(2)
i=1
where p1 and p2 are two image point vectors with pi components, n is the vector size, and wi are the weight coefficients. In this way it is possible to give a different relevance to each vector component, according to the desired results. In our experiments we have chosen to give more relevance to the similarity of photometric gain then spatial coordinates (i.e. w3 is a little bit higher than w1 and w2 ). This proximity measurement has the advantage of being simple and fast with respect to other more complex distances (such as the Euclidean one). Notice the use of photometric gain instead of simple spectral pixel information (like RGB color space values): in this way we segment image maintaining some textural information. For example, if a shadow is projected onto two different surfaces (i.e. a wall and a table) a spectral-based segmentation divide it into two subregions, while the proposed approach consider them as a unique segment. After defining the implemented proximity measures, it is possible to segment foreground objects in several subregions by means of a clustering procedure. 2.2
Basic Sequential Algorithmic Scheme
BSAS is a very basic and simple clustering algorithm; vectors are presented only once and the number of clusters is not known a priori. What is needed is the similarity measure d(p, C), and a threshold of similarity th. Each class can be represented by a feature vector (usually called representative): this simplifies the measure distance evaluations between vectors and clusters. The idea is to assign every newly presented vector to an existing cluster or to create a new cluster for this sample, depending on the distance to the already defined clusters. Further details about the algorithm can be found in [24]. The main drawback of this algorithm is that different choices for the distance function lead to different results, and, unfortunately, the order in which the samples are presented can also have a great impact on the final result. What’s
444
P. Spagnolo et al.
also very important is the value of the threshold th. This value has a direct effect on the number of resulting clusters. If th is too small, unnecessary clusters are created, while an incorrect segmentation could be carried out if it is too large. In our implementation we have lightly modified this classic version of the algorithm. In particular, in order to smooth the dependence from the order in which the samples are presented, a merge procedure is carried out on the output clusters, using th as a merge threshold: if the distance between two clusters (considering mean values of vector elements) is less than th, then they are merged. In figure 1 an example of segmentation procedure in outdoor context can be seen. As evident, after the segmentation procedure shadow boundaries have been grouped in a different cluster (1(b)), while after merge procedure they correctly form a unique cluster with shadow (1(c)).
(a)
(b)
(c)
Fig. 1. The output of segmentation and merge procedure
2.3
Shadow Detection
After segmenting foreground image in subregions, the next step is to distinguish shadow regions from effective foreground objects. Starting from the assumption that shadow is an half-transparent region, we try to solve this problem by means of an edge matching procedure. We firstly extract edge points both in background image and current one. Then, by using previously detected subregions as processing masks, we evaluate the percentage of matching edges, and regions with a great matching confidence are considered as shadow regions ad removed. Firstly, an edge operator is applied to the background image to find the edges. The same operator is then applied on the current image (an optimization of this algorithm can work in the direction of limiting this operation only in correspondence regions around the foreground objects previously detected). Formally, if we consider as Fi the detected foreground region: 1 if (x, y) is an adge, (x, y) ∈ Fi EB (x, y) = (3) 0 otherwise 1 if (x, y) is an adge in the current image EF (x, y) = (4) 0 otherwise
Shadows Removal by Edges Matching
445
where EB (x, y) and EF (x, y) denote the binary images obtained by applying an edge operator respectively on the background image and on the current one. To perform edge detection, we have used Susan algorithm [25], which is very fast and has good performances. Let be N the total number of edge points of the region Fi ,that is: N= EB (x, y) (5) (x,y)
Now, the two images containing the edges are matched and a similarity matching measure needs to be evaluated. Due to the possibility that edge pixels do not exactly match between the two images, this matching procedure needs to be generalized by using small windows around each examined point. The metric we have implemented physically counts the number of edge points in the segmented image that have a correspondent edge point in the current grey level image; moreover, a searching procedure around those points is necessary to avoid mistakes due to noise or small segmentation flaws. Let be CB (x, y) ⊂ EB (x, y) and CF (x, y) ⊂ EF (x, y) two windows of size n around point (x, y) respectively in the background and current edge images. The matching measurement MSF will be: 1 MSF = MSF (x, y) (6) N (x,y)|(x,y)∈EB
where ⎧ ⎨ 1 if ES (x, y) = EF (x, y) = 1 = EF (x, y) ∧ CF (x, f ) =∅ MSF = δ if ES (x, y) = EF (x, y) = 0 ∨ ES (x, y) ⎩ 0 if CF (x, y) = ∅ (7) δ is a coefficient that varies in [0,1]: it is equal to 1 if in the region CF around the point (x, y) there is a number of edge points greater then (or equal to) the same in the region CS (we are sure that this last value is always different from 0 because ES (x, y) = 1 from eq. 3). If the region CF is empty, then δ is equal to 0. Formally: EB (x, y) CB (x,y)
δ = max(
EF (x, y)
, 1)
(8)
CF (x,y)
After this procedure, the value of MSF is thresholded: if it is greater than a given value experimentally selected, it means that the edges of the region extracted from the current image have correspondent edge points in the current grey level image: so we can mark it as a removable region, i.e. it can be a shadow region or a reflection, or a ghost.
446
3
P. Spagnolo et al.
Experimental Results
The proposed method was applied to two image sequences: an office (1152 frames), and an outdoor sequence (1434 frames), each of them with a person moving in the scene. Outdoor images are characterized by sharpened shadows, while in indoor images they are lighter; on the other hand, in this last context the presence of several artifacts due to reflective surfaces makes the improvement of the results an interesting challenge. We have manually segmented some images from sequences (34 from the first one and 41 from the second one); to do it, starting from the output of the motion detection algorithm proposed in [21], we have labeled two different regions: Foreground and Shadows, including in this category also reflections and ghosts. In fig. 2 it can be seen the results obtained in the two contexts (first two rows refer to the indoor sequence, while last two ones refer to the outdoor results). Firstly, for each sequence, background and current image are illustrated; then we can se the output of motion detection algorithm and the segmented version of it, according to the procedure described in 2.1. It can be noted that shadow has been correctly clustered in a unique region. After this, the edges of respectively background and current images are plotted, and finally, in last images, the results after the elimination of regions with high percentage of matching edges are reported. As evident, the algorithm correctly removed shadow and reflection regions. In order to give a quantitative evaluation of the results, in table 1, a summary of achieved results is reported. For this kind of evaluation, we have considered three different classes of results, according to the capability of the algorithm both to correctly remove shadow, and to not remove foreground objects. So, we have manually verified the output images, distinguish them in these categories: CORRECT: shadow regions have been removed for at least 80% of total size, and foreground objects erroneously deleted points are less then 10%; PARTIALLY CORRECT: shadow regions have been removed for at least 40% of total size, and foreground objects erroneously deleted points are less then 30%; INCORRECT: shadow regions have been removed for less then 40% of total size, or foreground objects deleted points are more then 30%. By analyzing results reported in table 1, it can be seen that in outdoor contexts, with well contrasted shadows, the percentage of errors is very low, even if the number of partially correct results is greater then desired. On the contrary, in indoor contexts, in presence of diffuse shadows and several reflections, the percentage of errors lightly increases, while the partially correct percentage is very low: in this context the separation between a good result and a poor segmentation is not so evident, and the threshold selection becomes a challenging task. It happens because the background is often homogeneous, without an evident texture, and it involves an incorrect segmentation of regions, according to the procedure described in 2.1. Finally, in table 2 we evaluated the reliability of the matching edges procedure. In particular, we manually segmented some images
Shadows Removal by Edges Matching
447
Fig. 2. Results obtained in indoor and outdoor contexts
Table 1. The results obtained in two different contexts CONTEXT Outdoor Indoor
CORRECT 80.55% 82.89%
PARTIALLY CORRECT 14.23% 7.34%
INCORRECT 5.22% 9.77%
Table 2. Mean number of matching edges Context Outdoor Indoor
Matching Edges in Shadow Regions (%) 87.43% 84.21%
Matching Edges in Foreground Regions(%) 31.26% 28.41%
for each test sequence, and for those images we calculated the mean number of matching edges. As expected, in correspondence of shadow regions the matching percentage is very high, while it is very low in correspondence of foreground objects. It demonstrates that the threshold on matching edges can be easily detected, and it does not influence the overall reliability of the system.
448
4
P. Spagnolo et al.
Conclusions and Future Works
In this paper we presented an approach for shadow removing that can be easily added to a generic motion segmentation algorithm. Firstly the output of a motion detection algorithm is divided into subregions according to the similarity of photometric gain exhibited by each point. The goal of this procedure is to separate effective foreground objects from their shadows and reflections. Then, an edge matching procedure has been implemented with the goal of detect (and remove) the shadow regions. To do this, edge points are evaluate both in background image and the current one, and regions that maintain the same edge structure are considered as shadow. Experimental results confirm the effectiveness of the proposed approach. As a future work, we are going to carried out more experiments, in different contexts. Moreover, a detailed analysis of errors is in progress, with the goal of distinguish those due to segmentation procedure from those due to edge matching approach.
References 1. Cucchiara, R., Grana, C., Piccardi, M., Prati, A.: Detecting moving objects, ghosts, and shadows in video streams. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 1337–1342 (2003) 2. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.: Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proceedings of the IEEE 90, 1151–1163 (2002) 3. Salvador, E., Cavallaro, A., Ebrahimi, T.: Cast shadow segmentation using invariant color features. Computer Vision and Image Understanding 95, 238–259 (2004) 4. Martel-Brisson, N., Zaccarin, A.: Moving cast shadow detection from a gaussian mixture shadow model. In: Proc. of IEEE Conf. on CVPR, pp. 643–648 (2005) 5. Martel-Brisson, N., Zaccarin, A.: Learning and removing cast shadows through a multidistribution approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1133–1146 (2007) 6. Horprasert, T., Harwood, D., Davis, L.: A statistical approach for real-time robust background subtraction and shadow detection. In: Proceedings of ICCV FrameRate Workshop (1999) 7. Siala, K., Chakchouk, M., Besbes, O., Chaieb, F.: Moving shadow detection with support vector domain description in the color ratios space. In: Proceedings of International Conference Pattern Recognition, pp. 384–387 (2004) 8. Hsieh, J., Hu, W., Chang, C., Chen, Y.: Shadow elimination for effective moving object detection by gaussian shadow modeling. Image and Vision Computing 21, 505–516 (2003) 9. Xu, D., Liu, J., Li, X., Liu, Z., Tang, X.: Insignificant shadow detection for video segmentation. IEEE Transactions on Circuits Systems for Video Technology 15, 1058–1064 (2005) 10. Rosin, P., Ellis, T.: Image difference threshold strategies and shadow detection. In: Proceedings of British Machine Vision Conference, pp. 347–356 (1995) 11. Zhang, W., Fang, X., Yang, X.: Moving cast shadows detection using ratio edge. IEEE Transactions on Multimedia 9, 1202–1214 (2007)
Shadows Removal by Edges Matching
449
12. Toth, D., Stuke, I., Wagner, A., Aach, T.: Detection of moving shadows using mean shift clustering and a significance test. In: Proc. of ICPR, pp. 260–263 (2004) 13. Chien, S.Y., Ma, S.Y., Chen, L.G.: Efficient moving object segmentation algorithm using background registration technique. IEEE Transactions on Circuits Systems for Video Technology 12, 577–586 (2002) 14. Leone, A., Distante, C.: Shadow detection for moving objects based on texture analysis. Pattern Recognition 40, 1222–1233 (2007) 15. Tian, Y., Lu, M., Hampapur, A.: Robust and efficient foreground analysis for realtime video surveillance. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1182–1187 (2005) 16. Jacques Jr., J., Jung, C.R., Musse, S.: A background subtraction model adapted to illumination changes. In: Proc. of IEEE ICIP, pp. 1817–1820 (2006) 17. Wang, Y., Tan, T., Loe, K., Wu, J.: A probabilistic approach for foreground and shadow segmentation in monocular image sequences. Pattern Recognition 38, 1937–1946 (2005) 18. Stauder, J., Ostermann, R.: Detection of moving cast shadows for object segmentation. IEEE Transactions on Multimedia 1, 65–76 (1999) 19. McKenna, S., Jabri, S., Duric, Z., Rosenfeld, A., Wechsler, H.: Tracking groups of people. Computer Vision and Image Understanding 80, 42–56 (2000) 20. Joshi, A., Papanikolopoulos, N.: Learning to detect moving shadows in dynamic environments. IEEE Trans. on Patt. Anal. and Mach. Int. 30, 2055–2063 (2008) 21. Rosito, C.J.: Efficient background subtraction and shadow removal for monochromatic video sequences. IEEE Transactions on Multimedia 11, 571–577 (2009) 22. Yang, M.T., Lo, K.H., Chiang, C.C., Tai, W.K.: Moving cast shadow detection by exploiting multiple cues. IET Image Processing 2, 95–104 (2008) 23. Prati, A., Mikic, I., Trivedi, M., Cucchiara, R.: Detecting moving shadows: Algorithms and evaluation. IEEE Trans. on Patt. Anal. and Mach. Intell. 25, 918–923 (2003) 24. Theodoridis, S., Koutroumbas, K.: Pattern Recognition. Academic Press, London. ISBN 0-12-686140-4 25. Smith., S.M.: A new class of corner finder. In: Proc. of 3rd British Machine Vision Conference, pp. 139–148 (1992)
Online Video Textures Generation Wentao Fan and Nizar Bouguila Institute for Information Systems Engineering University of Concordia Montreal, Canada {wenta_fa,bouguila}@ciise.concordia.ca
Abstract. In this paper, we propose two different online approaches for synthesizing video textures. The first approach is through incremental Isomap and Autoregressive (AR) process. It can generate good quality results, nonetheless, it does not extend well for videos that are more sparse, such as cartoons. Our second online video texture generation approach exploits incremental spatio-temporal Isomap (IST-Isomap) and AR process. This second approach can provide more efficient and better quality results for sparse videos than the first approach. Here, the ISTIsomap, which we proposed, is an extension of spatio-temporal Isomap (ST-Isomap) and incremental Isomap. It contains spatio-temporal coherence in the data set and can also be applied in an incremental mode. Compared with other video texture generation approaches, both of our online approaches are able to synthesize new video textures incrementally which in turn offer advantages (e.g. faster and more efficient) in applications where data are sequentially obtained. Keywords: Video texture, computer vision, dimensionality reduction, Isomap, incremental Isomap, autoregressive process.
1
Introduction and Related Works
Video texture introduced by Sch¨ odl et al. [1] is a new type of medium that can generate a continuous, infinitely changing stream of images from a recorded video. It is very useful in movie and game industries since it is able to create new objects by reusing existing resources. Furthermore, more applications related to video textures may be found in [2] [3] [4] [5]. Fitzgibbon [6] introduced a new method for creating video textures by applying principal components analysis (PCA) and autoregressive (AR) process [7]. As a result, all frames in the generated video are new and consist with motions in the original video without any ‘dead ends’1 . Later, Campbell et al. [8] extended this approach by applying a spline and a combined appearance model in order to work with strongly non-linear sequences. 1
In the original video texture algorithm, a ‘dead end’ is a scenario where there is no good transition point that can be found, then for some part of the video there is no graceful exit.
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 450–459, 2009. c Springer-Verlag Berlin Heidelberg 2009
Online Video Textures Generation
451
To the best of our knowledge, all of the video texture generation methods so far work in a batch mode, which means, all the frame data have to be collected before the generation process starts. The drawback of the batch video texture method is that, in some situations where the training images become available sequentially such as surveillance, the computations become prohibitive. Another problem is that, we have to recompute the low-dimensional representation for the whole data set in order to update the lower dimensional subspace for frame ‘signatures’ with a new observed image. In this paper, we propose two different approaches that allow video textures to be generated incrementally. We are inspired by Fitzgibbon’s work [6], where dimension reduction technique and AR process were implemented to synthesize new video textures. In our case, we extend Fitzgibbon’s method by applying incremental Isomap (recently proposed in [9]) and AR process to establish a incremental version of video texture generation process. This approach produces good results for most videos except for comparable sparse video data such as cartoons. The reason is that cartoon data are more inherently spare and also contain exaggerated deformations between frames. Spatio-temporal Isomap (ST-Isomap) [11] had been applied to generate better results for cartoon textures than original Isomap by Juan et al. [10] since it maintains the temporal coherence in the data set while finding the low-dimensional embedding. Thus, ST-Isomap is appropriate for dealing with cartoon data. Unfortunately, it still operates in a batch mode. Based on the idea of incremental Isomap, we propose an extension of STIsomap which is named as incremental spatio-temporal Isomap (IST-Isomap). It contains spatio-temporal coherence in the data set and can be applied incrementally. Our second approach applies IST-Isomap and AR process, which can produce better quality results for sparse videos (e.g. cartoon). Based on our experiments, both of our methods can process the image frame data sequentially and more efficiently than batch methods in terms of computational efficiency and time-consuming. The rest of this paper is organized as follows: The original Isomap and incremental Isomap are briefly introduced in section 2. Section 3 explains the ISTIsomap algorithm. Experimental results for generating online video textures are demonstrated in section 4. Finally, conclusion and future work are described in section 5.
2 2.1
Isomap and Incremental Isomap Isomap
Isometric feature mapping (Isomap) [12] is a global nonlinear dimensionality reduction technique, which reduces the dimensionality by searching a low dimensional manifold hidden in observational space. Isomap algorithm contains three major steps. The first step is to determine the neighborhood relations for all data points by constructing a weighted graph. The next step is to discover the geodesic distances between all pairs of data points on the manifold, which can be estimated approximately by the shortest path between the corresponding
452
W. Fan and N. Bouguila
vertices in the neighborhood graph. The last step is to apply classical multidimensional scaling (MDS) [13] to discover the coordinates of the input data in a lower-dimensional space. 2.2
Incremental Isomap
Incremental Isomap proposed by Law et al. [9] is a modified version of the original Isomap. First, assume that we have already obtained the low-dimensional coordinates yi of xi for the first n data points. Later, when a new data point xn+1 is observed in the sequence of data, we need to update the existing lowdimensional coordinate set of yi and estimate yn+1 . Similar to the original Isomap algorithm, there are three steps for the incremental Isomap. The first step is to update the neighborhood graph and then update the geodesic distance gij in the view of the change of the neighborhood graph when the new data point xn+1 is observed. The neighborhood graph is defined by k-nearest neighbors (knn) algorithm. Each vertex in the graph represents a data point. The Euclidean distance between two vertices vi and vj is defined as Δij . If vn+1 is in vi ’s knn neighborhood, the edge between a vertex vi and the new vertex vn+1 is added; meanwhile, the edge between vi and its previous kth nearest neighbor is deleted. Then, the geodesic distance between vn+1 and the other vertices should be updated as gn+1,i = gi,n+1 = min(gij + Δj,n+1 ) . j
(1)
where j is the index of the vertices that have an added edge from vj to vn+1 . The second step is to apply the geodesic distance between the new point and the existing points to estimate the xn+1 ’s low-dimension coordinate yn+1 . We can discover yn+1 by matching its inner product with yi to the values obtained from the geodesic distances. The last step is to update the whole existing low-dimensional coordinates set xi after obtaining the new geodesic distance gij . This can be considered as an incremental eigenvalue problem, which can be solved by eigen-decomposition. The detailed algorithm of incremental Isomap is described in [9].
3
Incremental Spatio-Temporal Isomap
Incremental Isomap works well for uncovering low-dimensional intrinsic data structure where data are collected sequentially. However, it is incapable to consider the temporal coherence in data set. Spatio-temporal Isomap (ST-Isomap) [11], an extension of the original Isomap framework, includes temporal relationships in data set. Nevertheless, it works in a batch mode and is not appropriate when data are observed sequentially. Therefore, we propose a new method named as incremental spatio-temporal Isomap (IST-Isomap) which is able to account for temporal coherence and operate in an incremental manner. IST-Isomap inherits the framework of ST-Isomap and also augments its ability to deal with sequentially collected data efficiently.
Online Video Textures Generation
3.1
453
Spatio-Temporal Isomap
Similar to the framework of original Isomap, the algorithm for ST-Isomap [11] contains five steps. The first step is windowing of the input data into temporal blocks in order to give a temporal history for each individual data point. This windowing step results in a sequentially ordered set of data points. The second step is to identify the neighborhood relationships and build the neighborhood graph. The neighborhood graph in ST-Isomap is constructed based on common temporal neighbors (CTN) relationships. CTNs are determined by K-nearest nontrivial neighbors (KNTN) using Euclidean distance. A vertex va is considered to be a nontrivial match of vb if a = b + 1, or a = b and Δb,a ≤ Δb,τb where τb is the index of the kth most similar nontrivial neighbor of vb . We use D to denote the Euclidean distance matrix of the data set with each element Δij which represents the distance between two points i and j. In the third step, the distance matrix D between data points is reduced with spatio-temporal correspondences. Other than CTN, another temporal relationship between data is defined as adjacent temporal neighbors (ATN). ATNs represent the adjacent points in the sequential order of data set. The reduced distance is calculated by dividing cCT N and cAT N , the constant factors for controlling how strong the temporal relationships between data pairs are. As cCT N or cAT N increases, the distance between data pairs with spatio-temporal coherence decreases and similarity increases. In the fourth step, geodesic distance is calculated by applying Dijkstra’s algorithm. Finally, the classical MDS is applied to construct an embedding of the data in low-dimension space as in original Isomap. 3.2
Incremental Version of Spatio-Temporal Isomap
IST-Isomap inherits the framework of incremental Isomap and ST-Isomap. Suppose that we have already obtained the low-dimensional coordinates yi of xi for the first n data points. The low-dimensional coordinates for the existing data set need to be updated when a new data point is observed. In IST-Isomap, since we assume that data are collected sequentially, it is unnecessary to windowing the input data as in ST-Isomap’s algorithm. Therefore, each data point is considered as one single block as in ST-Isomap. The algorithm for IST-Isomap includes four steps. First, update the neighborhood graph. In IST-Isomap, once a new point xn+1 is observed, the existing neighborhood graph has to be updated corresponding to the new vertex vn+1 . The neighborhood graph is defined by KNTN algorithm, and each vertex in the graph represents a data point. Let τi be the index of the kth most similar nontrivial neighbor of vi . It is straight forward to add a new edge from vi to the new vertex vn+1 if vn+1 belongs to the CTN of vi . The edge from vi to vτi is deleted if Δi,τi > Δi,n+1 and Δτi ,i > Δτi ,ιi , where ιi is the index of the kth most similar nontrivial neighbor of vτi after inserting vn+1 . A new distance matrix D represented the neighborhood graph is obtained after this step.
454
W. Fan and N. Bouguila
Second, reduce the distance matrix D between data points with spatiotemporal coherence. This step is similar to the corresponding step in ST-Isomap, except that we only need to reduce the distance between vertex pairs where edges have been deleted or added. Third, update geodesic distance matrix. Since the deleted and added edges may affect the geodesic distance of some vertices, it is important to update these vertices with new distances followed by calculating the geodesic distance between vn+1 and the other vertices by using (1). Finally, the same algorithm is applied as in incremental-Isomap [9]. The lowdimensional coordinates of the new data point yn+1 is calculated. Then, the lowdimensional coordinates of the whole existing data set is updated with respect to the new geodesic distance.
4
Experimental Results
In our experiments, we implemented incremental Isomap and IST-Isomap on several input videos in order to produce online video textures.2 Our goal is to compare the quality and efficiency between the batch and incremental methods for generating video textures. For instance, one of the input videos we used is a video clip of a person moving a pen (Fig. 1). It is 15 seconds long and contains 450 frames with a size of 160 × 128 (20480) pixels per frame. For each frame, the new image vector would be a vector with 20480 dimensions, and the total image matrix for 450 frames is a 450 × 20480 matrix, with images in the rows, and dimensions in the columns.
Fig. 1. The original input video
4.1
Experimental Results by Incremental Isomap
The preprocessing step is the initialization step. In this step, we are given the first 50 frames from an input video and then extract the signatures from these frames by applying the original Isomap. In our experiments, the dimensionality of 30 to 40 is good enough for representing each individual frame. This is done in the batch version of the Isomap, with a knn neighborhood of size 7. After this step, the rest of the frames in the input video will be observed one by one 2
Some of our input test movies are obtained from the “Video Textures” web site: http://www.cc.gatech.edu/cpl/projects/videotexture
Online Video Textures Generation
455
sequentially. Thus, the incremental Isomap is applied for online retractation for the signatures of the up coming frames. After preprocessing step, the rest of the process for generating new video textures contains three major steps: the first step is to implement incremental Isomap to extract the signatures when a new frame is coming from the input video. The next step is to apply an AR process to predict new frame signatures based on signatures obtained in the previous step. Here, we have a time series consisting of a sequence of frame signatures {yn } where n = 1, ..., N , and N is the total current number of frames. A zero-mean AR process of order p for a series of frame signatures in a d-dimensional space may be modelled by yn =
p
Ak yn−k + w .
(2)
k=1
where the d × d matrices A ≡ (A1 , ..., Ap ) are the coefficient matrices of the AR process model. And w is a d-dimensional random vector drawn from Gaussian white noise distribution with zero-mean. In our case, the order p is set to 2, which means, the next frame signature will be generated based on the previous two signatures. Normally, in our experiments, we use the previous predicted frame signature and the new frame signature to predict the next frame. By combining the predicted signature with the new up coming signature, the level of noise in the result is reduced. Table 1. Runtime (Seconds) for Generating Video textures by Original Batch Isomap and Incremental Isomap n 50 100 150 200 250
Dist. 0.19 0.56 0.83 1.36 1.88
Batch Incremental Speed-improved 10.45 3.78 63.83% 16.15 7.41 54.12% 21.54 11.35 47.31% 28.78 16.84 41.49% 36.12 20.09 44.38%
Table 1 shows the comparison of runtime (seconds) for generating video textures between batch Isomap and incremental Isomap. Here, n is the number of new frames are synthesized as the new video textures. “Dist” represents the time for distance computation for different n frames. The last step is to project the new predicted frames signatures back to the image space and to compose them together as a video texture. All of our experiments are done by using Matlab on a Windows platform. Fig. 2 demonstrates the first three synthesized frames in the results of applying incremental Isomap to generate online video textures. Fig. 3 shows the frames signatures resulted by incremental Isomap. We also tested our approach on some other video clips such as fountain. The result can be viewed in Fig. 4. Table 2 represents its corresponding comparison result.
456
W. Fan and N. Bouguila
(a)
(b)
(c)
Fig. 2. The first three frames synthesized by incremental Isomap
Fig. 3. The frame signature synthesized by incremental Isomap
(a)
(b)
(c)
Fig. 4. The first three frames synthesized by incremental Isomap Table 2. Runtime (Seconds) for Generating Video textures by Original Batch Isomap and Incremental Isomap n 50 100 150 200 250
Dist. 0.17 0.40 0.79 1.22 1.63
Batch Incremental Speed-improved 11.59 4.50 61.17% 15.32 6.93 54.77% 18.77 13.38 28.72% 25.04 17.67 29.43% 31.28 21.10 32.54%
From these experimental results, we can conclude that, applying incremental Isomap to generate video textures is much faster than applying the original Isomap when the input data is available sequentially. Unproportionately,
Online Video Textures Generation
457
noise still exist, which in turn causes the frames to become blur after some period of time; however, in a short run, this provides us very good results. 4.2
Experimental Results by IST Isomap
In this section, we applied IST-Isomap to generate online video textures for cartoon data. In our experiments, we set the KNTN to 3, the values of constant factors cCT N and cAT N are set to 10 and 5, respectively (in our case, this provide the best results). The result is compared with the one generated by incremental Isomap as shown in Fig. 5. It is clearly shown that the noise is noticeable in the 40th frame for incremental Isomap but not visible for IST-Isomap. Table 3 shows the comparison among batch Isomap, incremental Isomap and IST-Isomap in term of runtime (seconds) for generating video textures. According to the results, it is concluded that IST-Isomap is more robust to noise and more efficient for producing online video textures than incremental Isomap for cartoon data since they are inherently sparse and include exaggerated deformations between frames.
(a)
(b)
(c)
(d)
Fig. 5. Compare the online video texture results generated by IST-Isomap and incremental Isomap for a cartoon animation: (a) and (b) are the 30th and 40th frames synthesized by IST-Isomap; (c) and (d) are the 30th and 40th frames synthesized by incremental Isomap
Table 3. Runtime (Seconds) for Generating Video textures by Original Batch Isomap and Incremental Isomap n 50 100 150 200 250
Dist. 0.24 0.68 1.05 1.51 1.97
Batch Incremental Incremental ST 13.89 6.19 7.62 18.24 10.22 10.03 24.17 14.09 13.79 34.51 20.78 18.47 40.56 26.42 24.61
458
5
W. Fan and N. Bouguila
Conclusion and Future Works
In this paper, we have proposed two incremental ways of generating online video textures by applying the incremental Isomap and IST-Isomap. Both approaches can produce good online video texture results. In particular, IST-Isomap is more suitable for sparse video data (e.g. cartoon). Generating online video textures is extremely useful for some scenarios where data points are available sequentially (e.g. surveillance). Based on our experimental results, the runtime for creating new video textures is much faster with incremental Isomap and IST-Isomap than with the original Isomap in batch mode. The results are not only visually appealing with similar motions as the original video, but also contain frames that have never appeared before. Unfortunately, noise may still emerge in a long run and make the synthesized frame blur. Thus, one of our future work is to build a more robust statistical model in order to generate the video textures with less noise but still in an incremental fashion. Acknowledgments. The completion of this research was made possible thanks to the Natural Sciences and Engineering Research Council of Canada (NSERC) and a NATEQ Nouveaux Chercheurs Grant.
References 1. Sch¨ odl, A., Szeliski, R., Salesin, D., Essa, I.: Video Textures. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 489–498. ACM Press/Addison-Wesley Publishing Co., New York (2000) 2. Sch¨ odl, A., Essa, I.: Machine Learning for Video-based Rendering. In: Advances in Neural Information Processing Systems, vol. 13, pp. 1002–1008 (2001) 3. Sch¨ odl, A., Essa, I.: Controlled Animation of Video Sprites. In: Proceedings of the 2002 ACM SIGGRAPH/Eurographics symposium on Computer animation, pp. 121–127. ACM, New York (2002) 4. Agarwala, A., Zheng, C., Pal, C., Agrawala, M., Cohen, M., Curless, B., Salesin, D., Szeliski, R.: Panoramic Video Textures. In: ACM Transactions on Graphics, pp. 821–827. ACM, New York (2005) 5. Li, Y., Wang, T., Shum, H.: Motion Texture: A Two-Level Statistical Model for Character Motion Synthesis. In: ACM Transactions on Graphics, pp. 456–472. ACM, New York (2002) 6. Fitzgibbon, A.W.: Stochastic Rigidity: Image Registration for Nowhere-static Scenes. In: Proceedings of International Conference on Computer Vision (ICCV) 2001, vol. 1, pp. 662–669 (2001) 7. Pandit, S.M., Wu, S.M.: Time Series and System Analysis with Applications. John Wiley & Sons Inc., Chichester (1983) 8. Campbell, N., Dalton, C., Gibson, D., Thomas, B.: Practical Generation of Video Textures using the Auto-regressive Process. Image and Vision Computing 22, 819–827 (2004) 9. Law, H.C., Jain, A.K.: Incremental Nonlinear Dimensionality Reduction by Manifold Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 377–391 (2006)
Online Video Textures Generation
459
10. Juan, C., Bodenheimer, B.: Cartoon Textures. In: Proceedings of the 2004 ACM SIGGRAPH/Eurographics symposium on Computer animation, pp. 267–276 (2004) 11. Jenkins, O.C., Mataric, M.J.: A Spatio-temporal Extension to Isomap Nonlinear Dimension Reduction. In: Proceedings of the twenty-first international conference on Machine learning, pp. 441–448. ACM, New York (2004) 12. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 2319–2323 (2000) 13. Cox, T.F., Cox, M.A.A.: Multidimensional Scaling. Chapman and Hall, Boca Raton (2000)
Deformable 2D Shape Matching Based on Shape Contexts and Dynamic Programming Iasonas Oikonomidis and Antonis A. Argyros Institute of Computer Science, Forth and Computer Science Department, University of Crete {oikonom,argyros}@ics.forth.gr http://www.ics.forth.gr/cvrl/
Abstract. This paper presents a method for matching closed, 2D shapes (2D object silhouettes) that are represented as an ordered collection of shape contexts [1]. Matching is performed using a recent method that computes the optimal alignment of two cyclic strings in sub-cubic runtime. Thus, the proposed method is suitable for efficient, near real-time matching of closed shapes. The method is qualitatively and quantitatively evaluated using several datasets. An application of the method for joint detection in human figures is also presented.
1
Introduction
Shape matching is an important problem of computer vision and pattern recognition which can be defined as the establishment of a similarity measure between shapes and its use for shape comparison. A byproduct of this task might also be a set of point correspondences between shapes. The problem has significant theoretical interest. Shape matching that is intuitively correct for humans is a demanding problem that remains unsolved in its full generality. Applications of shape matching include but are not limited to object detection and recognition, content based retrieval of images, and image registration. A lot of research efforts have been devoted to solving the shape matching problem. Felzenszwalb et al. [2] propose the representation of each shape as a tree, with each level representing a different spatial scale of description. They also propose an iterative matching scheme that can be efficiently solved using Dynamic Programming. Ebrahim et al. [3] present a method that represents a shape based on the occurrence of shape points on a Hilbert curve. This 1D signal is then smoothed by keeping the largest coefficients of a wavelet transform, and the resulting profiles are matched by comparing selected key regions. Belongie et al. [1] approach the problem of shape matching introducing the shape context, a local shape descriptor that samples selected edge points of a figure in log-polar space. The resulting histograms are compared using the x2 statistic. Matches between corresponding points are established by optimizing the sum of matching costs using weighted Bipartite Matching (BM). Finally, a Thin Plate Spline (TPS) transformation is estimated, that warps the points of the first shape to G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 460–469, 2009. c Springer-Verlag Berlin Heidelberg 2009
Deformable 2D Shape Matching
461
the second, based on the identified correspondences. This process is repeated for a fixed number of iterations, using the resulting deformed shape of the previous step as input for the next step. A very interesting work that utilizes shape contexts is presented in [4]. The goal of this work is to exploit the articulated nature that many common shapes possess to improve shape matching. The authors suggest that the distances and angles to be sampled should be measured only inside the closed contour of a figure. In this work we are interested in the particular problem of matching deformable object silhouettes. The proposed method is based on shape contexts and the work of Belongie [1]. It is assumed that a 2D shape can be represented as a single closed contour. This is very often the case when, for example, shapes are derived from binary foreground masks resulting from a background subtraction process or from some region-based segmentation process. In this context, shape matching can benefit from the knowledge of the ordering of silhouette points, a constraint that is not exploited by the approach of Belongie [1]. More specifically, in that case, two silhouettes can be matched in sub-cubic runtime using a recently published algorithm [5] that performs cyclic string matching employing dynamic programming. The representation power of shape contexts combined with the capability of the matching algorithm to exploit the order in which points appear on a certain contour, result in an effective and efficient shape matching method. Several experiments have been carried out to assess the effectiveness and the performance of the proposed method on benchmark datasets. The method is quantitatively assessed through the bull’s-eye test applied to the MPEG7 CEshape-1 part B dataset. More shape retrieval experiments have been carried out on the “gestures” and “marine” datasets. Additionally, the proposed shape matching method has been employed to detect the articulation points (joints) of a human figure in monocular image sequences. Specifically, 25 human postures have been annotated with human articulation points. Shape matching between a segmented figure and the prototype postures results in point correspondences between the human figure and its best matching prototype. Then TPS transfers known points from the model to the observed figure. Overall, the experimental results demonstrate that the proposed method performs very satisfactory in diverse shape matching applications and that the performance of shape matching can be improved when the order of points on a contour is exploited. Additionally, its low computational complexity makes it a good candidate in shape matching applications requiring real-time performance. The rest of the paper is organized as follows. The proposed method is presented in Sec. 2. Experimental results are presented in Sec. 3. Finally, Sec. 4 summarizes the main conclusions from this work.
2
The Proposed Shape Matching Method
The proposed method utilizes shape contexts to describe selected points on a given shape. A fixed number of n points are sampled equidistantly on the contour
462
I. Oikonomidis and A.A. Argyros
of each shape. For each of these points, a shape context descriptor is computed. To compare two shapes, each descriptor of the 1st shape is compared using the x2 statistic to all the descriptors of the 2nd, giving rise to pairwise matching costs. These costs form the input to the cyclic string matching, and correspondences between the shapes are established. These correspondences are used to calculate a Thin Plate Splines based alignment of the two shapes. A weighted sum of the cyclic matching cost and the TPS transformation energy forms the final distance measure of the two shapes. The rest of this section describes the above algorithmic steps in more detail. 2.1
Scale Estimation and Point Order
The first step of the method is to perform a rough scale estimation of the input shape. As in [1], the mean distance between all the point pairs is evaluated and the shape is scaled accordingly. Denoting the ith input point as pti , the scale a is estimated as n n 2 pti − ptj a= . (1) n(n − 1) i=1 j=i+1 Then, every input point is scaled by 1/a. The order (clockwise/counterclockwise) in which silhouette points are visited may affect the process of shape matching. Therefore, we adopt the convention that all shapes are represented using a counter-clockwise order of points. To achieve this, the sign of the area of the polygon is calculated as n
A=
1 xi yi+1 − xi+1 yi , 2 i=1
(2)
with xn+1 = x1 and yn+1 = y1 . If A is negative, the order of the input points is reversed. 2.2
Rotation Invariant Shape Contexts
For the purposes of this work, rotation invariance is a desirable property of shape matching. As mentioned in [1], since each shape context histogram is calculated in a log-polar space, rotation invariance can be achieved by adjusting the angular reference frame to an appropriately selected direction. A direction one can use for imposing rotation invariance in shape contexts, is the local tangent of the contour. In this work this direction is estimated using cubic spline interpolation. First, the 2D curve is fitted by a cubic spline model. Cubic splines inherently interpolate functions of the form f : IR → IR. It is easy to extend this to interpolate parametric curves on the plane (functions of the form γ : IR → IR2 ), by concatenating two such models. The next step is to compute the derivatives of the two cubic spline models at each point of interest. For each such pair of derivatives, the local tangent is computed by taking the generalized arc tangent function with two arguments. This method has the advantage that the computed
Deformable 2D Shape Matching
463
angles are consistently aligned not only to a good estimate of the local derivative, but also to a consistent direction. The estimated local contour orientation is then used as the reference direction of shape contexts towards achieving descriptions that are rotationally invariant. 2.3
Cyclic Matching
The comparison of a pair of shape contexts can be performed with a number of different histogram comparison methods. In this work, the x2 statistic is selected as in [1]: K 1 [h1 (k) − h2 (k)]2 χ2 (h1 , h2 ) = , (3) 2 h1 (k) + h2 (k) k=1
where h1 and h2 are the compared histograms, each having K bins. The comparison of two shapes is performed by considering a 2D matrix C. The element (i, j) of this matrix is the x2 statistic between the ith shape context of the first shape and the jth shape context of the second shape. Any such pair is a potential correspondence. Belongie et al. [1] use Bipartite Matching to establish a set of 1-to-1 point correspondences between the shapes. However, by exploiting the order that is naturally imposed by the contour, the search space can be significantly reduced. For the purpose of matching, we adopt the method presented in [5]. The matrix C of x2 shape context comparisons forms the matching costs matrix needed for the cyclic matching. Along with the matching pairs, a matching cost cm is calculated as the sum of costs of all the aligning operations that were used. Thus, cm can be used as a measure of the distance between the two shapes. 2.4
Thin Plate Spline Computation
The final step of the presented shape matching method is the computation of the planar deformation that aligns two shapes. The alignment is performed using Thin Plate Splines. The input to this stage is the result of the previous step, i.e. a set of pairs of correspondences between two 2D shapes. The output is a deformation of the plane, as well as a deformation cost. This cost can be properly weighted along with the cost of the previous step to form the final matching cost or distance between the shapes. The regularized version of the TPS model is used, with a parameter λ that acts as a smoothness factor. The model tolerates higher noise levels for higher values of λ and vice versa. Since the scale of all shapes is roughly estimated at the first step of the method, the value of λ can be uniformly set to compensate for a fixed amount of noise. For all experiments, λ was fixed to 1, as in [1]. Besides the warping between the compared shapes, a total matching cost D is computed as D = l 1 cm + l 2 cb . (4) D is a weighted sum of the cyclic matching cost cm and the TPS bending cost cb . While cb has the potential to contribute information not already captured by cm ,
464
I. Oikonomidis and A.A. Argyros
in practice it proved sufficient to ignore the cb cost, and use only the cm cost as the distance D between shapes (i.e. l1 = 1 and l2 = 0). For all the following, this convention is kept. It should be also noted that the TPS might be needed for the alignment of matched shapes, regardless of whether the cb cost contributes to the matching cost D. Such a situation arises in the joints detection application described in Sec. 3.2.
3
Experimental Results
Several experiments have been carried out to evaluate the proposed method. The qualitative and quantitative assessment of the proposed method was based on well-established benchmark datasets. An application of the method for the localization of joints in human figures is also presented. Throughout all experiments n = 100 points were used to equidistantly sample each shape. For the MPEG7 experiment (see Sec. 3.1) this results in an average subsampling rate of 13 contour pixels with a standard deviation of 828 pixels. This large deviation is due to the long right tail of the distribution of shape lengths. Shape contexts were defined having 12 bins in the angular and 5 bins in the radial dimension. Their small and large radius was equal to 0.125 and 2, respectively (after scale normalization). The TPS regularization parameter λ was set equal to 1 and the insertion/deletion cost for the cyclic matching to 0.75 (the χ2 statistic yields values between 0 and 1). 3.1
Benchmark Datasets
The proposed shape matching method has been evaluated on the “SQUID” [6] and the “gestures” [7] datasets. In all the experiments related to these datasets, each shape was used as a query shape and the proposed method was employed to rank all the rest images of the dataset in the order of increasing cost D. Figures 1(a) and 1(b), show matching results for the “SQUID” and the “gestures” datasets, respectively. In each of these figures, the first column depicts the query shape. The rest of each row shows the first twenty matching results in order of increasing cost D. The retrieved shapes are, in most of the cases, very similar to the query. The quantitative assessment of the proposed method was performed by running the bull’s-eye test on the MPEG7 CE-shape-1 part B dataset [8]. This dataset consists of 70 shape classes with 20 shapes each, resulting in a total of 1400 shapes. There are many types of shapes including faces, household objects, other human-made objects, animals, and some more abstract shapes. Given a query shape, the bull’s-eye score is the ratio of correct shape retrievals in the top 40 shapes as those are ranked by the matching algorithm, divided by the theoretic maximum of correct retrievals, which for the specific dataset is equal to 20. The bull’s eye score of the proposed method on the MPEG7 dataset is 72.35%. The presented method does not natively handle mirroring, so the minimum of the costs to the original and mirrored shape is used in shape similarity comparisons. By post-processing the results using the graph transduction method [9]
Deformable 2D Shape Matching
(a)
465
(b)
Fig. 1. Matching results for (a) the “SQUID” and (b) the “gestures” datasets
with the parameter values suggested therein, the score is increased to 75.42%. For comparison, the state of the art reported scores on this dataset are 88.3% for the Hilbert curve method [3] and 87.7% for the hierarchical matching method [2] (for more details, see Table 2 in [3]). An extended investigation of the results of the bull’s-eye test is graphically illustrated in Fig.2(a). This graph essentially turns the rather arbitrary choice of the forty best results into a variable. The horizontal axis of the graph is this recall length variable, and the vertical axis is the percentage of correct results among the examined ones. The experimental results demonstrate that the cyclic string matching performs better than Bipartite Matching. Additionally, graph transduction improves both methods but does not affect the superiority of the cyclic matching compared to Bipartite Matching. The essential advantage of cyclic matching over Bipartite Matching is the reduction of the search space: while Bipartite Matching searches among all possible permutations between two shapes, cyclic matching only considers the matchings that obey the ordering restrictions imposed by both shape contours. This effectively speeds up the matching process while yielding intuitive results. Sample1 shape retrieval results on the MPEG7 dataset are shown in Fig.2(b). 3.2
Detecting Joints in Human Figures
Due to its robustness and computational efficiency, the proposed method has been used for the recovery of the joints of a human figure. For this purpose, a set of synthetic human model figures were generated. Two model parameters control the shoulder and elbow of each arm. Several points (joints and other points of interest) are automatically generated on each model figure. Figure 3 1
The full set of results for the reported experiments is available online at http://www.ics.forth.gr/∼argyros/research/shapematching.htm
466
I. Oikonomidis and A.A. Argyros
Bull’s-eye test scores 100
CM CM+GT [10] BM BM+GT [10]
Bulls’s-eye score (%)
95 90 85 80 75 70 65 60 0
5
10
15
20 25 Recall depth
30
35
40
(a)
(b) Fig. 2. Results on the MPEG7 data set. (a) The bull’s-eye test scores on the MPEG7 dataset as a function of the recall depth, (b) sample shape retrieval results.
Deformable 2D Shape Matching
467
Fig. 3. The five configurations for the right arm. The contour of each figure is used as the shape model; Marked points are the labeled joints.
Fig. 4. Characteristic snapshots from the joints detection experiment
shows five such model figures for various postures of the right arm. A total of 25 models were created, depicting all possible combinations of articulations of the right (as shown in Fig.3) and the left arm. In the reported experiments, the background subtraction method of [10] has been employed to detect foreground figures. Connected components of the resulting foreground mask image are then considered. If there exist more than one
468
I. Oikonomidis and A.A. Argyros
connected components on the foreground image, only the one with the largest area is maintained for further processing. Its silhouette is then extracted and a fixed number n of roughly equidistant points are selected on it. This list of points constitutes the actual input to the proposed matching method. Each figure is compared to all model figures. The model with the lowest cost D is picked as corresponding to the input. The TPS transformation between the model and the input is subsequently used to warp the labeled points of interest on the input image. Figure 4 shows characteristic snapshots from an extensive experiment where a human moves in a room in front of a camera while taking several different postures. The input image sequence contains approximately 1200 frames acquired at 20 fps. Having identified the joints, a skeleton model of each figure is obtained. Interestingly, the method performs well even under considerable scale and perspective distortions introduced because of the human motion that result in considerable differences between the actual foreground silhouettes and the considered prototypes. The results presented in Fig.4 have been obtained without any exploitation of temporal continuity. This may improve results based on the fact that the estimation of the human configuration in the previous frame is a good starting point for the approximation in the current frame. To exploit this idea, at each moment in time, a synthetic figure like the ones shown in Fig.3 is custom rendered using the joint angles of the estimated skeleton. Thus, the result of the previous frame is used as a single model figure for estimating the human body configuration in the current frame. In case that the estimated distance between the synthetic model and the observed figure exceeds a specified threshold, the system is initialized by comparing the observed figure with the 25 prototype figures, as in the previous experiment. The exploitation of temporal continuity improves significantly the performance of the method.
4
Discussion
This paper proposed a rotation, translation and scale invariant method for matching 2D shapes that can be represented as single, closed contours. Affine transformations can be tolerated since the shape contexts are robust (but not strictly invariant) descriptors under this type of distortion. The performance of the method deteriorates gradually as the amount of noise increases. In this context, noise refers to either shape deformations due to errors in the observation process (e.g. foreground/background segmentation errors, sampling artifacts etc) or natural shape deformations (e.g articulations, perspective distortions, etc). The time complexity of the method is O(n2 log(n)) for n input points, an improvement over the respective performance of [1], which is O(n3 ). In the application of Sec. 3.2, the employed unoptimized implementation performs 25 shape comparisons per second, including all computations except background subtraction. By exploiting temporal continuity, most of the time the method needs to compare the current shape with a single prototype, leading to real time
Deformable 2D Shape Matching
469
performance. Overall, the experimental results demonstrate qualitatively and quantitatively that the proposed method is competent in matching deformable shapes and that the exploitation of the order of contour points besides improving matching performance, also improves shape matching quality.
Acknowledgments This work was partially supported by the IST-FP7-IP-215821 project GRASP. The contributions of Michel Damien and Thomas Sarmis (members of the CVRL laboratory of FORTH) to the implementation and testing of the proposed method are gratefully acknowledged.
References 1. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on PAMI 24, 509–522 (2002) 2. Felzenszwalb, P., Schwartz, J.: Hierarchical matching of deformable shapes. In: CVPR 2007, pp. 1–8 (2007) 3. Ebrahim, Y., Ahmed, M., Abdelsalam, W., Chau, S.C.: Shape representation and description using the hilbert curve. Pat. Rec. Let. 30, 348–358 (2009) 4. Ling, H., Jacobs, D.: Shape classification using the inner-distance. IEEE Transactions on PAMI 29, 286–299 (2007) 5. Schmidt, F., Farin, D., Cremers, D.: Fast matching of planar shapes in sub-cubic runtime. In: ICCV 2007, pp. 1–6 (2007) 6. Mokhtarian, F., Abbasi, S., Kittler, J.: Robust and efficient shape indexing through curvature scale space. In: BMVC 1996, pp. 53–62 (1996) 7. Petrakis, E.: Shape Datasets and Evaluation of Shape Matching Methods for Image Retrieval (2009), http://www.intelligence.tuc.gr/petrakis/ 8. Jeannin, S., Bober, M.: Description of core experiments for mpeg-7 motion/shape (1999) 9. Yang, X., Bai, X., Latecki, L.J., Tu, Z.: Improving shape retrieval by learning graph transduction. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 788–801. Springer, Heidelberg (2008) 10. Zivkovic, Z.: Improved adaptive gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., vol. 2, pp. 28–31 (2004)
3D Model Reconstruction from Turntable Sequence with Multiple -View Triangulation Jian Zhang, Fei Mai, Y.S. Hung, and G.Chesi The University of Hong Kong - Hong Kong
Abstract. This paper presents a new algorithm for 3D shape recovery from an image sequence captured under circular motion. The algorithm recovers the 3D shape by reconstructing a set of 3D rim curves, where a 3D rim curve is defined by the two frontier points arising from two views. The idea consists of estimating the position of each point of the 3D rim curve by using three views. Specifically, two of these views are chosen close to each other in order to guarantee a good image point matching, while the third view is chosen far from these two views in order to compensate for the error introduced in the triangulation scheme by the short baseline of the two close views. Image point matching among all views is performed by a new method which suitably combines epipolar geometry and cross-correlation. The algorithm is illustrated through experiments with synthetic and real data, which show satisfactory and promising results.
1
Introduction
In computer vision, 3D modeling of real objects is a key issue and has numerous applications such as augmented reality, digital entertainment, and pose estimation. There are three main groups of methods: point-based methods, silhouettebased methods, and combined methods. In a point-based method, feature points are used to compute the dense depth map [1], [2] of a 3D object by stereo matching techniques [3], [4], [5] and fuse all points on the 3D object [6]. A good and realistic 3D model can be obtained, but the computational burden for dense map estimation and fusion is quite high. In a silhouette-based method, only silhouettes are used to compute visual hull [7], [8] of the 3D object. A 3D model can be reconstructed with good efficiency, but fine details and concave surfaces of the object are missing. To achieve high-quality reconstruction result with low computation complexity, there has been considerable interest in combining point and silhouette information together. Most of the methods in this group reconstruct the visual hull as an initial solution and then refine the model using the feature point information [9], [10], [11]. In [12], object silhouettes in two views are exploited to establish a 3D rim curve, which is defined with respect to the two frontier points arising from two views. To reconstruct this 3D rim curve, its images in the two views are matched using traditional cross-correlation technique, and then the 3D rim curve is reconstructed using triangulation method over the two images. The method can reconstruct concave object surface with G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 470–479, 2009. c Springer-Verlag Berlin Heidelberg 2009
3D Model Reconstruction from Turntable Sequence
471
fast surface extraction. However, the two views used in this approach are consecutive views in the turntable sequence and close to each other. As a result, it is easier to establish correspondences of points on the rim curves, but depth precision is sacrificed due to the short baseline. In order to improve the depth precision of rim reconstruction, point matching over wide baseline is desired. In this paper, we propose a new matching method to match images of a 3D rim over three views accurately, where one view lies relatively far away from the other two views. Thus the 3D rims can be reconstructed more reliably using this triple of views. The paper is organized as follows. Section 2 describes the proposed multiple view point matching method. Section 3 presents the object reconstruction approach. Experimental results with real and synthetic data are given in Section 4, and lastly the conclusion is given in Section 5.
2 2.1
Multiple View Point Matching Determining Matching Area
In this section, we present the point matching method within three views. Suppose a 2D point p in one view called Va has been defined. The problem is to establish the correspondences of this point on other two views, which are referred as Vj and Vk . The view Va is chosen to have wide baseline from views Vj and Vk .
laj
p
D
C
Va (a) Point p in Va
F
Vj
G
lak
Vk
(b) Matching area in Vj (c) Matching area in Vk
Fig. 1. Matching area.The two lines in Vj and Vk are epipolar lines of the point p in view Va . The stars show the bounding points of each line. The matching points are searched only inside the two line segments CD and F G.
In order to improve the efficiency and accuracy of the matching process, we exploit epipolar geometry [14] and silhouette sequence as explained in [15], which allow us to limit the image area to be analyzed for finding the sought matching points. As shown in in Fig. 1, for the point p in image Va , there is an epipolar line [14] in view Vj and Vk , named laj , lak respectively, which are limited by corresponding bounding points. Let us consider a back-projected line Rp from the point p (Fig. 1(a)), which intersects the object visual hull at the 3D points P1 and P2 (we assume that P1 is closer than P2 to the camera center of view Va ). As it is known, the matching points in views Vj and Vk should lie on the epipolar lines laj and lak corresponding to the point p, and in particular they should lie on the sections delimited by the projections of the points P1 and P2 . For example, considering
472
J. Zhang et al.
view Vk , we can get the epipolar line of point p with epipolar geometry, which could be segmented by the 2D projection points of P1 and P2 in view Vk , namely bk,1 and bk,2 . Then a line segment bk,1 bk,2 has been obtained. Since the point p is on the front surface in 3D space, we could remove the back part of the line interval by restricting our attention to the portion delimited by bk,1 and (1 − λk )bk,1 + λk bk,2 , where λk ∈ [0, 1], namely line segment CD (Fig. 1(b)). Applying a similar argument to view Vj , the searching area is limited by bj,1 and (1 − λj )bj,1 + λj bj,2 , where λj ∈ [0, 1], namely line segment F G (Fig. 1(c)). In the following matching process, we will only search along thess portions of the epipolar lines. laj
qk ,1 , i 1,! , N
D
C
ri
lak
G
r ( p j ,1 , qk ,i ), i 1,! , N
p j ,1
Vj
p k ,1
F
(a) Getting pj,i
F
pk ,1
(c) Getting pk,i
max ri
(b) Getting pˆk,i G
l jk ,1
Vk
Vk
lak
lak
Di
p k ,i
pk , i
Vk
(d) Getting Di
Fig. 2. Matching process. (a) Getting pj,i in Vj . laj is the epipoalr line of source point p in Vj , which is limited as line segment CD by bounding points (stars). pj,1 (dot) is one of the sample points along CD. (b) Getting pˆk,i in Vk with cross-correlation technique. qk,i is the sample pixel along F G. Compute the cross-correlation score between pj,1 and qk,i , a sequence of score ri could be obtained. In this case, the pixel qk,i determined the highest score ri with pj,1 indicates the matching point pˆk,1 (solid square). (c) Getting pk,i in Vk with epipolar geometry. pj,1 has corresponding epipolar lines ljk,1 (dashed line) in Vk , which intersect lak at pk,1 (circle). (d) Getting Di . Di = pˆk,i − pk,i , which is the distance in Vk between two matching points pˆk,i and pk,i according to the same sample point pj,i in Vj . The index I minimizing Di define the matching points pj,I and pk,I .
2.2
Matching Method
For finding the points corresponding to the point p in the other two views, epiploar geometry and cross-correlation techniques would be employed together. The idea is as follows. First, we sample the line segment CD, which is the epipolar line laj in image Vj limited by the corresponding bounding points as stated previously. Here a sequence of points pj,i has been obtained. As shown in Fig. 2(a), pj,1 is one sample point in the sequence.
3D Model Reconstruction from Turntable Sequence
473
Second, we find pj,i ’s matching point pˆk,i in image Vk with cross-correlation technique, which is based on the similarity between two image patches. Specifically, suppose Ik (uk , vk ) is the intensity of image Ik at pixel xk = [uk , vk , 1]T . By using a correlation mask with size (2n + 1) × (2m + 1) pixels, we can get the normalized correlation score between point x1 and x2 with the standard cross correlation function. As the correlation matching score between pj,i and the pixels along F G could be computed, the pixel along F G determining the largest cross-correlation score with pj,i gives pˆk,i , which is pj,i ’s matching point in Vk . For example, in Fig. 2(b), pˆk,1 is pj,1 ’s matching point obtained with cross-correlation method. Then, find pj,i ’s matching point pk,i in image Vk with epipolar geometry. Let ljk,i be the epipolar line in view Vk corresponding to pj,i . The point pk,i should be the intersection between lak and ljk,i . In Fig. 2(c), the dashed line is the epipolar line ljk,1 corresponding to pj,1 , the circle point is the matching points pk,1 determined by the intersection. Up to now, for one sample point pj,i in image Vj , there are two matching points pˆk,i and pk,i in image Vk defined by different methods. Lastly, we measure the distance between two matching points pˆk,i and pk,i (Fig. 2(d)). The pair (pj,I , pk,I ) with the index I minimizing the euclidean distance between pˆk,i and pk,i is selected as matching points pair corresponding to point p. Together with the source point p, there are three points p, pj,I and pk,I in image Va , Vj and Vk respectively, which match with each other. Repeating these steps, we may compute the matching points for any object point in view Va . 2.3
Discussion on View Selection
As previously stated, the views Vj and Vk should be close in order to enjoy a small distortion of the object and get an accurate image point matching. Hence, the views Vj and Vk can be chosen as two consecutive views of the turntable sequence. Let us consider now view Va . As previously explained, this view should be far from the views Vj and Vk in order to compensate for the short baseline between these two views, which may lead to unsatisfactory results in the triangulation. However, as the rotation angle between view Va and views Vj and Vk increases, the object points visible in view Va may become invisible in view Vj and Vk due to occlusion. Therefore, the position of view Va with respect to the views Vj and Vk has to be selected according to a trade-off between these two requirements.
3 3.1
Object Reconstruction Overview
Fig. 3 shows the main steps of our 3D reconstruction process. Specifically, an object of interest is observed from a camera in N locations uniformly distributed
474
J. Zhang et al.
along a circumference. A turntable sequence of images of the object is hence obtained. Then, a particular section of the object called “rim curve” is defined (Fig. 3(ii)). Image point matching is performed among three views as described in Section 2, by matching in each image the “rim curve” (Fig. 3(iii)). The 3D points of the rim curves are hence estimated based on multiple view triangulation (Fig. 3(iv)). Lastly, the sought 3D model of the object is obtained by merging the estimated 3D points (Fig. 3(v)).
(i)
Tumtable sequence capture
Defining a rim curve (ii)
Multiple view rim matching
Multiple view rim reconstruction
(iii)
(iv)
3D model (v)
Fig. 3. Main steps of the 3D reconstruction process
Fig. 4 shows the camera in three locations, Ca , Cj and Ck from which the three views Va , Vj and Vk are obtained. Observe that views Vj and Vk are close to each other in order to allow for a good cross-correlation matching, and relatively far from Va in order to compensate for the short baseline among them which may affect the accuracy of the triangulation scheme. Va is supposed around 90 degrees from Vj . The value could be changed according to the object shape.
Ck Cj Ca Fig. 4. View selection. Vj and Vk are close to each other and relatively far from Va , according to the camera center locations Ca , Cj and Ck .
3.2
Rim Matching and Triangulation
In [12], a 3D rim curve is defined with respect to a calibrated image pair. Given image pair Va and Vj , we find the two frontier points [13] (X1 and X2 ). A frontier
3D Model Reconstruction from Turntable Sequence
475
point is the intersection of two contours in 3D space, and it is visible in both silhouettes. They define a plane together with camera center Ca . This plane cuts the model at a planar curve on the object surface. This curve is segmented by the two frontier points into two segments. The one closer to the chosen camera center Ca is defined as a 3D rim curve. It projects to view Va as a straight line, and to view Vj as a general 2D rim curve. The straight line in view Va is sampled, obtaining a sequence of image points called source points. As the 3D rim curves and the source points in Va have been defined, the method proposed in Section 2 could be applied to obtain the matching points in the other two views Vj and Vk . In order to achieve the matching points for all 3D rim curves, the proposed point matching process should be repeated to all view groups. Once the image points corresponding to the same 3D point of the rim have been found in three views as described previously, we estimate the coordinates of the 3D points as follows. Let X be an unknown 3D point we want to estimate, and xa , xj , xk be the image points corresponding to X in views Va , Vj , Vk . Moreover, let Pa , Pj and Pk be the projection matrices relative to views Va , Vj , Vk . In the absence of image noise and calibration errors one has that: MX = 0
(1)
where ⎛
⎞ Q(xa ) M = ⎝ Q(xj ) ⎠ , Q(xk ) Q(xi ) =
1 0 −xi,1 0 1 −xi,2
Pi .
Therefore, X can be estimated via an SV D of matrix M . In order to be able to differently weigh the contribution given by xa , xj , and xk , we use the weighted matrix Mw = DM (2) where D = diag(γ, γ, 1, 1, 1, 1) and γ is a positive real number, which is less than one to reduce the error introduced by the distant image Va . Hence: [U, S, V ] = svd(Mw )
(3)
and the sought estimate of X is given by X = v ∗ /v4∗ where v ∗ is the last column of V , and v4∗ is its fourth component.
(4)
476
J. Zhang et al.
(a) Source 2D rim points in (b) Matched Va points in Vj
2D
rim (c) Matched points in Vk
2D
rim
Fig. 5. Matching results for the clayman sequence. For each point along the 3D rim curve in 3D space, the corresponding 2D points in view Va , Vj and Vk have been found with our proposed point matching method.
4
Experimental Results
In this section we present some results obtained by applying the proposed approach to a real turntable sequence (the clayman) and a synthetic sequence (the Buddha). We have chosen sequences with N = 36 images, i.e. a rotation angle equal to 10 degrees between any two consecutive images of the turntable sequence. Views Va , Vj and Vk have been selected according to the following rule: a = i, j = mod(i + 3, 36) + 1, k = mod(i + 4, 36) + 1, for all i = 1, ..., 36. Fig. 5 shows some matching results obtained for this choice for the clayman sequence. For comparison, in Fig. 6, we show the matching result in image Vk using the matching algorithm in [12] and our proposed multiple view point matching method respectively. They have the same source points in Va as shown in Fig. 5(a). The superiority of the proposed method is obvious. Based on good matching result among views with long baseline, good reconstruction could be obtained. Fig. 7(a) shows the 3D rim curves estimated with the proposed approach, while Fig. 7(b) shows the triangulated mesh model based on the 3D rim curves. We have also applied our method to a synthetic model of a Buddha to measure the 3D error. The evaluation on 3D error e3D is performed by finding the
3D Model Reconstruction from Turntable Sequence
(a) Proposed matching
477
(b) Matching by method of [12]
Fig. 6. Comparison of the matching results in Vk between the proposed method and the method in [12]. The source points for (a) and (b) are the same in image Va as those in Fig. 5(a). The matching points inside the circles in (b) are quite orderless and wrong.
(a) 3D rim curves
(b) 3D mesh model
Fig. 7. Reconstructed results
ˆ and their corresponding collineation H between the reconstructed 3D points X ground truth 3D points X [17], [18]. The 3D error can be defined as: n 1 ˆ i 2 e3D = min Xi − αi H X (5) H n i
478
J. Zhang et al.
where αi is a non-zero scaling factor for normalizing the fourth component of ˆ i to 1. For the synthetic object with 25003 given 3D points, the 3D error for HX our proposed method is 1.8343 comparing 2.0868 for the method in [12].
5
Conclusion
This paper has addressed the problem of reconstructing the 3D model of an object from a turntable image sequence. In particular, we have proposed an algorithm where each 3D point is estimated via a multiple-view triangulation scheme, in order to obtain satisfactory image point matching while coping with the short baseline problem. The proposed advantage of the approach is that it provides an ordered set of 3D rim curves, which can be readily organized as a meshed object surface for further 3D modeling processes such as rendering.
Acknowledgement The authors would like to thank the Reviewers for their useful comments. This work was supported in part by the Research Grants Council of Hong Kong Special Administrative Region under Grants HKU711208E and HKU712808E.
References 1. Bolles, R.C., Baker, H.H., Marimont, D.H.: Epipolar-plane image analysis: An approach to determining structure from motion. International Journal of Computer Vision 1, 7–55 (1987) 2. Okutomi, M., Kanade, T.: A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 353–363 (1993) 3. Cox, I.J., Hingorani, S.L., Rao, S.B.: A maximum likelihood stereo algorithm. Computer Vision Image Understanding 63, 542–567 (1996) 4. Sun, J., Zheng, N.N., Shum, H.Y.: Stereo matching using belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 787–800 (2003) 5. Van Meerbergen, G., Vergauwen, M., Pollefeys, M., Van Gool, L.: A Hierarchical Symmetric Stereo Algorithm Using Dynamic Programming. International Journal of Computer Vision 47, 275–285 (2002) 6. Koch, R., Pollefeys, M., Gool, L.J.V.: Multi Viewpoint Stereo from Uncalibrated Video Sequences. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, p. 55. Springer, Heidelberg (1998) 7. Baumgart, B.G.: Geometric modeling for computer vision. Ph.D Thesis, Stanford University (1974) 8. Laurentini, A.: The Visual Hull Concept for Silhouette-Based Image Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 150–162 (1994) 9. Furukawa, Y., Ponce, J.: Carved Visual Hulls for Image-Based Modeling. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 564–577. Springer, Heidelberg (2006)
3D Model Reconstruction from Turntable Sequence
479
10. Esteban, C.H., Schmitt, F.: Silhouette and stereo fusion for 3D object modeling. Computer Vision and Image Understanding 96, 367–392 (2004) 11. Isidoro, J., Sclaroff, S.: Stochastic Refinement of the Visual Hull to Satisfy Photometric and Silhouette Consistency Constraints. In: The Ninth IEEE International Conference on Computer Vision, Beijing, China (2003) 12. Huang, Z., Wing Sun, L., Wui Fung, S., Yeung Sam, H.: Shape recovery from turntable sequence using rim reconstruction. Pattern Recognition 41, 295–301 (2008) 13. Cipolla, R., Giblin, P.J.: Visual Motion of Curves and Surfaces. Cambridge University Press, Cambridge (1999) 14. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 15. Boyer, E., Franco, J.S.: A hybrid approach for computing visual hulls of complex objects. In: Computer Vision and Pattern Recongnition, Madison, Wisconsin (2003) 16. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In: Computer Vision and Pattern Recognition, New York, USA (2006) 17. Malis, E., Chesi, G., Cipolla, R.: 2 1/2 D visual servoing with respect to planar contours having complex and unknown shapes. International Journal of Robotics Research 22(10), 841–853 (2003) 18. Csurka, G., Demirdjian, D., Horaud, R.: Finding the Collineation between Two Projective Reconstructions. Computer Vision and Image Understanding 75, 260–268 (1999)
Recognition of Semantic Basketball Events Based on Optical Flow Patterns Li Li1 , Ying Chen2 , Weiming Hu1 , Wanqing Li3 , and Xiaoqin Zhang1 1
2
Institute of Automation, Chinese Academy of Sciences, Beijing, China {lli,wmhu,xqzhang}@nlpr.ia.ac.cn Institute Department of Basic Sciences, Beijing Electronic and Science Technology, Beijing, China
[email protected] 3 University of Wollongong, Sydney, Australia
[email protected]
Abstract. This paper presents a set of novel features for classifying basketball video clips into semantic events and a simple way to use prior temporal context information to improve the accuracy of classification. Specifically, the feature set consists of a motion descriptor, motion histogram, entropy of the histogram and texture. The motion descriptor is defined based on a set of primitive motion patterns which are derived form optical flow field. The event recognition is achieved by using kernel SVMs and a temporal contextual model. Experimental results have verified the effectiveness of the proposed method.
1
Introduction
Event detection is a key task in video analysis. It allows us to effectively and semantically summarize, annotate and retrieve the video content. In sports videos, an event is often captured by a segment or clip of video and detecting the event involves locating the clip in the video sequence and deciding which event the clip belongs to. Various algorithms have been proposed in the past to segment videos into shots, since each shot can be consist of different clips that contain different semantic events, therefor, this paper focuses on the problem of classifying sports video clips into a set of predefined semantic events. We assume that video sequences are presegmented into clips and there is only one event in each clip. Generally, event recognition requires discriminative description of the events and an effective classifier built upon the description. For the description, visual features are widely used. Ekin et al. [1] adopted dominant color regions and proposed a heuristic approach to classify soccer video shots into far, medium and close-up views and further to annotate the shot as “in play” or “break”. In stead of using low-level visual features, mid-level features, such as camera motion patterns, action regions and field shape properties, were exploited in [2] for semantic shot classification in sports videos. Wang et al. [3] proposed an expanded relative motion histogram of Bag-of-Visual-Words(ERMH-BoW) which captured G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 480–488, 2009. c Springer-Verlag Berlin Heidelberg 2009
Recognition of Semantic Basketball Events Based on Optical Flow Patterns
closeup view
left to right
left court
481
right to left
right court Fig. 1. Predefined Basketball Events
both what and how aspects of an event. Recently, fusion of visual features with audio and textual features have been explored. Xu et al. [4] used web-casting text to combine low-level features and high level semantics in sports video analysis. For the classifier, SVMs have been widely employed [2,3,4,5] not only because of its solid theoretical basis but also its success in various applications. However, unlike conventional Hidden Markov Model (HMM) [6] and Conditional Random Field (CRF)[7], SVMs are not able to make use of the temporal context that exists in a sequence of sports events. In order to exploit the temporal correlations of concepts in a video sequence, Qi et al. [8] proposed a correlative multi-label approach which simultaneously classifies concepts and models. In this paper, we focus on classifying basketball video clips into five basic events (as shown in Figure 1): 1) closeup view; 2) offense from left to right; 3) offense from right to left; 4) offense and defense at left court; 5) offense and defense at right court. Notice that penalty view belongs to the offense and defense at right or left court, so it is not specifically considered as a separate event. The close-up view includes stillness, lay-up in close-up view, track player in close-up view and foul shot in close-up view. Goal view is likely to happen in offense and defense at left or right court and followed by close-up views for highlighting. The major contributions of this paper include introducing a set of new features to effectively describe the events and using of prior temporal context information to improve the classification. Specifically, the feature set consists of a motion descriptor, motion histogram, entropy of the histogram and texture. The motion descriptor is defined based on a set of primitive (i.e. pixel level) motion patterns, observed from the measured optical flow. The event classification is achieved by using Kernel SVMs and a temporal contextual model. The rest of the paper is organized as follows. Section 2 describes the proposed features and their extraction from a video clip. Section 3 presents the classifiers adopted in this paper, which consists of kernel SVMs and a contextual model. Experimental results are given in Section 4, followed by conclusions in Section 5.
482
L. Li et al.
a
b
c
d
e
f
Fig. 2. Six Optical Flow Patterns
2
Feature Extraction
The proposed features consist of a motion descriptor, motion histogram, entropy of the histogram and texture. To extract the motion related information, Lucas and Kanade’s method [9] was employed to compute a dense optical flow field for each frame. The optical flow fields are post-processed through blurring with a Gaussian filter and thresholding in order to reduce the influence of noise. 2.1
Motion Descriptor
Since the camera motion play an important role in identifying the events, it is necessary to extract the dominant motion patterns. For this purpose, we utilize the power of patterns, created by the type of camera motion, which are more naturally observed from the measured optical flow [10]. In order to characterize the motion of the events, we introduce six primitive optical flow patterns: a) horizontal pan or rotation around a vertical axis; b) vertical pan or rotation around a horizontal axis; c) camera’s moving forward or backward or zooming zoom; d) rotation around the optical axis of the camera; e) and f) complex hyperbolic flows. Figure 2 shows the six flow patterns. Considering the velocity vector vp of the pixel located at p and its nearby pixel p0 , and applying one order expansion operation, vp can be represented as: vp = vp0 + A(p − p0 )
(1)
Recognition of Semantic Basketball Events Based on Optical Flow Patterns
483
Where vp0 = (a, b)T is the velocity vector at pixel p0 , and αγ A= βδ Then through defining the following four matrix: 10 0 −1 D= ,R = , 01 1 0
1 0 01 H1 = , H2 = 0 −1 10 A can be decomposed into A=
1 (dD + rR + h1 H1 + h2 H2 ), 2
(2)
Where d = α + δ refers to a divergent optical flow, r = β − γ refers to a rotating optical flow, h1 = α − δ refers to one type of hyperbolic optical flow, h2 = β + γ refers to another type of hyperbolic optical flow. Together with a and b, any velocity field can be approximately characterized by the six parameters xy (axy , bxy , dxy , rxy , hxy 1 , h2 ). Given an optical flow field, for each pixel p(x, y), we can consider its neighbor pixels (i.e. 3×3 window )to estimate the parameter vector (axy , bxy , dxy , xy xy rxy , hxy 1 , h2 ) using Eq. 1 and the least squared error method. The parameter a xy is associated with camera panning right and left, b is associated with camera tilting up and down, dxy is associated with camera zooming in and out and the last three parameters indicate hybrid motion. In basketball videos, either the ball or the players are constantly followed by the camera. For the defined basketball events, we define the following three motion features to measure horizontal motion components (HMC), vertical motion components (VMC) and divergent motion component (DMC) respectively. N X HM C =
i=1
x=0
i=1
x=0
N
N X DM C =
y=0
aixy
N N X
V MC =
Y
i=1
x=0
(3)
Y
i y=0 bxy
Y
y=0
dixy
(4)
(5) N Where X,Y denote the width and height of the optical flow field, and N is the number of frames in the video clip. These three features reflect the joint camera motion and motion of the players and the ball. In addition, different events demonstrate different overall motion intensity. For example, the motion intensity in the events of close-up and penalty is lower
484
L. Li et al.
Fig. 3. An event sequence and the associated AMF values
than that of other events. Motion feature T M is utilized as a measure of the overall motion intensity. T M = wH HM C + wV V M C + wD DM C
(6)
where wH ,wV and wD are the weights of the three motion components and subject to: wH + wV + wD = 1 (7) According to the domain knowledge, the direction and duration of pan and tilt are also important to discriminate the events. For example, the event “offense from right to left” is likely to be associated with camera pan left. “offense and defense at left court” probably has a dominant motion toward left. The closeup view usually indicates a divergent/convergent optical flow field due to the camera’s fast following-up or zooming in/out actions. We therefore define an accumulative motion feature AM F [11] as follows: ⎧ N (HM C − V M C) · e−DMC , ⎪ ⎪ ⎨ if HM C · V M C > 0 AM F = −DMC N (HM C + V M C) · e , ⎪ ⎪ ⎩ if HM C · V M C ≤ 0 Figure 3 shows the AMF values and the corresponding events of one video clip. From the figure we can see that different events have distinct sign and magnitudes of AM F . The AM F values for “offense from right to left” and “offense and defense at right court” are usually positive, but the former has a larger magnitude than the latter. The AM F s of “offense from left to right” and “offense and defense at left court” are usually negative with the former having a larger value.
Recognition of Semantic Basketball Events Based on Optical Flow Patterns
2.2
485
Motion Histogram and Entropy
Also included in the feature set to characterize the overall motion of the events is the motion histogram (M OH) with respect to orientation θ and magnitude ρ. M OH is computed by setting 2 bins for the magnitude and equally spaced 8 bins for the orientation. θ ∈ {0, π/4, π/2, 3π/4, π, 5π/4, 3π/2, 2π} Moreover, entropy measures the uncertainty associated with a random variable. The close-up view usually features a higher entropy than other events [2] due to the diversity of motion vector’s direction and magnitude. However, the uniform motion field in the “offense and defense left to right or right to left court view” leads to a low entropy. We calculate the entropy from the motion histogram. H =−
i
h(i) pi log pi , pi = k h(k)
(8)
where h is the motion histogram and i is the index of the histogram bins. 2.3
Texture
In addition, the texture feature (T ) is extracted for each frame in order to discriminate “court view from non-court view”. The texture features include mean gray level, standard deviation, smoothness of the variance, third moment, measure of uniformity and the entropy extracted from the image gray-scale cooccurrence matrix. Finally, the combined features T M , AM F , H, M OH, T forms the feature vector to describe the defined events and serves as the input to our classifiers, where · means “average”.
3
Classifier
Our classifier consists of two components: kernel SVMs and a temporal contextual model. The kernel SVMs are employed as “one-against-one” manner to estimate the likelihood of a given feature vector extracted from a video clip belonging to an event. The contextual model updates the likelihood by exploiting the temporal relationships between the events. For example, the event “offense and defense at left court ” usually happens after “offense from right to left”, similarly, the “offense from left to right” often follows “offense and defense at right court”, The event “offense and defense at left or right court” is often followed by the close-up view. Moreover, some events can not happen subsequently. For instance, the event “offense form right to left” can not be subsequently followed by “offense and defense at right court”.
486
L. Li et al. Table 1. Update algorithm
Input: the transition matrix P (c); the confidence score P (ct |st ); Output: P (c = i|st ) Begin: 1. for each ct = i /* Calculate the influence form ct−1 to ct based on the transition probability matrix 2. Calculate the context influence factor based on λi −1
equation ψi = e σ 3. endfor /*Update confidence scores 4. P (c = i|st ) = ψi P (ct |st ) End.
Let ct denote the event type of a video clip st at time t. Given st , first, the probability P (ct |st ) of st belonging to each class ct is estimated from the outputs of the SVMs. Then, the context based concept fusion strategy [12] is employed to update the probability P (ct |st ) by using the prior temporal contextual information of the events. We construct a transition probability matrix P (c) from training data to quantify the dependence of ct on ct−1 . Assume ct obeys the Markov property , that is, P (ct |c1 , c2 , .., ct−1 ) = P (ct |ct−1 ) and λi = P (c = i|st ) = P (c = i, ct−1 = j|st ) =
j
P (ct−1 = j|st )P (c = i|ct−1 = j)
j
The algorithm that updating the P (ct |st ) is listed in Table 1. Finally, the st is classified as event ct ct = arg max P (c|st ) (9) c
4
Experimental Results
Two NBA video sequences of 40 minutes and 20 minutes respectively were used to test the proposed features and classifier. The videos were manually segmented into clips and each video clip contains one event. The total number of video clips for each event is listed in Table 2. LibSVM [13] is utilized to accomplish C-SVC learning. We train all datasets using the radial basis function (RBF) kernel: K(xi , xj ) = e−γxi −xj
2
Recognition of Semantic Basketball Events Based on Optical Flow Patterns
487
Table 2. Experimental results Event Class
Total
Close-up 63 Lefttoright 24 Righttoleft 15 Left court 26 Right court 27 Average Precision 158
SVM SVM+ TC recall precision F1 recall precision F1 0.94 0.94 0.94 0.95 0.92 0.94 0.75 0.86 0.80 0.75 0.86 0.80 0.87 0.62 0.72 0.87 0.72 0.79 0.78 0.80 0.78 0.86 0.82 0.83 0.85 0.82 0.84 0.85 0.85 0.85 0.85 0.88
Different values of kernel parameter γ and penalty C were evaluated in order to set the parameters. C = [2−5 , 2−4 , ..., 215 ], γ = [2−15 , 2−14 , ..., 23 ] In the experiments, half of data set were used as training samples and the remainder were used for test. To evaluate the performance, classification accuracy were calculated. Table 2 summarizes the accuracies for each type of events when C = 27 ,γ = 2−10 with the best cross-validation rate of 82.89% and compares the results of using SVMs alone (SVM) with the results of using SVMs and temporal contextual information (SVM+TC)where the σ was set to a value between 0.2 and 1.5 . Both methods achieve satisfactory results because of the effective features. The the best accuracy achieved by SVM alone is 94%, On average, the accuracy reaches 85% for SVMs alone and is improved to 88% using the contextual information. In particular, the recall of close-up view is comparatively high reaching 95% when contextual information is employed.
5
Conclusions
This papers have presented a set of new motion features to describe basketball events and a simple way to incorporate the temporal contextual information to improve the event classification. The proposed primitive motion patterns provide an effective and efficient bridge between the low level motion features and middle level motion features that can be tailored to specific types of events. Our future work includes extensive evaluation of the features in large datasets and further develop the proposed classification method into a robust event detection algorithm.
Acknowledgment This work is partly supported by NSFC (Grant No. 60825204 and 60672040), the National 863 High-Tech R&D Program of China (Grant No. 2006AA01Z453) and Foundation of Beijing Electronic Science and Technology Institute Key laboratory of Information Security and Privacy(Grant No. YZDJ0808).
488
L. Li et al.
References 1. Ekin, A., Tekalp, A., Mehrotra, R.: Automatic soccer video analysis and summarization. IEEE Transactions on Image Processing 12, 796–807 (2003) 2. Duan, L., Xu, M., Chua, T., Tian, Q., Xu, C.: A unified framework for semantic shot classification in sports video. IEEE Transaction on Multimedia 7, 1066–1083 (2005) 3. Wang, F., Jiang, Y., Ngo, C.: Video event detection using motion relativity and visual relatedness. In: ACM Multimediaa, pp. 239–248 (2008) 4. Xu, C., Wang, J., Lu, H., Zhang, Y.: A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Transaction on Multimedia 10, 421–435 (2008) 5. Xu, D., Chang, S.: Video event recognition using kernel methods with multilevel temporal alignment. IEEE Transaction on Pattern Analysis and Machine Intelligence 30, 1–13 (2008) 6. Xu, G., Ma, Y., Zhang, H., Yang, S.: An hmm-based framework for video semantic analysis. In: IEEE Transaction on Circuits and Systems for video technology, pp. 1422–1431 (2005) 7. Wang, T., Li, J., Diao, Q., Hu, W., Zhang, Y.: Semantic event detection using conditional random fields. In: Computer Vision and Pattern Recognition Workshop, vol. 109, pp. 17–22 (2006) 8. Qi, G., Hua, X., Rui, Y.: Correlative multi-label video annotation. In: ACM Multimedia, pp. 17–26 (2007) 9. Lucas, B., Kanada, T.: An iterative image registration technique with an application to stereo vision. In: DARPA Image Understanding Workshop, pp. 121–130 (1981) 10. Sudhir, G., John, C., Lee, M.: Video annotation by motion interpretation using optical flow streams. Visual Commun Image Represent 7, 354–368 (1996) 11. Liu, S., Yi, H., Chia, L., Rajan, D., Chan, S.: Multi-modal semantic analysis and annotation for basketball video. Special Issue on Information Mining from Multimedia Databases of EURASIP Journal on Applied Signal Processing, 1–13 (2006) 12. Wu, Y., Tseng, B., Smith, J.: Ontology-based multi-classification learning for video concept dedetection. In: Proceeding of IEEE International Conferences on Multimedia and Expo, pp. 1003–1006 (2004) 13. Chang, C., Lin, C.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/cjlin/libsvm
Action Recognition Based on Non-parametric Probability Density Function Estimation Yuta Mimura, Kazuhiro Hotta, and Haruhisa Takahashi The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, Japan {mimura,hotta,takahashi}@ice.uec.ac.jp
Abstract. In recent years, it is desired that many surveillance cameras are set up for security purpose. If the automatic detection and recognition system of crimes and accidents, we can prevent them. Researchers work actively to realize the automatic system. Cubic Higherorder Local Auto-Correlation (CHLAC) feature has the shift-invariant property which is effective for surveillance. Thus, we use it to realize action recognition without detecting the target. The recognition of a sequence x=(x1 , . . . , xT ) can be defined as the estimation problem of posterior probability of it. If we assume that the feature of certain time is independent of other features in the sequence, the posterior probability can be estimated by the simple production of conditional probability of each time. However, the estimation of conditional probability is not easy task. Thus, we estimate the conditional probability by non-parametric model. This approach is simple and does not require the training of model. We evaluate our method using the KTH dataset and confirm that the proposed method outperforms conventional methods.
1
Introduction
In recent years, it is desired that many surveillance cameras are set up for security purpose. However, conventional surveillance systems have been used for the evidence of crimes and accidents. If the automatic detection and recognition system of crimes and accidents is developed, we can prevent them in advance. Researches of action recognition work actively to realize the automatic system. However, since conventional action recognition methods cropped the characteristic local regions and use them in recognition, the accuracy depends on the local regions selected [4-6]. CHLAC feature [2] has been proposed in recent years. It has the robustness to position changes of objects in an image. CHLAC feature is suitable to extract the continuous changes of human action because it computes the correlation of various directions in spatio-temporal space. In this paper, we propose an action recognition method using the CHLAC feature. Since CHLAC feature is shift invariant, we do not need to crop the target or characteristic local regions. In recognition of a sequence x = (x1 , . . . , xT ), the posterior probability p(C | x1 , . . . , xT ) of the sequence is important. If the feature xt given at time t is G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 489–498, 2009. c Springer-Verlag Berlin Heidelberg 2009
490
Y. Mimura, K. Hotta, and H. Takahashi
independent of the Tother features in the sequence, the posterior probability can be computed as i p(xi | Ck )p(Ck ) where p(Ck ) is the prior probability and p(xt | C) is the conditional probability at each time. However, the estimation of the conditional probability is not easy. When we use the parametric model to estimate the conditional probability, the accuracy depends on the parameters. In addition, it is not guaranteed that we can select the appropriate parameters. Therefore, we estimate the conditional probability by using non-parametric model. In this method, training is not required because the probability density of each frame is estimated directly from only the training samples. In the experiments, we use the KTH human motion dataset [1] which is used in recent researches of action recognition [4-7]. This dataset contains 599 videos of 6 actions captured from 24 subjects. The 6 actions indicate boxing, hand-clapping, handwaving, jogging, running and walking though hand-clapping category contains only 99 videos and other categories have 100 videos. In this paper, we try 2 evaluations. In the first evaluation, 24-fold cross validation in which videos of 24 subjects are used in training and videos of remaining 1 subject is used as test. This evaluation is used in [4]. The second evaluation is that videos of 8 subjects are used in training and videos of 9 subjects are used as test. This evaluation is used in [5-7]. We demonstrate that the proposed method outperforms the conventional methods. In section 2, CHLAC feature is explained. Section 3 explains an action recognition method using non-parametric method. Section 4 describes the KTH dataset used in experiments. Section 5 shows experimental results. Finally, conclusions and feature works are described in section 6.
2
Cubic Higher-Order Local Auto-correlation Feature
In conventional action recognition methods, the characteristic local regions are cropped, and they are used in recognition [4-6]. However, the accuracy of those methods depend on the selected local regions. Therefore we use CHLAC feature which is independent of the position of objects. CHLAC feature is the extended version of Higher-order Local Auto-Correlation (HLAC) feature [3] which is effective for still-image recognition. CHLAC feature is extracted from spatiotemporal space though HLAC feature is extracted from only spatial domain. CHLAC feature is defined as xN (a1 , a2 , . . . , aN ) = f (r)f (r + a1 ) . . . f (r + aN )dr. (1) Where xN is CHLAC feature, ai (i = 1, . . . , N ) is displacement from the position r in spatio-temporal space and f(r) is an intensity value of r. The integral range depends on the parameters how much we take the correlation in spatio-temporal space. The number of feature depends on the displacements. In this paper, we extract CHLAC feature of N=0,1,2 within local 3×3×3 region. The same patterns appear by the translation in 3×3×3 region. The total number of mask patterns is 251 when we eliminate them.
Action Recognition Based on Non-parametric Probability Density
491
Fig. 1. The mask patterns for extracting HLAC feature
Figure 1 shows the mask patterns for extracting HLAC feature when we use N=0,1,2 within 3×3 region. The 25 mask patterns consists of 1 pattern for N=0, 4 patterns of N=1 and 20 patterns of N=2. However, since CHLAC feature is extracted from spatio-temporal space, the number of mask patterns is larger than that of HLAC feature. We use N=0,1,2 within 3×3×3 region. Namely, we consider the combinations of 3 points in local 3×3 region from t-1 to t+1. The total number of combination is 251 which consists of 1 pattern of N=0, 13 patterns of N=1 and 237 patterns of N=2. The important properties of CHLAC feature are additivity and shift invariance. The additivity is induced by the summation of local correlation in entire image as equation (1). The part of mask patterns for extracting CHLAC feature is shown in Figure 2. The mask pattern of N=0 is shown in top left, the examples of mask patterns of N=1 are shown in top right and the examples of N=2 are shown in bottom left. All 251 patterns are used as 3×3×3 filters. CHALC feature is extracted by applying the filters to continuous 3 frames in a sequence. Since 1 feature is obtained by 1 mask pattern, we obtain CHLAC feature with 251 dimensions from continuous 3 frames by 251 mask patterns. We explain how to use the mask patterns. When we use a mask pattern of N=1 shown in top right of Figure 2, the mask pattern is applied to the certain position in continuous 3 frames. Then the product of intensity of top left position at t-1 frame and intensity of center-position at t frame is computed. The mask pattern is slid pixel-by-pixel over the continuous 3 frames and the product is computed. Finally, the sum of all product values is computed. This value is 1 element of CHLAC feature vector. In this paper, we compute difference images of continuous frames to emphasize the motion, and 251 mask patterns are applied to them.
492
Y. Mimura, K. Hotta, and H. Takahashi
Fig. 2. The mask patterns for extracting CHLAC feature
3
Proposed Method
In recognition of a sequence x = (x1 , . . . , xT ), posterior probability p(C | x1 , . . . , xT ) of the sequence is important. If the feature xt given at time t is independent of the T other features in the sequence, the posterior probability can be computed as i p(xi | Ck )p(Ck ) where p(Ck ) is the prior probability and p(xt | C) is the conditional probability at each time. When parametric model is used to estimate of the conditional probability, the accuracy depends on the parameters. In addition, it is not guaranteed that we can select the appropriate parameters. Therefore, we estimate conditional probability by using non-parametric model. In section 3.1, posterior probability of a sequence is explained. Section 3.2 explains how to estimate the conditional probability by non-parametric model. Section 3.3 shows the example of the conditional probability estimation. Effectiveness of normalizing the norm of feature vector is explained in section 3.4.
Action Recognition Based on Non-parametric Probability Density
3.1
493
Posterior Probability of a Sequence
We should classify a test sequence x = (x1 , x2 , . . . , xT ) to the class which gives the maximum posterior probability p(Ck | x1 , x2 , . . . , xT ). When Bayes theorem is used, the posterior probability of the sequence is defined as
p(Ck | x1 , · · · , xT ) =
p(xT | Ck , x1 , · · · , xT −1 )p(Ck | x1 , · · · , xT −1 ) . p(xT | x1 , · · · , xT −1 )
(2)
If the feature of each frame x1 , x2 , . . . , xT is independent of each other, p(xT | Ck , x1 , . . . , xT −1 ) = p(xt | Ck ). Since p(xT | x1 , x2 , . . . , xT −1 ) is constant, we pay attention to the numerator of equation (2). When we suppose p(Ck | x0 ) = p(Ck ), the posterior probability can be computed as p(Ck | x1 , · · · , xT ) ∼ = p(xT | Ck )p(xT −1 | Ck )p(xT −2 | Ck ) . . . p(Ck | x0 ), ∼ p(xT | Ck )p(xT −1 | Ck ) . . . p(Ck ), = T ∼ = Πi p(xi | Ck )p(Ck ).
(3) (4) (5)
In this paper, the prior probability is set to p(Ck ) = N1C where NC is the number of class. We must estimate p(xi | Ck ) to classify the sequence. However, the estimation of the conditional probability is not easy. Although the conditional probability can be estimated by using parametric method, the accuracy depends on parameters. Therefore, we estimate the conditional probability directly from samples by non-parametric method. The benefit of this approach is that advance training is not required and it is easy to implement. 3.2
Estimation of Probability Density by Non-parametric Method
The estimation of probability density by non-parametric method is based the consideration in which the neighboring patterns in feature space have similar property. The estimation of probability density is based on the distance of the K-th nearest sample from a test sample x. If V(x) is hyperspherical volume which just included the K-th KTH nearest sample from x and N is the number of training samples, the probability density function is defined as p˜(x) =
K . N V (x)
(6)
Next, how to compute the hyperspherical volume is explained. When r is the distance of the K-th nearest sample from the test sample and n is the dimension of hypersphere, the hyperspherical volume is defined as ⎧ n ⎨ rn (2π) 2 (when n is even number) n!! V (x) = n−1 (7) ⎩ rn (2π) 2 (when n is odd number) n!!
494
Y. Mimura, K. Hotta, and H. Takahashi
Fig. 3. Conditional probability of class CA Fig. 4. Conditional probability of class CB
where n!! = n(n − 2)(n − 4) . . . 2 for even n and n!! = n(n − 2)(n − 4) . . . 1 for odd n. Equation (7) shows that V(x) depends on r. In this paper, the K-th nearest sample is selected for each class independently. Thus, the conditional probability density of the class Ck is defined as p(x | Ck ) =
K . N V (x)
(8)
When the conditional probability p(xi | Ck ) is estimated by this approach, we can estimate the posterior probability p(Ck | x1 , . . . , xT ). 3.3
Example of Estimation of Probability Density
We show the example of estimation of probability density using Figure 3 and 4. The circles within light blue in Figure 3 and 4 are test samples. Triangles are training samples of the class CA , squares are training samples of the class CB . The radius of hypersphere of K=1 or K=5 in Figure 3 is just the distance till the 1-st or 5-th nearest training sample. The radius of hypersphere in Figure 4 is the same. The volume of hypersphere in Figure 4 of K=1 is larger than that in Figure 3 because the nearest training sample of the class CB is further than one of class CA . As an example, we calculate the conditional probability of K=5 in Figure 3 by using the equation (6). When a test sample x is given, the conditional probability of the class CA is computed as where p˜(x | CA ) = NA V5A (x) = 12V5A (x) . NA is the number of training samples of the class CA and VA (x) is the hyperspherical volume of the K-th nearest sample from x . The class CA has 12 training samples as shown in Figure 3. Thus, NA = 12. When RA is the hyperspherical radius in n dimentional space, the hyperspherical volume VA is computed as
Action Recognition Based on Non-parametric Probability Density
⎧ n 5 (2π) 2 ⎪ ⎨ (RA ) n!! VA (x) =
⎪ ⎩
495
(when n is even number) (9)
n−1
2 (RA )5 (2π)n!!
(when n is odd number).
Therefore, the conditional probability is defined as
p˜(x | CA ) =
⎧ 5 ⎪ n ⎪ 2 ⎪ ⎨ 12(RA )5 (2π) n!!
(when n is even number)
⎪ ⎪ ⎪ ⎩
(when n is odd number).
(10) 5
n−1 2
12(RA )5 (2π)n!!
In the following experiments, the conditional probability is computed with the 1-st nearest training sample (K=1) from a test sample. We classify the action videos using the posterior probability estimated by this approach. 3.4
Normalization of Norm of Feature
In this paper, the posterior probability is based on the radius of the nearest sample in CHLAC feature space with 251 dimensions. Therefore, the distance metrics is important. Here we use the Euclidean distance after normalizing the norm. The feature vector after normalizing the norm of x is denoted as x . Then the distance DN between two samples is defined as Dn = x − y = xT x + y T y − 2xT y . 2
(11)
Since the norm of y and x is 1, Dn = 2(1−xT y ). When we substitute x = x x y and y = y into equation (12), Dn = 2(1 −
xT y ). xy
(12)
The correlation coefficient between x and y (the cosine of composed angle θ) is defined as cosθ =
xt y . xy
(13)
xt y Since the range of cosθ is defined as −1 ≤ cosθ = xy ≤ 1, the range of Dn is defined as 0 ≤ Dn ≤ 4. This means that the upper bound of distance Dn is defined by normalizing the norm. In addition, the Euclidean distance of normalized feature depends on the correlation which is used in image matching. Thus, this metrics is more efficient for the classification than the normal Euclidean distance.
496
4
Y. Mimura, K. Hotta, and H. Takahashi
Fig. 5. Normal
Fig. 6. Scale changes
Fig. 7. Different clothes
Fig. 8. Illumination changes
KTH Dataset
We use the KTH human motion dataset [1] which is used in recent researches of action recognition [4-7]. This dataset includes the 6 kinds of actions; boxing, hand-clapping, handwaving, jogging, running and walking. Each action is captured from 25 subjects under 4 different conditions. Examples of 4 variations of a subject are shown in from Figure 5 to Figure 8. Figure 5 shows a boxing video captured a fixing camera. Figure 6 is a boxing video is under scale changes. Figure 7 shows a boxing video with different clothes from that in Figure 5 and 6. Figure 8 is a boxing video under illumination changes. Each class has 100 videos (= 4 conditions × 25 subjects) though the handclapping category have only 99 videos. Each video consists of the gray-scale images of 160 × 120 pixels.
5
Experimental Results
In this paper, we use the KTH dataset which contains action videos of 25 subjects. Conventional methods used the 2 kinds of evaluations. In the first one, videos of 24 subjects are used in trainig and video of 1 subject is used in test. We evaluated 25 times while replacing the subject, and the average classification rate is used in evaluation. This evaluation method is used in the research [4]. Table 1 shows the result. We understand that the proposed method outperforms Niebles’s method. We also evaluated the mutual subspace method [8] which works well in sequential recognition. The subspace of each video is constructed by principle component analysis of CHLAC feature. The similarity between subspaces is computed, and the input video is classified to the class given maximam similarity. The number of dimention of each subspace is determined by a cumulative contribution rate. We obtain the highest accuracy at the contribution
Action Recognition Based on Non-parametric Probability Density
497
Table 1. Comparison with conventional methods 1
Table 2. Comparison with conventional methods 2
rate of 99 percents. The result is also shown in Table 1. The proposed method gives much higher accuracy than the mutual subspace method though the same feature is used. This shows the effectiveness of the proposed method. Next, we use the same evalution method as Dollar [5], Schuldt [6] and Ke [7]. Since they required the samples for parameter setting, only the videos of 8 subjects are used in training and 9 subjects are used as test. The proposed method does not need to set parameters, all samples can be used in training and test. This is the benefit of the proposed method. However, for fair comparison, the videos of 8 subjects and 9 subjects are selected randomly, the accuracy is evaluated. We evaluate 3 times while changing the initial seeds of a random function, and the average classification rate of 3 runs is used in evaluation. The result is shown in Table 2. The proposed method outperforms the methods of Schuldt and Ke. It gives comparable accuracy with the Dollar’s method. Although the proposed method is very simple in which the conditional probability is estimated from only the training samples, it outperformes the conventional methods. This demonstrates the effectiveness of our method.
6
Conclusion
In this paper, we proposed an action recognition method using the estimation the conditional probability by non-parametric model. We confirmed that the proposed method outperforms mutual subspace method used in the recognition of sequential data. The proposed method also gave higher accuracy than that of conventional methods using the KTH dataset. To improve the accuracy further, we will change the mask patterns for CHLAC feature. For example, the mask patterns in local 5×5×3 region can be used though mask patterns in 3×3×3 region are used in this paper. Those masks extract more rough motion feature. If we use fine and rough features simulately, the accuracy may be improved [9]. In addition, we can change the interval of time space. In this paper, CHLAC feature
498
Y. Mimura, K. Hotta, and H. Takahashi
is extracted from the continuous 3 frames in a sequence, we can the extract feature from the 3 frames such as t-2, t, t+2. This reduces the computional time and may improve the accuracy because motion is emphasized. The computational cost of the proposed method is high because it must measure the distance between all samples and an input. However, the reaserch to speed-up of K-nearest neighbor is worked actively [10]. They will be useful. The proposed method is independent of recognition task, and it is applicable to the other sequential recognition without any changes. We will apply it to the video ratrieval and detection of the abnormal action.
References 1. http://www.nada.kth.se/cvap/actions/ 2. Kobayashi, T., Otsu, N.: Action and Simultaneous Multiple-Person Identification Using Cubic Higher Order Loeal AutoCorrelation. In: International Conference on Pattern Recognition, pp. 741–744 (2004) 3. Otsu, N., Kurita, T.: A new scheme for practical flexible and intelligent vision systems. In: Proc. IAPR Workshop on Computer Vision, pp. 431–435 (1988) 4. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words. International Journal of Computer Vision 79(3), 299–318 (2008) 5. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Proc. IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005) 6. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions:A local svm approach. In: Proc. International Conference on Pattern Recognition, pp. 32–36 (2004) 7. Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: Proc. International Conference on Computer Vision, pp. 166–173 (2005) 8. Yamaguchi, O., Fukui, K., Maeda, K.: Face recognition using temporal image sequence. In: Proc. IEEE Third International Conference on Automatic Face and Gesture Recognition, pp. 318–323 (1998) 9. Hotta, K.: View independent face detection based on horizontal rectangular features and accuracy improvement using combination kernel of various sizes. Pattern Recognition 42(3), 437–444 (2009) 10. Matsushita, Y., Wada, T.: Principal Component Hashing: An Accelerated Approximate Nearest Neighbor Search. In: Wada, T., Huang, F., Lin, S. (eds.) PSIVT 2009. LNCS, vol. 5414, pp. 374–385. Springer, Heidelberg (2009)
Asymmetry-Based Quality Assessment of Face Images Guangpeng Zhang and Yunhong Wang School of Computer Science and Engineering, Beihang University
[email protected],
[email protected]
Abstract. Quality assessment plays an important role in biometrics field. Unlike the popularity of fingerprint and iris quality assessment, the evaluation of face quality is just started. To solve the incapability for performance prediction and remove the requirement for scale normalization of existing methods, three face quality measures are proposed in this paper. SIFT is utilized to extract scale insensitive feature points on face images, and three asymmetry-based quality measures are calculated by applying different constraints. Systematical experiments validate the efficacy of the proposed quality measures.
1
Introduction
Quality assessment is receiving more and more attentions in biometrics field. Because of different characteristics of subjects, devices and environments, the samples’ quality varies from time to time. Low quality samples increase the enrollment failure rate, and decrease the system performance, therefore automatic quality assessment is necessary in both the enrollment and verification phases. With the help of quality assessment, the bad quality samples can be automatically removed during enrollment or rejected during verification. Qualities of samples can also be involved in the classification algorithm to increase the system performance [1]. For traditional image quality assessment, fidelity is an important issue, which measures the image’s consistency with the object’s appearance. But for biometrics quality assessment, fidelity is no longer the only issue to be considered, instead the correlation with the system performance and the ability to predict the performance becomes more important. During the last several years, Researchers have done much work on quality assessment for fingerprint and iris modalities [2], [3], [4], and roughly the proposed methods can be divided into three categories [4]: local feature based, global feature based, and classification based. Unlike fingerprint and iris cases, not so much work has been done on face quality assessment. Most previous face image quality assessment methods work on general image properties, such as contrast, sharpness and illumination intensity etc [5], [6], [7], [8], [9], [10], [11], [12]. As most existing face recognition algorithms are more or less insensitive to these normal image quality variations, quality assessment based on these measures can not predict the system performance very well. Gao et al [13] proposed to use G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 499–508, 2009. c Springer-Verlag Berlin Heidelberg 2009
500
G. Zhang and Y. Wang
asymmetry of LBP features as a measure of face image quality. However, Gao’s method requires the image size to be identical, face images are first normalized to the same size, which will reduce the quality, therefore the measure does not reflect the original quality. The requirement for scale normalization also exists in most of the previous methods. To solve this problem, in Ref. [5], the author scaled the face image into 5 discrete levels and trained a model for each of them. It can partially deal with scale problem, but is very laborious. In this paper we propose a face quality assessment method that can perform on face images without scale normalization. Bilateral asymmetry of SIFT feature points on face images is utilized to assess the quality. Three quality measures are proposed, and systematical experiments are conducted to demonstrate the efficiency of the proposed quality measures. The paper is organized as below: section 2 presents the three quality measures in detail; experiments and results are shown in section 3; section 4 concludes the paper.
2
Asymmetry Based Quality Measures
Illumination and pose variations normally cause the face images to be asymmetric, and significantly influence most of the state-of-the-art face recognition algorithms. In order to predict the recognition performance, the quality measure must take the asymmetry into consideration. Many facial asymmetry measures are proposed for different purposes [14][15], however, most of them require the face images to be normalized first so that image sizes are the same. Inevitably the scale normalization will reduce the image quality. To remove the requirement for scale normalization, we resort to scale invariant local features, and measure the asymmetric distribution of local feature points. Due to its interesting characteristics, SIFT (Scale Invariant Feature Transform) is chosen among various scale invariant features to calculate face image quality. SIFT was proposed by Lowe [16] for object recognition, for more details please refer to [17]. SIFT is a robust feature extraction method against image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. Using SIFT for quality assessment has two advantages: first, SIFT is sale invariant, and intrinsically it has the potential to solve the scale sensitivity problem; second, different from general image properties, SIFT is insensitive to many factors, and is more suitable to predict the performance of the state-of-the-art face recognition algorithms. Given a face image, SIFT is performed on it, resulting in a set of detected SIFT points S = {s1 , s2 , · · · , sN }, and each point si = (loci (xi , yi ), f eati ), where loci means the location of the point, and f eati means the feature vector of it. A vertical separation line x = t equally divides the image into left and right parts, which also divides the SIFT point set S into two subsets SL and SR respectively. If the width of the face image is W , then X ∈ [1, W ], and t = (1 + W )/2. Here we ignore those points lie on the separation line, therefore |S| = |SL| + |SR|, where || means the cardinality of the set. Set N L = |SL| and N R = |SR|. For a high quality face image, in which face pose is strictly frontal, illumination is evenly distributed, and no large expression is posed, the bilateral symmetry
Asymmetry-Based Quality Assessment of Face Images
501
Fig. 1. Asymmetric distribution of SIFT points detected on three face images
is guaranteed, and the number of SIFT points detected on both sides of the face image should be almost the same, while for those low quality face images, which is highly asymmetric due to various factors, the number of SIFT points on the two sides should be highly different, as shown in Fig. 1. Based on this observation, the first and most intuitive quality measure Q1 is defined as: Q1 =
min(N L, N R) max(N L, N R)
(1)
As we can see, Q1 only measures the inequality of the number of SIFT points on the two sides of face images. It can be explained as below: The side with the maximum number of SIFT points is chosen as the preferable side, and the corresponding SIFT point set is denoted by SM ax, and respectively the other subset is denoted by SM in, we assume there is an injective mapping f from SM in to SM ax, that for each SIFT point in SM in, there exist a unique matching point in SM ax. But Q1 does not impose any constraint on the match. By adding constraints on the mapping between SL and SR, another two quality measures Q2 and Q3 are proposed. Q2 is defined by adding location constraint on the mapping. Q2 = (
|f (SL, SR)| |f (SR, SL)| + )/2 NL NR
(2)
The function f is the bilateral symmetry matching function described in Table 1. From Table 1 we can see that, only those points having matching points with location constraint on the other side are considered by quality measure Q2. Besides the location constraint, feature constraint can further be applied to the match, which results in the third quality measure Q3. As the descriptor of SIFT point is not bilateral symmetric, the image is flipped left to right, and SIFT is performed on the flipped image, resulting in two subsets of SIFT point just as previously described, namely SL and SR , then Q3 is defined as: |f (SL, SL )| |f (SR, SR )| + )/2 NL NR and the matching condition in f becomes: (xj , yj ) − (xi , yi ) < r and f eatj − f eati < g Q3 = (
(3)
502
G. Zhang and Y. Wang Table 1. Bilateral symmetry matching function – Input: source point set S1, destination point set S2. – Output: M which is a largest subset of S1 that have matching points in S2. – Process: 1. set M = φ. 2. For each si = (loci (xi , yi ), f eati ) ∈ S1, i = 1, . . . , |S1| if ∃sj = (locj (xj , yj ), f eatj ) ∈ S2 that (xj , yj ) − (2 ∗ t − xi , yi ) < r then M = M ∪ si . – Return: M .
To fix the value r and g, the averages md, mf and standard deviations sd, sf of the shortest location distance and feature distance between each SIFT point with its neighbors are calculated, and set r = md − sd, and g = mf − sf .
3
Experiments and Results
To demonstrate the efficacy of the proposed quality measures, systematical experiments are conducted. CAS-PEAL database [18] is used in the experiments, which is a large scale face database of Mongolian, containing 30863 images of 1040 individuals (595 males and 445 females). There are “Frontal” variations of expression, lighting, accessory, background, time and distance, each one of them corresponds to a folder of images and are used as probe set (6992 images), and there is another “normal” folder which is used as gallery set (1040 images). These diverse variations also result in various face image qualities, therefore CAS-PEAL database is a suitable database for testing the performance of quality measures. Because most of the state-of-the-art face recognition algorithms deal with frontal images much better than multi-pose images, pose variations are not considered in the following experiments. However, that does not mean that the proposed quality measures can not deal with pose variations, actually many of our experiments show their suitableness to pose variations. The eyes coordinates of all the face images are provided with the database, and with these coordinates the face region are cropped out using a square bounding box considering the regular position of the eyes on face. The size of the cropped face images varies from 92 × 92 to 268 × 268, and Fig. 2 shows the histogram of the width of face images. The three quality measures are calculated on the cropped face images without scale normalization. For performance evaluation in the second and third experiments, the face images are further normalized to the size of 80 × 80 to calculate the matching distance.
Asymmetry-Based Quality Assessment of Face Images
503
1800 1600 1400 1200 1000 800 600 400 200 0 50
100
150
200
250
300
Fig. 2. Histogram of the width of cropped face images in CAS-PEAL database
3.1
Fidelity Assessment
Although fidelity is no longer the only issue to be considered for biometrics quality assessment, a good quality measure often has good fidelity as well. Experiment is conducted to test the fidelity first. Among different variations introduced above, lighting variations affect image quality most significantly, and the effect is easy to evaluate, therefore the quality measures’ fidelity with lighting variations is tested in this experiment. Totally there are 31 different lighting conditions, including 3 kinds of lighting sources (ambient, fluorescent, incandescent), 3 elevations (−45◦ ,0◦ , and +45◦) and 5 azimuths (−90◦ , −45◦ , 0◦ , +45◦ , and +90◦ ). For ambient lighting, only +45◦ elevation and 0◦ azimuth is captured, while for the other two kinds of lighting sources, all the 3 elevations and 5 azimuths are captured, resulting in 31 (1 + 2 × 3 × 5) lighting conditions. With 3283 images from “lighting” and “normal” folders, the average qualities under 31 lighting conditions are calculated, and plotted in Fig. 3. The X axis labels are named as Ixx ± nn: the initial character “I” indicates the kind of lighting source; the first x (E, F, L) indicates the kind of lighting source, E for ambient, F for fluorescent, and L for incandescent; the second x (U, M, D) indicates the elevation of lighting source, U for +45◦ , M for 0◦ , and D for −45◦ ; the ±nn indicates the azimuth of lighting source. From Fig. 3 clearly we can see that under the same lighting source and the same elevation, with the increasing of azimuth magnitude, the quality decreases. Although lie in different ranges, the three quality measures Q1, Q2, and Q3 show the same trend. Fig. 4 shows the qualities of three exemplar face images shown in Fig. 1 under different illumination conditions. The results demonstrate the fidelity of the proposed quality measures.
504
G. Zhang and Y. Wang
1 Q1 Q2 Q3
0.9 0.8 0.7
Quality
0.6 0.5 0.4 0.3 0.2 0.1 IEU IFD+00 IFD+00 IFD+45 IFD+90 IFD−45 −9 IFM IFM+00 + IFM 40 IFM+95 IFM−40 IFU−950 IFU+00 IFU+45 IFU+90 IFU−45 ILD−90 ILD+00 ILD+45 ILD+90 ILD−45 −9 ILM ILM+000 ILM+45 ILM+90 ILM−45 ILU−90 ILU+00 ILU+45 ILU+90 ILU−45 −9 0
0
Fig. 3. Fidelity assessment with lighting variations
3.2
Correlation Analysis
As described in the introduction section, to predict system performance is the most important task of biometrics quality assessment. To evaluate this ability, the correlation between quality measure and matching score is analyzed first. Given two samples x1 and x2 , with the corresponding qualities q1 and q2 , then the quality of the pair of samples q is defined as min(q1 , q2 ). To calculate the matching distance d, the standard Eigenface [19] method and LBP (Local Binary Pattern) [20] method are performed, both of which are well known in the face recognition community. In the LBP method, basically we followed the scheme proposed in [21], but the face image is divided into 4 × 4 windows, and equal weight is assigned to each window when computing the χ2 statistic. The Eigenface method is a representative global face recognition algorithm, while LBP method combines local features into a global description, using these two global methods to evaluate the local SIFT-based quality assessment is more convincing. In order to eliminate the influence of inter-subject variations, only the genuine matching distances are considered. Given the two random variables quality Q and matching distance D, the correlation coefficient ρQ,D between them is defined as ρQ,D =
E(QD) − E(Q)E(D) . E(Q2 ) − E 2 (Q) E(D2 ) − E 2 (D)
(4)
Asymmetry-Based Quality Assessment of Face Images
505
Fig. 4. Qualities of the three exemplar face images under different illumination conditions Table 2. The correlation coefficients between the three quality measures and the matching distances obtained by Eigenface and LBP Q1 Q2 Q3 Eigenface -0.4334 -0.1132 -0.1954 LBP -0.2943 -0.0988 -0.2253
Table 2 shows the correlation coefficients between the three quality measures and the matching distances obtained by Eigenface and LBP. Consistently we can see that the simplest Q1 has the higheset correlation with the matching distance, and Q2 has the lowest correlation. According to the correlation coefficients, the simplest quality measure Q1 gives the best results, and Q3 performs better than Q2. 3.3
ROC Curve Analysis
To further validate the proposed quality assessment measures, we calculate the quality of each sample, and sort the samples according to their qualities. Then we divide the samples equally into three parts, namely “low quality”, “medium quality” and “high quality” parts. Eigenface and LBP are performed to calculate the matching distances between samples and the ROC curves for each part. Fig. 5 shows the ROC curves of Q1, Q2, and Q3, from which we can see that for all the three quality measures, the quality affects the verification performance in the same way: high quality part gives better performance than low quality part. This clearly validates the ability of the proposed quality measures to predict system performance. But the ROC curves have different gaps among the three quality measures. Consistently we can see that Q1 curves are much more separable between “high quality” and “low quality” than between “high quality” and “medium quality”, Q2 curves are much more separable between “high quality” and “low quality” than between “medium quality” and “low quality”, and Q3 curves are more evenly spaced. According to the above three experiments, the proposed three quality measures Q1, Q2, and Q3 lie in different ranges, but they have the same trend in fidelity assessment, and in performance prediction the three quality measures show consistent behavior for the results obtained by the two matching methods
506
G. Zhang and Y. Wang
Fig. 5. ROC curve analysis. The first row shows the ROC curves of Eigenface method, and the second row shows that of LBP method. (a,d) for Q1, (b,e) for Q2, (c,f) for Q3. level 1 for low quality, and level 3 for high quality.
Eigenface and LBP. There is no certain rule to decide which one is the best, our suggestion is to try all three of them and select the most suitable one for your application.
4
Conclusions
Quality assessment is becoming more and more important in biometrics field. Three asymmetry based face quality measures are proposed in this paper, all of which are based on local SIFT features, and can be calculate without regard to the image size. Extensive experiments demonstrate the fidelity and ability to predict system performance of the proposed quality measures.
Acknowledgements This work was supported by Program of New Century Excellent Talents in University, National Natural Science Foundation of China (No. 60575003, 60332010, 60873158), Joint Project supported by National Science Foundation of China and Royal Society of UK (60710059), Hi-Tech Research and Development Program of China (2006AA01Z133), and the opening funding of the State Key Laboratory of Virtual Reality Technology and Systems (Beihang University).
Asymmetry-Based Quality Assessment of Face Images
507
References 1. Nandakumar, K., Chen, Y., Dass, S.C., Jain, A.K.: Likelihood ratio-based biometric score fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 342–347 (2008) 2. Chen, Y., Dass, S., Jain, A.: Fingerprint quality indices for predicting authentication performance. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 160–170. Springer, Heidelberg (2005) 3. Chen, Y., Dass, S., Jain, A.: Localized iris image quality using 2-d wavelets. In: Zhang, D., Jain, A.K. (eds.) ICB 2005. LNCS, vol. 3832, pp. 373–381. Springer, Heidelberg (2006) 4. Alonso-Fernandez, F., Fierrez, J., Ortega-Garcia, J., Gonzalez-Rodriguez, J., Fronthaler, H., Kollreider, K., Bigun, J.: A comparative study of fingerprint imagequality estimation methods. IEEE Transaction on Information Forensics and Security 2, 734–743 (2007) 5. Luo, H.: A training-based no-reference image quality assessment algorithm. In: Proc. of International Conference on image Processing, pp. 2973–2976 (2004) 6. Subasic, M., Loncaric, S., Petkovic, T., Bogunovic, H., Krivec, V.: Face image validation system. In: Proc. of the 4th International Symposium on Image and Signal Processing and Analysis, pp. 30–33 (2005) 7. Fronthaler, H., Kollreider, K., Bigun, J.: Automatic image quality assessment with application in biometrics. In: Proc. of Conference on Computer Vision and Pattern Recognition Workshop, p. 30 (2006) 8. Werner, M., Brauckmann, M.: Quality values for face recognition. In: NIST Biometric Quality Workshop (2006) 9. Weber, F.: Some quality measures for face images and their relationship to recognition performance. In: NIST Biometric Quality Workshop (2006) 10. Hsu, R.L.V., Shah, J., Martin, B.: Quality assessment of facial images. In: Biometrics Symposium, pp. 1–6 (2006) 11. Fourney, A., Laganiere, R.: Constructing face image logs that are both complete and concise. In: Proc. of the fourth Canadian Conference on Computer and Robot Vision, pp. 488–494 (2007) 12. Nasrollahi, K., Moeslund, T.B.: Face quality assessment system in video sequences. In: Schouten, B., Juul, N.C., Drygajlo, A., Tistarelli, M. (eds.) BIOID 2008. LNCS, vol. 5372, pp. 10–18. Springer, Heidelberg (2008) 13. Gao, X., Li, S.Z., Liu, R., Zhang, P.: Standardization of face image sample quality. In: Lee, S.-W., Li, S.Z. (eds.) ICB 2007. LNCS, vol. 4642, pp. 242–251. Springer, Heidelberg (2007) 14. Liu, Y., Schmidt, K., Cohn, J., Mitra, S.: Facial asymmetry quantification for expression invariant human identification. Computer Vision and Image Understanding 91, 138–159 (2003) 15. Mitra, S., Liu, Y.: Local facial asymmetry for expression classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 889–894 (2004) 16. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proc. of the International Conference on Computer Vision, pp. 1150–1157 (1999) 17. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 18. Gao, W., Cao, B., Shan, S., Chen, X., Zhou, D., Zhang, X., Zhao, D.: The cas-peal large-scale chinese face database and baseline evaluations. IEEE Trans. on System Man, and Cybernetics (Part A) 38, 149–161 (2008)
508
G. Zhang and Y. Wang
19. Turk, M., Pentland, A.: Face recognition using eigenfaces. In: Proceedings the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–591 (1991) 20. Ojala, T., Pietikainen, M., Harwood, D.: A comparative study of texture measures with classification based on feature distributions. Pattern Recognition 29, 51–59 (1996) 21. Ahonen, T., Hadid, A., Pietikainen, M.: Face recognition with local binary patterns. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469–481. Springer, Heidelberg (2004)
Scale Analysis of Several Filter Banks for Color Texture Classification Olga Rajadell, Pedro Garc´ıa-Sevilla, and Filiberto Pla Depto. Lenguajes y Sistemas Inform´aticos Universitat Jaume I, Campus Riu Sec s/n 12071 Castell´on, Spain {orajadel,pgarcia,pla}@lsi.uji.es http://www.vision.uji.es
Abstract. We present a study of the contribution of the different scales used by several feature extraction methods based on filter banks for color texture classification. Filter banks used for textural characterization purposes are usually designed using different scales and orientations in order to cover all the frequential domain. In this paper, two feature extraction methods are taken into account: Gabor filters over complex planes and color opponent features. Both techniques consider simultaneously the spatial and inter-channel interactions in order to improve the characterization based on individual channel analysis. The experimental results obtained show that Gabor filters over complex planes provide similar results to the ones obtained using color opponent features but using a reduced number of features. On the other hand, the scale analysis shows that some scales could be ignored in the feature extraction process without distorting the characterization obtained.
1 Introduction Texture analysis has been tackled from different points of view in the literature. Literature survey provides us with a wide variety of well known texture analysis methods (co-occurrence matrices [5], wavelets [7], Gabor filters [3], local binary patterns [8], etc.) which have been mainly developed for grey level images. Although the supremacy of filter-bank based methods for texture analysis have been challenged by several authors [12] [8] they are still one of the most frequently used methods for tecture characterization. One goal of this paper is to analyze the influence of the scale parameter in several filter banks for texture analysis and study the information provided by each filter in order to reduce the characterization data required. Reducing the number of features used may make the feature extraction process easier. It is well known that, when dealing with microtextures, the most discriminant information falls in medium and high frequencies [2] [9]. Therefore, it may be convenient to consider the influence of each frequency band separately in order to identify where the textural information could be localized. Color texture analysis in multi-channel images has been generally faced as a multidimensional extension of techniques designed for mono-channel images. In this way, color images are decomposed into three separated channels and the same feature extraction process is performed over each channel. This definitely fails capturing the interchannel properties of a multi-channel image. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 509–518, 2009. c Springer-Verlag Berlin Heidelberg 2009
510
O. Rajadell, P. Garc´ıa-Sevilla, and F. Pla
On the other hand, in order to study these inter-channel interactions, color opponent features were proposed [6] which combine spatial information across spectral bands at different scales. Furthermore, we propose the use of similar features obtained using Gabor filters over complex planes which also try to describe the inter-channel properties of color textures, but using a smaller number of features. The paper is organized as follows: first, the use of Gabor filters over complex channels and color opponent features are described in section 2. Section 3 describes the experiments performed and section 4 comments on the experimental results obtained. The conclusions are shown in section 5.
2 Feature Extraction Let I i (x, y) be the ith channel of an image and f (x, y) a filter in the filter bank. The response of an image channel to the filter applied is given by: hi (x, y) = I i (x, y) ∗ f (x, y)
(1)
The response of a filter over an image channel may be represented by its total energy: μi = h2i (x, y) (2) x,y
If a filter bank is applied, an image can be characterized by means of all the responses generated by all filters. It is possible to apply a filter in the space domain by a convolution or in the frequency domain by a product. In both cases, the feature obtained is the corresponding energy of the chosen group of pixels which responds to the filter applied [4]. When using filter banks, they are generally designed considering a dyadic tessellation of the frequency domain, that is, each frequency band (scale) considered is double the size of the previous one. It should not be ignored that this tessellation of the frequency domain thoroughly analyzes low frequencies, given less importance to medium and higher frequencies. Because the purpose of this work is to localize the texture information for color microtexture classification tasks, an alternative constant tessellation (giving the same width to all frequency bands) is proposed in order to ensure an equal analysis of all frequencies [10]. 2.1 Gabor Filter Bank Gabor filters consist essentially of sine and cosine functions modulated by a Gaussian envelope that achieve optimal joint localization in space and frequency [3]. They can be defined by eq. (3) and (4) where m is the index for the scale, n for the orientation and um is the central frequency of the scale. 2 1 x + y2 real fmn (x, y) = exp − × cos(2π(um x cos θn + um y sin θn )) (3) 2 2 2πσm 2σm 2 1 x + y2 imag fmn (x, y) = exp − × sin(2π(um x cos θn + um y sin θn )) (4) 2 2 2πσm 2σm
Scale Analysis of Several Filter Banks for Color Texture Classification
511
If symmetrical filters in the frequency domain are considered, only the real part of the filters in the space domain must be taken into account for convolution. 2.2 Complex Bands If we filter each individual image channel we will lose all inter-channel information in the image. Hence, in order to take advantage of the inter-channel data, complex bands will be used instead. In this way, two real image channels are merged into one complex channel, one as the real part and the other one as the imaginary part. In this way we involve inter-channel information in each characterization process (similarly as the opponent features do, see next section). Since complex channels are no longer real, their corresponding FT is neither symmetrical. In this case, we suggest the usage of complex filters (non-symmetrical filters in the frequency domain). As a result, for a cluster of image channels, we will consider all possible complex channels (pairs of channels). The Gabor filter bank will be applied over all complex channels as shown in eq. (5), where I i (x, y) is the ith image channel and fm,n (x, y) the filter corresponding to the scale m and the orientation n in the filter bank previously defined. i j hij (5) mn (x, y) = (I (x, y) + I (x, y)i) ∗ fmn (x, y) The feature vector for each filter applied over the image is composed of the energy response to all filters in the filter bank, that is: ψx,y = {μij =j,∀m,n mn (x, y)}∀i,j/i
(6)
As we are working with color images, the number of bands that compose the image is fixed to three. Even though, the size of the feature vector varies with the number of orientations and scales. For each complex channel, one feature is obtained for each filter applied what means that there will be as many features as filters for each complex band. So the total number of features is given by eq. (7) where M stands for the number of scales and N for the number of orientations. size(ψx,y ) = M × N × 3
(7)
As inter-channel information is introduced in complex channels, it would be interesting to use some sort of decorrelation method (e.g. PCA) to minimize the correlation of RGB data in order to guarantee that merged information do introduce relevant information. Therefore, in the experiments carried out we will show results applying the filter bank directly over the RGB channels, and also over the PCA-RGB channels. 2.3 Opponent Features Opponent features combine spatial information across image channels at different scales and are related to processes in human vision [6]. They are obtained from Gabor filters, computing firstly the difference of the outputs of two different filters. These differences among filters are needed for all pairs of image channels i, j with i = j and for all scales such that |m − m | ≤ 1: j i dij mm n (x, y) = hmn (x, y) − hm n (x, y)
(8)
512
O. Rajadell, P. Garc´ıa-Sevilla, and F. Pla
Then, the opponent features can be obtained as the energies of the computed differences: ij ρij (dmm n (x, y))2 (9) mm n = x,y
In this way, opponent features use inter-channel information and minimize the correlation of channels which are expected to be highly correlated. The feature vector for an image is the set of all opponent features for all image channels: ϕx,y = {ρij =j,∀m,m /|m−m |≤1,∀n mm n (x, y)}∀i,j/i
(10)
Hence, the size of the opponent feature vector also depends on the number of scales M , and orientations N/2, being the number of bands B = 3 for color images. Note that the number of orientations used in this case is half the number of orientations used before. This is because now each filter is applied over a single image channel (that is, a real image) and, therefore, the other half filters will provide symmetrical responses. size(ϕx,y ) =
size(ψx,y ) N + B × (B − 1) × (M − 1) × 2 2
(11)
Note that, for usual values of M and N , the number of features is considerably increased in this case.
3 Experimental Setup Several experiments have been conducted on texture classification in order to investigate the characterization properties of the filter banks described in previous sections. Also the effects of the different scales used to create the filter banks will be studied. Seventeen different color textures have been taken from the VisTex database [13] which are shown in Fig. 1. All of them are 512 × 512 sized images that have been divided into sixty-four non-overlapping patches of 64 × 64 pixels, which makes a total of 1088 samples for seventeen balanced classes. The experiments were held using two different tessellations of the frequency domain. For the first one, five dyadic scales (the maximum starting from width one and covering all the image) and eight orientations were used. For the second one, eight constant-width frequency bands and eight orientations were considered. It has been introduced certain degree of overlapping between filters as recommended in [1]. Gaussian distributions are designed to overlap each other when achieving a value of 0.5. The three kind of features previously described have been tested: Gabor features using complex channels over RGB images, Gabor features over complex PCA-RGB channels, and color opponent features. As stated in previous section, only four orientations were considered for color opponent feature due to symmetry. For each of the scales considered a classification experiment was held using only the features provided for that scale. In addition, an analysis of the combination of adjacent scales have been performed. In order to study the importance of low frequencies an ascendent joining was performed, characterizing patch with the data provided by joined
Scale Analysis of Several Filter Banks for Color Texture Classification
water - 1
water - 2
water - 3
water - 4
fabric - 1
fabric - 2
fabric - 3
stone - 1
stone - 2
metal - 1
metal - 2
metal -3
metal - 4
metal - 5
metal -6
sand
bark
Fig. 1. Color textures used in the experimental campaign
513
514
O. Rajadell, P. Garc´ıa-Sevilla, and F. Pla
ascendent scales. Similarly, the study of the high frequencies was carried out by a descendant joining. Also for medium frequencies, central scales are considered initially and adjacent lower and higher scales are joined gradually. All texture patches characterized are later divided into sixteen separate sets keeping the a priori probability of each texture class. Therefore, no redundancies are introduced and each set is a representative set of the bigger original one. Eight classification attempts were carried out for each experiment with the k-nearest neighbor algorithm with k = 3 and the mean of the correct classification rates of these attempts was taken as the final performance of the experiment. Each classification attempt uses one of these sets for training and another one as test set. Therefore, each set was never used twice in the same experiment.
4 Experimental Results Figure 2 shows the percentages of correct sample classification obtained for the experiments that used the dyadic filter banks whereas Figure 3 shows similar experimentation when the constant width filter banks were used instead. As it can be observed in both figures, the filter bank using the constant tessellation outperforms the dyadic one being in general more consistent. Briefly, the more detail is obtained from medium and high frequencies the best the texture is characterized. Note that a constant tessellation (Fig. 3) thoroughly analyzes medium and high scales which are claimed to contain discriminant information whereas dyadic does not. It can be observed in the graphs that, in general, the features derived from low scales do not help the characterization processes as the classification rates mainly decreases when they are considered. By analyzing scales individually, Fig.2.(a-b) and 3.(a-b), the lower scale can never outperform the classification rates achieved by medium and high scales which, in same cases, achieve up to 75% by themselves. Regarding the dyadic tessellation, although scales 2-4 independently do not outperform the characterization using all scales together (Fig. 2a-b), their join performance does, Fig. 2g-h. This is because the scales by themselves do not cover the whole area containing outstanding information but their joining cover it all and consequently its performance reaches the maximum classification rates. It was expected that last scale outperformed the rest since it covers a larger frequential area. The ascendant joining presented in Fig. 2c-d shows a very poor performance for low frequencies and higher performances are not reached until medium frequencies take part of the characterization. Likewise, Fig. 2e-f enforce this conclusion showing high performances when taking medium frequencies into account. In a nutshell, when all (five) scales are used, the classification rates are better than the ones obtained using the medium scales independently. However, it is similar to the results obtained joining this three scales although having a more reduced number of features which proves that medium frequencies include the main discriminant textural information. Note that graphs in Fig. 3 outperform those commented before. This is because medium and high frequencies are better analyzed in this case and this leads to a better texture characterization, improving the performance for all sort of joinings.
Scale Analysis of Several Filter Banks for Color Texture Classification
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
515
Fig. 2. Pixel classification rates using the filter bank with dyadic tessellation. (a,c,e,g) Gabor features over complex planes (b,d,f,h) Opponent features (a,b) Individual scales (c,d) Ascendent join (e,f) Descendent join (g,h) Central join.
516
O. Rajadell, P. Garc´ıa-Sevilla, and F. Pla
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 3. Pixel classification rates using the filter bank with constant tessellation. (a,c,e,g) Gabor features over complex planes (b,d,f,h) Opponent features (a,b) Individual scales (c,d) Ascendent join (e,f) Descendent join (g,h) Central join.
Scale Analysis of Several Filter Banks for Color Texture Classification
517
Fig. 3g-h shows that no increase of the features may improve the characterization output as performance stays in the same values obtained using medium frequencies. Last but not least, the comparison between the feature extraction methods suggests that opponent features perform slightly better than Gabor filters over complex bands using RGB channels. It seems that opponent features provides an efficient method of including inter-channel information while decreasing correlation among these channels. This points out that inter-channel interaction is also very important for characterization and color images should not be treated as a simple dimensional extension. For this reason, when PCA is considered before the application of Gabor filters over complex channels, their results outperform not only the classification rates obtained using the original RGB channels, but also the rates obtained using the color opponent features. It is important to bear in mind that, in this case, the number of features used to characterize each texture patch is significantly smaller than the number of color opponent features.
5 Conclusions An analysis of the contribution of each scale to the characterization of color texture images has been performed. As it is known in the texture analysis field, medium and high frequencies play an essential role in texture characterization. Consequently, as has been shown, a constant tessellation of the frecuency domain outperforms the traditional dyadic tessellation for microtexture characterization. For three different feature extraction methods, a thoroughly analysis of the contribution of each independent scale and the groups composed by low, medium or high frequencies has been carried out. Besides, a few scales could be considered in the feature extraction process providing by themselves very high classification rates with a reduced number of features. The experiments carried out have shown that the usage of PCA in RGB images before applying the Gabor filters over complex channels enhance the texture characterization significantly. Furthermore, these features outperformed the color opponent features even using a smaller number of features.
Acknowledgment This work has been partly supported by grant FPI PREDOC/2007/20 and project P1-1B2007-48 from Fundaci´o Caixa Castell´o-Bancaixa and projects CSD2007-00018 (Consolider Ingenio 2010) and AYA2008-05965-C04-04 from the Spanish Ministry of Science and Innovation.
References 1. Bianconi, F., Fern´andez, A.: Evaluation of the effects of Gabor filter parametres on texture classification. Patt. Recogn. 40, 3325–3335 (2007) 2. Chang, T., Kuo, C.C.J.: Texture analysis and classification with tree-structured wavelet transform. IEEE Trans. on Geoscience & Remote Sensing. 441, 429–441 (1993) 3. Fogel, I., Sagi, D.: Gabor filters as texture discrimination. Biological Cybernetics 61, 103–113 (1989)
518
O. Rajadell, P. Garc´ıa-Sevilla, and F. Pla
4. Grigorescu, S.E., Petkov, N., Kruizinga, P.: Comparison of Texture Features Based on Gabor Filters. IEEE Trans. Image Processing 11(10), 1160–1167 (2002) 5. Haralick, R.M., Shanmugam, K., Dinstein, I.: Texture Features for Image Classification. IEEE Trans. Systems, Man, and Cybernetics 3(6), 610–621 (1973) 6. Jaim, A., Healey, G.: A multiscale representation including oppponent color features for texture recognition. IEEE Trans. Image Process. 7, 124–128 (1998) 7. Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on PAMI 11, 674–693 (1989) 8. Ojala, T., Pietikainen, M., Maaenpaa, T.: Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Trans. Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002) 9. Petrou, M., Garc´ıa-Sevilla, P.: Image Processing: Dealing with Texture. John-Wiley and Sons, Dordrecht (2006) 10. Rajadell, O., Garc´ıa-Sevilla, P.: Influence of color spaces over texture characterization. Research in Computing Science 38, 273–281 (2008) 11. Randen, T., Hakon Huosy, J.: Filtering for Texture Classification: A Comparative Study. IEEE Trans. Pattern Analysis and Machine Intelligence 21(4), 291–310 (1999) 12. Varma, M., Zisserman, A.: Texture classification: Are filter banks necessary? In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 691–698 (2003) 13. VisTex Texture Database, MIT Media Lab (1995), http://vismod.media.mit.edu/
A Novel 3D Segmentation of Vertebral Bones from Volumetric CT Images Using Graph Cuts Melih S. Aslan1 , Asem Ali1 , Ham Rara1 , Ben Arnold2 , Aly A. Farag1, Rachid Fahmi1 , and Ping Xiang2 1
Computer Vision and Image Processing Laboratory (CVIP Lab) University of Louisville, Louisville, KY 40292 {melih,farag}@cvip.uofl.edu 2 Image Analysis, Inc 1380 Burkesville St Columbia, KY, 42728, USA www.cvip.uofl.edu
Abstract. Bone mineral density (BMD) measurements and fracture analysis of the spine bones are restricted to the Vertebral bodies (VBs). In this paper, we present a novel and fast 3D segmentation framework of VBs in clinical CT images using the graph cuts method. The Matched filter is employed to detect the VB region automatically. In the graph cuts method, a VB (object) and surrounding organs (background) are represented using a gray level distribution models which are approximated by a linear combination of Gaussians (LCG) to better specify region borders between two classes (object and background). Initial segmentation based on the LCG models is then iteratively refined by using MGRF with analytically estimated potentials. In this step, the graph cuts is used as a global optimization algorithm to find the segmented data that minimize a certain energy function, which integrates the LCG model and the MGRF model. Validity was analyzed using ground truths of data sets (expert segmentation) and the European Spine Phantom (ESP ) as a known reference. Experiments on the data sets show that the proposed segmentation approach is more accurate than other known alternatives.
1
Introduction
The spine bone consists of the VB and spinal processes. In this paper, we are primarily interested in volumetric computed tomography (CT) images of the vertebral bones of spine column with a particular focus on the lumbar spine. The primary goal of the proposed work is in the field of spine densitometry where bone mineral density (BMD ) measurements are restricted to the vertebral bodies (see Fig. 1 for regions of spine bone). Various approaches have been introduced to tackle the segmentation of skeletal structures in general and of vertebral bodies in particular for the anatomical definition of a VB. For instance, Kang et al. [1] proposed a 3D segmentation G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 519–528, 2009. c Springer-Verlag Berlin Heidelberg 2009
520
M.S. Aslan et al.
Fig. 1. Anatomy of a human vertebra (The image is adopted from [4])
method for skeletal structures from CT data. Their method is a multi-step method that starts with a three dimensional region growing step using local adaptive thresholds followed by a closing of boundary discontinuities and then an anatomically-oriented boundary adjustment. Applications of this method to various anatomical bony structures are presented and the segmentation accuracy was determined using the European Spine Phantom (ESP ) [2]. Later, Mastmeyer et al. [3] presented a hierarchical segmentation approach for the lumbar spine in order to measure bone mineral density. This approach starts with separating the vertebrae from each other. Then, a two step segmentation using a deformable mesh followed by adaptive volume growing operations are employed in the segmentation. The authors conducted a performance analysis using two phantoms: a digital phantom based on an expert manual segmentation and the ESP. They also reported that their algorithm can be used to analyze three vertebrae in less than 10min. This timing is far from the real time required for clinical applications but it is a huge improvement compared to the timing of 1 − 2h reported in [5]. Recently, in the context of evaluating the Ankylosing Spondylitis, Tan et al. [6,7] presented a technique to segment whole vertebrae with their syndesmophytes using a 3D multi-scale cascade of successive level sets. The seed placement was done manually and results were validated using synthetic and real data. Other techniques have been developed to segment skeletal structures and can be found for instance in [9,10] and the references therein. The main objective of our algorithm is to segment the VB. In this paper, we propose a novel automatic VB segmentation approach that uses subsequently; i) the Matched filter which is used in automatic determination of the VB region, ii) the LCG method to approximate the gray level distribution of the VB (object) and surrounding organs (background), and iii) the graph cuts to obtain the optimal segmentation. First, we use the Matched filter to determine the VB region in CT slice. In this method, no user interaction is needed. Also, this method helps the LCG method to initialize the gray level distributions more accurately. After the LCG method initializes the labels, graph cuts segmentation
A Novel 3D Segmentation of Vertebral Bones from Volumetric CT Images
521
method is employed in the segmentation. The VB and surrounding organs have very close gray level information and there is no strong edges in some CT images. Therefore, we depend on both the volume gray level information and spatial relationships of voxels in order to overcome any region inhomogeneity existing in CT images as shown in the Figure 2. In this study, b-spline based interpolation and statistical level set methods using various post-processing steps are tested and compared with the proposed algorithm. The main advantages of the proposed framework are follows: i) automatic detection of the VB using the Matched filter eliminates the user interaction and improves the initialization of the proposed method, ii) complete framework is based on modeling the region of the VB, the intensity distribution, and spatial interaction between the voxels in order to preserve details. Section 2 discusses the background of Matched filter and graph cuts method. Section 3 describes the alternative methods, explain the experiments, and compare the results.
(a)
(b)
(c)
(d)
Fig. 2. Typical challenges for vertebrae segmentation. (a) Inner boundaries. (b) Osteophytes. (c) Bone degenerative disease. (d) Double boundary.
2
Proposed Framework
This section briefly reviews the mathematical background of the techniques used in this paper. 2.1
Matched Filter
In the first step, the Matched filter [12] is employed to detect the VB automatically. This procedure eliminates the user interaction and improves the segmentation accuracy. Let f (x, y) and g(x, y) be reference and test images, respectively. To compare the two images for various possible shifts τx and τy , one can compute the cross-correlation c(τx , τy ) as c(τx , τy ) = g(x, y)f (x − τx , y − τy )dxdy. (1) where the limits of integration are dependent on g(x, y). The Eq. 1 can also be written as c(τx , τy ) = G(fx , fy )F ∗ (fx , fy ) exp(j2π(fx τx + fy τy ))dfx dfy = (2) = F T −1 (G(fx , fy )F ∗ (fx , fy )).
522
M.S. Aslan et al.
where G(fx , fy ) and F (fx , fy ) are the 2-D FTs of g(x, y) and f (x, y), respectively with fx and fy demoting the spatial frequencies. The test image g(x, y) is filtered by H(fx , fy ) = F ∗ (fx , fy ) to produce the output c(τx , τy ). Hence, H(fx , fy ) is the correlation filter which is complex conjugate of the 2-D FT of the reference image f (x, y). The Figure 3(a) shows the reference image used in the Matched filter. Some examples of the VB detection are shown in the Figs. 3(b-d). We tested the Matched filter using 3000 clinical CT images. The detection accuracy for the VB region is 97.6%.
(a)
(b)
(c)
(d)
Fig. 3. (a) The template used for the Matched filter, (b-d) Few images of automatic VB detection. Green line shows the detection of VB region.
2.2
Graph Cuts Segmentation Framework
In the graph cuts method, a VB (object) and surrounding organs (background) are represented using a gray level distribution models which are approximated by a linear combination of Gaussians (LCG) to better specify region borders between two classes (object and background). Initial segmentation based on the LCG models is then iteratively refined by using MGRF with analytically estimated potentials. In this step, the graph cuts is used as a global optimization algorithm to find the segmented data that minimize a certain energy function, which integrates the LCG model and the MGRF model. To segment a VB, we initially labeled the volume based on its gray level probabilistic model. Then we create a weighted undirected graph with vertices corresponding to the set of volume voxels P, and a set of edges connecting these vertices. Each edge is assigned a nonnegative weight. The graph also contains two special terminal vertices s (source) “VB ”, and t (sink) “background”. Consider a
A Novel 3D Segmentation of Vertebral Bones from Volumetric CT Images
523
neighborhood system in P, which is represented by a set N of all unordered pairs {p, q} of neighboring voxels in P. Let L the set of labels {“0”, “1”}, correspond to VB and background regions respectively. Labeling is a mapping from P to L, and we denote the set of labeling by f = {f1 , . . . , fp , . . . , f|P| }. In other words, the label fp , which is assigned to the voxel p ∈ P, segments it to VB or background region. Now our goal is to find the optimal segmentation, best labeling f , by minimizing the following energy function: E(f ) = Dp (fp ) + V (fp , fq ), (3) p∈P
{p,q}∈N
where Dp (fp ), measures how much assigning a label fp to voxel p disagrees with the voxel intensity, Ip . Dp (fp ) = −ln P (Ip | fp ) is formulated to represent the regional properties of segments. The second term is the pairwise interaction model which represents the penalty for the discontinuity between voxels p and q. To initially label the VB volume and to compute the data penalty term Dp (fp ), we use the modified EM [13] to approximate the gray level marginal density of each class fp , VB and background region, using a LCG with Cf+p positive and Cf−p negative components as follows: C+
P (Ip |fp ) =
fp
r=1
C−
wf+p ,r ϕ(Ip |θf+p ,r ) −
fp
l=1
wf−p ,l ϕ(Ip |θf−p ,l ),
(4)
where ϕ(.|θ) is a Gaussian density with parameter θ ≡ (μ, σ 2 ) with mean μ and variance σ 2 . wf+p ,r means the rth positive weight in class fp and wf−p ,l means Cf+ the lth negative weight in class fp . This weights have a restriction r=1p wf+p ,r − Cf−p − l=1 wfp ,l = 1. The simplest model of spatial interaction is the Markov Gibbs random field (MGRF) with the nearest 6-neighborhood. Therefore, for this specific model the Gibbs potential can be obtained analytically using our maximum likelihood estimator (MLE) for a generic MGRF in [11,14]. So, the resulting approximate MLE of γ is: K2 ∗ γ = K− fneq (f ) . (5) K −1 where K = 2 is the number of classes in the volume and fneq (f ) denotes the relative frequency of the not equal labels in the voxel pairs. To segment a VB volume, we use a 3D graph where each vertex in this graph represents a voxel in the VB volume. Then we define the weight of each edge as shown in the table below. After that, we get the optimal segmentation surface between the VB and its background by finding the minimum cost cut on this graph. The minimum cost cut is computed exactly in polynomial time for two terminal graph cuts with positive edges weights via s/t Min-Cut/Max-Flow algorithm [15].
524
M.S. Aslan et al. Edge
Weight γ {p, q} 0 {s, p} −ln[P (Ip | “1”)] {p, t} −ln[P (Ip | “0”)]
for fp = fq fp = fq p∈P p∈P
(a)
(b)
(c)
(d)
Fig. 4. Steps of the proposed algorithm. (a) The clinical CT data set, (b) the Matched filter determines the VB region, (c) LCG initialization, and (d) the final result using the graph cuts.
3
Experiments and Discussion
To assess the accuracy and robustness of our proposed framework, we tested it using a clinical data sets, as well as, the phantom (ESP ). The real data sets were scanned at 120kV and 2.5mm slice thickness. The ESP was scanned at 120kV and 0.75mm slice thickness. All algorithms are run on a PC 3Ghz AMD Athlon 64 X2 Dual, and 3GB RAM. All implementations are in C++. To compare the proposed method with other alternatives, VBs are subsequently segmented using the b-spline interpolation and statistical level sets method including some post-processing steps. Finally, segmentation accuracy is measured for each method using the ground truths (expert segmentation).M1 represents the proposed algorithm. The alternative methods used in the experiments are represented as M2 (for spline-based interpolation), M3 (for level sets with morphological closing postprocess), M4 (for level sets without any postprocess), and M5 (for level sets with interpolation postprocess).
A Novel 3D Segmentation of Vertebral Bones from Volumetric CT Images
525
To evaluate the results we calculate the percentage segmentation error as follows: 100 ∗ N umber of missclassif ied voxels error% = . (6) T otalnumber of VB voxels Table 1. Accuracy and time performance of our segmentation on 10 data sets. Average volume 512x512x14. Average elapsed time for 10 3D data sets is 7.5 secs.
Min. error, % Max. error, % Mean error, % Stand. dev.,% Average time,sec
M1 (Proposed) 2.1 12.6 5.6 4.3 7.5
M2 3.5 8.6 6.3 2.4 34.5
M3 7.3 34.3 13.7 11.5 12.5
M4 8.2 41.4 15.5 14.5 6.9
M5 7.2 37.2 14.5 12.8 43.6
Preliminary results are very encouraging and the test results was achieved for 10 data sets and the ESP, which we have their ground truths (expert segmentation). The statistical analysis of our method is shown in the Table 1. In this table the results of the proposed segmentation method and other four alternatives are shown.The motivation behind our segmentation of VB is to exclude errors as far as possible. The average error of the VB segmentation on 10 clinical 3D image sets is 5.6% for the proposed method. It is worth mentioning that the segmentation step is extremely fast thanks to automatically detection step of the VOI using the Matched Filter. The segmentation time is much lower than that reported in [3,5]. The spline based interpolation method, represented as M2, has the closest segmentation accuracy for the clinical data set as shown in the Table 1. An example that shows 3D segmentation results of a clinical data set is shown in Fig 5. This figure shows the results of M1-M5. Red color shows the misclassified voxels. More 3D results of the proposed method is shown in Fig 6.
4
Validation
To assess the robustness of the proposed approach, the European Spine Phantom (ESP ), which is an accepted standard for quality control [2] in bone densitometry, was used to validate the segmentation algorithms. The Figure 7 shows some CT images of the ESP used in our experiment. Because clinical CT images have gray level inhomogeneity, noise, and weak edges in some slices, the ESP was scanned with the same problems to validate the robustness of the method. The VB segmentation error on the ESP is 3.0% for the proposed method. The level set method without any post-processing has the closest (but not less) segmentation error which is 9.9%. The Fig. 8 shows 3D segmentation results of the ESP for the M1 (proposed method) and M4. Because the proposed algorithm is dependent on both gray level information and spatial interaction between the voxels, it is superior than other alternatives.
526
M.S. Aslan et al.
(M1)
(M2)
(M3)
(M4)
(M5)
Fig. 5. 3D results of one clinical data sets using different methods. (M1) The result of the proposed method, (M2-M5) The results of alternative methods. Red color shows the segmentation errors.
Fig. 6. Some 3D results of the proposed framework
Fig. 7. CT images from the ESP data set
(a)
(b)
Fig. 8. 3D Results for the ESP. (a) The result of the proposed algorithm, (b) The result of M4 which has closest result to the proposed algorithm. Red color shows the misclassified area.
A Novel 3D Segmentation of Vertebral Bones from Volumetric CT Images
5
527
Conclusion
In this paper, we have presented a novel and fast 3D segmentation framework of VBs in clinical CT images using the graph cuts method. User interaction is completely eliminated using the Matched filter which detects the VB region automatically. This step eliminates the manual initialization and improves the segmentation accuracy. Validity was analyzed using ground truths of data sets and the European Spine Phantom (ESP ) as a known reference. Preliminary results are very encouraging for the proposed method. The average segmentation precision error of 10 clinical 3D data sets and the ESP were 5.6% and 3.0%, respectively. Experiments on the data sets show that the proposed segmentation approach is more accurate and robust than other known alternatives.
References 1. Kang, Y., Engelke, K., Kalender, W.A.: New accurate and precise 3D segmentation method for skeletal structures in volumetric CT data. IEEE Transaction on Medical Imaging (TMI) 22(5), 586–598 (2003) 2. Kalender, W.A., Felsenberg, D., Genant, H., Fischer, M., Dequeker, J., Reeve, J.: The European Spine Phantom - a tool for standardization and quality control in spinal bone measurements by DXA and QCT. European J. Radiology 20, 83–92 (1995) 3. Mastmeyer, A., Engelke, K., Fuchs, C., Kalender, W.A.: A hierarchical 3D segmentation method and the definition of vertebral body coordinate systems for QCT of the lumbar spine. Medical Image Analysis 10(4), 560–577 (2006) 4. http://www.back.com 5. Kaminsky, J., Klinge, P., Bokemeyer, M., Luedemann, W.: M Samii, Specially adapted interactive tools for an improved 3D-segmentation of the spine. Computerized Medical Imaging and Graphics 28(3), 119–127 (2004) 6. Tan, S., Yao, J., Ward, M.M., Yao, L., Summers, R.M.: Computer aided evaluation of Ankylosing Spondylitis. In: IEEE International Symposium on Biomedical Imaging (ISBI 2006), pp. 339–342 (2006) 7. Tan, S., Yao, J., Ward, M.M., Yao, L., Summers, R.M.: Computer Aided Evaluation of Ankylosing Spondylitis Using High-Resolution CT. IEEE Transaction on Medical Imaging (TMI) 27(9), 1252–1267 (2008) 8. Ibanez, L., Schroeder, W., Cates, J.: The ITK software guide Kitware Inc (2003) 9. Sebastiana, T.B., Teka, H., Criscob, J.J., Kimia, B.B.: Segmentation of carpal bones from CT images using skeletally coupled deformable models. Medical Image Analysis 7(1), 21–45 (2003) 10. Perelman, L.C., Paradis, J., Barrett, E., Zoroofi, R.A., Sato, Y., Sasama, T., Nishii, T., Sugano, N., Yonenobu, K., Yoshikawa, H., Ochi, T., Tamura, S.: Automated segmentation of acetabulum and femoral head from 3-D CT images. IEEE Trans Inf Technol Biomed. 7(4), 329–343 (2003) 11. Ali, A.M., Farag, A.A.: Automatic Lung Segmentation of Volumetric Low-Dose CT Scans Using Graph Cuts. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne, T.-M., Monroe, L. (eds.) ISVC 2008, Part I. LNCS, vol. 5358, pp. 258–267. Springer, Heidelberg (2006)
528
M.S. Aslan et al.
12. Kumar, B.V.K.V., Savvides, M., Xie, C.: Correlation pattern recognition for face recognition. Proceedings of the IEEE 94(11), 1963–1976 (2006) 13. Farag, A.A., El-Baz, A., Gimelfarb, G.: Density estimation using modified expectation maximization for a linear combination of gaussians. In: Proc. of International Conference on Image Processing (ICIP 2004), vol. 3, pp. 1871–1874 (2004) 14. Aslan, M.S., Ali, A.M., Arnold, B., Fahmi, R., Farag, A.A., Xiang, P.: Segmentation of trabecular bones from vertebral bodies in volumetric CT spine images. In: ICIP 2009 (Accepted to appear, 2009) 15. Boykol, Y., Kolmogorov, V.: An experiment comparison of min-cut/maxflowalgorithms for energy minimization in vision. IEEE Transaction on Pattern Analysis and Machine Intelligence 26, 1124–1137 (2004)
Robust 3D Marker Localization Using Multi-spectrum Sequences Pengcheng Li, Jun Cheng, Ruifeng Yuan, and Wenchuang Zhao Shenzhen Institute of Advanced Integration Technology, Chinese Academy of Sciences/The Chinese University of Hong Kong, 1068 Xueyuan Ave., Shenzhen Univ. Town, Shenzhen, 518055, China {pc.li,jun.cheng,rf.yuan}@siat.ac.cn,
[email protected]
Abstract. Robust 3D marker localization in different conditions is an open, challenging problem to stereovision systems. For years, many algorithms — using monocular or multiple views; based on visible or infrared sequences — have been proposed to solve this problem. But they all have limitations. In this paper, we propose a novel algorithm for robust 3D marker localization in different conditions, using synchronous visible and infrared (IR) spectrum sequences captured by binocular camera. The main difficulty of the proposed algorithm is how to accurately match the corresponding marked objects in multi-spectrum views. We propose to solve the matching problem by considering geometry constraints, context based features of special designed markers, 3D physical spacial constraints, and etc. Experimental results demonstrated the feasibility of the proposed algorithm.
1
Introduction
Three dimensional marker localization is a typical issue in stereovision systems. The main challenge is to acquire the robust 3D localizations in various conditions. Many algorithms have been proposed — using monocular or multiple views; based on visible or IR sequences. Some of them have been successfully applied in many areas, including medicine [1], athletic [2], human-computer interfaces [3], etc. Three dimensional localization systems can be classified into marker-based and marker-free systems. The marker-free systems often use conventional, cheaper equipments, and very convenient for users. However, the commercial usage is markedly limited by the expensive computational problems [4], [5]. Marker-based localization system is inexpensive and easy to use, and it has the potential to make the system more feasible. Many markers-based commercial systems have been successfully applied to record 3D human motions [1]. The marker-based systems can also be classified into active and passive systems. Systems based on wired active markers are more accurate and helpful for motion analysis and other research applications, but not convenient for household users [6]. Passive systems with wireless retro-reflective markers are more convenient and feasible for household applications [7], [8], [9]. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 529–537, 2009. c Springer-Verlag Berlin Heidelberg 2009
530
P. Li et al.
In passive binocular IR system, the passive markers have distinct high brightness area in the images under constant infrared illumination conditions. Therefore, it is very easy and fast to locate the passive marker. But the IR image lacks texture and color information so that it is difficult to distinguish the targets from other disturbance with similar shape and match the targets between left and right cameras. The binocular visible camera can capture much more texture and color information helping distinguishing and matching targets, but it need expensive computation to process large amount of pictorial information [5]. It is difficult for these two kinds of binocular systems to be accurate and real-time at the same time in various conditions. In this paper, we propose a fast and accurate algorithm for robust 3D marker localization based on synchronous visible and IR spectrum sequences. Our binocular system consists of one visible camera and one IR camera to capture images synchronously. In IR image, it is easy to detect the targets because they have high intensity. In visible light image, we can distinguish the targets with the texture and color information of the targets. By combining the advantages of the visible and IR spectrum sequences, the proposed 3D localization algorithm may be superior in improving the performance in different conditions. The proposed algorithm consists of four main modules, markers detection, matching, features based recognition, 3D localization and post processing. The main problem is to accurately match the corresponding markers in multi-spectrum images. We attempt to match the markers through the combination of the geometry constraint, context based features (including shape, hue and textures) and 3D physical spacial constraints. The hue and textures in the visible image are used to recognize the target markers ultimately. The remainder of the paper is organized as follows: Section 2 presents the system overview. The 3D marker localization algorithm is proposed in section 3. Section 4 shows experimental results and conclusions are given in section 5.
2
System Overview
Our stereovision system is to recognize the human motion for human computer interfaces based on synchronous visible and IR sequences. The system, as shown in Fig. 1, includes steps as follows: capturing, detection and localization, tracking and recognition. The markers, fixed on the body, are specially designed to improve the system’s performance. Fig. 2 shows the visible and IR sequences and the calibrated epipolar line. In this paper, we focus on the 3D markers localization algorithm.
3
3D Marker Localization
In this paper, we present a fast and accurate algorithm for robust 3D single marker localization based on visible and IR sequences (Fig. 2). As shown in Fig. 3, the proposed algorithm consists of three main steps, (1) marker detection and geometry based matching; (2) shape based classification and neighborhood recognition; (3) 3D localization.
3D Marker Localization Using Multi-spectrum Sequences
531
Frame Capturing (target person with markers fixed on the body) Infrared Sequences
Visible Sequences
Markers Detection
Markers Detection & Features Extraction
3D Markers Localization
3D Tracking and Trajectories Estimation Human Motion Recognition
Fig. 1. Overview of proposed synchronous visible and IR system
(a) Infrared Spectrum Image
(b) Visible Spectrum Image
Fig. 2. Multi-spectrum Images: (a) The highlight object is a head marker; (b) A person with the head marker is playing computer game. The white curve is the epipolar line for head marker, which has been distorted by the camera’s wide-angle lens.
3.1
Marker Detection and Geometry Based Matching
The marker detection step attempts to extract all the candidate foreground blobs in the visible and IR images. Because the passive markers can have a high constant and distinct brightness, it is reliable to segment the candidate marker’s blobs with a simple threshold method. The disadvantage is that both the markers and some high light objects are detected as candidates. For each candidates in IR image, we can find the corresponding blobs in visible-light image with epipolar constraint (Fig. 2(b)). Thus, the processing time for visible-light image is reduced significantly. Additionally, in marker detection, the fixed high light objects in the IR sequences are statistically classified as the background. This step can help reduce the number of candidates in complicated conditions with many fixed high light disturbances.
532
P. Li et al.
Webcam 0 (Infrared Spectrum) Markers Detection
SVM Shape Classification
Webcam 1 (Visible Spectrum) Epipolar Constraint
Markers Detection
SVM Shape Classification
Shape Matching of Markers
Neighborhood Detection
Markers Recognition (Hue&Texture)
3D Localization 3D Position Fig. 3. Algorithm Overview: single marker localization in synchronous visible and IR spectrum image sequences
3.2
Features Based Classification and Recognition
The features used in classification and recognition include shape, texture and hue of the neighborhood area. For IR images only have shape information, we adopt shape features such as perimeter, area, length, width to classify candidates into groups with different shape. For visible-light images have abundant color, texture and shape information, we can use all these features to distinguish targets. In the shape based classification, SVM with RBF kernel [10] is applied to classify the right candidate markers with a 2D shape model. Let En be the n-th marker’s vector of shape parameters: En = [l, w,
a ∗ b C2 T , ] , S S
(1)
where l, w denote the lengths of long and short axis of fitted ellipse respectively; a, b denote the length and width of the fitted rectangle respectively; S, C denote the area and perimeter of the marker region respectively. Define the shape similarity of m-th and n-th marker as: Smn = 1 −
Em − En 2 , Em + En 2
(2)
where Em , En denote the vector of shape parameters of the m-th and n-th marker in IR or visible images, respectively. The shape similarity Smn is used to validate the correctness of the matched markers in multi-views. To recognize the marker, the neighborhood of the marker is detected in the proposed algorithm. For instance, the head marker can be recognized through the face neighborhood. The features of face, including LBP([11]) textures and hue, can be detected accurately and fast. This algorithm allows to locate the target marker even in the complicated conditions with multiple similar marker disturbances.
3D Marker Localization Using Multi-spectrum Sequences
3.3
533
3D Localization and Post Processing
Since the corresponding markers have been matched, the 3D marker localization result can be obtained through the stereo calibrated system. It is also possible that more than one 3D candidate localizations are obtained, then the only correct or most possible marker localization needs to be recognized. The post processing method in this section is to solve the problem. The spacial position and speed of the target marker are tracked to recognize the incorrectness. If the candidate localization is incorrect, the speed may exceed the physical limits. Then we can add some constraints of marker’s speed as follows. Let Pkn = (xkn , ykn , zkn )T be the world coordinates of the n-th candidate marker in the k-th frame. Note: ⎛ ⎞ xkn Pkn = ⎝ ykn ⎠ = Pk−1 + an ik + bn jk + cn kk , (3) zkn where Pk−1 , {ik , jk , kk }, respectively denote the marker’s 3D localization result in the (k-1)-th frame and orthonormal basis in the k-th frame, and {an , bn , cn } denotes the marker’s 3D motion vector from (n-1)-th to n-th frame. The marker above head should be limited by the maximal physical head’s moving speed as: a2n + b2n + c2n <
V , (4) frame rate where V is the maximal head’s moving velocity, which can be known beforehand.
4
Experimental Results
The algorithm is implemented in C++ on a calibrated synchronous 640 × 480 visible and IR image capturing platform (Fig. 4) and a Pentium Dual 1.6 GHz PC. In the platform, the cameras’ directions are approximately parallel. So the corresponding markers’ motion vectors in the 2D images can be regarded as in parallel and equidistant. This constraint effectively functions in our experiments. A synchronous control system is specially designed to keep the cameras capturing at the same frame rate. The image processing algorithm is running parallel to the image capturing thread. The proposed algorithm thread can run about 58fps in normal conditions. In order to testify the robustness and effectiveness, some high light objects are fixed on the background to make disturbance in our experiments In the experiments, we test the accuracy rate of the proposed algorithm in various indoor conditions(Seq.0-9) and compare it with a marker localization algorithm using binocular infrared sequences(Bi-IR Seq.0-2, Fig. 6). Despite of the similarity to the proposed algorithm in framework, it can not recognize the target marker accurately for the lack of the visible texture and hue information. The experimental results are listed in Table 1. The proposed algorithm has an
534
P. Li et al.
Webcam 1 (Visible Spectrum) LED
Webcam 0 (Infrared Spectrum) LED
Corresponding Marker
Fig. 4. Synchronous Imaging Platform: Webcam 0 is a IR spectrum camera. Webcam 1 is a common visible-light camera. The surrounding LED light sources are used to provide a constant infrared illumination, and the markers in the images can have strong brightness. A user is playing computer game.
(a) Seq. 0
(b) Seq. 1
(c) Seq. 2
(d) Seq. 4
(e) Seq. 5
(f) Seq. 6
Fig. 5. Experiments on disturbances of similar marker(a-c) and moving person(d-f). The red circle in images denotes the correct matching result. The results are shown in Table 1.
average superior accuracy rate(98.97%) over the Bi-IR algorithm(94.3%). In the simple conditions with less disturbances, such as Seq.0 and Bi-IR Seq.0, both of the two algorithms perform quite well, reaching a high accuracy rate at about 99.8%. But in the complicated conditions with similar marker disturbances, such as the Seq.1-6 and Bi-IR Seq.1-2, the proposed algorithm outperforms the Bi-IR algorithm significantly. Due to the increasing of the disturbances with similar shape to the target marker, the accuracy of Bi-IR algorithm reduced from 98.6% to 84%, while the proposed algorithm remains steady performance (from 99.6% to 97.7%). In addition, for the IR camera is only sensitive to the markers and high light objects, the Bi-IR algorithm do not need to care much about the crowd, hues and illuminations, which cost much computation in the proposed algorithm. So the average running speed of Bi-IR algorithm(70.0fs) is faster than that of the proposed algorithm(57.6fps). The visible information is effectively used to
3D Marker Localization Using Multi-spectrum Sequences
535
Table 1. Experiments Stat.: Accuracy Rate of Head Marker Localization Methods(Conditions)
Sequences(Frame number) Accuracy Rate(%)
Proposed(Simple)
Seq.0(1873) Seq.1(600) Seq.2(2563) Seq.3(2265) Seq.4(3019) Proposed(Complicated) Seq.5(2253) Seq.6(2889) Seq.7(2180) Seq.8(2546) Seq.9(2754)
99.8 99.2 98.6 98.2 99.8 99.6 99.2 99.0 98.6 97.7
Proposed(Average)
Seq.0-9
98.97
Bi-IR(Simple) Bi-IR(Complicated) Bi-IR(Complicated)
Bi-IR Seq.0(2400) Bi-IR Seq.1(2566) Bi-IR Seq.2(2700)
99.8 98.6 84.5
Bi-IR(Average)
Bi-IR Seq.0-2
94.3
(a) Bi-IR Seq. 0
(b) Bi-IR Seq. 1
(c) Bi-IR Seq. 2
(d) Bi-IR Seq. 2
Fig. 6. Comparison Experiments in Binocular Infrared (Bi-IR) Sequences: (a) simple conditions, (b) a bit complicated conditions, (c) complicated conditions. The red circle in images denotes the correct matching result. (d) shows the mismatching cases.
improve the accuracy rate in complicated conditions, such as Seq.1-6 as shown in Table 1, but it also leads the proposed algorithm to be sensitive to the visible disturbances. We also carry out the experiments on the various disturbances. Fig. 7 shows the experiments on the various motions, including rotation, leaning, fast moving, etc. Fig. 8 shows some mismatching cases caused by the marker disturbances or drastic motion disturbances. Fig. 9 shows the experiments on the high indoor illumination and background hues. The highlight wall regions may multiply the number of candidate markers, and disturb the detection and recognition of target marker. The background hues that similar to skin may also
536
P. Li et al.
(a) Seq. 3
(b) Seq. 8
(c) Seq. 3
Fig. 7. Experiments on various motion cases: (a) rotation, (b) leaning, (c) motion blur or fast moving. The red circle in images denote the correct matching result.
(a) Seq. 1
(b) Seq. 3
(c) Seq. 5
Fig. 8. Experiments: some mismatching cases caused by the background or motion disturbances.
(a) Seq. 7
(b) Seq. 8
(d) Seq. 7
(e) Seq. 8
(c) Seq. 9
(f) Seq. 9
Fig. 9. Experiments on the influences of high illuminations and wall pictures. The red circle in images denotes the mismatching result as (a-d,f), and the correct result as (e). The green rectangles in the visible image denote high light disturbances(a-c) or disturbances with similar hue to the target(d-f). The white curve is the epipolar line. The results are shown in Table 1.
be mistakenly recognized as face. The accuracy rate of Seq.9 is a little lower than that of Seq.7-8, for the reason that the target marker in Seq.9 is at the location which is much easier to be disturbed by the pictures or highlight walls. Although the proposed algorithm is limited by the visible disturbances, the accuracy and running speed of the proposed algorithm can still keep in a steady high level, as shown in Table 1.
5
Conclusion and Future Works
In this paper, we propose a fast and accurate algorithm for robust 3D single marker localization in different indoor conditions, based on visible and infrared image sequences. The key problem—markers’ matching, is solved by the
3D Marker Localization Using Multi-spectrum Sequences
537
combination of geometry constraint, context based features and the physical body constraints. The validity of the proposed algorithm has been demonstrated through the experiments in various conditions. In the future, we will investigate multiple markers’ localization.
Acknowledgement The work described in this paper is supported by the National Natural Science Foundation of China (grant: 60806050).
References 1. Zhou, H., Hu, H.: Human motion tracking for rehabilitation-a survey. Biomedical Signal Processing and Control 3, 1–18 (2008) 2. Multon, F., Hoyet, L., Komura, T., Kulpa, R.: Interactive control of physically-valid aerial motion: Application to vr training system for gymnasts. In: Proceedings of the 2007 ACM Symposium on Virtual Reality Software and Technology, pp. 77–80 (2007) 3. Freeman, W., Tanaka, K., Ohta, J., Kyuma, K.: Computer vision for computer games. In: Automatic Face and Gesture Recognition, pp. 100–105 (1995) 4. Michoud, B., Guillou, E., Briceno, H., Bouakaz, S.: Real-time and marker-free 3d motion capture for home entertainment oriented applications. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part I. LNCS, vol. 4843, pp. 678–687. Springer, Heidelberg (2007) 5. Tao, Y., Hu, H.: Building a visual tracking system for home-based rehabilitation. In: Proceedings of the 9th Chinese Automation and Computing Society Conference in the UK, England, pp. 343–348 (2003) 6. http://www.codamotion.com/ 7. Ferrigno, G., Pedotti, A.: Elite: A digital dedicated hardware system for movement analysis via real-time tv signal processing. In: IEEE Transaction on Biomedical Engineering, vol. 32, pp. 943–949 (1985) 8. Sementille, A., Lourenco, L., Brega, J., Rodello, I.: A motion capture system using passive markers. In: Proceedings of the 2004 ACM SIGGRAPH International Conference on Virtual Reality Continuum and its Applications in Industry, pp. 440–447 (2004) 9. Chung, J., Kim, J., Shim, K.: Vision based motion tracking system for interactive entertainment applications. In: TENCON 2005, pp. 1–6 (2005) 10. Gunn, S.: Support vectors machines for classification and regression. In: Technical Report, Image Speech and Intelligent Systems Research Group, University of Southampton (1997) 11. Trinh, P., Ngoc, P., Jo, K.: Color-based face detection using combination of modified local binary patterns and embedded hidden markov models. In: SICE-ICASE International Joint Conference, pp. 18–21 (2006)
Measurement of Pedestrian Groups Using Subtraction Stereo Kenji Terabayashi, Yuki Hashimoto, and Kazunori Umeda Chuo University / CREST, JST, 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan
[email protected]
Abstract. In this paper, detection of pedestrian groups and counting of the number of pedestrians in each group using “subtraction stereo” are discussed. Subtraction stereo is a stereo vision method that focuses on the movement of objects to make a stereo camera robust and produces range images for moving regions. Pedestrian groups are detected with a standard labeling, and three dimensional (3D) features of pedestrian groups are measured from range images obtained by subtraction stereo. Then a method to count the number of pedestrians in a group is proposed. The basic algorithm of the subtraction stereo is implemented on a commercially available stereo camera, and the effectiveness of the method to count the number of pedestrians is verified by experiments using the stereo camera.
1
Introduction
A huge number of studies have been carried out for stereo vision until now [1,2,3,4]. These days, several practical stereo vision systems have been reported. Some studies realize real-time acquisition of range images using personal computers (PCs) because the CPUs and Graphics Processing Units (GPUs) are fast enough [5,6]. In some studies, a Field Programmable Gate Array (FPGA) is used instead of a PC to acquire range images [7]. Some stereo cameras that are connected to a PC are commercially available [8] and widely used. There are stereo cameras that are practically used for automotive cars [9]. We are aiming at developing a practical stereo camera for applications such as surveillance, in which detection of anomalies or measurement of moving people are required. Several systems have been proposed for such kind of surveillance using a single camera [10,11]. However, a single camera may not be sufficient since the size of a target is not obtained without restriction such that the camera pose is given and the target is on the ground. Stereo cameras are more appropriate, for size information can be directly obtained and scalable use for several scenes becomes possible. It becomes possible to distinguish whether the target is human-size or larger/smaller or to know whether the target person is an adult or a child. For example, it is possible to avoid falsely detecting a bird flying near the camera. As explained above, we can say that stereo cameras have already reached a practical level. However, stereo vision has a weakness that stereo matching is G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 538–549, 2009. c Springer-Verlag Berlin Heidelberg 2009
Measurement of Pedestrian Groups Using Subtraction Stereo
539
inevitably not robust for weak textures or recurrent patterns, which is called the correspondence problem. There have also been many studies that aimed to solve the problem. Muti-baseline stereo [12] that uses multiple cameras and make the stereo matching more robust is well known. Another approach is to project some pattern such as random dot pattern to give texture on a scene, though this approach does not work when targets are far. We focus on the movement of objects to make a stereo camera robust and have proposed “subtraction stereo” [13]. In this paper, we choose an issue to detect and measure pedestrian groups, and apply the subtraction stereo to deal with the issue. This paper is organized as follows. In section 2, we show the outline of subtraction stereo. In section 3, we discuss the detection and measurement of pedestrian groups. Then in section 4, we propose a method to estimate the number of pedestrians in a group. Experimental results to measure pedestrians using the stereo camera with subtraction stereo algorithm are given in section 5. This paper is concluded in section 6.
2
Basic Algorithm of Subtraction Stereo
Fig. 1 shows the basic algorithm of the subtraction stereo. In standard stereo vision, two images captured with right and left cameras are matched and disparities are obtained at each pixel. The subtraction stereo adds a step to extract moving regions in the images of each camera, and then applies the stereo matching to the extracted moving regions. The extraction of moving regions is realized by a subtraction method, in which the simplest one is background subtraction. In compensation that the pixels where a disparity is obtained are restricted to moving regions, the subtraction stereo can realize the robustness of the stereo matching by restricting the search space for matching, and by using motion information as well as the original image itself. Fig. 2 shows an example of the result of subtraction stereo. Fig. 2(b) is the disparity image obtained by the subtraction stereo for the scene Fig. 2(a). Color represents the disparity; bluer color indicates larger disparity, i.e. smaller distance. In contrast to the disparity image obtained by the standard stereo matching Fig. 2(c), appropriate disparity images are obtained for moving objects only.
3
Detection and 3D Measurement of Pedestrian Group Regions
In this section, we explain the measurement of features of pedestrians with the subtraction stereo. We assume a fixed parallel stereo camera in this paper. 3.1
Labeling
In the subtraction stereo, the obtained disparity image is originally restricted to moving regions. Therefore, pedestrian group regions can be obtained with the standard labeling technique.
540
K. Terabayashi, Y. Hashimoto, and K. Umeda
Scene Camera1
Camera2
Image input
Image input
Detection of moving regions
Detection of moving regions
Subtraction Matching Measurement of Disparities, ranges
Moving regions + range information
Fig. 1. Flow of subtraction stereo
3.2
Measurement of 3D Position
When the disparity of a point is given, the corresponding distance along the optical axis and 3D position of the point are calculated. Hereafter, we use “distance” as the meaning of distance along the optical axis. Let the disparity be k pixel and the distance be z m. The distance z is calculated by z=
b·f k·p
(1)
where b is the baseline length, f is the focal length of the lens, and p is the width of each pixel of the image. As is well known, the distance is inversely proportional to the disparity. Furthermore, the 3D position of a measured point is obtained from the distance z and the image coordinates (u, v) of a point in the image (see Fig. 3). Assuming that there is no skew and the aspect ratio of each pixel is 1, then the 3D position x is given as p T p x = z · f (u − u0 ) f (v − v0 ) 1 (2) where (u0 , v0 ) is the image coordinates of the image center. 3.3
Coordinate Transformation
When position and orientation of the camera are given, the 3D position xw of the point in the world coordinate system is calculated as xw = Rx + t
(3)
where t and R are camera’s position and orientation respectively. t is a 3D vector and R is a 3 × 3 rotation matrix.
Measurement of Pedestrian Groups Using Subtraction Stereo
(a) Scene
(b) Subtraction stereo
(c) Standard stereo Fig. 2. An example of subtraction stereo for an indoor scene
541
542
K. Terabayashi, Y. Hashimoto, and K. Umeda u
v 䊶 (u0, v0)
Fig. 3. Definition of the image coordinate system
As 3D information of the scene is obtained with a stereo camera, its position and orientation parameters can be measured by observing a scene. When a plane (ground) is observed, its height is measured as a position parameter, and its tilt and roll angles are measured as orientation parameters. It is also possible to obtain positions of pedestrians and use the points in spite of the ground points to calculate the position and orientation parameters of the camera. 3.4
Features of Pedestrian Group Region
It is possible to calculate 3D position of every point in a pedestrian group region using the procedure above. Thinking of rather large measurement errors of stereo vision (which is proportional to the square of distance), we average the distance to each point of the region and use the value zc as the distance to the region. And by substituting zc and the image coordinates (uc , vc ) of the centroid of the region for eq.(2), 3D position xc to represent the region is obtained. At the same time, we adopt the area of the region (i.e., the number of the pixels) S as the feature of the region.
4
Estimation of the Number of Pedestrians
In this section, we introduce the method to estimate the number of pedestrians of the detected pedestrian group region. 4.1
Method to Estimate the Number of Pedestrians
The area S and the distance z theoretically satisfies c = S · z2
(4)
where c is a constant. Let c for one person which is obtained by the equation (4) be c1 . If there is no overlap of persons, the number of pedestrians n can be estimated with the following equation. c n= (5) c1 c1 can be estimated by calculation, given the size of a person, or it can also be estimated by measuring a person.
Measurement of Pedestrian Groups Using Subtraction Stereo
4.2
543
Compensation of the Camera Orientation
The area S is affected by the camera orientation. When the optical axis of a camera is parallel to the ground, the area becomes maximum. A camera is often fixed at a high position with a downward tilt so that the pedestrians are looked down on as illustrated in Fig. 4. In such a case, area S becomes smaller. When the tilt angle is θ, the area S should be rectified to S =
S . cosθ
(6)
Stereo Camera θ
Optical axis Pedestrians Zw
Yw Xw
Fig. 4. The relation of the camera and pedestrians
5
Experiments
In this section, we show experimental results to evaluate the proposed method to measure pedestrian groups. 5.1
Implementation of Basic Algorithm Using a Commercially Available Stereo Camera
We implemented the basic algorithm of subtraction stereo on a commercially available stereo camera. The stereo camera is Point Grey Research Bumblebee2 (color, f=3.8 mm). We set the size of the image to 320 × 240 pixel. In this implementation, the simple background subtraction is employed to extract moving objects. The stereo matching procedure of the Bumblebee2 library was applied to the subtraction images of the right and left cameras and a disparity image is obtained. The rate to obtain disparity images is about 37 fps with a PC with Core 2 Duo T9300 (2.50GHz).
544
5.2
K. Terabayashi, Y. Hashimoto, and K. Umeda
Experiment to Evaluate the Accuracy of the Method to Estimate the Number of Pedestrians
We evaluated the accuracy of the method to estimate the number of pedestrians. The stereo camera was installed so that the pedestrians were looked down on from a building as shown in Fig. 5. Fig. 6 shows the experimental setup. The camera was set at the height of 8.3 m with 50 deg downward tilt.
Fig. 5. Experimental scene
Stereo camera x
50°
y
z (optical axis)
8.3m
zw xw
Person yw
Ground
Fig. 6. Experimental setup of the experiments
Estimation of the constant. To estimate the constant c1 , we investigated the change of c when a person with a height of 180 cm walked from forefront to back almost vertically to the camera. Fig. 7 shows the result. It is shown that constant c keeps around 140,000 regardless of distance, and therefore, we set constant c1 to 140,000. If we obtain the distance z and the width p per pixel, the number
Measurement of Pedestrian Groups Using Subtraction Stereo
545
of pixels of a region of person in the image coordinate system are obtained from eq.(2). It means that when the area S is obtained, then the theoretical value of constant c is obtained. p of the stereo camera is 7.4 μm. When the theoretical value of constant c is set to 140,000, it corresponds to the area S of 0.52 m2 . This area corresponds to the average width of 29 cm for the height 180 cm. 200000 180000 160000 140000
c
120000 100000 80000 60000 40000 20000 0 1
51
101
151 Frame No.
201
251
Fig. 7. Time-series change of c of one person
Examples of estimating the number of pedestrians. We carried out experiments for scenes with different number of pedestrians. Fig. 8, 9, 10 show the results. In each figure, (a) shows the color image of the scene that is captured with the stereo camera, and (b) shows the disparity image obtained with the subtraction stereo and the results to detect pedestrian groups and measure number of pedestrians of each group from the disparity image. The red rectangles surround detected pedestrian regions, and numbers indicate the estimated number of pedestrians.
(a) Color image
(b) Result
Fig. 8. An example of detecting pedestrian groups (several groups)
546
K. Terabayashi, Y. Hashimoto, and K. Umeda
(a) Color image
(b) Result
Fig. 9. An example of detecting pedestrian groups (12 persons)
(a) Color image
(b) Result
Fig. 10. An example of detecting pedestrian groups (25 persons)
(a) Experimental scene 1 (2 persons)
(b) Experimental scene 2 (10 persons)
Fig. 11. Disparity image of the experimental scene
The results show that the proposed method works well for several conditions with small (minimum with 1 person) or rather large (maximum with 25 persons) groups; the pedestrian groups are detected appropriately, and the
Number of people
Measurement of Pedestrian Groups Using Subtraction Stereo
547
4 2 0 1
51
101
151
201
251
Frame No.
Number of people
(a) Results for experimental scene 1 14 12 10 8 6 4 2 0 1
51
101
151 Frame No.
201
251
Number of People
(b) Results for experimental scene 2 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 1
51
101
151 Frame No.
201
251
(c) Results for experimental scene 3 Fig. 12. Experimental result of counting the number of pedestrians
estimated numbers are almost accurate. A video to detect the pedestrian groups and estimate number of pedestrians is attached. As the video shows, the process is realized appropriately in real time. Evaluation of the accuracy of estimating the number of pedestrians. We carried out experiments to evaluate the accuracy of estimating the number of pedestrians for different number of pedestrians. We applied the proposed method to scenes in which 2, 10, and 25 pedestrians walk from right to left, and estimated the number of pedestrians. Fig. 11(a) and (b) show disparity images of 2 and 10 pedestrians. Again, the red rectangles surround detected pedestrian regions.
548
K. Terabayashi, Y. Hashimoto, and K. Umeda
A disparity image of 25 pedestrians is given in Fig. 10(b). Fig. 12(a), (b), and (c) show the results to estimate the number of pedestrians for these scenes with 2, 10, and 25 persons respectively. A pedestrian group appeared from the right edge of the image, and disappeared from the left edge of the image. The range between two red lines show the period while every pedestrian was captured by the camera. From the experimental results, we can see that a certain level of accuracy is realized to estimate the number of pedestrians by the proposed method. Specifically, errors of estimated numbers while every pedestrian was captured are settled as follows. Fig. 12(a) is from -51 to 13 %, Fig. 12(b) is from -11 to 26 %, and Fig. 12(c) is from -8 to 18 %. And at the outside of two red lines in Fig. 12, we can see that estimated numbers change mostly linearly, which are appropriate results.
6
Conclusions
In this paper, we discussed the measurement of pedestrians using subtraction stereo. The method to detect pedestrian groups and estimate the number of pedestrians in each group was proposed. The basic algorithm of the subtraction stereo was implemented on a commercially available stereo camera, and the effectiveness of the proposed method was verified by experiments using the stereo camera. When the overlapped regions of pedestrians seen from the camera increase, the area becomes smaller and consequently the estimated number of pedestrians becomes less than the actual value. Additionally, as the proposed method uses the area, the estimated number are affected by shadows. We proposed a simple technique to remove shadows in [14], but it is only applicable to separated persons and not applicable to a pedestrian group. Our future works include to deal with these issues. Especially, development of a method to remove shadows in real time is important. And we construct a stereo camera with the function of subtraction stereo and apply it for surveillance applications.
References 1. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge Univ. Press, Cambridge (2000) 2. Hebert, M.: Active and passive range sensing for robotics. In: Proc. of ICRA 2000, vol. 1, pp. 102–110 (2000) 3. Brown, M., Burschka, D., Hager, G.: Advances in computational stereo. IEEE Trans. Pattern Analysis and Machine Intelligence 25, 993–1008 (2003) 4. Seitz, S., et al.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: Proc. of CVPR 2006, pp. 519–528 (2006) 5. Kagami, S., Okada, K., Inaba, M., Inoue, H.: Realtime 3d depth flow generation and its application to track to walking human being. In: Proc. of 15th International Conference on Pattern Recognition (ICPR 2000), vol. 4, pp. 197–200 (2000)
Measurement of Pedestrian Groups Using Subtraction Stereo
549
6. Ueshiba, T.: An efficient implementation technique of bidirectional matching for real-time trinocular stereo vision. In: Proc. of 18th International Conference on Pattern Recognition (ICPR2006), vol. 1, pp. 1076–1079 (2006) 7. Hariyama, M., Kobayashi, Y., Sasaki, H., Kameyama, M.: Fpga implementation of a stereo matching processor based on window-parallel-and-pixel-parallel architecture. IEICE Trans. Fundamentals, 3516–3522 (2005) 8. Point Grey Research, http://www.ptgrey.com/ 9. Hanawa, K., Sogawa, Y.: Development of stereo image recognition system for ada. In: Proc. IEEE Intelligent Vehicle Symposium 2001 (2001) 10. Collins, R., et al.: A system for video surveillance and monitoring. Tech. report CMU-RI-TR-00-12 (2000) 11. Haga, T., Sumi, K., Yagi, Y.: Human detection in outdoor scene using spatiotemporal motion analysis. In: Proc. of 17th International Conference on Pattern Recognition (ICPR 2004), vol. 4, pp. 331–334 (2004) 12. Okutomi, M., Kanade, T.: A multiple-baseline stereo. IEEE Trans. Pattern Analysis and Machine Intelligence 15, 353–363 (1993) 13. Umeda, K., Nakanishi, T., Hashimoto, Y., Irie, K., Terabayashi, K.: Subtraction stereo - a stereo camera system that focuses on moving regions. In: Proc. of SPIE 3D Imaging Metrology, vol. 7239 (2009) 14. Hashimoto, Y., Matsuki, Y., Nakanishi, T., Umeda, K., Suzuki, K., Takashio, K.: Detection of pedestrians using subtraction stereo. In: Proc. of 2nd International Workshop on SensorWebs, Databases and Mining in Networked Sensing Systems (SWDMNSS 2008), pp. 165–168 (2008)
Vision-Based Obstacle Avoidance Using SIFT Features Aaron Chavez and David Gustafson Department of Computer Science Kansas State University Manhattan, KS 66506 {mchav,dag}@ksu.edu
Abstract. This paper presents a vision-based collision detection algorithm. Our approach is similar to optic flow-based approaches, except that we are working at a feature level instead of a pixel level. The algorithm analyzes a pair of images taken from a moving camera at different times. Then, it recognizes imminent collisions by analyzing the change in scale and location of SIFT features in the pair of images. We have evaluated the performance of this algorithm and present our experimental results. Keywords: Local navigation; obstacle avoidance; vision-based navigation; SIFT.
1 Introduction Mobile robot navigation requires the detection and avoidance of obstacles. Traditional algorithms tend to rely on sonar or laser range data, but strategies that rely on visual data are gradually becoming successful. There are multiple benefits to using visual sensors instead of strictly range-finding sensors. The first such benefit is the density of information presented by a camera, as compared to even an array of sonars and lasers. Objects that span a small region of pixels could theoretically be detected in an image, but would almost surely be missed by laser or sonar. Of course, depending on the application, avoiding such small objects may not be necessary, or even desirable. From a design perspective, there may be benefits to vision-based obstacle avoidance. If the robot must perform other vision-related tasks, such as object recognition, it may be advantageous for the navigation to be vision-based as well. Then, both goals can be intimately coupled without conflicting behaviors. This simplifies the overall control scheme. From a theoretical perspective, an effective vision-based algorithm might emulate human behavior and provide insight into the natural human approach to the problem. This paper presents a vision-based collision detection algorithm. The algorithm analyzes a pair of images taken from a moving camera at different times. It finds and matches the corresponding SIFT features in both images. Then, the algorithm uses the difference in the location and scale of the features to determine if there are any objects moving toward the camera, and if a collision is imminent. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 550–557, 2009. © Springer-Verlag Berlin Heidelberg 2009
Vision-Based Obstacle Avoidance Using SIFT Features
551
2 Related Work Once obstacles are identified, the process of navigating around them is generally straightforward. Potential fields [1] have been successfully used for this task in many applications. Recently, laser and vision have become increasingly more prevalent than sonar, and newer techniques capitalize on the more precise information provided by these sensors [2]. One common approach to vision-based obstacle detection is to segment out the ground or floor region of an image. Once this task is completed, obstacles can be identified simply as occlusions of the ground plane. This approach does depend on the assumption that the ground remains consistently identifiable, using color(s), texture or some other method. Regardless, there have been many successful implementations of this approach when confined to such a domain [3, 4, 5, 6, 7, 8]. If the floor is inconsistent over the entire environment, it may be possible continually adapt the robot's model of the floor [9]. There is another category of obstacle detection strategies, which rely on depth perception. If a robot can calculate the distance from the camera to an object, it is straightforward to determine which objects pose imminent threats. While it is theoretically impossible to discern precise depth from a single image, some monocular heuristics do exist that provide reasonable estimates. The depth from focus method requires that the point spread function of the imaging system is known, and that objects are of uniform brightness and simple in shape. If such preconditions exist, the technique can estimate depth by performing successive blurring on an image and observing its effect on the edges of objects [10]. Another monocular approach relies on texture cues. Texture energies and texture gradients are both incorporated into a histogram. This composite feature can be trained with reinforcement learning to predict approximate depths [11]. While monocular strategies exist, most approaches to depth perception rely on processing two or more images. Two similar yet distinct approaches are stereo vision and optic flow. Stereo vision techniques [12, 13, 14] attempt to establish correspondences between images from two or more cameras. If such correspondences are accurately determined, depth information can be extracted. Alternately, if only one camera is available, correspondences might be established between two images taken by the same camera at different times. Optic flow [15] has been used in applications similar to those of stereo vision [16, 17]. 2.1 SIFT While correspondences between images can be found at the pixel level, an alternative strategy is to find correspondences between features in the images. Features are salient regions comprised of many pixels. For a feature-based approach, we require that a feature be robust with respect to small visual changes (illumination, rotation, pose, and scale). The SIFT feature descriptor satisfies these requirements [18]. Lowe approximated the Laplacian-of-Gaussian using a difference-of-Gaussian function. This function performs two consecutive smoothings using a Gaussian, and finds the difference in the resulting images. It is efficient to compute since the smoothing is already necessary to build multiple scales in the image pyramid [18].
552
A. Chavez and D. Gustafson
The SIFT descriptor consists of image gradient histograms. The image gradient at every point in the region is calculated. Then, the region is divided into 16 sub-regions (4x4). For each sub-region, the gradients are reduced to eight directions and combined to form a histogram. The resulting 128 values (8 directional values for 16 regions) are the SIFT descriptor. Also note that interpolation is used to reduce the boundary effect; thus, values in the center of a sub-region are weighted higher than values on the edge. This makes the descriptor robust to small deformations in varying viewpoints [18]. SIFT features have successfully been used to derive depth values for objects. This technique has previously been applied to robot localization and mapping [19].
3 Methodology Our goal is to approximate real-world locations and trajectories of objects based on their varying location in a series of images. Our approach is very similar to optic flow-based approaches [16, 17], except that we are working at a feature level instead of a pixel level. We will make some simplifying assumptions. First, we assume that the camera only moves directly forward; that is, orthogonal to the viewing plane. This is a reasonable assumption except when the camera (or the entity on which the camera is mounted) is turning. Our algorithm would simply not be applied during the times that the camera is turning. Future research will investigate algorithms that can be applied during camera rotation. Second, we assume that all images are taken at constant time intervals, and that the velocities of the camera and all objects are also constant (possibly zero). It is straightforward to fix the time interval between images and the velocity of the camera. Because we have no control over the velocities of other objects in a real-world scenario, it seems that the most logical prediction is to assume that they will continue to move at their currently observed velocity. Future research will explore algorithms where images need not be taken at constant intervals, and objects may not move at constant rates. Our approach does not require any particular unit of measurement, nor does it need to know the exact velocity of the camera. We arbitrarily define the fixed time interval between frames to be one timestep. The distances between the camera and each object can be based off of this unit of time. To determine if a collision is imminent, we need to estimate the velocities of the camera, and of all the objects in a series of images. We impose a 3-dimensional coordinate system to simplify this problem. The x-axis will be aligned with the horizontal axis of camera images, while the y-axis aligns with the vertical axis. The z-axis, then, is orthogonal to the viewing plane of the camera. Note that for this system to remain consistent from image to image, the camera can only move directly along the z-axis, but we have made this simplifying assumption. The primary variable used in our implementation is the change in scale of the SIFT features. Note that in [19] the location of an object is determined by its disparity in position between multiple cameras. The scale information is ignored when calculating location, because it should theoretically be redundant. In our implementation, we use a single camera. A disparity in location between two images may be the result of the object's motion. Thus, only scale can potentially determine the object's precise location.
Vision-Based Obstacle Avoidance Using SIFT Features
553
Fig. 1. In (I), an object is represented by the solid line, and the camera moves directly toward it. The object encompasses half of the viewing angle at point (A). At (B), it encompasses the entirety of the view. We can conclude that the camera is half as far away at (B). In (II), the camera begins at point (A) moving towards (C), and there is an object at (D). We can derive the ratio of (AD) to (BD) simply by observing the object's apparent scale at (A) and (B). However, to find the proportional length of (AC) or (CD), we first need to compute the various angles.
The scale of the object inversely corresponds to the distance from the camera. For example, an object that doubles in size from one frame to the next is half as far away (if the camera is moving directly toward it). If the camera and an object are moving directly towards each other on the z-axis (one might be stationary, only the relative motion is important), then this rate of change in scale is the only value necessary to compute the time to collision. The relationship becomes slightly more complicated if the camera and object are not moving directly towards each other (relatively). An object that doubles in size from one frame to the next will still be half as far away from the camera. However, the distance between the camera and the object may not be changing at a constant rate. Consider the example in Figure 1 (II) to observe this relationship. Any collision between camera and object must occur when the z values are approximately equal. We can approximate the time until a collision on the z axis using scales and angles extracted from two consecutive images, and basic trigonometry. The same criteria must be met for the x and y values, but the x and y positions and velocities follow from similar calculations to those of z. 3.1 Algorithm The input to the algorithm is a series of images from a moving camera taken at constant intervals. For the first image, there will be no previous images for comparison. The only action taken on the first image is to extract the SIFT features. For subsequent images, we again extract the SIFT features, but we also match these features to the previous image. Matching is performed using the canonical SIFT matching algorithm [18]. Certain thresholds must be specified for this matching
554
A. Chavez and D. Gustafson
algorithm including the maximum allowable scale difference. We set a maximum allowable scale difference of 2, implying that a feature can be twice as large in the second image, and still match the first. The quality of matches appears to degrade if the scale difference is allowed to be larger. When feature points are identified that have been persistent for 2 or more frames, we can interpolate their trajectories as explained in the previous section. Once its trajectory is calculated, it is straightforward to determine if a feature point is imminently dangerous. First we observe the relative velocity of the camera and the feature point along the z-axis. From this, we find the time to collision on the z-axis. If this value is below a threshold, a collision may be imminent. We set this threshold at 2.0, meaning that a collision is imminent if it is going to occur within the next 2 timesteps. Then, for a collision to occur, the point and camera must have approximately the same x and y value at the time of the z collision. It is easy to determine if this is the case, because we know the x and y location as well as the x and y velocity of the feature point. One feature point, however, is not sufficient to accurately report a collision. There are many feature points in every image, and there is potential for error in the matching phase. Furthermore, the scale and location of the feature point may have been calculated imperfectly. Thus, the final step of the algorithm is to check each dangerous feature point and see if it is corroborated by other nearby dangerous feature points. If a given dangerous feature point overlaps 3 other dangerous feature points, an imminent collision is reported. This criterion was found to work well empirically.
4 Results To test the algorithm, several series of images were taken with a digital camera, simulating collisions or near-collisions with an ActivMedia Pioneer 2 Robot. The camera and/or the robot were moved at constant intervals of one foot between each image to simulate constant motion over a constant timestep. Each series consisted of eight images culminating in an imminent collision, or with the robot leaving the camera's field of view (no collision). Various angles of approach ranging from 120º to 180º were tested. Because feature point matches are required by the algorithm to determine dangerous points, only seven of the eight images in each series generate a testable hypothesis of collision imminent or no collision imminent. This hypothesis was confirmed or rejected by the ground truth, which is easily determined through actual physical measurements. Table 1 shows the results of the 17 series of images. There were a total of 10 false positives (a collision was reported but not actually imminent). Recall, however, that a collision was defined in Section 3.2 to be imminent only if it will occur within the next 2 frames. Of these 10 false positives, 7 were reports of a collision that would occur in 3 or 4 frames. There were a total of 12 false negatives (no report, but a collision is imminent). Of these 12 false negatives, 7 could not report a collision because the final image in the series was too close to the camera. The feature point matching algorithm has a maximum scale increase of 2; thus, feature points that grow by a factor of more than 2 cannot be matched. Increasing the maximum scale difference, however, would cause the quality of the matches to degrade.
Vision-Based Obstacle Avoidance Using SIFT Features
555
Illustration 1. The first example is a true positive. As the camera and robot move toward each other from (top-left) to (top-right), many overlapping dangerous features (black circles) are found and an imminent collision is reported. The second example (bottom-left to bottom-right) is a true negative. There is a large dangerous feature in (bottom-right), but not a sufficient density of dangerous features to report a collision, which is correct. Table 1. Experimental results over 17 series of 8 images. A Correct report indicates that an imminent collision was reported at some point during a “Collision” series, or that no collision report was made during a “No collision” series. The true positives, true negatives, false positives, and false negatives are tallied from each individual image in every series. In the “Collision” scenarios, the final two images depict imminent collisions. In the “No Collision” scenarios, there are no imminent collisions and thus no true positives or false negatives.
Tests Correct reports
True True False False % positives negatives positives negatives correct
14
13
16
63
7
12
81%
No Collision 3
1
-
18
3
-
86%
Total
14
16
81
10
12
82%
Collision
17
There were 14 series of images in which a collision would in fact occur. The algorithm recognized an imminent collision in 13 of these. There were 3 series in which a collision would not occur. The algorithm correctly reported no imminent
556
A. Chavez and D. Gustafson
collision in 1 of those series. It reported an imminent collision in the other 2 of these series; however, a collision is only narrowly avoided in these series. In none of the series did the algorithm report an imminent collision when the camera and robot were at least 7 feet away.
5 Conclusion We have presented a vision-based collision detection algorithm. The algorithm is capable of recognizing imminent collisions by analyzing the change in scale and location of SIFT features in a pair of images. The algorithm can consistently recognize true imminent collisions and can sometimes discern when an object will narrowly avoid a collision. Future work has already been proposed to eliminate some of the simplifying assumptions of this algorithm. Modifications to our algorithm may make it usable while the camera is rotating or accelerating. Other such modifications may allow for images to be taken at varying time intervals. Other future work could focus on making adjustments to the algorithm to detect collisions from greater distances. Another possibility would be to experiment with a multi-camera approach, which may be able to place objects at more precisely defined locations (using the location in the images rather than scale). The algorithm could be tested in environments with greater variance in both background and potential obstacles. It could also be directly incorporated into the control scheme of a functional robot for a “real-world” performance evaluation.
References 1. Khatib, O.: Real-Time Obstacle Avoidance for Manipulators and Mobile Robots. The International Journal of Robotics Research 5(1) (1986) 2. Huang, W.H., Fajen, B.R., Fink, J.R., Warren, W.H.: Visual navigation and obstacle avoidance using a steering potential function. Robotics and Autonomous Systems 54, 288–299 (2006) 3. Lenser, S., Veloso, M.: Visual Sonar: Fast Obstacle Avoidance Using Monocular Vision. In: Proceedings of IROS 2003 (2003) 4. Hoffmann, J., Jüngel, M., Lötzsch, M.: A Vision Based System for Goal-Directed Obstacle Avoidance used in the RC 2003 Obstacle Avoidance Challenge. In: Nardi, D., Riedmiller, M., Sammut, C., Santos-Victor, J. (eds.) RoboCup 2004. LNCS (LNAI), vol. 3276, pp. 418–425. Springer, Heidelberg (2005) 5. Lorigo, L., Brooks, R., Grimsou, W.: Visually-Guided Obstacle Avoidance in Unstructured Environments. In: IROS 1997, vol. 1, pp. 373–379 (1997) 6. Sekimori, D., Usui, T., Masutani, Y., Miyazaki, F.: High-speed Obstacle Avoidance and Self-Localization for Mobile Robots Based on Omnidirectional Imaging of the Floor Region. In: Birk, A., Coradeschi, S., Tadokoro, S. (eds.) RoboCup 2001. LNCS (LNAI), vol. 2377, p. 204. Springer, Heidelberg (2001) 7. Gini, G., Marchi, A.: Indoor Robot Navigation with Single Camera Vision. In: Proc. Pattern Recognition in Information Systems, PRIS, Spain (2002)
Vision-Based Obstacle Avoidance Using SIFT Features
557
8. Pears, N., Liang, B.: Ground Plane Segmentation for Mobile Robot Visual Navigation. In: IROS 2001, vol. 3, pp. 1513–1518 (2001) 9. Ulrich, I., Nourbakhsh, I.: Appearance-Based Obstacle Detection With Monocular Color Vision. In: Proceedings of AAAI National Conference on Artificial Intelligence, Austin, TX, USA (2000) 10. Jahne, B., Geissler, P.: Depth From Focus with One Image. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR, pp. 713–717 (1994) 11. Michels, J., Saxena, A., Ng, A.Y.: High Speed Obstacle Avoidance Using Monocular Vision and Reinforcement Learning. In: ICML (2005) 12. Elinas, P., Hoey, J., Lahey, D., Montgomery, J.D., Murray, D., Se, S., Little, J.J.: Waiting with Jose, a Vision-Based Mobile Robot. In: ICRA 2002, vol. 4, pp. 3698–3705 (2002) 13. Scharstein, D., Szeliski, R.: A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. Int’l Journal of Computer Vision 47, 7–42 (2002) 14. Burschkal, D., Lee, S., Hager, G.: Stereo-Based Obstacle Avoidance in Indoor Environments with Active Sensor Re-calibration. In: ICRA 2002, vol. 2, pp. 2066–2072 (2002) 15. Barron, J., Fleet, D., Beauchemin, S.: Performance of Optical Flow Techniques. Int’l Journal of Computer Vision 12, 43–77 (1994) 16. Camus, T., Coombs, D., Herman, M., Hong, T.-H.: Real-time Single Workstation Obstacle Avoidance Using Only Wide-Field Flow Divergence. In: International Conference on Pattern Recognition, vol. 3, pp. 323–330 (1996) 17. Souhila, K., Karim, A.: Optical Flow Based Robot Obstacle Avoidance. International Journal of Advanced Robotic Systems 4(1) (2007) 18. Lowe, D.G.: Object Recognition from Local Scale-Invariant Features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 1, pp. 1150–1157 (1999) 19. Se, S., Lowe, D.G., Little, J.: Vision-Based Mobile Robot Localization and Mapping Using Scale-Invariant Features. In: International Conference on Robotics and Automation, Seoul, Korea, pp. 2051–2058 (2001)
Segmentation of Chinese Postal Envelope Images for Address Block Location Xinghui Dong, Junyu Dong, and Shengke Wang Department of Computer Science and Technology Ocean University of China, Qingdao, Shandong, 266100, China
[email protected],
[email protected]
Abstract. In this paper, we propose a simple segmentation approach for camera-captured Chinese envelope images. We first apply a moving-window thresholding algorithm, which is less curvature-biased and less sensitive to noise than other local thresholding methods, to generate binary images. Then the skew images are corrected by using a skew detection and correction algorithm. In the following stage rectangular frames on the envelopes containing postcode are removed by using opening operators in mathematical morphology. Finally, a post-processing procedure is used to remove remaining thin lines. In this stage, connected components are labeled. We test 800 camera-captured envelope images in our experiments, including handwritten and machine-printed envelopes. For almost all of these images, the proposed approach can accurately separate the address block, stamp and postmark from the background.
1 Introduction Automated machine processing and sorting of mails has become an important part of mail delivery systems. This work aims at developing an efficient method for processing Chinese postal envelopes by employing state-of-the-art techniques in image segmentation and text identification. Since segmentation of envelopes is a preliminary step before finding and recognizing the address block, research in the relevant fields has received attentions in many countries. In [1-5], several segmentation algorithms for processing postal envelopes were introduced. Compared with these methods, thresholding is a simpler and faster approach for binarization of the document images. In machine vision applications, binary images can be obtained by thresholding the original image in real time. Most of thresholding (binarization) [6-11] methods can be classified into two categories: global and local algorithms. If the subjects concerned are envelope images captured by a camera and illumination is uneven, global thresholding techniques will have problems to produce good results [12]. The methods proposed in [6, 7] also suffer similar issues. In this case, local thresholding approaches [9,10,11] should be employed. In [13,14], O.Trier and T.Taxt point out Niblack’s method [9] has the best ratio of performance/speed in terms of the error and reject rate. However, as discussed by Trier and Taxt, only using Niblack’s method can not produce ideal binarization results and a post-processing stage [11] is always required. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 558–567, 2009. © Springer-Verlag Berlin Heidelberg 2009
Segmentation of Chinese Postal Envelope Images for Address Block Location
Cameracaptured gray scale images
Skew detection and correction for binary Images
Connected Components labeling
Calculate local thresholds and local contrasts using Improved movingwindow RATS and apply it to gray scale images
Obtain the outer rectangular frames enclosing postcode using the opening operator and substract them from the previous binary images
Calculate the maximum stroke width of each component and threshold them to remove thin lines
559
Final result containing the address block
End
Improved local RATS Removing rectangular frames Thin line elimination
Fig. 1. The flow chat of the segmentation algorithm proposed in this paper
Unfortunately, it is difficult for the post-processing procedure to suit a wide range of images. Meantime, it also increases computation complexity. Sauvola et al. improved Niblack’s method in [10] and the post-processing procedure [11] was discarded. However, the text obtained by this method is thicker than the original. In particular, the strokes might be conjunctive in the case of the characters are very small, which will decrease the accuracy of the OCR stage. Gatos et al. proposed another adaptive binarization approach [16]. Nevertheless, those three methods [9,10,16] need a priori knowledge, which is related to the type of the studied objects. Since the objects investigated in the study are various kinds of envelopes, we really need a universal method to fulfill the binarization task. In [8, 15], Kittler et al. proposed a robust automatic threshold selection (RATS) algorithm. Inspired by these works, we propose an efficient approach based on RATS and Sobel operators to preprocess Chinese envelope images captured by off-the-shelf cameras. Although color images can carry more visual information, in this study we only use gray-level images because they can be efficiently processed and effectively used in low-cost vision systems. We do not directly apply the moving-window RATS proposed in [15] in the binarization stage, as a post-processing [11] is required. Instead, the movingwindow RATS method is improved and then employed to perform binarization to avoid post-processing. Furthermore, we also introduce an additional procedure to remove the rectangular frames enclosing postcode, as envelopes used in mainland China are different from those used in other countries and previous approaches cannot meet the requirement. Experiments show that our method can produce results comparable to those produced by the two methods introduced in [9,10]. Fig. 1 shows the flow chat of the proposed approach.
2 The Improved Moving-Window RATS Method Robust Automatic Threshold Selection (RATS) was originally introduced in [8]. RATS is a method for bilevel thresholding of grey scale images and is extended by using the Sobel operator in [15] to process vein images. It is based on a simple image statistics, which is the average of grey levels weighted by the gradient at each pixel position. Ideally, we want to compute the threshold in an isotropic area surrounding
560
X. Dong, J. Dong, and S. Wang
every pixel. This can be done using a moving-window version of RATS, which can be written as the ratio of two convolutions
Th ( x, y ) =
( ∏ h ∗(e ⋅ p))( x, y ) ( ∏ h ∗e)( x, y )
Where: * denotes convolution; h is the width of the window;
∏ h ( x, y ) =
{
(1)
∏ h ( x, y )
1 if x ≤h, y ≤h
is given by (2)
0 otherwise
p ( x, y ) is the grey level at (x, y) and the gradient e( x, y ) is given by
e( x, y ) = Δ x, Sobel 2 ( x, y ) + Δ y ,Sobel 2 ( x, y ), with
Δ
x,Sobel
Δ
(x, y) =
(x, y) =
(3)
Δx (x, y −1) + 2Δx (x, y) +Δx (x, y +1) , 4
Δy (x −1, y) + 2Δy (x, y) +Δy (x +1, y)
(4) . 4 Equation (3) shows no curvature bias, is rotation invariant, and has reduced noise [17]. The Sobel operator can also reduce variance effectively, as indicated in [17]. Fig. 3 (a) presents a result obtained by using the moving-window RATS algorithm with respect to the degraded image in Fig. 2. There are large quantities of “ghost” objects in the binary image. Thus a post-processing [11] will also be required. Fig. 3 (b) shows a result image obtained by using the moving-window RATS with a post-processing procedure [11]. Although satisfactory results can be obtained for a certain class of images, with similar gray-level distributions, it still remains difficulty in finding a good postprocessing method for a wide range of images, which is common in automated mail processing systems. In the post-processing procedure [11], the important parameter T p was normally selected manually by experiments and analysis of errors. However, y,Sobel
it is almost impossible to determine a fixed value for a set of blind test images. Following [18], we also select a T p by estimating the distribution of local contrast, which can be seen as a descriptor of the input image. The local contrast image is obtained by calculating the difference between the local minimum and maximum [18]:
C = {c(x, y )} = Pmax (x, y ) - Pmin (x, y ) .
(5)
A local window is defined as consisting of only one class if the contrast is lower than a threshold t = Otsu (C ) , where Otsu(C) represents Otsu’s method [7]. In the case of envelope image processing, we can always assume the characters are printed or
Segmentation of Chinese Postal Envelope Images for Address Block Location
561
written in a small area. Thus, if the regions with a contrast lower than the threshold t , we treat this region as backgrounds. Then we can generate a binary image B = {b( x, y )} in which b( x, y ) = 1 represents foreground pixels and b( x, y ) = 0 represents background pixels by using the following equation:
B = {b( x, y )} =
{
1 if p (x,y ) < Th ( x, y ) and c(x,y ) > t 0 otherwise
(6) .
Fig. 3 (c) presents a result image obtained by our improved moving-window RATS.
Fig. 2. A degraded gray-level Chinese envelope image
(a)
(b)
(c)
(d)
Fig. 3. (a): The result image obtained by using moving-window RATS. (b): The result image obtained by means of moving-window RATS with post-processing [11]. (c): The result image obtained by using improved moving-window RATS. (d): The corrected image produced by our skew detection and correction algorithm [19].
562
X. Dong, J. Dong, and S. Wang
3 Skew Detection and Correction for Binary Envelope Images Because all images used in our experiments are camera-captured, the images might be skew when the camera or envelope is placed slantwise. This will affect the accuracy of the process of removing outer rectangular frames that contain postcode and the following procedure for locating address block on the envelope images. Thus, we should detect and correct the skew of the envelope images before removing outer rectangular frames. We use an efficient and fast skew detection and correction algorithm introduced in [19] to complete this task. Fig. 3 (d) shows the corrected image obtained by our skew detection and correction method; the original image is shown in Fig. 3 (c). As shown in Fig. 3 (d), this algorithm might produce slight deformation of original characters or other foreground objects. However, experiments show that there is little effect on the final recognition results for Chinese documents [19].
4 Removal of Outer Rectangular Frames Containing Postcode Envelopes used in mainland China are different from those used in other countries. A dramatic difference is that the position of postcode is fixed, which lies in the top-left region on the envelope. In addition, there are rectangular frames so that the postcode can be filled in. These frames need to be removed so that the address on the envelope can be accurately located in a later stage. Mathematical morphology [20] has been widely used in solving image processing problems that in some circumstance can be difficult to solve by simply applying linear filters. The union or intersection of a series of linear openings or closings can be used to extract long-thin features within an image. Consequently, in order to remove outer rectangular frames containing postcode, we first extract the horizontal and vertical lines. Because the postcode always lies in the top-left region on the envelope, we only need to apply opening operator on the sub-image of the envelope. Based on our study on a large number of Chinese envelopes, we select an optimal sub-region ratio which is suitable for almost any envelopes used in mainland China. By using this ratio instead of concrete width or height, our method can also be employed to various image resolutions. In order to extract horizontal lines we use horizontal linear SE (Structure Element) oh . Suppose s is the sub-image of the binary envelope image.
oh = [111...1]l l
is the size of SE
(7)
oh , which represents the smallest length of the horizontal line that
will be extracted. Since the distance between the camera and envelopes is fixed in the process of taking pictures, the value of l can be set approximately under certain resolution. First we apply opening operator oh on s , and the result is h .
s
sh = (s o oh ) = (sΘoh ) ⊕ oh
(8)
Segmentation of Chinese Postal Envelope Images for Address Block Location
563
Then we can obtain vertical lines with the process extracting horizontal lines, the difference is that we use vertical SE ov .
ov = [111...1]Tm
(9)
sv = (s o ov ) = (sΘov ) ⊕ ov s f = sh + sv
(10) (11)
We merge the extracted horizontal and vertical lines and obtain rectangular frames f . Finally, we can remove rectangular frames from the raw binary image I
s
using the subtract operator (See Equation 12). Fig. 4 shows the rectangular frames (left) extracted from original image and the result (right) after removing those frames.
I' = I − sf
(12)
However, when the postcode overlaps with the rectangular frames, the digits might be broken after removing frames. In this case, those broken digits in each possible orientation can be restored by applying the mathematical morphology with dynamic kernels [21] within the regions of rectangular frames obtained in the previous step.
Fig. 4. Left: The rectangular frames extracted from binary image. Right: The produced result after removing the rectangular frames.
Fig. 5. Left: An example of thin lines that cannot be removed by setting thresholds on the width, height and height-width ratio of the components. Right: The result obtained by setting threshold on the maximum stroke width.
5 Thin Line Elimination However, some thin lines are still left in the binary image after removing the rectangular frames. In this stage, connected component labeling [22] is first performed. An
564
X. Dong, J. Dong, and S. Wang
Fig. 6. A connected component with 4 points are labeled
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 7. (a): The final result obtained by our method. (b): The result obtained by Niblack’s method [9]. (c): The result obtained by using Niblack’s method with post-processing [11]. (d): The result produced by the method proposed by Sauvola, in [10]. (e): The result produced by the improved Bernsern’s method in [18]. (f): The result produced by Otsu’s method [7].
Segmentation of Chinese Postal Envelope Images for Address Block Location
565
obvious choice is to use thresholds on the width, height and height-width ratio of the components. However, the results are not as good as we desired. Fig. 5 (left) shows an example that these methods failed. Based on extensive study of real envelope images, we propose to use the maximum stroke width of each component as the reference of threshold. The procedure is as follows: (1) starting from each labeled point of a component (as shown in Fig. 6), a traverse toward the direction of the arrow that next to the current labeled point is performed until a pixel whose gray value is 1 (foreground) is encountered. (2) from this point, we begin to count and continue traversing toward that direction until the point whose gray value is 0 (background) appears. In this way, we acquire four count values, and the maximum of them is taken as the maximum stroke width. We treat those components whose maximum stroke width is less than 2 pixels and greater than a half of the height (width) of the component as thin lines. Fig. 7 (a) shows the final result image after removing the thin lines.
6 Experimental Results We tested 800 Chinese envelope images in our experiments, which contain machine printed and hand written text, variety of layouts, including sparse and dense textual regions, mixed fonts with different sizes and orientations, text on different shading or watermarks within one image. For most of these envelope images, the proposed
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 8. Images in the left column are original gray envelope images; those in the right column are binary envelope images obtained by our method
566
X. Dong, J. Dong, and S. Wang
approach can accurately separate address block, stamp and postmark from the background. In Fig. 8, images displayed in the left column are original envelope images; those in the right column are binary envelope images obtained by our method. We also test several other well-known methods in literature on the same collections of images for comparison, including Niblack’s and Niblack’s with post-processing [11], Sauvola, J.’s [10], improved Bernsen’s [18], Otsu’s [7]. These methods are chosen because either they have been successfully used to threshold document image, or they were designed to extract textual information from its application. Fig. 7 (b)-(f) show the result images obtained by these methods. Obviously, Niblack’s method introduced too much noise. Even though we use the improved Niblack’s method with post-processing [11], there still exist some “ghost” objects. The text obtained from the method proposed by Sauvola, J. [10] is thicker than the original. Meanwhile, excessive noise exists in the binary image. The text obtained by improved Bernsen’s method is lighter than it really is. As for the traditional method by Otsu [7], it failed to pick up many objects. Furthermore, the rectangular frames also can not be removed by these methods. The visual evaluation of experimental results confirms that our algorithm performs better than these methods.
7 Conclusions and Future Work In this paper, we introduce a simple approach for segmentation of Chinese envelope images. The proposed method can accurately separate the address block, stamp and postmark from the background. Experimental results show our method is superior to those methods introduced in [7,9,10,13,14,15,18]. We believe that this benefits from low curvature bias and sensitivity to noise of RATS. However, the proposed method still has limitations that can be improved in the future. For example, when the length of the digit exceeds the height of the rectangular frame, the digit might be mistakenly removed as the vertical line when we remove frames. Although this might not dramatically affect the text recognition accuracy, a re-examination and correction process is necessary.
Acknowledgements The project (NO. 60702014) is supported by National Natural Science Foundation of China.
References 1. Lu, Y., Tan, C.L., Shi, P., Zhang, K.: Segmentation of Handwritten Chinese Characters from Destination Addresses of Mail Pieces. International journal of pattern recognition and artificial intelligence 16, 85–96 (2002) 2. Menoti, D., Borges, D.L., Facon, J., de Souza Britto Jr., A.: Segmentation of Postal Envelopes for Address Block Location: an approach based on feature selection in wavelet space. In: Proceedings of Seventh International Conference on Document Analysis and Recognition, pp. 699–703 (2003)
Segmentation of Chinese Postal Envelope Images for Address Block Location
567
3. Yonekura, E.A., Facon, J.: Postal Envelope Segmentation by 2-D Histogram Clustering through Watershed Transform. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 1, pp. 338–342 (2003) 4. Menoti, D., Borges, D.L., et al.: Salient Features and Hypothesis Testing: evaluating a novel approach for segmentation and address block location. In: International Conference on Computer Vision and Pattern Recognition Workshop, vol. 3, pp. 26–33 (2003) 5. Legal-Ayala, H.A., Facon, J., Barán, B.: Postal Envelope Segmentation using LearningBased Approach. CLEI Electron. J. 11 (2008) 6. Wu, S., Amin, A.: Automatic thresholding of gray-level using multistage approach. In: Proceedings of Seventh International Conference on Document Analysis and Recognition, vol. 1, pp. 493–497 (2003) 7. Otsu, N.: A threshold selection method from gray-level histogram. IEEE Transactions on Systems, Man and Cybernetics 9, 62–66 (1979) 8. Kittler, J., Illingworth, J., Foglein, J.: Threshold selection based on a simple image statistic. Computer Vision, Graphics and Image Processing, 125–147 (1985) 9. Niblack, W.: An Introduction to Digital Image Processing, pp. 115–116. Prentice Hall, Englewood Cliffs (1986) 10. Sauvola, J., Pietikainen, M.: Adaptive Document Image Binarization. Pattern Recognition 33, 225–236 (2000) 11. Yanowitz, S.D., Bruckstein, A.M.: A new method for image segmentation. In: Proceedings of 9th International Conference on Pattern Recognition, pp. 270–275 (1988) 12. Liang, J., Doermann, D., Li, H.: Camera-based analysis of text and documents: a survey. In: IJDAR, vol. 7, pp. 84–104 (2005) 13. Trier, O., Jain, A.: Goal Directed Evaluation of Binarization Methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1191–1201 (1995) 14. Trier, O., Taxt, T.: Evaluation of binarization methods for document images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 312–315 (1995) 15. Wilkinson, H.F.M., et al.: Blood vessel segmentation using moving-window robust automatic threshold selection. In: International Conference on Image Processing, Barcelona, vol. 2, pp. 1093–1096 (2003) 16. Gatos, B., Pratikakis, I., Perantonis, S.J.: An Adaptive Binarization Technique for Low Quality Historical Documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 102–113. Springer, Heidelberg (2004) 17. Wilkinson, M.H.F.: Optimizing edge detectors for robust automatic threshold selection: coping with edge curvature and noise. In: GMIP, pp. 385–401 (1998) 18. Ye, X., Cheriet, M., Suen, C.Y., Liu, K.: Extraction of bankcheck items by mathematical morphology. IJDAR 2, 53–66 (1999) 19. Yu, Z., Dong, J., Wei, Z., Shen, J.: A Fast Image Rotation Algorithm for Optical Character Recognition of Chinese Documents. In: Proceedings of the 4th International Conference on Communications, Circuits and Systems, pp. 485–489 (2006) 20. Serra, J.: Image Analysis and Mathematical Morphology. Academic Press, London (1982) 21. Said, J.N.: Automatic Processing of Documents and Bank Cheques. In: PhD thesis, Concordia University (1998) 22. Chang, F., Chen, C.-J.: A component-labeling algorithm using contour tracing technique. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2, pp. 741–745 (2003)
Recognizability of Polyhexes by Tiling and Wang Systems H. Geetha1 , D.G. Thomas2 , T. Kalyani1 , and T. Robinson2 1
Department of Mathematics, St. Joseph’s College of Engineering, Chennai - 600119
[email protected] 2 Department of Mathematics, Madras Christian College, Chennai - 600059
[email protected]
Abstract. The polyhexes are the hexagonal polyominoes on the honeycomb lattice. In this paper we define various types of polyhexes and show that various classes of column (row) polyhexes can be naturally represented as two dimensional words of tiling recognizable languages. We prove the recognizability of polyhexes through Wang tiles and obtain the result that the classes of polyhexes recognized by hexagonal tiling and hexagonal Wang tiles are equivalent. Keywords: Polyominoes, hexagonal picture languages, recognizable hexagonal picture languages, Wang tiles.
1
Introduction
In this paper we consider the problem of representing two dimensional hexagonal picture languages by various classes of polyhexes. The picture languages generated by grammars or recognized by automata have been advocated for solving problems arising in the framework of pattern recognition and image analysis. We use the tiling system recognizability as introduced in [9]. Hexagonal arrays and hexagonal patterns are found in the literature in picture processing and image analysis. Hexagonal kolam arrays were introduced by Siromoney and Siromoney. Dersanambika et al. introduced hexagonal Wang tiles and two interesting classes of hexagonal picture languages namely, the class of local hexagonal picture languages and the class of recognizable hexagonal picture languages generated by hexagonal tiling systems [7]. The hexagonal Wang tiles were used to introduce hexagonal Wang system, a formalism to recognize hexagonal picture languages. In biomedical image processing, a chromosome analysis program, circumscribing polygons associated with each image turn out to be hexagons. These hexagons are in fact equiangular. In each equiangular hexagon opposite sides are parallel. A polyomino is a finite connected union of cells having no cut points, where a cell is a unit square in the plane Z × Z. A row (column) of a polyomino is the intersection between the polyomino and an infinite strip of cells whose centers lie on a horizontal (vertical) line. Polyominoes are figures formed by congruent squares placed so that squares share a side. Golomb in 1954 defined G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 568–577, 2009. c Springer-Verlag Berlin Heidelberg 2009
Recognizability of Polyhexes by Tiling and Wang Systems
569
polyominoes. The various classes of polyominoes can be naturally represented as two dimensional words of tiling recognizable languages [5]. The polyominoes are well known combinatorial objects which are related to many different problems, such as tiling [1], games and enumeration [2]. These objects are not only interesting for computer scientists, but also remarkable in the study of lattice models in physics and chemistry. In the literature, various kinds of problems, including those on different classes of polyominoes, have been studied by means of a coding of the class in terms of a string language [6]. An analogous representation of polyominoes, in terms of two-dimensional languages, turn out to be more powerful than string languages. At the same time, such a coding gives us some interesting information about the combinatorial properties of two-dimensional languages, in particular concerning the nature of their generating functions. The generating series of convex hexagonal polyominoes were studied in [8]. In this paper we define polyhexes and various classes of polyhexes. We show that these classes of the polyhexes can naturally be represented as two dimensional words of tiling recognizable hexagonal languages in section 2. In sections 3 and 4, we establish the recognizability of the polyhexes by means of tiling system and labelled Wang tiles respectively. The procedure to transform the tiling system to labelled Wang tiles is also formulated. By the bijective correspondence between the tiling system and labelled Wang tiles we show the recognizability of the polyhexes. The results can be extended for row convex polyhexes. In section 5 we have proved an equivalence theorem concerning the classes of languages recognized by polyhexes tiling systems and that of the languages recognized by polyhexes Wang systems.
2
Polyhexes
We refer to [7] for the basic definitions of hexagonal picture, hexagonal tiling system, hexagonal local picture language, tiling recognizability of hexagonal picture language, hexagonal labelled Wang tile and hexagonal Wang systems. A polyhexes is a finite connected union of hexagonal cells having no cut point. Polyhexes are defined up to translation. A polyhexes is generated by adjoining a unit hexagon along their faces. A dohexes is formed by two hexagons, similarly a triohexes is formed by three hexagons and so on. A column (row) polyhexes is the intersection between the polyhexes and an infinite strip of cells whose centers lie on a vertical (horizontal) line. A rightup (left-up) or left-down (right-down) polyhexes is the intersection between the polyhexes and an infinite strips of cells whose centers lie on the right-up (left-up) or the left-down (right-down) direction line. If the polyhexes is a column (row) convex then they are called column (row) convex honeycomb polyhexes. The area of a polyhexes is the number of unit hexagons that cover the polyhexes. The perimeter of a polyhexes is the distance around it. A polyhexes is said to be column-convex (row-convex) when its intersection with any vertical (horizontal) line is convex. A polyhexes cannot be both column
570
H. Geetha et al.
and row convex. The set of all polyhexes over Σ is denoted by Σ ∗∗p . A polyhexes picture languages L over Σ is a subset of Σ ∗∗p . Based on the definition of Ferrers diagram [10], we define a Ferrers polyhexes as a convex lattice polygon which contains both the upper corners and the lower left corner of its smallest bounding hexagon. A polyhexes is said to be directed when every cell of p can be reached from a distinguished cell (usually from the lower northwest to the leftmost ordinate), by a path which is contained in p and only uses northeast and northwest unit steps. Each of the polyhexes in Fig. 1, 2, 3, 4 touches a minimum of two vertices of the minimal bounding hexagon of the polyhexes. Let us consider some important classifications of polyhexes namely, Ferrers polyhexes, parallelogram polyhexes, stack polyhexes and directed polyhexes as in Fig. 1, 2, 3 and 4.
3
Fig. 1. Ferrers polyhexes
Fig. 2. Parallelogram polyhexes
Fig. 3. Stack polyhexes
Fig. 4. Directed polyhexes
Polyhexes Tiling Systems
In this section, we define polyhexes tiling system and prove that the classes of column (row) convex polyhexes can be encoded as words of tiling recognizable two dimensional hexagonal picture languages. This is achieved by providing a set of tiles for each of these languages and proving that convexity constraints can be formulated by means of local properties on the boundary of the polyhexes. Definition 1. A polyhexes tiling system PT is a 4-tuple (Σ, Γ, π, θ) where Σ and Γ are two finite sets of symbols, π : Γ → Σ is a projection and θ is the set of hexagonal tiles over the alphabet Γ ∪ {#}. A polyhexes picture languages L ⊆ Σ ∗∗p is polyhexes tiling recognizable if there exist a polyhexes tiling system PT = (Σ, Γ, π, θ) such that L = π(L(θ)), where L(θ) is the polyhexes local picture language.
Recognizability of Polyhexes by Tiling and Wang Systems
571
The set of polyhexes picture languages recognizable by the polyhexes tiling system is denoted by L(PT S). Let p be a column convex polyhexes, R(p) be its minimal bounding hexagons, the six disjoint (possibly empty), sets of unit cells in R(p)\p are easily individuated, each of them located at one of six vertices of R(p). Let us call these sets A, B, C, D, E and F (see Fig. 5). Proposition 1. P is column (row) convex polyhexes if and only if for each cell (i, j, k) of R(P ) it holds – – – – – –
if if if if if if
(i, j, k) ∈ A then both (i + 1, j − 1, k) ∈ A and (i, j − 1, k + 1) ∈ A; (i, j, k) ∈ B then both (i, j, k + 1) ∈ B and (i, j − 1, k) ∈ B; (i, j, k) ∈ C then both (i, j − 1, k + 1) ∈ C and (i + 1, j, k + 1) ∈ C; (i, j, k) ∈ D then both (i, j, k + 1) ∈ D and (i + 1, j, k) ∈ D; (i, j, k) ∈ E then both (i + 1, j − 1, k) ∈ E and (i + 1, j, k + 1) ∈ E; (i, j, k) ∈ F then both (i, j − 1, k) ∈ F and (i + 1, j, k) ∈ F ; #
A
B
# # #
F
C
1
f 1
1
#
1 1
1
#
#
1 1
c
# #
1 #
1 d d
d #
# 1
1
1 1 1
1 #
# 1
1
1 1 1 1
1
1 #
#
b b
1 1 1 e
#
D
1 1
# # b
1 1 1 1
1
1
#
E
1
1 1
#
1 1
f
#
1
1 1 1 1
1
1
#
# 1
1
f
#
# a
#
# #
#
Fig. 5. A column-convex polyhexes p individ- Fig. 6. Representation of p as a word LC uates six disjoint sets of cells in R(p)\p
To each column convex polyhexes we associate a picture obtained by representing with a 1 every cell belonging to the polyhexes and with a symbol a (respectively b, c, d, e and f ) every cell in A (respectively B, C, D, E and F ) as depicted in Fig. 6. Let LC be the language of these polyhexes over the alphabet {1, a, b, c, d, e, f }. Let us consider the following set of tiles
θA =
# # a #
# a
# a, a
# a
# 1, a
#
# 1, a
1
# a, #
# a
# a, #
a
a
1 a a a 1 1 a a a a # a 1, # 1 1, # a 1, 1 1 1, a 1 1, a a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 a
1 #
1 a
1 #
1 #
a a
# a
a,
Similarly we construct the set of tiles for the regions B, C, D, E, F and R as θB , θC , θD , θE , θF and θR respectively. It is easy to prove that the sets θA , θB , θC , θD , θE and θF satisfy the conditions of proposition 1 with respect to the cells A, B, C, D, E and F respectively and so together with θR which characterizes the internal part of p, we have the following theorem.
572
H. Geetha et al.
Proposition 2. LC is a local language over the alphabet ΣC = {1, a, b, c, d, e, f } and LC = L(θR ∪ θA ∪ θB ∪ θC ∪ θD ∪ θE ∪ θF ). We also prove that some classes of polyhexes are represented as polyhexes tiling recognizable languages. Let us consider the following two dimensional polyhexes picture languages over the alphabet {0, 1} : C (where C is either the language of Ferrers polyhexes (F ), or the language of Stack polyhexes (S), or the language of parallelogram polyhexes (P) or the language of directed polyhexes (D)). First let us prove that F , S, P and D are polyhexes tiling recognizable languages. Theorem 1. The polyhexes picture languages of Ferrers, parallelogram, stack and directed polyhexes are polyhexes tiling recognizable. Proof. Let us consider the following local languages LF = L(θR ∪ θA ∪ θC ∪ θD ∪ θF ) over ΣF = {1, a, c, d, f }. LP = L(θR ∪ θA ∪ θC ∪ θD ∪ θF ) over ΣP = {1, a, c, d, f }, LS = L(θR ∪θB ∪θD ∪θF ) over ΣS = {1, b, d, f }, LD = L(θR ∪θB ∪ θC ∪ θD ∪ θF ) over ΣD = {1, b, c, d, f } and the projections πF : ΣF → {0, 1} such that πF (a) = πF (c) = πF (d) = πF (f ) = 0, πF (1) = 1, πP : ΣP → {0, 1} such that πP (a) = πP (d) = πP (c) = πP (f ) = 0, πP (1) = 1, πS : ΣS → {0, 1} such that πS (b) = πS (d) = πS (f ) = 0, πS (1) = 1, πD : ΣD → {0, 1} such that πD (c) = πD (f ) = πD (b) = πD (d) = 0, πD (1) = 1. Finally we have πF (LF ) = F . Thus F is polyhexes tiling recognizable. Again πP (LP ) = P, πS (LS ) = S and πD (LD ) = D. Hence P, S and D are polyhexes tiling recognizable.
4
Polyhexes Wang System
In this section we prove the recognizability of polyhexes through Wang tiles. We are able to transform the tiles of the tiling system for convex (row/column) polyhexes to labelled Wang tiles. We give a procedure to transform the tiles of a tiling system into a labelled Wang tiles. The recognizability of various classes of polyhexes studied in section 3 is also studied by means of Wang tiles. Definition 2. A polyhexes labelled Wang system is a triple PW = (Σ, Q, W ) where Σ is a finite alphabet, Q is a finite set of colors W is the set of labelled Wang tiles W ⊆ Q6 × Σ. The set of polyhexes picture languages recognizable by the polyhexes labelled Wang tile system is L(PW S). For a given set of Wang tiles, a valid tiling requires all shared edges between tiles to have matching colors. Labelled Wang tiles were used in the recognizability of picture languages. More recently Wang tiles are used for image generation [3]. By the bijective corrrespondence between tiling systems and labelled Wang tiles, we try to establish the recognizability of polyhexes as in [4]. Let PT = (Σ, Γ, θ, π) be a polyhexes tiling system. We consider over the set of tiles θ, thirteen subsets of tiles : θN , θN E , θN W the tiles of the north, northeast, northwest respectively, θS , θSE , θSW the tiles of the south, southeast, southwest
Recognizability of Polyhexes by Tiling and Wang Systems
573
respectively. The six corner tiles θCN , θCN E , θCN W , θCS , θCSE , θCSW or north, northeast, northwest, south, southeast, southwest respectively and θI the set of the remaining tiles of the interior. Before giving the procedure we need to observe the following facts: 1. We start from a picture of a tiling system to labelled Wang tiles, the dimensions of the picture must be the same in both the representation. 2. The projection mapping π : Γ → Σ, maps the alphabet Σ in Γ in such a way that a word p ∈ L(θ) is mapped into p ∈ L(PT ), more precisely the symbol in position (i, j, k) in p is the image by π of the symbol in the position (i, j, k) in p. Thus we insert a label in the labelled Wang tile at position (i, j, k), which is the image through π of the symbol in the position (i, j, k) in p. 4.1
Construction of a Hexagonal Labelled Wang Tile from Hexagonal Tiles
a
BN
#
#
c for each of them we construct the labelled Wang tile
b
π (b)
#c
,
ce
ad
e
d
a#
de
where with BN we mean that we are placed on the northern border, with #c we mean one symbol and the same holds for ce, ed, da and a#. Similarly, we consider the sets θN E , θN W , θS , θSE , θSW of tiles to their corresponding labelled Wang tiles. Secondly, we consider the set of (N, NW) corner tiles of the form #
# #
b to its labelled Wang tile
a c
BNW
BN π (a)
c#
d
#b
.
bd dc
Similarly, we consider the sets (N, NE), (NE, SE), (SE, S), (S, SW), (NE, SW) to their corresponding labelled Wang tiles. Thirdly, the set θI of tiles of the form b
a c
e to its labelled Wang tile
d f
ab
g
ca
π (d)
fc
be
.
eg gf
Here BN E , BN W , BS , BSE , BSW are the labels of the northeast, northwest, south, southeast, southwest borders respectively. By translation we have obtained all the labelled Wang tiles necessary to represent the local language L(θ) recognized by the polyhexes tiling system PT . Finally we take care of this projection π, that maps the alphabet Σ, that is the alphabet of L(θ) and of the label of the labelled Wang tiles in W . Then we must replace the labels of the labelled Wang tiles with the respective images through π and we have completed the translation.
574
4.2
H. Geetha et al.
Column Convex Polyhexes Constructed on Labelled Wang Tiles
We proved that many classes of column convex polyhexes can be encoded as words of polyhexes tiling recognizable two-dimensional languages. The polyhexes tiling system TC = {ΣC , {0, 1}, θC , π} recognizes the two dimensional language LC of convex polyhexes. We recall that ΣC = {a, b, c, d, e, f } and the projection π for this language is π(a) = π(b) = π(c) = π(d) = π(e) = π(f ) = 0, π(1) = 1. We are able to encode TC using the procedure mentioned in the previous section, and the set of polyhexes labelled Wang tiles that represent the language LC is given as follows: BN BNW
WA =
π (a)
a#
BN #a
,
a#
1a
π (a)
a#
BNW
,
,
BNW
π (a)
a#
aa
aa
,
BNW
π (a)
1#
a1
11
a1
aa
1#
11 11
1a
π (1)
11
11 11
11
,
aa
π (1)
1a
#a
,
a1 1a
#a a1
,
BNW
π (1)
1#
a1
,
11 11 aa
a1 11
11
π (a)
aa
11
11
1a
,
,
a#
11
aa
#1
BN #1
#a
BN π (a)
π (1)
1a
11
#a #a
a#
11
BN BNW
π (a)
1a
a1
BN #1
,
aa
π (a)
1a
a1 11
11
Using the same procedure we construct the set of labelled Wang tiles WB , WC , WD , WE , WF and WR for the regions B, C, D, E, F and R respectively. Thus a column convex polyhexes are generated on the honeycomb lattice using a set of polyhexes labelled Wang tiles Wconv = WR ∪ WA ∪ WB ∪ WC ∪ WD ∪ WE ∪ WF where WA controls the north westside of the exterior of the polyhexes, WB controls the north eastside of the exterior of the polyhexes, WC controls the corner of the northeast and southeast exterior of the polyhexes, WF controls the corner of northwest and southwest exterior of the polyhexes and WD and WE control the southeast and southwest corner of the exterior of the polyhexes respectively. 4.3
Construction of a Parallelogram Polyhexes Using Labelled Wang Tiles
We construct a parallelogram polyhexes using labelled Wang tiles, by using the procedure given in section 5. We encode the parallelogram polyhexes using labelled Wang tiles. This type of encoding is used to simulate a planar signals and to synchronize the signals in order to make one or many actions. We reall that the class of parallelogram polyhexes as defined in section 4, as a class of column convex polyhexes with four minimal bounding hexagons (see Fig. 2). As for the class of column convex polyhexes, it is possible to encode the class of the parallelogram polyhexes by means of two-dimensional language. So, let us indicate with LP the recognizable two-dimensional language that represents
Recognizability of Polyhexes by Tiling and Wang Systems
575
parallelogram polyhexes. Then by applying the algorithm, we can translate LP into a set WP of polyhexes labelled Wang tiles. More explicitly this set of Wang tiles is given by WP = WA ∪ WC ∪ WD ∪ WF ∪ WR , we show that a parallelogram polyhexes, the two-dimensional word coming from LP , its projection by π (where π(a) = π(c) = π(d) = π(f ) = 0 and π(1) = 1) and its encoding by means of labelled Wang tiles. # # # # # a a 1 1 # # a 1 1 # 1 1 1 1 # 1 c # 1 1 # 1 1 # 1 1 1 # f 1 # 1 1 d # # 1
# # # # # 0 0 1 1 # 1 # 0 1 # 1 1 1 1 # 1 0 # 1 1 # 1 1 # 1 1 1 # 0 1 # 1 1 0 # # 1
#
1
# #
#
π
d #
d
1
#
# #
#
1
# #
0
1 #
#
0 # # #
Fig. 7. The two dimensional word that represents the polyhexes in fig. 2 (on the left) and its projection by π (on the right)
Β ΒNW Β Β ΒNW f# #f ΒSW Β Β
Β π (#)
1# 1#
1 1f 1f
1 1# 1# π (#)
Β
a# 1a 11 11 11 11 #1 Β
a# 1a 11 11 11 11 #1
ΒN
#a #a 0 a1 a1 1a 1a 11 11 1 11 11 11 11 11 11
1
11 11
1 ΒS
11 1d d#
11 1d d# Β
Β π (#)
a1 a1
1 11 11
1 11 11
1 dd dd π (#)
Β
Β 1# 11 11
1# 11 11
11 11 11 1d dd d#
11 1d dd d#
ΒN
1
ΒNE
Β Β #1 π (#) 1c Β 1c c# 1 c# c1 Β c1 11 π (#) 11 Β 1# 1# Β 1 ΒSE #d #d Β 11 11
π (#)
Β
#1
Β
Β
Fig. 8. The encoding of the polyhexes H in fig. 2 by means of labelled Wang tiles
We note that B stands for the border label in the second level boundary symbol #. Similarly, using labelled Wang tiles, we can construct the following families of column convex polyhexes.
576
H. Geetha et al.
1. Directed column convex polyhexes using the set WD = WB ∪ WC ∪ WD ∪ WF ∪ WR 2. Ferrers diagrams using the set WF = WA ∪ WC ∪ WD ∪ WF ∪ WR 3. Stack polyhexes using the set WS = WB ∪ WD ∪ WD ∪ WF ∪ WR
5
Equivalence Theorem
In this section we prove that the classes of polyhexes picture languages L(PT S) and L(PW S) coincide. Proposition 3. L(PT S) ⊆ L(PW S) Proof. Let L ∈ L(PT S). Then L = L(PT ) for some PT = (Σ, Γ, θ, π). We find a polyhexes labelled Wang system PW such that L(θ) = L(PW ). Because L(PW S) is closed under projection, if L(θ) ∈ L(PW S) then L = π(L(θ)) ∈ L(PW S). We set PW = (Γ, Q, W ) where Q = (Γ ∪ {#})2 . W = WA ∪ WB ∪ WC ∪ WD ∪ WE ∪ WF ∪ WR . Clearly L(θ) = L(PW ). Proposition 4. L(PW S) ⊆ L(PT S) Proof. Let L ∈ L(PW S). Then L = L(PW ) for some PW = (Σ, Q, W ), we find a polyhexes tiling system PT = (Σ, Γ, Q, π) such that L = L(PT ) we set aa
Γ = W and π : W → Σ where
π
aa
π (a)
1a
a
a
a1
= a 11
11
a 1
1
for any labelled
1
polyhexes Wang tile W and θ = θA ∪ θB ∪ θC ∪ θD ∪ θE ∪ θF ∪ θR . We prove that L(PW ) = π(L(θ)). 1. L(PW ) ⊆ π(L(θ)). If w ∈ L(PW ), then there exist a tiling H labelled with ˆ are in θ. Hence B2,2,2 (H) ˆ ⊆ θ. We have w. By construction, all tiles of H that H ∈ L(θ) and w = π(H) ∈ π(L(θ)) that implies w ∈ L(PT ). 2. π(L(θ)) ⊆ L(PW ). If w ∈ π(L(θ)) then there exists w ∈ L(θ) such that w = π(w ). By construction of θ, w = H, where H is tiling labelled by w, hence w ∈ L(PW ). Proposition 3 and 4 yield the following theorem. Theorem 2. The classes of polyhexes picture languages recognizable by polyhexes tiling and polyhexes labelled Wang system are equal that is, L(PW S) = L(PT S).
6
Conclusion
In this paper we have proved the recognizability of polyhexes by polyhexes tiling system and polyhexes Wang system. Our future work is to study the recognizability of polyhexes by means of domino systems and online tessellation automata.
Recognizability of Polyhexes by Tiling and Wang Systems
577
References 1. Beauquier, D., Nivat, M.: Tiling the plane with one tile. In: Proc. of the 6th Annual Symposium on Computational Geometry (SGC 1990), Berkeley, pp. 128–138. ACM Press, New York (1990) 2. Bousquet-Melou, M.: A method for the enumeration of various classes of column convex polygons. Discrete Math. 154, 1–25 (1996) 3. Choen, M.F., Shade, J., Hiller, S., Deussen, O.: Wang tiles for image and texture generation. ACM Transaction on Graphics (2003) 4. De Carli, F., Frosini, A., Rinaldi, S., Vuillon, L.: How to construct convex polyominoes on DNA Wang tiles? LAMA report, Lama. Univ. Savoie.fr (2009) 5. De Carli, F., Frosini, A., Rinaldi, S., Vuillon, L.: On the tiling system recognizability of various classes of convex polyominoes. Annals of Combinatorics (2009) (to appear) 6. Delest, M., Viennot, X.: Algebraic languages and polyominoes enumeration. Theor. Comp. Science 34, 169–206 (1984) 7. Dersanambika, K.S., Krithivasan, K., Martin-Vide, C., Subramanian, K.G.: Local and recognizable hexagonal picture languages. IJPRAI 19(7), 853–871 (2005) 8. Gouyou-Beauchamps, D.: Enumeration of symmetry classes of convex polyominoes on the honeycomb lattice, LRI, CNRS, France and Pierre Leroux, LACIM, UCAM, Canada, March 9 (2004) 9. Giammarresi, D., Restivo, A., Seibert, S., Thomas, W.: Monadic second order logic over rectangular pictures and recognizability by tiling system. Infor. Comp. 125, 32–45 (1996) 10. Schwerdtfeger, U.: Volume laws for boxed plane partitions and area laws for Ferrers diagram, Bielefeld, Postfach 10031, 33501, Bielefeld, Germany, January 27 (2009)
Unsupervised Video Analysis for Counting of Wood in River during Floods Imtiaz Ali and Laure Tougne Universit´e de Lyon, CNRS Universit´e Lyon 2, LIRIS, UMR5205, F-69676, France
Abstract. This paper presents a framework for counting the fallen trees, bushes and debris passing in the river by monocular vision. Automatic segmentation and recognition of wood in the river is relatively new field of research. Unsupervised segmentation of the wooden objects moving in the river has been developed. A novel method is developed for the separation of wood from water waves. The counting of number of fallen trees in the river is realized by tracking them in the consecutive continuous frames. The algorithm is tested on multiple videos of floods and the results are evaluated both qualitatively and quantitatively.
1
Introduction
Automatic video surveillance addresses the challenge to perform real-time analysis and constant monitoring of activity [1]. This automation helps in the improvement of the safety in our surroundings. The remote surveillance of unattended environments is often done in places like airports, highways, railway infrastructures, parking lots and on the roads. In most of the cases the surveillance systems detect the potentially threatening incidents. The monitoring of rivers using cameras is done from many years. During the floods, there are large numbers of fallen trees, debris, branches and roots of trees carried by water. These fallen trees and bushes block the flow of water in mountains. Moreover they threat the bridges and dams as these fallen trees are accumulated over the period of time during floods. The monitoring systems installed over the rivers are usually manually supervised. Automatic detection of these trees will help to take preventive measures during floods. The statistics of the fallen trees carried every year with floods will help in finding the maximum number of wood passing in the river every year and the time of the year during which one could expect the flooding. The number of fallen trees and wood in the river requires image segmentation and motion tracking of the fallen trees inside the water. The detection of wood inside the river is the example detection of object motion within moving background. The videos we study in this paper are from the camera installed on the river Ain (France). Figure 2 gives some examples of extracted images from such videos in the first row. The complex natural environments often have many constraints. Such constraints can be classified in two groups, the constraints for detection and recogG. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 578–587, 2009. c Springer-Verlag Berlin Heidelberg 2009
Unsupervised Video Analysis for Counting of Wood in River during Floods
579
nition, and the constraints of tracking of moving wood in the river. The detection and recognition of wood depend on the luminosity difference between wood and water. The flow of water in rivers contains turbulences and waves that are more prominent in case of floods. In addition to that the cloud movement in sky that causes changes in the brightness over the surface of river. The difference of the luminosity of the waves and the wood is not very important. Moreover, the shadows of surrounding trees and buildings make the situation more difficult for correct foreground/background extractions. The image segmentation is not easy in the presence of some moving tree braches in front of the surveillance camera. The bridges in the monitored scene also produce strong shadows over water surface. Consequently, in the moving background the objects can only be detected by virtue of their existence in the multiple consecutive frames. Furthermore, counting of the number of fallen trees that are passing through some strategically important places during flood requires that the waves present in the river must be separated from the fallen tree or wood. The tracking of the foreground objects in this case has some constraints too. The water waves and wood that are moving with the same speed make it difficult to distinguish between the two ones. The motions of wood and water waves inside the river are not linear. For good tracking of the moving objects it is necessary that the objects should be present in the multiple consecutive frames. The water waves during the floods are so large that they submerge the fallen tree branches, and the size of the objects does not remain the same in consecutive frames. In case of small wood pieces or debris, these water waves totally submerge them and they appear in one frame and remain submerged in the next two or three frames. Finally, due to remote location of the monitoring scene and limitations of transfer rate of data networks, the frame rate (fps) in the video is very low (∼4 fps). Consequently, the object motion is larger in consecutive frames. This paper is organized as follows. Section 2 presents the review of relevant works in similar situations and highlights the constraints and technical difficulties in our case. In section 3, the proposed methodology for detection is described. In section 4 the experimental results and comparison of the results with statistical data obtained manually is presented.
2
Related Works
Automatic segmentation and recognition of wood in the river is relatively new field of research. There are not many articles in the literature in such type of application. In this section we present the previous works on the detection of foreground objects in the non-stationary backgrounds. For foreground detection the adaptive background model is proposed for non-stationary backgrounds [2]. The background model plays the role of reference image in background subtraction techniques. It is constructed by adapting the changes during the training period. The construction of background model is based on the different image features (spectral features, spatial features and temporal features). For construction of background model based on the
580
I. Ali and L. Tougne
spectral characteristics, Gaussian Mixture Model (GMM) method is used by most of the researchers [3,4,5], where one or more Gaussians are used to represent the spectral features at each background pixel. All these methods are used in situations of very small dynamic background movements. The GMM method leads to misclassification when the background scene is complex [6], [7]. In [1] the method of spatio-temporal filtering is proposed for compensation of the limitations of region-based blocks of images. This method is applied for detection of swimmers in the swimming pool. The spatial features are extracted by gradient analysis, which gives the information of movements in the images. The mixture of spatial features with spectral features extracted from the image for foreground extraction is used by [8]. Our method is inspired from these but gradient analysis alone is not sufficient in our case because the water waves and wood both have strong gradients. For moving object detection in the video temporal characteristics are very important. The optical flow technique proposed by [9] is largely used for this purpose. Many researchers use this technique. [10,11] used the estimation of the consistency of optical flows over a short duration of time. But the consistency of local optical flows requires small displacements from one frame to another. In our case, the videos are having very low frame rate (∼4 fps), due to which there is large displacement of wood from one frame to other and also the motion of the wood is not linear. Hence for object segmentation and recognition, spectral and spatial features must be incorporated with temporal features. Notice that due to dynamic nature of our application we cannot construct background model. As a matter of fact the background is dynamic with water waves and wood moving with the same speed. A framework is proposed in next section, which uses the spectral and spatial features for detection and segmentation and temporal features for tracking the objects in the video for counting the number of fallen tree, branches or stem depending on their appearance in the videos.
3
Proposed Methodology
The detection of wood in river contains two steps: image segmentation and recognition of wood. The outdoor environments have constraints of sudden appearance/disappearance of sunshine. This fact is shown in Figure 2. First row represents the original images from the video of flood. The presence of bridge (top left corner of images), moving branches of tree before camera (right middle portions of images) and the shadows of surrounding trees over the river are evident from these images. The proposed methodology for detection of wood in river composed of two major parts: 1) detection and recognition of wood in river, 2) separation of wood and water waves by tracking them in consecutive sequence of images, with the architecture as presented in Figure 1. The following two subsections describe the proposed methodology in details.
Unsupervised Video Analysis for Counting of Wood in River during Floods Frame 1
Frame 2 Temporal difference (df1)
Intensity Mask (MI)
Gradient Mask (MG)
Resulting contours MI MG df1
Find barycenter of center of mass
581
Frame 3 Temporal difference (df2)
Intensity Mask (MI)
Gradient Mask (MG)
Resulting contours MI MG df2
Find barycenter of center of mass
Resumed Image
Fig. 1. Outline of proposed methodology for detection and segmentation of wood in the river water
3.1
Detection and Recognition of Wood
The automatic detection of wood begins from automatic segmentation of image. The flow chart in Figure 1 shows that each frame is treated for two segmentation processes. One is named as intensity mask (MI) and other is gradient mask (MG). They are the result of images segmentation based on intensity histogram thresholding and edge-based gradient technique respectively. Intensity Mask (MI). Gray-level histograms of image intensity are calculated for every incoming frame. Histogram thresholding is among the most popular techniques for segmenting gray-level images and several strategies have been proposed to implement it [12], [13]. In fact, peaks and valleys of the 1D brightness histograms can be easily identified, respectively, with objects and backgrounds of gray-level images. In the absence of sun shine the water in the river and wood has a difference intensity levels. But, the intensity of water waves and wood resemble one another both in gray level and in color RGB values. This fact is shown in Figure 2. The Fisher linear discriminate technique is used for histogram thresholding. This technique produces very good segmentation of images in the absence of sunshine. In Figure 2, the first two images in second row are the results of our algorithm in the presence of sunlight. The last two images in second row are the results of intensity based segmentation in the absence of sunshine that shows the efficiency of this technique Figure 2. Gradient Mask (MG). The spectral analysis, as described above, is working well in the absence of sunshine. In the presence of sunshine the shadows of surrounding trees and building over the river make the segmentation based on the histogram thresholding very difficult. Therefore it is necessary to integrate spatial features of the image with spectral features to avail meaningful
582
I. Ali and L. Tougne
Fig. 2. The representation of various steps involved in the segmentation, images in first row represent original images of moving wood in water, the images in the second row are intensity masks, the images in the third row show gradient mask of corresponding images and resulting combinations of all segmentations are shown in the last row
segmentation. By this it means that wood must be separated from the water. The branches and debris moving under the shadows of the surrounding trees cannot be separated from each other. So for this reason, segmentation by detecting the edges among regions is applied with intensity histogram thresholding. This approach has been extensively investigated for gray-level images [12], [13]. Algorithms have also been proposed for the detection of discontinuities within color images [14]. This technique gives the image segmentation based on spatial features. The resulting image is named as gradient mask (MG) in the Figure 1. The resulting images of this method are shown in third row of Figure 2. Temporal Difference (df ). The image segmentation is done by two different methods. The histogram thresholding technique based on the spectral analysis separates the wood from water in the absence of sunshine but fails to detect woods under the shadows of surrounding trees in the presence of sunshine. The gradient analysis separates the objects in motions from the rest of the scene. As in our case, both water waves and moving wood have strong gradients, resulting image contains both of them. The advantage of using gradient mask is that it detects the objects under the shadows of surrounding trees and buildings. The wood and water waves can only be separated from one another by virtue of their
Unsupervised Video Analysis for Counting of Wood in River during Floods
583
existence in the consecutive frames of video. The majority of water waves that are dispersed in two consecutive frames are automatically suppressed by taking such inter-frame differences. The Resulting Combination. The intersection of the spectral segmentation based intensity mask, the edge based gradient mask and temporal inter-frame difference are combined in a manner to give a resulting image. This image is a binary image that represent the detected wood along with some water waves. The combination images are shown in the last row of Figure 2.
Fig. 3. The moving fallen tree as original video, the combination image showing the detected contours of tree
3.2
The Separation of Wood and Water Waves
Here the main goal is to detect and count the number of fallen trees and debris in the flood that passes through the river. The water waves in the flood, in the absence of sunshine, resemble the wood. So, for counting the number of fallen trees and debris, the decision cannot be made on single image segmentation. The water waves and wood forming contours must be tracked in the consecutive frames of video. First constraints of tracking the wood is that the floating fallen trees are not having the same length from one frame of video to another. The water waves in the flood often submerge the wood. Secondly, the movements of fallen trees are not linear and also the water waves exist for longer duration in the videos, therefore, sometimes detected as wood in the many consecutive frames. So to avoid loosing the counts of the wood it is important to find some mechanism than minimizes the false detection.The method of counting the fallen trees is explained in this section. Figure 3 shows the fallen tree in the river with corresponding combination image. The Barycentre of Mass Centers. The fallen trees in the river have many branches and appear in the video as different closed contours as shown in Figure 3. To cope with first constraint the multiple contours of the same object must be grouped together to avoid false detection. Every resulting contour has some area and center of mass. So the centers of mass of the contours are grouped on the basis of closeness of them in the image to give barycenter of mass centers. These barycenters of mass centers are stored in the summery image.
584
I. Ali and L. Tougne
Counting the Number of Woods. In order to count the number of fallen trees, bushes, stems of trees, roots and debris that are passing through the river, we propose to represent the presence of barycenters in a “summary image”. The barycenter of the object (wave or wood) that is present in the consecutive two frames make a pair of barycenters in the summery image and a trace is formed on the summery image. If the object is wood then these centers of masses must be present continuously from left to the right of the screen (as motion of river water is from left to right). This means that if the object is not totally submerged in the water it will be present in more than four continuous frames. So the wood is detected and counted on this basis.(see an example of such image in Figure 4).
Fig. 4. Example of resumed image
4
Experimental Results
A monitoring system has been set up on the river Ain France. The videos of the flood during recent years are recorded. The number of fallen trees, bushes, branches and roots of trees are counted manually by Geographers. The results are qualitatively evaluated by visual inspection. The quantitative evaluation is computed as the true positive, false negative and false positive of the wood detected in the videos. Figure 5 shows a glimpse of some difficult situations. The first scenario presents two very small wood pieces moving at the same time. These two pieces are segmented and counted as two different objects. The second one shows that the detection is done even if there is shadow. In addition to qualitative evaluation, Table 1 shows the quantitative evaluation in terms of wood pieces actually present and counted as wood, the number of wood pieces that are not counted by our algorithm and the number of waves that are detected as wood pieces. The separation of wood and water waves depends on the presence of wood in the consecutive frames. The parameters are tested for different type of situations and different length of wooden objects. To count the wood pieces present in the videos the number of continuous frames are optimized to five. Geographers obtain the ground truth through visual inspection. They have manually gone through 5400 frames to derive the reported detection rate.
Unsupervised Video Analysis for Counting of Wood in River during Floods
585
Fig. 5. The segmentation of wood on sample frames captured from different challenging scenarios at different time intervals in the absence and presence of sun light. Odd rows: Samples frames captured. Even rows: Corresponding segmentation results.
The algorithm is applied to seven videos of flood. Total duration of seven videos is thirty-six minutes. The number of wood detected is clearly higher percentage than the number of false detection of waves as wood. The brightness of waves are very close to wood pieces and have strong gradient, moreover the water waves last for more than five frames in some cases. If the water waves are continuously present in five frames false detection occurs but the percentage of such false detection is not very important. Moreover, the wood pieces sometimes appear in some number of consecutive frames and disappear for one or two frames. Such type of wood pieces cannot be detected. The detection rate is nearly 98% while successful counting rate is 90%. The number of detected wood (Nd), number of non-detected wood (Npd), number of water waves detected as wood (Nw) of the seven videos of total duration of thirty six minutes are summarized in Table 1. The numbers of non-detected wood (Npd) are those wooden objects that appeared in the videos for less than five and more than two consecutive frames. The results are shown in Figure 6, which clearly indicate that the algorithm count the wooden objects in difficult scenario with high success rate.
586
I. Ali and L. Tougne
Table 1. Quantitative evaluation of proposed algorithm in terms of number of true detection Nd, number of non-detected wood Npd and number of waves detected as wood Nw
Video 1 Video 2 Video 3 Video 4 Video 5 Video 6 Video 7 Total
Total frames Duration (min) 650 4’00 900 5’23 860 5’36 750 5’11 550 4’02 800 5’52 880 6’05 5390 36’05
Nd Npd Nw (%) (%) (%) 95 5 6 91 9 13 81 19 19 90 10 7 76 24 2 93 7 14 91 9 19 90 10 15
Fig. 6. Results of counting the number of wood in videos, white bars represents number of true wood pieces, number of wood non-detected are represented by black bars, number of waves detected as wood are represented by grey bars
5
Concluding Remarks
In this paper, the problem of automated monitoring based on video surveillance in highly dynamic environment of river has been discussed. The nature of problem is such that a background model can not be created. There is a need of an algorithm that detects the wood by using different features of images. In particular this paper has addressed the two fundamental issues: 1) unsupervised segmentation of wood in river 2) the method to count the number of wooden material in river during floods. The first issue has been addressed by using the spectral features of images with spatial features. The two types of features help in great deal in unsupervised segmentation of wood and water waves in the river from rest of the water. As the water waves and wooden objects both are present in the segmented image, the separation of wood from water waves need tracking the wooden objects in the consecutive frames. The fallen tree or bushes can only be detected if some part of it remains above the water level in the river. If the wood submerges in some frames and appear in next frame then such wooden
Unsupervised Video Analysis for Counting of Wood in River during Floods
587
objects cannot be detected. Moreover, during heavy cloudy environment the water waves resemble the wood in color. The water waves during flood stays longer time, produces false detection of them as wood. The experimental results indicate that the proposed algorithm detect and count the number of wood with reasonably good percentage.
References 1. Eng, H.L., Wang, J., Wah, A.H.K., Yau, W.: Robust human detection within a highly dynamic aquatic environment in real time. IEEE Tran. on Image Processing 15, 1583–1600 (2006) 2. Li, L., Huang, W.M., Gu, I.H., Tian, Q.: Statistical modeling of complex background for foreground object detection. IEEE Trans. Image Process. 13, 1459–1472 (2004) 3. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: realtime tracking of the human body. IEEE Trans. Pattern Anal Machine Intell. 19, 780–785 (1997) 4. Vacavant, A., Chateau, T.: Realtime head and hands tracking by monocular vision. In: IEEE International Conference on Image Processing 2005, ICIP 2005 (2005) 5. Stauffer, C., Grimson, W.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Anal. Machine Intell. 22, 747–757 (2000) 6. Boult, T.: Frame-rate multi-body tracking for surveillance. In: DARPA Image Understanding Workshop (1998) 7. Gao, X., Boult, T., Coetzee, F., Ramesh, V.: Error analysis of background adoption. In: IEEE Conf. Computer Vision and Pattern Recognition, June 2000, pp. 503–510 (2000) 8. Mittal, A., Paragios, N.: Motion-based background subtraction using adaptive kernel density estimation. In: CVPR, pp. 302–309 (2004) 9. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 10. Iketani, A., Nagai, A., Kuno, Y., Shirai, Y.: Deteching persons on changing background. In: Int. Conf. Pattern Recognition, vol. 1, pp. 74–76 (1998) 11. Wixson, L.: Detecting salient motion by accumulating directionary-consistent flow. IEEE Tran. Pattern Anal. Machine Intell. 774–780(22) (August 2000) 12. Fu, K., Mui, J.: A survey on image segmentation. Pattern Recognition 13, 3–16 (1981) 13. Rosenfeld, A., Kak, A.: Digital picture processing, 2nd edn., vol. 2. Academic Press, New York (1982) 14. Zhao, A.: Robust histogram-based object tracking in image sequences. Digital Image Computing Techniques and Applications, 45–52 (2008)
Robust Facial Feature Detection and Tracking for Head Pose Estimation in a Novel Multimodal Interface for Social Skills Learning Jingying Chen1,2 and Oliver Lemon1 1
2
School of Informatics, University of Edinburgh, UK Engineering and Research Centre for Information Technology on Education Huazhong Normal University, Wuhan, P.R. China
Abstract. A robust and efficient facial feature detection and tracking approach for head pose estimation is presented in this paper. Six facial feature points (inner eye corners, nostrils and mouth corners) are detected and tracked using multiple cues including facial feature intensity and its probability distribution based on a novel histogram entropy analysis, geometric characteristics and motion information. The head pose is estimated from tracked points and a 3D facial feature model using POSIT and RANSAC algorithms. The proposed method demonstrates its capability in gaze tracking in a new multimodal technology enhanced learning (TEL) environment supporting learning of social communication skills.
1 Introduction Facial feature detection and tracking is important in vision related applications [1] such as head pose estimation (HPE) which is crucial in a new multimodal technologyenhanced learning (TEL) environment supporting learning of social communication skills. Head orientation is related to a person’s direction of attention, it can give us useful information about what he or she is paying attention to. Furthermore, head pose estimation is also crucial for analyzing meaningful gestures, i.e. head nodding and shaking. In this paper, we present a robust and efficient facial feature detection and tracking approach for HPE from monocular images to support learning of social communication skills in a new multimodal TEL environment. 1.1 Prior Work HPE approaches can be divided into two classes, global and local approaches. Global methods analyze the entire head image for pose classification. The range of head orientation is divided into a limited number of classes and classifiers for each class are trained. They attempt to recover the relationship between head pose and face image by statistical learning algorithms, however, they need a large amount of training data to accommodate possible variations of head poses under different conditions [2, 3, 4]. On the other hand, local methods detect and track the facial feature points to calculate the actual head orientation [5, 6, 7, 8]. Theoretically, the local methods should provide more precise results than the global methods. However, their G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 588–597, 2009. © Springer-Verlag Berlin Heidelberg 2009
Robust Facial Feature Detection and Tracking for Head Pose Estimation
589
performance depends on the successful detection of facial features. In practice, due to the variety of facial motions and expressions, facial feature detection and tracking is very challenging. Hence, we propose a robust and efficient facial feature detection and tracking method for HPE. The facial feature detection and tracking literature includes image-based approaches [9, 10], template-based approaches [11, 12, 6], appearance-based approaches [13, 14, 15] and motion-based approaches [16]. Each of these approaches has its own strengths and limitations. Image-based approaches use color information, properties of facial features and their geometric relationships to locate facial features. Yang and Stiefelhagen [9] presented a technique for tracking based on human skin color. This approach is in general very fast, however, color alone does not provide enough reliable information to track facial features. Stiefelhagen et al. [10] used color information and certain geometric constraints on the face to detect and track six facial feature points (pupils, nostrils and mouth corners) in real time for lip reading. This method works properly under good lighting conditions, however the mouth corners may drift away when the illumination changes. Template based approaches are usually applied to intensity images where a predefined template of facial feature is matched against image blocks. Tian et al. [11] used multiple state templates to track the facial features. Feature point tracking together with masked edge filtering is used to track the upper facial features. The system requires that templates be manually initialized in the first frame of the sequence, which prevents it from being automatic. Kapoor and Picard [12] used eyebrow and eye templates to locate upper facial features in a real time system. However, specialized hardware (an infrared sensitive camera equipped with infrared LEDs) is needed to produce the red eye effect in order to track the pupils. Matsumoto and Zelinsky [6] detected the facial features using an eye and mouth template matching method to estimate head pose, which was implemented using the IP5000 image processing board. Appearance-based approaches use facial models derived from a large amount of training data. These methods are designed to accommodate possible variations of human faces under difference conditions. Cootes et al. [13] proposed active appearance models (AAM) and Matthews and Baker [14] improved the performance of the original AAM. However, these methods need large amounts of delineated training data and involve relatively expensive computations. Also, the AAM fitting requires expensive computations which make the real time tracking difficult. Cristinacce and Cootes [15] proposed Constrained Local Model (CLM) for feature detection and tracking, they used a joint shape and texture appearance model to generate a set of region template detectors. The model is fitted to an unseen image in an iterative manner by generating templates using the joint model and the current parameter estimates, correlating the templates with the target image to generate response images and optimising the shape parameters so as to maximise the sum of response. In their method, Viola and Jones’s [17] features are used to detect face. Within the detected face region they applied smaller Viola and Jones’s feature detectors constrained using the Pictorial Structure Matching (PSM) method [18], to detect initial feature points. They claimed their proposed method is more robust and accurate than the original AAM. He et al. [16] proposed a motion based facial feature point tracking system. Their method takes a Kanade-Lucas-Tomasi (KLT) optical flow as basis and corrects the prediction by prior statistical facial feature restriction.
590
J. Chen and O. Lemon
The facial feature detection and tracking approach using a single cue about the image sequence is insufficient for reliable performance. A robust tracking system should use as much knowledge about the image sequence as possible to handle all sources of variability in the environment. Hence, we propose to use the multi-cue of Haar-like features, intensity and its probability distribution, geometry constraints, motion and a simple 3D facial feature model to build a robust facial feature tracking system. The six features (i.e. inner eye corners, nostrils and mouth corners) are chosen because they have obvious characteristics, e.g. inner eye corners are very stable features across different facial expressions and are independent of gaze direction, nostrils are dark regions, mouth corners are extremities of dark region and they satisfy certain constraints inherent in facial geometry. Then, the head pose is estimated from the tracked features and 3D facial feature model based on RANSAC and POSIT algorithms. The outline of the paper is as follows. The proposed facial feature detection is presented in Section 2. Section 3 describes the feature tracking and head pose estimation. Section 4 presents the experimental results while Section 5 demonstrates an application. Section 6 gives the conclusions.
2 Facial Feature Detection In the proposed approach, a face is first detected, which relies on a boosting algorithm and a set of Haar-like features. Then the eyes are searched for inside the face area based on their Haar-like features. Next, the inner eye corners are detected by their intensity probability distribution based on a novel histogram entropy analysis and edge characteristics. Finally, the mouth corners and nostrils are located using their intensity probability distribution based on the proposed histogram entropy analysis and geometric constraints. The detail of the detection procedure is given below. 2.1 Face Detection Viola and Jones’s [17] face detection algorithm, based on Haar-like features is used to detect a face. Haar-like features encode the existence of oriented contrast between regions in the image. A set of these features can be used to encode the contrast exhibited by a human face and their special relationships. In Viola and Jones’s method, a classifier (i.e. a cascade of boosted classifier working with Haar-like features) is trained with a few hundreds of sample views of face and non-face examples, they are scaled to the same size, i.e.24x24. After the classifier is trained, it can be applied to a region of interest in an input image. To search for the face, one can move the search window across the image and check every location using the classifier. 2.2 Eye Corners Detection Similar to face detection, eyes are found using a cascade of boosted tree classifiers with Haar-like features. A statistical model of the eyes is trained in this work. The cascade is trained on 3000 eye and 8000 non-eye samples of size 18x12. The training set contains different facial expressions and head rotations. The 18x12 window moves across the eye region and each sub-region is classified as eye or non-eye. Within the detected eye regions, the inner eye corners are determined using their intensity and
Robust Facial Feature Detection and Tracking for Head Pose Estimation
591
edge characteristics, the corners should have low intensities and high edge responses. First, a novel approach based on entropy analysis of the partial intensity histogram within the eye regions is proposed to segment the eyes. In this approach, the entropy E j is iteratively calculated according to different parts of the normalized histogram, until its value is greater than a threshold Eth which is found from the training data. j
E j = − ∑ H (i ) log H (i )
j=0,…,n. n∈(1, 255)
i =0
where i and H(i) are histogram index and value respectively. When E j > Eth , j is used as a threshold to segment eyes. Pupils and eyelids generally have lower intensities than neighbouring pixels and contain a relatively fixed proportion of the information within the eye regions, hence it is reasonable to segment them based on the threshold chosen from partial intensity histogram entropy, which is insensitive to illumination variations. Also, eye corners have high responses in edge map, hence, we combine the edge information using Sobel edge detector [19] with intensity to segment eyes. Then, the morphological processing is applied to the segmented images. Finally, the most right or left extremities of the largest connected region of the segmented eyes are searched for as inner eye corners which satisfy the anatomical constraints. For example, in Figure 1 one can see that the inner corners of the left eyes can be detected correctly under different illumination conditions and with/without glasses.
(I)
(II) (b) (a)
(d)
(e)
(c)
(III) Fig.1. Inner eye corner detection under different conditions, (a) detected eye regions; (b) segmented eyes based on entropy analysis; (c) eye edge maps; (d) the largest connected regions of the combination of (b) and (c); and (e) detected inner corners of the left eyes
2.3 Mouth Corners Detection An estimated mouth region can be obtained using the eye positions and face size. Similar to eye segmentation, a mouth is segmented based on the threshold chosen from partial histogram entropy. Figure 2 shows two mouth images segmented correctly under different illumination conditions, because the segmentation does not depend on the absolute intensity. Then, the mouth corner positions are estimated based on the largest connected region of the segmented image (see Figure 2(b)). Extremities of the bright areas around the left and right parts are searched for as mouth corners.
592
J. Chen and O. Lemon
(a)
(b)
(c)
(d)
Fig. 2. Mouth corner detection in the segmented image, (a) and (c) mouth intensity images, (b) and (d) segmented images
2.4 Nostrils Detection The nostrils appear dark relative to the surrounding area of the face under a wide range of lighting conditions. As long as the nostrils are visible, they can be found by searching for two dark regions, which satisfy certain geometric constraints. Here, the search region is restricted to an area below the eyes and above the mouth, and similar to eye and mouth segmentation, nostrils are segmented based on entropy analysis. Then, the centers of the dark regions are computed as the centers of nostrils (see Figure 3).
Fig. 3. The illustration of nostrils detection
3 Facial Feature Tracking and Head Pose Estimation Once the feature points have been detected, the Lucas-Kanade (LK) algorithm [20] is performed to track the inner eye corners and nostrils. The algorithm detects the motion through the utilization of optical flow. Since the mouth has relatively higher variability (i.e. closed mouth or open mouth, with/without teeth appearance) compared to the eyes and nose, mouth corners are tracked based on the segmented mouth image. Extremities of the bright areas are searched for around the previously found left and right corners. After the positions of the tracking points have been updated, the head pose can be estimated using POSIT algorithms [21]. Given the 3D facial feature model (i.e. 3D locations of the six feature points), and their 2D locations in the camera image, the head pose (rotation and translation) with respect to the camera can be computed using the POSIT algorithm. It estimates the pose by first approximating the perspective projection as a scaled orthographic projection, and then iterating refining the estimation until the distance between the projected points and the ones obtained with the estimated pose falls below a threshold. Instead of using all the found feature points, a minimal subset of feature points is used to estimate the pose. So long as one subset of good, accurate measurements exists, the rest of the feature points can be ignored and gross errors will have no effect on the tracking performance. The selection of a good subset can be done within the RANSAC regression paradigm [22]. Once the best subset of the features is found, the true position of an outlier can be predicted by
Robust Facial Feature Detection and Tracking for Head Pose Estimation
593
projecting its model point onto the image, using the computed pose. This prediction allows the system to recover from tracking errors and leads to a more robust feature point tracking. 3.1 Tracking Failure Detection and Recovery The failure detection includes two steps. First, all the found feature points are checked to see if they lie within the face region and satisfy certain constraints inherent in facial geometry. If not, the model points are projected back onto the image using the computed pose. In the case of mild occlusion, the lost feature points can be recovered. Second, if the average distance between the back-projected model points and the actual found points exceeds a certain threshold, it is considered a failure. Once the tracking failure has been detected, the feature points have to be searched for again. The failure recovery can be solved using the previously found pose just before the failure occurs. If an eye corner is lost during the tracking process, a search window is computed. Its center and size are chosen based on the previously found pose. The search window center is the previous position of the eye corner and its size is proportional to the Euclidean distance between the two last known eye positions. This can scale the search window automatically when the person gets closer or further from the camera. Then, the eye corner detection described in the previous section is applied within the search window. The nostril recovery is based on the detection of a dark region within a search window, the center of the dark region is computed as the recovered nostril center. The mouth corners are recovered based on the mouth corners detection given in section 2.3.
4 Experimental Results 4.1 Detection Results The proposed method has been implemented using C++ under MS windows environment with Intel Core 2.53GHz processor and tested with both static face databases and live video sequences. For measuring the performance of the proposed feature detection method, we have tested it on two publicly available face image databases, JAFFE (http://www.kasrl.org/jaffe.html) which contains 213 images each representing 7 deferent facial expressions by 10 Japanese female models, and BioID Face Database (http://www.bioid.com) which consists of 1521 images of front faces under cluttered background and various illuminations. The detection rates (i.e. images with successful detected inner eye corners, nostrils and mouth corners relative to the whole set of the database images) of the inner eye corners, nostrils and mouth corners are 94.8%, 96.1% and 95.4%. Compared to the method in [23] which also uses facial feature intensity and anatomical constraints to detect facial features and needs inexpensive computation, it provides 92.8%, 95.2% and 93.3% detection rates of the inner eye corners, nostrils and mouth corners on the same databases, the proposed method provide better results and computationally efficient. Examples of the detected facial features using the proposed method are shown in Figure 4. The white crosses represent the detected features.
594
J. Chen and O. Lemon
Fig. 4. The results of detected inner eye corners, nostrils and mouth corners from JAFFE and BIOID face image databases
4.2 Tracking Results The tracking experiments have been performed using both live data capture and prerecorded video sequences. Examples of the tracking results are given below (see Figure 5 and 6). Sequences were captured using an inexpensive web camera with a resolution of 320 x 240 at 25 frames per second. The bright dots represent the tracked points.
Fig. 5. Tracking results for sequence 1
Fig. 6. Tracking results for sequence 2
From these results, one can see that the proposed approach can track the six feature points accurately when the person is moving or rotating her head during the tracking process. Figure 7 gives an estimate of the tracking accuracy of the proposed method. The measurements were taken using 400 frames of different subjects. Manually corrected positions of the feature points were used as reference for measuring displacement error. The error was computed as the Euclidean distance (in pixels) between the reference points and the points obtained by the proposed method. The average and standard deviations of the distances were computed across all eye corners, nostrils and mouth corners during the tracking processing (see Figure 7). The performance of the proposed approach is very good, and it can cope with large angle head rotation, different facial expressions, and various illuminations. The approach is fully automatic and computationally efficient. These methods can be easily implemented in a real time system. On the other hand, only a simple facial feature model is used to compute the head pose, which improves tracking robustness and its simplicity needs only inexpensive computation. Hence, the proposed tracking system is efficient, robust, and suitable for putting into practice.
Robust Facial Feature Detection and Tracking for Head Pose Estimation
595
Fig.7. The average and standard deviation of the distances (in pixels) between the tracked feature points and manually corrected feature points of the eye corners, nostrils and mouth corners
5 Application In order to test the applicability of the proposed feature detection and tracking approach to human computer interaction, we apply the proposed method to estimate the head pose related to the person’s gaze to support learning of social communication skills in a new multimodal TEL environment. The distance from the user to a web camera is about 60cm. The system detects and tracks the facial feature points automatically when a human face appears in front of the camera. We divide the computer screen into a 3x3 grid. The head pose is estimated from the tracked points and a facial model using POSIT and RANSAC algorithms. Examples of head pose related gaze tracking is given in Figure 8. For example, when a person looks at left, top and centre of the screen, the corresponding screen parts will turn green.
Fig. 8. Examples of head pose related gaze tracking
Figure 9 gives the evaluation results of head pose parameters, the rotation angle around axis X ( Rx ) and the rotation angle around axis Y ( R y ) for one example sequence of 200 frames. The solid lines indicate the reference rotation angles obtained with manually corrected feature points while the dash lines show the rotation angles obtained using the proposed method. Due to the quality of camera calibration and feature point localization, one cannot expect the solid lines match the dash lines exactly. From these results, one can see that the proposed facial feature detection and tracking approach provides good performance for head pose estimation.
596
J. Chen and O. Lemon
Rotation angles around axis X, Rx
Rotation angles around axis Y, R y
Fig. 9. Estimated rotation anlges with manually corrected feature points (solid lines) and with automatically tracked feature points using the proposed method (dash line) for 200 frames
6 Conclusions A robust real time facial feature detection and tracking approach is proposed to estimate head pose in this paper. The system detects a human face using a boosting algorithm and a set of Haar-like features, locates the eyes based on Haar-like features, detects the inner eye corners using a novel histogram entropy analysis approach and edge detection, finds the mouth corners and nostrils based on the proposed histogram entropy analysis, then tracks the detected facial points using optical flow based tracking. The head pose is estimated from tracked points and a facial feature model using POSIT and RANSAC algorithms. The system is able to detect tracking failure using constraints derived from a facial feature model and recover from it by searching for one or more features using the feature detection algorithms. The system demonstrates its capability in head pose related gaze tracking in a new multimodal TEL environment supporting learning of social communication skills (where appropriate gaze behaviour is a crucial part of social communication). The results obtained suggest that the method has strong potential as alternative method for building a feature-based head pose tracking system. In the future we will experiment with additional features in the tracking method.
References 1. Weidenbacher, U., Layher, G., Bayerl, P., Neumann, H.: Detection of Head Pose and Gaze Direction for Human-Computer Interaction. In: André, E., Dybkjær, L., Minker, W., Neumann, H., Weber, M. (eds.) PIT 2006. LNCS (LNAI), vol. 4021, pp. 9–19. Springer, Heidelberg (2006) 2. Stiefelhagen, R., Yang, J., Waibel, A.: Simultaneous Tracking of Head Poses in a Panoramic View. In: Proceedings of the 15th International Conference on Pattern Recognition, Barcelona, Spain, September 2000, vol. 3, pp. 722–725 (2000) 3. Gourier, N., Maisonnasse, J., Hall, D., Crowley, J.L.: Head Pose Estimation on Low Resolution Images. In: Stiefelhagen, R., Garofolo, J.S. (eds.) CLEAR 2006. LNCS, vol. 4122, pp. 270–280. Springer, Heidelberg (2007) 4. Rajwade, A., Levine, M.: Facial Pose from 3D Data. Image and Vision Computing 24(8), 849–856 (2006) 5. Hu, Y., Chen, L., Zhou, Y., Zhang, H.: Estimating Face Pose by Facial Asymmetry and Geometry. In: Proceedings of the 6th IEEE International Conference on Automatic Face and Gesture Recognition, Seoul, Korea, May 2004, pp. 651–656 (2004)
Robust Facial Feature Detection and Tracking for Head Pose Estimation
597
6. Matsumoto, Y., Zelinsky, A.: An Algorithm for Real Time Stereo Vision Implementation of Head Pose and Gaze Direction Measurement. In: Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition, France, pp. 499–505 (2000) 7. Ko, J., Kim, K., Choi, S., Kim, J., Kim, K., Kim, J.: Facial Feature Tracking and Head Orientation-based Gaze Tracking. In: International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC) 2000, Pusan, Korea (July 2000) 8. Ji, Q., Yang, X.: Real Time 3D Face Pose Discrimination Based On Active IR Illumination. In: Proceedings of the 16th International Conference on Pattern Recognition (ICPR), vol. 4, pp. 40310–40313 (2002) 9. Yang, J., Stiefelhagen, R., Meier, U., Waibel, A.: Real Time Face and Facial Feature Tracking and Applications. In: Proceedings of the International Conference on AuditoryVisual Speech Processing AVSP 1998, pp. 207–212 (1998) 10. Stiefelhagen, R., Meier, U., Yang, J.: Real-time Lip-tracking for Lip Reading. In: Proceedings of the Eurospeech 1997, 5th European Conference on Speech Communication and Technology, Rhodos, Greece (1997) 11. Tian, Y., Kanade, T., Cohn, J.F.: Recognizing Upper Face Action Unit for Facial Expression Analysis. In: Proceedings of the International Conference on Computer Vision and Pattern recognition, South Caroline, USA, June 2000, pp. 294–301 (2000) 12. Kapoor, A., Picard, R.W.: Real-Time, Fully Automatic Upper Facial Feature Tracking. In: Proceedings of the 5th International Conference on Automatic Face and Gesture Recognition, Washington DC, USA, May 2002, pp. 10–15 (2002) 13. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001) 14. Matthews, I., Baker, S.: Active Appearance Models Revisited, Technical report: CMU-RITR-03-02, the Robotics Institute Carnegie Mellon University (2002) 15. Cristinacce, D., Cootes, T.F.: Feature Detection and Tracking with Constrained Local Models. In: Proceedings of British Machine Vision Conference, UK, pp. 929–938 (2006) 16. He, K., Wang, G., Yang, Y.: Optical Flow-based Facial Feature Tracking using Prior Measurement. In: Proceedings of the 7th International Conference on Cognitive Informatics, August 2008, pp. 324–331 (2008) 17. Viola, P., Jones, M.: Robust Real Time Object Detection. In: Proceedings of the 2nd International Workshop on Statistical and Computational Theories of Vision-Modeling, Learning, Computing and Sampling, Vancouver, Canada (July 2001) 18. Felzenszwalb, P., Huttenlocher, D.: Pictorial Structures for Object Recognition. International Journal of Computer Vision 61, 55–79 (2005) 19. Sobel, I., Feldman, G.: A 3x3 Isotropic Gradient Operator for Image Processing. Presented at a talk at the Stanford Artificial Project in 1968, unpublished but often cited, orig. in Pattern Classification and Scene Analysis (1973) 20. Lucas, B., Kanade, T.: An Interactive Image Registration Technique with an Application in Stereovision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) 21. DeMenthon, D.F., Davis, L.S.: Model Based Object Pose in 25 Lines of Code. In: Proceedings of 2nd European Conference on Computer Vision, Santa Margherita Ligure, May 1992, pp. 335–343 (1992) 22. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm. of the ACM 24, 381–395 (1981) 23. Sohail, A.M., Bhattacharya, P.: Detection of Facial Feature Points Using Anthropometric Face Model. In: Proceedings of IEEE International Conference on Signal-Image Technology and Internet-Based Systems, Hammamet, Tunisia, pp. 656–665 (2006)
High Performance Implementation of License Plate Recognition in Image Sequences Andreas Zweng and Martin Kampel Institute of Computer Aided Automation Vienna University of Technology, Pattern Recognition and Image Processing Group, Favoritenstr. 9/1832, A-1040 Vienna, Austria {zweng,kampel}@prip.tuwien.ac.at
Abstract. License plate recognition is done by recognizing the plate in single pictures. The license plate is analyzed in three steps namely the localization of the plate, the segmentation of the characters and the classification of the characters. Temporal redundant information has allready been used to improve the recognition rate, therefore fast algorithms have to be provided to get as many temporal classifications of a moving car as possible. In this paper a fast implementation for single classifications of license plates and performance increasing algorithms for statistical analysis other than a simple majority voting in image sequences are presented. The motivation of using the redundant information in image sequences and therefore classify one car multiple times is to have a more robust and converging classification where wrong single classifications can be suppressed.
1
Introduction
Automatic license plate recognition is used for traffic monitoring, parking garage monitoring, detection of stolen cars or any other application where license plates have to be identified. Usually the classification of the license plates is done once, where the images are grabbed from a digital network camera or an analog camera. Temporal redundant information has allready been used to improve the recognition rate by a simple majority voting [1]. Typically the localization of the license plate is the critical problem in the license plate detection process. For the localization, candidate finding is used in order to find the region mostly related to a license plate region. If the license plate is not localized correctly for any reason (e.g. traffic signs are located in the background), there is no other chance to detect the license plate when taking only one frame into consideration. In this paper we propose a new approach of automatic license plate detection in video streams using redundant information. The idea is to use standard methods for the classification part and use the classification of as many frames as possible to exclude classifications where the wrong region instead of the license plate region was found or the classification of the characters was wrong due to heavily illuminated license plates, partly occluded regions or polluted plates. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 598–607, 2009. c Springer-Verlag Berlin Heidelberg 2009
High Performance Implementation of License Plate Recognition
599
The algorithm has to perform with at least 10 frames per second to retrieve as much classifications as possible for the statistical analysis, therefore we developed high peformance implementations of existing algorithms. The remainder of this paper is organized as follows: In section 2, related work is presented, in section 3 the methodology is reviewed (plate localization, character classification and the classification of the plate). In section 4 experimental results are shown and section 5 concludes this paper. Our approach has the following achievements: – – – –
2
High recognition rate High potential for further improvement Converging progress of classification Fast implementation
Related Work
Typically license plate recognition starts by finding the area of the plate. Due to rich corners of license plates Zhong Qin et.al. uses corner information to detect the location of the plate [2]. Drawbacks of this approach are that the accuracy of the corner detection is responsible for the accuracy of the plate localization and the detection depends on manual set parameters. Xiao-Feng et al. use edge projection to localize the plate which has drawbacks when the background is complex and has rich edges [3]. For character segmentation Xiao-Feng et al. made use of vertical edges to detect the separations of characters in the license plate [3] where Feng Yang et.al. use region growing to detect blobs with a given set of seed points [4]. The main problem of this approach is, that the seed points are chosen by taking the maximum gray-values of the image which can lead to skipped characters due to the fact that no pixel in the characters blob has the maximum value. For character classification, neural networks are used in [5]. Decision trees can also be used to classify characters where each decision stage splits the character set into subsets. In case of a binary decision tree, each character-set is divided into two subsets. Temporal redundant information of license plate classifications has allready been used in [1]. A simple majority voting is used to improve the final classification which leads to better classification results if the character classification and the license plate localisation errors do not influence the single classification rate in a way that there are to many different classifications compared to the number of temporal redundant classifications. In fact the classifications are mostly defective at one or two positions in terms of character classification errors or segmentation errors. With majority voting the correct classified characters are not taken into account if the actual license plate is not the majority-voted plate. For a more robust recognition this data should be included in the final classification. Two different algorithms are proposed in this paper to optimally use the redundant license plate information and to further improve the recognition rate with the help of redundant data. In contrast
600
A. Zweng and M. Kampel
to [6] we tried to use fast algorithms to be able to classifiy more license plates in a given time to have more redundant data which helps to improve the final classification.
3
Methodology
The methodology is divided into three parts. The license plate localization algorithms, the character segmentation and the license plate classification. The license plate classification is separated in three subparts, namely the single classification, which is done in each frame, the temporal classification which uses the single classifications and is also done in each frame and the final classification which uses the temporal classifications and is done once per license plate at a triggered moment (e.g.: when the plate is not visible anymore). The algorithms in this paper are summarized in the graph shown in fig. 1.
Fig. 1. Summary of algorithms
3.1
License Plate Localization
In video streams license plates are not present in all frames so the first part of the localization is to detect whether a license plate is present or not. Edge projection is used to find plate candidates. To detect if a plate is located in the region of interest in the image, the highest edge projection value is stored for the past frames. In our approach the past 1500 frames are taken to compute the edge projection mean value. If the highest edge projection value in the actual frame is 10% higher than the mean value, further computation in license plate detection is done, otherwise the frame is rejected. License plates can be localized by horizontal and vertical edge projection [7]. Edge projection is affected by noise, so horizontal and vertical average filtering is done before edge detection. To speed up the process of filtering we used a horizontal filter of size 16 by 1 pixels and a vertical filter of size 1 by 16 pixels therefore the normalization of the sum of the 16 pixel-values can be done by shifting the integer value by 4 bits instead of dividing it by 16 which is the same result. The sum for the average filtering is only fully calculated for the first pixel, for the rest of the image-columns the sum is calculated by subtracting the most left pixel-value and adding the next pixel-value to the right. Similar calculations can be done for the vertical averaging filter. The size of 16 bytes also has the advantage that SIMD (Single Instruction Multiple Data) registers (e.g. SSE2) can hold 128 bit values, which are exactly 16 eight bit values. The
High Performance Implementation of License Plate Recognition
601
use of the SIMD instructions has the disadvantage that the values first have to be loaded into the register in order to calculate the sum, therefore the first approach mentioned was used. In order to speed up the process of edge projection in x and y direction for clipping (see [7]), we used the Roberts Filter and computed the projectionhistograms within the convolution. The memory access can be optimized by using different pointer-variables for each row of the filter (see pseudo code). program ConvolveRobertsX for each row pointerVar1 = GetPixel(0,0) // see Roberts pointerVar2 = GetPixel(1,1) // X - Kernel for each column HistoX(rowNr) += pointerVar2 - pointerVar1; pointerVar1++ pointerVar2++ endfor endfor License Plate Candidates. The candidate region found by taking the highest peak of the edge projection is not necessarily the region of the real license plate. The lights of a car and the cars grille also produce high edge projection values, so these candidates have to be rejected and other peaks have to be analyzed. Our experience was that three candidates for the horizontal projection and three candidates for the vertical projection (for each of the three horizontal projections) are enough to find the real license plate. In fig. 2, the edge projection at the position of the light is higher than the projection of the plate. This is the reason why the regions at the lights are detected as candidates (red rectangles). After analyzing the regions, no information of a license plate could be found, so the three best candidates are rejected and the plate was found at the second highest projection of the horizontal edge histogram. The characters are then segmented by binarizing the image region with a threshold value using the algorithms mentioned in [9] or [10].
Fig. 2. Candidate finding and rejection
602
3.2
A. Zweng and M. Kampel
Feature Extraction and Character Classification
In our approach, character classification is done by using a decision tree. In each stage, characters are separated by using a comparison of features until only one character is in each leaf. The advantage of this approach is, that not all features have to be calculated all the time, but only the features which are needed to step to the next subtree which is cpu-performance increasing . For each character there are several features extracted [11][12]. The features are computed for a number of zones per character. Before we are able to classify characters, we first have to train the characters by computing all features for the whole training set (about 250.000 characters are used) and compute feature-to-feature relations which separates the characters best. Comparisons with simple thresholds are not used due to the variance to rotation and scale of most features. Instead comparisons between local features are used for decisions like “the lower left zone average gray level is higher than the top right zone gray level”. The result of a decision stage in the training phase could be that 99.5% of all samples of character1 are classified to be on the left sub-tree, 99.7% of all samples of character2 are classified to be on the right sub-tree and 70% of all samples of character3 are also classified to be on the right subtree. In such cases character3 is taken in both subtrees for a more precise decision when other features are taken for classification in a subtree. 3.3
The Use of Temporal Information
A license plate of a car may be present for 30 frames, which means that 30 classifications are stored for this plate. The redundancy of this data has been used in [1] for better recognition rates. They are only using a majority voting which is not very robust against single character classification errors. Our classifier stores the data of each character for all frames. In each frame all characters are classified and the number of characters is stored. At the actual frame the classification includes all previous classifications of that license plate. Consider the following example. The license plate is classified four times: (128A, 28A, 12BA, 128H). We call these classifications single classifications. The second classification did not recognize the first character so all characters are shifted to the left which causes errors in classification. For that reason the first step is to compute the median of the number of characters of the classifications. In this case the median value is four characters. For further analysis, only classifications with the number of four characters are taken to prevent, that classifications like the second one in the example affects the classification. After this step the following license plates are left: (128A, 12BA, 128H). The second step is the computation of the median of the characters for each position which results in the classification “128A”. This process is done in each frame which leads to a converging classification what we call the temporal classification because it is not the final decision of the license plate. The car corresponding to the license plate which is in the process to be detected may drive out of the cameras view. The license plate in that case is partly outside of the image and the classification recognizes less characters than before for these frames.
High Performance Implementation of License Plate Recognition
603
This can lead to misclassification and the temporal classification converges to a wrong license plate. To suppress this problem, the final decision is computed by taking the maximum occurrence of the temporal classifications. Table 1 illustrates the process of the classification steps by an example. (The real license plate is 128A). Table 1. Example of a classification process # Frame Single classification Temporal classification 1 2 3 4 5 6 7
128A 28A 128H 128A 12B 12B 12B
128A 128A 128A 128A 128A 128A 12B
The temporal classification of the last frame where the license plate is visible should be the best classification because of the converge progress. In this example the last three frames did not recognize the most right character due to the position of the car which may be partly outside the view. Due to this problem, the median of the numbers of characters at that position is three instead of the correct number of four so the single classifications with three characters are taken for the temporal classification. The final classification in this example would be “12B” which is wrong. In that case all temporal classifications are used to compute the classification of maximum occurrence. “128A” has the maximum occurrence in this example although the single classifications are correct only in two frames of seven. Because of the fact, that later temporal classifications include more single classifications, the temporal classifications should be weighted in order to support this approach. The temporal classifications are weighted by their number of including single classifications. The temporal classification in the first frame contains only one single classification, where the temporal classification in frame 7 contains 7 single classifications. The final result of the example from table 1 is “128A” with a score of 21 (the plate with the highest score will be taken as the final result). The score from table 1 can be calculated as follows: Score of “128A” = 1+2+3+4+5+6 = 21 Score of “12B” = 7 Statistical Analysis of Classifications. We have developed and tested three different approaches for the Statistical Analysis of the classified license plates. – Most frequent license plate (“MFP”): This approach calculates candidates from the available single classifications and chooses the most frequent candidate (approach from [1]).
604
A. Zweng and M. Kampel
– Single, Temporal and Final Classification (“STF”): This approach uses the single classifications to calculate temporal classifications and a final classification (see example in section 3.4 table 1). – Best matching position (“BMP”): This approach uses also the single, temporal and final classifications but with all single classifications. The classifications are matched by calculating the best matching position to the previous temporal classification and then used with the other previous single classifications for the next temporal classification (see example in table 2). The median value is not used for this approach because all license plates are used as information for the temporal classification if there are at least 25% matching characters. Table 2. Example of best matching positions # Frame Single classification Temporal classification Best matching position 1 2
128A 28A
128A 128A
0-4 1-4
In the first frame we have one sample for each character, in the second frame we have (due to the best matching position) one sample for the first character (“1”) and two samples for the other characters (“28A”) and so on.
4
Experimental Results
The huge amount of classifications for each license plate requires a good computational performance of the algorithm. Our implementation was tested on an Intel Pentium M 1.8GHz with input images of dimension 640 by 480 pixels. The vertical edge detection was only computed on the region of the horizontal band. The classification rate of the license plate is divided into three categories. One category is the detection if a license plate is present in the actual frame or not. The second category is the classification of each character and the last category is the result of the final decision of the license plate classification. The results can be seen in table 3, where about 350.000 characters are classified and 1760 license plates are found and classified. Compared to [6] and [1] the results are almost equal at 98% but our algorithm mostly fails in the license plate localisation stage which can be improved without loosing the ability to classify as much frames as the camera can grab (e.g. 25 frames per second). The approach of [1], which is equal to the “MFP” approach (see fig. 3 and fig. 4), is the worst out of our three tested approaches. The algorithm in [6] is more robust to rotations and other variations since the characters are trained for those special variations. Their localisation requires a high performance CPU and the classification is not fast enough to apply the statistical analysis approach for our test sequences. Compared to our running time of 2.857 ms - 12.5 ms (80 fps - 350 fps) dependent on the number of candidates found, on a mobile 1,8 GHz CPU their performance
High Performance Implementation of License Plate Recognition
605
was 0.91 fps (1.1 seconds) for an image of size 640x480 pixels on a 2.5 GHz CPU and the running time of the algorithm in [1] on a unknown CPU is 81 ms (12.35 fps) for images of size 352x288 pixels. Table 3. Recognition performances of different types of detections Character classification License plate detection License plate classification 98.5%
100%
97.95% (1724 of 1760 cars)
For comparison of the three different statistical approaches (MFP, STF, BMP) they are tested on automatically generated random license plates (character strings) where 50 samples (single classifications) per license plate and 50 different license plates are generated. The license plates are first reduced to smaller license plates with a certain probability to simulate incorrect localized license plates and then the given characters are defected with a certain probability to simulate incorrect classified characters. Fig. 3 and 4 show results on generated license plates with 80% and 60% correct plate-length. Our approaches “BMP” and the “STF” clearly outperformed the approach from [1] in this evaluation.
Fig. 3. Evaluation of statistical analysis of license plate classifications with 80% correct license plate length
The characters and the 1760 license plates for the evaluation are extracted from a 12 hours video sequence and manually annotated for evaluation and training tests. The conditions for the license plate classification varied over time from dawn until dusk. The video sequence was recorded with a static camera at a parking garage facing the incoming cars as illustrated in figures . 5a to . 5c. The detection if license plates are present in one frame was correct in every case. In fig. 5a the license plate is partly polluted (“9” and “R”) so that the classification fails. Fig. 5b illustrates a correct temporal classification due to the weight of the previous classifications although the light may be a reason
606
A. Zweng and M. Kampel
Fig. 4. Evaluation of statistical analysis of license plate classifications with 60% correct license plate length
Fig. 5. License plate detection results (a)(b)(c)
for segmentation errors. In fig. 5c the license plate was not correctly localized. Due to the number of single classifications, the temporal classification is not affected.
5
Conclusion and Future Work
In this paper a new approach of license plate recognition is presented where the license plate is not recognized only in one frame but in several consecutive frames. For that statistical approaches are presented in order to improve the classification result. The single classification approach can be adapted but in case of a real-time system, the single classification approach should be able to be executed several times per second. The single classification approach used in our work performs at 43% classification rate where our statistical analysis improves that result to 97.95%. The main problem of the used single classification is the localization of the license plate (about 57% correct localizations). Our approach extends existing approaches by analyzing the classifications in each frame by the help of the information from image sequences. This extension leads to a better classification result in cases where single characters are missclassified. For future works the recognition should be independent for the classification of other countries. For that a decision tree for each countries license plate characters and one decision tree for a “country decision” is built. In the first step the characters
High Performance Implementation of License Plate Recognition
607
are analyzed to which country the characters belong and in the second step the corresponding decision tree to this country is chosen to classify the characters.
Acknowledgment This work was partly supported by the CogVis1 Ltd. However, this paper reflects only the authors views; CogVis Ltd. is not liable for any use that may be made of the information contained herein.
References 1. Donoser, M., Arth, C., Bischof, H.: Detecting, tracking and recognizing license plates. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part II. LNCS, vol. 4844, pp. 447–456. Springer, Heidelberg (2007) 2. Qin, Z., Shi, S., Xu, J., Fu, H.: Method of license plate location based on corner feature. In: The Sixth World Congress on Intelligent Control and Automation, 2006. WCICA 2006, vol. 2, pp. 8645–8649 (2006) 3. Chen, X.F., Pan, B.C., Zheng, S.L.: A license plate localization method based on region narrowing. In: 2008 International Conference on Machine Learning and Cybernetics, vol. 5, pp. 2700–2705 (2008) 4. Yang, F., Ma, Z., Xie, M.: A novel approach for license plate character segmentation. In: 2006 1st IEEE Conference on Industrial Electronics and Applications, pp. 1–6 (2006) 5. Anagnostopoulos, C., Anagnostopoulos, I., Loumos, V., Kayafas, E.: A license plate-recognition algorithm for intelligent transportation system applications. IEEE Transactions on Intelligent Transportation Systems 7, 377–392 (2006) 6. Matas, J., Zimmermann, K.: Unconstrained licence plate detection. In: Pfliegl, R. (ed.) 8th International IEEE Conference on Intelligent Transportation Systems, Medison, US, pp. 572–577. IEEE Inteligent Transportation Systems Society (2005) 7. Martinsky, O.: Algorithmic and mathematical principles of automatic number plate recognition systems. B.SC Thesis, Brno (2007) 8. Zhang, Y., Zhang, C.: A new algorithm for character segmentation of license plate. In: Proceedings of Intelligent Vehicles Symposium, 2003, pp. 106–109. IEEE, Los Alamitos (2003) 9. Ridler, T.W., Calvard, S.: Picture thresholding using an iterative selection method. IEEE Transactions on Systems, Man and Cybernetics 8, 630–632 (1978) 10. Lee, B.R.: An active contour model for image segmentation: a variational perspective. In: Proc. of IEEE International Conference on Acoustics Speech and Signal Processing, Mimeo (2002) 11. Abdullah, S.N.H.S., Khalid, M., Yusof, R., Omar, K.: Comparison of feature extractors in license plate recognition. In: First Asia International Conference on Modelling and Simulation, 2007. AMS 2007, pp. 502–506 (2007) 12. Peura, M., Iivarinen, J.: Efficiency of simple shape descriptors. In: Aspects of Visual Form, pp. 443–451. World Scientific, Singapore (1997)
1
http://www.cogvis.at
TOCSAC: TOpology Constraint SAmple Consensus for Fast and Reliable Feature Correspondence Zhoucan He, Qing Wang, and Heng Yang School of Computer Science and Engineering Northwestern Polytechnical University, Xi’an 710072, P.R. China
[email protected]
Abstract. This paper aims at outliers screening for the feature correspondence in image matching. A novel robust matching method, called topology constraint sample consensus (TOCSAC), is proposed to speed up the matching process while keeping the matching accuracy. The TOCSAC method comprises of two parts, the first of which is the constraint of points order, which should be invariant to scale, rotation and view point change. The second one is a constraint of affine invariant vector, which is also used to validate in similar and affine transforms. Comparing to the classical algorithms, such as RANSAC (random sample consensus) and PROSAC (progressive sample consensus), the proposed TOCSAC can significantly reduce time cost and improve the performance for wide base-line image correspondence.
1 Introduction Finding reliable corresponding features in two or more views related to a same scene, is a fundamental step in image based 3D scene reconstruction [11], stitching a panorama [10,12] and other computer vision applications. No matter how robust the features descriptor is, it is generally accepted that incorrect matches cannot be avoided in the first stage of the matching process where only local image descriptors are compared [2]. Thus, a great number of robust estimation algorithms such as LMS (least median-square) [7], RANSAC(random sample consensus) [3,13], improved RANSAC [14], adaptive real-time RANSAC [15], MLESAC (maximum likelihood estimation sample consensus) [4,5], PROSAC (progressive sample consensus) [2], outlier model [6] and so on, are proposed in the literatures to remove the mismatches (outliers) due to the phenomenon like repetitive patterns, occlusions and noise. LMS [7] estimator selects the model with the least median of square, but it fails to tackle the situations when the outliers are surpass 50% [1]. The famous RANSAC and some improved versions follow the supposition and verification model [2], and these methods can be used to deal with the images with a very low inlier ratio. Generally, RANSAC iteratively calculates the estimation model from the randomly sampled minimum sets of tentative corresponding matches, and then chooses the model which has the biggest number of supported inliers as the true solution, at last calculates the optimized estimation from those inliers. Unlike RANSAC, MLESAC takes the log likelihood of the estimation as support instead of the inliers number under certain error threshold so that MLESAC slightly improves the performance, but at the more G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 608–618, 2009. © Springer-Verlag Berlin Heidelberg 2009
TOCSAC for Fast and Reliable Feature Correspondence
609
expense of computational cost [4]. For tackling the issue of the low efficiency of MLESAC, a guided sampling strategy [5] is proposed by estimating the correctness probability of individual correspondence, and thus the results seem effective and efficient. On the other hand, PROSAC gives out another guided method which takes the match similarity into account. By sampling from the small sets in which the correspondences have the highest similarity, PROSAC algorithm saves much more time. Unlike the traditional methods, Hasler et al. [6] build a complex outlier model based on the content of two views, by assuming that the error in pixel intensity generated by the outlier is similar to an error generated by comparing two random regions in the scene. However, it is also very time consuming. In this paper, we propose a novel topology constraint sample consensus algorithm to cope with outlier screening, with much higher efficiency than the state-of-art algorithms, such as RANSAC and PROSAC, and keep the same performance as them at the same time. Our idea is to propose a new effective guidance, which can not only save much time costs but also keep the robustness. However, in PROSAC only a mild assumption is given out while no theoretical improvement is proposed. Herein we address and discuss the problems from not only practical but also theoretical aspects, and in doing so recommend the random sampling to be replaced by TOCSAC guided search, especially in the time-consuming applications such as large scale scene reconstruction and content based image retrieval in huge volume. Before outlier rejection, we first carry out feature detection and extract SIFT (scale invariant feature transform) [9] features from images, and adopt the newly proposed DBH (dichotomy based hash)[8] search algorithm which performs well, especially adapted for SIFT descriptors, to obtain coarse feature matching. Based on the coarse matching results, we carried out outlier screening work to get fine matches afterwards. This paper is organized as follows, in section 2 we give detailed description and theoretical analyses of the proposed TOCSAC algorithm, and in section 3 experimental results are shown and proven that the time cost is significantly reduced while high matching accuracy is kept. Finally we make a conclusion in section 4.
2 TOCSAC Algorithm Generally speaking, for using those robust estimators such as RANSAC, some specific geometric constraint models, such as epipolar geometry, planar homography and so on, should be explored. In this paper, we adopt homography geometry to model the outlier screening process, and at least four pairs of tentative matches are used to calculate the homography [1]. 2.1 Key Idea of TOCSAC As mentioned previously in guided-MLESAC [5] and PROSAC [2], an effective guidance will contribute much in time cost except the robustness. For example, PROSAC takes the matches with certain high similarity (for SIFT, the distances ratio is used) into consideration, however, no theoretical improvement is shown that the matches with lower distance ratio would be the most outlier-likely matches. In fact, for the truly matched pairs, not only the epipolar geometry or homography geometry
610
Z. He, Q. Wang, and H. Yang
are in existence, some unexplored topology constraints are also in existence which would be helpful in guiding the estimators. As shown in Figure 1, no matter what the transformation is, the points’ relative topological order after the transformations should be kept the same as the initial one. That is to say, if four sampled tentative matched pairs have different topological orders in the two views, it definitely means that it is a bad sample with some mismatches. However, only the points order constraint is not enough to assure the inlier since it is hard to avoid the coincidence that some mismatch pairs have the same topological order. As an enhancement, another constraint of affine invariant vector is represented from the order constrained samples, which can provide a more tight and precise constraint to guarantee the samples to be inliers with high probability. Usually, the changes between wide base-line image pairs are scale, rotation, affine transforms and luminance. Hereby, the constraint of affine invariant vector will be valid for all those changes. We call the above two mentioned guided strategies as topology constraints, and we will present a detailed description in the following subsections.
(a) scale
(b) scale and rotation
(c) affine transform
Fig. 1. The order relationships of the truly matched points under different transforms. Taking point A and a as the starting reference point, the points order should be kept as A(a)→B(b)→ C(c)→D(d) (Anti-clockwise) under different transforms.
2.2 The Approaches The Constraint of Points Order. For a random example of S with four pairs of corresponding image features, X={a,b,c,d} denotes four points in the first view, and Y={A,B,C,D} represents the tentative matched points in the second view correspondingly. Without the loss of generality, we assume that arbitrary 3 points of X or Y are non-collinear points. In order to get the points order in X, we setup an oriented line using two points, while the other two points should stay in the different sides of this line. For example, suppose that we link points a and c to obtain a ray a × c , if (a × c)ib < 0 and (a × c)id > 0 , the points order can be determined as a→b→c→d;
otherwise, (a × c)i b>0 and (a × c)id < 0 , it means that the points order should be a→d→c→b. Thus, an order constraint function is built for X, which is g ( X) = sign((r × s)i t) + sign((r × s)ip) ∗ 2
(1)
where r,s,t,p ∈ X and sign((r × s)i t) + sign((r × s)ip) = 0 . In a similar way, the order constraint function for Y is also defined as, f (Y ) = sign((r ′× s ′)i t ′) + sign((r ′× s ′)i p′) ∗ 2
(2)
where r ′,s ′,t ′,p′ ∈ Y , and r ′,s′,t ′,p′ are the tentative matched points corresponding to r,s,p,t in X. Consequently, the final constraint function for the sample S is
TOCSAC for Fast and Reliable Feature Correspondence
H(S) = g ( X) − f (Y)
611
(3)
If H(S) = 0 , the sample S is said to satisfy the order constraint and will be further processed using constraint of affine invariant vector in the next step.
(b)
(a)
Fig. 2. Samples of (a) consistent and (b) inconsistent order between two views
Fig. 3. Error sample with the consistent points order, but B is not a truly matched point for b
The Constraint of Affine Invariant Vector. In fact, only using the constraint of point order to verify the samples is not sufficient. As shown in Fig 3, there are still some mismatches even if they are under the consistence of points order. In this section, a more strict constraint is proposed. Generally, the transformation between two images is affine or similar, so a certain affine invariant description about four pairs of tentative matched points should be valid in both situations. As we know, the area ratio of triangles is invariant with affine transform, and it is not hard to compute the area from the polygons, especially triangles composed from those tentative matched points. Naturally, we obtain the area ratios from the samples as follows. Since arbitrary 3 pairs of tentative matched points will compose a pair of corresponding triangles, 4 matched pairs in one sample would surely weave 4 pairs of triangles, and the area ratios between those triangles and the area sum of these four triangles will compose an affine invariant vector with length of 4, as described as follows,
P = [ S1 / S sum , S 2 / S sum , S3 / Ssum , S4 / S sum ]
T
′ , S 2′ / S sum ′ , S3′ / Ssum ′ , S4′ / S sum ′ ] Q = [ S1′ / S sum
T
(4) (5)
where Si , i = 0,1, 2,3 and S sum = ∑ i =1 Si denote the triangle area and area sum of 4
′ are for the second them in the first view respectively, and Si′' , i = 0,1, 2,3 and Ssum view. If the tentative matched points in the sample are true correspondences, the affine vectors P and Q must be same or similar with each other under a specific measurement. In this paper, Euclidean distance can be used to represent the similarity of
612
Z. He, Q. Wang, and H. Yang
the two affine invariant vectors P and Q. Before computing distance, the vectors of P and Q should be normalized in unit ones. In order to avoid bad samples with high probability, in practice a very rigorous threshold is set as 0.05.
2.3 The Theoretical Analysis for TOPSAC In this subsection we give formally theoretical analysis for the TOCSAC algorithm. To make the discussion clear, some notations and corresponding meanings are listed in Table 1 in advance. Table 1. Notations used in the following analysis N : The number of tentative matched points between two views. S : A sample having 4 pairs of matches randomly selected from N tentative matched points. nin : The number of inliers infered from the N tentative matched points. pin : The probability of a match being a truly matched one (for inlier). pout : The probability of a match being a mismatched one (for outlier). p (ri ) : The probability of a sample which has i (i = 0,.., 4) correct matches.
p (r _ pos | out ) : The conditional probability of a mismatch located at certain specific areas. p (top ) : The probability of a sample which obeys topology constraints. p (top | ri ) : The conditional probability of a sample which obeys topology constraints under the
condition that the sample has i (i = 0,.., 4) pairs of truly matched points. p (r4 | top ) : The conditional probability of a sample being a good one without mismatch under the condition that the sample satisfies topology constraints.
The validity for checking a sample S either inlier or outlier is decided by the conditional probability p (r4 | top ) . The bigger the value p (r4 | top ) is, the more effective TOCSAC is. Before discussion of the computation of p (r4 | top ) , we need to put forward an assumption. Assumption: If a match is not a true one, the miss-matched point in the second view could be any position except the right position. That is to say, the miss-matched points are uniformly distributed in the second view and the probability of missmatched point located at a pointed area depends on the size of such pointed area. If we consider the probability of a match being a true one (inlier), it is pin = nin N
(6)
Otherwise, the probability of a match being a mismatch (outlier) is pout = 1− pin
(7)
According to the Bayes rule, we have p (r4 | top) =
p (top | r4 ) * p(r4 ) = p (top )
p(top | r4 ) * p (r4 ) 4
∑ p(top | r ) ∗ p(r ) i
i =0
i
(8)
TOCSAC for Fast and Reliable Feature Correspondence
613
To calculate p (r4 | top ) , at first, p (ri ) is obtained from the following equation, p (ri ) = C4i ∗ ( pin )i ∗ ( pout )4−i , i = 0,.., 4
(9)
Now the problem is turn to obtain the conditional probability p(top | ri ) , i = 0,.., 4 , respectively. In order to get the conditional probability p(top | ri ) , five different cases should be considered, as shown in Figure 4.
Case 1: When i = 4 (see Figure 4(b)), all 4 matches from S are correct ones, without doubt, p(top | r4 ) = 1 . Case 2: When i = 3 (see Figure 4(c)), the only mismatch point’s position is in fact fixed in a certain area restrained by others under topology constraints. Thus, p(top | r3 ) = p(r _ pos | out ) . Case 3: When i = 2 (see Figure 4(d)), in general, suppose that the first two matches are inliers. Certainly, the third matched point C at least needs to stay in the JJG JJJG right side of directed line AB as c stays in the right side of ab under the points order constraint. But the position can not be decided directly so that the probability is conservatively set as 0.5 only under points order constraint. When C is confirmed, the position of D is determined by others under the topology constraint just like the case 2, and so p(top | r2 ) = 0.5 ∗ p(r _ pos | out ) . Case 4: When i = 1 (see Figure 4(e)), generally, suppose that the first match is inlier, and the second matched point B maybe lies any place in the second image. When B is decided, the situation is similar with the case 3, leading to p(top | r1 ) = 0.5 ∗ p(r _ pos | out ) .
Case 5: When i = 0 (see Figure 4(f)), the first two matched points’ position could be any place in the second image. When the position of the first two points are fixed up, the situation is similar with the case 3, and p(top | r0 ) = 0.5 ∗ p(r _ pos | out ) .
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4. Five different cases under the topology constraints. (a) The reference points in the first view. From (b) to (f), there are 0, 1, 2, 3, 4 false matches in the second view respectively, where false matches are illustrated as red color.
As seen in previously mentioned assumption, it is generally suppose that the missmatched points are uniformly distributed, so if the mismatch’s position is fixed at a certain area, the probability is computable according to the size of the certain area.
614
Z. He, Q. Wang, and H. Yang
For a image with size x × y pixels, the acceptable fixed area’s size is u × v pixels, the above mentioned probability is p(r _ pos | out ) = (u × v) ( x × y ) . Typically, in order to get more accurate result, a pessimistic value 0.05 is set for p(r _ pos | out ) . Thus, the Bayesian probability p(r4 | top) can be solved immediately. For example, if
pin is set to 0.5, the probability of a sample being a good one without false match is therefore 0.52, correspondingly, which is much higher than a random sample with the probability of 0.0625. 2.4 Termination Criteria and the Restrictions of TOCSAC Suppose that nin inliers are extracted from N tentative matched points (the initial value of nin is 0, and it will change during the iterative process). The termination conditions for TOCSAC are summarized as follow. Sample times Ts . For a sample set S , α is the probability of S being a bad sample, so α = 1− pin4 . After Ts times of sampling, the probability β of missing a good sam-
ple is β = αTs . In order to guarantee that at least one sample is a good one without mismatch, a typical value 0.01[1] is set as the upper bound for β . As a result, when the sample times Ts makes the probability value β lower than 0.01, the iterative process should be terminated. Calculated times Tc . In our TOCSAC algorithm, not every random sample will be calculated, except that it satisfies the topology constraints. As mentioned in section 2.4, the Bayesian probability γ of a sample being a good one under the topology
constraints is computable, which is γ = p(r4 | top) . After Tc times of calculations, the probability η of missing a good sample is η = (1− γ )Tc , and a very low value of 10-6, is set as the upper bound for η . Herein, the iteration would be terminated if η is lower than 10-6 after Tc times of calculation. Least inlier number nmin. In fact, a match perhaps supports a wrong solution. To prevent that, TOCSAC may select a solution supported by outliers which happens to be insistent with it, another constraint is checked, as mentioned in literature [9]. The distribution of the cardinalities of sets of random ‘inliers’ is binomial [2],
pNR (i ) = CNi −−44θ i − 4 ∗ (1 − θ ) N − i + 4 , i > 4
(10)
where θ is the probability that a match supports a false model. For a set with N tentative matches, nmin = min{m : ∑ i = m pNR (i ) < μ} , where μ is typically set to 0.01 N
and θ is pessimistically set to 0.05. Consequently, the third terminal criterion is that, no matter what kinds of constraint adopt, the minimum number of true inliers should be greater than nmin. The restrictions. Since our algorithm is based on homography geometry and it is effective under affine change, it is not adapt to deal with the projective transformation.
TOCSAC for Fast and Reliable Feature Correspondence
615
And it will become much rigorous in the situations with large depth discontinuity in wide-base line image matching, because some correct matches which are not consistent with the points order constraint would be wrongly discarded. However, despite of discarding a few correct matches, TOCSAC is also valid for general wide-base line image match problem except the projective transformation occasions.
3 Experimental Results and Analysis In this section we will address the efficiency of TOCSAC by two groups of convincing experiments. The experimental environment is PC with Intel (R) Celeron (R) 2.66 GHz, 512 MB memory. We have carried out RANSAC, PROSAC and TOCSAC estimators, and several pairs of images with strong rotation, scale, and viewpoint changes are used to evaluate the capability of these three algorithms. For impartial comparison of the three estimators, we congruously use the same DBH [8] method to get the coarse matches as the basis data. To avoid the randomness, all experiments run for 100 times, and we record the average time cost, iteration times and inliers number. The inliers are the final output of such estimators. Challenging images with strong affine transformations. As shown in Figure 5, a set of four pairs of images with strong view point changes, lots of repetitive structures and occlusions are used. Performance comparisons among TOCSAC, PROSAC and adaptive RANSAC algorithms, including iteration times, inliers number and total time costs are represented in Table 2. We have formerly explained the fact that the topology constraints should be invariant to affine transformations, and experimental results have proven that our new algorithm is effective and make it go faster to screen outliers. The computational efficiency of TOCSAC is nearly 10 times than those of RANSAC and PROSAC.
(a)
(b)
(c)
(d)
Fig. 5. Four pairs of challenging images. (a) Eave with strong affine transform and repetitive structures. (b) Bell tower with big view point change and repetitive structures. (c) Shopping mall with affine change and occlusions. (d) Graffiti with rotation and big view point change.
616
Z. He, Q. Wang, and H. Yang
Table 2. Time cost (ms), iteration times and the number of inliers for challenging images Eave Time # iter nin RANSAC 422 432 284 PROSAC 406 395 285 TOCSAC 59 42 280
Bell tower Time # iter nin 235 932 44 212 846 45 22 39 44
Shopping mall Time # iter nin 1510 5556 34 1138 4167 35 107 90 33
Graffiti Time # iter 334 1625 296 1464 36 17
(a) Bell tower (rotation)
(b) Bell tower (vast scale transform)
(c) Car (luminance)
(d) Boat (scale and rotation)
nin 25 26 25
Fig. 6. Four pairs of generic images Table 3. Time cost (ms), iteration times and the number of inliers for generic images Bell tower Time # iter nin RANSAC 64 168 129 PROSAC 47 117 135 TOCSAC 12 16 128
Bell tower Time # iter nin 17 41 118 14 39 118 6 6 118
Car Time # iter 7 36 7 36 2 5
nin 46 46 46
Boat Time # iter 10 42 9 42 4 6
Fig. 7. Correspondence results after outliers screening by TOCSAC algorithm
nin 60 60 60
TOCSAC for Fast and Reliable Feature Correspondence
617
Generic images. The following 4 pairs of images (shown in Figure 6) have general transformations like scale, rotation and luminance changes. The TOCSAC algorithm has been used to find out true feature correspondences reliably, and the computational cost, iteration times and the number of inliers comparing to RANSAC and PROSAC are listed in Table 5, from which we can find that TOCSOC also performs much better than adaptive RANSAC and PROSAC algorithms respectively. Furthermore, the correspondence results after outliers screening for challenging and generic image pairs are illustrated in Figure 7.
4 Conclusions A novel robust outlier screening method, TOCSAC (TOpology Constraint SAmple Consensus), is proposed in the paper, which takes the topology constraints into account to deal with wide baseline correspondence issue, including points order and affine invariant vector. The distinguished performances on some challenging data have validated that the proposed TOCSAC algorithm can achieve better results than the state-of-art approaches such as PROSOC and self-adaptive RANSAC. As a result, we highly recommend that the TOCSAC algorithm can be used in the future for large scale scene reconstruction applications from unordered wide baseline images and content based image retrieval in huge image database.
Acknowledgments This work is supported by National Natural Science Fund (60873085), National Hi-Tech Development Programs under grant No. 2007AA01Z314, P. R. China, and graduate starting seed fund of Northwestern Polytechnical University (z200963).
References 1. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision, 2nd edn. Cambridge University, Cambridge (2003) 2. Chum, O., Matas, J.: Matching with PROSAC – Progressive Sample Consensus. In: CVPR (2005) 3. Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. CACM 6(24), 381–395 (1981) 4. Torr, P.H.S., Zisserman, A.: MLESAC: A new robust estimator with application to estimating image geometry. In: CVIU, pp. 138–156 (2000) 5. Tordoff, B., Murray, D.: Guided sampling and consensus for motion estimation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 82–96. Springer, Heidelberg (2002) 6. Hasler, D., Sbaiz, L., Süsstrunk, S., Vetterli, M.: Outlier Modeling in Image Matching. IEEE Tran. on PAMI 25(3), 301–315 (2003) 7. Rousseeuw, P.J., Leroy, A.M.: Robust regression and outlier detection. Wiley, New York (1987)
618
Z. He, Q. Wang, and H. Yang
8. He, Z., Wang, Q.: A Fast and Effective Dichotomy Based Hash Algorithm for Image Matching. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne, T.-M., Monroe, L. (eds.) ISVC 2008, Part I. LNCS, vol. 5358, pp. 328–337. Springer, Heidelberg (2008) 9. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. Journal of Computer Vision 60(2), 91–110 (2004) 10. Brown, M., Lowe, D.: Recognising panoramas. In: Proc. ICCV, pp. 1218–1225 (2003) 11. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. ACM Trans. Graph. 25(3), 835–846 (2006) 12. Hays, J., Efros, A.A.: Scene completion using millions of photographs. ACM Transactions on Graphics (SIGGRAPH) 26(3) (2007) 13. Chum, O., Matas, J., Obdržálek, S.: Enhancing RANSAC by generalized model optimization. In: Proc. of the ACCV, vol. 2, pp. 812–817 (2004) 14. Márquez-Neila, P., García, J., Baumela, L., Buenaposada, J.M.: Improving RANSAC for Fast Landmark Recognition. In: Workshop on Visual Localization for Mobile Platforms (in conjunction with CVPR 2008), Anchorage, Alaska, USA (2008) 15. Raguram, R., Frahm, J.M., Pollefeys, M.: A Comparative Analysis of RANSAC Techniques Leading to Adaptive Real-Time Random Sample Consensus. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 500–513. Springer, Heidelberg (2008)
Multimedia Mining on Manycore Architectures: The Case for GPUs Mamadou Diao and Jongman Kim School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta GA, 30318, United States {mamadou,jkim}@ece.gatech.edu
Abstract. Media mining, the extraction of meaningful knowledge from multimedia content, poses significant computational challenges in today’s platforms, particularly in real-time scenarios. In this paper, we show how Graphic Processing Units (GPUs) can be leveraged for compute-intensive media mining applications. Furthermore, we propose a parallel implementation of color visual descriptors (color correlograms and color histograms) commonly used in multimedia content analysis on a CUDA (Compute Unified Device Architecture) enabled GPU (the Nvidia GeForce GTX280 GPU). Through the use of shared memory as software managed cache and efficient data partitioning, we reach computation throughputs of over 1.2 Giga Pixels/sec for HSV color histograms and over 100 Mega Pixels/sec for HSV color correlograms. We show that we can achieve better than real time performance and major speedups compared to high-end multicore CPUs and comparable performance on known implementations on the Cell B.E. We also study different trade-offs on the size and complexity of the features and their effect on performance.
Introduction The proliferation of image and video recording devices, coupled with the increase in connectivity and digital storage capacity, has led to a rapid growth in the amount of multimedia content available to users. To handle this large amount of multimedia data, there is a need to extract meaningful knowledge from it in order to analyze, search, organize, index and browse this content [1][2]. Media mining applications attempt to tackle this problem. Many video mining workloads require significant computational resources due to the large volumes of data involved and the complexity of the processing required. In a number of applications, those computations have to be performed in real-time: live streaming video analysis for real-time meta-data extraction; content monitoring and filtering; scene analysis for immersive communications. Feature extraction, a component in most media mining systems, is among the most computationally intensive tasks. When dealing with indexing multimedia repositories, the extraction of features is usually done off-line [3][4]. However, in real-time media mining scenarios, feature extraction has to be performed on the fly [5]. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 619–630, 2009. c Springer-Verlag Berlin Heidelberg 2009
620
M. Diao and J. Kim
The shift toward parallel computing as the primary way to scale performance with increasing transistor density presents a major opportunity for media mining applications [6][7]. To make such applications more accessible to end users and across different platforms, it is crucial to develop systems that can harness the increasing raw computational power available in today’s personal computers. GPUs, largely underexplored in current personal computers, are one example of massively parallel co-processors with tremendous raw computational capability [8]. That power is becoming more accessible through more suitable programming models such as CUDA [9]. Successfully mapping media mining workloads on GPUs will have a positive impact toward the adoption of such applications to the general public because of the pervasiveness of graphic cards. In this paper, we examine how GPUs can be leveraged in media mining workloads. We map the extraction of visual descriptors (HSV color correlograms and HSV color histograms) on GPUs and demonstrate how such co-processors can be used toward real-time content analysis. This will provide some insights into discovering fine grained parallelism on multimedia workloads and hence allow us to implement media mining applications on massively parallel architectures (hundreds of cores). The remainder of the paper is organized as follows: Section 1 demonstrates that GPUs can significantly help in compute-intensive media mining workloads. Section 2 presents the related work. Section 3 is an overview of the GPU architecture and the CUDA programming model. Section 4.1 describes the implemented features. Section 4.2 describes our implementation on the GPU and the techniques used to increase performance through proper memory management and data partitioning. Section 4.3 discusses the performance of the proposed implementation. We compare our results with implementations on other multicore platforms (general purpose multicore CPUs and the Cell proccessor) and analyse the different trade-offs on the size and complexity of the features as well as their effect on performance. Finally section 5 concludes our work.
1
The Case for GPUs in Media Mining Applications
This section discusses the use of GPUs in the context of media mining applications. An overview of two typical media mining applications is given in Figure 1. They consist of a Content Based Multimedia Indexing and Retrieval (CBMIR) engine (1a) and a real-time streaming video mining application (1b). CBMIR is a technique for indexing unstructured multimedia data from its content. Common systems used to index archived multimedia content are often composed of two components: 1) a back-end part that extracts low-level features and generates higher level semantic information on the multimedia database (annotations and semantic concepts) and 2) a query engine that processes search queries and returns results based on the similarity between the query and the indexed data. But for real time content analysis, the multimedia data is not sitting in a repository. The media content is not stored and is processed on the
Multimedia Mining on Manycore Architectures: The Case for GPUs
(a) Content Based Video Indexing and Retrieval (CBVIR) engine with a query engine
621
(b) Real-time video mining
Fig. 1. Overview of two typical media mining applications
fly as it is being generated or streamed . Several applications fall within this category. Due to the real-time constraints, such applications often require more computing resources. Out of the different processing modules present in Figure 1, we identify the following tasks that account for most of the execution time [4][10] and see how they map to GPU architectures: Feature Extraction: This involves various low-level image processing algorithms. Commonly used features for video mining are color, texture, shape and motion features as well as scale-invariant local descriptors. Many feature extraction tasks display fine grained data and thread-level parallelisms [4][10][11]. In the next sections, we will implement some color features on a GPU and demonstrate significant speedups. In [12], a CUDA implementation of Horn and Schuncks optical flow method is presented. Indexing: Ding et al. [13] proposed a framework for high performance IR query processing on GPUs. They offloaded subtasks such as index decompression, inverted list traversal and intersection, and top-k scoring to the GPU. They showed that using a CPU-GPU combination achieves significantly higher query processing performance. Wu et al. [14] used GPUs to cluster large datasets using k-means. They reported performance gains of an order of magnitude over an optimized CPU-only version running on 8 cores. Model Training and Semantic Meta-Data Generation: These modules attempt to bridge the semantic gap. From low-level features, machine learning techniques are often used to associate the features with higher level semantic information. This is the case for concept based video retrieval [15].
622
M. Diao and J. Kim
Support Vector Machines (SVMs) are very popular in detecting concepts . As an example of how GPUs can be used for machine learning algorithms, Catanzaro et al. [16] accelerate SVM classification and training on an Nvidia 8800GTX and achieve speedups of around 80x for classification and 9x for training over CPU based implementations. User Browsing: In CBMIR, much of the end users’ satisfaction comes from the richness of his browsing experience, in other words, his ability to interact with the media content in a fast and user-friendly way. That’s the problem that Strong et al. tackled [17] by proposing to use the GPU in building a browsing engine based on feature similarities for large collections of images. They reported 15 to 19 times speedups over their CPU implementation. When dealing with multimedia content, one often has to deal with all information modalities (audio, speech, video and text). Speech Recognition is often used to transform the audio data to text, which has a higher semantic content and easier to handle. Chong et al. [18] parallelize and port the HMM based Viterbi search algorithm on an NVidia G80 GPU for Large Vocabulary Continuous Speech Recognition (LVCSR). They achieve a 9x speedup compared to a sequential CPU implementation, leading to a step closer to real-time LVSCR. The previously mentioned examples all illustrate how major components - in terms of computational requirements - of media mining workloads are susceptible to gain from the computing power available in GPUs. This is not to say that GPUs could and/or should replace CPUs for such workloads. It would be unrealistic as there is a significant number of tasks that map poorly on GPUs compared to multicore CPUs (algorithms with very little parallism). Our intent is rather to demonstrate that GPUs can be very effective massively parallel coprocessors to significantly help solve the computational challenges certain media mining workloads pose. Another argument for the use of GPUs in media mining applications stems from the fact that many existing platforms, personal desktops in particular, come with programmable GPUs that are often untapped potential in the machine. As a consequence, leveraging that power often comes at little to no marginal cost to end users.
2
Related Work
Li et al. [4] proposed an accelerator made of 64 light general purpose cores connected through a bi-directional ring network and optimized three media mining applications (sports-video analysis, video-cast indexing, and home video editing) on the proposed architecture. Although they demonstrated that significant speedups can be achieved, the proposed architecture was simulated as opposed to an existing chip where design assumptions are validated with the reality of semiconductor technology as well as manufacturing and market forces . Zhang et al. [3] and Chen et al. [11] addressed parallelization and performance analysis of low level feature extraction programs to Intel multi-core CPU architectures. However, they only considered a maximum number of 8 cores and did not address how those parallel implementations scale on larger number of cores.
Multimedia Mining on Manycore Architectures: The Case for GPUs
623
Some work has been conducted on implementing CBMIR tasks on the Cell B.E. Liu et al. [10] ported a complete content based digital media indexing application (MARVEL) to a STI Cell Broadband Engine Processor. They optimized the feature extraction by vectorizing the algorithms to take advantage of the SIMD architecture of cell B.E SPEs. Our GPU implementation results will be compared to those of Liu et al. [10] and Zhang et al. [3].
3
GPU Architecture and Programming Model
Graphic Processing Units (GPUs) are being widely used for applications other than 3D graphics [8]. This trend has been fueled by the increasing raw computational power as well as the improved programmability that accompanied the evolution of graphics hardware in recent years. Several classes of applications have been mapped to GPUs with substantial speedups [8]. Such applications have large computational requirements, display massive parallelism and often put a larger emphasis on throughput than on latency. Until recently, the raw computational power available in GPUs was not easily accessible to programmers due to the complex programming model that constrained the programmer to map his problem to the graphics pipeline and use graphics APIs (DirectX, OpenGL) to access the programmable functionalities of graphics hardware. The use of programming models such as CUDA [9] has exposed the GPUs as a powerful massively parallel co-processor. In this work, we use the NVIDIA GTX280 GPU and choose CUDA as the programming model. A description of a G80 NVIDIA GPU architecture is provided in Figure 2. It consists of several SIMD stream Multiprocessors (SM) with
Fig. 2. Nvidia G80 GPU architecture
624
M. Diao and J. Kim
8 Scalar Processors (SP) in each SM which allows the GPU to have a large number of hardware threads simultaneously running on the multiprocessor. On a G80, each SM can have up to 768 thread contexts simultaneously active in hardware [19]. Switching between threads is handled in hardware and is performed very rapidly. Each SM has 16KB of memory shared by all processors in the multiprocessor. A larger, but much slower global memory is accessible to all processors in all the SMs. In the CUDA [9] programming model, an application is a combination of serial programs that are executed on the CPU and parallel kernel programs that execute on the GPU. A kernel is made of several threads organized into thread blocks. All threads in a block are executed on the same SM and can cooperate through shared memory. However, threads from different blocks do not have a safe and fast way to cooperate without terminating the kernel call. As a result, implementing parallel algorithms with data dependencies between threads is sometimes challenging on GPUs. Threads in a block are grouped into warps. A warp is the scheduling unit and all threads within a warp execute the same instruction on different data. Divergence between threads in a warp can be handled by the SMs but it incurs additional overheads and should be avoided if possible. Another determining factor for GPU performance is global memory latency. Global memory is not cached. As a consequence, access to it has a very high latency but is optimized for large throughputs when accessed properly. Shared memory can be used as a software managed cache when reusing data and particularly when accessing global memory in a non coalesced way. Texture memories are read-only cached memories and are faster when accessed with 2D spatial locality.
4 4.1
Color Visual Descriptors on CUDA GPUs Feature Extraction
A wide range of low-level visual descriptors has been proposed [2] in media mining applications. They include color descriptors (color histogram, color correlogram...), texture descriptors (co-occurrence matrix, wavelet coefficients...), shape descriptors, local descriptors (SIFT) and motion descriptors (optical flow). In this work, we focus on color features and choose the HSV color histogram and HSV color correlogram visual descriptors as a case study for low level feature extraction on a GPU. Both features are commonly used in media mining applications. Their implementation reveals some of the challenges associated with mapping certain multimedia algorithms on massively parallel architectures such as data dependencies and the need for concurrent memory updates. Color histograms measure the distribution of pixels in an image whereas color correlograms capture the spatial distribution of pairs of colors. Although color histograms are faster to compute, they lack spatial correlation information. Let I be an nxn image (we assume that we have a square image). For a pixel p = (x, y), let c = I(p) be it’s color. The colors are quantized into m colors: c1 , c2 , ..., cm . We use the L∞ -Norm for the distance between two pixels |p1 −p2 | =
Multimedia Mining on Manycore Architectures: The Case for GPUs
625
max|x1 − x2 |, |y1 − y2 |. Let k be k ∈ 1, 2, ..., d, where d is the maximum pixel distance. The correlogram of image I is defined as: γc(k) (I) = P rp1 =ci ,p2 ∈I [p2 = cj ||p1 − p2 | = k] i ,cj
(1)
(k)
For a given pixel ci , γci ,cj is the probability a pixel at distance k of ci is of color cj . The size of the correlogram feature is m2 d and can be quite large. As a consequence, the autocorrelogram is used in most CBMIR applications The autocorrelogram of image I captures the spatial relationship between identical colors and is defined by: (k) α(k) (2) c (I) = γc,c (I) To compute the correlogram, we need the count: Γc(k) (I) = |p1 ∈ Ici , p2 ∈ Icj ||p1 − p2 | = k| i ,cj
(3)
(k)
Γci ,cj (I) counts all the pixel pairs of color ci and cj within k distance of each other. The final correlogram is obtained by: γc(k) (I) = Γc(k) (I)/(hci (I).8k) i ,cj i ,cj
(4)
where hci (I) is the histogram of image I for color ci and 8k arises from using the L∞ -Norm to measure the distance between pixel locations.
Fig. 3. HSV color correlogram and histogram computation block diagram
In our implementation, all pixels are quantized in the HSV color space. Each of the three components of the color space is linearly quantized. The block diagram of the feature computations is shown in Figure 3. 4.2
Implementation
The computation of the correlogram feature on the GPU is illustrated in Figure 4. In a straightforward implementation, in both the histogram and correlogram, all threads will need to update bin counts on a memory location accessible to all threads. The only location accessible to all threads to write to would be in global memory. This raises significant write contentions. As pointed out and empirically verified by Zhang et al. [3] on an Intel multi-core CPU, the contention increases
626
M. Diao and J. Kim
Fig. 4. Correlogram computation
significantly as the number of threads increases. Another problem associated with the straightforward approach is that random writes to global memory incur large latencies because of the lack of caching mechanisms to write to global memory. To address these problems, we create temporary local features in shared memory. Each thread will update a local bin count on shared memory and only after all threads in a block have done so that the block will write its local count in global memory and update the final feature. The color histogram and the color correlogram computation present slightly different challenges. In the histogram computation, there is no data reuse. Each pixel is read only once from texture memory. When computing the color correlogram, we need to explore the neighborhood of each pixel and accumulate the co-occurrence counts in shared memory. Given the large amount of memory accesses needed to compute the correlogram, we will load the quantized pixels of each block in shared memory. Each thread block will load its corresponding tile in shared memory. Since we explore the pixels surrounding each pixel in the tile, we will need to have access to pixels belonging to neighborhood tiles. As a consequence, each thread block will have to load all the pixels within d distance from the associated tile in shared memory. The size of the image tiles is limited by the size of the shared memory (16KB). For each feature, we implemented 3 kernels in CUDA: 1. Kernel 0 : Where there is no use of shared memory. The data is directly fetched from texture memory and all threads update the same shared space in global memory. This is the most straightforward implementation but it is extremely slow. 2. Kernel 1 : Where we use the shared memory for caching the HSV-quantized image tiles in the case of the correlogram and for storing partial features that are shared by all threads in a block. To minimize lock contentions, we assign an address space to each warp. The partial features are merged after all threads are done updating the co-occurence count.
Multimedia Mining on Manycore Architectures: The Case for GPUs
627
3. Kernel 2 : Where all threads in the block share the same address space in shared memory. This implementation allows for larger features to be computed since we don’t need to keep as many temporary features as we have warps in the block. 4.3
Experiments and Results
We performed our evaluation on a NVIDIA GeForce GTX 2801 GPU running on PC with an Intel Core2 Quad Q6850 at 3 GHz with 3GB of RAM. We use CUDA 2.0. For better comparison with previous works [10][3], we used the same image datasets (TRECVID 2005 frames) and the same feature parameters (Number of bins and maximum distance d). We compare the execution times to compute the feature. We did not include the time to transfer the image from CPU to GPU because it did not account for much and CUDA allows asynchronous data transfers. For the rest of our analysis, we measure the performance of our implementation by the throughput (Megapixels/sec) that is independent from the image size. Our experiments have shown that the throughput does not vary much with respect to the image size in our case (0.25 Mpixels to 4Mpixel range). Table 1. Performance comparison with multicore Intel CPUs [3]. Conroe is a 2.4 GHz dual-core and Xeon 7130M server has 8 cores clocked at 3.2 GHz with 8GB of shared main memory. (*) Execution times calculated from reported optimized serial code times and speedups achieved after parallel implementations. 352x240 images from trecvid 2005 with d = 4 and 32 color bins. GPU Conroe (*) Xeon 7130M (*) Our CPU Corr 1.8ms 23ms 5.7ms 24ms Hist 0.18ms – – 9.8ms
Table 2. Performance comparison with cell processor [10]. 352x240 images from trecvid 2005 with d = 8 and 128 color bins. GPU Cell SPE Our CPU Corr 7.69ms 7.89ms 67ms Hist 0.18ms 0.7ms 10.8ms
A summary of the comparison of our GPU implementation with those on a Cell B.E [10] and on Intel multicore CPUs (2 cores and 8 cores) is provided in Table 1 and Table 2. Both compared implementations use architecture specific optimizations on their target platforms. Our CPU numbers should not be taken as comparison numbers as we did not optimize the code and did not parallelize it to take advantage of the 4 cores of the Q6850. 1
30 SMs, 1GB of Off-chip RAM, 1296 MHz processor clock speed.
628
M. Diao and J. Kim
Table 3. Feature computation throughput (in MPixels/sec) on different images. 1024x768 pixel images used with d = 5 and 64 color bins. Histogram Constant 601 Random 1622 Real 1079
Correlogram 24 75 33
We achieve significant speedups compared to multicore CPUs for both features as well as for the color histogram feature compared to the cell B.E. implementation. Our correlogram implementation is comparable to the one on the Cell B.E. Our experiments have also shown that the number of color bins has a small effect on the performance of both features (Figure 5a). The effect of the maximum distance d on the correlogram computation throughput is shown on Figure 5b. As d increases, the thoughput decreases rapidly. Furthermore, as expected, the complexity of the correlogram algorithm increases rapidly with d.
(a) Histogram Throughput
(b) Correlogram Throughput
Fig. 5. (a) GPU HSV color histogram computation throughput with respect to the number of bins (b) GPU color correlogram computation throughput computed with 64 number of bins with respect to the maximum distance d
The results in Table 3 show how our implementation performs on the 3 selected images (Random, Constant, Real). The images were chosen to evaluate how the performance is affected by data variations. The degenerate case is the constant image where only one bin is updated by all threads. The random image leads to the best performance because of the minimum contention in accessing shared memory space.
5
Conclusion and Future work
We have shown that GPUs can be very effective massively parallel co-processors to significantly help solve the computational challenges certain media mining workloads face. We have presented a parallel implementation of the color correlogram and histogram feature extraction on a massively parallel SIMD architecture
Multimedia Mining on Manycore Architectures: The Case for GPUs
629
(GeForce GTX 280). We compared our performance to state-of-the-art multicore CPUs as well as the Cell B.E. We achieve significant speedups compared to multicore CPUs and comparable performance with the cell on the correlogram feature computation. This is an interesting step toward the deployment of realtime multimedia content analysis applications into end user personal and mobile platforms.
References 1. Sebe, N., Tian, Q.: Personalized multimedia retrieval: the new trend? In: MIR 2007: Proceedings of the international workshop on Workshop on multimedia information retrieval, pp. 299–306. ACM, New York (2007) 2. Lew, M.S., Sebe, N., Djeraba, C., Jain, R.: Content-based multimedia information retrieval: State of the art and challenges. ACM Trans. Multimedia Comput. Commun. Appl. 2, 1–19 (2006) 3. Zhang, Q., Chen, Y., Li, J., Zhang, Y., Xu, Y.: Parallelization and performance analysis of video feature extractions on multi-core based systems. In: ICPP 2007: Proceedings of the 2007 International Conference on Parallel Processing, Washington, DC, USA. IEEE Computer Society, Los Alamitos (2007) 4. Li, E., Li, W., Tong, X., Li, J., Chen, Y., Wang, T., Wang, P., Hu, W., Du, Y., Zhang, Y., Chen, Y.K.: Accelerating video-mining applications using many small, general-purpose cores. IEEE Micro 28, 8–21 (2008) 5. Glasberg, R., Tas, C., Sikora, T.: Recognizing commercials in real-time using three visual descriptors and a decision-tree. In: 2006 IEEE International Conference on Multimedia and Expo., pp. 1481–1484 (2006) 6. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006) 7. Mccool, M.D.: Scalable programming models for massively multicore processors. Proceedings of the IEEE 96, 816–831 (2008) 8. Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: Gpu computing. Proceedings of the IEEE 96, 879–899 (2008) 9. Corporation, N.: NVIDIA CUDA Programming Guide, version 2.0 (2008) 10. Liu, L.-K., Liu, Q., Natsev, A., Ross, K.A., Smith, J.R., Varbanescu, A.L.: Digital media indexing on the cell processor. In: 2007 IEEE International Conference on Multimedia and Expo., pp. 1866–1869 (2007) 11. Chen, Y., Li, E., Li, J., Zhang, Y.: Accelerating video feature extractions in cbvir on multi-core systems. Intel Technology Journal 11 (2007) 12. Mizukami, Y., Tadamura, K.: Optical flow computation on compute unified device architecture. In: 14th International Conference on Image Analysis and Processing, 2007. ICIAP 2007, pp. 179–184 (2007) 13. Ding, S., He, J., Yan, H., Suel, T.: Using graphics processors for high performance ir query processing. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 421–430. ACM, New York (2009) 14. Wu, R., Zhang, B., Hsu, M.: Clustering billions of data points using gpus. In: UCHPC-MAW 2009: Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, pp. 1–6. ACM, New York (2009)
630
M. Diao and J. Kim
15. Hauptmann, A.G., Christel, M.G., Yan, R.: Video retrieval based on semantic concepts. Proceedings of the IEEE 96, 602–622 (2008) 16. Catanzaro, B., Sundaram, N., Keutzer, K.: Fast support vector machine training and classification on graphics processors. In: ICML 2008: Proceedings of the 25th international conference on Machine learning, pp. 104–111. ACM, New York (2008) 17. Strong, G., Gong, M.: Browsing a large collection of community photos based on similarity on gpu. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne, T.-M., Monroe, L. (eds.) ISVC 2008, Part II. LNCS, vol. 5359, pp. 390–399. Springer, Heidelberg (2008) 18. Chong, J., Yi, Y., Faria, A., Satish, N., Keutzer, K.: Data-parallel large vocabulary continuous speech recognition on graphics processors. In: Proceedings of the 1st Annual Workshop on Emerging Applications and Many Core Architecture (EAMA), pp. 23–35 (2008) 19. Blythe, D.: Rise of the graphics processor. Proceedings of the IEEE 96, 761–778 (2008)
Human Activity Recognition Based on Transform and Fourier Mellin Transform Pengfei Zhu, Weiming Hu, Li Li, and Qingdi Wei Institute of Automation, Chinese Academy of Sciences, Beijing, China {pfzhu,wmhu,lli,qdwei}@nlpr.ia.ac.cn
Abstract. Human activity recognition is attracting a lot of attention in the computer vision domain. In this paper we present a novel human activity recognition method based on transform and Fourier Mellin Transform (FMT). Firstly, we convert the original image sequence to the Radon domain, get the transform curves by transform. Then we extract the Rotation-Scaling-Translation (RST) invariant features by FMT and use to have dimension reduction by PCA method. At the recognition stage, the Earth Mover’s Distance (EMD) is used here. In the experiment, we compare our method to other methods. The experimental results show the effectiveness of our method.
1 Introduction Human activity recognition is an attractive direction of research in computer vision, which has wide application such as intelligent surveillance, analysis of the physical condition of people and caring of aged people [1]. Human activity recognition includes tracking, action features extraction and representation, action model learning and high level semantic understanding. The feature expression of activity recognition is a key step. But the video data are variant at the aspect of the scale angle and location with the carema, the job of feature extraction is very hard. So the extraction of view invariant features are attentioned by more and more researchers. Rao et al [2] present a computational representation of human action to capture these dramatic changes using spatio-temporal curvature of 2-D trajectory. This representation is compact, view-invariant, and is capable of explaining an action in terms of meaningful action units called dynamic instants and intervals. Ogale et al [3] represent human actions as short sequences of atomic body poses. Actions and their constituent atomic poses are extracted from a set of multiview multiperson video sequences by an automatic keyframe selection process, and are used to automatically construct a probabilistic context-free grammar (PCFG). Parameswaran and Chellappa [4] exploit a wealth of techniques in 2D invariance that can be used to advantage in 3D to 2D projection and model actions in terms of view-invariant canonical body poses and trajectories in 2D invariance space, leading to a simple and effective way to represent and recognize human actions from a general viewpoint. Weinland et al [5] introduce Motion History Volumes (MHV) as a free-viewpoint representation for human actions in the case of multiple calibrated, and background-subtracted, video cameras. They present algorithms for computing, aligning and com-paring MHVs of G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 631–640, 2009. c Springer-Verlag Berlin Heidelberg 2009
632
P. Zhu et al.
different actions performed by different people in a variety of viewpoints. Weinland et al [6] propose a new framework where they model actions using three dimensional occupancy grids, built from multiple viewpoints, in an exemplar-based HMM. The novelty is, that a 3D reconstruction is not required during the recognition phase, instead learned 3D exemplars are used to produce 2D image information that is compared to the observations. Parameters that describe image projections are added as latent variables in the recognition process. Li and Fukui [7] propose a novel view-invariant human action recognition method based on non-rigid factorization and Hidden Markov Models. Shen and Foroosh [8] show that fundamental ratios are invariant to camera parameters, and hence can be used to identify similar plane motions from varying viewpoints. For action recognition, they decompose a body posture into a set of point triplets (planes). The similarity between two actions is then determined by the motion of point triplets and hence by their associated fundamental ratios, providing thus view-invariant recognition of actions. Natarajan and Nevatia [9] present an approach to simultaneously track and recognize known actions that is robust to such variation, starting from a person detection in the standing pose. To tackle activity recognition, Gilbert et al [10] propose learning compound features that are assembled from simple 2D corners in both space and time. In this paper, we present a novel human activity recognition method based on transform and Fourier Mellin Transform (FMT). Figure 1 shows the framework of our method.
Fig. 1. Overview of our approach
The rest of this paper is organized as follows. Section 2 shows the Radon transform and transform. The Fourier mellin transform algorithm is introduced in section 3. To evaluate our method, the experiments are showed in section 4. Section 5 is the conclusion of our paper. Section 6 shows the references.
Human Activity Recognition Based on Transform and FMT
633
2 Radon Transform and Transform In mathematics, two dimensional Radon transform is the transform consisting of the integral of a function over the set of lines in all directions, which is roughly equivalent to finding the projection of a shape on any given line. For a discrete binary image, each image is projected to the Radon domain. Let f (x, y) be an image, its Radon transform is defined[11][12]: ∞ ∞ T R f (ρ, θ) = f (x, y)δ(x cos θ + y sin θ − ρ)dxdy = Radon { f (x, y)} (1) −∞
−∞
where θ ∈ [0, π], ρ ∈ [−∞, ∞] and δ(.) is the Dirac delta function, 1 if x = 0 δ(x) = 0 otherwise
(2)
For geometry transformation such as scaling, translation and rotation, Radon transform has the following properties: For a scaling factor α, x y 1 Radon { f ( , )} = T R f (αρ, θ) α α α
(3)
Radon { f (x − x0 , y − y0 )} = T R f (ρ − x0 cos θ − y0 sin θ, θ)
(4)
For translation of (x0 , y0 ),
For rotation of θ0
Radon { fθ0 (x, y)} = T R f (ρ, θ + θ0 )
(5)
From the equation (3)-(5), we can see that the Radon transform is variant at the aspects of scaling, translation and rotation. An improved representation of Radon transform, transform, is introduced [13][12]: ∞ f (θ) = T R2 f (ρ, θ)dθ (6) −∞
For a scaling factor α, ∞ ∞ 1 1 1 2 T (αρ, θ)dρ = T 2 f (ν, θ)dν = 3 f (θ) f α2 −∞ R α3 −∞ R α For translation of (x0 , y0 ), ∞ T R2 f ((ρ − x0 cos θ − y0 sin θ), θ)dρ = −∞
For rotation of θ0
∞
−∞
∞ −∞
T R2 f (ν, θ)dν = f (θ)
T R2 f (ρ, (θ + θ0 ))dρ = f (θ + θ0 )
(7)
(8)
(9)
634
P. Zhu et al.
Fig. 2. The Radon transform and the transform of the example images
From the equation (7)-(9), we can see that the transform is invariant at the aspect of translation, a scaling changing can reach an amplitude scaling, and a rotation results in a phase sift. In the experiments, we normalize the transform curve to get the scaling invariance by equation (10). (θ) =
(θ) max((θ))
(10)
The Figure 2 shows the Radon transform and the transform of the example images.
Human Activity Recognition Based on Transform and FMT
635
3 Fourier Mellin Transform The use of the Fourier Mellin Transform for rigid image registration was proposed in [14], that is to match images that are translated, rotated and scaled with respect to one another. Let F1 (ξ, η) and F2 (ξ, η) be the Fourier transforms of images f1 (x, y) and f2 (x, y), respectively. If f2 differs from f1 only by a displacement (x0 , y0 ) then f2 (x, y) = f1 (x − x0 , y − y0 ),
(11)
or in frequency domain, using the fourier shift theorem F2 (ξ, η) = e− j2π(ξx0 +ηyo ) × F1 (ξ, η).
(12)
The cross-power spectrum is then defined as C(ξ, η) =
F1 (ξ, η)F2∗ (ξ, η) = e j2π(ξx0 +ηy0 ) , |F1 (ξ, η)F2 (ξ, η)|
(13)
where F ∗ is the complex conjugate of F. The Fourier shift theorem guarantees that the phase of the cross-power spectrum is equivalent to the phase difference between the images. The inverse of (3) results in c(x, y) = δ(x − x0 , y − y0 ),
(14)
which is approximately zero everywhere except at the optimal registration point. If f1 and f2 are related by a translation (x0 , y0 ) and a rotation θ0 then f2 (x, y) = f1 (x cos θ0 + y sin θ0 − x0 , −x sin θ0 = y cos θ0 − y0 ).
(15)
Using the Fourier translation property and the Fourier rotation property, we have F2 (ξ, η) = e− j2π(ξx0 +ηyo ) × F1 (ξ cos θ0 + η sin θ0 , −ξ sin θ0 + η cos θ0 ).
(16)
Let M1 and M2 be the magnitudes of F1 and F2 , respectively. They are related by M2 (ξ, η) = M1 (ξ cos θ0 + η sin θ0 , −ξ sin θ0 + η cos θ0 ).
(17)
To recover the rotation, the Fourier magnitude spectra are transformed to polar representation M1 (ρ, θ) = M2 (ρ, θ − θ0 ) (18) where ρ and θ are the radius and angle in the polar coordinate system, respectively. Then, (3) can be applied to find ρ0 . If f1 is a translated, rotated and scaled version of f2 , the Fourier magnitude spectra are transformed to log-polar representations and related by
i.e.
M2 (ρ, θ) = M1 (ρ/s, θ − θ0 )
(19)
M2 (log ρ, θ) = M1 (log ρ − log s, θ − θ0 )
(20)
636
P. Zhu et al.
M2 (ξ, θ) = M1 (ξ − d, θ − θ0 )
(21)
where s is the scaling factor, ξ = log ρ and d = log s.
4 Experiments In our experiments, we use the Weizman dataset to evaluate our method with 93 videos of 9 actors and 10 actions (bend, jack, jump, pjump, run, side, skip, walk, wave1, wave2), the sample images are showed in Figure 3.
Fig. 3. The example images of the Weizman dataset
In our experiments, each silhouette image is normalized into a 64 ∗ 64 resolution. Firstly we convert the image to the Radon domain, get a curve by the transform. Before extract the invariant features by the fourier mellin transform, we convert the curve to a 2D transform image. To get more compressed features, PCA is used here. Since the periods of the activities are not uniform, comparing sequences is not straightforward. In the case of human activities, the same activity can be performed in different speeds, resulting the sequence to be expanded or shrunk in time. In order to eliminate such effects of different speeds and to perform robust comparison, the Earth Mover’s Distance (EMD) [15] is used in our experiment. The Earth Mover’s Distance has been proved to have promising performance in image retrieval and visual tracking because it can find optimal signature alignment and thereby can measure the similarity accurately. For arbitrary two activity sequences P and Q, P = {(pi , w pi ), 1 ≤ i ≤ m}, Q = {(qi , wqi ), 1 ≤ i ≤ n}, where m and n are the number of clusters in P and Q, respectively. The EMD between P and Q is computed by m n i=1
j=1
D(P, Q) = m n i=1
di j fi j
j=1 fi j
(22)
Where di j is the Euclidean distance between pi and q j , and fi j is the optimal match between two signatures P and Q that can be computed by solving the Linear Programming problem.
Human Activity Recognition Based on Transform and FMT
min WORK(P, Q, F) =
m n
637
di j fi j
i=1 j=1
fi j ≥ 0
s.t.
n
fi j ≤ w pi
j=1 m
fi j ≤ wqi
i=1 m n i=1 j=1
fi j = min(
m i=1
w pi ,
n
wqi )
j=1
4.1 Experiment 1 In the experiment, we evaluate our method at the aspect of rotation, translation and scaling respectively. Figure 4, 5 show the correct recognition rates when the activity sequences are rotated or scaled. From the results, we can see that the correct rates of the rotated activities can right up to 90%. And the correct rates of the scaled ones can right up to 80%. As for the translated ones, the correct recognition rates are 100%.
Fig. 4. The correct recognition rates of rotated activity sequences
638
P. Zhu et al.
Fig. 5. The correct recognition rates of scaled activity sequences
Fig. 6. The example images of our dataset
4.2 Experiment 2 In the experiment, we build a dataset including the datum of the Weizman dataset, the rotated sequences of the Weizman dataset at angle −30o -30o randomly, the translated sequences of the Weizman dataset, and the scaled image sequences of the Weizman dataset. The example images are showed in Figure 6. We compare our method to other methods, such as Zernike Moment, transform in, Fourier Mellin Transform. Figure 7 shows the correct recognition rates. From the figure we can see that our methods are better than other three methods. Our RST-invariant features based on the transform and the Fourier Mellin Transform is effective and can be used in human activity recognition.
Human Activity Recognition Based on Transform and FMT
639
Fig. 7. The correct recognition rates
5 Conclusion In this paper we present a novel human activity recognition method based on transform and Fourier Mellin Transform (FMT). Our feature extraction method is RotationScaling-Translation invariant, which can be used in human activity recognition, especially when the camera is unstable. The experimental results show the effectiveness of our method.
Acknowledgment This work is partly supported by NSFC (Grant No. 60825204 and 60672040) and the National 863 High-Tech R&D Program of China (Grant No. 2006AA01Z453).
References 1. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behavior. IEEE Trans. on Systems, Man and Cybernetics, Part C: Applications and Reviews 37, 334–352 (2004) 2. Rao, C., Yilmaz, A., Shah, M.: View-invariant representation and recognition of actions. International Journal of Computer Vision 50, 203–226 (2002) 3. Ogale, A., Karapurkar, A., Aloimonos, Y.: View-invariant modeling and recognition of human actions using grammars. In: Workshop on Dynamical Vision at ICCV, vol. 5 (2005) 4. Parameswaran, V., Chellappa, R.: View invariance for human action recognition. International Journal of Computer Vision 66, 83–101 (2006) 5. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding 104, 249–257 (2006) 6. Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views using 3d exemplars. In: Proceedings of the International Conference on Computer Vision, pp. 1–7 (2007) 7. Li, X., Fukui, K.: View-invariant human action recognition based on factorization and hmms. EICE Transactions on Information and Systems, 1848–1854 (2008)
640
P. Zhu et al.
8. Shen, Y., Foroosh, H.: View-invariant action recognition using fundamental ratios. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7 (2008) 9. Natarajan, P., Nevatia, R.: View and scale invariant action recognition using multiview shapeflow models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 10. Gilbert, A., Illingworth, J., Bowden, R.: Scale invariant action recognition using compound features mined from dense spatio-temporal corners. In: European Conference on Computer Vision, pp. 222–233 (2008) 11. Deans, S.: Application of the radon transform. Wiley Interscience Publications, New York (1983) 12. Wang, Y., Huang, K., Tan, T.: Human activity recognition based on r transform. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 13. Tabbone, S., Wendling, L., Salmon, J.: A new shape descriptor defined on the radon transform. Computer Vision and Image Understanding 102, 42–51 (2006) 14. Reddy, B., Chatterji, B.: An fft-based technique for translation, rotation, and scale-invariant image registration. IEEE Trans. Image Processing 8, 1266–1271 (1996) 15. Rubner, Y., Tomasi, C., Guibas, L.: The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40, 99–121 (2000)
Reconstruction of Facial Shape from Freehand Multi-viewpoint Snapshots Seiji Suzuki1 , Hideo Saito1 , and Masaaki Mochimaru2 1
2
Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa, Japan {suzuki,saito}@hvrl.ics.keio.ac.jp Advanced Industrial Science and Technology, 2-41-6 Aomi, Koto-ku, Tokyo, Japan
[email protected]
Abstract. We propose a method that can reconstruct both a 3D facial shape and camera poses from freehand multi-viewpoint snapshots. This method is based on Active Shape Model (ASM) using a facial shape database. Most ASM methods require an image in which the camera pose is known, but our method does not require this information. First, we choose an initial shape by selecting the model from the database which is most suitable to input images. Then, we improve the model by morphing it to fit the input images. Next, we estimate the camera poses using the morphed model. Finally we repeat the process, improving both the facial shape and the camera poses until the error between the input images and the computed result is minimized. Through experimentation, we show that our method reconstructs the facial shape within 3.5 mm of the ground truth.
1
Introduction
3D shape reconstruction is one of the research issues that is extensively studied for over 20 years in computer vision. Hardware devices such as a range scanner help us to measure a 3D shape accurately [1]. A video projector, one of the hardware devices, can also help us to reconstruct a 3D shape by projecting some particular patterns to an object [2,3]. However, these hardware devices are expensive and not easy to use. Therefore, a lot of image based techniques which do not require any hardware devices except camera have been presented. For example, Shape from Shading and Shape from Texture can reconstruct a 3D shape from a single-viewpoint image. These methods, however, have so many constraints on reflectance properties and illumination conditions that it could not be used easily. Stereo Vision which requires multi-viewpoint images is also hard to use because the user have to calibrate cameras accurately. There is a technique named Structure from Motion [4] that reconstructs a 3D shape from sequential images such as video. This technique uses Optical Flow in order to find corresponding points between subsequent frames. However, Optical Flow can not be computed accurately because cheeks have uniform texture. To solve these problems, methods based on Active Shape Model (ASM) have developed. These methods uses a database which has 3D facial shapes measured G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 641–650, 2009. c Springer-Verlag Berlin Heidelberg 2009
642
S. Suzuki, H. Saito, and M. Mochimaru
by a range scanner. A 3D facial shape can be reconstructed accurately using Principal Component Analysis (PCA) of the database. Some traditional methods [5, 6, 7], however, use only one input image, so 3D geometric information is ignored. Even though the appearance of their results is plausible, the reconstructed shape may lack geometric compliance. In contrast, the other ASM methods [8, 9, 10] consider the geometric information, using multi-viewpoint images. However, camera poses are assumed to be already known. This assumption requires the user to calibrate cameras. Recently, a method in which the user need not calibrate any cameras was proposed [11]. A 3D facial shape can be reconstructed from uncalibrated multi-viewpoint images. However, this method reconstructs a shape without estimating the camera poses. Therefore the reconstructed results is not accurate. In this paper, we propose a method that can reconstruct a 3D facial shape from uncalibrated multi-viewpoint snapshots. In fact, our method requires some manual inputs such as clicking facial feature points. However, the user do not have to prepare any special hardware devices and any calibrations, so this method is easy to use even at home.
2
Method
We aim to recover a facial shape from uncalibrated multi-viewpoint snapshots, which capture a face from various unknown poses. This means that we need to estimate both a facial shape and camera poses from the input images. However, the shape optimization requires camera poses, while the pose estimation requires a facial shape. Even though an initial shape is quite different from a real shape, it is only one information that can be used for the pose estimation. That is why the poses and the shape can not be accurately estimated simultaneously in one computation. Therefore, in the proposed method, we designed an iterative algorithm for reducing estimation error. Fig. 1 shows a flowchart of this method. We roughly estimate a camera pose against a target human face in each input image using an interim shape. A projected image of the interim shape can be rendered at the each estimated pose. The error between the input images and the projected images is computed by error functions. The interim shape is optimized to minimize the error. The interim shape is quite different from a real target shape at the beginning of this process. As the interim shape is fitted to the real shape, the poses can be computed more accurately. The accurate poses make the interim shape more accurate. To repeat this process, we can get a reconstructed shape as the optimized interim shape. Our method can be regarded as an energy minimization. We want to find the shape x which minimizes the error y in the equation y = f (x), where f is the error function. It is better that the argument vector x is lower dimensional in this situation. We use PCA of a facial shape database to make x lower dimensional vector.
Reconstruction of Facial Shape from Freehand Multi-viewpoint Snapshots
Input Images
643
Interim Shape
Pose Estimation Poses
Error Evaluation Projection
Evaluated Value
Projected Images
Convergence Yes
No Shape Optimization
Shape
Fig. 1. Flowchart of this method
We describe about the database in 2.1, and PCA in 2.2. The way of the initial shape generation, the camera pose estimation and the facial shape optimization are referred in 2.3, 2.4, 2.5 respectively. The definition of the error functions is in 2.6. 2.1
Database
Our database is composed of human’s head shape data. Each data is scanned by a range scanner. The scanned data have around 200 thousand vertices. An anatomist extracted one hundred vertices which have an anatomically important information, and another 330 vertices are interpolated. Fig. 2(a) shows the orbitomeatal plane coordinate system, on which all the human’s head shapes are defined. We use only a facial part of the 430 head vertices. The facial part has 260 vertices shown in Fig. 2(b). We have two databases which include 52 males’ and 52 females’ facial shape respectively. 2.2
Principal Component Analysis
The facial shape, that is represented as a multidimensional vector, should be lower dimensional through the optimization. PCA can compress the multidimensional vector to lower dimensional one. Our database has m persons’ shape vectors. Let x1 , x2 , · · · , xm denote them respectively. Each vector is defined as x = [x1 , y1 , z1 , x2 , y2 , z2 , · · · , xn , yn , zn ] , where n is the number of vertices. In our database, n = 260 and m = 52.
644
S. Suzuki, H. Saito, and M. Mochimaru zW
Ectoccanthion Pronasale yW
Cheilion
xW
Vertex # : 260 Patch # : 482
Orbitomeatal Plane
(a) Orbitomeatal plane
(b) Facial part model Fig. 2. Database definition
PCA calculates eigenvectors p1 , p2 , · · · , pk from the shape vectors, where k = min (3n, m) and pi ∈ 3n (1 ≤ i ≤ k). An arbitrary facialshape x can m 1 ¯= m be represented by the eigenvectors and an average shape x i=1 xi : ¯+ x=x
k
si pi .
(1)
i=1
The vector s = [s1 , s2 , · · · , sk ] is in a one-to-one correspondence with the vector x. Both x and s represent the facial shape uniquely. Choosing l (1 ≤ l < k) elements from s in ascending order, we get a vector ˆ can sˆ = [s1 , s2 , · · · , sl ] . If the shape is represented by sˆ, a predicted shape x be computed: l ˆ=x ¯+ x si pi . (2) i=1
ˆ and sˆ each other at any time. Thus we can convert x The facial shape is represented by the 3n dimensional vector x. Now it is compressed to the l dimensional vector sˆ by the theory of PCA, so it can be practical to optimize the facial shape. 2.3
Initial Shape
We have to choose an initial shape from our database. To choose the most suitable shape, first we project all shapes x1 , x2 , · · · , xm to the eigenspace. Next we estimate a camera pose using the projected shape sˆi (1 ≤ i ≤ m) in each input image, and then render the projected image of each shape. Finally we compute the evaluated values yi = f (ˆ si ) and determine an initial shape sˆinit as the shape sˆi which has the minimum error yi .
Reconstruction of Facial Shape from Freehand Multi-viewpoint Snapshots
Image Coordinate u=[u, v]T
Intrinsic Matrix K
645
World Coordinate XW=[xW, yW, zW]T Image Plane
Camera Coordinate XC=[xC, yC, zC]T
Extrinsic Matrix M=[R|t]
Fig. 3. Relationships among coordinate systems
2.4
Pose Estimation
Fig. 3 shows image coordinate system u = [u, v] , camera coordinate system XC = [xC , yC , zC ] , and world coordinate system XW = [xW , yW , zW ] . They relate to each other: ˜C ˜ KX u (3) ˜C MX ˜W , X
(4)
where K is intrinsic camera parameter and M = [R|t] is extrinsic camera parameter. They are denoted vy following elements: ⎡ ⎤ ⎡ ⎤ fx 0 cx r11 r12 r13 t1 K = ⎣ 0 fy cy ⎦ , M = ⎣r21 r22 r23 t2 ⎦ . (5) 0 0 1 r31 r32 r33 t3 The world coordinate system equals to the orbitomeatal plane coordinate system in this situation. We assume that the intrinsic parameter K is already known, and the 2D feature points u are also known because of user’s clicking. The 3D facial feature points XW is given by the interim shape sˆ. In this method we compute the camera pose, that is the extrinsic parameter M , from a pair of the five feature points u and XW (see Fig. 2) using Zhang’s method [12]. 2.5
Shape Optimization
Eq. (3) (4) leads to a following equation: ˜W ˜ KM X u ˜W , PX
(6)
where P is so-called projection matrix. We can project the interim shape sˆ to an image at the same pose as the input camera pose P , so the projected
646
S. Suzuki, H. Saito, and M. Mochimaru
image depends on the projection matrix P and the interim shape sˆ. In fact, the projection matrix is computed from the interim shape. Therefore, as a result, the projected image relies on only the interim shape. We can compute the error between the input image and the projected image. The shape is otimized to minimize it. We use Levenberg-Marquardt method for the optimization. This algorithm generates the optimized shape sˆopt = arg minsˆ f (ˆ s), where f denotes the error function. In fact, the error function is composed of the four functions described at 2.6: f (ˆ s) =
4
2
{αi fi (ˆ s)} ,
(7)
i=1
where αi (1 ≤ i ≤ 4) is a weight coefficient of each function determined empirically. The optimized shape sˆopt is the final output of this method. 2.6
Error Functions
There are four error functions. They require manual inputs such as five feature points uinput, a silhouette Sinput, and an outline Linput on each input image Iinput . The silhouette is a facial part of the input image. The outline is a part of the contour of the silhouette. A border between a face and a background is the outline. In contrast, a border between a face and hair is not the outline. Facial Likelihood Error Function. This function computes a correlation value between an argument shape sˆ and the set of the database shapes: l 1 s2i f1 (ˆ s) = 1 − exp − , (8) 2 i=1 λi where λi (1 ≤ i ≤ l) denotes the eigenvalue, that is a variance of the learning data set. We suppose that the human facial shapes are on the Gaussian distribution from the average shape. If the argument shape sˆ is far from the data set, the error value should be high. This function prevent the interim shape from being morphed too much and being far from humanity. Facial Contour Error Function. This function evaluates the difference in terms of the contour between the argument shape and the input image: The definition is:
(u,v) dproj dudv f2 (ˆ s) = A
, (9) w A dudv where
(u,v) A = (u, v) |Linput = white
(10)
Reconstruction of Facial Shape from Freehand Multi-viewpoint Snapshots
(u,v) (u,v) dproj = D (Sproj)(u,v) + D S˜proj
w=
fx2 + fy2 t3
.
647
(11)
(12)
S˜ denotes a negative image of a binary image S. D (S) means the euclidean distance transformation of S. Note that the silhouette Sproj is the projection of the argument shape sˆ, so it depends on sˆ. The main part of this function is
(u,v) A dproj dudv. This part means the sum of absolute distance from the outline to the projected silhouette. The denominator
is a normalization factor, where w represents the projected facial area size and A dudv is the length of Linput . Feature Points Error Function. This function evaluates a 2D distance of the feature points between the input image and the projected image. Let uinput, uproj denote the feature points on each image: f3 (ˆ s) =
5 1 (i) (i) uinput − uproj , 5w i=1
(13)
where u(i) (1 ≤ i ≤ 5) represents the coordinate of each feature point. Note that the projected points uproj depends on sˆ. The denominator is a normalization factor, where w is defined by Eq. (12). Texture Error Function. This function can evaluate the detail of the face. We use the most frontal facial input image for a texture, and then render a texture mapped projected image Iproj onto the other input image Iinput . We compute the error usihng following equation:
(u,v) (u,v) − I I input proj dudv B
f4 (ˆ s) = , (14) dudv B where
(u,v) (u,v) B = (u, v) |Sinput = Tproj = white .
(15)
Tproj is a mask image that presents a texture mapped area. The numerator is the sum of absolute difference of each appearance in a region of comparable area. The denominator is a normalization term, that is a size of the comparable area.
3
Experiments
First, We reconstruct a facial shape from multi-viewpoint snapshots. The result is show in 3.1. Next, in order to discuss the accuracy, we reconstruct five persons’ shapes whose real shapes are measured by a range scaner in advance. The accuracy is shown in 3.2 and discussed in 3.3.
648
3.1
S. Suzuki, H. Saito, and M. Mochimaru
Reconstruction
The number of input images is four in 640 × 480 resolution. The input images are taken by a digital still camera and not calibrated extrinsically. Through the experimentation, we use l = 20 principal components, which enable the cumulative contribution ratio 90%. The parameters in Eq. (7) is determined empirically as α1 = 1.0, α2 = 6.0, α3 = 3.0, α4 = 0.1. We decide αi (1 ≤ i ≤ 4), where the evaluated values αi fi (ˆ s) are nearly same to each other.
(a) Iinput
(b) Sinput
(c) Linput
(d) Mesh overlaid.
(e) Proj.
(f) Sproj
(g) Iproj
(h) Tproj
Fig. 4. Input image and reconstructed result
Fig. 4 shows the results. Though the number of the input images is four, we show one typical image in the figure. Fig. 4(a) is the input image with cliked feature points. Fig. 4(b) and Fig. 4(c) are the silhouette and the outline of the input image respectively. We reconstruct a shape from these inputs. The result is shown in Fig. 4(e). Fig. 4(f) shows the silhouette image of it. Fig. 4(g) is the projected image with a texture, and Fig. 4(h) is the mask image that means the texture mapped area. Fig. 4(d) shows an image with a reconstructed mesh. To compare Fig. 4(a) with Fig. 4(g), the PSNR (Peak Signal-to-Noise Ratio) of the appearance is 24.4 dB. The computation time is around 10 minutes in the condition of Windows XP SP3, Intel Core 2 Duo 6700 (2.66GHz), 3.5GB RAM. 3.2
Comparison with Range Data
We reconstruct five persons’ shapes whose real shapes are already known by a range scanner. They consist of three males and two females, and we use a respective database. The experimental condition is the same as 3.1. We computed 3D reconstruction errors from real shapes. The error means the average of the euclidean distance between the true verteces and their
Reconstruction of Facial Shape from Freehand Multi-viewpoint Snapshots
649
Table 1. Shape evaluation. Each column corresponds with the each person. The top row shows the error between the real shape and the reconstructed shape. The middle row means the error between the real shape and the most similar shape in the database. The bottom row is the average value of errors between the real shape and the respective shapes in the database (mm). Person ID Reconstructed Min DB Avg DB
Male 1 3.1 3.5 4.9
Male 2 3.3 3.2 5.0
Male 3 3.2 2.9 4.3
Female 1 Female 2 3.9 3.9 3.0 2.7 4.6 4.7
corresponding reconstructed vertex positions. Table 1 shows the fact that the reconstructed shape is around 3.5 mm different from the real shape. Compared with the middle row and the bottom row, the reconstructed results look suitable. 3.3
Error Distribution
Fig. 5 shows reconstruction error maps. Each column corresponds with the each person. Fig. 5(a) is the real shapes which is measured by a range scanner. Fig. 5(b) is the reconstructed shape, and Fig. 5(c) is the error maps.
(a) Real shapes
(b) Reconstructed results
0
6mm
(c) Error maps Fig. 5. Reconstructed results and error maps
The center of the face has a small error. In contrast, the edge has a big error. This occurs because the feature points are concentrated to the center of the face. That is why the edge part can not be computed accurately.
650
4
S. Suzuki, H. Saito, and M. Mochimaru
Conclusion
We proposed a method that can reconstruct both a facial shape and camera poses from freehand multi-viewpoint snapshots. This method does not use any special hardware device. The most of conventional methods require a calibrated multi camera system, but our method does not require it because we estimate both of them simultaneously. The reconstruction error is around 3.5 mm. However, our method needs manual inputs such as facial feature points, a silhouette and an outline. It is better to decrease these manual inputs. It will be a future task.
References 1. Brunsman, M.A., Daanen, H.A.M., Robinette, K.M.: Optimal postures and positioning for human body scanning. In: Proc. of Int’l Conf. on Recent Advances in 3-D Digital Imaging and Modeling, pp. 266–273 (1997) 2. Zhang, L., Snavely, N., Curless, B., Seitz, S.M.: Spacetime faces: High resolution capture for modeling and animation. ACM Trans. on Graphics 23, 548–558 (2004) 3. Siebert, J.P., Marshall, S.J.: Human body 3d imaging by speckle texture projection photogrammetry. Sensor Review 20, 218–226 (2000) 4. Chowdhury, A.K.R., Chellappa, R.: Face reconstruction from monocular video using uncertainty analysis and a generic model. Computer Vision and Image Understanding 91, 188–213 (2003) 5. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Proc. of the 26th Annual Conf. on Computer Graphics and Interactive Techniques, pp. 187–194 (1999) 6. Romdhani, S., Blanz, V., Vetter, T.: Face identification by fitting a 3d morphable model using linear shape and texture error functions. In: Proc. of the Seventh European Conf. on Computer Vision, vol. 4, pp. 3–19 (2002) 7. Blanz, V., Vetter, T.: Face recognition based on fitting a 3d morphable model. IEEE Trans. on Pattern Analysis and Machine Intelligence 25 (2003) 8. Hu, Y., Jiang, D., Yan, S., Zhang, L., Zhang, H.: Automatic 3d reconstruction for face recognition. In: Proc. of the Sixth IEEE Int’l Conf. on Automatic Face and Gesture Recognition, pp. 843–848 (2004) 9. Jiang, D., Hu, Y., Yan, S., Zhang, L., Zhang, H., Gao, W.: Efficient 3d reconstruction for face recognition. Pattern Recognition 38 (2005) 10. Amberg, B., Blake, A., Fitzgibbon, A., Romdhani, S., Vetter, T.: Reconstructing high quality face-surfaces using model based stereo. In: Proc. of the Eleventh IEEE Int’l Conf. on Computer Vision (2007) 11. Takeuchi, T., Saito, H., Mochimaru, M.: 3d-face model reconstruction utilizing facial shape database from multiple uncalibrated cameras. In: Proc. of the 16th Int’l Conf. in Central Europe on Computer Graphics, Visualization and Computer Vision (2008) 12. Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. on Pattern Analysis and Machine Intelligence 22, 1330–1334 (2000)
Multiple-view Video Coding Using Depth Map in Projective Space Nina Yorozu, Yuko Uematsu, and Hideo Saito Keio University 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa, 223-8522, Japan {yorozu,yu-ko,saito}@hvrl.ics.keio.ac.jp
Abstract. In this paper a new video coding by using multiple uncalibrated cameras is proposed. We consider the redundancy between the cameras view points and efficiently compress based on a depth map. Since our target videos are taken with uncalibrated cameras, our depth map is computed not in the real world but in the Projective Space. This is a virtual space defined by projective reconstruction of two still images. This means that the position in the space is correspondence to the depth value. Therefore we do not require full-calibration of the cameras. Generating the depth map requires finding the correspondence between the cameras. We use a “plane sweep” algorithm for it. Our method needs only a depth map except for the original base image and the camera parameters, and it contributes to the effectiveness of compression.
1
Introduction
According to the development of digital image processing technique, multipleview video captured with multiple cameras has high demand for a lot of new media: broadcasting, live sport events, cinema production, and so on. The “EyeVision” system [1] used in the live broadcasting of American football is famous and landmark example of the multiple-view video researches. In the field of cinema production, “Matrix” [2] employed the novel technique for generating the computer graphics based on real videos and created the scene where virtual camera was panning around the actor for a moment. Moreover, for such as 3DTV and free-viewpoint TV (FTV), there are many related researches [3][4][5] that generate free-viewpoint images from real images taken by multiple cameras. By free-viewpoint images, viewers can easily change their interactive viewpoints without concerning about the real camera position. As noted by Tanimoto [6] and Smolic et al. [7], these types of videos have many advantages in many fields. On the other hand, streaming distribution of a movie is provided on the Internet at present. It is expected that streaming distribution of a multiple-view videos will also start in the future, and multiple-view videos coding (MVC) will become very important. In the case of MVC, the redundancy between viewpoints should also be taken into consideration besides spatial or time redundancy. Many techniques have been proposed and they can be classified into some approaches. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 651–661, 2009. c Springer-Verlag Berlin Heidelberg 2009
652
N. Yorozu, Y. Uematsu, and H. Saito
Object base coding is applied by MPEG-4. In this coding, the scene is constructed by synthesizing each object. This requires the objects in the scene to be separated in advance. It is difficult to apply this coding to natural images. Disparity compensation is the most popular approach and the technique is an extended method of a single-view video coding. The images taken by other viewpoints are treated as the encoding target and are used just as reference images for coding. Therefore, disparity information such as motion vectors and residual signals, i.e., prediction error, are encoded and transmitted to the decoder side. Though it has effectiveness for time redundancy, this is a minor benefit for distantly-positioned cameras. View synthesis and View interpolation use techniques from the field of imagebased rendering to predict coding target images. These approaches have been developed in order to use the fact that the disparity of an object between two views depends on the geometric relation between their cameras. This means that the objects closer to the camera move much more than the objects far from the camera when moving the position of view. Therefore, one of the most popular and general algorithms is to use depth instead of disparity vectors, as proposed by Martinian et al.[8], Shimizu et al. [9], Tsung et al. [10] and Ozkalayci et al. [11]. Many of them focus on the color matching between the input images to generate a higher-accuracy depth map. These approaches require full-calibration of the multiple cameras to get a depth map, however, such a calibration of many cameras is very time consuming task. This is one of the difficulties for practical use of multiple-view shooting. Moreover easy segmentation of images is also necessary to generate a depth map. In this paper, a new video coding method based on a depth map is proposed. The targets are multiple-view videos which are taken by multiple cameras. We focus on the redundancy between the cameras to compress large-volume videos. In this method, we use only an original image and a depth map of a base camera, which is one of the multiple cameras, and predict the images taken by the other cameras. In contrast with a conventional method based on a depth map, our method does not require full-calibration to generate a depth map, because it is computed in the Projective Space that is a virtual space. This space is defined by projective reconstruction of two images. It means that the depth value of our depth map is corresponding to the position in the Projective Space not in the real world. Therefore we do not require full-calibration of the cameras. For getting correspondence between the cameras to generate the depth map, we apply Plane Sweep algorithm [12]. We assume that two virtual planes are defined in the Projective Space so that every target object lies between the planes. The space between the two planes is divided by parallel multiple planes. By projecting each pixel of each plane onto the input images and matching the colors found in the images, then, pixel-to-pixel correspondences between the cameras are obtained. Many related researches have applied the Joint Multiview Video Model (JMVM) [13] that is the reference software for MVC to evaluate their methods. However, in this paper, we examine the effectiveness of compression by using
Multiple-view Video Coding Using Depth Map in Projective Space
653
multiple (still) images captured at a same time and apply the entropy coding as the multiple-view video coding method. If this experiment for the multiple still images achieves good result, our method will also have good performance for multiple videos. In the section 2, we explain detailed our method. In the section 3, we apply our method to multiple-view video coding, and demonstrate the effectiveness of compression. The conclusions are given in the section 4.
2
Method
The outline of our method is shown in Fig. 1. In our method, we use three uncalibrated cameras: base camera, reference camera and input camera. The base camera, which is one of the cameras, is used as a basis for coding, and the “base image” is taken with the base camera. The images taken with the other two cameras are called as “reference image” and “input image”, respectively. The target images of coding are the reference image and the input image. Therefore those two images are predicted by using the base image.
Fig. 1. Outline of Proposed Method
Our method is divided into a preprocessing and two main processing, “Depth Map Generation” and “Prediction & Coding”. In the preprocessing, F-matrix that represents the epipolar geometry between every pair of images is obtained. By finding eight or more pair of corresponding points, the F-matrix is computed. That is usually called as the weak calibration. In the “Depth Map Generation”, a depth map of the base image is generated by constructing a Projective Space, which is a virtual 3D space. For constructing the Projective Space, the base and reference images are utilized with projective reconstruction. In the “Prediction & Coding”, the reference and input images are predicted by using the computed depth map of the base image. And then, the difference between the original images and the predicted images of the reference and input images are encoded. The details will be described in the following sections.
654
2.1
N. Yorozu, Y. Uematsu, and H. Saito
Generation of Depth Map
Our method does not require full-calibration and generate a depth map in the Projective Space, which is a 3D virtual space. Using the F-matrix obtained in the preprocessing, the Projective Space is constructed from the base image and the reference image. Then, the position in the Projective Space not in the real world is corresponding to each depth value of our depth map. For getting correspondence between the cameras to generate the depth map, we apply Plane Sweep algorithm [12]. A detailed flow is shown in Fig. 2.
Fig. 2. Flow of Generation of Depth Map
Construct Projective Space The Projective Space is constructed from two images, base image and reference image, taken by two cameras shown in Fig. 3. Since this technique is based on the projective reconstruction, the parallelism of axes is unkept. When epipolar geometry between two images (cameras) is established, the relationship between the Projective Space and two images are respectively P A = [I|0],
[eB ]× F AB PB = − |eB eB 2
Fig. 3. Projective Space
(1)
Multiple-view Video Coding Using Depth Map in Projective Space
655
where P A and P B are the projection matrices to the base image and the reference image, F AB is a F-matrix of the base image to the reference image, and eB is an epipole on the reference image. Consider XP (P, Q, R) as a point in the Projective Space, xA (uA , vA ) as on the base image, xB (uB , vB ) as on the reference image, we can write M X˜P = 0 ⎡
⎤
p1 − uA p3A ⎢ A ⎥ ⎢ 2 ⎥ ⎢ pA − vA p3A ⎥ ⎢ ⎥ M =⎢ ⎥ ⎢ p1B − uB p3B ⎥ ⎣ ⎦ p2B − vB p3B
(2)
(3)
piA CpiB are the ith column vector of P A and P B . Then, we obtain X˜P (P, Q, R, 1) by the singular value decomposition of M . If more than six corresponding points are detected among the three images, the base image, the reference image and the input image, the projection matrix P C from the Projective Space to the input image is obtained. The projection matrices P A , P B and P C are used for the pixel matching. Pixel Matching. All pixels in the base image are matched to the reference image and the input image, and their 3D coordinates in the Projective Space are computed. The 3D coordinates represent the depth value of our depth map. For getting correspondence between two images, we employ Plane Sweep algorithm as shown in Fig. 4.
Fig. 4. Plane Sweep
The space is divided by parallel multiple planes. We assume that every target object lies between the planes. By projecting each pixel of each plane onto the three images (base, reference, input) using P A , P B , P C and matching the colors found in the images, then, pixel-to-pixel correspondences between the images
656
N. Yorozu, Y. Uematsu, and H. Saito
are obtained. The 3D coordinateXP (P, Q, R) in the Projective Space can be computed by the matched pair of pixels. Therefore we consider R as the depth value in the depth map. 2.2
Prediction and Coding
In our method, the basis of coding is the base image, and the target of coding is the reference image and the input image. We predict the reference image and the input image only by using the information of the base image such as the depth map and the intensities. For coding, we make the subtraction images by subtracting the original image from the predicted image of the reference image and the input image. A detailed flow is shown in Fig. 5.
Fig. 5. Flow of Prediction and Coding
Prediction Other Images. By the definition of the Projective Space, we can consider XP (P, Q, R) as a point in the Projective Space, xA (uA , vA ) as on the base image. Therefore, the relationship between 2D coordinate on the base image and 3D coordinate in the Projective Space is described as follows uA = P/R, vA = Q/R
(4)
When the depth map of the base camera is obtained, the 3D coordinates of all pixels can be computed, because xA (uA , vA ) and R is known. Then, by projecting every point onto the reference image and the input image, each image is predicted. Image Coding. After getting the subtraction images of the reference image and the input image, they are encoded with the original image and the depth map of the base image. As described above, if the predicted images are quite accurate, the subtraction images should be similar to 0. Therefore more accurate prediction makes more efficient coding. In our method, Entropy Coding is applied as the multiple-view video coding method. The entropy E of the image is represented as follows
E=− Si log Si (5) i
Multiple-view Video Coding Using Depth Map in Projective Space
657
where Si is the probability of the color value i (0 ≤ i ≤ 255). In the same way, the entropy is computed for each image; base image, depth map and subtraction images of the reference image and input image.
3
Experimentation and Discussion
We applied our method to following two test color sequences. In both cases, we used three non-calibrated cameras and set them as shown in Fig. 6. In this experiment, the corresponding points are manually selected for the weak calibration.
Fig. 6. Experimental Scene
(a)Input Image
(b)Base Image
(c)Reference Image
Fig. 7. Images for Test Sequences “on the desk”
(a)Input Image
(b)Base Image
(c)Reference Image
Fig. 8. Images for Test Sequences “volleyball”
658
N. Yorozu, Y. Uematsu, and H. Saito
– “on the desk” : a scene that some paper-crafts are put on the desk (with a 320 × 240 resolution, as shown in Fig. 7) – “volleyball” : a scene of a volleyball game (with a 480 × 270 resolution, as shown in Fig. 8) 3.1
Depth Map
Generated depth maps are shown in Fig. 9, 10. Our depth map is represented in the Projective Space not in the real world. Since the axis of R may not be perpendicular to the image plane, as described in Sec. 2.1, the depth map is visually different from a general depth map. The area of the depth map is the common area of three images (base, reference, input), because the map is generated by getting correspondence among them.
(a)Original
(b)Depth Map
Fig. 9. Depth Map of a base camera “on the desk”
(a)Original
(b)Depth Map
Fig. 10. Depth Map of a base camera “volleyball”
3.2
Prediction and Subtraction
As described in the section 2.2, the results of the prediction and subtraction are shown as Fig. 11 to Fig. 14. The area, that is not common area captured by three cameras, is interpolated with neighbor colors and the whole image is predicted. As shown in the subtraction images (c), the difference value is quite small in every image. This is because our predicted images have high accuracy. As described in Sec. 2.2, the accurate prediction can increase the coding efficiency, because the subtraction image becomes almost 0. The quantitative evaluation of the coding is described in the next section.
Multiple-view Video Coding Using Depth Map in Projective Space
(a)Original
(b)Predicted
659
(c)Subtraction
Fig. 11. Reference Image of “on the desk”
(a)Original
(b)Predicted
(c)Subtraction
Fig. 12. Input Image of “on the desk”
(a)Original
(b)Predicted
(c)Subtraction
Fig. 13. Reference Image of “volleyball”
(a)Original
(b)Predicted
(c)Subtraction
Fig. 14. Input Image of “volleyball”
3.3
Comparison Entropy
The calculation result of the entropy is shown in Table 1. Even though our method needs only a depth map except for the original base image and the
660
N. Yorozu, Y. Uematsu, and H. Saito Table 1. Comparison Entropy
Base Image Depth Map “on the desk” original proposed “volleyball” original proposed
Entropy [bit/pixel] Reference Image Input Image Total (Subtraction Image) (Subtraction Image)
22.1 22.1
6.5
22.1 (18.2)
22.2 (19.1)
66.4 65.9
20.8 20.8
7.0
21.0 (17.6)
21.2 (17.4)
63.0 62.8
camera parameters, it has the effectiveness of compression. We use three cameras in this experiment, however, it can achieve high and efficiency compression if we use more cameras. This is because, our method employs the Plane Sweep algorithm which uses color matching of every pixel of all cameras. The more cameras are utilized, therefore, the higher accuracy of pixel matching is obtained.
4
Conclusion
In this paper, a new video coding method based on a depth map is proposed. The target of our method is multiple-view images which are taken with multiple cameras. We consider the redundancy between the viewpoints of the cameras and efficiently compress large-volume image data. Only using a single original image and a depth map, our method could predict the images taken with the other cameras. Applying our method to multiple-view video coding, we demonstrated the effectiveness of the compression. Even though our method needs only the depth map and the original image, it achieved effective compression than using raw images. One advantage of our method is that we do not require full-calibration of the cameras in contrast with conventional method using a depth map. In our method, the depth map is generated in the Projective Space that is a virtual 3D space defined by projective reconstruction of two images. Therefore, we need only weak-calibration, which represents epipolar geometry of the cameras. This is a big advantage, because any uncalibrated videos (images) can be easily applied to our method. In our future work, we plan to use more cameras to make more effective depth map and apply our method to the video sequence. By applying general feature detection technique to obtaining corresponding points between the cameras in computing F-matrix, we can easily extend to full automatic system.
Acknowledgments This research was supported by National Institute of Information and Communications Technology (NICT).
Multiple-view Video Coding Using Depth Map in Projective Space
661
References 1. Eye Vision, http://www.pvi-inc.com/eyevision/ 2. Manex Entertainment Inc.: Matrix, http://www.mvfx.com 3. Chen, S.E., Williams, L.: View interpolation for image synthesis. IEEE Trans. on Pattern Analysis and Machine Intelligence 20, 218–226 (1998) 4. Inamoto, N., Saito, H.: Fly through view video generation of soccer scene. In: IWEC CWorkshop Note, May 2002, pp. 94–101 (2002) 5. Nozick, V., Saito, H.: On-line free-viewpoint video: From single to multiple view rendering. International Journal of Automation and Computing 5, 257–265 (2008) 6. Tanimoto, M.: Overview of free viewpoint television. Signal Proceedings: Image Communication 21, 454–461 (2006) 7. Smolic, A.: 3d video and free viewpoint video -technologies, applications and mpeg standards. In: Proc. ICME 2006, July 2006, pp. 2161–2164 (2006) 8. Martinian, E., et al.: View synthesis for multivew video compression. In: Proc. PSC 2006, April 2006, pp. SS3–4 (2006) 9. Shimizu, S., et al.: View scalable multiview video coding using 3-d warping with depth map. IEEE Trans. Circuits Syst. Video Technol. 17, 1485–1495 (2007) 10. Tsung, P.K., Lin, C.Y., Chen, W.Y., Ding, L.F., Chen, L.G.: Multiview video hybrid coding system with texture-depth synthesis. In: IEEE International Conference on Multimedia and Expo., April 2008, vol. 26, pp. 1581–1584 (2008) 11. Ozkalayci, B., Serdar Gedik, O., Aydin Alatan, A.: Multi-view video coding via dense depth estimation. In: 3DTV Conference, May 2007, pp. 1–4 (2007) 12. Collins, R.: A space-sweep approach to true multi-image matching. In: Proceedings of IEEE Computer Society Conference on CVPR, pp. 358–363 (1996) 13. MPEG-4 Video Group: Joint multiview video model (jmvm) 1.0
Using Subspace Multiple Linear Regression for 3D Face Shape Prediction from a Single Image Mario Castel´an1 , Gustavo A. Puerto-Souza2, and Johan Van Horebeek2 1
Centro de Investigaci´on y de Estudios Avanzados del I.P.N., Grupo de Rob´otica y Manufactura Avanzada, Ramos Arizpe, Coah. 25900, M´exico
[email protected] 2 Centro de Investigaci´on en Matem´aticas, Guanajuato, Gto. 36240, M´exico
Abstract. In this paper, we compare four different Subspace Multiple Linear Regression methods for 3D face shape prediction from a single 2D intensity image. This problem is situated within the low observation-to-variable ratio context, where the sample covariance matrix is likely to be singular. Lately, efforts have been directed towards latent-variable based methods to estimate a regression operator while maximizing specific criteria between 2D and 3D face subspaces. Regularization methods, on the other hand, impose a regularizing term on the covariance matrix in order to ensure numerical stability and to improve the out-oftraining error. We compare the performance of three latent-variable based and one regularization approach, namely, Principal Component Regression, Partial Least Squares, Canonical Correlation Analysis and Ridge Regression. We analyze the influence of the different latent variables as well as the regularizing parameters in the regression process. Similarly, we identify the strengths and weaknesses of both regularization and latent-variable approaches for the task of 3D face prediction.
1 Introduction Due to its potential applications in surveillance and computer graphics, 3D face reconstruction is an active research topic in computer vision, more specifically in the area of shape analysis. Classic shape-from-shading (SFS) algorithms [12] may provide a way to approach this task, but their usability remains limited since facial reflectance departs from Lambert’s law. As a consequence, restrictions have been imposed within the SFS framework in order to make the problem approachable. For example, some a priori knowledge about facial surface may be assumed. The most recent work following this idea is the face molding approach [13]. Here, a surface is recovered in accordance to an input intensity image, a reference surface and a reference albedo. Unfortunately, reference data resembling the input image should be available in order to obtain good results. During the last decade, the idea of learning 3D shape variations from a training set of facial surfaces has attracted considerable attention in the computer vision community. The first attempts were initially inspired by the work of Kirby and Sirovich [14] for the G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 662–673, 2009. c Springer-Verlag Berlin Heidelberg 2009
Using Subspace MLR for 3D Face Shape Prediction from a Single Image
663
characterization of 2D intensity images of faces into a low dimensional linear subspace using Principal Component Analysis (PCA). Attick et al. [2] were the first to incorporate the SFS irradiance constraint within a 3D statistical model. Their approach planted a seed for future research in the field of statistical 3D face shape recovery from a single image. Following Attick’s work, Blanz and Vetter proposed separate linear subspaces for 3D shape and texture [3]. With the help of a sophisticated optimization procedure, they showed that facial shape and texture can be recovered across pose and illumination variations. The so-called morphable model is considered state-of-the-art in the field, and has also been used for face recognition purposes [4]. More recently, Smith and Hancock [18] transformed surface normal directions into Cartesian points using the azimuthal equidistant projection. PCA was applied to these points and the model was iteratively fitted under geometric SFS contraints [20]. They also developed a statistical model of surface normals using principle geodesic analysis. With a robust statistic model, this deformable model can be to fitted to facial images with self-shadowing and can also be used for the purposes of face recognition [19]. The methods mentioned above have a common principle. First, either a single 3D shape model or two separate models of 3D shape and texture are constructed using PCA. Then, the models are fitted to an input intensity image using specific algorithms and different optimization criteria. Unfortunately, common information shared between the 3D shape and 2D intensity subspaces is not considered by these approaches. The first efforts to tackle this problem are reported by Castel´an et al. [5]. Here, separate models of 3D shape and intensity are first constructed. A single statistical model coupling variations of 3D shape and intensity is then built by linking coefficients in both subspaces into a coupled coefficient vector. The projection of a novel intensity example onto the intensity subspace is used to lead an optimization procedure to calculate coupled model coefficients and finally 3D shape. This work was inspired by the active appearance model of Cootes et al. [7], where coupled variations of 2D facial shape and texture were condensed into a single statistical model. Ahmed and Farag have recently described in [1] how the coupled model of [5] can be extended to deal with changes in illumination using a spherical harmonics model. In order to simplify the classic statistical face shape recovery scheme, several schemes [17,6,21,15,16] have borrowed ideas from particular applications of Multiple Linear Regression (MLR) in chemometrics. MLR aims to explain a set of responses Y as a linear combination of a set of predictors X. In the context of this paper, the responses are 3D facial shapes and the predictors are 2D intensity images. The idea underlying MLR is appealing since 3D shape can be calculated directly from new intensity examples, avoiding the use of optimization methods for parameter fitting. In this context, Reiter et al. proposed a method based on Canonical Correlation Analysis (CCA) to predict 3D shape from frontal view color face images [17] . The aim of CCA is to maximize correlation between projections in both predictors and responses subspaces. The basic idea of their approach was to model the relationship between depth and appearance with a small number of latent variables, that is, correlated image features in the space of depth images and colour images. Similarly, a method resembling Canonical Variate Analysis (CVA) has been explored in [16]. The idea here is to learn projections of 2D and 3D spaces based on maximum correlation criteria while
664
M. Castel´an, G.A. Puerto-Souza, and J.V. Horebeek
optimizing the linear transform between subspaces. The method is rather similar to CCA as both maximize correlation. The Partial Least Squares (PLS) approach has also been applied to the problem of face shape prediction [6]. The approach seeks to maximizing covariance between projections of responses and predictors. This idea has probed to be the most useful in MLR with the low-observation-to-variable ratio problem [8]. While the aim of the above approaches is to directly recover 3D shape from an input intensity image, other authors have applied both CCA and PLS for the subspace characterization of alternative 3D representations. For example, Lei et al. [15] have explored the mapping between tensor spaces of near infrared (NIR) images and 3D face shapes using CCA. Given an NIR face image, the depth map is computed directly using the learned mapping with the help of tensor models.Also, in [21], intensity images were defined as Local Binary Pattern (LBP) vectors. The LBP data was later used to explain variations in 3D shapes using PLS. The motivation underlying this paper is to provide research directions on the suitability of different subspace MLR approaches for solving the problem of 3D face shape prediction from a single image. To this end, we perform a comparative analysis of CCA and PLS, which have been recently used for 3D face prediction. Additionally, we propose the use of Principal Component Regression (PCR) and Ridge Regression (RR). These regression paradigms are well known in the MLR literature, however, they have not been explored in the context of 3D face prediction. In order to keep calculations computationally efficient, the four subspace MLR approaches covered in this paper are implemented using (linear) kernels, leading to KPCR, KRR, KPLS and KCCA. The paper is organized as follows: Section 2 introduces the mathematical concepts related to the different kernel MLR approaches, in Section 3 we present the experimental evaluation, finally, conclusions are described in Section 4.
2 Subspace Multiple Linear Regression Methods The goal in regression analysis is to model the predictive relationship of a set of p predictor variables x = [x1 , x2 , . . . , xp ]T on q response variables y = [y1 , y2 , . . . , yq ]T given a set of n training observations. For the particular problem of face shape recovery, p = q is the number of pixels in the images. For MLR, matrices of centered data Xn×p = [x1 , x2 , . . . , xn ]T and Yn×p = [y1 , y2 , . . . , yn ]T are built and the regression matrix B is sought that minimizes trace((XB − Y)(XB − Y)T ).
(1)
The solution is given by B = (XT X)−1 XT , in case the inverse of XT X exists. The latter is the sample covariance matrix up to a constant. Unfortunately, for face shape recovery purposes, the dimensionality of XT X which depends on the number of pixels, is much higher than the number of observations; this makes the problem computationally intractable. For this reason, one often resorts to subspace regression methods, where the solution is sought in a much smaller search space. Two popular approaches to define implicitly or explicitly this subspace are the use of regularization and of latent variables.
Using Subspace MLR for 3D Face Shape Prediction from a Single Image
665
2.1 Ridge Regression In ridge regression one imposes an additional smoothness restriction trace(BT B) ≤ α in (1). This is equivalent to minimizing for a given λ: trace((XB − Y)(XB − Y)T ) + λ trace(BT B).
(2)
The additional term implies a bias but it often improves the out-of-training prediction error. The solution always exists and is given by B = (XT X + λIp×p )−1 XT Y. As (XT X + λIp×p )−1 XT = XT (XXT + λIn×n )−1 and p >> n, it is computationally more convenient to calculate: B = XT (K + λIn×n )−1 Y with K = XXT .
(3)
This forms the basis of kernel ridge regression. 2.2 Latent Variables Approach In this approach one supposes the existence of a small number of unobserved variables that capture the most relevant information from x and/or y. This leads to a low rank factorization of the predictor and/or response matrices: ⎛ ⎝
⎞ Xn×p
⎛ ⎝
⎠
⎛ ∼
⎞ Yn×p
⎠
⎝ Rn×k ⎠ ⎛
∼
⎞
⎞
⎝ Sn×k ⎠
Uk×p
Vk×p
The columns of R and S are considered as the latent variables (scores). Once the latent variables are found, one fits a MLR using R as predictors for Y. Principal Component Regression (PCR) PCR aims to explain the responses using a reduced number of principal components of the predictors. In a first step, the first k eigenvectors (eigenfaces) of the empirical covariance matrix XT X are calculated and stored in U. The matrix R contains the projections of the predictors on these components. Because of orthogonality, XUT = R. Next, these projections are used in a MLR as the new predictors for Y, and one minimizes: trace((RB − Y)(RB − Y)T ). In practice the eigenvectors u of XT X are calculated by means of the eigenvectors ∗ u of XXT and making use of the fact that u = XT u∗ , as is done in Kernel PCA. A drawback of the above method is that the projection directions are determined independent of the response variables. As shown in [6] this can lead to suboptimal
666
M. Castel´an, G.A. Puerto-Souza, and J.V. Horebeek
solutions. For the problem concerning this paper, sometimes the intensities of the pixels give rise to only small variations in X, and if the height values (3D shape) vary a lot, then the latent variables found by PCR may not be particularly good at describing Y . In the worst case important information may be hidden in directions in the space that PCR interprets as noise, and therefore leaves it out. In the following subsections two methods are introduced that intend to avoid this suboptimality. Partial Least Squares (PLS) PLS aims to find simultaneously interesting projections of the predictor and response variables. It consists of the following iteration, starting with l = 1: 1. Look for projections in the direction w and c of x and y which maximizes the covariance. Using the sample covariance matrix this leads to: maxw,c Xw, Yc = maxw,c wXT Yc, with ||w|| = ||c|| = 1.
(4)
2. Define the l-th column of R as R.,l = Xw and similarly S.,l = Yc. The larger the covariance in (4), the stronger will be the linear relation between these two columns. 3. Desinflate X and Y in these directions; if the resulting matrices are no null matrices, augment l by 1 and go to step 1. The solution of (4) satisfies: Xt YYt Xw = λw Yt XXt Yc = λc, with ||w|| = ||c|| = 1.
(5)
Using the same (kernel) trick as mentioned with PCR, as n << p, it is more convenient to solve: YYt XXt w∗ = λw∗ & XXt YYt c∗ = λc∗ , (6) We will refer to this as KPLS. This generalized eigenproblem is often solved by a power method based algorithm like e.g. NIPALS. We refer to [6] for a detailed implementation. Canonical Correlation Analysis (CCA) While PLS maximizes covariance between latent variables, Canonical Correlation Analysis finds the directions of maximal correlation between latent variables. Using again the sample covariance matrices, it can be shown that the counterpart of (4), is: maxw,c Xw, Yc = maxw,c wXT Yc, with wXT Xw = cYT Yc = 1.
(7)
and the solution satisfies [10] XT Yw = λXT Xc, YT Xc = λYT Yw
(8)
Similar methods to the ones for PLS can be used to solve this problem. Using the correlation instead of the covariance seems at first glance a minor difference. Specially for the applications we consider in this paper it has mayor consequences: because of the normalization in the correlation, non informative directions with small variance tend to be inflated.
Using Subspace MLR for 3D Face Shape Prediction from a Single Image
3D shape
Intensity non aligned
aligned
non aligned
aligned
667
Mean 3D shape non aligned
aligned
Fig. 1. Alignment. The figure presents three pairs containing original and aligned data. An intensity example, its corresponding 3D shape and the mean 3D shape are shown from left to right. Note that 3D shapes are depicted as frontal Lambertian re-illuminations (albedo-free). The red lines around the eyes, nose and mouth area of the left-most image are the sixteen control points used for database alignment.
3 Experimental Evaluation The face database used for training was provided by the Max-Planck Institute for Biological Cybernetics in Tuebingen, Germany [3]. This database was constructed using laser scans of heads of young adults, and provides head structure data in a cylindrical representation. For constructing the 3D-based models, we converted the cylindrical coordinates to Cartesian coordinates and solved for height values. We had also at our disposal the intensity maps for each 3D face. We used 50 training examples while 43 out-of-training samples were separated for shape recovery tests. A pre-processing alignment step was performed for each image in the databases. A Thin Plate Spline (TPS) warping operation was used for this task. Sixteen manually assigned control points were used around the eyes, nose and mouth area. Examples of the alignment process are show in Figure 1. The figure presents three image pairs containing original and aligned data. An intensity example, its corresponding 3D shape and the mean 3D shape are shown from left to right. Note that 3D shapes are depicted as frontal Lambertian re-illuminations (albedo-free). The red lines around the eyes, nose and mouth area of the left-most image are the sixteen control points used to align the database. The outcome of the alignment procedure becomes more evident by a visual inspection of the mean 3D shapes, since the contour surrounding facial features appears more defined for the aligned example. Once the 3D shape of an out-of-training example is recovered, the inverse TPS warping operation is performed. All the experiments in this section present results corresponding to un-aligned recovered data. n For each out-of-training example, we calculated the quadratic error n1 i=1 (hgt (i)− hrec (i))2 , where hgt (i) and hrec (i) are the ground truth and recovered height values at the ith pixel and n is the total number of pixels in the image. Figure 2 presents error boxplot diagrams for the different methods. For KPCR, KPLS and KCCA, the error is shown as a function of the number of latent variables used. For KRR, the error is shown as a function of the value of the regularization parameter λ. The width of the boxes indicates the degree of dispersion and skewness in the data (excluding outliers). Note how the first 20 latent variables suffice for KPLS to reach a minimal error with a small box width, while KPCR needs at least 40 latent variables to achieve comparable
668
M. Castel´an, G.A. Puerto-Souza, and J.V. Horebeek KPLS 90
80
80
70
70
Quadratic error
Quadratic error
KPCR 90
60 50 40 30
60 50 40 30
20
20
10
10
0
0 1
10
20
30
40
49
1
10
20
Latent variables
30
40
49
Latent variables
KCCA
KRR 90 80
200
Quadratic error
Quadratic error
70 150
100
60 50 40 30 20
50
10 10 0
0 1
2
3
4
5
6
7
8
Latent variables
9
10
11
12
0.5
1
10
20
30
40
50
60
70
80
90
100
Value of lambda
Fig. 2. Relative error boxplots. KPCR, KPLS and KCCA are shown as a function of the number of latent variables. KRR is shown as a function of the regularization parameter λ.
results. As expected, KPLS and KPCR obtain the same predictions when using all the latent variables. The figure also reveals that overfitting the models is not significant for these two methods. On the contrary, KCCA is clearly affected by overfitting, i.e., after a small number of latent variables is used, the error commences to increase notoriously. Another feature to note about KCCA is its sensitivity to outliers. For this reason, unlike KPCR and KPLS, the full rank of latent variables is not shown in the KCCA boxplot. As far as KRR is concerned, the error is rather insensitive to the choice of λ; a value of λ = 90 seems to be optimal Individual out-of-training results are shown in Figure 3. The aim of this figure is to take to the individual level the global results shown in Figure 2. Profile plots for the two different subjects are presented in the rows of the figure. Subject one was chosen to be close in shape to the mean shape, while subject two is an outlier. The ground truth is plotted with a thick solid line. The regressions for KPCR (49 latent variables), KPLS (20 latent variables), KCCA (4 latent variables) and KRR (λ = 90) are plotted with a thin solid line. For all the latent variable methods, the reconstructions using 1 latent variable are plotted with a dotted line. KRR with λ = 0.5 is also shown with a dotted line. A dashed line is used for KPCR (20 latent variables), KPLS (10 latent variables), KCCA (10 latent variables) and KRR (λ = 700).
Using Subspace MLR for 3D Face Shape Prediction from a Single Image KPCR
KPLS
300
295
290
285
280
275
270
265
300
295
290
285
280
275
270
0
0
300
265
295
0
0
20
20
60
60
60
60
G.T. Lam = 90 Lam = 0.5 Lam = 700
40
40
G.T. 1 lat. var. 10 lat. var. 20 lat. var.
40
G.T. 1 lat. var. 20 lat. var 49 lat. var
40
G.T. 1 lat. var. 4 lat. var. 10 lat. var
20
20
80
80
80
80
100
100
100
100
subject 1
KRR
KCCA 290
285
280
275
270
300
120
120
120
120
140
140
140
140
300
295
290
285
280
275
270
300
160 265
295
290
285
280
275
0
0
0
270
300
295
290
285
280
275
270
300
295
290
285
265
160 2650
160
160 265
280
265
275
295
280
290
275
270
285
270
265
G.T. Lam = 90 Lam = 0.5 Lam = 700
50
G.T. 1 lat. var. 4 lat. var. 10 lat. var.
50
G.T. 1 lat. var. 10 lat. var. 20 lat. var
50
G.T. 1 lat. var. 20 lat. var. 49 lat. var.
50 100
100
100
100
150
150
150
150
subject 2
669
Fig. 3. Individual examples analysis. Two different out of training surface recovery cases are shown in the rows of the figure. Profile lines are used for the purposes of shape comparison. The ground truth is plotted with a thick solid line. The regressions for KPCR (49 latent variables), KPLS (20 latent variables), KCCA (4 latent variables) and KRR (λ = 90) are plotted with a thin solid line. For all the latent variable methods, the reconstructions using 1 latent variable are plotted with a dotted line. KRR with λ = 0.5 is also shown with a dotted line. A dashed line is used for KPCR (20 latent variables), KPLS (10 latent variables), KCCA (10 latent variables) and KRR (λ = 700).
The first feature to note from the figure is the gradual contribution of the latent variables in KPCR as opposed to the focused contribution in KPLS, i.e., the first twenty latent variables in KPLS seem to approximate facial shape with similar accuracy as when KPCR uses the full rank of latent variables. This is due to the latent variable estimation procedure in KPLS, which looks for maximal covariance between projections in shape and intensity subspaces. The former idea suggests that the hidden relation between these subspaces may be modeled by identifying axis of coupled energy. Note that using the full rank of latent variables for KPLS and KPCR leads to exactly the same predictions. Although overfitting is not significant for these methods, using all of the latent variables does not seem to be of help in approximating regions such as the forehead and the chin. Another feature to note from the figure is that KCCA encounters difficulties in estimating shape, specially for the region around the mouth, where fine surface details (small variabilities) are likely to occur. Instead, it focus in approximating the shape of the forehead. Also, the dashed line in the KCCA diagrams confirms the undesired effect of incrementing the number of latent variables. Note how the three approaches present a clear difference in the way their different latent variables explain the relationship between intensity and shape subspaces.
670
M. Castel´an, G.A. Puerto-Souza, and J.V. Horebeek
Ground truth
KPLS
KCCA
KRR
Fig. 4. Surface recovery comparison. The figure shows results for KPLS (20 latent variables), KCCA (4 latent variables) and KRR (λ = 90). A red dashed line corresponding to the ground truth profile is also shown along each surface recovery for the purposes of visual comparison.
Note how the results obtained with KRR are similar to those obtained using the full rank of latent variables in KPCR and KPLS. Also, the line attached to λ = 90 seems to be located between λ = 0.5 and λ = 700, specially for subject 1. The value of λ = 0.5 seems to approximates the shape of subject 1 with the best accuracy, while a value λ = 90 seems to favor shape prediction for subject 2. Finally, it is worth commenting that shape prediction for subject 2 slightly benefits from a value of λ = 90, which provides some evidence about improvement of out-of-training prediction error by KRR. A different perspective for visual comparison is provided in Figure 4, where profile vies of the predicted surfaces are shown for subjects 1 and 2. The aim of the figure is to offer a visual idea of the appearance of the surface. The figure shows ground truth along with predicted surfaces for KPLS (20 latent variables), KCCA (4 latent variables) and KRR (λ = 90). A red dashed line corresponding to the ground truth profile is also provided with each predicted surface for the purposes of visual comparison. Despite KCCA seems to be in agreement with the ground truth profile line, a visual inspection of the appearance of the surface reveals clear departures from ground truth, specially in subject two. On the contrary, a visual inspection of the 3D surfaces delivered by KPLS and KRR show similarity with ground truth. Interestingly, although the difference between KPLS and KRR regressions is hardly noticeable, the proximity to the red dashed line (ground truth) reveals subtle differences, i.e., KRR regression appears to be slightly more adapted to the ground truth profile.
Using Subspace MLR for 3D Face Shape Prediction from a Single Image
Ground truth
aligned
KPLS non aligned
aligned
KCCA non aligned
subject 2
subject 1
Input
671
Fig. 5. Lambertian appearance and alignment analysis. The figure shows three pairs of images. All surfaces here are shown as frontal Lambertian re-illuminations (i.e. albedo free). Input intensity images and ground truth, aligned and non aligned results obtained with KPLS and KCCA are shown through the different pairs. Input
Ground truth
KPLS
KCCA
Input
Ground truth
KRR
KCCA
Fig. 6. Lambertian appearance for additional out-of-training examples. The figure shows several input images together with their corresponding lambertian appearance for ground truth, results obtained through KPLS, KRR and KCCA.
In Figure 5 Lambertian appearance (frontal albedo-free re-illuminations) of the recovered surfaces for subjects 1 and 2 are shown. The figure also presents results for non-aligned out-of-training cases. Although the difference between KRR and KPLS predictions is noticeable after a detailed visual analysis at the surface level, a difference in Lambertian appearance is hardly perceptible. For this reason, the regressions obtained from KRR are not included in this figure. Let us now focus on the visual
672
M. Castel´an, G.A. Puerto-Souza, and J.V. Horebeek
comparison of the results obtained from aligned data. Supporting the results of Figure 4, the visual agreement between the ground truth and the KPLS regression is plausible. Note how KPLS seems to approximate a smooth version of the ground truth. The Lambertian appearance of the surfaces predicted by KCCA, on the other hand, reveal a less plausible similarity. For surface prediction from non-aligned data, both KPLS and KCCA are clearly favored by the alignment step. Nonetheless, KCCA appears more sensitive to alignment errors. To extend this analysis, additional lambertian appearance out-of-training recovery examples are shown in Figure 6. Here, results using KPLS, KCCA and KRR are shown. However, KPLS and KRR are not shown together for the different subjects for the reasons explained above. A visual inspection of the figure reveals a good approximation using both KPLS and KRR, i.e., the variability among ground truth faces is conserved in these predictions. For the predictions obtained through KCCA, again, face shape similarity is not as strong as for the other methods.
4 Conclusions We have presented an experimental evaluation of four subspace MLR methods for the problem of face shape prediction from a single image. Two novel approaches have been included into this analysis: KPCR and KRR. Among all the methods, KCCA appears as the less suitable approach for modeling the problem. KPLS shows advantages over KPCR, as a considerably smaller number of latent variables is required to achieve comparable accuracy. The maximum covariance criteria imposed by KPLS appears as the most convenient for latent variable construction. The predictions obtained by KRR, on the other hand, are comparable in accuracy to those of KPLS. This is interesting considering that each approach attempts to tackle different issues, i.e., KRR focuses on healing colinearity while KPLS focuses on dimensionality reduction. This suggests that any of the two criteria suffices for obtaining good approximations, provided that calibration data has been previously aligned. KPLS, nonetheless, offers the possibility to explore inside the “black box” of the linear predictor. This may be used to understand the contributions of the different latent variables in the process of face shape recovery. Alternatively, the latent variables may be used for the purposes of 3D face recognition and classification using information provided by intensity images. As far as KRR is concerned, it has proved to offer a simple and efficient way to approximate facial shape without the need of constructing latent variables between subspaces. A good idea would be to explore the outcome of robust regularization for these purposes. A disadvantage of both methods is the need to find an optimal number of latent variables as well as an optimal regularization parameter.
Acknowledgements This work has been supported by Consejo Nacional de Ciencia y Tecnolog´ıa under Project Conacyt Ciencia B´asica 61593.
Using Subspace MLR for 3D Face Shape Prediction from a Single Image
673
References 1. Ahmed, A., Farag, A.: A New Statistical Model Combining Shape and Spherical Harmonics Illumination for Face Reconstruction. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Paragios, N., Tanveer, S.-M., Ju, T., Liu, Z., Coquillart, S., Cruz-Neira, C., M¨uller, T., Malzbender, T. (eds.) ISVC 2007, Part I. LNCS, vol. 4841, pp. 531–541. Springer, Heidelberg (2007) 2. Atick, J., Griffin, P., Redlich, N.: Statistical approach to shape from shading: Reconstruction of three-dimensional face surfaces from single two-dimensional images. Neural Computation 8, 1321–1340 (1996) 3. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Proc. SIGGRAPH, pp. 187–194 (1999) 4. Blanz, V., Vetter, T.: Face recognition based on fitting a 3d morphable model. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1063–1074 (2003) 5. Castel´an, M., Smith, W., Hancock, E.: A coupled statistical model for face shape recovery from brightness images. IEEE Transactions on Image Processing 16(4), 1139–1151 (2007) 6. Castel´an, M., Van Horebeek, J.: 3D face shape approximation from intensities using Partial Least Squares. In: Proc. IEEE CVPRW, pp. 1–6 (2008) 7. Cootes, T., Edwards, G., Taylor, C.: Active appearance models. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, pp. 484–498. Springer, Heidelberg (1998) 8. Frank, I., Friedman, J.: A statistical view of some chemometrics regression tools. Technometrics 25(2), 109–135 (1993) 9. Geladi, P., Kowalski, B.: Partial least squares regression: a tutorial. Anal. Chim. Acta 185, 1–17 (1986) 10. Hoegaerts, L., Suykens, J.A.K., Vandewalle, J., De Moor, B.: Kernel PLS variants for regression. In: Proc. of the 11th European Symposium on Artificial Neural Networks, pp. 203–208 (2003) 11. Hotelling, H.: Relations between two sets of variates. Biometrika 8, 321–377 (1936) 12. Horn, B., Brooks, M.: Shape from Shading. MIT Press, Cambridge (1989) 13. Kemelmacher, I., Basri, R.: Molding Face Shapes by Example. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 277–288. Springer, Heidelberg (2006) 14. Sirovich, L., Kirby, M.: Low-dimensional Procedure for the Characterization of Human Faces. Journal of the Optical Society of America 4, 519–524 (1987) 15. Lei, Z., Bai, Q., He, R., Li, S.Z.: Face Shape Recovery from a Single Image Using CCA Mapping between Tensor Spaces. In: Proc. IEEE CVPR, pp. 1–7 (2008) 16. Li, A., Shan, S., Chen, X., Chai, X., Gao, W.: Recovering 3D facial shape via coupled 2D/3D space learning. In: Proc. IEEE International Conference on Automatic Face and Gesture Recognition, pp. 1–6 (2008) 17. Reiter, M., Donner, R., Langs, G., Bischof, H.: 3d and Infrared Face Reconstruction from RGB Data Using Canonical Correlation Analysis. In: Proc. IEEE ICPR (2006) 18. Smith, W.A.P., Hancock, E.R.: Recovering Facial Shape Using a Statistical Model of Surface Normal Direction. IEEE Trans. Pattern Anal. Mach. Intell. 28(12), 1914–1930 (2006) 19. Smith, W.A.P., Hancock, E.R.: Facial shape-from-shading and recognition using principal geodesic analysis and robust statistics. International Journal of Computer Vision 76(1), 71–91 (2008) 20. Worthington, P.L., Hancock, E.R.: New constraints on data-closeness and needle map consistency for shape-from-shading. IEEE Trans. on Pattern Analysis and Machine Intelligence 21(12), 1250–1267 (1999) 21. Zheng, Y., Wang, Z.: Robust depth estimation for efficient 3D face reconstruction. In: Proc. IEEE ICIP, pp. 1516–1519 (2008)
PedVed: Pseudo Euclidian Distances for Video Events Detection Md. Haidar Sharif and Chabane Djeraba University of Sciences and Technologies of Lille (USTL), France {md-haidar.sharif,chabane.djeraba}@lifl.fr
Abstract. This paper provides a new method that generates automatically pseudo Euclidian distances (PED) from the trigonometrically treatments of motion history blobs (MHB) obtained from motion history images (MHI) to extract efficient image features, which are pertinent to video events detection (VED). Given a point with its direction of motion where the point coincides the center of a circle. How far the point can virtually travel inside the circle with that direction? That virtual distance is called pseudo Euclidian distance. PED, would be potentially used in wide variety of computer vision applications, remains the main contribution of this paper. To show the interest of the usage of PED, we have proposed a PED based methodology for VED and the detection results of some events at TRECVID’08 1 in real videos have been demonstrated.
1
Introduction
Event detection in video surveillance is an important task for the places of both private and public. As huge amount of video surveillance data makes it an exhausting work for people to keep watching and finding anomalous events, an automatic surveillance system is strongly needed for detecting suspicious events. A video event is defined to be an observable action or change of state in a video stream that would be important for the security management. Events may vary greatly in duration, from two frames to longer duration events that can exceed the bounds of the excerpt. In crowded environments e.g., airports, malls, etc., objects merge and occlude each other very frequently, as a result conventional background subtraction methods do not work as appointively. Many single frame detection algorithms based on transfer cascades [1,2] or recognition [3,4,5] have demonstrated some high degree of promise for pedestrian detection in real world busy scenes with occlusion. To detect pedestrian histogram of gradients was used in [3], while authors in [4,5] used biological inspired model for recognizing different classes including pedestrian. However, most of those pedestrian detection algorithms are significantly slow for real time applications. For example, authors in [6] noted that the state-of-art algorithms for pedestrian detection e.g., [3] takes around 0.5 seconds for recognition of 128×64 size image frame, [4] takes 2 seconds/frame, and [5] takes about 80 seconds/frame. A target detection and tracking algorithm based 1
Surveillance Event Detection Pilot :: http://www-nlpir.nist.gov/projects/trecvid/
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 674–685, 2009. c Springer-Verlag Berlin Heidelberg 2009
PedVed: Pseudo Euclidian Distances for Video Events Detection
675
on the measurements of a stereo audio and cycloptic vision sensor has been presented in [7]. Using a supervised Support Vector Machine method, authors in [8] proposed an approach which makes a step toward generic and automatic detection of unusual events in terms of velocity and acceleration. There are some works [9,10,11,12] which estimate crowd density. The applied methods are based on textures and motion area ratio and make an interesting analysis for crowd surveillance, but do not detect events explicitly. To detect events in the TRECVID’08 many algorithms have been proposed, e.g., based on: change detection [13], analysis of trajectory [14], trajectory and domain knowledge [15], spatio-temporal video cubes [16], Haar based pedestrian detection and histogram matching [6], optical flow concepts [17,18,19], etc. Yet, vast diversity of one event viewed from different view angles, different scales, different degrees of partial occlusion, etc., make challenge for performance of the event detectors; hence it is necessary to greatly improve the effectiveness of them by further investigation. We have proposed a methodology based on pseudo Euclidian distances, PED, for video events detection (VED). To extract image features, optical flow estimation would be a superior grade for the crowd scene but it is too sensitive to the small noise because of the broadness of the camera view. If there are many people in the videos, the existence of small motion noise will be extremely negative and unreliable. Therefore, estimating the movement of objects by optical flow is difficult. Deeming this fact we rest with confidence on motion history images (MHI), motion history blobs (MHB), and trigonometric treatments of MHB which generate PED to extract efficient image features which are pertinent to event of interests. The MHI is a representation of the history of pixel-wise changes, yet remains a computationally inexpensive method for analysis of object motions and effectively only previous frame needs to be stored. We segment the MHI to grasp the essential sequence of motion components, object of interests (OoI) or MHB, which are then tracked by using PED. Generation and usage of PED are the unique contribution of our current investigation. There are several state-of-the-art algorithms for tracking OoI, e.g., particle filtering [20], hybrid strategy [21], etc. Since occlusions happen frequently in limited camera scope, particle filtering may archive a commendable performance. But particle filtering is a time-consuming process, especially when the object tracked is large. It is difficult to complete the test on evaluation data within the limited time. Hence, we take up PED for MHB tracking, final aim for different kinds of VED. The rest of this paper is organized as follows: Section 2 delineates steps of pseudo Euclidian distances (PED) calculation; Section 3 summarizes specific video events detection (VED) methodology; Section 4 reports the experimental results; and Section 5 concludes the work with few clues for further investigation.
2 2.1
Steps of Pseudo Euclidian Distances (PED) Calculation Extraction of Motion History Blobs (MHB)
The strength of the motion history images (MHI) is that although it is a representation of the history of pixel-wise changes, only previous frame needs to be
676
Md.H. Sharif and C. Djeraba
stored. It is easy to implement and adds little computational cost to the realtime system. In a motion history image, Hτ (x, y, t), pixel intensity is a function of the temporal history of position or motion at that point. The previous method of MHI as described in [22] was based on frames rather than time. Currently, a simple replacement and duration operator based on time-stamping is used [23]: τ if ψ(x, y, t) = 1 Hτ (x, y, t) = (1) max(0, Hτ (x, y, t − 1) − δ) otherwise where x, y, and t demonstrate the position and time; τ is the current timestamp; δ is the maximum time duration constant (e.g., few seconds) associated with the template; ψ(x, y, t) = 1 signals object presence or motion in the current video image. The ψ(x, y, t) can be computed from background subtraction, frame differencing, optical flow, edges, etc. The use of time-stamps allows for a more consistent port of the system between platforms where speeds may differ. System time is consistent during processing where frame rate is not. Thus time is explicitly encoded in the motion template. The Eq. 1 indicates that the MHI pixels where motion occurs are set to the current time-tamp τ , while the pixels where motion happened far ago are cleared. The above update function is called each time a new image is received and the corresponding silhouette image is formed. The result of the function is a scalar-valued image where more recently moving pixels are brighter; only we wish to deal with those brighter parts (region of motion components) which we called motion history blobs (MHB). We get absolute difference between two frames and threshold it and using the
Fig. 1. (a): camera view; (b): blue regions are the current silhouettes (motions mask) or motion history blobs (MHB); (c): view after suppression of the little MHB from (b), and red arrows point towards global motion orientations of the rest motion components
thresholded frame update function of Eq. 1 to get MHB. Since the motion history image encodes in a single image the temporal nature for the motions over some time interval, the motion segmentation should be easier than in methods those attempt to segment and propagate motion between frames. Figure 1 (a) and (b) depict a snapshot of original image and the current silhouettes motion or MHB respectively. To get sequence of motion components, it is important to segment motion regions which were produced by the movement of parts or the
PedVed: Pseudo Euclidian Distances for Video Events Detection
677
whole of the object of interest. For each motion component we select dynamically rectangular region of interest (RoI) and get the center of rectangle. Afterwards, we calculate number of points within silhouetted RoI so that it will be easy to check for the case of little motion, which will be neglected under an experienced threshold Th (say 50-point). On filtering we enclose each remaining blue region of motion component by a circle with fixed unit radius (marked green circle in Fig. 1 (c)) centered at remaining respective rectangle. The global motion orientation Φ of each motion component (marked red arrow in Fig. 1 (c)) has been calculated in the following subsection. Finally, each region of motion component or motion history blob has an explicit center (e.g., P (x0 , y0 ) in Fig. 2) and a global motion of orientation Φ. After the trigonometrical treatments of the circle of the each motion history blob, the position and angle information (P (x0 , y0 ), Φ) will be used to generate pseudo Euclidian distances (PED). In our current investigation PED will be measured in terms of pixels length which will play the vital role for tracking the region of motion components (MHB) in video frames. 2.2
Global Motion Orientation Φ Estimation
Upon suppression of insignificant (little) motion components from MHI, we calculate global motion orientation Φ of each remaining component as used in [24]: x,y angDif f (Φcon (x, y), Φref ) × N (τ, δ, Hτ (x, y, t)) Φ = 2π − Φref − (2) x,y N (τ, δ, Hτ (x, y, t)) where 2π accommodates the adjustment for images with top-left origin; Φref be the base reference angle (peaked value in the histogram of orientations); Φcon (x, y) be the motion orientation map found from gradient convolutions (e.g., standard 3×3 Sobel gradient masks); N (τ, δ, Hτ (x, y, t)) be a normalized motion history image value (linearly normalizing the motion history image from 0-1 using the current time-stamp τ and duration δ); and angDif f (Φcon(x, y), Φref ) be the minimum signed angular difference of an orientation from reference angle. 2.3
Calculation of Pseudo Euclidian Distances (PED)
We take into account four motion directions, based on its previous motion directions, namely forward (+), backward (-), upward (around +π/2), and downward (around -π/2) as shown in figure 2, which illustrates a situation where the center of a motion history blob P (x0 , y0 ) (center of circle/ellipse) moves in forward direction with respect to its previous motion direction, e.g., from point P (x0 , y0 ) −→ −−→ − − → tends to R(x, y), i.e., P R = P Q + QR, and having this fact it is evident that: Φ = tan−1
y − y0 x − x0
⇒ y = y0 + (x − x0 )tanΦ.
(3)
Although we are dealing with only circle, the global motion orientation Φ has been shown inside of an ellipse for better presentation as a circle is a special case of an ellipse. An ellipse is defined as the locus of points equidistant to two fixed
678
Md.H. Sharif and C. Djeraba Upward Motion –
+ [x − x0 ]
–
Φ P(x0 ,y0 )
Backward Motion –
–
Q(x,y0 ) + [y − y0 ] R(x,y) Forward Motion +
+ Downward Motion
Fig. 2. Global motion orientation Φ of a motion history blob inside of an ellipse. Circle is an exceptional ellipse in which the two foci are coincident at the center of the ellipse.
Fig. 3. A simple example of PED calculation concerning the movement of the center of circle of the Motion History Blob of a person from (150,100) to (640,100) with global motion orientation variations about 15 ◦ and a constant velocity of 10 pixels per frame. λ− exhibits concave up or convex cup; λ+ substantiates concave down or convex cap.
PedVed: Pseudo Euclidian Distances for Video Events Detection
679
points called the foci. Deeming that the ellipse is centered at (0,0) and foci are located at (±c, 0) then the standard equation of it can be formulated by dint of: x2 y2 + =1 a2 a2 − c2
(4)
√ where a and a2 − c2 are the semi-major and semi-minor axes respectively. The foci always lie on the semi-major axis, spaced equally each side of the center of the ellipse. If the lengths of semi-major axis a and semi-minor axis √ a2 − c2 are identical, i.e., c = 0, then both foci are coincident at the center of ellipse, explicitly, the ellipse Eq. 4 comes into existence an equation of a circle, x2 + y 2 = a2 , where a is renamed as the radius of the circle. The area enclosed by the circle is π multiplied by the radius squared a2 . Since we are considering unit area, i.e., πa2 = 1, the Eq. 4 can be rewritten as: x2 + y 2 = π1 . Substituting y from Eq. 3 into this new circle equation, the following quadratic equation yields: (1 + tan2 Φ)x2 + 2tanΦ(y0 − x0 tanΦ)x + (y0 − x0 tanΦ)2 −
1 = 0. π
(5)
On solving Eq. 5, we get two solutions or roots, say x+ and x− , formulated as: x
x
+
−
=
=
−tanΦ(y0 − x0 tanΦ) +
(tanΦ(y0 − x0 tanΦ))2 − (1 + tan2 Φ)((y0 − x0 tanΦ)2 − 1+
−tanΦ(y0 − x0 tanΦ) −
(tanΦ(y0 − x0 tanΦ))2 − (1 + tan2 Φ)((y0 − x0 tanΦ)2 − 1+
1 π)
tan2 Φ
tan2 Φ
1 π)
(6)
(7)
and their corresponding y components y + and y − are as follows: y + = y0 + (x+ − x0 )tanΦ,
y − = y0 + (x− − x0 )tanΦ.
(8)
There are three variables namely x0 , y0 , Φ in Eq. 6 − 8 of which Φ can be easily calculated using Eq. 2. However, x0 , y0 are merely the positions of the moving component which can be obtained from their x and y coordinates respectively. Since we are using unit scale, there will be a severe error if we want to use their corresponding coordinates directly. Normalization can uniquely solve that problem. Assuming that the frame size is fx × fy (e.g., 640 × 480, etc.) pixels, then to obtain workable values of x0 and y0 normalization can be performed as: y x 2Nxy − 1 1 fx + fy √ Nxy = , x0 = , y0 = − x20 (1 − 2Nxy ) (9) 2 π π where Nxy be a pseudo number, between 0 and 1, generated by using the x and y coordinates of any point on the frame. Take into consideration the position (x0 , y0 ) and those two points (x+ , y + ) and (x− , y − ), it is easy to compute their respective pseudo Euclidean distances (PED), signified λ+ and λ− , by dint of: λ+ = (x0 − x+ )2 + (y0 − y + )2 , λ− = (x0 − x− )2 + (y0 − y − )2 . (10)
680
Md.H. Sharif and C. Djeraba
Let us give a simple example of PED calculation. Figure 3 depicts the PED calculation where in video frames the center of motion history blob of a person has been moved from position (150,100) to (640,100) with global motion direction diversifications roundabout 15 ◦ and an invariant velocity of 10 pixels per frame. Once we get PED, many routines can be employed to use the raw information for analysis or recognition. For instance, it is possible to detect video events using PED on employing some routines, as stated in the following subsections. We are confident that the future detailed investigation results of PED would work somewhat in parallel to the assumption and prediction algorithms e.g., local and/or global optical flow techniques, Kalman filter, particle filters, etc.
3
Video Events Detection (VED)
We wish to use PED for VED. For this aim, we need the explicit information of motion history blobs that can be gained by tracking objects of interest and thereof can get more information which will be used for specific VED, e.g., PersonRuns, OpposingFlow, PeopleMeet, Embrace, PeopleSplitUp, ObjectPut. 3.1
Motion History Blobs (MHB) Tracking
Assuming that the radius a is unit and consists of S number of pixels. PED in − Eq. 10 can be expressed in term of pixels length λ+ pixel & λpixel respectively: + + λ+ pixel = N umber of pixels to pass on λ = S ∗ λ
(11)
− − λ− pixel = N umber of pixels to pass on λ = S ∗ λ
(12)
which use as the judgement index for tracking MHB in the following algorithm. Algorithm [M : total number of circles in any frame f , N : total number of circles in frame f + 1, m: circle counter in frame f , n: circle counter in frame f + 1] 1. begin 2. if N = 0 then exit 3. initialization: m = 1, n = 1 4. if m ≤ M 4.1 then 4.1.1 if n ≤ N (i) Calculate Euclidean distance d(Cfm , Cfn+1 ) between two centers of circle Cfm & Cfn+1 in f & f + 1 and store it (ii) Increase n by 1 (iii) Repeat step 4.1.1 4.2 else 4.2.1 Select the minimum distance dmin , caused by two centers min Cfm & min Cfn+1 with angles Φf & Φf +1 respectively, and estimate its normalized pixel value Tpixel dmin = d(min Cfm ,min Cfn+1 ) = arg min d(Cfm , Cfn+1 ) k (13) k=1...N
PedVed: Pseudo Euclidian Distances for Video Events Detection
Tpixel
∞ 2k+1 1 2 (−1)k {dmin } = S∗ 1+ √ 2 k!(2k + 1) π
681
(14)
k=0
− 4.2.2 Select λ+ pixelf or λpixelf with respect to previous direction of movement, if + same then use λpixelf , otherwise use λ− pixelf , and save current motion direction 4.2.3 The area of circle with radius (λ+ pixelf + Tpixel ) (leading edge of the convex + cap) or (λpixelf +1 + Tpixel ) (falling edge of the convex cap) will be greater than or + equal to than that of caused by (λ+ pixelf +1 − Tpixel ) or (λpixelf − Tpixel ), explicitly: 2 + π(λ+ pixel + Tpixel ) ≥ π(λpixel f
If there exists : 4.2.3.1 then
f +1
− Tpixel )2 or π(λ+ pixel
Tpixel
f +1
2 + Tpixel )2 ≥ π(λ+ pixel − Tpixel ) . f
+ + λ pixelf +1 − λpixelf ≥ & Tpixel ≥ 2 2
(15)
[in 4.2.3 the λ− pixelf has not been considered for simplicity]
(i) A new motion of the motion history blob has been detected (ii) Assign min Cfm completely convergence to min Cfn+1 + (iii) If there exists occlusion (Tpixel < 3) & (λ+ pixelf = λpixelf +1 ) then choose reasonable range of same orientation for each motion history blob and after occlusion compare its new orientation with the previous orientations. 4.2.3.2 else The motion history blob has insignificant motion or out of frame 4.2.4 Disregard center of circles min Cfm and 4.2.5 Decrease both M and N by 1 4.2.6 Increase m by 1 and set n = 1 4.2.7 Repeat step 4 5. end
n min Cf +1
When those explicit information of motion history blobs are available, the algorithm can easily be made suitable for different kinds of video event detection, e.g., PersonRuns, OpposingFlow, PeopleMeet, Embrace, PeopleSplitUp, ObjectPut. 3.2
PersonRuns (PR )
We set three experienced T values, as thresholds, in Eq.15. If we use one T value then one encountered problem is that people near the camera are supposed to generate large motion and people far from the camera cannot generate such motion even when they would make very quick motion (e.g., running). To obtain an acceptable distribution of motion flow pattern, people near or far from the camera should be fairly treated. To solve this problem, we take into account three T values namely T1 adjacent region to camera (d1 ), T2 middle region (d2 ), and T3 far region from camera (d3 ) where T1 > T2 > T3 . If the distance of the camera observing region is d then the d is divided into three experienced distances d1 , d2 , and d3 where d1 > d2 > d3 . If the camera is fixed the division can be accomplished easily. If the direction variation between two circles is about |Φf − Φf +1 | ≤ π4 and each time Eq. 15 satisfies then the event is judged as PR .
682
Md.H. Sharif and C. Djeraba
Fig. 4. Output of the PersonRuns event detector: true positive (all images of the first and second rows), false positive (left two images of the third row) occurs when we are observing a PR event when in truth there is none, false negative or failure (the residua)
3.3
OpposingFlow (OF )
The algorithm can easily be adapted to detect the person opposing the general flow of the scene, even without a predefined direction of opposing flow. The general direction of the scene can be calculated by considering forward motion or backward motion of the object of interests for some period in some defined region (e.g., door entry/exit). On defining the scene direction, if there exist any forward motion or backward motion with respect to it, then there is an OF event. 3.4
PeopleMeet (PM )
Assuming that people will away from each other before meeting and keep a minimum distance dm to them during meeting. Two events may occur either crossing or meeting. The relative distance dr between persons will be larger than dm at the beginning of appearance and will decrease towards dm and go beyond dm in time and their relative orientation are in reasonable range, with these conditions if one or both persons stop (few or disappear motions) within dm , then PM event occurs, otherwise crossing (which causes false positives) occurs. 3.5
Embrace (Em )
This event is close to PM and hence assuming that Em event happens immediately after PM . After detecting PM the meeting region is encompassed by a circle
PedVed: Pseudo Euclidian Distances for Video Events Detection
683
with approximate radius of dm and within this region calculate dr again. An Em is said to be detected where dr and orientation are below the given thresholds. 3.6
PeopleSplitUp (PS )
Considering that PS event happens after a while of detecting event PM when one or more person will separate from a group (out of the circle). Compute and update each crowd center in consecutive frames to detect if a person is decided to leave the corresponding crowd circle. If the relative distance between the person and the crowd center is larger than dm , then a PS event is said to be occurred. The vast majority of the false positives are brought forth by frowzy background, occlusions of people, and sophisticated interactions among influential personages. 3.7
ObjectPut (OP )
The OP event is commonly characterized if there is downward motion over several frames. The down ward motion, which is stored over a period of frames, 7π may pose variable direction between − 5π 12 and − 12 over several frames. The approach does not consider any event as a positively detected which goes different from down ward motion (e.g., throw a bottle in dustbin). Hence it use downward motion, it can recognize the event if someone sitting down as a false positive OP .
4
Experimental Results
A wide variety in the appearance of the event types makes the events detection task in the surveillance video selected for the TRECVID’08 extremely difficult. The source data of TRECVID’08 comprise about 100 hours (10 days * 2 hours per day * 5 cameras) of video obtained from Gatwich Airport surveillance video data. A number of events for this task were defined. Since all the videos are taken from surveillance cameras which means the position of the cameras is still and cannot be changed. However, it was not practical for us to analyze 100 hours of video except some hours. The results obtained from our methodology along with ground truth events of those videos have been depicted on the Table 1. The PR and OF events detection were quite reliable along with false positives, and likely a bit superior to the result of [19], where 45% PR events successfully detected. The Fig. 4 depicts some output of the PR event detector. The false positives of Fig. 4 were being produced when wagons passed over the camera active regions. Nevertheless, PR event detector was unable to detect some events of Fig. 4 because of mainly the fact that video events have taken place significantly far distance from the camera, and hence the considerable amount of motion components were insufficient to analyze over the threshold Th . The event detectors output of PM and PS may have some degree of average acceptance, on the other hand the output of Em and OP event detectors had performed much below than anticipations. Challenges which make circumscribe the performance of event detectors encompass a wide variety in the appearance of event types with different view angles, miscellaneous scales, divergent degrees of imperfect occlusion, etc.
684
Md.H. Sharif and C. Djeraba Table 1. Achievement appraisal of the output of the detectors Video Events
5
Metrics
PR
OF
PM
Em
PS
OP
Number of ground truth events (gt )
55
8
35
14
35
39
Number of false negative events (fn )
23
2
19
11
20
26
Number of false positive events (fp )
17
3
12
5
13
12
Number of true positive events (tp )
32
6
16
3
15
13
Sensitivity = tp /(tp + fn ) = tp /gt
0.58 0.75
0.45
0.21 0.42 0.33
Precision rate = tp /(tp + fp )
0.65 0.66
0.57
0.37 0.53 0.52
Conclusions
We keyed out a new method which generates automatically pseudo Euclidian distances (PED) from the trigonometrically treatments of motion history blobs (MHB) aiming for different kinds of video events detection (VED). Pseudo Euclidian distance is defined as the virtually traveled distance of a moving point inside of a circle towards its direction when it coincides the center of the circle. PED remains the main contribution of this paper and would be used in wide variety of computer vision applications. To show the interest of the usage of PED, we proposed a PED based methodology for VED. The results based on the detection of some events at TRECVID’08 in real videos have been demonstrated. Some results show the robustness of the methodology, while the remains reflect the magnitude of the difficulty of the problem at hand. TRECVID’08 surveillance event detection task is a big challenge to test the applicability of such methodologies in a real world setting. Yet, we take the view that we have got ahead much insight to practical problems and future PED based evaluation of more effective VED methodologies have the potential to produce better results.
Acknowledgements Thanks to the MIAUCE project, EU Research Programme (IST-2005-5-033715).
References 1. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001) 2. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: ICIP (2002) 3. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
PedVed: Pseudo Euclidian Distances for Video Events Detection
685
4. Serre, T., Wolf, L., Poggio, T.: Object recognition with features inspired by visual cortex. In: CVPR (2005) 5. Bileschi, S., Wolf, L.: Image representations beyond histograms of gradients: The role of gestalt descriptors. In: CVPR (2007) 6. Yarlagadda, P., Demirkus, M., Garg, K., Guler, S.: Intuvision event detection system for trecvid 2008. In: Intuvision at TRECVID (2008) 7. Zhou, H., Taj, M., Cavallaro, A.: Target detection and tracking with heterogeneous sensors. IEEE Journal on Selected Topics in Signal Processing 2, 503–513 (2008) 8. Ivanov, I., Dufaux, F., Ha, T.M., Ebrahimi, T.: Towards generic detection of unusual events in video surveillance. In: IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS (2009) 9. Marana, A., Velastin, S., Costa, L., Lotufo, R.: Estimation of crowd density using image processing. Image Processing for Security Applications (Digest No.: 1997/074), IEE Colloquium, 11/1–11/8 (1997) 10. Rahmalan, H., Nixon, M.S., Carter, J.N.: On crowd density estimation for surveillance. In: International Conference on Crime Detection and Prevention (2006) 11. Lin, S.F., Chen, J.Y., Chao, H.X.: Estimation of number of people in crowded scenes using perspective transformation. IEEE Transactions Systems on Man and Cybernetics, Part A 31, 645–654 (2001) 12. Ma, R., Li, L., Huang, W., Tian, Q.: On pixel count based crowd density estimation for visual surveillance. In: IEEE Conference on Cybernetics and Intelligent Systems, vol. 1, pp. 170–173 (2004) 13. Yokoi, K., Nakai, H., Sato, T.: Surveillance event detection task. In: Toshiba at TRECVID (2008) 14. Lee, S.C., Huang, C., Nevatia, R.: Definition, detection, and evaluation of meeting events in airport surveillance videos. In: USC at TRECVID (2008) 15. Guo, J., Liu, A., Song, Y., Chen, Z., Pang, L., Xie, H., Zhang, L.: Trecvid 2008 event detection. In: MCG-ICT-CAS at TRECVID (2008) 16. Hauptmann, A., Baron, R.V., Chen, M.Y., Christel, M., Lin, W.H., Mummert, L., Schlosser, S., Sun, X., Valdes, V., Yang, J.: Informedia @ trecvid 2008: Exploring new frontiers. In: CMU at TRECVID (2008) 17. Hao, S., Yoshizawa, Y., Yamasaki, K., Shinoda, K., Furui, S.: Tokyo Tech at TRECVID (2008) 18. Kawai, Y., Takahashi, M., Sano, M., Fujii, M.: High-level feature extraction and surveillance event detection. In: NHK STRL at TRECVID (2008) 19. Orhan, O.B., Hochreiter, J., Poock, J., Chen, Q., Chabra, A., Shah, M.: Content based copy detection and surveillance event detection. In: UCF at TRECVID (2008) 20. Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing 50, 174–188 (2002) 21. Cavallaro, A., Steiger, O., Ebrahimi, T.: Tracking video objects in cluttered background. IEEE Transactions on Circuits and Systems for Video Technology 15, 575–584 (2005) 22. Davis, J.W., Bobick, A.F.: The representation and recognition of human movement using temporal templates. In: CVPR, pp. 928–934 (1997) 23. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. TPAMI 23, 257–267 (2001) 24. Davis, J., Bradski, G.: Real-time motion template gradients using intel cvlib. In: IEEE ICCV Workshop on Framerate Vision (1999)
Two Algorithms for Measuring Human Breathing Rate Automatically Tomas Lampo, Javier Sierra, and Carolina Chang Grupo de Inteligencia Artificial, Universidad Sim´ on Bol´ıvar, Venezuela {tomas,javier,cchang}@gia.usb.ve
Abstract. This paper presents two new algorithms for measuring human breathing rate automatically: a Binary Algorithm and a Histogram Cost Algorithm. These algorithms analyze frames from a thermal video of a person breathing and then estimate the person’s breathing rate. Our Binary Algorithm reduces grayscale images into pure black and white (binary) images. Our Histogram Cost Algorithm enhances the differences on normalized histograms by assigning a larger cost to darker pixels. We tested our algorithms on 26 human subjects and results show that the Binary Algorithm’s total percentage error is 19.50%, while the Histogram Cost Algorithm’s total percentage error is 4.88%. These algorithms work in real time, presenting constantly updated measurements of the breathing rate. They are also resistant to small movements and work under several environment conditions, which makes them suitable for measuring victims’ breathing rate in Urban Search and Rescue Situations, as well as patients in Medical Situations.
1
Introduction
Urban Search and Rescue (USAR) is the response to the collapse of human made structures. In these situations, rescuers are trying to save as many lives as possible, by entering a collapsed structure through the debris and looking for injured victims inside. This task is very dangerous, and many rescuers have lost their lives while doing so. According to Dr. Robin Murphy [1], a victim’s mortality rate exponentially increases after 48 hours, so rescuers need to prioritize victims that are going to be rescued according to their general health status, in order to save their lives and protect their own. This status is checked by measuring some of the vital signs: body temperature, pulse (or heart rate), blood pressure and respiratory rate, among others. While pulse is the most accurate and commonly used vital sign, it is difficult to measure without touching the victim. A touchless measurement of this vital sign has been implemented before [2], but it needs to have a clear vision of the carotid artery with a sensitive thermal camera. On USAR situations, the victim may be found on a position that does not facilitate finding the carotid artery without changing positions or the available thermal camera (as was our case) may not be sensitive enough to easily identify the carotid artery. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 686–697, 2009. c Springer-Verlag Berlin Heidelberg 2009
Two Algorithms for Measuring Human Breathing Rate Automatically
687
A person’s respiratory rate, also known as breathing rate, is equal to the number of breathing cycles drawn per minute. One breathing cycle corresponds to inhaling and exhaling once. A healthy adult’s breathing rate is between 12 to 20 cycles per minute. A study conducted by McFadden, Price, Eastwood, and Briggs [3] shows that the respiratory rate in elderly patients is a valuable physical sign, for it allows doctors to diagnose illnesses and infections when a patient’s breathing rate is not in a given range. Doctors measure the breathing rate on their patients with a stethoscope, by listening to their breathing and counting the number of breaths drawn per minute. This measurement is inaccurate and tends to make patients uncomfortable. Also, if the doctor makes a mistake calculating the breathing rate, it may lead to a wrong diagnosis. Rescuers, instead, tend to measure the breathing rate without touching the victim by watching the chest and counting the number of times it moves per minute (thoracic movement), but this measurement is also inaccurate and if incorrect, it may lead to wrong decisions, which may in turn lead to the death of the victim. For these important reasons, we introduce the possibility of measuring a person’s breathing rate automatically with a computer, presenting rescuers and doctors with a constantly updated value that should help them make the right decision. For this paper, we have studied the works conducted by Fei and Pavlidis[4,2] and Murthy[5] for detecting and measuring human breathing. These studies, however, need to guarantee a certain set of conditions in order to work. These conditions include a climate controlled room and healthy subjects sitting in a comfortable chair, facing towards the camera, exactly two meters away from it. In USAR, victims may not be found in comfortable positions and may have collapsed lungs or other breathing difficulties. Carlson and Murphy [6] state that USAR operations tend to be in non–structured environments, so the room temperature cannot be controlled. Also, since the equipment must be quickly deployed, cameras may be positioned at any distance in front of the subject. In this study we emphasize the need to create algorithms that are not biased by any other condition than the mere existence of a thermal camera connected to a computer running the software for detecting and measuring breathing rate.
2
Detecting and Measuring Human Breathing Rate with a Thermal Camera
This study was conducted using an uncooled Indigo FLIR camera. An uncooled FLIR camera works at room temperature. This camera measures the heat profile of bodies and displays a grayscale image to represent it, by displaying the highest temperature (warm bodies) with white pixels and the lowest temperature (cold bodies) with black pixels. In order to start measuring a person’s breathing rate, the thermal camera must be placed in a way that the nostrils can be clearly recognized in the grayscale video. The process for recognizing a breathing cycle is quite simple: when the person inhales, the air taken in is cold (at room temperature) and when this
688
T. Lampo, J. Sierra, and C. Chang
(a) Inhaling
(b) Exhaling
Fig. 1. Frames from video of subject inhaling and exhaling. Notice the dark colored pixels in the nostrils when the subject is inhaling. The nostrils are filled with light colored pixels (almost matching the skin) when the subject is exhaling.
person exhales, the air taken out is warm (at body temperature). Therefore, on the grayscale thermal video the nostrils would be filled with dark colored pixels when the victim is inhaling and then they would be filled with light colored pixels when the victim is exhaling. This can be seen clearly in Figure 1. The subject in the figure is inhaling in subfigure 1(a) and exhaling in subfigure 1(b). Since these algorithms could be used in USAR operations and every victim’s life could be in danger, we cannot rely on Artificial Intelligence or any other automatic tracking method for finding and following the nostrils. These methods could fail when trying to locate the nostrils and instead select another feature from the image, which may lead to an incorrect measurement and a misguided decision that could cost the person’s life, so a human operator must instead select a rectangle (or area of interest) on the video over the victim’s nostrils and then start the measuring process on the software, to make sure that the software is working with the right input. The software should then estimate the person’s breathing rate and present a constantly updated value in real time. There are no methods to safeguard that the area of interest being measured is actually a nose, so if the operator decides to study a random area with heat fluctuation, it may be detected as a breathing pattern by the system.
3
Proposed Algorithms
All the algorithms we have designed analyze frames from the rectangle drawn on the thermal video (Figure 2 shows 6 cropped rectangles from the subject shown before on Figure 1. These rectangles will be used as a running example throughout this paper, for comparing our proposed algorithms.) and try to determine if the victim is inhaling or exhaling in that frame. With this information, we proceed to count the number of breathing cycles per minute and present a constantly updated value. The main idea in these methods of measurement is to be able to differentiate inhaling from exhaling in real time. This is done by taking advantage of the fluctuation on the intensity of the pixels around the nostrils. However, due to the nature of USAR situations, some extra measures have to be taken to deal with difficulties such as movement, background alteration and camera recalibrations, among others.
Two Algorithms for Measuring Human Breathing Rate Automatically
(a) 5
(b) 15
(c) 30
(d) 40
(e) 55
689
(f) 70
Fig. 2. Cropped rectangles from 3 breathing cycles. The subject is inhaling in frames 5, 30 and 55 and exhaling in frames 15, 40 and 70.
3.1
Binary Algorithm (BA)
As was stated before, the thermal video is in grayscale. According to Shapiro [7] the most recommended technique when analyzing grayscale images with computer vision, would be working with the image’s histogram. With this recommendation, we then proceeded to design an algorithm that would reduce grayscale images into binary (pure black and white) images using Otsu’s Method[8], allowing it to clearly separate inhaling from exhaling and calculate the breathing rate. When transforming a grayscale image into a binary image, we need to find a threshold that is sensitive enough to separate inhaling from exhaling. Therefore, this algorithm proceeds to compute the Otsu threshold of every frame for 10 seconds and then select the minimum computed value, which will then be used to binarize the remaining frames. Then the number of black pixels is computed for every binary image. When plotted, the number of black pixels per frame in time would look like a wave, where peaks appear when the person inhales and valleys appear when the person exhales. Once the peaks and valleys have been differentiated, we can calculate the number of breathing cycles per minute. It is important to understand that once the threshold is defined it can’t be altered. Any modifications to the threshold once the measurement has started will lead to incorrect results, since the black pixel count will be altered for all the remaining frames. In order for the algorithm to be able to calculate the peaks and valleys correctly it is necessary to divide the process in two main phases: Calibration Phase and Measurement Phase. Calibration Phase. The first few seconds of measurement are used to determine a proper threshold for the binarization process. In this work the Otsu method [8] was used on frames captured in this interval and the lowest value returned by these calculations is set to be the global threshold. After the calibration time is over, the threshold is set and remains constant until the next measurement process. The threshold is set to the lowest value since this is the most restrictive condition we have to separate the near black pixels from the nostrils from all the other pixels in the frame. In many cases there is noise in the frames such as facial hair or cold patches of skin, and the selection of a low threshold helps filter these features on most frames.
690
T. Lampo, J. Sierra, and C. Chang
(a) 5
(b) 15
(c) 30
(d) 40
(e) 55
(f) 70
Fig. 3. Binary images for running example of 3 breathing cycles. The subject is inhaling in frames 5, 30 and 55 and exhaling in frames 15, 40 and 70.
Fig. 4. Generated wave with BA for running example of 3 breathing cycles. Peaks appear when the subject inhales, while valleys when the subject exhales.
Measurement Phase. Once the threshold is defined, frames are transformed into their binary form. These forms tend to be completely white when the subject is exhaling and show some black spots when the subject is inhaling. However, some black spots may appear when the subject is exhaling, but these spots should be smaller in size than the inhaling markers. The definition of the binarization is expressed in the equation 1, where I is the pixel matrix of the original image and T is the newly defined threshold. 0 if I(i, j) < T N ewImage(i, j) = . (1) 255 if I(i, j) ≥ T Once the image is in its binary form, as shown in Figure 3, the black pixels are counted and a wave is generated (see Figure 4). The average value from all the black pixel counts is used to differentiate if the subject is inhaling (the black pixel count is above the average) or exhaling (the black pixel count is below the average). Using an average to differentiate between the peaks and valleys allows the algorithm to withstand some small movements and recalibrations from the camera. Whenever a recalibration makes an entire image darker, the selected threshold
Two Algorithms for Measuring Human Breathing Rate Automatically
691
should be able to deal with most of the changes, while the averaging of the values would help differentiate the peaks and valleys on the wave with little error. Strengths and Weaknesses. This algorithm fails to measure the breathing rhythm correctly when the subject presents sudden motions; however this doesn’t represent much of a shortcoming in most cases, since it is fairly safe to state that a moving victim in USAR situations is probably alive. This algorithm works well with subjects who are relatively still. This comes as an advantage for USAR situations since we would expect victims to be still even while they are conscious. Patients, on the other hand, could be asked to keep still while they are being measured. The algorithm also manages automatic recalibrations made by the thermal camera and is quite fast. However, this method is inaccurate when measuring the breathing rate in subjects with a particular facial anatomy or with facial hair. In order to deal with these subjects, the Histogram Cost Algorithm was designed and implemented. 3.2
Histogram Cost Algorithm (HCA)
After designing the Binary Algorithm it became clear that the measurement of darker pixels is a viable strategy for working with breathing patterns. Taking this into account, it became necessary to design an algorithm that could consider the variations of intensity in the frame without actually having to detect or separate every single changing pixels. The HCA is based on the application of a cost function on the histogram of a frame. This function assigns different weights to the values of the histogram, giving a bigger contribution to the values closer to black and minimizing the effect of the values near white. By doing this an inhalation frame should have a much higher cost than an exhalation frame. Histogram Cost Method. The cost function 2 was designed to enhance the cost of darker pixels, giving these a much higher weight. 256 cost(I) = × H(i) . (2) i+1 i∈H
Just like in the binary method, the plotting of the cost of a sequence of frames would look like a wave, and the average value of all the measurements can be used to determine if a subject is inhaling or exhaling. This method is more robust than the binary approach because it considers the contribution of every single pixel in the frame and not only the contribution of the pixels that fall below a given threshold. Nonetheless, this method tends to fail because the function needs the image to have the values distributed along the whole domain of the histogram. In order to achieve a better distribution of the values on the histogram it is necessary to normalize it. However, the normalization process cannot be done using the standard methods, because it would yield incorrect results. This
692
T. Lampo, J. Sierra, and C. Chang
normalization process needs to keep track of the lowest and highest pixel value registered among all processed frames in the Calibration Phase. The final cost function applied to the histogram of each frame during the Measurement Phase is defined by Equation 3, which computes the cost of a normalized histogram between the values 0 and 255, where Nmin and Nmax are the minimum and maximum values registered in the set of histograms that have already been analyzed in the Calibration Phase. This equation yields a cost that clearly differentiates frames with several dark pixels from those that do not have them.
256 cost(H) = × H(i) . (3) Nmax − Nmin × ((i + 1)) + Nmin 255 i∈H When the histogram is normalized, we can guarantee that there will be an important difference between the cost of an inhalation frame and an exhalation frame, diminishing in this way the chances of failure in the algorithm. Just like with the Binary Algorithm, this method requires to go through a calibration phase and a measurement phase. Calibration Phase. During this phase, the algorithm sets the parameters for the normalization of the histograms. The calculated parameters are the minimum and maximum values registered in the analyzed histograms. Once these parameters have been set, the average of the costs of some frames is calculated to determine an initial comparison point, which we will after use to determine if the subject is inhaling or exhaling. These values will remain constant during the measurement phase, until a new measuring process is started. Measurement Phase. Equation 3 is used to calculate the cost of each captured frame, where Nmax and Nmin are the parameters calculated during the calibration phase. Just like the BA, the plotting of the costs calculated for each frame should look like a wave (see Figure 5), where peaks represent inhalations and valleys correspond to the exhalations. The differentiation of the peaks and valleys is done through a comparison of the cost of the frame against the average value of all the previously calculated costs. Strengths and Weaknesses. The HCA does seem to have a better overall behavior when compared to our BA, and is able to measure the breathing rate correctly on most subjects disregarding their facial anatomy and the presence of facial hair. However, the main weakness this method presents is the response to camera recalibrations. Whenever an automatic calibration is made by the equipment being used there is a global change that, when summed up pixel by pixel, does a great deal of contribution to the cost of a frame. Since the differentiation of inhalations and exhalations is done through the comparison of the cost of a frame and the average of all the previous frames, it is easy to understand that a change on the whole frame will lead to incorrect measurements. However, this shortcoming of the algorithm can be corrected by using proper (non autocalibrating) equipment.
Two Algorithms for Measuring Human Breathing Rate Automatically
693
Fig. 5. Wave generated by our HCA for 3 breathing cycles. Peaks appear when the subject inhales, while valleys when the subject exhales.
3.3
Algorithms’ Output Comparison
Table 1 displays the partial values calculated with each of the two methods proposed for each frame shown in figure 2. It is evident that the calculated values describe a wave pattern that can be clearly divided by the average cost of all the frames. Table 1. Costs for the frames shown for running example in figure 2, using our BA and our HCA. The cost of inhaling frames is greater than the cost of exhaling frames, describing a wave pattern. Frame Number 5 (inhaling) 15 (exhaling) 30 (inhaling) 40 (exhaling) 55 (inhaling) 70 (exhaling)
(a) S1
(b) S4
(c) S11
HCA Cost BA Cost 21407.58 673 17811.17 46 22230.09 797 17833.99 68 24999.83 817 17828.55 31
(d) S18
(e) S24
(f) S25
Fig. 6. Subset of subjects measured in experiments. Notice that these subjects appear in different positions and distance from the thermal camera. Some of them even have cold patches of skin and facial hair.
694
T. Lampo, J. Sierra, and C. Chang
Table 1 shows that the difference between the cost of an inhalation and an exhalation is more pronounced when it is calculated with our HCA, than when it is computed with our BA.
4
Experiments
Our two algorithms were tested on videos of 26 volunteers with different facial anatomy, positioned at any distance from the camera and in different positions. The subjects were asked to keep still, but a few presented sudden movements during the measurement. Some of these subjects presented patches of cold skin on their faces, as well as facial hair. For this experiments, we made sure we had as many different situations as possible, in order to determine which algorithm is robust and precise enough to be used by rescuers and doctors. Table 2. Experiments conducted on 26 subjects with the BA and HCA. vr corresponds to the real breathing rate while vm corresponds to the value returned by the algorithm. δ corresponds to the algorithm’s percentage error. vr S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 S21 S22 S23 S24 S25 S26
8 15 20 29 21 13 15 13 23 23 20 16 13 17 20 15 19 17 14 28 20 17 35 19 11 17
vm 8 25 20 28 21 24 16 13 23 23 20 16 13 14 30 14 19 17 17 30 20 16 38 18 8 18
BA δ 0.00 % 66.67 % 0.00 % 3.45 % 0.00 % 84.62 % 6.67 % 0.00 % 0.00 % 0.00 % 0.00 % 0.00 % 0.00 % 17.65 % 50.00 % 6.67 % 0.00 % 0.00 % 21.43 % 7.14 % 0.00 % 5.88 % 8.57 % 5.26 % 27.27 % 5.88 %
vm 8 17 20 28 21 13 15 13 21 23 20 16 15 15 20 16 20 17 14 28 20 17 35 19 11 19
HCA δ 0.00 % 13.33 % 0.00 % 3.45 % 0.00 % 0.00 % 0.00 % 0.00 % 8.70 % 0.00 % 0.00 % 0.00 % 15.38 % 11.76 % 0.00 % 6.67 % 5.26 % 0.00 % 0.00 % 0.00 % 0.00 % 0.00 % 0.00 % 0.00 % 0.00 % 11.76 %
Comments Difficult angle Cold patches of skin Difficult angle Cold patches of skin Cold patches of skin Cold patches of skin Cold patches of skin
Difficult to measure visually Facial hair and sudden motions
Cold patches of skin Sudden motions Difficult to measure visually
Cold nose and facial hair
Two Algorithms for Measuring Human Breathing Rate Automatically
695
Table 2 compares each algorithm’s calculated breathing rate to the victim’s real breathing rate, counted visually. For each subject Si , we compare the real value and each algorithm’s computed value, as well as each algorithm’s percentage error δ, computed using Eq. 4. |vm − vr | × 100% . (4) |vr | Where vm corresponds to the measured value and vr corresponds to the real breath rate, counted visually. A frame from some of the videos used is shown on Figure 6. Since this tool for automatic breathing rate measurement plays an important role when it comes to making decisions about a victim’s situation and calculating rescuing priorities, as well as diagnosing patients, measurements made by the software must be very precise or at least must calculate a number that is very close to the real breathing rate. The HCA calculates the correct breathing rate for 18 subjects, while the BA only calculates the correct breathing rate for 12 subjects. In terms of exactness the HCA behaves better than the BA, which leads us to believe that the HCA may be more suited for USAR and medical situations than the BA. In order to determine which method is the most suited for USAR and medical situations, we have calculated their total percentage error Δ using Equation 5. n 2 i=0 (vm − vr ) n Δ= × 100% . (5) 2 i=0 vr δ=
Where vm corresponds to the measured values and vr corresponds to the real breath rates, counted visually. According to the values presented on Table 3, we can tell that the Histogram Cost Algorithm (HCA) seems to be the best suited for USAR and medical situations, because it has the least total percentage error and is the most exact out of the two algorithms. Additionally, as can be seen on Table 2, the Binary Algorithm (BA) has a percentage error higher than 25% in some measurements, which is very inaccurate and unacceptable when dealing with life or death situations. This leads us to think that the BA may not be suited for USAR and medical situations. These experiments also show that the algorithms can measure a person’s breathing rate correctly, even if this person presents sudden motion, has cold patches of skin or facial hair, or even if the camera was placed at any distance from their nostrils or in a difficult angle. Table 3. Total percentage error for each method BA HCA 19.50% 4.88%
696
5
T. Lampo, J. Sierra, and C. Chang
Conclusions and Future Work
This work allowed us to design two new algorithms for automatically measuring a person’s breathing rate using a thermal camera and a computer. We presented two alternatives (a Binary Algorithm and a Histogram Cost Algorithm) and then proceeded to compare them. Experiments revealed that the Histogram Cost Algorithm seems to adapt better to USAR and medical situations, calculating a person’s breathing rate with the least total percentage error. If this algorithm was used in USAR situations, rescuers would have a tool that could help them make important decisions according to the victim’s general health status and then more lives could be saved. This tool could also help doctors diagnose patients according to their breathing rate [3]. Works conducted by Fei and Pavlidis [4,2] and Murthy [5] for detecting and measuring human breathing only work under very controlled environments and don’t seem to focus on subjects with cold patches of skin and facial hair. The algorithms we have designed are not biased by any condition. This allows subjects to have facial hair, cold patches of skin and even be in different positions. It also makes possible to measure a person’s breathing rate from any distance, as long as the nostrils show on the video. However, we have determined that our algorithms have some trouble dealing with repeated movement and sudden temperature changes in the video (a sudden temperature change, like the introduction of fire or ice in the image causes the camera to adjust its measuring range, which in turn drives our algorithms to fail). These algorithms have been designed to detect and measure breathing on still patients and unconscious USAR victims. If a victim is moving on a USAR situation, rescuers can easily conclude that the victim is alive. We encourage future works based in this paper to improve our algorithms so they can deal with these flaws. We also believe that using a calibrated thermal camera would improve our algorithms’ performance. Another possible improvement to their performance would be a robust feature detector that ensures framing the subject’s nostrils and is capable of tracking them, providing our algorithms with the correct input even if the subject presents sudden motions. During our research, we found that some subjects had patches of cold skin on their faces. These subjects, to the best of our knowledge, were healthy and had not been exposed to cold temperatures. This is an interesting phenomenon that could become the subject of future medical studies. When using these algorithms in USAR situations, we recommend rescuers combining software measurements with other methods for breathing rate calculation like thoracic movement and compare values, in case the software fails to approximate the real value. These algorithms should always be a complementary tool for doctors and rescuers, instead of replacing the existing methods for measuring breathing rates.
Two Algorithms for Measuring Human Breathing Rate Automatically
697
Acknowledgments We would like to thank Doctor Robin Murphy and CRASAR for lending us the equipment needed to conduct our experiments. We would also like to thank the Institute for Safety Security Technology and the National Science Foundation Grant EIA-022440, R4: Rescue Robots for Research and Response, for their support.
References 1. Murphy, R.R.: Human-robot interaction in rescue robotics. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 34, 138–153 (2004) 2. Sun, N., Pavlidis, I., Garbey, M., Fei, J.: Harvesting the thermal cardiac pulse signal. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4191, pp. 569–576. Springer, Heidelberg (2006) 3. McFadden, J.P., Price, R.C., Eastwood, H.D., Briggs, R.S.: Raised respiratory rate in elderly patients: a valuable physical sign. British Medical Journal (Clin. Res. Ed.) 284, 626–627 (1982) 4. Fei, J., Pavlidis, I.: Virtual thermistor. In: 29th Annual International Conference of the IEEE, Engineering in Medicine and Biology Society, 2007. EMBS 2007, pp. 250–253 (2007) 5. Murthy, R., Pavlidis, I., Tsiamyrtzis, P.: Touchless monitoring of breathing function. In: 26th Annual International Conference of the IEEE, Engineering in Medicine and Biology Society, 2004. IEMBS 2004, vol. 1, pp. 1196–1199 (2004) 6. Carlson, J., Murphy, R.R.: Reliability analysis of mobile robots. In: IEEE International Conference on Robotics and Automation, vol. 1, pp. 274–281 (2003) 7. Shapiro, L., Stockman, G.: Computer Vision. Prentice-Hall, Englewood Cliffs (2001) 8. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics 9, 62–66 (1979)
Biometric Recognition: When Is Evidence Fusion Advantageous? Hugo Proenc¸a Department of Computer Science IT - Instituto de Telecomunicac¸o˜ es SOCIA - Soft Computing and Image Analysis Group University of Beira Interior, 6200-Covilh˜a, Portugal
[email protected]
Abstract. Having assessed the performance gains due to evidence fusion, previous works reported contradictory conclusions. For some, a consistent improvement is achieved, while others state that the fusion of a stronger and a weaker biometric expert tends to produce worst results than if the best expert was used individually. The main contribution of this paper is to assess when improvements in performance are actually achieved, regarding the individual performance of each expert. Starting from readily satisfied assumptions about the score distributions generated by a biometric system, we predict the performance of each of the individual experts and of the fused system. Then, we conclude about the performance gains in fusing evidence from multiple sources. Also, we parameterize an empirically obtained relationship between the individual performance of the fused experts that contributes to decide whether evidence fusion techniques are advantageous or not.
1 Introduction Private and governmental entities are paying growing attention to biometrics and nationwide systems are starting to be deployed. Pattern recognition (PR) systems have never dealt with such sensitive information at these high scales, which motivated significant efforts to increase accuracy, comfort, scale and performance. Currently deployed systems achieve remarkable low error rates (e.g., the Daugman’s iris recognition system [1]) at the expenses of constrained data acquisition setups and protocols, which is a major constraint regarding their dissemination. Most biometric systems use a single trait for recognition (e.g., fingerprints, face, voice, iris, retina, ear or palm-print) and are called unimodal. These systems have high probability of being affected by noisy data, non-universality, lack of distinctiveness and spoof attacks [2]. Multimodal systems make use of more than one source to perform recognition and are an attempt to alleviate these problems. Used sources may be different recognition strategies from the same data or from different sensors and from a unique or multiple traits. Here, fusion can occur at any stage of the PR process: (1) At the data acquisition level; (2) At the match score level, if the scores generated by each feature comparison strategy are used; (3) At the decision level, if the output of each PR system is used to generate the final response. Fusion at the early stages is believed to G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 698–708, 2009. c Springer-Verlag Berlin Heidelberg 2009
Biometric Recognition: When Is Evidence Fusion Advantageous?
699
be more effective [3], essentially due to the amount of available information. However, it is more difficult to achieve in practice, due to usual incompatibilities between feature sets. At the other extreme, fusion at the decision level is considered too rigid, due to the limited amount of available information. Fusion at the match score level is seen as a trade-off: it is relatively easy to perform and combines appropriately the scores generated by different modalities. The idea of fusing scores to perform biometric recognition is largely described in the literature. Ross and Jain [4] reported a significant improvement in performance when using the sum rule. Wang et al. [5] used the similarity scores of a face and an iris recognition module to generate 2D feature vectors that are redirected as inputs of a neural network classifier. Duca et al. [6] framed the problem according to the Bayes theory and estimated the biases of individual expert opinions. These were used to calibrate and fuse scores into a final decision. Brunelli and Falavigna [7] used the face and voice for identification and Hong and Jain [8] associated different confidence measures with the individual matchers, when integrating the face and fingerprint traits. These works reported significant improvements in performance due to evidence fusion and did not pointed any constraint about the individual performance of each fused expert. However, as stated by Daugman [9], fusing different scores may not be good for all situations. Although the combination of tests enables the decision based in more information, on the other hand, if a stronger test is combined with a weaker one, the resulting decision environment is in some sense averaged, and the combined performance will lie somewhere between that of the two tests conducted individually. Accordingly, Poh and Bengio [10] analyzed four typical scenarios encountered in biometric recognition, mainly concerned about the issues of score correlation and variance, having concluded that fusing in not always beneficial. The main purpose of this study is to contribute to the decision about when evidence fusion should actually be used, by predicting the performance of the fused classifier and comparing it with the corresponding value of the best expert used in fusion. To do so, we simulate the scores generated by each expert, assuming that they are unimodal, independent and identically-distributed (i.i.d.) for the intra-class and inter-class comparisons and can be approximated by normal distributions. These assumptions may be readily satisfied and are a reasonable practice in multimodal biometrics research. Everywhere in this paper, the term “performance” refers to the accuracy performance of the biometric experts. We will restrict our study to fusion of two biometric experts, although the results could be extended to fusion of multiple biometrics by induction, as pointed by Hong et al. [11]. The remainder of this paper is organized as follows: Section 2 briefly summarizes the most usual information fusion techniques. Section 3 describes our empirical framework and presents and discusses the results. Finally, Section 4 concludes.
2 Evidence Fusion Kittler et al. [17] developed a theoretical framework for combing multiple experts and derived the most usual classifiers combinations schemes, such as the product, sum, min, max and median rules.
700
H. Proenc¸a
Let R be the number of biometric experts Bi operating in the environment, such that i = {1, . . . , R}. Let Z be an input pattern that is to be assigned to one of m classes w1 , . . . , wm . For our purposes, the value of m was set to 2, which corresponds to a trivial verification system (w0 =intra-class comparison, w1 =inter-class comparison). Let → − xi be the biometric signature (encoded from Z) that is presented to the ith biometric expert and generates a corresponding dissimilarity score di for each enrolled template. Before fusing scores it is necessary to perform normalization, so that none of the base experts dominates the decision. Without any assumption about the priori probabilities, → the approximation of the posterior probability that − xi belongs to class wj is given by: → P (wj |− xi ) =
→ P (− xi |wj ) R → P (− xi |ws )
(1)
s=1
− where P (→ xi |wj ) denotes the probability density function of the j th class, estimated from the di values observed in a training set. → Product Rule: Assuming statistical independence of the − xi values, the input pattern is assigned to class w0 iff: R
− P (w0 |→ xi ) >
i=1
R
→ P (w1 |− xi )
i=1
→ Sum Rule: Apart from the assumption of statistical independence of the − xi values, this rule also assumes that posteriori probabilities from each system and corresponding priori probabilities are similar. The input pattern is assigned to class w0 iff: R i=1
→ P (w0 |− xi ) >
R
→ P (w1 |− xi )
i=1
Max Rule: This rule approximates the sum of the posterior probabilities by the max→ imum value. As in the previous rules, statistical independence of the − xi values is assumed. The input pattern is assigned to class w0 iff: → → max P (w0 |− xi ) > max P (w1 |− xi ) i
i
− Min Rule: Similarly to the previous rule, statistical independence of the → xi values is assumed. The input pattern is assigned to class w0 iff: → → min P (w0 |− xi ) > min P (w1 |− xi ) i
i
Expert Weighting: As proposed by Snelick et al. [18], an intuitive idea is to assign different weights to each individual expert, hoping to increase the role played by the stronger ones. In this work, the weight ei associated to each expert was assigned according to its d-prime value di , proportionally with the values of the remaining experts. The input pattern is assigned to class w0 iff:
Biometric Recognition: When Is Evidence Fusion Advantageous? R
→ ei P (w0 |− xi ) >
i=1
R
→ ei P (w1 |− xi ), s.t. ei =
i=1
701
di R dj j=1
Dempster-Shafer Theory: It is based on belief functions [19] and combines different pieces of evidences into a single value that approximates the probability of an event. Let X denote our frame of discernment, composed uniquely by two states: the assignment of the input pattern to classes w0 or w1 . The power set P(X) contains all possible subsets of X: {∅, {w0 }, {w1 }, {w0 , w1 }}. Assigning null beliefs to the ∅ and {w0 , w1 } states, the mass of w0 is given by m(w0 ) = max{0, 1 − F0 − F1 }, where F0 and F1 are the cumulative distribution functions of classes w0 and w1 . m(w1 ) = 1 − m(w0 ), so that A∈P(X) m(A) = 1. The combination of two masses is given by: m1,2 (A) =
m1 (B) m2 (C)
B∩C=A
1−
m1 (B) m2 (C)
(2)
B∩C=∅
where m1 and m2 are the masses of individual experts.
3 Experiments and Discussion Our empirical framework comprises v virtual subjects. Let these be denoted by P = {p1 , . . . , pv }. For simplicity purposes, we assume that (1) all subjects appear with identical frequency; (2) no other subjects attempt to be recognized and (3) data is properly acquired by all the biometric devices operating in the environment. Operating in the identification mode, each of the R samples acquired in a recognition attempt is matched against all the enrolled templates, performing a total of 1 intra-class and v−1 inter-class comparisons for each expert. Thus, a recognition attempts give a total of a (intra-class) and (v − 1) × a (inter-class) dissimilarity scores for each expert. We denote these sets respectively by X = {X1 , . . . , Xa } and Y = {Y1 , . . . , Y(v−1) × a }. We consider that X and Y are drawn from populations with distributions FI and FE , such that, FI (x) = P rob(X ≤ x) and FE (x) = P rob(Y ≤ x). Also, multiple comparisons provide unimodal i.i.d. dissimilarity scores that follow the normal distribution. An estimate FˆI (x) of FI (x) at some x > 0 is given by: a
1 FˆI (x) = I a i=i {Xi ≤x}
(3)
where I{.} denotes the characteristic function. As suggested by Bolle et al. [20], the law of large numbers guarantees that FˆI (x) is distributed according to a normal distribution N (FˆI (x), σ(x)). An estimate of the standard deviation is given by σ ˆ (x) = FˆI (x)(1−FˆI (x)) si
and confidence intervals can be found with percentiles of the normal
702
H. Proenc¸a
distribution. For all our results, 99% confidence intervals were chosen and given by −2.326 σ ˆ (x) < FˆI (x) < 2.326 σ ˆ (x). The procedure is similar for FˆE (x). In our experiments, we used v = 10 000 and a = 20 000. The dissimilarity scores generated by each biometric expert were simulated through a pseudo-random generator of normally distributed numbers (Zigurat method [21]), according to the corresponding parameters of the expert and type of comparison (intra-class and inter-class). Using 32bit integers to store data, this method guarantees a period for the overall generator of about 264 , which is more than enough for the purposes of this work. Also, we simulated different levels of correlation between the scores generated by experts and analyzed the corresponding effect in fusion. The Pearson product moment correlation ρ(X, Y ) measures the linear dependence between variables, yielding a value between 1 (maximal correlation) and 0 (independence): n
1 ρ(X, Y ) = n k=1
X i − μX σX
Yi − μY σY
(4)
As suggested by Daugman [12], the d-prime value (d ) appropriately quantifies the decidability of a biometric system, informing about the typical separation between dissimilarity scores generated for intra-class and inter-class comparisons: |μE − μI | d = 1 2 2 2 (σI + σE )
(5)
where μI and σI are the mean and standard deviation of the intra-class comparisons and μE and σE are similar values of the inter-class comparisons. As figure 1 illustrates, d has inverse correspondence with the overlap area of the two distributions and, hence, acts as a measure of the error expected for the biometric system. → Figure 2 illustrates the typical distributions of the P (w0 |− xi ) values for each evidence fusion variant, having as base experts the ones illustrated in figures 1b and 1c.
(a) 0.11,
“Strong” biometric expert (μI σI
=
0.065,
μE
0.49, σE = 0.031, d = 7.60).
=
(b)
=
(c)
=
0.25, σI = 0.08, μE = 0.49, σE = 0.06, d = 3.39).
(μI
“Medium” biometric expert (μI
“Weak” = 0.25, σI
biometric
expert
= 0.07, μE 0.37, σE = 0.07, d = 1.75).
=
Fig. 1. Typical intra-class (continuous lines) and inter-class distributions (dashed lines) of the dissimilarity scores generated by biometric experts with heterogenous performance: good (figure 1a, that refers to iris recognition in constrained imaging setups [1]), medium (figure 1b, based in results of ear [13], face [14] and palm-print [15] recognition) and poor (figure 1c, based in results of a gait classifier [16])
Biometric Recognition: When Is Evidence Fusion Advantageous?
(a) “Medium” biometric expert.
(b) “Weak” biometric expert.
(c) Min rule.
(d) Max rule.
(e) Sum rule.
(f) Product rule.
(g) Dempster-Shaffer rule.
(h) Weighted product rule.
703
Fig. 2. Intra-class (dark bars) and inter-class (white bars) distributions of the posterior probabilities for class w0 given a dissimilarity score si , of the biometric experts of figures 1b and 1c and of the different evidence fusion variants
3.1 Scenario 1: Fusing Independent Scores In the first scenario we assume that the base experts analyze different traits and, thus, the scores generated by them are statistically independent. Figure 3 illustrates the independence between similarity scores generated by a “Strong” and a “Weak” biometric expert.
Fig. 3. Independence between scores (ρI (X, Y ) ≈ 0, ρE (X, Y ) ≈ 0) generated by a “Strong” and a “Weak” biometric expert. Cross points denote the intra-class and circular points the interclass comparisons.
Table 1 compares the results obtained by evidence fusion, where S, M and W stand for the “Strong”, “Medium” and “Weak” biometric experts illustrated in figure 1. Note that the “Individual” column gives the values obtained individually by the best biometric expert. d and EER denote the decidability and approximated equal error rate. It is interesting to note that the weighted rule outperformed all the other combination rules for most of the times.
704
H. Proenc¸a
Table 1. Comparison between the performance obtained by evidence fusion techniques, according to the individual performance of independent experts. S, M and W denote the “Strong”, “Medium” and “Weak” biometric experts of figure 1. {A, B} denotes de fusion of the A and B experts. d’ Experts
EER
Individual Min Max Prod. Sum
DS Weig. Individual Min
Max Prod. Sum
DS
Weig.
18.9
{S, S}
360
8690 250 8591
0.01
0
0.01
0
0.01
0.04
0.01
{S, M }
360
6.98 8.69 6.98 10.89 7.23 18.12
0.01
0.04
0.03
0.04
0.03
0.04
0.03
{S, W }
360
4.41 4.13 4.41
6.03 2.75 20.11
0.01
0.05
0.04
0.05
0.04
0.05
0.04
{M, M }
5.26
4.97 6.20 4.98
7.44 7.73 7.44
4.15
4.60
3.37
4.47
3.61
3.5
3.61
{M, W }
5.29
3.41 3.48 3.47
4.46 3.98 5.31
4.15
6.80
6.74
6.61
5.74
6.07
4.12
{W, W }
1.88
2.22 2.18 2.26
2.60 1.87 2.60
18.82
14.21 11.19 14.03 14.01 16.12 14.81
499
499
In order to assess in more detail when evidence fusion techniques do indeed improve the recognition performance, we compared the results obtained by the evidence fusion variant that was observed to be the best (for each case) with the performance of the best individual expert. We carried out all combinations between experts with individual d in [1, 7.75], using 0.25 steps. Figure 4 exhibits the results. The vertical axis gives the proportion between the results of the best individual expert and the ones obtained by evidence fusion. The remaining axes give the individual d values of the fused experts. It can be observed that the higher improvements are achieved when the base experts have close performance. If their performance is notoriously different, evidence fusion techniques lead to the deterioration of the performance, when compared to the best individual expert. Also, a diagonal structure can be seen in the plot, which suggests that improvements tend to be in direct correspondence with the individual performance of the base experts. The figure at the upper-right corner shows the intersection of the 3D performance surface with the plane z = 1, which is to say that reveals the region where evidence fusion was observed to be advantageous (Adv). Here, a power relationship between the stronger (ds ) and the weaker (dw ) expert appears to be evident, as the R2 value of the fitted function (given in (6)) confirms. 3.2 Scenario 2: Fusing Correlated Scores It was also found pertinent to evaluate the improvements in performance when the scores generated by the fused biometric experts are correlated, perhaps because they resulted from the same data or from different data extracted from the same trait. In our experiments, the Pearson product moment correlation ρ of the generated scores was approximately of 0.75, both for the intra-class (ρI (X, Y )) and inter-class (ρE (X, Y )) comparisons, as illustrated in figure 5. As in the previous scenario, we varied the d values of the fused experts and compared the results obtained by fusion and individually by the best expert. Table 2 lists the obtained results. It is evident that the gains break down, suggesting that data correlation fully constraints the improvements due to fusion. This was confirmed for all types of fused experts (“Strong”, “Medium” or “Weak”) and for all evidence fusion variants, which is in agreement with a previous conclusion made by Kitter et al. [17].
Biometric Recognition: When Is Evidence Fusion Advantageous?
705
Adv
Fig. 4. Comparison between the performance obtained by evidence fusion and individually by the best expert, according to the d’ values of the fused experts and assuming the independence between scores. A power function was fitted to the intersection of the 3D structure with the plane Z = 1, revealing the region (Adv) where evidence fusion is actually advantageous.
Fig. 5. Correlated scores (ρI (X, Y ), ρE (X, Y ) ≈ 0.75) generated by a “Strong” and a “Weak” biometric expert. Cross points denote the intra-class and circular points the inter-class comparisons.
As in the previous scenario, figure 6 compares the results obtained by the best individual expert with the ones obtained by the evidence fusion variant that was observed to be the best. We confirmed that the higher improvements occur when the fused experts have close individual performance. Again, a power function defines the region where evidence fusion is advantageous (Adv), as illustrated in the 2D plot at the upper right-corner. The above given results suggest that evidence fusion improves the relative performance mostly when the fused experts have similar performance. On the contrary, when one of the fused biometrics is considerably weaker, the overall performance of the system tends to decrease, confirming the essential of the Daugman’s note: “A strong biometric is better used alone than in combination with a weaker one” [9]. Also, it should be noted that data correlation is an important feature for the overall improvements. This can be confirmed in (6), that defines the regions where evidence fusion should
706
H. Proenc¸a
Table 2. Improvements in performance obtained by evidence fusion techniques, according to the performance of correlated experts. S, M and W denote the “Strong”, “Medium” and “Weak” biometric experts of figure 1. {A, B} denotes de fusion of the A and B experts. d’ Classifs.
EER
Individual Min Max Prod. Sum DS Weig. Individual Min
DS
Weig.
0.01
0.01
0.01
0
0.01
0.04
0.01
1.98 2.02 2.13 2.00 1.61 4.00
0.01
0.05
0.04
0.05
0.04
0.05
0.04
360
1.39 1.15 1.40 1.61 1.12 5.14
0.01
0.07
0.05
0.06
0.04
0.06
0.05
{M, M }
5.26
3.61 4.03 3.91 4.55 4.69 4.55
4.15
4.80
4.01
4.03
3.99
4.52
3.99
{M, W }
5.29
2.28 2.32 2.27 2.86 2.26 2.79
4.15
7.09
8.10
7.14
7.00
9.07
7.03
{W, W }
1.88
1.95 1.93 1.99 2.02 1.80 2.02
18.82
16.09 13.53 15.66 15.69 18.12 15.69
{S, S}
360
970
{S, M }
360
{S, W }
29
909
61
3.1
61
Max Prod. Sum
Adv
Fig. 6. Comparison between the performance obtained by the best evidence fusion variant and by the best individual expert used in fusion, according to the decidability (d’) of the fused experts (left) and assuming scores correlation (ρ(Xi,j , Yi,j ) ≈ 0.75). A fitted power function approximates the relationship between the performance of the fused experts in order to improve performance by fusion (Adv region of the figure at the upper-right corner).
actually improve performance, either when the scores generated by experts are independent (ρ(Xi,j , Yi,j ) ≈ 0) or not. This equation relates the decidability value of the weaker biometric expert dw with the corresponding value of the stronger one (ds ) in order to improve results by fusion. dw > 1.114 ds0.7884 − 0.4133 , ρ(Xi,j , Yi,j ) ≈ 0 (6) dw > 1.254 ds−1.056 + 0.002 , otherwise
4 Conclusions Previous research works reported substantial improvements in biometrics performance by fusing the evidence from multiple sources, with emphasis to the fusion at the score
Biometric Recognition: When Is Evidence Fusion Advantageous?
707
level. However, others authors claim that these strategies are not particularly useful and even tend to deteriorate the recognition performance. Starting from readily satisfied assumptions about the dissimilarity scores generated by each biometric expert (i.i.d. unimodal values for the intra-class and inter-class comparisons that can be modeled by normal distributions), we simulated the outputs generated by different biometric experts and analyzed the performance gains obtained by the most usual evidence fusion techniques. We concluded that effectiveness is maximized when the fused biometrics have similar performance. Oppositely, if their performance is notoriously different, the overall performance tends to decrease, when compared to the best expert. Also, we confirmed that the independence between the fused similarity scores is an important requirement for the effectiveness of score fusion techniques. If the fused data is strongly correlated, the performance achieved by evidence fusion is likely to be worst than the one obtained individually by the strongest expert. Finally, we fitted two boundary decision curves by power functions that define the regions where evidence fusion should actually be advantageous, either if the fused scores are independent or not. This was made according to the individual performance of each fused expert.
Acknowledgments We acknowledge the financial support given by “FCT-Fundac¸a˜ o para a Ciˆencia e Tecnologia” and “FEDER” in the scope of the PTDC/EIA/69106/2006 research project “BIOREC: Non-Cooperative Biometric Recognition”.
References 1. Daugman, J.G.: Probing the uniqueness and randomness of iriscodes: Results from 200 billion iris pair comparisons. Proceedings of the IEEE 94(11), 1927–1935 (2006) 2. Jain, A.K., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recognition 38, 2270–2285 (2005) 3. Ross, A., Jain, A.K.: Multimodal biometrics: An overview. In: Proceedings of the 12th European Signal Processing Conference (EUSIPCO), pp. 1221–1224 (2004) 4. Ross, A., Jain, A.K.: Information fusion in biometrics. Pattern Recognition Letters 24, 2115–2125 (2003) 5. Wang, Y., Tan, T., Jain, A.K.: Combining face and iris biometrics for identity verification. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 805–813. Springer, Heidelberg (2003) 6. Duca, B., Bigiin, E.S., Bigiin, J., Maitre, G., Fischer, G.: Fusion of audio and video information for multi modal person authentication. Pattern Recognition Letters 18(9), 835–843 (1997) 7. Brunelli, R., Falavigna, D.: Person identification using multiple cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 955–966 (1995) 8. Hong, L., Jain, A.K.: Integrating faces and fingerprints for personal identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 1295–1307 (1998) 9. Daugman, J.G.: Biometric decision landscapes (2000), http://www.cl.cam.ac.uk/TechReports/
708
H. Proenc¸a
10. Poh, N., Bengio, S.: How Do Correlation and Variance of Base-Experts Affect Fusion in Biometric Authentication Tasks? IEEE Transactions on Signal Processing 53(11), 4384–4396 (2005) 11. Hong, L., Jain, A.K., Pankanti, S.: Can multibiometrics improve performance? In: Proceedings of the IEEE Workshop on Automatic Identification Advanced Technologies, pp. 59–64 (1999) 12. Daugman, J.G., Williams, G.O.: A proposed standard for biometric decidability. In: Proceedings of the CardTech/SecureTech Conference, pp. 223–234 (1996) 13. Arbab-Zavar, B., Nixon, M., Hurley, D.: On Model-Based Analysis of Ear Biometrics. In: IEEE Conference on Biometrics: Theory, Applications and Systems, Washington, pp. 1–5 (2007) 14. Chen, X., Flynn, P., Bowyer, K.: IR and visible light face recognition. Computer Vision and Image Understanding 99(3), 332–358 (2005) 15. Han, Y., Sun, Z., Tan, T.: Combine hierarchical appearance statistics for accurate palmprint recognition. In: 19th International Conference on Pattern Recognition, pp. 1–4 (2008) 16. Boyd, J., Little, J.: Biometric Gait Recognition. In: Tistarelli, M., Bigun, J., Grosso, E. (eds.) Advanced Studies in Biometrics. LNCS, vol. 3161, pp. 19–42. Springer, Heidelberg (2005) 17. Kittler, J., Hatef, M., Duin, R., Matas, J.: On combing classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998) 18. Snelick, R., Uludag, U., Mink, A., Indovina, M., Jain, A.K.: Large scale evaluation of multimodal biometric authentication using state-of-the-art systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(3), 450–455 (2005) 19. Yager, R.R., Liu, L.: Classic works of the Dempster-Shafer theory of belief functions. In: Studies in fuzziness and soft computing, vol. 219. Springer, Heidelberg (2008) 20. Bolle, R., Pankanti, S., Ratha, N.: Evaluation techniques for biometrics-based authentication systems(frr). In: Proceedings of the 15th International Conference on Pattern Recognition, June 2000, vol. 2, pp. 831–837 (2000) 21. Marsaglia, G., Tsang, W.W.: The ziggurat method for generating random variables. Journal of Statistical Software 5(8) (2000)
Interactive Image Inpainting Using DCT Based Exemplar Matching Tsz-Ho Kwok and Charlie C.L. Wang Department of Mechanical and Automation Engineering The Chinese University of Hong Kong
[email protected]
Abstract. We present a novel algorithm of exemplar-based image inpainting which can achieve an interactive response and generate results with good quality. In this paper, we modify exemplar-based method with the use of Discrete Cosine Transformation (DCT) for the strategy of exemplar matching. We decompose exemplars by DCT and evaluate the matching score with fewer coefficients, which is unprecedented in image inpainting. The reason why using fewer coefficients is so important is that the efficiency of Approximate Nearest Neighbor (ANN) search drops significantly when using high dimensions. We have also developed a local gradient-based filling algorithm to complete the image blocks with unknown pixels so that the ANN search can be adopted to speed up the matching while preserving the continuity of image. Experimental results prove the advantage of this proposed method.
1
Introduction
Image inpainting, also known as image completion, is a process to complete the missing parts of images, which is nowadays widely used in applications as retouching a photo with unexpected objects or recovering damages on a valuable picture. More specifically, given an input image I with a missing or unknown region Ω, the task here is to propagate structure and texture information from the known region I \ Ω into Ω. In literature, many techniques have been developed, which can be roughly classified into Partial Differential Equation (PDE) based, exemplar-based and statistical-based. Some of these algorithms can work well with small regions but fail in large and highly textured regions, whereas other algorithms which work well in large regions take a long time to complete an image. The Exemplar-Based method [1] showed a great improvement on speed, which reduced the computation time from hours to tens of seconds. However, by our implementation and tests, it still cannot reach the interactive speed (i.e., in a few seconds or even within a second) when running on the consumer level PC. Therefore, a new algorithm is presented in this paper to speed up the exemplar-based image inpainting. PDE-based method (ref. [2,3]) diffuses the known pixels into the missing regions, it smoothly propagates image information from the surrounding areas
Corresponding author.
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 709–718, 2009. c Springer-Verlag Berlin Heidelberg 2009
710
T.-H. Kwok and C.C.L. Wang
along the isophotes direction. The results were sharp and without too many color artifacts. However, the major defect of the algorithm is that it only works well with small missing regions. Ghost effects were produced in large regions and unreal blurred results were generated when processing highly textured regions. In addition, the algorithm takes a long time to complete an image. Photoshop Healing Brush [4] is a variety of PDE-based method, but it can be completed in an interactive speed. However, it also inherits the major drawback of PDE-based inpainting that connot process the highly textured or large region successfully. Figure 1 shows such an example.
Fig. 1. Limitation of Photoshop Healing Brush: (left) the given image with the target region in green, (right) the result generated by Photoshop
Statistical-based method [5] uses the information of the rest of whole image. The algorithm adopts the strategy of statistical learning. First, it builds an exponential family distribution which is based on the histograms of local features over images. The frequency of gradient magnitude and angle occurrence is recorded. Then, it employs the image specific distribution to retouch the missing regions by finding the most probable completion with the given boundary and distribution. Finally, loopy belief propagation is utilized to achieve optimal results. However, the algorithm also fails on high textured photographs and takes long time to compute the results. Exemplar-based methods are a combination of texture synthesis and inpainting. The first approach [1] computed the priorities of patches to perform the synthesis taken through a best-first greedy strategy that depends on the priority assigned to each patch on the filling-front. This algorithm works well in large missing regions and textured regions. Our proposed algorithm fills the missing
Interactive Image Inpainting Using DCT Based Exemplar Matching
711
region in a similar way but with a more efficient matching technique. Several variants were proposed thereafter. Priority-BP (BP stands for belief propagation) was posed in the form of a discrete global optimization problem in [6]. The priority-BP was introduced to avoid visually inconsistent results; however, it took longer computing time than [1] and needed user guidance. The structural propagation approach in [7] filled the regions along user specified curves so that the important structures can be recovered. The filling order of patches was determined by dynamic programming, which is also very time-consuming when working on large images. Retouching an image using other resources from database or Internet is a new strategy which was researched starting from [8]. The recent approaches include the usage of large displacement views in completion [9] and the image database in re-coloring [10]. However, when having large number of candidate patches, the processing time will become even longer. In this paper, we focus on the speed-up problem of inpainting, which has not been discussed by them. 1.1
Exemplar-Based Image Inpainting
To better explain our algorithm, the procedure of exemplar-based algorithm [1] will be briefed here. After extracting the manually selected initial front ∂Ω 0 , the algorithm repeated the following steps until all pixels in Ω have been filled. – Identify the filling front ∂Ω t , and exit if ∂Ω t = ∅. – Compute (or update) the priorities for every pixel p on the filling front ∂Ω t by P (p) = C(p)D(p) with C(p) being the confidence term and D(p) being the data term. They are defined as |∇In⊥p · np | q∈Ψp ∩Ω C(q) C(p) = , D(p) = (1) |Ψp | α
– –
– –
where |Ψp | is the area of the image block Ψp centered at p, α is the normalization factor (e.g., α = 255 for a typical grey-level image), np is the unit normal vector that is orthogonal to the front ∂Ω at p, and ∇In⊥p is the isophote at p. During initialization, C(q) is assigned to C(p) = 1 (∀p ∈ Ω) and C(p) = 0 (∀p ∈ I \ Ω). Therefore, the pixel that is surrounded with more confident (known) pixels and more likely to let the isophote flow in will has a higher priority. Find the patch Ψpˆ which has the maximum priority. Find the exemplar patch Ψqˆ in the filled region that minimizes the Sum of Squared Differences (SSD) between Ψpˆ and Ψqˆ (i.e., d(Ψpˆ , Ψqˆ )) defined on those already filled pixels. Copy image data from Ψqˆ to Ψpˆ (∀p ∈ Ψpˆ ∩ Ω). Update C(p) = C(ˆ p) (∀p ∈ Ψpˆ ∩ Ω).
By default, a window size of 9 × 9 pixels is provided for Ψp but, in practice, requires the user to set it to be slightly larger than the largest distinguishable
712
T.-H. Kwok and C.C.L. Wang
texture element. d(Ψpˆ , Ψqˆ ) is evaluated in the CIE Lab color space because of its property of perceptual uniformity. In the routine of exemplar-based image inpainting, the most time consuming step is finding the best matched patch Ψqˆ = arg minΨq ∈I\Ω d(Ψpˆ , Ψqˆ ), where our approach contributes on the speed-up of algorithm. 1.2
Compression Technique
There are quite a number of methods can use fewer coefficients to represent an image block (exemplars in our approach), such as Haar Wavelet [11], FFT [12], PCA [13]. However, for those fast decomposition methods (e.g. Haar Wavelet), some important details are lost during the coefficients selection, and the structure is destroyed. Besides, it is not that flexible in selecting coefficients after PCA. In image processing, DCT is more efficient than FFT. Therefore, we choose DCT as our mathematical tool to select coefficients in exemplar matching. The formula of DCT can be partially pre-computed, which can further speed up the process of decomposition. Moreover, the locations of DCT coefficients have very clear physical meaning—i.e., the top-left corner coefficients are corresponding to the low frequency components which are very important to visual perception. 1.3
Our Contribution
We develop two enhancements for the exemplar-based image inpainting. – Firstly, for the evaluation of matching score, we decompose exemplars into the frequency domain using DCT and determine the best matched patches using fewer coefficients more efficiently with the help of ANN search. – For DCT on patch with unknown pixels, their pixel values are determined by a local gradient-based filling, which can preserve the continuity of image information better.
2
Using DCT in Exemplar-Based Image Inpainting
Using heuristic search to determine the best matched patch that minimizes SSD in the above algorithm can hardly reach the speed reported in [1], although we have already employed the maximum heap data structure to obtain the patch Ψpˆ with the maximum priority. Therefore, we tried to employ the efficient Approximate Nearest Neighbor (ANN) library [14] that is implemented by KD-tree to search for the best matched patch Ψqˆ for a given patch Ψpˆ at the filling front ∂Ω, where every image block will be considered as a high-dimensional point in the feature space and the distance in such a feature space is equal to the SSD. However, the unknown pixels in Ψpˆ change from time to time so that the dimension cannot be fixed. Although no description about it has been given in [1], we used the average color to fill the unknown pixels which therefore fixed the dimension of KD-tree. Surprisingly, same results are obtained with similar lengths of time – we tested our implementation in this way on Fig.9 and 13 and 15 in [1].
Interactive Image Inpainting Using DCT Based Exemplar Matching
713
Fig. 2. Example I: (top-left) the given image (206 × 308 pixels), (top-right) original [1] (11.97s), (bottom-left) DCT (0.328s), and (bottom-right) DCT+GB (0.344s)
714
T.-H. Kwok and C.C.L. Wang
Fig. 3. Example II: (top-left) the given image (628 × 316 pixels), (top-right) original [1] (44.36s), (bottom-left) DCT (1.406s), and (bottom-right) DCT+GB (1.344s)
Fig. 4. Example III: (top-left) the given image (700 × 438 pixels), (top-right) original [1] (76.34s), (bottom-left) DCT (1.922s), and (bottom-right) DCT+GB (1.875s)
Further study shows that when increasing the dimension of KD-tree from 20 to more than 200 (e.g., a window size with 9×9 pixels will be a point in the space with dimension 9 × 9 × 3 = 243), the performance of KD-tree drops significantly. Thus, in order to improve the efficiency, we need a method to approximate the image block using fewer coefficients.
Interactive Image Inpainting Using DCT Based Exemplar Matching
715
Fig. 5. Example IV: (top-left) the given image (438 × 279 pixels), (top-right) original [1] (48.60s), (bottom-left) DCT (0.985s), and (bottom-right) DCT+GB (1.062s)
Fig. 6. Example V: (top-left) the given image (700 × 438 pixels), (top-right) original [1] (124.5s), (bottom-left) DCT (2.468s), and (bottom-right) DCT+GB (2.469s)
716
T.-H. Kwok and C.C.L. Wang
Fig. 7. Example VI: (left) the given image (362 × 482 pixels), (middle-left) original [1] (6.359s), (middle-right) DCT (0.406s), and (right) DCT+GB (0.437s)
Discrete Cosine Transformation (DCT) [15] provides such an ability. As been investigated in JPEG standard of still images, an image block can be well reconstructed even if we ignore many high frequency components on the resultant array of DCT. Specifically, using only about 10% of the DCT-coefficients for matching has already given acceptable results in our tests. The two indices of a DCT-coefficient correspond to the vertical and horizontal frequencies that it contributes to. Therefore, to make the matching conducted on DCT-coefficients symmetric, we will keep symmetric DCT-coefficients as the feature components of an image. The DCT coefficients of image blocks fully occupied by known pixels can be pre-computed and stored in the KD-tree as exemplars for ANN search before going to the matching steps, which speeds up the computation.
3
Improvement by Gradient-Based Filling
For computing DCT-coefficients on image blocks with unknown pixels, if the unknown pixels are filled with average color of known pixels, the DCT-coefficients do not reflect the texture or the structural information in the block very much. For example, for a block with progressive color change from left to right, if the missing region Ω is located at the right part, filling Ω with average color is not a good approximation. For a smooth image, the gradient at pixels will be approximately equal to zero. Based on this observation, we developed a gradientbased filling method to determine the unknown pixels before computing DCT. In detail, for each unknown pixel p, letting the discrete gradient at this pixel be zero will lead to linear equations relating to p and its left/right and top/bottom neighbors. Therefore, for n unknown pixels, we can have m linear equations with m > n which is actually an over-determined linear system. The optimal values of unknown pixels can then be computed by the Least-Square solution that minimizes the norm of gradients at unknown pixels. This gives better inpainting results.
Interactive Image Inpainting Using DCT Based Exemplar Matching
717
Fig. 8. Example XI (top-row) and X (bottom-row): (left) the given images, (middleleft) results of [1], (middle-right) DCT results, and (right) DCT+GB results
4
Experimental Results and Discussion
We have implemented the proposed algorithm using Visual C++, and tested it on various examples on a PC with Intel Core2 6600 CPU at 2.40GHz plus 2.0GB RAM. Basically, the proposed algorithm can recover photographs more effectively and efficiently than the greedy exemplar-based approach [1]. Figures 2-8 show the results, where ‘original’ stands for the method of [1] but with ANN search (filling unknown pixels with average color), ‘DCT’ for only using Discrete Cosine Transformation (filling unknown pixels with average color), and ‘DCT+GB’ denotes the DCT-based method with unknown pixels filled with zero gradient optimization (i.e., the method in section 3). The major limitation of the DCT-based method is that it does not work very well on sharp features as the high frequency components are neglected in the search of best-matched blocks (e.g., the top of Fig. 8). Improvement on this will be our work in the near future.
References 1. Criminisi, A., P´erez, P., Toyama, K.: Region filling and object removal by exemplarbased image inpainting. IEEE Transactions on Image Processing 13, 1200–1212 (2004) 2. Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: ACM SIGGRAPH 2000, pp. 417–424. ACM, New York (2000) 3. Bertalmio, M., Bertozzi, A.L., Sapiro, G.: Navier-stokes, fluid dynamics, and image and video inpainting. In: IEEE CVPR 2001, vol. I, pp. 355–362. IEEE, Los Alamitos (2001) 4. Georgiev, T.: Photoshop healing brush: a tool for seamless cloning. In: Workshop on Applications of Computer Vision (ECCV 2004), pp. 1–8 (2004) 5. Levin, A., Zomet, A., Weiss, Y.: Learning how to inpaint from global image statistics. In: IEEE ICCV 2003, pp. 305–312. IEEE, Los Alamitos (2003)
718
T.-H. Kwok and C.C.L. Wang
6. Komodakis, N., Tziritas, G.: Image completion using global optimization. In: IEEE CVPR 2006, vol. I, pp. 442–452. IEEE, Los Alamitos (2006) 7. Sun, J., Yuan, L., Jia, J., Shum, H.-Y.: Image completion with structure propagation. ACM Transactions on Graphics (SIGGRAPH 2005) 24, 861–868 (2005) 8. Haysm, J., Efros, A.: Scene completion using millions of photographs. ACM Transactions on Graphics (SIGGRAPH 2007) 26, Article 4 (2007) 9. Liu, C., Guo, Y., Pan, L., Peng, Q., Zhang, F.: Image completion based on views of large displacement. The Visual Computer 23, 833–841 (2007) 10. Liu, X., Wan, L., Qu, Y., Wong, T., Lin, S., Leung, C., Heng, P.: Intrinsic colorization. ACM Transactions on Graphics (SIGGRAPH Asia 2008) 27, Article 152 (2008) 11. Bede, B., Nobuhara, H., Schwab, E.D.: Multichannel image decomposition by using pseudo-linear haar wavelets. In: Image Processing, 2007. ICIP 2007, vol. 6, pp. VI-17 – VI-20 (2007) 12. Kumar, S., Biswas, M., Belongie, S.J., Nguyen, T.Q.: Spatio-temporal texture synthesis and image inpainting for video applications. In: Image Processing, 2005. ICIP 2005, vol. 2, pp. II-85–88 (2005) 13. Lefebvre, S., Hoppe, H.: Parallel controllable texture synthesis. ACM Trans. Graph. 24, 777–786 (2005) 14. Mount, D., Arya, S.: ANN: A Library for Approximate Nearest Neighbor Searching (2006), http://www.cs.umd.edu/~ mount/ANN/ 15. Li, Z.N., Drew, M.: Fundamentals of Multimedia. Pearson Prentice Hall, London (2004)
Noise-Residue Filtering Based on Unsupervised Clustering for Phase Unwrapping Jun Jiang1,2 , Jun Cheng1 , and Xinglin Chen2 1
Shenzhen Institute of Advanced Integration Technology, Chinese Academy of Sciences/The Chinese University of Hong Kong 2 Department of Control Science and Engineering, Harbin Institute of Technology
[email protected]
Abstract. This paper analyzes the characteristic of noise-induced phase inconsistencies, or residues in the wrapped phase map. Because residues are the potential source of phase-error propagation, it is essential to eliminate them before phase unwrapping. We propose an unsupervised clustering driven noise-residue filter, and apply it as a preprocessing procedure of phase unwrapping. The filter is based on the fact that most residues are in the form of adjacency caused by noisy phases. These noisy phases differ from the other correct ones numerically within the adjacent residues, so it is possible to group correct phases and noisy ones into different clusters. Besides, this paper also proposes a method to cope with independent residues. The tests performed on simulated and real interferometric data confirm the validity of our approach.
1
Introduction
Phase shifting technique has become the method for most problem of full 3-D sensing by noncontact optical method. The wrapped phase map calculated from N fringe images consists of phases wrapped within the interval [−π,π). With high quality interferograms, it is usually sufficient to scan the wrapped phase in search of phase steps that are greater than π. Constants of 2nπ are then added to the phase values in the wrapped regions to obtain a continuous phase function with no phase steps that are larger than π. Problems, however, occur when the wrapped phases contain errors. Stetson [1] points out that it is possible for the errors at two neighboring pixels to combine with the slope of the error-free phase function to create a phase step of greater than π that does not denote a transition between wrapped regions. Therefore, the problem is to distinguish those steps greater than π. Goldstein [2] proposes that the inconsistency must be found and marked before phase unwrapping (abbreviated to PU). A general method is to utilize irrotational property to check phase inconsistency [3]. Cusack and Huntley [4] propose that branch cuts, between positive residues and negative residues, act as barriers to prevent a further PU from passing through the cut lines. Costantini [3] equates the problem of PU to that of finding the minimum cost flow on a network to improve the robustness of PU. In order to G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 719–727, 2009. c Springer-Verlag Berlin Heidelberg 2009
720
J. Jiang, J. Cheng, and X. Chen
make PU procedure fast and accurate, feasible algorithms must focus on noiseresidue elimination. Lee [5] presents a sigma filter as an adaptive extension of the averaging filter. It averages only selected phase values within the k×k neighborhood. This filter, however, results in an undesired phase smoothing that causes a loss of information. Nico [6] confines to the description of configurations of residues and designs a special filter. This method is based on the observation of the noisy phases inducing adjacent phase inconsistencies. A combinatorial analysis, however, shows that the filter designed is very complex owing to various configurations of adjacent residues. Volkov [7] presents a Fourier-based exact solution for deterministic PU from experimental maps of wrapped phase in the presence of noise and phase vortices. Recently, Fornaro [8] addresses maximum likelihood PU algorithm based on the phase difference, and the noise is postprocessed in the unwrapped phase map using a median filter. Bioucas-Dias [9] introduces a new energy minimization framework to provide an approximate solution. However, no new method can remove the residues thoroughly within the minimum operating time. In this paper, we analyze configurations of adjacent residues and self-consistent residues. We introduce the technique of pattern classification into noise-residue elimination, and design a feasible filter to remove noise inducing inconsistencies without need for interjection of branch cuts. The remainder of the paper is organized as follows: Section 2 analyzes the configurations of residues. Section 3 gives the filtering algorithm based on unsupervised clustering. Section 4 illustrates the validity of the proposed filtering algorithm. Section 5 gives the conclusion and future works.
2
Noise-Residue Analysis
If the technique of phase shifting is utilized to evaluate a 3-D complex object, residues in wrapped phase map must be located before PU procedure. We assume that the original images are sampled enough, in that, Nyquist Sampling Theorem is satisfied. A residue is detected by evaluating the sum of the phase gradients clockwise or counterclockwise around each set of four adjacent points: (i, j) = (−1/2π) [Ψ1 (i, j + 1) − Ψ1 (i, j) − Ψ2 (i + 1, j) + Ψ2 (i, j)]
(1)
Ψ1 (i, j) = 1 Ψ (i, j) + 2πn1 (i, j)
(2)
Ψ2 (i, j) = 2 Ψ (i, j) + 2πn2 (i, j)
(3)
where, 1 Ψ (i, j) and 2 Ψ (i, j) are differences of neighboring wrapped phases along columns and rows directions respectively; n1 (i, j) and n2 (i, j) are integers selected to have Ψ1 (i, j) ∈ [−π, π) and Ψ2 (i, j) ∈ [−π, π) respectively. (i, j) in Eq.(1) will always be −1, 0, or +1. For a given close loop involving a 2×2 pixel window, if (i, j) is different from zero, the 2×2 area is called residue in this paper. Only triads satisfying the following rules are observed: (1) Two consecutive phase gradients in Eq.(1) in modulus greater than π must be of the opposite sign; (2) Two nonconsecutive phase gradients in modulus greater than
Noise-Residue Filtering Based on Unsupervised Clustering
721
π must share the same sign, and the third one must have the opposite sign. As Nico depicts [6], most of residues have only one phase gradient in modulus greater than π. The most probable configurations of noise-induced residues, shown as Fig. 1, consist of four categories: (1) a couple of adjacent opposite-sign residues in the form of left-right or up-down; (2) a couple of adjacent opposite-sign residues in the form of diagonal; (3) a couple of disjoint opposite-sign residues; (4) more than two residues joined together.
Fig. 1. (a) a couple of adjacent opposite-sign residues in the form of up-down; (b) a couple of adjacent opposite-sign residues in the form of diagonal; (c) a couple of disjoint opposite-sign residues; (d) two couples of adjacent opposite-sign residues joined together
3 3.1
Residue Removal Using Single Linkage Why Choose Unsupervised Clustering?
The operating speed of PU algorithm largely determines the running time of the 3-D sensing system, and the selection strategy of integrating path largely determines the efficiency of path following PU algorithm. PU, however, is sensitive to noise-induced residues for the simple integrating path, which may give rise to global error. For sophisticated integrating path, error is not unavoidable with only local error occurring. Nevertheless, the computing burden is heavy, and the system efficiency is reduced significantly. From the perspective of rapidity, simple integrating path is the first choice. Subsequently, a feasible noise-residue removal algorithm must be found to avoid error propagation. In-depth configuration analysis of adjacent residues shows that noisy phases causing the residues
722
J. Jiang, J. Cheng, and X. Chen
differ from the other, making it possible to classify correct phases and noisy ones into separate clusters. We consider the well known k-means and single-linkage algorithms as candidates. For one dimensional data, Berkhin [10] shows that the computational complexity of single-linkage algorithm is O(N2 ) with the sample size N, and complexity of k-means algorithm is O(NCT) with the number C of categories and the number T of iterations. The above two algorithms have similar complexity. However, proper class number is required for k-means. So single-linkage is selected as the basis of the proposed noise-residue filtering algorithm. 3.2
The Proposed Residue Removal Algorithm
We classify noisy phases into low modulation ones and those inducing residues in the wrapped phase map. A fringe modulation function is defined as Eq.(4): ⎡ 2 N 2 ⎤1/2 N M (x, y) = ⎣ In (x, y) sin(2πn/N ) + In (x, y) cos(2πn/N ) ⎦ n=1
n=1
(4) where, N is the number of phase shifting, In s are the captured fringe images. The filtering algorithm based on unsupervised clustering can be now stated as follows: Step 1: Remove the low modulation pixels. Modulation map is computed in terms of N frames of successively phase-shifted fringe images using Eq.(4). If the modulation value of current pixel is lower than a preset threshold, it is replaced by the highest modulation pixel of its 4-neighbors. The preset threshold τm can be determined by the histogram of the modulation statistically [11]. This step is operated only once. Step 2: Locate the residues in the wrapped phase map. Residues are tested using Eq.(1), and a flag-matrix, the same size as the corresponding wrapped phase map, is introduced to mark the locations of the tested residues. Step 3: Determine the locations of the noisy phases inducing residues. Singlelinkage algorithm, described in [10], is used to classify the noisy and correct phases in an adjacent residue into different categories. It is a remarkable fact that the noisy phases in a separate residue cannot be determined directly. This case will be dealt with in Step 4. Step 4: Remove the noisy phases inducing residues. The category containing the least data is considered the set of noisy phases which accords with practical situation, and the noisy phases are replaced by the median in the category containing the most data. The separate residue can be replaced by the median in the category of its 5×5 neighbors containing the most data. Step 5: Determine whether there is still any residue. If any residue occurs, go to Step 2 until all the residues are removed. After all the residues have been removed, 1 Φ(i, j) and 2 Φ(i, j) can be estimated by Ψ1 (i, j) and Ψ2 (i, j) reliably. Then, the PU can be operated in the wrapped phase map filtered using Eq.(5) to acquire consistent unwrapped phase map.
Noise-Residue Filtering Based on Unsupervised Clustering
Φ(i, j) = Φ(1, 1) +
i−1 i =1
1 Φ(i , 1) +
j−1
2 Φ(i, j )
723
(5)
j =1
where, Φ(i, j) is the unwrapped phase; 1 Φ(i, j), 2 Φ(i, j) are the discrete partial derivatives of the unwrapped phase, and 1 Φ(i, j) = Ψ1 (i, j), 2 Φ(i, j) = Ψ2 (i, j), respectively.
4
Experiments
In this section, the filter is used as a preprocessing step before PU to remove noise-induced residues. And the filtering performance is evaluated by applying the filter to synthetic and real phase signals. We demonstrate the capability of the proposed filtering algorithm in three areas, as follows: 1) dealing with high noise level in the wrapped phase map effectively; 2) making path following PU algorithm possible to be applied stably; 3) acquiring almost correct unwrapped phase map, not just consistent one. We simulate four 256×256 projected phase-shifted fringe images as the object to be constructed. 5% random noise is added to the synthetic images. Fig. 2 illustrates the process of residue removal. In Fig. 2(a) there is a couple of adjacent opposite-sign residues in the form of diagonal located across the edge of two consecutive fringes shown with bold data. During the process of local data clustering based on single-linkage algorithm, the bold data forming the couple of residues is classified into three groups, and the datum in gray area is noise causing this couple of residues. Fig. 2(b) illustrates that the median of the biggest cluster is substituted for the noisy phase, since the minimum group is considered the set of noisy phases inducing this couple of residues as depicted in Section 3. Fig. 3(b) and Fig. 3(c) show the results of reconstruction using the proposed algorithm and branch-cut unwrapping method described in [4], respectively. There is no error propagation in both results, and the comparisons about consuming time and accuracy are shown in Table 1. We also study the filter performance on real phase signals. Our system is composed of three modules: (1) fringe projection subsystem; (2) image capture subsystem; (3) high precision motion table subsystem, as Fig. 4 shows. LCD projector projects four specified sinusoidal fringes on the observed object, and the corresponding deformed fringes are captured by a CCD camera. We project four phase-shifted fringe images on the object to be constructed, and capture four 1024 × 768 images sequentially. Fig. 5 shows the captured real images of a metal part. The phase step of the four consecutive projected fringes is 90◦ . 5% random noise is added to the real images. The number of residues reduces from 139 to 0 within 8 iterations. The PU using our filter is shown as Fig. 6(a). Obviously, no error propagation occurs, and the time consumed is only 9s. However, there is error propagation using the algorithm based on branch-cut method [4], and time consumed is much longer as shown in Table 1. Another real experiment is reconstructing the partial surface shape of a container without artificial noise.
724
J. Jiang, J. Cheng, and X. Chen
Fig. 2. The process of residue removal
Fig. 3. (a) is the synthetical original 3-D surface; (b) is the reconstructed surface using the proposed algorithm; (c) is the one using the algorithm based on branch cut
Fig. 4. The frame diagram of the experiment setup
Noise-Residue Filtering Based on Unsupervised Clustering
725
Fig. 5. Images of digital grating projection on the metal part
Fig. 6. The reconstructed surface using the proposed filtering
Fig. 7. (a) is the reconstructed surface of a container using proposed algorithm; (b) is the one using branch-cut algorithm
726
J. Jiang, J. Cheng, and X. Chen
Table 1. The comparisons of rapidity and accuracy between the proposed algorithm and branch-cut algorithm Proposed algorithm Branch-cut algorithm 2
M SE of simulation(arc ) Consuming time of simulation (s) Error propagation of metal part Consuming time of metal part (s) Error propagation of container Consuming time of container (s)
0.0569 0.6 No 9 No 1.9
0.2361 21 Some areas 266 No 230
Table 2. The comparison of time-complexity between the proposed algorithm and branch-cut algorithm Proposed algorithm Branch-cut algorithm Modulation Residue detection Building branch cut Clustering Unwrapping
O(N ) O(ni N ) — O(ni n2 ) O(N )
— O(N ) O(N ) — O(N 2 )
The results is shown in Fig. 7, and the corresponding comparisons are also listed in Table 1. Table 2 compares time complexity of the PU algorithms based on proposed filter and based on branch-cut method. Integrating path is arbitrary after residue removal for the proposed algorithm. However, integrating path should be estimated at each point, and flood fill algorithm should be applied for the latter algorithm, which is a heavy time-consuming procedure. In Table 2, N is pixel number in an image; ni is the iteration number of residue removal, ni << N ; n is the point number within an adjacent residues, n << N .
5
Conclusion and Future Work
In terms of the data feature of residues, this paper proposes a novel filtering algorithm based on unsupervised clustering. It classifies the noise inducing residues and the other correct data into different groups. The noise is replaced properly. The filter is designed to be applied as a preprocessing step before phase unwrapping. Its performance has been tested on simulated and real phase data. Problems, such as oversmoothing and resolution reduction, are prevented. Besides, it can make path following phase unwrapping algorithm possible to be applied stably, and acquire almost correct unwrapped phase map, not just consistent one. Although we have illustrated the efficiency of the proposed filtering algorithm, some problem is not considered in this paper. Future works include: (1) Pay
Noise-Residue Filtering Based on Unsupervised Clustering
727
attention to the projected fringe images containing holes with low modulation; (2) Study other unsupervised classification algorithms to see if there are more proper filtering algorithm for residue removal.
Acknowledgement The work described in this paper is supported by the National Natural Science Foundation of China (grant: 60806050), and the grant from the Ministry of Science and Technology, The Peoples Republic of China (International Science and Technology Cooperation Projects 2006DFB73360).
References 1. Stetson, K., Wahid, J., Gauthier, P.: Noise-immune phase unwrapping by use of calculated wrap regions. Appl. Opt. 36, 4830–4838 (1997) 2. Goldstein, R., Zebker, H., Werner, C.: Satellite radar interferometry: Twodimensional phase unwrapping. Radio Sci. 23, 713–720 (1988) 3. Costantini, M.: A novel phase unwrapping method based on network programming. IEEE Trans. Geosci. Remote Sens. 36, 813–821 (1998) 4. Cusack, R., Huntley, J., Goldrein, H.: Improved noise-immune phase-unwrapping algorithm. Appl. Opt. 34, 781–789 (1995) 5. Lee, J., Papathanassiou, K., Ainsworth, T.L.: A new technique for noise filtering of sar interferometric phase images. IEEE Trans. Geosci. Remote Sens. 36, 1456–1465 (1998) 6. Nico, G.: Noise-residue filtering of interferometric phase images. J. Opt. Soc. Am. A. 17, 1962–1974 (2000) 7. Volkov, V.V., Zhu, Y.M.: Deterministic phase unwrapping in the presence of noise. Opt. Lett. 28, 2156–2158 (2003) 8. Fornaro, G., Pauciullo, A., Sansosti, E.: Phase difference-based multichannel phase unwrapping. IEEE Trans. Image Process. 14, 960–972 (2005) 9. Bioucas-Dias, J., Valadao, G.: Phase unwrapping via graph cuts. IEEE Trans. Image Process 16, 698–709 (2007) 10. Berkhin, P.: Survey Of Clustering Data Mining Techniques. Springer, Heidelberg (2002) 11. Su, X., Bally, G., Vukicevic, D.: Phase-stepping grating profilometry: Utilization of intensity modulation analysis in complex objects evaluation. Opt. Commun. 98, 141–150 (1993)
Adaptive Digital Makeup Abhinav Dhall, Gaurav Sharma, Rajen Bhatt, and Ghulam Mohiuddin Khan Samsung Delhi R&D, D5 Sec. 59, Noida {abhinav.d,rajen.bhatt,mohish.khan}@samsung.com,
[email protected]
Abstract. A gender and skin color ethnicity based automatic digital makeup system is presented. An automatic face makeup system which applies example based digital makeup based on skin ethnicity color and gender type. One major advantage of the system is that the makeup is based on the skin color and gender type, which is very necessary for an effective makeup. Another strong advantage is that it applies automatic makeup without requiring any user input.
1
Introduction
Good looks can be deceiving. Improving looks has always been of special interest to humans. Imagine a business video conferencing scenario; shabby looks can leave a bad impression on the viewer. The user is well served if the system can automatically apply digital makeup and enhance the caller’s face. Different makeup suites different people as all faces are different and have varied makeup demands. For example a lip-stick shade which may suite a lady from some place may not exactly look good on lady from some other place having a different skin tone. Therefore the type of makeup must be adapted to the type of face. Not just female but male faces also require some touching up to remove small spots, pimples etc. and also perhaps improving the appearance of smokers’ lips. Digital makeup or artificial makeup has strong applications in various scenarios such as video conference, digital cameras, digital albums etc. In this work we pre-sent a new framework for ethnicity based skin tone and gender adaptive digital makeup for human faces. Before applying makeup straight forward as described in some earlier works our system access the type of makeup required based on gender and skin tone type based on skin ethnicity. The system first performs skin segmentation using fuzzy rules and locates the face using Active shape models and facial feature detector on probable face candidate areas. Before applying ASM scalable boosting based gender classification is performed on the face. Random sampling is performed on the detected face skin pixels and SVM based skin tone ethnicity classification is performed. Now with the available gender and skin tone type information and the facial feature the system applies case specific digital makeup. A database of k images in each category was taken and makeup-statistics in terms of HSV and alpha values were mined from them. This data was taken as reference and operations were performed for applying the makeup. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 728–736, 2009. c Springer-Verlag Berlin Heidelberg 2009
Adaptive Digital Makeup
729
The paper is organized as follows. First we review the related work in Sec. 2. Then we describe our method in Sec. 3. Finally we give experimental results in Sec. 4.
2
Related Work
Robust automatic digital makeup systems are still in nascent stages. There is no such single system which can be termed as complete facial makeup system. In 2006 Microsoft was granted a patent [8] which outlines a general digital makeup system for video conferencing solutions. The system locates the face and the facial features and then applies various enhancement filters on the face such as contrast balancing, histogram equalization, eye dark circle removing. It also proposes blurring the background such that the caller/person looks clearer with respect to the background. In another approach blood haemoglobin and melanin values of a face were changed on the bases of a physical model generated from haemoglobin and melanin values [14]. Then in an-other work by by Ojima et al. [12] ”before” and ”after” makeup examples are used. And foundation makeup procedure is used. Face retouching softwares have also been developed such as MyPerfectPicture [11] which manually ask the user to define the key facial points of the face and then select various parameters for touching. In their work Tommer et al [9]. used facial a facial attractiveness engine has been suggested which is trained with the help ph human raters. It extracts the distance between various facial feature points and maps that to a so called face space. This space is then searched for a larger predictive attractiveness rating and then points are edited using 2D warp. This basically alters the shape of the face and maps it to a better looking face as rated by the users earlier.
3
Proposed Framework
We present a system which first categorizes the face and then applies the digital makeup. Various steps are performed in order to extract and categorize the face. In order to identify the facial features, the first step is to perform fuzzy skin color segmentation, then haar feature based face, eyes and mouth detectors are used to extract rough areas. Now the skin pixels are classified into its ethnicity class using support vector machines. Next scalable boosting approach is used to categorize the face into male female. Active shape models are then used to extract lip and eyes from the earlier haar feature detected regions. Then the system matches the color moment of the face with the subset images from the database. The system starts with applying pre processing Gaussian smoothing filter. The whole process is depicted in Figure 1. 3.1
Fuzzy Learning Based Skin Segmentation
Skin segmentation is based on low complexity fuzzy decision tree constructed over B, G and R color space. Skin and non-skin data was generated from various
730
A. Dhall et al.
Fig. 1. Block diagram of the system
facial images constituting various skin tones, ages and gender. Sparse number of rules is generated by our skin segmentation system which is very fast. Fuzzy decision trees are powerful, top-down, hierarchical search methodology to extract easily interpretable classification rules [2] [3]. Fuzzy decision trees are composed of a set of internal nodes representing variables used in the solution of a classification problem, a set of branches representing fuzzy sets of corresponding node variables, and a set of leaf nodes representing the degree of certainty with which each class has been approximated. We have used our own implementation of fuzzy ID3 algorithm [2] [3] for learn-ing a fuzzy classifier on the training data. Fuzzy ID3 utilizes fuzzy classification entropy of a possibility distribution for decision tree generation. The overall skin non-skin detection rate comes out to be 94%. Fig. 2 shows fuzzy decision tree using fuzzy ID3 algorithm for the skin-non skin classification problem. Skin Binary Map Image (SBI) is generated which contains skin and non-skin information.
Fig. 2. Fuzzy decision tree
Next a connected component analysis is performed on the binary map ISBI and then face detection is applied on the skin segmented blobs.
Adaptive Digital Makeup
731
Fig. 3. Output of the fuzzy based skin segmentation
3.2
Haar Based Facial Feature Extraction
Once face probable area is deduced using the skin detection, HAAR feature based Viola Jones face detector [15] is used to detect face on the reduce space. Then in the lower half of the face, similar detector for mouth is applied. This gives the approximate mouth area. Similarly, in the upper half of the face haar based eye detector is applied. The detected eye rectangles are referred to as EyeRectL and EyeRectR maps and the mouth area is referred to as LipRect map. This was done using the Intel OpenCV library [13]. Figure 4 depicts all the steps involved in facial feature extraction.
Fig. 4. Depicts the output sequence for face and facial feature detection steps. (a) Original image. (b) Face extracted using haar feature face detector. (c) Mouth region detected using haar feature based mouth detector. (d) The detected eye by the face detector.
3.3
Skin Ethnicity Classification
Next step in the method is classification into different ethnicity. Random skin pixels are picked from the face area and radial basis function kernel based SVM classification is performed. We perform a one versus one classification. This gives the skin tone class which provides the ethnicity information. Introduced by Vapnik [4] in 1995, Support vector machine is a set of related supervised learning
732
A. Dhall et al.
methods used for classification and regression. SVM constructs a hyper plane and maximizes the mar-gin between two set of vectors in n dimensions. Two parallel hyper planes are constructed and margin is maximized by pushing them towards the data vectors, the one which achieves maximum distance from data points of both the classes gets maximum margin and generally larger the margin then lower is the generalization error of the classifier. Three classes based on ethnicity, European, Asian and African were defined. Training data was constructed for these individual classes and a radial bases function kernel was trained. Training and testing were used using the libsvm library [5]. 3.4
Gender Classification
Since for applying efficient automatic digital makeup gender classification is necessary for male and female have different makeup requirements. Gender classification is done on the faces detected in the skin bound regions bearing ethnicity information by the haar feature detector. Each face is classified into male or female using Scalable boosting learned classifier models during training as defined by S. Baluja et Al. [1]. The author in this paper used the approach for image categorization; we modified it for gender detection. Classifier model chosen for gender classification of a face de-pends on its predicted ethnicity. Scalable boosting uses simple pixel comparison features for gender classification. Five types of pixel comparison operators are pi ¿ pj, pi within 5 units (out of 255) of pj, pi within 10 units (out of 255) of pj, pi within 25 units (out of 255) of pj, pi within 50 units (out of 255) of pj. There exist weak classifiers for each pair of different pixels in the normalized face image for each comparison operator. For each i and j pixel of the face image one of the above features is chosen which gives the best gender classification results for the dataset. The feature chosen acts as a weak classifier yielding binary results 1or 0 de-pending on whether the comparison is true or false respectively. The output corresponds to male if it is true and to female if false. For a 4848 pixel images, this means there are 2523042303 distinct weak classifiers. This yields extremely large number of classifiers to consider. Thus we need to minimize the number of features that need to be compared when given a face image in run time, while still achieving high identification rates. This is done using Adaboost algorithm [7]. The accuracy for European, Asian and African class were 93.3 %, 91.7 % and 90.2 % for 500 classifiers. 3.5
Facial Feature Extraction
Then the Active Shape Models by Cootes et al. [6] are used on these three maps generated from individually. This is done because ASM may not fit properly on the face and we want the exact eyes and lip. Active shape models (ASMs) are shape based statistical models of objects which iteratively fit to the object in a new scenario. In ASM the shapes are constrained by the point distribution statistical shape model to vary only in ways seen in a training set of labeled examples. The shape of an object is rep-resented by a set of points (controlled by
Adaptive Digital Makeup
733
Fig. 5. The subset images of category specific faces that were used for training in gender detection taken from the face database. Row one and two are from European skin category, third is African and fourth and fifth are Asian.
the shape model). The ASM algorithm aims to match the model to a new image. The ASM library by Stephen Milborrow [10] was used for experimentation. Two ASM models were trained for eye and lip each. Region filling is done on the control points obtained from ASM.
Fig. 6. (a) The output using ASM eye model and (b) The output using ASM mouth model
3.6
Reference Database
A database of example sample images was created. For each skin tone ethnicity type, Hue and Saturation color information of k example images each of male and female was stored. For example in the European skin tone category there is k male and k fe-male images which have been chosen as representing different type of European skin tones. Information on the type of makeup that should be applied on these images is stored along with the image skin color Hue Saturation color moment. The information contains after makeup skin tone Hue-Saturation values, lip stick color and alpha values in case of women, eye makeup color and its alpha values. 3.7
Digital Makeup
The first step in digital makeup is applying Gaussian smoothening and morphological dilation to the input image in order to remove small marks, pigmentations and moles. This RGB skin color image is then converted in HSV color space. The color moment CM (Skin Color) is computed over the new color space skin color image IHSV. This CM (SC) is then compared to the pre-stored color moment of skin color of the sub set images in the database. This sub set is derived on the basis of skin ethnicity type and gender type calculated earlier on the input face. For example: Asian male or African female etc. The image with minimum difference is chosen as the reference image and referenced to as IREF. The after
734
A. Dhall et al.
makeup Hue and Saturation values stored with IREF are used as the target values for IHSV. Hue and Saturation values are balanced on the basis of these pre-stored database values. Figure 7 shows these outputs.
Fig. 7. (a) Input skin sample (b) Output skin sample after Hue and Saturation balancing
Next step is lip shading. In case of female faces lipstick is applied to the lip area. Lipstick is applied as a rasterization operation (ROP). The target lip color LIPRGB and alpha values ALPHA are taken from the data present with the closest image matched earlier. The RGB MouthMap image is extracted using the MouthMap and control points computed earlier. For each pixel on lip a new color value is calculated as NewRGB = OLDRGB * (MAX-ALPHA) + LIPRGB * ALPHA. This preserves the texture of the lip. Figure 8, demonstrates the lip stick applying operation. In case of men separate color values are used which don’t give the lipstick touch to the lips but makes their texture smooth which is in tandem with their skin color. This is especially useful in improving the looks of a smoker’s lips as depicted in figure 8(b).
Fig. 8. (a) The input and output of a female Asian lip and (b) The input and output of an Asian male having smokers lip
4
Experimental Results
Figure 9 show the outputs for four different cases. Kindly note in Figure 9 (a) the lips color has changed and the texture has been maintained considerably. This is as Asian female, the skin tone now looks more red. The dark regions below the eyes have been considerably suppressed due to smoothening and skin color balancing. In Figure 9(b) the skin color has been improved and the lips have been applied with similar color so as to give a smoothened lip and remove the smoker’s lips effect. In 9(c) the output has been taken in an office environment which shows the effectiveness of performing skin color segmentation and haar features as the initial steps. In Figure 9 (d) the subject is an African female, the skin color is now lighter, improved and smoothened and a light
Adaptive Digital Makeup
735
Fig. 9. (a) Input and output image of an Asian female after applying digital makeup (b) Input and Output image after applying digital makeup of a dark Asian man. (c) Input and output image after applying digital makeup. (d) Input and output image after applying digital makeup on an African female.
colored lip stick suiting the skin color tone has been applied by the system. The oily skin effect has also been removed.
5
Conclusion and Future Work
We presented a system for digital makeup. The system is based on two important factors (a) dependency of skin tone on ethnicity and (b) gender of the subject. The system customizes the makeup based on these two factors/parameters and retrieves makeup values via color matching from a pre defined data set. The reference image is choosen on the bases of the two parameters and color information. We employed ro-bust machine learning technique (a) fuzzy decision tree based skin color skin segmentation, (b) HAAR feature based face, eye and lip detection, (c) SVM with skin pixel color ethnicity categorization. (d) ASM based lip and eye extraction. Then we used fundamental image processing techniques for improving the appearance of the skin region and enhancing lips. The system is fast and we are currently exploring optimizations in order to implement it on embedded platform with real time performance. Eyes are relatively more complicated to be manipulated with fast and simple operations; we are currently pursuing this as a future work.
736
A. Dhall et al.
References 1. Baluja, S.: Automated image-orientation detection: a scalable boosting approach. Pattern Anal. Appl. 10(3), 247–263 (2007) 2. Bhatt, R.B., Gopal, M.: Erratum: ”neuro-fuzzy decision trees”. Int. J. Neural Syst. 16(4), 319 (2006) 3. Bhatt, R.B., Gopal, M.: Frct: fuzzy-rough classification trees. Pattern Anal. Appl. 11(1), 73–88 (2008) 4. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: COLT, pp. 144–152 (1992) 5. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 6. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models-their training and application. Computer Vision and Image Understanding 61(1), 38–59 (1995) 7. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997) 8. Microsoft Corporation. System and method for applying digital make-up in video conferencing us20060268101 (2006) 9. Leyvand, T., Cohen-Or, D., Dror, G., Lischinski, D.: Data-driven enhancement of facial attractiveness. ACM Trans. Graph. 27(3) (2008) 10. Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 504–513. Springer, Heidelberg (2008) 11. MyPerfectPicture. Myperfectpicture (2009), http://www.myperfectpicture.com/ 12. Ojima, N., Yoshida, K., Osanai, O., Akasaki, S.: Image synthesis of cosmetic applied skin based on optical properties of foundation layers. In: International Congress of Imaging Science, pp. 467–468 (1999) 13. Pisarevskyand, V., et al.: Opencv, the open computer vision library (2008), http://mloss.org/software/view/68/ 14. Tsumura, N., Ojima, N., Sato, K., Shiraishi, M., Shimizu, H., Nabeshima, H., Akazaki, S., Hori, K., Miyake, Y.: Image-based skin color and texture analysis/synthesis by extracting hemoglobin and melanin information in the skin. ACM Trans. Graph. 22(3), 770–779 (2003) 15. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: CVPR (1), pp. 511–518. IEEE Computer Society, Los Alamitos (2001)
An Instability Problem of Region Growing Segmentation Algorithms and Its Set Median Solution Lucas Franek and Xiaoyi Jiang Department of Mathematics and Computer Science University of M¨ unster, Germany {lucas.franek,xjiang}@uni-muenster.de
Abstract. The region growing paradigm is a well known technique for image segmentation. In the first part of this work, the robustness of region growing algorithms is studied. It is shown that within a small parameter range, which leads to good segmentation results in the majority of cases, bad segmentation results may occur. Furthermore the influence of noise on segmentation results is studied. In fact, instability is a problem of region growing methods and reasons for its occurrence are discussed. In the second part of the work, a solution for this problem based on the set median concept is proposed. The set median is adopted to combine image ensembles and stability is achieved. Experimental results illustrate the performance of our approach.
1
Introduction
The region growing paradigm is one of the most widely used techniques for image segmentation because of its competitive segmentation performance and high computational efficiency. These algorithms often have parameters, which need to be chosen for each image and each algorithm individually to obtain good segmentation results. The exact parameter setting, which yields the best segmentation result, is usually unknown. Therefore finding the optimal parameter setting is a difficult and important problem. In this work we assume to know a small reasonable range of parameters which leads to good segmentation results. Our purpose is to examine how stable region growing algorithms are within this small parameter range. For this work we have several motivations. We first show that even within a range of parameters yielding good segmentation results in the majority of cases, often parameters are observed which yield comparatively bad results. It is very unexpected that within a sequence of consecutive parameters yielding good segmentation results there may be some parameters yielding bad segmentation results. We study the stability of two state-of-the-art image segmentation algorithms based on the region growing paradigm [1,2]. Our empirical study will show that instability is in fact a substantial problem of these algorithms and we discuss the frequency of such instabilities. Robustness of region growing algorithms was examined by Wan and Higgins [3] in the context of the selection G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 737–746, 2009. c Springer-Verlag Berlin Heidelberg 2009
738
L. Franek and X. Jiang
of initial growing points (seeds). The authors developed theoretical criteria for a subclass of region growing algorithms (symmetric region growing algorithm) and proved robustness for this subclass. In contrast, we do not study the seed selection problem explicitly, but rather the general parameter selection problem. As a second aspect of the instability problem we study the influence of noise on region growing algorithms. For this reason we fix parameters of the segmentation algorithms and investigate the segmentation performance on noisy images. Thirdly we propose to solve this stability problem by computing the set median of an ensemble of segmentations of an image. Unlike the approach in [3], we do not need to modify the segmentation algorithm itself to receive robust segmentation results. The paper is organized as follows. In section 2 we will study in detail the robustness of region growing algorithms. Theoretical reasons for instability are given in section 3. In section 4 the concept of set median is proposed for solving the instability problem and experimental results are shown in section 5. We conclude in section 6.
2
Motivation
In this section we analyse two region growing algorithms extensively. It is shown that among a set of parameters which yields good segmentation results, there may be some parameters which yield remarkably bad segmentation results. In the next step we perturb the input images with Gaussian noise and study how segmentations are influenced by noise. 2.1
Instability Caused by Variation of Parameters
We first explore the parameter space for each segmentation algorithm and 300 images of the Berkeley segmentation dataset [4]. To evaluate segmentation performance the F-measure [5] or the normalized mutual information (NMI) [6] may be used. For consistency of our presentation throughout this work we decide to use NMI as performance measure and as distance measure in our segmentation optimization method. Let Sa and Sb denote two labellings, each representing one segmentation. Furthermore |Sa | and |Sb | denote the number of groups within Sa and Sb . Then NMI is formally defined by |Sa | |Sb |
|Rh,l | log
n · |Rh,l | |Rh | · |Rl |
NMI(Sa , Sb ) = h=1 l=1 |Sa | |Sb | |Rh | |Rl | |Rh | log |Rl | log n n h=1
(1)
l=1
where Rh and Rl are regions of the same labels from Sa and Sb , respectively. Rh,l denotes the common part of Rh and Rl , and n is the image size. The NMI
An Instability Problem of Region Growing Segmentation Algorithms
(a) Whole parameter space
(c) Best segmentation within the detailed view: N MI = 0.70, k = 460, σ = 1.85
739
(b) Detail view
(d) Worst segmentation within the detailed view: N MI = 0.26, k = 450, σ = 1.7
0.7
0.8
0.65 0.75
0.7
NMI
NMI
0.6
0.55
0.5
0.65
0.45 0.6 0.4
0.35
0
0.5
1
σ
1.5
2
(e) Parameter space for image 1g
σ ∈ (1.6, 1.8): N MI = 0.62, σ = 1.72
(g) Best
0.55
0
1
σ
1.5
2
(f ) Parameter space for image 1i
σ ∈ (1.6, 1.8): N MI = 0.39 , σ = 1.74
(h)Worst
0.5
(i)
Best
σ ∈ (1.2, 1.4) : N MI = 0.76 , σ = 1.34
(j)
Worst
σ ∈ (1.2, 1.4) : N MI = 0.61 , σ = 1.32
Fig. 1. Exploring parameter space for FH and JSEG
measures the correspondence between the two labellings and its value domain is [0, 1]. The NMI value 1 means that the two inputlabellings are equal, whereas a low NMI-value indicates low correspondence of the labellings. Therefore a high (low) NMI value indicates a good (bad) segmentation. Human segmentations from Berkeley dataset are used as ground truth images.
740
L. Franek and X. Jiang
We use the graph-based image segmentation algorithm (FH) [2] which has three parameters: a smoothing parameter (σ), a threshold function (k) and a minimum component size (min size). We choose a dense parameter grid with a total of 2500 parametersets: σ = 0.2, 0.25, . . . , 2.65 and k = 100, 110, . . . , 590. We fix min size because our empirical tests show that segmentation results are not sensitive to changes of this parameter. Fig. 1a shows the resulting parameter space for the image in Fig. 1c received by the FH algorithm. Furthermore, Fig. 1b shows a detail of the parameter space. The best and worst results within the detailed parameter space are displayed in Fig. 1c and 1d, respectively. In this paper our purpose is to extract such parameter regions from the whole parameter space, where the majority of segmentations are good ones and some bad segmentations are observed. In the next section we will empirically study the occurrence and frequency of such parameter regions. For purposes of comparison we also use JSEG algorithm [1] which combines a color quantization approach with the region growing paradigm. A Gaussian filter with the parameter σ is used for preprocessing. The other parameters of JSEG are set to default, because our tests show that segmentation results are not very sensitive to changes of these parameters. Therefore in this case we explore a dense one-dimensional parameter space consisting of a total of 100 parameters: σ = 0.0, 0.02, . . . , 1.98. Some resulting parameter spaces computed by the JSEG algorithm can be seen in Fig. 1e,1f. Furthermore, Fig. 1g – 1j demonstrates that the difference between the best and worst segmentation result within a small parameter range ( σ ∈ (1.6, 1.8) resp. σ ∈ (1.2, 1.4) ) is significant. In both cases (FH and JSEG) the differences in segmentation performance is remarkable although the changes in parameters are only small. Often a small change in the smoothing parameter (Δσ = 0.02) suffices to lead to remarkable differences in segmentation quality. It must be emphasized that the peaks associated with bad results often are unexpected, as can be seen from the results. In such situations in analogy to salt and pepper noise as “outliers”. We conclude that region growing algorithms are very sensitive to Gaussian smoothing whereas the sensitivity to the other parameters (e.g. k) is not very significant. On the other hand Gaussian filtering is often reasonable in the case of noisy images or to avoid small segments in the segmentation result. Note that image smoothing often is part of segmentation algorithms and can enhance segmentation results significantly, even if images are not noisy. For this reason in principle smoothing should not be avoided. 2.2
Instability Caused by Noise
Now for every image of Berkeley dataset noisy images are generated by adding Gaussian noise with zero mean and standard deviation 10−3 . If images are scaled to [0, 255], the standard deviation corresponds to a deviation of about one grey level. For the image in Fig. 2a 1000 noisy images were computed. The noisy images were segmented and NMI-histograms were plotted. The NMI values form a Gaussian distribution with standard deviation of 0.01 (for FH) respectively 0.03 (for JSEG). Similar results are received for other images. This result is a hint,
An Instability Problem of Region Growing Segmentation Algorithms
(a) Original Image
(b) FH
741
(c) JSEG
Fig. 2. NMI-histogram: Image was perturbed 1000 times by Gaussian noise (zero mean and standard deviation 10−3 ). Segmentation results for FH and JSEG.
that region growing algorithms are not stable if Gaussian noise is added. If the segmentation algorithm is stable the quality of segmentation results should not differ much. A high standard deviation of NMI-histogram indicates an unstable algorithm. Suppose a couple of perturbed images are given. In this situation it is desirable to avoid the worst segmentation results, and to match at least the mean segmentation result. In section 5 we propose a method to compute an approximation of the mean segmentation result without knowing ground truth.
3
Discussion and Reasons for Instability
In this section we will analyse empirically the frequency of outliers. It also will be explained why little changes in the input image suffice to provoke instability. 3.1
Frequency of Outliers in Parameter Space
As discussed in the last section, our purpose is to extract from the whole parameter space small parameter regions with good segmentations and some outliers. Therefore we examine for every image of the Berkeley dataset [4] image ensembles consisting of segmentations of an image which belong to neighbouring parameter sets in the dense parameter space. For the FH algorithm we examine image ensembles consisting of 5 × 5 = 25 images, for JSEG we examine image ensembles of 10 images. We distinguish between strong and weak outliers. Strong (weak) outliers are segmentation results with NMI lower than 15% (10%) of the maximum NMI of the current image ensemble. These bounds are chosen by experience with a view to distinguish between qualitatively good and bad segmentations. For every image of Berkeley dataset the number of regions with outliers are counted and the results for both algorithms are shown in Fig. 3a. In this figure the images were sorted by the number of regions with outliers. Per image 100 parameter regions are analysed (each consisting of 25 parameters). We conclude that in about 33% of all images (the images between 200 and 300) more than
L. Franek and X. Jiang
Strong outliers Weak outliers
Number of regions with outliers (total of 100 regions)
Number of regions with outliers (total of 100 regions)
742
25
20
15
10
5
0
0
50
100
150
200
250
300
Strong outliers Weak outliers 8 7 6 5 4 3 2 1 0
0
50
100
Image
150
200
250
300
Image
(a) FH – total of 100 parameter regions per image
(b) JSEG – total of 10 parameter regions per image
Fig. 3. Frequency of parameter regions with outliers
10 % (3 %) of all regions within the whole parameter space are good ones and affected by weak (strong) outliers. Similar results are received for JSEG (Fig. 3b). In this case per image 10 parameter regions are explored. Half of all images are affected by weak outliers and about 30 % are affected by strong outliers. 3.2
Reasons for Instability
It is well known that region growing methods suffer from the chaining problem [7,3]: Pixels of different intensity values can be joined into one region when there exists a chain of pairwise similar pixels which connects them. Furthermore the direction, in which one region grows, is dependent on the order that pixels are examined. In each iteration region growing algorithms search the unlabelled pixel with the lowest intensity difference between the pixel and its neighbouring region [8,2]. Additionally, the features of each region are adaptively updated as the region growing proceeds. Suppose the input image changes a little, like in the case of image smoothing or noise. This change could cause a different sequence in the region growing and therefore slightly different input images may lead to different regions with different features.
4
Combining Image Ensembles with the Concept of Set Median
In this section we first introduce the general concept of set median and then we show how it can be adopted to combine image ensembles. Let X = {x1 , . . . , xn } be a set of observations and d : X × X → R+ 0 a distance function defined on this set. Then the set median x ˆ is defined by xˆ = arg min x∈X
N
d (x, xi ) .
(2)
i=1
The sum in (2) is called sum of distances (SOD). Note that the minimizer x ˆ has to be within the set X.
An Instability Problem of Region Growing Segmentation Algorithms
743
The concept of generalized median was applied in [9] to combine image ensembles. In contrast to set median, the generalized median does not have to be within the set X and the computation is more complex. In this work we adopt the concept of set median for the combination of segmentation results by choosing NMI as distance function for labelled images. Because we used NMI for performance evaluation in section 2, using another distance measure could falsify results. The set median for each image ensemble is determined by computing the SOD for each image of the ensemble. It is the segmentation which leads to the lowest SOD. Let n denote the number of pixels for one image, then computation of set median of N images has the complexity O(|Sa | |Sb | n N ).
5
Experimental Results and Discussion
We have applied the concept of set median to 300 natural images of size 481 × 321 from the Berkeley segmentation dataset [4], each having multiple manual segmentations. For simplicity for every image one manual segmentation is randomly chosen as reference image. Because it is chosen randomly, it does not affect the overall results. The performance of segmentation results are quantitatively evaluated by computing the NMI between the segmentation and the reference image. 5.1
Computing Set Median of Image Ensembles in Parameter Space
The same parameter settings are used as in section 2 and set median is computed for every small parameter region with outliers. The examined extracted parameter regions are the same as in section 3 (FH: image ensembles of 25 images, JSEG: image ensembles of 10 images). Fig. 4 shows results for the FH algorithm and for the JSEG algorithm. For lack of space only the best, the worst segmentation and set median are shown. For JSEG set median is computed not only for the extracted parameter regions, but for the whole one-dimensional parameter space. For this purpose the parameter space, consisting of 100 samples, is divided into 10 equidistant parameter ranges each consisting of 10 parameter samples. Then set median is computed for each parameter range. Some images and the corresponding statistics are shown in Fig. 5. The continuous graph represents the segmentation results computed by JSEG, whereas the dashed graph illustrates the set median. For the skyscraper the best, worst segmentation and set median within the parameter range σ ∈ [1.8, 1.98] are shown in the last row of Fig. 4. Note that in most cases the worst segmentations are avoided and our intention, to eliminate peaks and achieve stability, is satisfied very well. The statistic in table 1 demonstrates the performance of our approach. For every image ensemble with outliers the NMI of the best segmentation, the worst NMI, the average NMI and the set median are compared. The average NMI is determined by computing the mean NMI of the images within the current image ensemble. The statistic presents the average deviation from the best segmentation within an image ensemble with outliers.
744
L. Franek and X. Jiang
N MI = 0.77
N MI = 0.63
N MI = 0.75
N MI = 0.78
N MI = 0.65
N MI = 0.78
N MI = 0.94
N MI = 0.77
N MI = 0.94
N MI = 0.76
N MI = 0.68
N MI = 0.76
Fig. 4. Comparison of segmentations within one image ensemble. First column: best. Second column: worst. Third column: set median. Row 1-3: FH. Row 4: JSEG.
For example for FH for strong outliers on average the deviation from the NMI of the worst segmentation to the best segmentation is 22 %. Whereas the deviation from the average segmentation is 6 % and the deviation from the set median is 2 %. Furthermore in 93 % of all cases the NMI of the set median is better than the average NMI. For JSEG the following conclusions can be stated: for strong (weak) outliers in 85 % (80) of all cases the set median is greater than the average NMI. We conclude that adopting the concept of set median to region growing algorithms is reasonable to receive stability. In the majority of cases the computation of set median avoids outliers and achieves robustness. Furthermore, it is worth mentioning that it is a general approach which works on “meta-level”, because it is applied to resulting image ensembles and not to the segmentation algorithms. Thus the set median concept can be applied to every segmentation algorithm.
An Instability Problem of Region Growing Segmentation Algorithms
Segmentation result Set median
Segmentation result Set median 0.8
Segmentation result Set median
0.7
0.48
0.65
0.46
0.75
0.44 0.42
0.65
NMI
NMI
NMI
0.6 0.55
0.7
0.5 0.45
0.34
0.6 0.35 0
0.5
1
σ
1.5
2
0.4 0.38 0.36
0.4
0.55
745
0.32 0
0.5
1
σ
1.5
2
0.3
0
0.5
1
σ
1.5
2
Fig. 5. Set median applied to JSEG. Images with corresponding statistics below the images. Continuous graph: NMI evaluation segmentation results received by JSEG. Dashed graph: set median was computed for every 10 neighbouring image ensembles in parameter space. Table 1. Average deviation in percent from the best results within one image ensemble Algorithm FH JSEG Outlier Weak: 10 % Strong: 15 % Weak: 10 % Strong: 15 % Worst Segmentation 11 22 12 25 Average Segmentation 4 6 4 9 Set median 2 2 3 4 Set median better than average 84 93 80 85
5.2
Computing Set Median of Noisy Images
For every image in Berkeley dataset 100 noisy images are generated as described in section 2. As mentioned in section 2 in the situation of a couple of noisy images it is desirable to avoid the worst segmentation and to match at least the segmentation with average NMI without knowing ground truth. This is accomplished by computing set median of all 100 noisy images. To analyse the performance of our method we extract those situations, where the worst segmentation is significantly worse than the average NMI. The barrier for classification of remarkably bad segmentations was chosen by experience (average NMI −0.1). For FH (resp. JSEG) this is the case in 18% (32%) of all image, as can be seen in table 2. Table 2 shows that set median is close to the average NMI or even better. For example for FH only in 0.6% of all cases set median is lower than (average NMI - 0.05) – this barrier was chosen to illustrate how close set median is to the average NMI.
746
L. Franek and X. Jiang
Table 2. Instability of noisy images: Set median is or better Algorithm Worst < (average − 0.1) Set median < (average − 0.1) Set median < (average − 0.05) Set median > average
6
close to the average segmentation FH 18 % 0% 0.6 % 74 %
JSEG 32% 0.7% 2% 67%
Conclusion
In this work we studied the instability of region growing segmentation algorithms, which is a substantial problem of such algorithms. The frequency of instabilities caused by varying the smoothing parameter σ was empirically studied. A method which works on meta-level and thus can be applied to any segmentation algorithm to avoid such instabilities was proposed. The intention of our approach was to eliminate peaks associated with bad segmentation results. As a second application scenario we computed set median of noisy images to avoid the worst segmentation results. Experimental results demonstrated the performance of our approach and proved that the proposed approach satisfies our intention very well. Future work will analyse the generalized median used in [9] to achieve stability. Other factors have to be examined which can cause little changes in the input image and thus can have significant influence on the segmentations.
References 1. Deng, Y., Manjunath, B.S.: Unsupervised segmentation of color-texture regions in images and video. IEEE Trans. Pattern Analysis and Machine Intelligence 23, 800– 810 (2001) 2. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Computer Vision 59, 167–181 (2004) 3. Wan, S.Y., Higgins, W.E.: Symmetric region growing. IEEE Trans. Image Process 12, 1007–1015 (2003) 4. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. ICCV, vol. 2, pp. 416–423 (2001) 5. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. Pattern Anal. Machine Intell 26, 530–539 (2004) 6. Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. on Machine Learning Research 3, 583–617 (2002) 7. Priese, L., Rehrmann, V.: A fast hybrid color segmentation method. In: Proceedings Mustererkennung, DAGM Symposium 1993, pp. 297–304. Springer, Heidelberg (1993) 8. Adams, R., Bischof, L.: Seeded region growing. IEEE Trans. Pattern Anal. Mach. Intell. 16(6), 641–647 (1994) 9. Wattuya, P., Jiang, X.: Ensemble combination for solving the parameter selection problem in image segmentation. In: da Vitoria Lobo, N., Kasparis, T., Roli, F., Kwok, J.T., Georgiopoulos, M., Anagnostopoulos, G.C., Loog, M. (eds.) S+SSPR 2008. LNCS, vol. 5342, pp. 392–401. Springer, Heidelberg (2008)
Distance Learning Based on Convex Clustering Xingwei Yang1 , Longin Jan Latecki1 , and Ari Gross2 1
Dept. of Computer and Information Sciences Temple University, Philadelphia {xingwei,latecki}@temple.edu 2 Computer Science Dept. Queens Collage, CUNY, New York
[email protected]
Abstract. Clustering has been among the most active research topics in machine learning and pattern recognition. Though recent approaches have delivered impressive results in a number of challenging clustering tasks, most of them did not solve two problems. First, most approaches need prior knowledge about the number of clusters which is not practical in many applications. Second, non-linear and elongated clusters cannot be clustered correctly. In this paper, a general framework is proposed to solve both problems by convex clustering based on learned distance. In the proposed framework, the data is transformed from elongated structures into compact ones by a novel distance learning algorithm. Then, a convex clustering algorithm is used to cluster the transformed data. Presented experimental results demonstrate successful solutions to both problems. In particular, the proposed approach is very suitable for superpixel generation, which are a common base for recent high level image segmentation algorithms.
1
Introduction
Clustering aims at finding hidden structure in a data set and is an important topic in machine learning and pattern recognition. Several methods, such as K-means [1], have been developed to solve the clustering problem for datasets which have a compact shape. However, they fail to handle data with complex cluster shapes, i.e., data that is not in the shape of point clouds, but instead forms curved and elongated shapes. Recently there have been some advances. Spectral Clustering [2] can handle this type of data very well and path-based clustering [3] also demonstrates excellent performance on some clustering tasks involving highly non-linear and elongated clusters in addition to compact clusters. However, these algorithms must have prior knowledge of the number of clusters, which is not practical in many applications. In this paper, a new distance learning approach is proposed to transform elongated structures into compact ones. It is interleaved with a convex clustering method, which is used to find a global optimal solution for clustering. Apart from the Gaussian Kernel parameter, it is a completely parameter-free clustering principle. Learned distance are based on the convex clustering, which in turn are G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 747–756, 2009. c Springer-Verlag Berlin Heidelberg 2009
748
X. Yang, L.J. Latecki, and A. Gross
used for convex clustering, and so on. We have a natural stop criterion. We stop when cluster membership remains unchanged. In addition to applying the proposed method to some toy examples, we demonstrate its excellent performance on image segmentation. The problem of segmenting an image into regions remains a great challenge for computer vision. Clustering has been crucial in attempting to solve this problem [3,4]. We also show that the proposed approach is very suitable to over segment images to so called superpixels, where superpixels are small connected regions in the image. Commonly image segmentation algorithms like [4,5] are used to generate superpixels. Recently many algorithms have been proposed for high level grouping of the superpixels, e.g., [6]. The rest of this paper is organized as follows. Some related work is briefly introduced in Section 2. In Section 3, the proposed distance learning and convex clustering approach will be introduced in detail. Experimental results will be given in Section 4 followed by the conclusion and discussion in Section 5.
2
Related Work
There are several approaches to distance learning such as supervised distance metric learning and unsupervised distance metric learning. Most of the unsupervised distance metric learning methods are embedding methods, including linear and non-linear. The well known algorithms for nonlinear unsupervised dimensionality reduction are ISOMAP [7], Locally Linear Embedding(LLE) [8], and Laplacian Eigenamp (LE) [9]. ISOMAP seeks the subspace that best preserves the geodesic distances between any two data points, while LLE and LE focus on the preservation of local neighbor structure. Among the linear methods, Principal Component Analysis (PCA) [10] finds the subspace that best preserves the variance of data; Multidimensional Scaling (MDS) [11] finds the low-rank projection that best preserves the inter-point distance given by the pairwise distance matrix. In addition to the embedding methods, the path-based distance [3] transforms an original distance to a path based distance. It is defined as the min of the max over all paths. It can transforms the elongated data into compact data, but this approach relies on a greedy algorithm, which often leads to unsatisfactorily solutions. Outstanding results have been reported for the exemplar-based affinity propagation [12], which can automatically decide the number of clusters, but it cannot guarantee convergence, and therefore, it must be stopped manually. Lashkari and Golland [16] have proposed a framework for constraining the search space of general mixture models to achieve global optimality of the solution, but this framework can deal only with data shaped as circular point clouds. We propose to first transform the data by learning new distance hierarchically so that the clusters in the new distance are more compact. This allows us to apply any classical clustering algorithm, yet for our applications we selected the algorithm by Lashkari and Golland, since it is guaranteed to yield optimal clustering and it automatically determines the number of clusters.
Distance Learning Based on Convex Clustering
50
50
50
50
100
100
100
100
150
150
150
150
200
200
200
200
250
250
250
250
300
300
300
350
350
350
400
50
100
150
200
250
300
350
400
400
50
100
150
200
250
300
350
400
400
749
300 350
50
100
150
200
250
300
350
400
400
50
100
150
200
250
300
350
400
Fig. 1. Clustering results of levels 1, 2, 3, and 4. The corresponding learned distances are shown in the second row.
3
General Framework for Distance Learning
Given a set of points X = {x1 , . . . , xn } and a distance function D : X ×X → R+ , where R+ is a set of nonnegative real numbers. We show that it is possible to learn a new distance function d : X × X → R+ for any hierarchical clustering algorithm. Any kind of clustering method can be used to cluster the data points into kl clusters {Cl1 , . . . , Clkl } at level l, where C0i = {xi } for i = 1, . . . , n, and the number of clusters cannot increase, i.e., kl ≥ kl+1 . The distance at the beginning level, level 0, is the original distance among the input data points in X . The distance between any data point a in Cli and any data point b in Clj at level l for i = j is: dl+1 (a, b) = min D(p, q). (1) p∈Cli ,q∈Clj
The distance dl+1 (a, b) gives the learned distance of all data points in Cli to all data points in Clj . If a, b ∈ X belong to the same cluster at level l, then dl+1 (a, b) = dl (a, b), i.e., the distances between points in the same cluster do not change. The hierarchical clustering process will automatically stop when the clusters in the level l + 1 are the same as in the level l. The new learned distance is then defined as d(a, b) = dl (a, b). (2) For brevity, we denote d(xi , xj ) = dij in the remainder of this paper. Our intuition is that learned distance d can adequately represent the manifold structure of the data point set. Since the data points which have smaller distance will be clustered into one cluster instead of the two, it is consistent with the intuition that the data points should have small intra-cluster distance and large inter-cluster distance. Fig. 1 illustrates the intuition behind the hierarchical distance learning algorithm. It shows clustering results at each level of the approach together with the obtained new distances in the second row. If two points are in the same manifold, they will ultimately merge into one cluster which, and will have small distance.
750
X. Yang, L.J. Latecki, and A. Gross
We summarize the clustering based on the learned distance as: For a given input symmetric (n×n) matrix D of nonnegative pairwise dissimilarities between n objects, with zero diagonal elements, find clusters based on the learned distance d (instead of the original distances). Any clustering method could be used, but we use the clustering algorithm presented in the next section. In fact, we use this algorithm twice, first to learn new distances, and then to cluster the data based on the new distances.
4
From Clustering to Distance Learning
We begin with an overview of the convex clustering approach in [16]. This approach can automatically determine the number of clusters for each level, and [16] have proved that their approach obtains a global optimal solution for a given Gaussian kernel. In each level of the convex clustering approach, qli represents the probability that data point i in level l is the cluster center and slij = exp(−βdlij ) represents the similarity between the two points i and j. According to the approach in [16], at each level the following two steps are iterated: (t) zli
=
n j=1
slij qlj
(t)
n
(t+1) qlj
(3)
1 sij qlj = n i=1 zlit l
(t)
(4)
From the view of point i, Eq. (3) represents how all the other points influence it. Eq. (4) represents, for a fixed point j, how much point j influences all the other points. In particular, for a pair of points j and i we have the ratio: slij qlj
(t)
(t)
zli
slij qlj = n l (t) j=1 sij qlj (t)
(5)
(t)
The term qlj represents the probability or strength of point j as a cluster center, the denominator slij qlj represents how strongly point i belongs to point j by the relation strength slij between them and, similarly, the numerator represents the total probability that point i belongs to all other points. Therefore, the ratio represents the normalized probability that point i belongs to point j. The higher the ratio, the more probable that point i belongs to point j. As the procedure can find the optimal solution [16], the repetition will stop (t+1) (t) when the value nj=1 (qlj −qlj ) is sufficiently small. Then, the soft assignment l of point sets xli to cluster center j, rij = P (lj|x = xli ) represents the probability distribution over the point set indices lj : j = 1, . . . , n and it is computed as (t)
l rij =
t slij qlj (t)
zli
(6)
Distance Learning Based on Convex Clustering
751
The point xli is then assigned to a cluster based on l assignment(xli ) = argmax rij = argmax j
j
t slij qlj (t)
(7)
zli
The soft assignment in this paper is different from the method in [16]. In [16], all qj that are below a certain threshold are set to zero and the entire distribution is renormalized over the remaining indices. This effectively excludes the corresponding points as possible exemplars and reduce the cost of the following iterations. However, in practice, the choice of threshold is very difficult and crucial for the results. For real applications, it is nearly impossible to choose a proper threshold. The proposed approach uses the hierarchical framework and therefore can use (7) directly to assign points to the cluster center, since the points which are incorrectly assigned will be corrected in the next level’s clustering. According to the assignment of (7), the algorithm automatically finds the number of clusters, which is represented as kl . Based on (1), the distance between different clusters is then updated, and the clustering is repeated with the new distances. The process is repeated until new clusters are the same as clusters form the previous iteration.
5
Experimental Results
In order to show the advantages of the framework of learning distance and the globally optimal clustering algorithm, two types of experiments were performed. The first type involves toy examples that illustrate the intuition behind our approach. The second type is the real-world application of image segmentation. 5.1
Toy Examples
The results in Fig. 2 show the advantage of the proposed distance learning algorithm on clusters with more complex shapes than the two half moons in Fig. 1. There are 688 data points, each of which belong to one of three classes. Again the algorithm in [16], which works with the original Euclidean distances, is not able to yield correct clusters for any parameter settings. In order to show the effect of the proposed distance learning approach, Fig. 3 shows the distance matrix before and after the learning process, which represents all of the pairwise distances between points. The above two toy examples show the features and the effectiveness of the proposed approach. In order to show that the proposed approach can be used in a real application, we implement the algorithm for image segmentation, which is still an unsolved problem in computer vision. 5.2
Image Segmentation
Image segmentation is an integral part of image processing applications. Although in recent years image segmentation based on clustering algorithms have
752
X. Yang, L.J. Latecki, and A. Gross
(a)
(b)
(c)
Fig. 2. (a) Clustering result of the proposed approach. (b) Clustering result of [16] with the same β as (a). (c) Clustering result of [16] with adjusted β.
100
100
200
200
300
300
400
400
500
500
600
600
100
200
300
(a)
400
500
600
100
200
300
400
500
600
(b)
Fig. 3. (a) The distance matrix before the proposed distance learning approach. (b) The distance matrix after the proposed distance learning approach.
seen great success, the process is not fully automatic and the results are not as good as the vision community would like. However, by using the proposed distance learning algorithm and the modified clustering algorithm of [16], we can automatically find the segmentation results without a predefined number of clusters. In our experiments, we use several images from the Weizemann dataset consisting of 328 horse images [14] of gray scale. Each pixel of an image is viewed as a data point and the distance between two pixels is the difference of their gray value and the Euclidean distance between their coordinates. Therefore, the slij in the above formulas can be obtained by slij = exp(−β1 dl1ij ) · exp(−β2 dl2ij ), where dl1ij is the difference of the gray value between two points at level l and dl2ij is the Euclidean distance between two points at level l. If the image’s size is m × n, the distance matrix will be (m · n)2 , which is too high (reducing the storage requirement will be addressed in future work). Therefore, in our experiments the test image are relatively small. However, they still demonstrate the advantage of the proposed distance learning algorithm and the proposed clustering algorithm. Fig. 4 shows segmentation results of our method compared to [16] on our first test image. It is obvious that the proposed approach, as shown in Fig. 4(c) finds
Distance Learning Based on Convex Clustering
(a)
(b)
753
(c)
Fig. 4. (a) The original image. (b) Segmentation result of [16]. (c) Segmentation result of the proposed approach.
the optimal segmentation results automatically, whereas [16], as shown in Fig. 4(b), cannot find any informative segmentation results for this image. We also compared the image segmentation results of the proposed approach with the path-distance algorithm [15]. For both approaches, we first used the corresponding algorithms to obtain new distances, and then used our modified version of the clustering algorithm in [16] to segment the images. As can be seen in Fig. 5, the new distances learned by the proposed method perform significantly better than the path-distance distances of [15]. From the original image, it can be seen that the gray value of the foreground and background is not uniformly distributed which makes the clustering very difficult. However, the proposed approach can still find the optimal segmentation results, which shows the robustness of our approach.
(a)
(b)
(c)
Fig. 5. (a) The original image from [16]. (b) Segmentation result based on distances learned according to [15]. (c) Segmentation result of the proposed approach.
Fig. 6 shows a few more examples of image segmentation with the proposed method on the Weizemann dataset [14]. The upper row shows the original image and the lower row shows the segmentation results obtained after 4,4,3 iterations, correspondingly. The morphological image postprocessing has been used to remove isolated segmentation pixels. The results are correct, even though only very elementary image information has been used: gray level differences and pixel
754
X. Yang, L.J. Latecki, and A. Gross
Fig. 6. Examples of image segmentation with the proposed method (bottom row) obtained after 4,4,3 iterations, correspondingly. We used only very elementary image information of gray level differences and pixel distances. The middle row show superpixels obtained by an intermediate stage of our algorithm after 3,3,2 iterations, correspondingly.
distances. The middle row demonstrates the potential of the proposed method to generate superpixels, from left to right, we have 5, 4, 12 superpixels marked in different colors. These results are obtained after 3,3,2 iterations, correspondingly. Besides the Weizemann dataset, Fig. 7 shows the performance on the document images. Fig. 7 (b) is the third iteration of the proposed approach, which could already distinguish the characters from the background, but it still contains many small components. The segmentation results of the final forth iteration is shown in (c). The image contains three clusters represented by black, white and gray colors. As the gray component is caused by the blur on the edge of character, it is always located on the edge of letters. Morphological postprocessing was used to remove isolated points. Similarly, Fig. 7 (e)-(f) shows the segmentation process for another document image (d). The proposed approach can also be used in cell image segmentation. In Fig. 8, each cell images are segmented into 3 clusters. As the core of the cells are often different from other parts, it is reasonable for the algorithm to treat them differently. As for cell and document segmentation, the foreground objects are always separated and are sparse. Though the proposed approach could work well on different kinds of images, the main drawback of the proposed approach is that the results are sensitive to the size of the Gaussian kernel. This is the trade-off for not having prior knowledge about the number of clusters.
Distance Learning Based on Convex Clustering
(a)
(b)
(c)
(d)
(e)
(f)
755
Fig. 7. Segmentation results on document images (a,d). (b) and (e) are the intermediate steps of our segmentation. The final segmentation results are shown in (c) and (f).
(a)
(b)
(c)
(d)
Fig. 8. (b) and (d) are the corresponding segmentation results of the cell images (a) and (c)
6
Conclusion
In this paper, we present a general framework to learn new distances by hierarchical clustering. At each level, a clustering algorithm is used to cluster data points. According to the cluster assignment, the distance is updated to extract the manifold structure of the data. A convex clustering approach is used as the algorithm for learning the distance. Based on the learned distance, the convex clustering algorithm can automatically find the globally optimal solution even for non-linearly distributed data. Our method does not involve any parameter except the parameter for Gaussian kernel. In particular, the number of clusters is determined automatically, which is an important advantage over most of the existing clustering algorithms.
References 1. Han, J., Kamber, M.: Data mining: Concepts and techniques. Morgan Kaufmann, San Francisco (2000) 2. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, vol. 14, pp. 849–856 (2002)
756
X. Yang, L.J. Latecki, and A. Gross
3. Fischer, B., Buhmann, J.M.: Path-based clustering for grouping of smooth curves and texture segmentation. IEEE. PAMI 25, 513–518 (2003) 4. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. International Journal of Computer Vision 59, 167–181 (2004) 5. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE. PAMI (2000) 6. Moore, A., Prince, S., Warrell, J., Mohammed, U., Jones, G.: Superpixel lattices. In: CVPR (2008) 7. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science, 290 (2000) 8. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science (2000) 9. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15 (2003) 10. Gonzales, R., Woods, R.: Digital image processing. Addison-Wesley, Reading (1992) 11. Cox, T., Cox, M.: Multidimensional scaling. Chapman and Hall, Boca Raton (1994) 12. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007) 13. Lashkari, D., Golland, P.: Convex clustering with exemplar-based models. In: Advances in Neural Information Processing Systems (2007) 14. Borenstein, E., Shron, E., Ullman, S.: Combining top-down and bottom-up segmentation. In: Proc. IEEE workshop on Perc. Org. in Com. Vis. (2004) 15. Fischer, B., Roth, V., Buhmann, J.M.: Clustering with the connectivity kernel. In: Advances in Neural Information Processing Systems (2004) 16. Chen, Q., Sun, Q., Heng, P.A., Xia, D.: A double-threshold image binarization method based on edge detector. Pattern reconition 41, 1254–1267 (2008)
Group Action Recognition Using Space-Time Interest Points Qingdi Wei1 , Xiaoqin Zhang1 , Yu Kong2 , Weiming Hu1 , and Haibin Ling3 1
National Laboratory of Pattern Recognition, Institute of Automation, CAS, Beijing 100190, P.R. China 2 Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing 100081, P.R. China 3 Center for Information Science and Technology, Computer and Information Science Department, Temple University, Philadelphia, PA, USA {qdwei,xqzhang,wmhu}@nlpr.ia.ac.cn,
[email protected],
[email protected]
Abstract. Group action recognition is a challenging task in computer vision due to the large complexity induced by multiple motion patterns. This paper aims at analyzing group actions in video clips containing several activities. We combine the probability summation framework with the space-time (ST) interest points for this task. First, ST interest points are extracted from video clips to form the feature space. Then we use k-means for feature clustering and build a compact representation, which is then used for group action classification. The proposed approach has been applied to classification tasks including four classes: badminton, tennis, basketball, and soccer videos. The experimental results demonstrate the advantages of the proposed approach.
1
Introduction
Understanding group activities is an important problem in computer vision with many applications, such as video surveillance and monitoring, object-level video summarization, human-computer interaction, video indexing and browsing, and digital library organization. Despite many research efforts on the human activity analysis, group action classification remains a challenging task and has not been widely studied for the following reasons: (i) It is difficult to find an effective descriptor for group human action, because there are usually many people performing different actions individually. (ii) When using local features to describe an individual action, there are too many objects to track. In addition, the environmental background for group action is often highly cluttered. (iii) Nuisance factors, such as the number of people in the group action, mutual-occlusion and self-occlusion, irregularity of camera parameters, also cause additional difficulties. In this paper, we combine the probability summation framework with the spacetime (ST) [1] feature point for group activity recognition. First, space-time interest point features are extracted to describe the group action, as shown in Fig. 1. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 757–766, 2009. c Springer-Verlag Berlin Heidelberg 2009
758
Q. Wei et al.
(a)
(b)
(c)
(d)
Fig. 1. Local space-time features detected for sports: (a) basketball; (b) badminton; (c) soccer; (d) tennis
Then, the k-means clustering algorithm is employed to cluster action features into the action codebook. Finally, a testing video is classified by computing the probability summation, which ignores the number of people in the group action. our purpose is to distinguish the group actions, just as basketball and soccer. High-precision work, such as attack and defend, will be done in future. The highlights of our work include: 1) the ST features provide an effective description for the group action in a video clip without tracking and action segmentation; 2) the probabilistic framework provides an effective way for integrating global information of the feature, which makes our method robust to the image noises and invariant to translation, rotation and scaling; and 3) the probabilistic framework is computationally very efficient. In fact, the system has a linear time complexity, given previously extracted features. The rest of the paper is organized as follows: In Section 2 we discuss the related work. In Section 3 we deal with the major problem, in which action features and the action recognition method are respectively presented in Section 3.1 and Section 3.2. Experimental results are shown in Section 4, and Section 5 is devoted to conclusion.
2
Related Work
Generally, there are two major parts in a human action recognition system: human action representation, recognition strategy. Laptev [1] propose space-time interest points for a compact representation of video data, and explore the advantage of using space-time interest points to describe human action. The ST feature does not need any segmentation or tracking of the individual performance of the action. With this property, space-time features have recently shown considerable success on action recognition [2], [3], [4], [5], [6]. Niebles, Wang and Fei-Fei [3] extract space-time interest points to represent a video sequence as a collection of space-time words, and use a probabilistic Latent Semantic Analysis (pLSA) model to recognize human actions. Schuldt [2] and Laptev [6] also use the ST feature but they prefer the codebook and the bag-of-words. However, most existing methods for action recognition mentioned above focus on individual actions. Efros, Berg, Mori et al. [7] takes soccer video as their experiment data but they also merely recognize single person’s action. Kong et al. [8] use optical flow as action features to recognize group
Group Action Recognition Using Space-Time Interest Points
759
Fig. 2. The training phase
action in soccer videos. This work is limited on soccer videos, and can only handle three categories of group actions. In [9], Ali and Shah track individual targets in density crowd using a scene structure based force model. Recognition strategy is another important part in an action recognition system. In the literature, there is a large number of work referring to this problem, and many impressive results have been obtained over the past several years, such as Hidden Markov Models (HMMs), Autogressive Moving Average (ARMA) [10], Conditional Random Fields (CRFs) [11], Finite State Machine(FSM) [12], [13] and their variations [14], semi-Markov model [15], 1-Nearest Neighbor with Metric Learning [16], ActionNets [17], LDCRF [18]. Wang and Suter [19] presented the use of FCRF in the vision community, and demonstrated its superiority to both HMM and general CRF. Ning, Xu, Gong and Huang [20] improved standard CRF model and its variations performance on continuous action recognition, by replacing the traditional random fields model with a latent pose estimator. Boiman and Irani [21] proposed a graphical Bayesian model which describes the motion data using hidden variables that correspond to hidden ensembles in a database of spatio-temporal patches. Vitaladevuni, Kellokumpu and Davis [22] presented a Bayesian framework for action recognition through ballistic dynamics. Compared with the previous work, we target on group human actions that are more general and complex. As shown in Section 4, we work on four different group actions.
760
Q. Wei et al.
(a)
(b)
Fig. 3. (a) Cylindrical histogram (b) Spheriform histogram
3
Recognition Method
We first define symbols used in the work. Denote V as the input video, and m ST = {sti }i=1 is denoted as the action feature, where each sti denotes the cuboids of a space-time interest point in video V . The recognition task is usually formed as a classification problem. Specifically, our purpose is to find a classifier f : f (ST ) = c that classifies a given sequence to a predefined group action c ∈ C = {1, · · ·, n} , where C is a set of group actions that we are interested in. Our algorithm consists of two stages which are the training phase and the testing phase. As shown in Fig. 2, the training phase contains the following steps: 1. Preprocess color video clips into gray ones and reduce the resolution to 320 × 240. 2. Extract space-time interest points from each video clip, and retain the cuboids on the interest points with significant feature strengths. 3. Cluster ST features to the feature classes, forming a sample database. 4. The priori probability of each feature class is estimated from the frequency of action classes. The testing phase contains the similar steps: 1. Preprocess color video clips into gray ones and reduce the resolution to 320 × 240. 2. Extract space-time interest points from each video clip, and computer cuboids on interest points which have significant relative feature strengths. 3. Classify ST features to get their feature class label. Here, each feature class label means a probability to action classes. 4. Sum the probability, and take the maximum, then group action recognition is completed. 3.1
Action Feature
The space-time interest point is used for a compact representation of video data and robust to occlusions, background clutter, significant scale changes, and high action irregularities. They have been successfully used as action feature in action recognition tasks [2], [3], [4]. In our work, we detect space-time interest points using a periodic detector. Then a cuboid [4] is extracted at each interest point,
Group Action Recognition Using Space-Time Interest Points
761
which contains the space-timely windowed pixel values. The Euclidean distance metric is adopted to evaluate the similarity of two cuboids. A cuboid sti of a space-time interest point is a description of information contained in the neighbor pixels, such as location, scale, histogram of gradient (HOG) and the like. In our system, we choose to concatenate all pixels in the cuboid for the description. Then the video is represented by a discrete set m ST = {sti }i=1 , containing m cuboids sampled from the video. PCA is applied to the descriptor for dimension reduction. We also experiment with 3D-shape context, which is similar as [23]. Huang and Trivedi [23] use multilayered cylindrical histogram to describe the human body voxels,while we employ spheriform histogram in our experiment. The sti is a 8 × 8 × 8 matrix in the process of computing 3D-shape context on the interest points. 3.2
Recognition
Using the k-means algorithm, we cluster a large number of cuboids extracted from the training data into k clusters in the training section. We call these clusters as feature classes. Since a feature class rarely comprises of cuboids features belonging to the same action class, it is reasonable that the transformation probability Ti can be estimated by the frequencies of each action class appearing in the corresponding feature class. The transformation probability Ti measures the probability of each action occurs when feature i appears. ⎧ ⎪ ⎨ action class 1 pi1 .. .. Ti : f eature class i (1) . . ⎪ ⎩ action class n pin where
k
pij = 1
(2)
i=1
The Eq. (2) can be replaced by
n j=1
pij = 1 , but it is sensitive to the number of
each action class samples. For instance, if one action class seldom appears, it will never appear in the experiment results. In this sense, the Eq. (2) works better. Moreover, the number of feature classes k is important to the result, which we will discuss in section 4.3. In the testing section, we first classify each cuboid in the test video to get a set of feature classes. Then the transformation is applied to the set of feature class: T (f eature class). Each feature class from test video corresponds to a 1×n vector (Fig. 3). So we get an k × n matrix, action class probability table, after the transformation. In the recognition process, the probabilities of action classes are summed up by Eq. (3), and then the max(apj ) is taken as the action class in this video. We call this method as the probability summation. action class probability : apj =
m i=1
(p1j , . . . , pij )
(3)
762
Q. Wei et al.
In the experiment we also use another method template matching to make a contrast. In the template matching, we first study the action class templates in the training section. Each template, which is a histogram of feature class, corresponds to an action class. When a test video comes, we compute the histogram of feature class, and use the Euclidean distance to evaluate the similarity between the test video and the template.
4
Experiments
4.1
Group Human Action Video Database
For the experiments, we build a video database containing four types of group human actions (tennis, badminton, soccer, and basketball, see Fig. 4) from several real sport videos collected from Internet. The reason why we choose these four kinds of sports is as follows. The soccer and the basketball both involve a number of people with a great deal of movements. The tennis and the badminton have two or four players in the game and share a similarity of movement frequency. Furthermore, the other three, except the basketball, all take green as their main color so that we cannot distinguish them simply through the aspect of color. As is known to all, the movement information and the color information change dramatically in one sport video due to the long shot and close-up. Therefore, to classify them correctly is not a simple task even in the ideal case, let alone the original videos are recorded under complex backgrounds with a moving camera. These video data are 25 fps frame rate, with various resolutions, such as 1280 × 720, 480 × 372. When creating the experimental video database, we downsample all videos to the resolution of 320 × 240 and have a length of one hundred frames on average. Our database contains 1193 videos with their ground label. To the best of our knowledge, this is the first video database with group human actions. 4.2
Experimental Method
We divide the video database into two parts, the training part (599 videos), and the testing part (594 videos). Space-time interest points are extracted on all videos(Fig. 4). There are a large number of ST points detected in most videos, for each video, we only choose 50 ST points with the largest feature responses for efficiency. In fact, we find in our experiments that 50 points contain sufficient information for group action recognition. Meanwhile, those videos which fail to contain 50 interest points can also be recognized correctly. The codes for detecting Space-time interest points and extracting cuboids are provided by piotr Activity Recognition Toolbox1 . The experiment is performed on matlab, and is running on a CY2.4 GHz personal computer with 4G memory. It takes 40 seconds to process a 4 seconds video. The most time is spent on the PCA, around 36 seconds. 1
http://vision.ucsd.edu/∼pdollar/research/cuboids doc/index.html
Group Action Recognition Using Space-Time Interest Points
763
Fig. 4. Space-time interest point in the sport video
Furthermore, we obtain 29k cuboids for training with the k-mean algorithm. The number of clusters is set to k = 200. So the transformation probability T is a 200 × 4 matrix. We normalize each cluster in T by Eq. (2) as the numbers of samples of every group action are uneven. 4.3
Results
In the recognition process, we evaluate two approaches by comparing their recognition rates: one is the template matching; the other is the probability summation. Table 1 shows the confusion matrix of template matching, while Table 2 is for probability summation. We can see that the probability summation performs Table 1. Confusion matrix of template matching
badminton tennis soccer baskerball
badminton tennis soccer basketball .87 .08 .05 .00 .09 .90 .01 .00 .05 .05 .81 .09 .14 .04 .00 .82
764
Q. Wei et al. Table 2. Confusion matrix of probability summation
badminton tennis soccer baskerball
badminton tennis soccer basketball .92 .02 .06 .00 .06 .94 .00 .00 .05 .01 .88 .06 .15 .06 .24 .75
Table 3. Recognition rate with various parameters k
3D-SC 3D-SC 3D-SC 3D-SC Original-cuboids (only distance) PCA50 PCA90 PCA134 PCA111 200 53.97 64.48 64.19 67.10 85.57 400 56.06 67.39 71.76 71.18 86.15 600 58.15 73.65 72.78 72.63 90.76 800 60.03 74.24 74.67 73.51 90.33 1000 59.88 77.44 75.98 78.17 91.49 1200 63.06 79.62 81.37 77.87 92.50 2000 65.66 82.10 81.51 83.70 93.51 4000 67.82 85.74 85.88 85.74 95.53 6000 68.47 87.05 87.05 87.34 95.38 10000 69.12 88.94 88.94 88.29 96.10 20000 70.27 90.25 90.83 90.83 95.67
better than the template matching. The reason is that each sport has various motions, for example, serving and spiking. In probability summation, we can correctly classify a sport even when serving or spiking is performed respectively. In comparison, template matching requires that both serving and spiking should be included in the video. The average recognition rate using probability summation is 88.48%. The rate is very promising, especially considering the limit training data and the peculiarities of the action. 4.4
Discussion
We also use 3D-shape context to do the experiment whose result is shown in Table 3. The first line, 3D-SC-Only distance, means the feature of the 3D-shape context is only binned in distance domain. We use PCA to compress 3D-shape context which is a 512-dimension vector. The lines 2-4 respectively correspond to the 50-dim, 90-dim, and 134-dim. The 134-dim contains more-than-90% information. The last line is the original cuboids feature which works well especially when the k is small. In our method, k, the number of clusters is an important factor of affecting the experimental results. As k increases, recognition rate also rises accordingly with the reducing acceleration. As mentioned in section 4.2, we obtain 29k features for training. When the k = 20000, it means one cluster contains few samples
Group Action Recognition Using Space-Time Interest Points
765
which is possibly one or two. Together with probability summation, our method is just like a primary boost algorithm, in which each feature works as a weak classifier.
5
Conclusion
In this paper, our contribution focuses on two points. First, we build a group human action database with thousands of videos and it keeps becoming larger. Second, we try our attempt to analyze group action, and achieve satisfying experimental results. Our method has the potential to analyze group human action in details, because the ST feature can specifically describe action information. At present, our database only has four group actions. We will extend our method to more categories of human group action, and enrich our database as well. As future work, we would like to consider the group/global features to enhance our work.
Acknowledgment This work is supported by NSFC (Grant No. 60825204, 60672040,) and the National 863 High-Tech R&D Program of China (Grant No. 2006AA01Z453).
References 1. Laptev, I.: On space-time interest points. International Journal of Computer Vision 64, 107–123 (2005) 2. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: Proceedings of the 17th International Conference on Pattern Recognition, Washington, DC, USA, vol. 3, pp. 32–36. IEEE Computer Society, Los Alamitos (2004) 3. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision 79, 299–318 (2008) 4. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005) 5. Gilbert, A., Illingworth, J., Bowden, R.: Scale invariant action recognition using compound features mined from dense spatio-temporal corners. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 222–233. Springer, Heidelberg (2008) 6. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, 1–8 (2008) 7. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, Washington, DC, USA, p. 726. IEEE Computer Society, Los Alamitos (2003)
766
Q. Wei et al.
8. Kong, Y., Zhang, X., Wei, Q., Hu, W., Jia, Y.: Group action recognition in soccer videos. In: 19th International Conference on Pattern Recognition, pp. 1–4 (2008) 9. Ali, S., Shah, M.: Floor fields for tracking in high density crowd scenes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 1–14. Springer, Heidelberg (2008) 10. Roy, A.V., Chowdhury, A., Chellappa, R.: Matching shape sequences in video with applications in human movement analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1896–1909 (2005) 11. Sminchisescu, C., Kanaujia, A., Metaxas, D.: Conditional models for contextual human motion recognition. In: 10th IEEE International Conference on Computer Vision, vol. 104, pp. 210–220 (2006) 12. Natarajan, P., Nevatia, R.: View and scale invariant action recognition using multiview shape-flow models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 13. Zhao, T., Nevatia, R.: 3d tracking of human locomotion: A tracking as recognition approach. In: Proceedings of the 16th International Conference on Pattern Recognition, Washington, DC, USA, p. 10546. IEEE Computer Society, Los Alamitos (2002) 14. Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Conditional models for contextual human motion recognition. In: 10th IEEE International Conference on Computer Vision, vol. 2, pp. 1808–1815 (2005) 15. Shi, Q., Wang, L., Cheng, L., Smola, A.: Discriminative human action segmentation and recognition using semi-markov model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 16. Tran, D., Sorokin, A.: Human activity recognition with metric learning. In: Proceedings of the 10th European Conference on Computer Vision, pp. 548–561. Springer, Heidelberg (2008) 17. Lv, F., Nevatia, R.: Single view human action recognition using key pose matching and viterbi path searching. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 18. Morency, L.P., Quattoni, A., Darrell, T.: Latent-dynamic discriminative models for continuous gesture recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, 1–8 (2007) 19. Wang, L., Suter, D.: Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 20. Ning, H., Xu, W., Gong, Y., Huang, T.: Latent pose estimator for continuous action recognition. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 419–433. Springer, Heidelberg (2008) 21. Boiman, O., Irani, M.: Detecting irregularities in images and in video. In: 10th IEEE International Conference on Computer Vision, vol. 1, pp. 462–469 (2005) 22. Vitaladevuni, S., Kellokumpu, V., Davis, L.: Action recognition using ballistic dynamics. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 23. Huang, K.S., Trivedi, M.M.: 3d shape context based gesture analysis integrated with tracking using omni video array. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, Washington, DC, USA, p. 80. IEEE Computer Society, Los Alamitos (2005)
Adaptive Deblurring for Camera-Based Document Image Processing Yibin Tian and Wei Ming Konica Minolta Systems Lab, 301 Velocity Way, Foster City, CA, USA 94404
[email protected]
Abstract. With increasing resolution of cameras on mobile devices and their computing capacity, camera-based document processing becomes more attractive. However, there are several unique challenges, one of which is defocus. It is common that a camera-captured image is blurred by variable amount of location-dependent defocus. To improve image quality, we developed a novel method to adaptively deblur camera-based document images. In this method, sub-images of interest are first extracted from the captured image, and a pointspread function is derived for each sub-image by analyzing the gradient information along edges. Then the sub-image is deblurred by its local point-spread function. Preliminary experimental results indicate that the proposed adaptive deblurring method significantly improves focusing quality as evaluated by both human observers and objective focus measures compared with single-PSF deblurring.
1 Introduction With rapid advances in consumer electronics, many multifunctional mobile devices have emerged in the past few years. The combination of digital cameras and cellular phones is particularly popular [1]. Nokida, the world’s largest manufacturer of mobile devices, is now the largest manufacturer of cameras. In addition to the dramatically increased availability, resolution of phone cameras has been increasing steadily. Now phone cameras with 8 megapixel sensors are widely available, for example, Nokia N86, Sony-Ericsson Aino, LG GC900 and Samsung i8000. With such cameras, it is possible to obtain document images at resolutions about 300 dpi for A4 papers without image mosaicking, which makes camera-based document image processing (CBDIP) much more attractive. In this report, document image processing or analysis is broadly defined as analyzing images containing text information. The advantages of CBDIP are obvious when compared with traditional scanner-based approach. Cameras on mobile device, particularly phone cameras, are non-contact, connected to wireless networks, and more widely available and portable. All these factors offer potentially wider and more efficient applications for CBDIP than scanner-based approach. For example, CBDIP systems can be used as text recognizer and reader for the visually impaired [2], handheld foreign language sign translator [3], and cargo container label reader [4]. Optical Character Recognition (OCR) is one of the most common document processing tasks, G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 767–777, 2009. © Springer-Verlag Berlin Heidelberg 2009
768
Y. Tian and W. Ming
and it has been shown that camera-based OCR is more productive than scanner-based OCR for processing newspaper text [5]. The flexibilities of CBDIP come at the cost of a number of challenges, such as non-uniform illumination, perspective distortion, zooming and focusing, moving objects and limited computing power [6]. This report addresses the focusing aspect of CBDIP. When an imaging target is positioned with significant depth variations from the camera due to certain physical constraints, the captured image is blurred by variable amount of location-dependent defocus. The problem is particularly severe when the imaging targets are very close to the camera or the camera’s depth of focus is small, which is frequently encountered in CBDIP due to magnification and filed of view requirements. In the simplest case of two-depth scene consisting of two targets of interest, the difference between the ideal image depths is
δ = d1i − d 2i =
f 2 (d1o − d 2o ) . (d1o − f )(d 2o − f )
(1)
d1i and d 2i are the image distances of the two targets, d1o and d 2o their object distances respectively, and f the focal length of the camera lens [7]. Both targets where
will be out of focus if they are located within the focus window, and the exact defocus amount depends on their relative sizes [7, 8]. To obtain higher quality images, deconvolution can be utilized to reduce the blur in the images. As the amount of defocus is dependent on location, a single point spread function (PSF) is not a good representation of the imaging system. In this report, we describe an adaptive deblurring method to improve the image quality locally. We take advantage of the fact that there is rich edge information in document images and derive local PSFs from the gradient variations around edges, thus deburring can be locally carried out on sub-images. Preliminary experiment results show significant image quality improvement of the proposed method over the traditional approach using a single PSF.
2 Methods Optical blurring can be well modeled as low pass filtering. As a result, the impact of defocus on images is most significant on edges. By analyzing the intensity variations around edges, we can infer the edge response of the camera if we assume the edges in the targets are sharp, as is the case in many documents, where characters and background often form sharp transitions in intensities [9]. Optical blurring can be caused by defocus, non-defocus aberrations, light scatter, and the combination of the three [10, 11]. Asymmetric PSFs may arise due to significant amount of asymmetric aberrations such as astigmatism and coma, but for small amount of non-defocus aberrations in cameras, their impact on blur is trivial [11]. The impact of light scatter in cameras is usually rotationally symmetric. For simplicity, in this report we only consider optical blurring that is near Gaussian, which allows reconstructing two-dimensional PSFs from one dimensional edge responses.
Adaptive Deblurring for Camera-Based Document Image Processing
769
2.1 Edge Responses and Point Spread Functions Images are converted to grayscale if they are colored. Edges in document images are abundant and easy to detect with common edge detectors, such as Canny edge detector [12], even when the imaged are blurred. Edges that form straight lines of at least a number of pixels (for example, more than 5 pixels) are classified as “well-defined”. Then the intensity variations around multiple well-defined horizontal and vertical edges are analyzed. The gradients of intensities in the perpendicular direction of the edges of the same orientation are averaged to give the horizontal and vertical edge response functions. To reduce the impact of noise and local background, gradients at multiple locations should be utilized. A previous study reported that Cauchy functions were better approximations to the edge response functions than Gaussian functions in flat-bed scanners [9]. We found no significant difference between the two approximations in our testing and assume the PSFs are Gaussian. Thus the 2D PSF (M*N matrix) is simply the multiplication of the vertical and horizontal edge response functions (M*1 and 1*N vectors respectively). It should be noted that edges in document images are closely spaced and the spacing of strokes is a critical limiting factor in recovering edge responses and point spread functions. The spacing of strokes depends on the font, size and spacing of characters and lines. In general, if two parallel neighboring strokes (more specifically, their neighboring edges) used in PSF recovery are K pixels apart, the width of the PSF to be correctly recovered should not exceed K pixels. Otherwise, the tails of the PSF would be contaminated due to crossover from neighboring strokes. So edges of large characters should be chosen over those of small ones. Another important factor in recovering edge responses is the local contrast near the edges; if the contrast is too low, the recoverable information is limited. In practice, if the local contrast near an edge is less than 20%, the edge should be excluded from PSF derivation. 2.2 Adaptive Deblurring When a document is positioned with significant depth variations from the camera due to certain physical constraints, the captured image is blurred by variable amount of location-dependent defocus. In such case, a single point spread function (PSF) is not a good representation of the camera-document configuration. The whole document image can be divided into a number of sub-images, and for each sub-image a local PSF is derived as described above. Then deblurring is carried out on individual sub-images using their corresponding local PSFs, each of which is derived from the respective subimage using the method described in Section 2.1. To improve smooth transitions on boundaries, the sub-images should have some small overlaps and the whole deblurred image can be recombined from deblurred sub-images, the overlapped regions should be reconstructed using interpolations of the corresponding sections in neighboring subimages. In certain applications, such as cargo container label reader, sub-images of interest can be extracted in separate text blocks such that no image reconstruction is needed, and the processing after deblurring, such as binarization and OCR, is done on individual sub-images. It is well-known that deconvolution is very sensitive to noise and prone to artifacts. To reduce the impact of noise, the deconvolution filter should be suppressed at very high frequencies. For simplicity, our current implementation discards noisy edges as described in Section 2.1. A more sophisticated approach is to use an attenuating
770
Y. Tian and W. Ming
power window to the deconvolution filter in the spatial domain, which essentially results in a band-pass deconvolution filter. The power window can adaptively change according to estimated energy of signal and noise in the image [13]. Iterative search and regularized deconvolution algorithms have been developed to reduce artifacts from deconvolution. Lucy-Richardson iterative algorithm was used in our experiment with no more than 10 iterations [14]; more than 10 iterations resulted in significant artifacts in our testing. 2.3 Image Quality Evaluation Image quality can be evaluated with both subjective and objective criteria. In our subjective evaluation, original and deblurred gray images are printed out. For each test one small original and the corresponding deblurred sub-images are presented to one naïve subject, who is asked to choose the one of best quality (i.e., N-alternative forced choice psychophysical procedure). The subjects are asked to mainly focus on the sharpness and smoothness of the characters. All images are also binarized using Otsu’s method and the same tests are repeated with the same subjects. As the main goal of deblurring is to get better focused images, we use focus measures to objectively evaluate image focusing quality. There are many focus measures in the literature, and previous comparison studies showed that Energy of Image Gradient (EIG) is a good focus measure with intermediate defocus/noise sensitivity and effective range [11, 15, 16]. So we use EIG as our image quality objective metric. To reduce the impact of noise and artifacts in the deblurred images, we apply EIG only to regions within a number of pixels of edges. The focus measure value of each sub-image is normalized by that of the respective original image
FM ( I ) =
∑∑
[ I x ( x, y )]2 + [ I y ( x, y )]2
∑∑
[ I xo ( x, y )]2 + [ I yo ( x, y )]2
x
x
where
y
.
(2)
y
I x ( x, y ) and I y ( x, y ) are the gradients of image I at pixel ( x, y ) in the
horizontal and vertical directions, and
I o the corresponding original sub-image of I .
3 Experiment and Results For illustration purpose, we set up a simplified scene where three distinctly different depths exist in the document image (top, middle and bottom regions in Fig. 1). The middle portion was best focused when the image was captured by a Cannon SD1000 digital camera with resolution of 1200x1600 pixels. Although the top and bottom portions were at about equal depth away, the bottom portion was blurred more than the top portion as it was closer to the camera. The original image was in RGB color space, and it was converted to grayscale between 0 and 255. The image was segmented into three sub-images in such a way that each sub-image contains mostly content from one depth, from which three local PSFs were derived respectively. Each sub-image was deblurred using the three local PSFs, and the deblurred images were evaluated with objective and subjective criteria.
Adaptive Deblurring for Camera-Based Document Image Processing
771
Fig. 1. A blurred document image captured by a camera. The resolution is 1200x1600 pixels. The image was originally captured in color and converted to 8-bit grayscale.
3.1 Local Point-Spread Functions One PSF was derived from each of the three sub-images described above (referred to as PSFa, PSFb and PSFc respectively from top to bottom, Fig. 2). As we would expect based on the focus and depth setup, the PSF from the middle portion (PSFb) is the most compact, while the one from the bottom portion (PSFc) the least compact. The PSFs in the horizontal and vertical directions may be significantly different in some cases (see Section 4 for discussion). In the current setup, the horizontal and vertical PSFs are very similar, so only the horizontal PSFs are shown in Fig. 2. The corresponding 2D PSFs can be visualized by rotating the displayed 1D PSFs by 360 degrees due to the similarity in the horizontal and vertical PSFs.
Fig. 2. Three horizontal point-spread functions (PSF) derived from edge responses in three different sub-images of Fig. 1. Each sub-image is at a distinct depth, thus each PSF represents the blur at the specific depth in the camera-document scene. The corresponding vertical PSFs are very similar.
772
Y. Tian and W. Ming
3.2 Deblurred Images Each sub-image was deblurred with its local 2D PSF and combined together after deblurring. The sub-images have overlap of 20 pixels in one direction, and for the reconstructed final image (Fig. 3), each overlapped section is the interpolation/ blending of the same overlapped section in the two deblurred neighboring sub-images.
Fig. 3. The result of adaptive deblurring of the image in Fig.1. The resolution is 1200x1600 pixels.
Fig. 4. Three windows of the original document image (1st column), and deblurring results with three different point-spread functions. Columns 2nd-4th were deblurred using PSFa, PSFb and PSFc respectively, for example, (a-a) is image (a) deblurred by PSFa, (a-b) image (a) deblurred by PSFb, and (a-c) image (a) deblurred by PSFc.
Adaptive Deblurring for Camera-Based Document Image Processing
773
To better illustrate adaptive deblurring effects, one small window was chosen in each of the three sub-images, and deblurring was carried out on each window using the three local PSFs (Fig. 4). Image quality improvement can be seen in most images. However, as we expected, deblurring all the windows using the same PSF would not give the best quality for all images. 3.3 Image Quality To better quantify image quality improvement of adaptive deblurring, the images in Fig. 4 and their binarized versions (Fig. 5) were separately evaluated by naïve human observers. Subjective evaluation results are summarized in Table1. Table 1. Subjective image quality evaluation results by 6 subjects using 4-alterative forced choice psychophysical procedure. The value for each image is the proportion of subjects who chose the corresponding image as the best. Gray and binarized images (Bi) were evaluated separately.
Image window a b c
Original Gray 0 1/6 0
Bi 0 1/6 0
Deblurred by PSF-a Gray Bi 5/6 5/6 1/6 0 1/6 0
Deblurred By PSF-b Gray Bi 1/6 1/6 4/6 5/6 0 0
Deblurred by PSF-c Gray Bi 0 0 0 0 5/6 6/6
Fig. 5. Binarized images of the three windows of the original document image (1st column) and binarized deblurring results with three different point-spread functions. Columns 2nd-4th were deblurred using PSFa, PSFb and PSFc respectively, for example, (a-a) is image (a) deblurred by PSFa, (a-b) image (a) deblurred by PSFb, and (a-c) image (a) deblurred by PSFc.
In addition, the images in Fig. 4 and their binarized versions (Fig. 5) were also separately evaluated using objective focus measure EIG, and the results are summarized in Table.2. Based on data shown in Tables 1 and 2, for both grayscale and binarized images, both subjective and objective criteria favor those images deblurred by its corresponding local PSFs (Fig. 4 and Fig. 5 (a-a), (b-b) and (c-c)), except for the objective
774
Y. Tian and W. Ming
evaluation of grayscale images of window b, where the deblurred image using PSFa (Fig. 4 (b-a)) was ranked better than that using PSFb (Fig. 4 (b-b)), due to its higher contrast. The probable reason that Fig. 4 (b-b) was favored over Fig. 4 (b-a) in subjective evaluation was that human observers were bothered by the noise and the artifacts around the characters in Fig. 4 (b-a), and the smoothness in Fig. 4 (b-b) perceptually overweighed its slightly lower contrast. In the traditional single-PSF deblurring approach, only one of the three depths can achieve best possible image quality (results would be either column 2, 3 or 4 in Fig. 4 & 5), and the other two depths would have poorer image quality than what can be achieved with adaptive deblurring (results are (a-a), (b-b) and (c-c) in Fig. 4 and 5). Table 2. Objective image quality evaluation by Energy of Image Gradient (EIG). All focus measure values were normalized as shown in Eq. 2. The normalizations were carried out on grayscale (Gray) and binary images (Bi) separately. Greater focus measure values indicate better image quality; focus measure values smaller than 1 indicate that deblurring actually reduced the image quality.
Image window a b c
Original Gray 1 1 1
Bi 1 1 1
Deblurred by PSF-a Gray Bi 2.435 1.142 1.006 1.990 1.112 1.105
Deblurred by PSF-b Gray Bi 2.252 1.122 1.861 1.014 1.109 1.136
Deblurred by PSF-c Gray Bi 2.047 0.929 1.800 0.987 1.222 1.203
It should be noted that objective focus measures cannot distinguish noise and deconvolution artifacts from real details. For example, the EIG value from Fig. 4 (a-c) is twice of that from Fig. 4 (a), while the former is no better than the latter in terms of image quality. This is reflected in the binarized versions of the two images; image quality of Fig. 5 (a-c) is worse than that of Fig. 5 (a). This indicates that using inappropriate PSFs in deblurring may worsen the image quality, as will happen for some image regions in single-PSF deblurring.
Fig. 6. (a) A camera-based document image; (b) the target-camera configuration from which the image in (a) was taken; (c) the zoom-in sub-image from the marked rectangle in (a)
Adaptive Deblurring for Camera-Based Document Image Processing
775
4 Discussion The proposed adaptive deblurring using local PSFs derived from camera-based document images can significantly improve the overall image quality when the document is not at a fixed depth. For illustrative purpose, in the experiment there were three distinct depths, the illumination was almost uniform across the image, and the characters were in either the horizontal or vertical direction. In real applications, these conditions may not be satisfied. The example of the document image shown in Fig. 6(a) is intended to show a few common challenges that frequently arise in CBDIP: non-uniform illumination, perspective distortion and continuous depth variations. We briefly describe how to deal with these issues in the context of adaptive deblurring. 4.1 Non-uniform Illumination Non-uniform illumination can arise from using camera flash or ambient light in the field [17]. Significant non-uniform illumination may adversely affect the derivation of local PSFs. Background removal and/or contrast stretching may be utilized to reduce the impact of non-uniform illumination [18]. 4.2 Perspective Distortions Perspective distortions have multiple manifestations in document images, for example, parallel edges in a document appear to intersect in the image (Fig. 6(a)), magnifications are different in the horizontal and vertical directions due to the difference in lateral and longitudinal magnifications, thus the strokes are blurred more in the vertical direction than in the horizontal direction (Fig. 6(c)), which lead to asymmetric PSFs. Perspective correction can be accomplished by estimating the horizontal and vertical vanishing points in the image [19]. And skew can be detected by applying Hough transform to the centroids [20] or extreme points of characters. After the image is properly adjusted, we can use the same procedure described in the report. We can also track two perpendicular directions of gradient around oblique edges to obtain the local PSFs. 4.3 Continuous Depth Variations It is difficult to segment an image based on depth variations if the depth varies continuously. In applications such as finding road signs, it is desirable to extract grouped blocks of characters using text classification methods [2]. A local PSF can be derived for each character block. However, in cases as shown in Fig.6 (a), segmentation of the image into character blocks needs arbitrarily defined boundaries. In such cases, discrete depth variations are used to approximate the continuous depth variations. The smaller the sizes of the sub-images are, the better the approximation of depth. Unfortunately, it is better to have larger sub-image for deconvolution due to its boundary effects. The size of the sub-images should be significantly greater than that of the local PSFs.
776
Y. Tian and W. Ming
It is also possible that some sub-images contain no edges or they are too blurred to estimate the local PSFs. Such sub-images may be ignored for deblurring purpose as they probably contain no useful information, or deblurring is not capable of recovering the useful information. Alternatively, their PSFs can be replaced by predictions or interpolations using the PSFs in the neighboring sub-images. This may be useful when perspective and depth can be obtained from the image or a priori knowledge. In summary, the proposed adaptive deblurring method can be utilized in CBDIP to improve image quality compared with traditional single-PSF deblurring. To adopt it in practical applications, correction and/or compensation methods can be used to reduce the impact of non-uniform illumination, perspective distortions and continuous depth variations. More sophisticated PSF derivation and deconvolution methods may help further reduce the impact of noise and achieve better image quality improvement.
References 1. Gye, L.: Picture This: the Impact of Mobile Camera Phones on Personal Photographic Practices. Journal of Media and Cultural Studies, 279–288 (2007) 2. Shen, H., Coughlan, J.: Grouping Using Factor Graphs: an Approach for Finding Text with a Camera Phone. In: Escolano, F., Vento, M. (eds.) GbRPR. LNCS, vol. 4538, pp. 394–403. Springer, Heidelberg (2007) 3. Yang, J., Gao, J., Zhang, Y., Waibel, A.: Towards Automatic Sign Translation. In: Proceedings of Human Language Technology, pp. 269–274 (2001) 4. Lee, C.M., Kankanhalli, A.: Automatic Extraction of Characters in Complex Scene Images. International Journal of Pattern Recognition and Artificial Intelligence, 67–82 (1995) 5. Newman, W., Dance, C., Taylor, A., Taylor, S., Taylor, M., Aldhous, T.: CamWorks: A Video-based Tool for Efficient Capture from Paper Source Documents. In: Proceedings of IEEE International Conference on Multimedia Computing and Systems, pp. 647–653 (1999) 6. Doermann, D., Liang, J., Li, H.: Progress in Camera-Based Document Image Analysis. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 606–616 (2003) 7. Tian, Y., Feng, H., Xu, Z., Huang, J.: Dynamic Focus Window Selection Strategy for Digital Cameras. In: Proceedings of SPIE, vol. 5678, pp. 219–229 (2005) 8. Tian, Y.: Dynamic Focus Window Selection Using a Statistical Color Model. In: Proceedings of SPIE, vol. 6069, pp. 98–106 (2006) 9. Smith, E.H.B.: PSF Estimation by Gradient Descent Fit to the ESF. In: Proceedings of SPIE, vol. 6059, pp. 129–137 (2006) 10. Tian, Y., Arnoldussen, M., Tuan, A., Logan, B., Wildsoet, C.F.: Evaluation of Retinal Image Degradation by Higher-order Aberrations and Light Scatter in Chick Eyes after PhotoRefractive Keratectomy (PRK). Journal of Modern Optics, 805–818 (2008) 11. Tian, Y., Shieh, K., Wildsoet, C.F.: Performance of Focus Measures in the Presence of Non-defocus Aberrations. Journal of the Optical Society of America A, 165–173 (2007) 12. Canny, J.: A Computational Approach to Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 679–698 (1986) 13. Young, S., Driggers, R.G., Teaney, B.P., Jacobs, E.L.: Adaptive Deblurring of Noisy Images. Applied Optics, 744–752 (2007) 14. Richardson, W.H.: Bayesian-based Iterative Method of Image Restoration. Journal of the Optical Society of America, 55–59 (1972)
Adaptive Deblurring for Camera-Based Document Image Processing
777
15. Tian, Y.: Monte Carlo Evaluations of Ten Focus Measures. In: Proceedings of SPIE, Vol.6502, p. 65020C (2007) 16. Mubbarao, M., Choi, T., Nikzad, A.: Focusing Techniques. Optical Engineering, 2824–2836 (1993) 17. Fisher, F.: Digital Camera for Document Acquisition. In: Proceedings of Symposium on Document Image Understanding Technology, pp. 75–83 (2001) 18. Kuo, S., Ranganath, M.V.: Real Time Image Enhancement for both Text and Color Photo Images. In: Proceedings of International Conference on Image Processing, pp. 159–162 (1995) 19. Clark, P., Mirmehdi, M.: Recognising Text in Real Scenes. International Journal on Document Analysis and Recognition, 243–257 (2002) 20. Yu, B., Jain, A.K.: A Robust and Fast Skew Detection Algorithm for Generic Documents. Pattern Recognition, 1599–1629 (1996)
A Probabilistic Model of Visual Attention and Perceptual Organization for Constructive Object Recognition Masayasu Atsumi Dept. of Information Systems Science, Faculty of Eng., Soka University 1-236 Tangi-cho, Hachioji-shi, Tokyo 192-8577, Japan
[email protected]
Abstract. This paper proposes a new probabilistic model of visual attention, figure-ground segmentation and perceptual organization. In this model, spatially parallel preattentive points on a saliency map are organized into sequential selective attended segments through figureground segmentation on dynamically-formed Markov random fields and perceptual organization among attended segments are performed in visual working memory for constructive object recognition. Selective attention to segments is controlled based on their saliency, closedness and attention bias. Attended segments in visual working memory are perceptually organized according to a law of proximity. Experiments were conducted by using images of plural categories in an image database and it was shown that selective attention was frequently turned to objects of those categories and that part segments of objects or salient context of objects were perceptually organized.
1
Introduction
Visual attention, segmentation and perceptual organization are indispensable perceptual processes for object and scene recognition. An attention process can be divided into two stages of a preattentive process and a focal attentional process [1]. In the preattentive process, local features are detected in parallel over the entire visual field. In the focal attentional process, they are successively integrated and attention works in two distinct and complementary modes of a space-based mode and an object-based mode [2], in which the former selects locations where fine segmentation is promoted and the latter selects organized segments of objects, and they operates in concert to influence the allocation of attention. Figure-ground segmentation is processed through border-ownership assignment and subsequent surface filling-in and it is largely autonomous process that precedes and guides allocation of attention [3,4]. Organized percept of segments tends to attract attention automatically [5]. Attention may influence the transfer of percept into visual working memory [6] and organized segments of objects tend to be stored together even when attention is directed to parts of them [7]. In a visual working memory model of Logie [8], visual working memory G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 778–787, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Probabilistic Model of Visual Attention and Perceptual Organization
779
is subdivided into two component of the passive visual cache component and the active rehearsal component. In this paper, on the basis of these findings, we propose a new probabilistic model of visual attention, figure-ground segmentation and perceptual organization. In this model, spatially parallel preattentive points on a saliency map [9,10] are organized into sequential selective attended segments through figure-ground segmentation on dynamically-formed Markov random fields [11] around these points and attended segments are perceptually organized in visual working memory for constructive object recognition. This model has the following characteristics: 1) preattentive points are used to assign border-ownership to figure segments and multiple Markov random fields are dynamically formed and merged to perform filling-in of segment surfaces; 2) selective attention to segments is controlled based on their saliency, closedness and attention bias; and 3) segments are maintained in visual working memory as active or passive mode of attention and perceptual organization is performed among active segments according to a law of proximity. Performances of the model are evaluated by using images of plural categories in an image database. This paper is organized as follows. Section 2 presents an overview of the model. Section 3 describes details of figure-ground segmentation, the degree of attention and perceptual organization. Experimental results are shown in section 4 and we conclude our work in section 5.
2
Overview of the Model
The model consists of a saliency map for preattention, a collection of dynamicallyformed Markov random fields for figure-ground segmentation, visual working memory for maintaining and perceptually organizing segments and an attention system on a saliency map and visual working memory. A saliency map and Markov random fields are generated on a feature space of an image [10] which consists of brightness, saturation, hue, contrast and orientation. Brightness contrast, saturation contrast and hue contrast are respectively computed by convolving brightness, saturation and hue with a LoG(Laplacian of a Gaussian) kernel. Brightness orientation, saturation orientation and hue orientation of 0◦ , 45◦ , 90◦ and 135◦ are respectively computed by convolving brightness, saturation and hue with Gabor kernels of those orientation. The maximum value of three orientations of brightness, saturation and hue defines a value of orientation for each point. A saliency map is obtained by using brightness contrast, saturation contrast, hue contrast and orientation [10]. Each feature is composed of plural dimensions to represent conspicuity competitively. Brightness contrast is composed of two dimensions of on-center off-surround response and on-surround off-center response. Saturation contrast is also composed of two dimensions of on-center off-surround response and on-surround off-center response. Hue contrast has six dimensions which correspond to red, yellow, green, cyan, blue and magenta. Orientation has four dimensions which correspond to orientations of 0◦ , 45◦ , 90◦
780
M. Atsumi
and 135◦ . Conspicuity of each feature is obtained according to a rareness criterion among dimensions which ensures that the fewer a dimension of feature appears in an image the more conspicuous the region of the dimension is in the image. Saliency S(p) is obtained from brightness conspicuity SBC (p), saturation conspicuity SSC (p), hue conspicuity SHC (p) and orientation conspicuity SOR (p) by S(p) = (wi × Si (p)) (1) i=BC,SC,HC,OR
for each point p in an image, where wi (i = BC, SC, HC, OR), whose sum is 1, are weights for combination. According to the degree of saliency, plural preattentive points are stochastically sensed on a saliency map and primitive percepts [12], which we call proto-segments, are generated around them. A proto-segment is a small region whose brightness and hue are similar to those of a preattentive point and whose size is less than a certain specified size. This preattentive process causes border-ownership assignment as follows. On-center off-surround contrast and onsurround off-center contrast generate salient points on both sides of a border. In general, in case a border forms a convex region, saliency is larger in the inside of the convex region, that is, a figure-side segment. As a result, preattention is sensed at either of both sides of a border and border-ownership is assigned to a side of a figure segment at a higher probability. Figure-ground segmentation is performed by figure-ground labeling on dynamically-formed Markov random fields of brightness and hue around proto-segments. When plural figure segments satisfy a certain mergence condition, they are merged into one segment. This process corresponds to filling-in of figure segments. The degree of attention of a segment is defined by using its saliency, closedness and attention bias. Here, attention bias represents a constraint such as attentional tendency to a face-like region. Segments are maintained in visual working memory as active or passive mode of attention. A certain number of segments are stochastically activated in visual working memory according to their degrees of attention and selective attention is directed to one or a few segments whose degrees of attention are larger than others among those active segments. Active mode of segments means that they are active memories in rehearsal and passive mode of segments means that they are passive memories in a visual cache [8]. Passive segments disappear from visual working memory after a certain period of time. Thus in this model, attention and segmentation are performed by repeating following steps. Step 1: Preattentive points or segments are stochastically selected to become active from a saliency map or visual working memory according to their degrees of attention. For preattentive points, proto-segments are generated around them. Step 2: Figure-ground labeling is iterated by gradually expanding Markov random fields around proto-segments or active segments by a certain margin
A Probabilistic Model of Visual Attention and Perceptual Organization
781
until figure segments converge or a specified iteration is reached. Figureground labeling is similarly applied to passive segments in visual working memory. If plural figure segments satisfy a certain mergence condition, they are merged into one segment. Step 3: Degrees of attention of segments in visual working memory are updated. Here, active segments are segments expanded and merged in Step 2 from proto-segments and segments activated in Step 1. Step 4: From these active segments, one or a few segments whose degrees of attention are larger than others are selected as selective attended segments. Step 5: Each selective attended segment and its adjacent active segments are recorded in visual working memory as a tree whose root is the selective attended segment and whose leaves are the adjacent active segments. This tree is called a co-occurrence segment tree. Perceptual organization arises from grouping adjacent segments which are simultaneously active in visual working memory. The grouping tendency is represented by a co-occurrence probability structure which is obtained from a set of co-occurrence trees. This makes it possible to perceptually organize part segments of an object or salient contexts of an object. In this model, proximity is treated as a factor of grouping but it is possible to extend to laws of grouping by other factors.
3 3.1
Segmentation, Attention and Perceptual Organization Figure-Ground Segmentation
In figure-ground segmentation, each figure segment is extracted by figure-ground labeling in each Markov random field which is iteratively expanded around a proto-segment or an existing segment. In figure-ground labeling, 2-dimensional features of brightness and hue are used and a label of figure or ground is assigned to each point in a Markov random field. Let z = (b, h) be an observation of features where b is brightness and h is hue. A set of segment labels is represented by L = {1, −1}, where “1” represents a figure label and “−1” represents a ground label. Let R be a domain of a Markov random field and let S = {Sr }r∈R be a collection of random variables defined on R, one for each point, with realization of segment labels s = {sr }r∈R , sr ∈ L. Segmentation is performed by estimating a collection of segment labels s = {sr }r∈R that maximizes the following posterior probability for a given observed feature z = {z r }r∈R exp (− r (−log p(z r |sr ) + U (sr ))) p(s|z) ∼ p(z|s)p(s) = (2) W where W is the normalization factor which is called the partition function, p(z r |sr ) is a distribution of feature for an assigned segment label which is assumed to be given by a multivariate Gaussian distribution and U (sr ) is given by β0 U (sr ) = V (sr , sr ) = − (sr × sr ) (3) 8 r ∈Nr
r ∈Nr
782
M. Atsumi
where V is potential of a pair-site clique, Nr is the 8-neighborhood system and β0 the interaction coefficient. The problem of estimating segment labels s = {sr }r∈R and a set of parameters Φ can be solved by using the EM algorithm. Here, Φ is a parameter set that provides distributions of features p(z|s, Φ) since the interaction coefficient β0 is preset in this study. Concretely, Φ is means and variances of multivariate Gaussian distributions of figure and ground features. The mean field theory is used for calculation in the E-step [13]. The mean field local energy function using mean field approximation is defined by U mf (sr |z r , Φ(l) ) = −log p(z r |sr , Φ(l) ) + U mf (sr ) and U mf (sr ) =
V (sr , sr ) = −
r ∈Nr
β0 (sr × sr ) 8
(4) (5)
r ∈Nr
where sr is an expectation of a segment label in the neighborhood and l represents the EM iteration number. Then, a posterior probability of a segment label is given by pmf (sr |z r , Φ(l) )
exp(−U mf (sr |z r , Φ(l) )) ˜ rmf W
(6)
˜ mf is the partition function and an expectation of a segment label is where W r obtained as sr |z r = (sr × pmf (sr |z r , Φ(l) )). (7) sr
In the E-step, for each point in a domain of a Markov random field, an expectation of the segment label sr |z r is repeatedly calculated until all the expectations of segment labels converge. Usually, only a few iterations are required to converge. A segment label is estimated as “1” if sr |z r > 0 and “−1” otherwise. In the M-step, means and variances of multivariate Gaussian distributions for figure and ground features are updated by using results of the E-step. The number of EM iterations is given by a convergence criterion for mean and variance of figure feature or the upper bound number of iterations. The mergence of segments is performed as follows. Let f 0 and f be a pair of segments. Then, they are merged if they spatially overlap and the Mahalanobis generalized distance for brightness and hue between them is not greater than a certain threshold. The Mahalanobis generalized distance Dbh (f 0 , f ) for brightness and hue between f 0 and f is defined by Dbh (f 0 , f ) = Db2 (f 0 , f ) + Dh2 (f 0 , f ) (8) Di2 (f 0 , f ) =
(mf 0 ,i − mf ,i )2
nf 0 n
σf2 0 ,i +
nf n
σf2 ,i
, (i = b, h)
where, for f ∈ {f 0 , f }, mf,b and mf,h are means of brightness and hue respec2 2 tively and σf,b and σf,h are variances of brightness and hue respectively. The nf 0 and nf are the number of points of f 0 and f where n = nf 0 + nf .
A Probabilistic Model of Visual Attention and Perceptual Organization
3.2
783
The Degree of Attention of a Segment
The degree of attention of a segment is calculated from its saliency, closedness and attention bias. Saliency of a segment is defined by both the degree to which a surface of the segment stands out against its surrounding region and the degree to which a spot in the segment stands out by itself. The former is called the degree of surface attention and the latter is called the degree of spot attention. The degree of surface attention is defined by the distance between two mean features, in which feature consists of brightness and hue, of a figure segment and its surrounding ground segment which are generated by figure-ground segmentation. The degree of spot attention is defined by the maximum value of saliency of each point in a segment. Closedness of a segment is judged whether it is closed in an image, that is, whether it touches the border of an image. A segment is defined as closed if it does not intersect at more than a specified number of points with the border of an image. Attention bias represents a priori or experientially acquired attentional tendency to a region with a particular feature. In this model, attention bias to a face-like region is introduced, in which a segment is judged as a face by simply using its hue and aspect ratio. Then, the degree of attention A(f ) to a segment f is defined by A(f ) = δ(f, γ) × (η × G(f ) + κ × P (f ) + λ × B(f ))
(9)
where G(f ) is the degree of surface attention, P (f ) is the degree of spot attention, B(f ) is the attention bias, and η, κ, λ (η+κ = 1, λ ≥ 0) are weighting coefficients for them respectively. The function δ(f, γ) takes 1 if a segment f is closed and γ otherwise, where γ (0 ≤ γ < 1) is the decrease rate of attention when the segment isn’t closed. 3.3
Perceptual Organization Among Attended Segments
The tendency of perceptual organization among segments is represented by a cooccurrence probability structure which is obtained from a set of co-occurrence trees. Let T be a set of co-occurrence segment trees. For any tree tri ∈ T , let ci be a set of segments which are nodes of tri and {H(ci )}ci ∈C be an occurrence histogram for a family of segment sets C = {ci } in T . Also for any segment fi which is a node of a tree in T , let {H(fi )}fi ∈F be an occurrence histogram for a set of segments F = {fi } in T . Then, an occurrence rate of ci is given by P (ci ) = H(ci )/ cj ∈C H(cj ) and an occurrence rate of fi is given by P (fi ) = H(fi )/ fj ∈F H(fj ). A co-occurrence probability of a segment fj for a segment fi is defined as Co(fj |fi ) = ck ∈Cs.t.{fi ,fj }⊆ck P (ck )/P (fi ) if there exists a tree tr ∈ T in which fi is its root and fj is its leaf and otherwise Co(fj |fi ) = 0. Then, the degree of co-occurrence for a pair of segments fi and fj is expressed by a pair Co({fj , fi }) = Co(fj |fi ), Co(fi |fj ). The degree of co-occurrence for a pair of segments takes a positive value only if they are adjacent. For a segment fi , Co(fi ) = {(fj , Co({fi , fj }))|fj is adjacent to fi } represents a co-occurrence probability structure around the fi . Let O = {fi } be a set of part segments
784
M. Atsumi
of an object. Then, P o(O) = fi ∈O Co(fi ) defines a co-occurrence probability structure among part segments of the object and between the object and its context for perceptual organization.
4 4.1
Experiments Experimental Framework
To evaluate performance of visual attention, segmentation and perceptual organization of the model, experiments were conducted by using images of plural categories in the Caltech image database [14]. In experiments, scene images in which objects of specified categories exist against the background of daily life were chosen and used to evaluate selective attention to and perceptual organization of object segments of those categories. Fig. 1 shows images of categories used in experiments. Two images for each of ten categories were used in experiments. Main parameters are set as follows for all experiments. In each repetition of attention and segmentation steps in section 2, the upper bound number of active segments is 10 and the upper bound number of selected attended segments is 3. The upper bound number of iteration for figure-ground labeling in step 2 is 20 for a proto-segment and 5 for a segment in visual working memory. The interaction coefficient β0 in the expression (3) is 1 and the upper bound number of EM iteration is 10. The threshold distance between segments for segment mergence in the expression (8) is 1.0. As for attention, combination weights of saliency in the expression (1) are wIC = 0.4, wSC = 0.1, wHC = 0.25, wOR = 0.25 and coefficients for the degree of attention of a segment in the expression (9) are η = 0.5, κ = 0.5, λ = 1.0, γ = 0.2.
Fig. 1. Images of 10 categories (“butterfly”, “chimp”, “elk”, “grasshopper”, “hibiscus”, “horse”, “iris”, “people”, “school-bus” and “telephone-box”) used in experiments
4.2
Experimental Results
Fig. 2 shows segments of categorical objects extracted from images shown in Fig. 1. We can observe that segments of each categorical object were extracted almost accurately in many cases. However, it was observed in a few cases that segments with similar features were joined into one segment like the body part of a “horse” categorical image or flower clusters of an “iris” categorical image. In 15 times repetition of consecutive selective attention, the number of times either segment of an object of each category was selected as an attended segment was
A Probabilistic Model of Visual Attention and Perceptual Organization
785
13.2 times on average for all the images used in experiments. On the other hand, the number of times segments other than objects of specified categories were selected as attended segments was 7.7 times on average during 15 times repetition of consecutive selective attention. The number of times a face segment was selected as an attended segment in each image of a “people” category was 8 times on average during 15 times repetition of consecutive selective attention. It was comparatively higher than the number of times each of other part segments of a person was selected as an attended segment, so that effectivity of attention bias to a face-like region was confirmed. Through these results, it was confirmed that the attention system of this model frequently turned selective attention to object segments to which human subjects had paid attention in the process of categorizing images. People
Horse
Butterfly
Chimp
Elf
School-bus
Grasshopper
Hibiscus
Iris
Telephone-box
Fig. 2. Segments of categorical objects extracted from images shown in Fig. 1
Fig. 3 shows co-occurrence segment trees and co-occurrence probability structures for objects in some categorical images. As shown in co-occurrence segment trees of “people”, “school-bus”, “butterfly”, and “horse” categories, co-occurrence was observed among part segments of categorical objects. Also, co-occurrence of segments was observed between segments of categorical objects and their surrounding segments such as a butterfly segment and a foliage segment in a “butterfly” categorical image, a flower segment and a foliage segment in an “iris” categorical image and a horse segment and a grass segment in a “horse” categorical image. The former type of co-occurrence may induce perceptual organization for constructive recognition of whole-part relationship of objects and the latter type of co-occurrence may induce perceptual organization
786
M. Atsumi
for contextual recognition of objects. A diagram of a co-occurrence probability structure in Fig. 3 represents a P o(O) for a set of part segments O of a categorical object. Degrees of co-occurrence among object segments and between object segments and their surrounding segments are shown on arrows of the diagram. We can observe that tendency of perceptual organization appears among object segments and between object segments and their surrounding segments. For all P o(O)s computed from all the images used in experiments, the mean of degrees of co-occurrence among object segments was 0.711. If the maximum value of Co(fi |fj ) and Co(fj |fi ) for each pair of adjacent object segments fi and fj is adopted as the value which represents the strength of constructive perception between them, the mean of those values among object segments was 0.955. On the other hand, the mean of degrees of co-occurrence from object segments to their surrounding segments was 0.697. These results demonstrate that the proposed method of perceptual organization through attention may work effectively for constructive and contextual perception of objects. Schoolbus
People
People
Butterfly
Butterfly 0.63
1.0
1.0
0.25 0.66
Iris
(a)
Horse
0.29 0.29 0.55
0.86 1.0
Horse
0.29
0.67
0.67
0.57
0.57
0.5 0.43
0.38 1.0 0.38
1.0 (b)
Fig. 3. (a) Co-occurrence segment trees. (b) Co-occurrence probability structures.
5
Conclusions
We have proposed a new probabilistic model of visual attention, figure-ground segmentation and perceptual organization. In this model, spatially parallel preattentive points on a saliency map are organized into sequential selective attended segments through figure-ground segmentation on dynamically-formed Markov random fields around these points and attended segments are perceptually organized in visual working memory for constructive object recognition. The characteristics of this model are summarized as figure-ground segmentation on multiple dynamically-formed Markov random fields, selective attention of segments based on their saliency, closedness and attention bias and perceptual organization around attended segments according to a law of proximity. Through experiments by using images of plural categories in the Caltech image database, it was confirmed that selective attention was frequently turned to
A Probabilistic Model of Visual Attention and Perceptual Organization
787
objects of those categories and that the proposed method of perceptual organization through attention was useful for constructive and contextual recognition of objects. The main future work is to incorporate constructive or contextual knowledge into attention process and controlling the degree of attention of segments to promote organized perception of objects and scenes.
References 1. Neisser, U.: Cognitive Psychology. Prentice Hall, Englewood Cliffs (1967) 2. Mozer, M.C., Vecera, S.P.: Space- and object-based attention. In: Itti, L., Rees, G., Tsotsos, J.K. (eds.) Neurobiology of Attention, pp. 130–134. Elsevier Academic Press, Amsterdam (2005) 3. Craft, E., Schutze, H., Nieber, E., Von der Heydt, R.: A neural model of figureground organization. Journal of Neurophysiology 97, 4310–4326 (2007) 4. Komatsu, H.: The neural mechanisms of perceptual filling-in. Nature Reviews Neuroscience 7, 220–231 (2006) 5. Kimchi, R., Yeshurun, Y., Cohen-Savransky, A.: Automatic, stimulus-driven attentional capture by objecthood. Psychonomic Bulletin & Review 14, 166–172 (2007) 6. Schmidt, B.K., Vogel, E.K., Woodman, G.F., Luck, S.J.: Voluntary and automatic attentional control of visual working memory. Perception & Psychophysics 64, 754–763 (2002) 7. Woodman, G.F., Vecera, S.P., Luck, S.J.: Perceptual organization influences visual working memory. Psychonomic Bulletin & Review 10, 80–87 (2003) 8. Logie, R.H.: Visuo-spatial Working Memory. Psychology Press, San Diego (1995) 9. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence 20, 1254–1259 (1998) 10. Atsumi, M.: Stochastic attentional selection and shift on the visual attention pyramid. In: Proc. of the 5th International Conference on Computer Vision Systems, CD-ROM, 10 p. (2007) 11. Atsumi, M.: Attention-based segmentation on an image pyramid sequence. In: Blanc-Talon, J., Bourennane, S., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2008. LNCS, vol. 5259, pp. 625–636. Springer, Heidelberg (2008) 12. Rensink, R.A.: The dynamic representation of scenes. Visual Cognition 7, 17–42 (2000) 13. Zhang, J.: The mean field theory in EM procedures for Markov random fields. IEEE Trans. on Signal Processing 40, 2570–2583 (1992) 14. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology (2007)
Gloss and Normal Map Acquisition of Mesostructures Using Gray Codes Yannick Francken, Tom Cuypers, Tom Mertens, and Philippe Bekaert Hasselt University - tUL - IBBT Expertise Centre for Digital Media Wetenschapspark 2, 3590 Diepenbeek, Belgium {firstname.lastname}@uhasselt.be
Abstract. We propose a technique for gloss and normal map acquisition of fine-scale specular surface details, or mesostructure. Our main goal is to provide an efficient, easily applicable, but sufficiently accurate method to acquire mesostructures. We therefore employ a setup consisting of inexpensive and accessible components, including a regular computer screen and a digital still camera. We extend the Gray code based normal map acquisition approach of Francken et al. [1] which utilizes a similar setup. The quality of the original method is retained and without requiring any extra input data we are able to extract per pixel glossiness information. In the paper we show the theoretical background of the method as well as results on real-world specular mesostructures.
1 Introduction During the last few decades, computers have become increasingly important for performing a wide variety of tasks. One of these tasks is generating images of virtual scenes. Nowadays convincing rendering techniques are applied in many applications such as computer games. Even photo-realistic images can be generated offline, for example to be used in movies. Therefore fast and/or accurate rendering techniques have been developed, approximating or accurately simulating the light transport within the virtual world. However, even if light interaction could be simulated in a physically correct manner, scene data still has to be provided in the form of a 3D model. If the input scene data does not contain small-scale surface details such as scratches, imperfections, etc, the scene will probably be judged as unrealistic. Hence, manually modeling the world in such a level of detail can be a tedious task, suggesting automatic 3D scanning methods. Throughout the years, many techniques have been proposed to digitize the world around us. These techniques typically capture either the (a) light in the scene, (b) the geometry, or (c) the reflectance properties, or any combination. In this paper we will mainly focus on capturing the reflectance properties, although we extend a fine-scale geometry acquisition system. Even though several techniques already exist for scanning reflectance properties as well as fine-scale geometry, users tend to stick to their manual approach. One of the reasons for this is the complexity of currently available methods. Many approaches G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 788–798, 2009. c Springer-Verlag Berlin Heidelberg 2009
Gloss and Normal Map Acquisition of Mesostructures Using Gray Codes
789
require special purpose setups containing exotic hardware components, time consuming calibration procedures, difficult implementations and scanning procedures etc. In this paper the goal is to make small-scale appearance acquisition available to the public, bridging the gap between current research and practical usage. This is achieved by presenting an efficient, easy to implement approach, employing solely off-the-shelf hardware components consisting of a regular still camera and a computer screen that functions as a planar illuminant.
2 Related Work In this related work section we will distinguish between fine-scale geometry acquisition and reflectance acquisition. 2.1 Fine-Scale Geometry Several techniques have been introduced specifically for recovering small-scale surface details, in the form of relief (height) maps or normal maps, assuming various types of materials [2,3,4,5]. The majority of these methods requires a specialized hardware setup [6,7,8,9,10], have long acquisition/processing times [11,12] or are not able to scan specular surfaces [2,3]. In our work we will use a slightly adapted version of the Gray code based approach of Francken et al. [1], employing a screen-camera setup as acquistion setup. Because of the use of a planar illuminant and Gray codes, fine-scale specular surface geometry can accurately be measured using only up to 40 input images. 2.2 Reflectance Acquiring spatially varying reflectance usually requires a complicated hardware setup, which measures the Bidirectional Reflectance Distribution Function (BRDF) [13] at each spatial location. This is a four dimensional function describing the surface’s response given the exitant (light) and incident (observation) direction. Our method is much simpler and cheaper. Even though we assume a simplified BRDF model, our technique is able to reproduce the mesostructure’s appearance faithfully. Numerous representations exist for storing either modeled or captured BRDFs [14,15,16,17]. As storing individual data samples of densely sampled BRDFs is memory inefficient, often approximating models are fitted through the large data collection. This is either achieved by fitting an analytical model [18,19,20,21,22], or projecting the data to polynomial [23,6], spherical harmonics [24,25,26] or wavelet bases [27,28]. For the sake of simplicity as well as compatibility with known tools, in our work we will employ a simple analytical Phong model [29] where the glossiness is represented by a single exponent parameter. Previous methods tend to focus mainly on improving BRDF quality, and less on acquisition speed and practical usability for a large class of users. Often very specialized setups or long and tedious procedures are required. As we focus on increasing the wide applicabily rather than improving the quality of recent BRDF methods, an approximate glossiness acquisition suffices for our purposes.
790
Y. Francken et al.
n
2sH
r
2sV
1
v
Z
O
(a) The digital still camera captures light emitted by the screen and reflected off a specular/glossy mesostructure.
X Y
(b) A screen modeled as a rectangular window on virtual surrounding hemispherical light source.
Fig. 1. Setup
The most closely related approach was presented by Ghosh et al. [30]. They estimate roughness as well as anisotropy from second order spherical gradient illumination. In their approach they make use of a specialized hardware setup. In our work we take an alternative approach as we want to avoid non-trivial hardware setups. We achieve this by using a screen-camera setup consisting of off-the-shelf and omnipresent hardware components. In our work, we start from an existing Gray code based mesostructure acquisition system [1] and show that glossiness information can easily be extracted from the already available data necessary for shape reconstruction. The original method only has to be slightly modified by replacing the polarization based specular-diffuse separation with the use of pattern complements instead. No extra data is required, and besides LCD screens, also non-polarization based illuminants such as CRT screens can now be employed.
3 Setup The proposed setup consists of a digital still camera that serves as light sensor and a computer screen that serves as planar illuminant (Fig. 1). Current digital still cameras are relatively inexpensive and are able to accurately measure light reflections. LCD as well as CRT computer screens are also inexpensive and omnipresent making it an ideal controllable light source. In order to turn a screen-camera setup into a mesostructure acquisition setup, a geometric calibration step is required to relate 2D screen pixels to 3D location with respect to the camera. As the screen is not directly visible to the camera, a spherical mirror is employed for the geometrical calibration [31]. To find the internal camera parameters and the mesostructure’s supporting plane, we use a standard calibration toolbox which makes use of a checkerboard pattern [32]. Radiometric calibration, which relates emitted and captured light intensities, is not essential as we are using binary (Gray code) illumination patterns.
Gloss and Normal Map Acquisition of Mesostructures Using Gray Codes
791
4 Acquiring Surface Normals and Gloss Acquiring local surface orientation and glossiness is achieved by placing the target object in front of a CRT or LCD monitor which acts as a light source, and recording the corresponding images using a camera. As in the normal map acquisition technique [1], we display stepwise refining vertical and horizontal Gray code patterns. We also display each pattern’s complement in order to robustly separate diffuse from specular reflection. The specular reflections then efficiently encode discrete spatial screen coordinates in a bit-wise fashion. In a geometrically calibrated setup, this allows for estimating the ideal reflection direction for each pixel. This enables us to estimate the surface normal n by taking the halfway vector between the reflection vector r and viewing vector v, as depicted in Fig. 1 (a). In this section, we will extend this system by performing an additional glossiness analysis step. 4.1 Overview In order to extract glossiness information from the recorded mesostructure taken under Gray code illumination, we require some additional illumination patterns. More specifically, complements of the original Gray code patterns are introduced. Fortunately, these render the use of polarization based separation redundant so the number of required patterns does not increase. This is due to the fact that specular highlights are considered much stronger than diffuse reflections [33,11] and hence a binary decision (white or black reflection) can robustly be made by comparing the pixels illuminated by the pattern and the pattern’s complement. As indicated by the grey area in Fig. 2 (b), after a certain number of pattern refinements, no extra information will be gained as the intensity differences between reflected patterns and their complements will converge to zero. We analyze this convergence process to obtain glossiness information. Without requiring additional input images, we are now able to obtain a per pixel shininess coefficient as well as a surface normal. The more pattern refinements that can be discerned, the more specular the material will be, and vice versa. This is the case because glossy reflections blur the reflected incoming light pattern. More precisely, the reflected pattern is convolved with a BRDF kernel around the ideal reflection direction [34]. The number of refinements thus is proportional to the shininess of the material. The size (or narrowness) and shape of the kernel is defined by the specular lobe of the BRDF. For the sake of simplicity as well as compatibility with known tools, we assume a Phong reflection model. This symmetric lobe is then described by a single exponent value n which is stored in the gloss map. 4.2 Theory We will now formalize the concept proposed in the previous section. Therefore a model will be built that describes the captured radiance L of an imaged surface point, observed from a direction v, illuminated by a given light pattern P . The equation is given by: L(v) = P (ω) [Rd (ω, n) + Rs (r, ω, n)] dω (1) Ω
792
Y. Francken et al.
0
i
0
K
i
0
i
K
0
i
(a)
K
(b)
K (c)
Fig. 2. Acquisition pipeline. (a) mesostructure, (b) intensity differences depending on the pattern refinement level, (c) detected normal codes and Phong kernels.
The following assumptions are made before applying this equation for determining the gloss level: Specular + diffuse: The imaged surface is assumed to be a combination of a specular component Rs (r, ω, n) and the diffuse component Rd (ω, n), where ω is the incoming light direction, n the surface normal and r the specular reflection vector depending on the observation direction v. Distant hemispherical illumination: The mesostructure is assumed to be a point in front of the center of the screen, illuminated by a rectangular part of the hemisphere Ω = [ π2 − σv , π2 + σv ] × [ π2 − σh , π2 + σh ] (Fig.1 (b)) Inter-reflections and occlusions: Both inter-reflections as occlusions are ignored for reasons of simplicity. Under uniform illumination u, where Pu (ω) = 1 for each incoming light direction ω, the equation can be simplified. Lu (v) =
Rd (ω, n) dω +
Ω
= Ld + Ls (v)
Rs (r, ω, n) dω
(2)
Ω
(3)
As we use Gray code patterns, we will define the patterns Pi in terms of the pattern refinement level i. For each incoming light direction ω ∈ Ω, the pattern Pi (ω) is either 0 or 1. The precise pattern definition for vertical patterns Piv and horizontal patterns Pih are given in equation (4) and (5), where (θ, φ) ∈ Ω. Notice that the Gray code patterns are basically modeled as a phase shifted ( 41 of the period) square wave in the vertical or horizontal interval [ π2 − σ, π2 + σ]. Each pattern refinement from i to i + 1 the frequency i(i+1)−2 of the wave doubles as = 2. ii−2
Gloss and Normal Map Acquisition of Mesostructures Using Gray Codes
2i−2 (θ − π2 + σv ) 1 1 + + 2σv 4 2 i−2 π 2 (φ − + σ ) 1 1 1 h 2 Pih (ω) = Ψ + + 2 2σh 4 2 Piv (ω) =
1 Ψ 2
793
The integer function Ψ is defined as +1 Ψ (x) = −1
if x − x ∈ [0, 0.5) if x − x ∈ [0.5, 1)
(4) (5)
(6)
Also the complements of the patterns need to be defined. They are referred to as Pic,v and Pic,h . Pic,v (ω) = 1 − Piv (ω)
(7)
Pic,h (ω)
(8)
=1−
Pih (ω)
The captured radiance can now be modeled applying the previous definitions. The remainder part of this section will focus on the use of horizontal Gray code patterns only. However, an analogous derivation can be done for vertical patterns. i−2 2 (φ − π2 + σh ) 1 1 Li (v) = Ld + Ls (v) + Ψ + Rd (ω, n) dω 2 2σh 4 Ω i−2 2 (φ − π2 + σh ) 1 + Ψ + Rs (r, ω, n) dω 2σh 4 Ω
(9)
If the frequency of the pattern i is sufficiently large, the Lambertian term is approximately zero, as shown by Lamond et al. [35]. The underlying reason for this is that the Lambertian reflection can be seen as an applied low frequency convolution filter blurring away the high frequency pattern. Hence the following form can be obtained: i−2 2 (φ − π2 + σh ) 1 1 Li (v) = Ld + Ls (v) + Ψ + Rs (r, ω, n) dω (10) 2 2σh 4 Ω When the pattern frequency is high compared to the size of the specular lobe, the same holds for the specular term, meaning that it also converges to zero. Hence, the wider the specular lobe, the faster this term converges to zero. As the same reasoning applies for the pattern complement Pic , the difference between the radiance of a scene illuminated by Pi and Pic converges to zero after a certain pattern refinement level i: 1 |Ld + Ls (v) − Lcd − Lcs (v)| = 0 (11) 2 Concretely, the smallest pattern numberi has to found, such that for all the subsequent patterns j ≥ i the intensity differences Lj (v) − Lcj (v) drop below a given threshold |Li (v) − Lci (v)| =
794
Y. Francken et al.
n
0
1
2
3
4
5
6
7
...
i
Fig. 3. Relation between pattern refinement level i and the gloss level n
(Fig. 2 (b)). When i is found, it is converted into a corresponding Phong kernel K(ω) = cosn (ω) (Fig. 2 (c)). Therefore we propose a simple heuristic which takes into account π/2 the following constraint: The surface area 0 K(ω)dω under the kernel K has to halve if the assigned i increases one level. We have emperically established that this relation can be well-approximated by a simple exponential function: n = 4(i−1)
(12)
This relation is illustrated in Fig. 3. Note we assign no n value when i = 0 as the material is then meant to be perfectly diffuse. Notice that this kernel fitting is approximate because of the limited number of input images we require and the reflection model we employ. However, as we focus more on the efficiency and easy applicability than on pure accuracy, it yields sufficiently precise results, as can be seen in the next section.
5 Results and Discussion We have created a proof of concept implementation of the described procedure. The setup we employed consists of a 19 inch LCD monitor and a Canon EOS 400D camera. Experiments were done on different specular materials including plastics, leather, metals, glass and polished marble. For all our results 40 input images were recorded, 10 for each direction plus their complements. Results on real-world examples are illustrated in Fig. 4. Column (a) shows the acquired normal maps stored in the red, green and blue color channel. Column (b) contains the gloss maps. The gloss values range from black to white. Black values indicate
Gloss and Normal Map Acquisition of Mesostructures Using Gray Codes
(a)
(b)
795
(c)
Fig. 4. Results. (a) normal maps obtained from detected codes, (b) gloss maps containing Phong exponents, (c) renderings.
diffuse reflections, white values represent highly specular reflections, and intermediate grey values represent glossy reflections. The results show for example that different metal coatings yield different gloss values (top row), the glass of the watch is more specular than the plastics (middle row), and scratches on the wallet’s hasp make it locally less specular (bottow row). However, also notice in this image that self-shadowed regions in the pores and grooves of the leather are mistakenly classified as non-specular since the occlusion assumptions did not hold. Column (c) shows virtual renderings of the scanned surfaces under point light illumination, taking into account the displayed normal and gloss maps as well as regular texture maps. Besides inherent problems of the relief acquisition, such as light occlusions, the main limitation is the small number of available per pixel samples due to efficient binary encoding. Only p possible gloss levels can be distinguished using our technique, where p is the number of patterns used (in our case 10). In addition, as the kernel width is directly dependent on the exponentially decreasing pattern’s stripe width, only a few
796
Y. Francken et al.
possible kernels can be assigned to diffuse materials. A more kernel size distribution may be desirable in this case.
6 Conclusions In this paper we showed how a straightforward extension of a Gray code based normal scanning can provide us with a very simple BRDF approximation in the form of a single Phong exponent. However, taking into account these approximate gloss maps in addition to traditional texture and normal maps tends to considerably improve rerenderings of heterogenous materials.
7 Future Work Improvements are possible regarding the convolution kernel approximations. Currently we are looking into recovering more general BRDFs by adding extra and more optimal patterns to allow for a more precise kernel fitting. Furthermore we believe this work can function as basis for an integrated normal map acquisition system, where the type of pattern/method depends on which mesostructure regions are processed.
Acknowledgements Part of the research at EDM is funded by the ERDF (European Regional Development Fund) and the Flemish government. Furthermore we would like to thank our colleagues for their help and inspiration.
References 1. Francken, Y., Cuypers, T., Mertens, T., Gielis, J., Bekaert, P.: High quality mesostructure acquisition using specularities. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 1–7. IEEE Computer Society Press, Los Alamitos (2008) 2. Rushmeier, H., Taubin, G., Gu´eziec, A.: Applying Shape from Lighting Variation to Bump Map Capture. IBM TJ Watson Research Center (1997) 3. Hern´andez, C., Vogiatzis, G., Brostow, G.J., Stenger, B., Cipolla, R.: Non-rigid photometric stereo with colored lights. In: Proceedings of International Conference on Computer Vision (2007) 4. Morris, N.J.W., Kutulakos, K.N.: Reconstructing the surface of inhomogeneous transparent scenes by scatter-trace photography. In: Proceedings of International Conference on Computer Vision (2007) 5. Chung, H.S., Jia, J.: Efficient photometric stereo on glossy surfaces with wide specular lobes. In: Proceedings of Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Press, Los Alamitos (2008) 6. Malzbender, T., Gelb, D., Wolters, H.J.: Polynomial texture maps. In: SIGGRAPH, pp. 519–528 (2001) 7. Han, J.Y., Perlin, K.: Measuring bidirectional texture reflectance with a kaleidoscope. In: SIGGRAPH, pp. 741–748. ACM Press, New York (2003)
Gloss and Normal Map Acquisition of Mesostructures Using Gray Codes
797
8. Neubeck, A., Zalesny, A., Gool, L.V.: 3d texture reconstruction from extensive btf data. In: Texture 2005 Workshop in conjunction with ICCV 2005, pp. 13–19 (2005) 9. Wang, J., Dana, K.J.: Relief texture from specularities. Transactions on Pattern Analysis and Machine Intelligence 28, 446–457 (2006) 10. Ma, W.C., Hawkins, T., Peers, P., Chabert, C.F., Weiss, M., Debevec, P.: Rapid acquisition of specular and diffuse normal maps from polarized spherical gradient illumination. In: Proceedings of Eurographics Symposium on Rendering (2007) 11. Chen, T., Goesele, M., Seidel, H.P.: Mesostructure from specularity. In: Proceedings of Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1825–1832 (2006) 12. Holroyd, M., Lawrence, J., Humphreys, G., Zickler, T.: A photometric approach for estimating normals and tangents. SIGGRAPH Asia 27, 133 (2008) 13. Nicodemus, F., Richmond, J., Hsia, J., Ginsberg, I., Limperis, T.: Geometrical considerations and nomenclature for reflectance. NBS monograph 160, 201–231 (1977) 14. He, X.D., Torrance, K.E., Sillion, F.X., Greenberg, D.P.: A comprehensive physical model for light reflection. In: SIGGRAPH, New York, NY, USA, pp. 175–186. ACM Press, New York (1991) 15. Ashikhmin, M., Shirley, P.: An anisotropic phong brdf model. Journal of Graphics Tools 5, 25–32 (2000) 16. Ashikmin, M., Premoˇze, S., Shirley, P.: A microfacet-based brdf generator. In: SIGGRAPH, New York, NY, USA, pp. 65–74. ACM Press, New York (2000) 17. He, X.D., Heynen, P.O., Phillips, R.L., Torrance, K.E., Salesin, D.H., Greenberg, D.P.: A fast and accurate light reflection model. SIGGRAPH 26, 253–254 (1992) 18. Ward, G.J.: Measuring and modeling anisotropic reflection. SIGGRAPH 26, 265–272 (1992) 19. Lafortune, E.P.F., Foo, S.C., Torrance, K.E., Greenberg, D.P.: Non-linear approximation of reflectance functions. In: SIGGRAPH, New York, NY, USA, pp. 117–126. ACM Press/Addison-Wesley Publishing Co (1997) 20. Lensch, H.P.A., Goesele, M., Kautz, J., Heidrich, W., Seidel, H.P.: Image-based reconstruction of spatially varying materials. In: Theoharis, T. (ed.) Algorithms for Parallel Polygon Rendering. LNCS, vol. 373, pp. 103–114. Springer, Heidelberg (1989) 21. Gardner, A., Tchou, C., Hawkins, T., Debevec, P.: Linear light source reflectometry. In: SIGGRAPH, pp. 749–758. ACM Press, New York (2003) 22. Ngan, A., Durand, F., Matusik, W.: Experimental analysis of brdf models. In: Proceedings of Eurographics Symposium on Rendering, Eurographics Association, pp. 117–226 (2005) 23. Koenderink, J.J., Doorn, A.J.v., Stavridi, M.: Bidirectional reflection distribution function expressed in terms of surface scattering modes. In: Proceedings of European Conference on Computer Vision, London, UK, pp. 28–39. Springer, Heidelberg (1996) 24. Westin, S.H., Arvo, J.R., Torrance, K.E.: Predicting reflectance functions from complex surfaces. In: SIGGRAPH, pp. 255–264. ACM Press, New York (1992) 25. Ramamoorthi, R., Hanrahan, P.: Frequency space environment map rendering. In: SIGGRAPH, pp. 517–526. ACM Press, New York (2002) 26. Basri, R., Jacobs, D.W.: Lambertian reflectance and linear subspaces. Transactions on Pattern Analysis and Machine Intelligence 25, 218–233 (2003) 27. Lalonde, P., Fournier, A.: A wavelet representation of reflectance functions. Transactions on Visualization and Computer Graphics 3, 329–336 (1997) 28. Ng, R., Ramamoorthi, R., Hanrahan, P.: All-frequency shadows using non-linear wavelet lighting approximation. In: SIGGRAPH, pp. 376–381. ACM Press, New York (2003) 29. Blinn, J.F.: Models of light reflection for computer synthesized pictures. In: SIGGRAPH, pp. 192–198. ACM Press, New York (1977) 30. Ghosh, A., Chen, T., Peers, P., Wilson, C.A., Debevec, P.: Estimating specular roughness and anisotropy from second order spherical gradient illumination. In: Theoharis, T. (ed.) Algorithms for Parallel Polygon Rendering. LNCS, vol. 373, Springer, Heidelberg (1989)
798
Y. Francken et al.
31. Francken, Y., Hermans, C., Bekaert, P.: Screen-camera calibration using gray codes. In: Proceedings of Canadian Conference on Computer and Robot Vision. IEEE Computer Society, Los Alamitos (2009) 32. Bouguet, J.Y.: Camera calibration toolbox for matlab (2006) 33. Umeyama, S., Godin, G.: Separation of diffuse and specular components of surface reflection by use of polarization and statistical analysis of images. Transactions on Pattern Analysis and Machine Intelligence 26, 639–647 (2004) 34. Ramamoorthi, R., Hanrahan, P.: A signal-processing framework for inverse rendering. In: SIGGRAPH, pp. 117–128. ACM Press, New York (2001) 35. Lamond, B., Peers, P., Ghosh, A., Debevec, P.: Image-based separation of diffuse and specular reflections using environmental structured illumination. In: Proceedings of International Conference on Computational Photography (2009)
Video Super-Resolution by Adaptive Kernel Regression Mohammad Moinul Islam, Vijayan K. Asari, Mohammed Nazrul Islam, and Mohammad A. Karim Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, VA {misla001,vasari,mislam,mkarim}@odu.edu
Abstract. A novel approach for super-resolution using kernel regression technique is presented in this paper. The new algorithm uses several low resolution video frames to estimate unknown pixels in a high resolution frame using kernel regression employing adaptive Gaussian kernel. Experiments conducted on several video streams to evaluate the effect of the proposed algorithm showed improved performance when compared with other state of the art techniques. This resolution enhancement technique is simple and easy to implement and it can be used as a software alternative to obtain high quality and high resolution video streams from low resolution versions.
1 Introduction Super-resolution is a process of image resolution enhancement by which low quality, low resolution (LR) images are used to generate a high quality, high resolution (HR) image [1]. There are numerous applications of super-resolution in the areas of image processing and computer vision such as long range target detection, recognition and tracking. It has many applications in consumer products such as cell phone, webcam, high-definition television (HDTV), closed circuit television (CCTV) etc. There have been several techniques developed and presented in the literature for image super-resolution [2-7]. These can be classified into two categories: single image super-resolution and super-resolution from several frames. In the first case, there is no additional information available to enhance the resolution. So, the algorithms are based on processes of either image smoothing or interpolation. Interpolation techniques give better performance than the smoothing technique but both methods smooth edges and cause blurring problems [2]. Single image super-resolution based on machine learning techniques uses patches from training images and a tree based approximate nearest neighbor search is performed to find the desired HR patch from the LR patch [3]. Li et al [4] used a support vector regression (SVR) to find the mapping function between a low resolution patch and the central pixel of a HR patch. These methods are computationally expensive and require huge memory for handling the training data. The classical super-resolution restoration from several noisy images was proposed by Elad and Feuer [5], where the mathematical model of obtaining a super-resolution image from several LR images is described as: Yk = Dk Ck Fk X + Ek for
1≤ k ≤ N
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 799–806, 2009. © Springer-Verlag Berlin Heidelberg 2009
(1)
800
M.M. Islam et al.
In the above equation, {Yk }kN=1 represents N images of size M×M each of which is a different representation of the desired HR image X. Matrix Fk represents the geometric warp performed on the image X , C k is a linear space variant blur matrix and Dk is a matrix representing the decimation operator in Yk . E k is the additive zero mean Gaussian noise in the k-th LR frame. Another approach to super-resolution is iterative back projection (IBP) similar to the projections in computer aided tomography (CAT) [6]. This method starts with an initial guess of the HR image and simulated to generate a set of LR images which are compared with the observed image to update the HR image. A modified IBP algorithm was presented in [7] based on elliptical weighted area (EWA) filter in modeling the spatially-variant point spread functions (PSF). The rest of the paper is organized as follows: In section 2, an overview of kernel regression technique is presented. Section 3 describes the proposed algorithm. Simulation results and performance analysis are discussed in section 4. Finally, concluding remarks are presented in section 5.
2 Kernel Regression Technique Kernel regression analysis is a nonparametric regression method to estimate the value of an unknown function f ( x ) at any given point based on the observations. For two dimensional cases, the regression model is: T Yi = f (xi ) + ε i , i = 1, 2,..., N , xi = [x1i , x2i ]
(2)
i = 1, 2,..., N } are the design points (pixel position), {Yi , i = 1,2,..., N } are observations (pixel values) of the response variable Y , f is a regression function and {ε i , i = 1,2,..., N } are independent identically distributed (i.i.d.) random errors, and N is
where
{(xi ),
the number of samples (number of frames). The generalization of kernel estimate ∧
f ( x ) is given by the following minimization problem [8].
∑ [Y − {p + q (x N
min
p , q1 .. ql
i =1
i
1
− x ) + ... + ql ( xi − x )
l
i
}] K ⎛⎜ x h− x ⎞⎟ 2
i
⎝
⎠
(3)
where K (.) is the kernel function with bandwidth h and l is a positive integer which determines the order of the kernel estimator. The above equation is solved to determine the unknown regression coefficients p, q1 , q2 ..., ql . In this paper, we used Gaussian kernel in the above regression equation. Solving equation (3) for l = 0 gives zero-order local constant kernel estimator also known as Nadaraya-Watson kernel estimator given as [8]: ⎛ xi − x ⎞ ⎟ h ⎠ f NW ( x ) = ⎛ x − x⎞ K⎜ i ⎟ ∑ ⎝ h ⎠ i =1 N
∑ Y K ⎜⎝ i =1 N
i
(4)
Video Super-Resolution by Adaptive Kernel Regression
801
Higher order kernel estimators can also be applied but with increasing complexity. One major problem of the ordinary kernel is that it is not dependent on the image characteristics. So, we used data dependent kernel which depends not only on the pixel positions but also on the intensity values of the samples. A general form of adaptive kernel is Bilateral kernel and for zero order case, above estimator is modified as [9], N
f NW ( x ) =
⎛ xi − x ⎞ ⎛ Yi − Y ⎞ ⎟ ⎟K ⎜ h ⎠ ⎜⎝ hr ⎟⎠
∑ Y K ⎜⎝ i =1
i
(5)
⎛ x − x ⎞ ⎛ Yi − Y ⎞ ⎟⎟ K⎜ i ⎟ K ⎜⎜ ∑ ⎝ h ⎠ ⎝ hr ⎠ i =1 N
where hr is the intensity dependent smoothing scalar.
Buffer
Kernel Regression
Smoothing Filter
HR Image
LR Frame
Fig. 1. Block diagram of kernel regression based video super-resolution
3 Proposed Algorithm Figure 1 shows the basic block diagram of the proposed technique. As shown in the figure, low resolution frames are input to a buffer which stores N number of LR frames. Once the N-th frame is received kernel regression method is applied to construct an HR image. Given N low resolution video frames of size m×n, the problem of super-resolution is to estimate high resolution frame of size rm×rn, where r is the resolution enhancement factor. For simplicity, we assume r= 2. This means, a 2×2 LR grid is zoomed into a 4×4 HR grid. Hence for each 2×2 LR block, 12 additional pixel values need to be determined. In order to get super-resolution image with acceptable registration error we use four LR frames. Each LR frame can be considered as a downsampled, degraded version of the desired HR image. These LR frames contribute new information to interpolate sub-pixel values if there are relative motions from frame to frame [1]. Then the regression output is smoothed out with a smoothing filter. In Figure 2, four 2 × 2 LR data blocks are mapped into a single 4×4 HR grid. Subscripts in LR frames indicate frame numbers whereas in HR frame it indicates reconstructed pixel. Different symbols correspond to information from different 2×2 data blocks. As there are motions between successive blocks, pixels are geometrically shifted in the HR blocks which contribute sub-pixel information to construct a HR image. The following assumptions are made regarding the video sequences: 1) Frame rates are not too small otherwise some motion information may be lost. 2) Distance between objects and camera are such that small change in camera motion doesn’t cause significant displacement of the object regions.
802
M.M. Islam et al.
a1
b1
a2
b2
a11
a12
b11
b12
c1
d1
c2
d2
a13
a14
b13
b14
a3
a4
c11
c12
b3
b4
d11
d12
c14
d14
d3
c13
d13
c3
c4
d4
(a)
(b)
Fig. 2. (a) Four consecutive LR frames of size 2×2, (b) A 4×4 HR frame is constructed from the four 2×2 LR frames
Figure 3 depicts description of the proposed algorithm for resolution enhancement for a factor of 2. For each pixel in LR frame four pixels are generated by placing the kernel function at the desired LR pixel position and taking samples from four neighborhood positions. Small perturbation of the camera is allowed in order to estimate sub-pixel values. Thus pixel a11 is found by applying the kernel at pixel location a1 with four observations in temporal direction of LR frames a1, a2, a3 and a4. Similarly, a12 is obtained by applying the kernel at pixel location a1 but with four observations b1, b2, b3 and b4 of LR frames and so on. a a 4 a2 3
Kernel at a1
a11
Kernel at a1
a12
Kernel at a1
a13
Kernel at a1
a14
a1 b b 4 b2 3 b1 c c 4 c2 3 c1 d d 4 d2 3 d1
Fig. 3. Detailed diagram showing the construction of the above HR frame as shown in Fig. 2. For each position of the kernel, four pixels are generated taking four samples from the LR frames.
4 Simulation Results The proposed algorithm is tested with several LR video sequences and their PSNR is measured as the performance index. Four consecutive LR frames are used to construct
Video Super-Resolution by Adaptive Kernel Regression
803
a HR frame with 2×2 scaling using our kernel regression method. Figure 4 shows the results of the super-resolution reconstruction of grayscale sequences by bi-cubic interpolation, iterative back projection (IBP), robust super-resolution and the proposed kernel regression method. Simulation results for IBP and robust super-resolution method are obtained by using the software available at [10]. It can be seen that the results obtained by IBP and the robust super-resolution method are noisy and the interpolated images are blurry, and the high resolution image created by our method is clear. Figure 5 shows the super-resolution recreation of a high resolution image with another set of gray scale images containing more textures. It can be seen that the proposed kernel regression method performs better than the other methods. Figure 6 shows the reconstruction of a color image.
(a)
(b)
(c)
(b)
(c)
(d)
(e)
Fig. 4. (a) Four consecutive LR frames from Emily video [11], (b) HR image with a magnification factor 2 using bi-cubic interpolation, (c) Iterative back projection (IBP) (d) Robust superresolution, and (e) Proposed kernel regression method
In order to measure the performance of the proposed technique, image frames are down-sampled and then reconstructed using the proposed algorithm to the original size. Since the original HR image is available, the restoration quality is measured by computing the PSNR as in [12].
804
M.M. Islam et al.
M
N
∑∑ 255 PSNR=
10 log10
2
(6)
i = 1 j =1
∑ ∑ ⎛⎜⎝ f (i, j ) − f (i, j )⎞⎟⎠ M
∧
N
2
i =1 j = 1
∧
where f is the original HR image and f is the reconstructed image. An image with higher PSNR means better reconstruction but it doesn’t always represent the true quality of the image [12]. The PSNR results of the test videos are shown in Table 1. In all the cases, kernel regression method shows a higher PSNR value which supports the effectiveness of the algorithm. The algorithm performs well if the above mentioned assumptions hold true i.e. there is only a small motion from frame to frame. The main advantage of this algorithm is that it is simple to implement and hence computationally efficient.
(a)
(b)
(b)
(c)
(c)
(d)
(e)
Fig. 5. (a) Four consecutive LR frames from Alpaca video [11], (b) HR image with a magnification factor 2 using bi-cubic interpolation, (c) Iterative back projection (IBP), (d) Robust super-resolution, and (e) Proposed kernel regression method
Video Super-Resolution by Adaptive Kernel Regression
805
(a)
(b)
(c)
(d)
(e)
Fig. 6. (a) Four consecutive LR frames from news video [13], (b) HR image with a magnification factor 2 using bi-cubic interpolation, (c) Iterative back projection (IBP), (d) Robust superresolution, and (e) Proposed kernel regression method Table 1. Performance Comparison of the proposed super-resolution method Image Emily Alpaca News
Bi-cubic interpolation 16.51 18.07 22.93
IBP 16.23 18.19 22.65
Robust superresolution 16.67 19.30 22.8
Proposed method 16.7 19.64 23.05
5 Conclusions A new algorithm for image super-resolution has been presented in this paper. It uses a kernel regression technique in video frames to construct a high resolution frame. The main advantage of this method is that it doesn’t need exhaustive motion estimations and hence it is simple in computations. However, this method is not suitable if there are significant motions between successive frames. Simulation results showed that the proposed algorithm performs comparably to other state of the art techniques.
806
M.M. Islam et al.
References [1] Nguyen, N., Milanfar, P.: A Computationally Efficient Superresolution Image Reconstruction Algorithm. IEEE Transactions on Image Processing 10, 573–583 (2001) [2] Chang, H., Yeung, D.Y., Xiong, Y.: Super-resolution through Neighbor Embedding. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 275–282 (2004) [3] Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based Superresolution. IEEE Computer Graphics and Applications 22, 56–65 (2002) [4] Li, D., Simske, S., Mersereau, R.M.: Single Image Super-resolution Based on Support Vector Regression. In: IEEE Proceedings of IJCNN, pp. 2898–2901 (2007) [5] Elad, M., Feuer, A.: Restoration of a Single Superresolution Image from Several Blurred, Noisy and Undersampled Measured Images. IEEE Transactions on Image Processing 6, 1646–1658 (1997) [6] Irani, M., Peleg, S.: Improving Resolution by Image Registration. CVGIP: Graphical Models and Image Processing 53, 231–239 (1991) [7] Jiang, Z., Wong, T.T., Bao, H.: Practical Super-resolution from Dynamic Video Sequences. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 549–554 (2003) [8] Qiu, P.: Image Processing and Jump Regression Analysis. Wiley Series in Probability and Statistics. Wiley, New Jersey (2005) [9] Takeda, H., Farisu, S., Milanfar, P.: Kernel Regression for Image Processing and Reconstruction, vol. 16, pp. 349–366 (2007) [10] Super-resolution software: http://lcavwww.epfl.ch/software/superresolution/index.html [11] Super-resolution dataset: http://www.soe.ucsc.edu/~milanfar/software/sr-datasets.html [12] Li, D., Mersereau, R.M., Simske, S.: Blind Image Deconvolution through Support Vector Regression. IEEE Transactions on Neural Networks 18, 931–935 (2007) [13] Super-resolution dataset: http://media.xiph.org/video/derf/
Unification of Multichannel Motion Feature Using Boolean Polynomial Naoya Ohnishi1 , Atsushi Imiya2 , and Tomoya Sakai2 1
2
School of Science and Technology, Chiba University Institute of Media and Information Technology, Chiba University Yayoicho 1-33, Inage-ku, Chiba, 263-8522, Japan
Abstract. We develop an algorithm to unify features extracted from a multichannel image using the Boolean function. The colour optical flow enables to detect illumination-robust motion features in the environment. Our algorithm robustly detects free-space for robot navigation from a colour video sequence. We experimentally show that colour-optical-flowbased free-space detection is stable against illumination change in an image sequence.
1
Introduction
This paper aims to introduce a unification method of multichannel image features image using the Boolean function. Colour decomposition provides a powerful strategy for segmentation of a multichannel image, since multichannel image captures all photonic information which forms an image [1]. The idea is widely used as a part of recognition system in mobile robots [2], since segmentation separates the free space for navigation and obstacles to avoid corrosion with. Photometric invariant features [8] of images and image sequence are important concepts in computer vision for autonomous robots and computer vision in car technology, since autonomous robot and unmanned vehicles are required to drive in various illumination environments in both indoor and outdoor workspaces. Optical flow provides fundamental features for motion analysis and motion understanding. Colour optical flow method computes optical flow from a multichannel image sequence, assuming the multichannel optical flow constraint, which assumes that, in a short time duration, illumination of an image in each channel is locally constant [6]. This assumption is an extension of the classical optical flow constraint to the multichannel case. This colour optical flow constraint derives a multichannel version of KLT tracker [10]. Colour optical flow constraint yields an overdetermined or redundant system of linear equations [6], although the usual optical flow constraint for a single channel image yields singular linear equation. Therefore, colour optical flow constraint provides a simple method to compute optical flow from without using neither regularisation [3] nor multiresolution analysis [4]. The other method to use multichannel image is to unify features on each channel. Barron and Klette [9] experimentally examined combinations of channels for the accurate computation of optical flow using Golland and Bruckstein method G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 807–816, 2009. c Springer-Verlag Berlin Heidelberg 2009
808
N. Ohnishi, A. Imiya, and T. Sakai
[6]. Barron and Klette concluded that the Y-channel image has advantages for the accurate and robust computation. Mileva et. al. [13] showed the H-channel image has advantages for robust optical flow computation in illumination changing environment. Andrews and Lovell [7] examined combinations of colour models and optical-flow computation algorithms. In ref. [8] van de Weijer and Gevers examined photometric invariant for optical flow computation. These references were devoted to the accurate and robust optical flow computation from multichannel images. For the computation of photometric invariant optical flow, performance of both single channel and multichannel methods were compared [9,8]. They experimentally showed that the effects of brightness component and colour components are different for the results of optical flow vector detection. Since the image analysis methodologies are studied for single channel grayvalue image, there are many established methods for image analysis for a gray-value image. If we used these classical methods for gray-valued images, the unification of features in each channel is a fundamental task. We propose a unification method of multichannel image feature using a Boolean polynomial. Binary mathematical morphology allows us to describe binary image processing operations as set theory operations. These operations are computationally achieved by pointwise logic operations on the imaging plane. This pointwise logic operation is expressed by a Boolean polynomial. The Boolean polynomial is an algebraic expression of disjunctive normal form of logic operation. This algebraic nature of the Boolean operations on a plane provides a method to unify features extracted from the image in individual channel. As a Boolean feature of image, we deal with the dominant-plane-based free space for robot navigation. In section 2, we introduce the Boolean polynomial and feature unification by a Boolean polynomial. Furthermore, section 3, the dominant plane as the free space for navigation is briefly summarised. Then, numerical examples verify the validity of the feature unification method based on the Boolean operations.
2
Boolean Unification of Multichannel Features
Setting x, y, and z to be the Boolean values, the Boolean function of three arguments is described as a polynomial of x, y and z, such that f3 (x, y, z) = a100 x + a010 y + a001 z + a110 xy + a011 yz + a101 zx + a111 xyz (1) for aijk ∈ {0, 1}, where + and · are the Boolean summation and multiplication. In eq. (1), elements of the set {x, y, z, xy, yz, zx, xyz} are called the Boolean mononimals. As combination of the mononimals, we can construct appropriate Boolean functions for the unification of information. Using these properties of the Boolean function of the three arguments, we unify the features separately extracted from three images. A colour image is expressed as a triplet of gray-valued images as f = (f α , f β , f γ ) ,
(2)
Unification of Multichannel Motion Feature Using Boolean Polynomial
809
where α, β, and γ are the indices to identify channel. Setting D∗ (x) ∗ ∈ {α, β, γ} to be the Boolean functions such that 1, if x is a point with a feature extracted by f ∗ , ∗ ∈ {α, β, γ} D∗ (x) = (3) 0, otherwise we define the Boolean function from R2 × R2 × R2 to R2 as D(x) = f (Dα (x), Dβ (x), Dγ (x)),
(4)
where the Boolean operations are computed pointwise on R2 . For instance, if only a100 a010 and a001 are, one we have D(x) = Dα (x) ∪ Dβ (x) ∪ Dγ (x).
(5)
Furthermore, if only a111 is one we have D(x) = Dα (x) ∩ Dβ (x) ∩ Dγ (x).
(6)
The functions D and D define the maximum and common areas of the dominant planes computed from f α , f β , and f γ .
3
Dominant Planes on Channels
For a colour image f , the colour optical flow constraint is ⎛ d α⎞ ⎛ α α⎞ fx f y dt f d ∂ d β⎠ f = ⎝ dt = Ju + f = 0, J = ⎝ fxβ fyβ ⎠ , f dt ∂t d γ fxγ fyγ dt f
(7)
for u = (x, ˙ y) ˙ and the Jacobian J Therefore, on each channel, we have the relations, fx∗ x˙ ∗ + fy∗ y˙ ∗ + ft∗ = 0, ∗ ∈ {α, β, γ}. (8) We compute optical flow for each channel. We use the Lucas-Kanade method [5] with pyramids for each channel image [4]. In ref. [12], Ohnishi and Imiya developed a featureless robot navigation method based on a planar area and an optical flow field computed from a pair of successive images. This method also yields local obstacle maps by detecting the dominant plane in images. We accepted the following four assumptions. 1. The ground plane is the planar area. 2. The camera mounted on a mobile robot is downward-looking. 3. The robot observes the world using the camera mounted on itself for navigation. 4. The camera on the robot captures a sequence of images since the robot is moving.
810
N. Ohnishi, A. Imiya, and T. Sakai
Perception: Image sequence Obstacle
Dominant plane
The robot observes an environment.
The robot acts on the local path.
Eyes: Camera
Observation
Action
Ground plane
Mobile robot
Robot motion
Perception Obstacle
The robot percepts the optical flow.
Decision The robot decides on the local path.
Fig. 1. Dominant plane and visual navigation. Using the dominant plane, the robot detects the free space for navigation.
Figure 1 shows the robot navigation strategy using dominant plane. A robot decides the navigation direction using the dominant plane detected from a sequence of images captured by the visual system mounted in the robot. Assuming that the camera displacement is small, the corresponding points x = (x, y) and x = (x , y ) on the same plane between a pair of successive two images are connected with an affine transform are expressed as x = A∗ x + b∗ , where A∗ and b∗ are a 2 × 2 affine-coefficient matrix and a 2-dimensional vector for ∗ ∈ {α, β, γ}. We can estimate the affine coefficients using the RANSAC-based algorithm [11]. Using estimated affine coefficients, we can estimate optical flow on a plane ˆ ∗ = (ˆ ˆ ∗ = A∗ x + b∗ − x, for all points x in the image. We call x ˆ∗ x x∗ , yˆ∗ ) , x ˆ ∗ (x, y, t) the planar flow field at time t, which is a set of the planar flow, and x ˆ ∗ computed for all pixels in an image. planar flow x If there are no obstacles around the robot, the ground plane corresponds to the dominant plane in the image observed through the camera mounted on the mobile robot. If an obstacle exists in front of the robot, the planar flow on the image plane differs from the optical flow on the image plane. Since the planar ˆ ∗ is equal to the optical flow vector x˙ ∗ on a plane plane, we use flow vector x the difference between these two flows to detect the dominant plane. We set ε to be the tolerance of the difference between the optical flow vector and the planar flow vector. Therefore, if the inequality ˆ ∗ | < ε, for x ˆ ∗ = (A∗ x + b)∗ − x, ∇f ∗ x˙ ∗ + ∂t f ∗ = 0 |x˙ ∗ − x
(9)
is satisfied, we accept point x as a point on the dominant plane [12] for each channel. Our algorithm is summarised as follows: 1. Compute following procedure for ∗ ∈ {α, β, γ} (a) Compute optical flow field u∗ (x, y, t) from two successive images. (b) Compute affine coefficients of the transform A∗ x + b∗ by random selection of three points. ˆ ∗ (x, y, t) from affine coefficients. (c) Estimate planar flow field u
Unification of Multichannel Motion Feature Using Boolean Polynomial
811
(d) Match the computed optical flow field u∗ (x, y, t) and estimated planar ˆ ∗ (x, y, t) using eq. (9). flow field u ˆ ∗ | < ε as the dominant plane. If the dominant (e) Assign the points |x˙ ∗ − x plane occupies less than half the image, then return to step (b). (f) Output the dominant plane d∗ (x, y, t) as a binary image. 2. Unify Dα (x), Dβ (x) Dγ (x) using a Boolean polynomial.
4
Computational Results
We computed following dominant planes. 1. 2. 3. 4.
The dominant planeD of the grey-valued image f . The dominant planes D∗ where ∗ ∈ {H, S, V } The mononimals DH , DS and DV . Set theory combinations of DH , DS and DV .
Fig. 2. Colour image of backyard. Image f RGB = (f R , f G , f B ) , which is captured by a RGB camera. The image is from Middlebury optical flow test sequence. Table 1. Hamming distance between dominant plane D of grey-valued image D and dominant plane D∗ of colour image. Dominant plane of colour image D∗ is computed as D∗ = Di αf (Dj , Dk ), where the operation α ∈ {∩, ∪} and f2 (x, y) is an appropriate biargument Boolean function.
E [%] E [%] E [%] E [%] E [%] E [%]
DH ⊕ D DS ⊕ D DV ⊕ D 14.5 14.4 5.1 (DH ∪ DS ) ⊕ D (DS ∪ DV ) ⊕ D (DV ∪ DH ) ⊕ D 10.2 6.4 5.8 (DH ∩ DS ) ⊕ D (DS ∩ DV ) ⊕ D (DV ∩ DH ) ⊕ D 18.7 13.1 13.8 ((DH ∪ DS ) ∩ DV ) ⊕ D ((DS ∪ DV ) ∩ DH ) ⊕ D ((DV ∪ DH ) ∩ DS ) ⊕ D 8.7 14.2 13.6 ((DH ∩ DS ) ∪ DV ) ⊕ D ((DS ∩ DV ) ∪ DH ) ⊕ D ((DV ∩ DH ) ∪ DS ) ⊕ D 5.5 9.4 10.0 (DH ∪ DS ∪ DV ) ⊕ D (DH ∩ DS ∩ DV ) ⊕ D 6.7 18.2
812
N. Ohnishi, A. Imiya, and T. Sakai
(a) f
(b) u
(c) D
(d) f H
(e) uH
(f) DH
(g) f S
(h) uS
(i) DS
(j) f V
(k) uV
(l) DV
Fig. 3. Dominant planes of an image. From top to bottom, an image, optical flow fields and dominant planes for the grey-valued image, H-image, S-image, and V-image, respectively.
Unification of Multichannel Motion Feature Using Boolean Polynomial
(a) DV
(b) DS ∪ DV
(c) DV ∪ DH
(d) DS ⊕ D
(e) (DS ∪ DV ) ⊕ D
(f) (DV ∪ DH ) ⊕ D
(g) DH ∪ DS ∪ DV
(h) (DH ∩ DS ) ∪ DV
(i) (DH ∪ DS ) ∩ DV
(j) (DH ∪ DS ∪ DV ) ⊕ D
(k) ((DH ∩ DS ) ∪ DV ) ⊕ D
(l) ((DH ∪DS )∩DV )⊕D
813
Fig. 4. Combination of three channels. These six combinations of three-channel HSV yield dominant planes with small differences with the dominant plane detected from the grey-valued image. For each combination D∗ and D∗ ⊕ D is presented.
814
N. Ohnishi, A. Imiya, and T. Sakai
For the evaluation of the performance, we compute E=
|D∗ ⊕ D)| , A ⊕ B = (A ∩ B) ∪ (A ∩ B), |D|
(10)
where |A| is the area of the point set A on the plane R2 . The operation A ⊕ B for point sets A and B in the Euclidean space Rn is called symmetry difference which computes the Hamming distance between two point sets. Figure 2 shows a colour sequence ”Backyard” from Middlebury optical flow test sequence. Figure 3 shows an image in the sequence, the optical flow fields, and dominant planes, respectively. In the dominant planes white and black pixels stand for points on the dominant plane and obstacles, respectively. Table 1 shows the results of E for various combination of dominant planes. From these results, we have the next assertion. Assertion 1. For DV , (DS ∪ DV ), (DV ∪ DH ), (DH ∪ DS ∪ DV ), ((DH ∪ DS ) ∩ DV )), and ((DH ∩ DS ) ∪ DV )), the error measure E is small. Figure 4 shows the dominant planes for these combinations. Our results achieved with real image sequence examined for HSV-colour model show that the V-channel image is essential for the robust detection of dominant planes from multichannel images. Since the HVS colour model separates an image to the brightness channel V and colour channels H and S, the dominant planes DH ∩ DV and DS ∩ DV are computed both brightness and colour information. Therefore, we can adopt D = (DH ∩ DV ) ∪ (DS ∩ DV ) = DV ∩ (DH ∪ DS ).
(11)
Furthermore, we can adopt the summation of brightness and two colour components H and S as D = (DH ∪ DV ) ∪ (DS ∪ DV ) = DH ∪ DS ∪ DV = DV ∪ (DH ∪ DS )
(12)
as the dominant planes of colour image sequences. Moreover, based on [13], we also adopt the result of the common region DH ∩ DS , of two colour components H and S. Then combine the colour-channel and brightness channel as D = (DS ∩ DH ) ∪ DV .
(13)
In all cases, colour information of the H-channel and the S-channel is merged to extract the dominant plane.
5
Conclusions
We introduced a method to unified geometric features extracted from grayvalued images extracted from colour channel decomposition. using the Boolean operations. We applied the this unification procedure to detect the dominant plane from a colour image sequence. There are many colour model. Therefore, as a future work, combination of the colour models is important problem for the robust dominant-plane detection in real environments.
Unification of Multichannel Motion Feature Using Boolean Polynomial
815
References 1. Angulo, J., Serra, J.: Color segmentation by ordered mergings. In: Proc. ICIP 2003, vol. 2, pp. 125–128 (2003) 2. Batavia, P.H., Singh, S.: Obstacle detection using adaptive color segmentation and color stereo homography. In: ICRA 2001, pp. 705–710 (2001) 3. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–204 (1981) 4. Bouguet, J.-Y.: Pyramidal implementation of the Lucas Kanade feature tracker description of the algorithm, Intel Corporation, Microprocessor Research Labs, OpenCV Documents (1999) 5. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) 6. Golland, P., Bruckstein, A.M.: Motion from color CVIU, vol. 68, pp. 346–362 (1997) 7. Andrews, R.J., Lovell, B.C.: Color optical flow. In: Proc. Workshop on Digital Image Computing, pp. 135–139 (2003) 8. van de Weijer, J., Gevers, Th.: Robust optical flow from photometric invariants. In: Proc. ICIP, pp. 1835–1838 (2004) 9. Barron, J.L., Klette, R.: Quantitative color optical flow. In: Proceedings of 16th ICPR, vol. 4, pp. 251–255 (2002) 10. Heigl, B., Paulus, D., Niemann, H.: Tracking points in sequences of color images. In: Proceedings 5th German-Russian Workshop on Pattern Analysis, pp. 70–77 (1998) 11. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. of the ACM 24, 381–395 (1981) 12. Ohnishi, N., Imiya, A.: Featureless robot navigation using optical flow. Connection Science 17, 23–46 (2005) 13. Mileva, Y., Bruhn, A., Weickert, J.: Illumination-robust variational optical flow with photometric invariants. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 152–162. Springer, Heidelberg (2007)
Appendix Single Channel LKP Optical Flow Computation Setting f 0 (x, y, t) = f (x, y, t) is the original image and f l (x, y, t) is the pyramid transformation of the image f (x, y, t) on the lth level, the pyramid representation is expressed as f l+1 (x, y, t) =
1
aij f l (2x − α, 2y − β, t),
i,j=−1
where a±i±j = 14 1 − 12 |i| 1 − 12 |j| for |i| ≤ 1, |j| ≤ 1 and l ≥ 0. The optical flow equation fx u + fy v + ft = 0 is solved by assuming that the optical flow vector of pixels is constant in the neighbourhood of each pixel. We set the window size to be 5 × 5. Then, we have a system of linear equations, l l l fx(αβ) ul + fy(αβ) v l + ft(αβ) = 0, |α| ≤ 2, |β| ≤ 2
816
N. Ohnishi, A. Imiya, and T. Sakai
l l l where fx(αβ) = fxl (x + α, y + β, t), fy(αβ) = fyl (x + α, y + β, t), and ft(αβ) = l ft (x + α, y + β, t). Therefore, if the vectors l l l Alx = (fx(−2 −2) , fx(−2 −1) , · · · , fx(22) ) l l l Aly = (fy(−2 −2) , fy(−2 −1) , · · · , fy(22) )
are independent, we have a unique solution ll (x, y, t) = (ul (x, y), v l (x, y)) for the centre point (i, j) of the 5 × 5 window on each layer as ul (i, j, t) = ((Al ) Al )−1 (Al ) hl (i, j, t) for Al = (Alx , Aly ) and l l l hl (i, j, t) = (−ft(−2 −2) , −ft(−2 −1) , · · · , −ft(22) ) .
Using these mathematical definitions, Algorithm 1 [4] computes optical flow. Algorithm 1. Optical Flow Computation by the Lucas-Kanade Method Using Pyramids Data: f (t)l , f (t + 1)l , 0 ≤ l ≤ the maximum of the layers; Result: ult ; l :=the maximum of the layers ; while l
= 0, do ult := u(f (t)l , f (t + 1)l ); l−1 ft+1 := w(f (t + 1)l−1 , ult ); l := l − 1; end
In this algorithm, ftl stands for the pyramidal representation on the lth level l of image f (x, y, t) at time t. We call u(x, y, t), which is a set of optical flows u computed for all pixels in an image, the optical flow field at time t. Furthermore, the warping is computed as w(f, u) = f (x+u) and the expresses an upsamplingand-interpolation operation ul = (E(ul+1 , E(v l+1 )) , which derives function define on the l-th layer from (l + 1)th layer in the pyramid-transform-based hierarchical structure, is computed for each element as l E(ul ) = 4 aαβ ul+1 aαβ v l+1 m−α n−β , E(v ) = 4 m−α n−β . α
β
2
2
α
β
for each element, where summations are computed if both integers.
2
m−α 2
2
and
n−β 2
are
Rooftop Detection and 3D Building Modeling from Aerial Images Fanhuai Shi1,2, Yongjian Xi1, Xiaoling Li1, and Ye Duan1 1
Computer Science Department, University of Missouri-Columbia, USA Welding Engineering Institute, Shanghai Jiao Tong University, China
2
Abstract. This paper presents a new procedure for rooftop detection and 3D building modeling from aerial images. After an over-segmentation of the aerial image, the rooftop regions are coarsely detected by employing multi-scale SIFT-like features and visual object recognition. In order to refine the detected result and remove the non-rooftop regions, we further resort to explore the 3D information of the rooftop by 3D reconstruction. Wherein, we employ a hierarchical strategy to obtain the corner correspondence between images based on an asymmetry correlation corner matching. We determine whether a candidate region is a rooftop or not according to its height information relative to the ground plane. Finally, the 3D building model with texture mapping based on one of the images is given. Experimental results are shown on real aerial scenes.
1 Introduction During the past two decades, automatic approach for creating 3D geometric models of the buildings from aerial images has received a lot of attention in the photogrammetry and computer vision community (see [11] for a survey on earlier work). Most of the existing approaches either compute a dense Digital Elevation Map (DEM) from high-resolution images [8], or directly use optical and range images [11]. In order to generate a hypothesis on the presence of building rooftop in the scene, existing techniques generally start by first detecting low-level image primitives, e.g. edges, lines or junctions, and then grouping these primitives using either geometric model based heuristics or a statistical model [11]. Recently, Baillard et al. [1] developed a method of automatically computing a piecewise planar reconstruction based on the matched lines and then generating near complete roof reconstructions from multiple images. However, the algorithm of [1] depends strongly on accurate 3D reconstruction and thus has a very high requirement of the image resolution and the imaging equipment. In a nutshell, existing aerial image based building extracting algorithms differ in the detection, segmentation and reconstruction approaches. Wherein, rooftop detection is a key and very difficult step as rooftops in the aerial images have a wide range of variable shape features, which is significantly different from ground-based images. Along with the recent advancement in visual object recognition, more and more machine learning technologies are explored to improve the building detection or rooftops detection in aerial images [10][14] etc. These works fully utilized the 2D information of the aerial images and greatly improve the robustness of building detection. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 817–826, 2009. © Springer-Verlag Berlin Heidelberg 2009
818
F. Shi et al.
In this paper, we combine both the 2D visual object recognition and 3D visual computing for rooftop detection and 3D building modeling. The contribution of this work is two folds: one is a novel rooftop region detection algorithm based on visual object recognition. The other one is a method of rooftop refinement and building modeling by 3D visual computing, wherein an asymmetry correlation corner matching is used.
2 Rooftop Detection by Visual Object Recognition In this section, we propose an original approach for rooftop detection from aerial images. The rooftop of the buildings is treated as an independent object in the image. Our main contribution is to propose original features that characterize rooftop of the buildings. As a first step of our algorithm, one starts with an initial over-segmentation of an image segmentation by partitioning an aerial image into multiple homogeneous regions. Here we choose to employ Mean Shift [3] as the image segmentation algorithm, see Fig. 1 for an example. It is worth mentioning that our model is not tied to a specific segmentation algorithm. Any method that could propose a reasonable oversegmentation of the images will be suited for our needs. 2.1 Feature Extraction When experimenting with each segmented region, we extract the features of the corresponding circum-rectangular image block and use color descriptors by first transforming the (R,G,B) image into the normalized (r,g,b) space [4] where r=R/(R+G+B), g=G/(R+G+B), b=B/(R+G+B). Using the normalized chromaticity value in object detection has the advantage of being more insensitive to small changes in illumination that are due to shadows or highlights. Each rectangular image block will then be resized to a fixed size (96 pixels in this paper) square block by bi-cubic interpolation, without preserving its aspect ratio. This will partially eliminate the influence of the rooftop shape and size on recognition. Similar to SIFT descriptor [9], multi-scale orientation histogram features are extracted independently from the r and g channel of the normalized image block and concatenated into one 256*n dimensional descriptor, where n represent the number of scale. That is, we only extract a SIFT like descriptor from the image block at each scale, with the whole image block as a frame and without dominant direction assignment. This will partially eliminate the influence of different rotation of the rooftop on object recognition. When computing the descriptor at each scale, the σ of the difference-of-Gaussian function is set to be the half size of the block. 2.2 Classification To test the utility of the SIFT like feature for performing rooftop recognition task, we conducted a standard cross-validated classification procedure on the high-dimension feature model.
Rooftop Detection and 3D Building Modeling from Aerial Images
819
(a)
(b)
(c)
Fig. 1. Sample of aerial image. (a) original image; (b) a segmented image region; (c) circumrectangular image of the segmented region (b).
To speed up the computation and improve classification performance, we reduced the dimensionality of the feature output prior to classification. In this paper, we adopt principle components analysis (PCA) to reduce the dimensionality of the feature to 256. Training and test images were carefully separated to ensure proper crossvalidation. Groups of 30 training example image blocks and 30 testing example image blocks of each category were drawn from the full image set. In order to eliminate the influence of those rooftop like regions (such as parking lots) on the training model, we remove them from the training set. Feature vectors and PCA eigenvectors were computed from the training images, and the dimensionality reduced training data were used to train a multi-class support vector machine (SVM) using the Statistical Pattern Recognition Toolbox for Matlab [6]. In this paper, we only need to classify the segmented image region into rooftop region and non-rooftop region. Following training, absolutely no changes to the feature representation or classifier were made. Each feature vector of the test image was transformed using the PCA projection matrix determined from the training images, and the trained SVM was used to report the predicted category of the test image. We will detail the experimental results in section 4.
3 Rooftop Refinement and 3D Building Modeling In this section, we will utilize the information from 3D reconstruction to remove those non-rooftop regions and construct the 3D building model from image pairs.
820
F. Shi et al.
Before 3D modeling of the buildings, we need to make calibration of the intrinsic/extrinsic parameters of the camera and ground plane by simple human-computer interaction. The intrinsic parameters were obtained by popular Camera Calibration Toolbox for Matlab. The camera pose and ground-plane estimation between image pairs were conducted semi-automatically. We manually chose some salient corner correspondences in the image pairs and refined them by finding the sub-pixel Harriscorners like [2], and then estimated the parameters of camera pose and ground plane by the robust five-point algorithm [12] coupled with random sample consensus (RANSAC) [5]. The flow chart of rooftop refinement and 3D building modeling is illustrated in Fig. 2.
Fig. 2. Flow chart of the rooftop refinement and 3D building modeling
3.1
Dominant Line Contour and Corner Extraction of the Candidate Rooftop Region
As we know, the appearance of the rooftop is usually uniform. Accordingly, the peripheral edges of the rooftop usually locate on or near the boundary of the segmented region, refer to Fig. 1. In order to obtain the main peripheral corner point of the rooftop, we extract the dominant line contour of the segmented region as a reference, wherein the binarysegmented area will be used as input image. In this paper, an edge linking and line
Rooftop Detection and 3D Building Modeling from Aerial Images
821
segment fitting algorithm [13] will be conducted on the smoothed contour of the region. When we obtain all the line edge along the contour, we can select the dominant line contour among them. Finally, the main corner point of the boundary can be calculated by intersection of the neighboring line segments. Fig 3 illustrates an example of the pipeline. The output corners in Fig. 3(c), denoted as C1, are the coarse extraction of corners from the boundary of candidate rooftops. Note that, C1 are estimated from the boundary of the segmented region, not the real image corners of the candidate rooftop. We can then use C1 as a reference to detect the real image corners of the candidate rooftop. Firstly, for each line segment that connect two neighboring point in C1, get the neighborhood image. Secondly, obtain the binary edge image by any popular algorithm, and use an edge linking and line segment fitting algorithm [13] to produce a line edge set L. Thirdly, find the right line segment among L, which is long and close to the reference line segment. There is a trade-off between ‘long’ and ‘close’ that can be tuned by different weighting strategy. Finally, calculate the real corner points by intersection of the neighboring line segments. See Fig. 3(d) for an illustration of dominant line segments and corner points. In addition, any region where the dominant contour and corners are failed to extract will be discarded automatically, that is, it will considered as a non-rooftop region. 3.2 Corner Matching of the Rooftop by Asymmetry Correlation In order to compute the 3D information of the rooftop, we need to find the corner correspondences in the corresponding images once we obtain the corner points of the candidate rooftop in section 3.1. For aerial images, since the imaging platform such as aircraft generally flies in straight lines and at relatively stable flying height, the variance of the object rotation and scaling between adjacent images is small. Hence we can conduct image matching between image pairs that contain the corresponding rectangular candidate rooftop blocks by normalized cross-correlation [15]. Wherein, in the case of matching two similar image blocks, the epipolar geometry can be used to reject the false matches [7]. By this step, we can accurately obtain the correspondent image area of the candidate rooftop in another image. Once the rooftop region correspondences were found, we can proceed to obtain the corner correspondences by corner matching based on the region correspondence. Traditional correlation based image matching [15] works well when the neighbor area of the corner varies smoothly between image pairs. However in real applications, there are often sharply variations of image content nearby the object boundary, which is also called “covered/uncovered” phenomenon, as the up-right image corners illustrated in Fig. 4. The traditional methods will fail in this case. In order to solve the “covered/uncovered” problem in corner matching, we propose an asymmetry correlation strategy in this paper. The basic idea of the proposed approach is, we try to use the stable area near the corner for matching, that is, the corner point does not locate at the symmetric center of the matching block. Fig. 4 gives an illustration of this method. In this way, the image window for correlation matching will contain as more informative features as possible, simultaneously avoiding the covered/uncovered area from interfering with the image matching.
822
F. Shi et al.
(a)
(b)
(c)
(d)
Fig. 3. Dominant line contour and main corner extraction of the candidate rooftop region. (a) segmented region; (b) boundary contour of the segmented region; (c) dominant line contour of the segmented region; (d) dominant line edges and corners of the rooftop from real image edges.
(a)
(b) Fig. 4. An illustration of asymmetry correlation strategy for corner matching. (a) is the source image and (b) is the object image. The blue “x” represents corner point and red square box around it denotes the corresponding matching window.
For simplification and efficiency, we only defined four types of “L” corner in this paper, see Fig. 5. Accordingly, four types of matching window were designed respectively. Given a “L” corner point c1 in candidate rooftop region r1 in image 1, we use a corresponding correlation matching window of size (2n+1)×(2n+1) centered with some eccentricity to this point, see the right column in Fig. 5. We then select a rectangular search area of size (2m+1)×(2m+1) (m>n) around this point in the region correspondence r2 in the second image, and perform a correlation operation on a given window between point c1 in the first image and all possible points c2 lying within the search area in the second image. Note that, the coordinates of the correlation matching result should take a translation to obtain the real location of the object corner, according to the eccentricity of the matching window. In the case of the corner not belonging
Rooftop Detection and 3D Building Modeling from Aerial Images
823
to anyone of the four types, a correlation matching window with the same size and centered at this point will be used instead. In our test, n=8 for the correlation window, and a translation of 1 or 2 pixels for the corner point away from the closest edge. For the searching window, m is set to be about 3 or 4 times of n. It is larger enough for the matching task.
I
II
III
IV
Basic form
Variation forms
Matching window
Fig. 5. Four types of “L” corner and some variation forms. The corresponding matching windows are listed in the right column, where the red “+” denotes the location of the corner.
3.3 3D Reconstruction of the Rooftop and Building After we obtain the corner correspondences of the candidate rooftop region from an image pairs, the 3D coordinates of image corners in camera coordinates system can be calculated by triangulation method [7]. We determine whether a candidate region is a rooftop or not according to the height information of the corners relative to the ground plane. Any regions that are considered to be a rooftop should satisfy both of the following conditions: (1) the average height of the corners is great than a threshold value; (2) at least half of the corners are higher than a specific threshold value. Regions that do not satisfy the above two conditions will be treated as non-rooftop regions and will be eliminated from further consideration. As to the model rendering, because there is not enough information of the building’s side plane in the aerial images, we only make texture mapping for the rooftop of the buildings. To reduce noise, it is recommended to first smooth the height information of the rooftop by averaging. Since our aim is to construct the model of the buildings, the image itself can be used as texture map of the ground plane.
824
F. Shi et al.
Fig. 6. A brief 3D model of buildings constructed from the scene in Figure 1(a) without texture mapping. The quadrangle composed of points 1-2-3-4 is the ground plane.
Fig. 7. The 3D model of the full scene in Fig. 1(a), with texture mapping based on the second image
4 Experimental Results We have tested our method on different aerial images taken over the city of Columbia in Missouri, USA from hot air balloon. Here we report experiments performed on an aerial image set collected at about 20 cms resolution over the urban and suburban area.
Rooftop Detection and 3D Building Modeling from Aerial Images
825
In the test of rooftop detection, we conduct experiments on ten groups of training and test samples. The average recognition rate of rooftop regions is 88.8889%, whereas, the average recognition rate of non-rooftop regions is 72.8571%. The relative low recognition rate of non-rooftop region might be due to the fact that, regions such as parking lot may have very similar appearance or structure as the rooftop. Fortunately, these non-rooftop regions can be removed by the consequent refinement procedure. The proposed method can identify most of the relative higher buildings and modeled them as a combination of many simpler polygon buildings. When the camera pose and ground plane were calibrated, we are able to detect and build the complete model from image of size 1939×1296, like Fig. 1(a), in about 5 minutes on a low-end PC with no user interaction at all. Figure 6 briefly demonstrates the 3D model of buildings constructed from the scene in Fig. 1(a) without texture mapping. In order to make a comparison, the calibrated ground plane is also drawn in this figure, which is the quadrangle composed of points 1-2-3-4. Fig. 7 shows the 3D modeling of the full scene of Fig. 1(a), with texture mapping based on the second image. The black strip around the 3D building model is to help make the building more salient in the scene. Note that, there is some distortion on the bottom of the ground texture, because this area is non-overlapped with the first image.
5 Conclusion In this paper, a complete procedure for automatic rooftop detection and 3D building modeling from aerial images was presented. It combines different components that gradually retrieve 2D and 3D information that is necessary to construct visual building model from aerial images. This procedure has the potential to be developed into a system taking aerial video of urban scene as input.
Acknowledgments Research was sponsored by the Leonard Wood Institute in cooperation with the U.S. Army Research Laboratory and was accomplished under Cooperative Agreement # LWI-61031. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Leonard Wood Institute, the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation heron. The first author is also supported in part by the National Natural Science Foundation of China (No. 60805018).
References [1] Baillard, C., Schmid, C., Zisserman, A., Fitzgibbon, A.: Automatic line matching and 3D reconstruction of buildings from multiple views. In: Proc. of ISPRS Conf. on Automatic Extraction of GIS Objects from Digital Imagery, IAPRS, vol. 32, pp. 69–80. Part 3-2W5
826
F. Shi et al.
[2] Bouguet, J.: Camera Calibration Toolbox for Matlab (2008), http://www.vision.caltech.edu/bouguetj/calib_doc/ [3] Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach toward Feature Space Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002) [4] Elgammal, A., Harwood, D., Davis, L.: Non-parametric model for background subtraction. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 751–767. Springer, Heidelberg (2000) [5] Fischler, M., Bolles, R.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm. of the ACM 24(6), 381–395 (1981) [6] http://cmp.felk.cvut.cz/cmp/software/stprtool/ [7] Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) [8] Jaynes, C., Riseman, E., Hanson, A.: Recognition and reconstruction of buildings from multiple aerial images. Computer Vision and Image Understanding 90(1), 68–98 (2003) [9] Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 2(60), 91–110 (2004) [10] Maloof, M., Langley, P., Binford, T., Nevatia, R., Sage, S.: Improved Rooftop Detection in Aerial Images with Machine Learning. Machine Learning 53 (2003) [11] Mayer, H.: Automatic object extraction from aerial imagery-a survey focusing on buildings. Computer Vision and Image Understanding 74(2), 138–149 (1999) [12] Nistér, D.: An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 756–770 (2004) [13] http://www.csse.uwa.edu.au/~pk/Research/MatlabFns/ [14] Porway, J., Wang, K., Yao, B., Zhu, S.: A Hierarchical and Contextual Model for Aerial Image Understanding. In: Proc. of CVPR 2008, Anchorage, Alaska (June 2008) [15] Zhang, Z., Deriche, R., Faugeras, O., Luong, Q.: A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence 78(1-2), 87–119 (1995)
An Image Registration Approach for Accurate Satellite Attitude Estimation Alessandro Bevilacqua, Ludovico Carozza, and Alessandro Gherardi ARCES - Advanced Research Centre on Electronic Systems - University of Bologna via Toffano 2/2 - 40125 Bologna, Italy {abevilacqua,lcarozza,agherardi}@arces.unibo.it
Abstract. Satellites are controlled by an autonomous guidance system that corrects in real time their attitude according to information coming from ensemble of sensors and star trackers. The latter estimate the attitude by continuously comparing acquired image of the sky with a star atlas stored on board. Beside being expensive, star trackers undergo the problem of Sun and Moon blinding, thus requiring to work jointly with other sensors. The novel vision based system we are investigating is stand alone and based on an earth image registration approach, where the attitude is computed by recovering the geometric relation between couple of subsequent frames. This results in a very effective stand alone attitude estimation system. Also, the experiments carried out on images sampled by a satellite image database prove the high accuracy of the image registration approach for attitude estimation, consistent with application requirements.
1
Introduction
Vision systems technology in the last decades has been broadening its applications to different fields, like industrial inspection, robotics, terrestrial and aerial navigation, only to cite a few. Its employment in navigation and motion control has been increasing for the advantages that this approach can bring. The idea of processing the visual structure of the surrounding environment in order to recover the motion parameters of an object, besides being attractive, permits to collect and exploit in one shot a large amount of information. For these reason, visual odometry [1] can represent an alternative to typical Inertial Navigation Systems made of ensembles of different sensors (GPS, accelerometers, gyros, etc.), that require subsequent data fusion and are subject to several drawbacks (drift, poor accuracy, etc.). On the other hand, the complexity of the data collected with a vision sensor represents the main challenge of this kind of approach, often requiring the supervision of human operators. Robustness and accuracy are two basic features of autonomous navigation systems, whose priority depends on the type of application. Mission critical applications, like autonomous satellite navigation for remote sensing purposes, require the pose of the object to be estimated with a very high accuracy, to provide the effective control of the attitude and guarantee the fulfillment of the tasks for G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 827–836, 2009. c Springer-Verlag Berlin Heidelberg 2009
828
A. Bevilacqua, L. Carozza, and A. Gherardi
which it has been designed. Since in such conditions human intervention can not be employed, also robustness is an important aspect. At the same time, strict constraints must be satisfied with regard to the available power budget and the computational payload (and its compatibility with the hardware on board), just to cite a few of them. At present, star trackers represent the only example of vision-based approach for satellite autonomous navigation [2]. Despite their increasing importance in the field of attitude and orbit control, due to their accuracy, some concerns are still open, such as the problem of Sun and Moon blinding, that are limiting factors for small-satellite, low-Earth orbit missions. Besides, the vision sensor takes pictures of part of the celestial sphere that is then matched using a reference database stored on board. In this work, the effectiveness of a novel vision-based approach for recovering the attitude of a mini-satellite spacecraft is investigated. An on board camera integral with the satellite acquires images of the Earth, instead of the stars, during its orbit. A dedicated registration algorithm is proposed to estimate geometrical transformations between subsequent views and to retrieve the satellite 3-D orientation with a high accuracy. This paper is outlined as follows. In Section 2, other approaches, or systems, utilized to deal with the problem are analyzed. The registration framework is illustrated in details in Section 3. In particular, the choices made for the registration algorithm are thoroughly discussed, in relation to the application requirements and the geometrical model adopted. Section 4 discusses the results of the experiments performed sampling images from a geo-referenced database. Finally, Section 5 draws conclusions and proposes future developments.
2
Previous Work
Since the early 1980s visual odometry has been an attractive approach to deal with autonomous navigation and guidance, as it can provide sufficient accuracy when other sensors may be temporarily unavailable or even fail [3]. Stemming from these initial studies, nowadays attitude estimation through vision based methods are employed in many fields, from robotics to aerial navigation using either monocular or stereo cameras [1,4]. In space navigation, star trackers relying on vision-based sensors have been employed to improve attitude estimation accuracy otherwise not achievable through common auxiliary sensors [2]. Feature based trackers are widely employed to recover ego-motion. In particular, the KLT feature tracker [5] is widely used due to both its simplicity and effectiveness. This method requires that selected feature points extracted from an image are located in subsequent partially overlapping images using spatial correlation. After finding the reliable set of tracked features, usually using a robust estimator (e.g. RANSAC [6]), the rotation matrix relating the views can be computed. Due to their low descriptive power, feature points are suitable just to recover relative motion parameters between views and require the starting point to be known in order to provide the absolute orientation. In [7], KLT is used effectively to develop a visual odometer for Unmanned Aerial Vehicle (UAV)
An Image Registration Approach for Accurate Satellite Attitude Estimation
829
navigation which can run at 20-30 fps. Here, the visual odometry sensor is combined with other sensors and geo-referenced aerial imagery to correct position drift errors. However, the accuracy is not suitable to what is required in satellite applications. Nevertheless, using a database of geo-referenced images of the ground is very computational intensive and, in terms of storage requirements, prohibitive for satellite systems. A different approach for vision based attitude estimation can be followed by using feature descriptor matching. Here salient features are computed and matched against subsequent images based on their descriptors and no correlation is needed [8,9]. Through this approach, any initial motion estimation is not required. In fact, these features permit to use initially a database of descriptors with known absolute locations, so as to allow computing the initial attitude and position estimation, without the need of more sensors’ data. Nevertheless, feature matching approaches require stable descriptors and may suffer from occlusion problems. In satellite applications, where the image Field of View (FOV) of the Earth can cover few square kilometers, the presence of wide sea area or cloud covered regions makes a matching approach based on feature descriptors useless, whereas feature tracking methods could exploit even clouds to work properly. In [10], a combined approach of monocular visual odometry and Simultaneous Localization and Mapping (SLAM) is used to reduce the error drift in position estimation for medium and high altitude flying UAVs. As for the estimated attitude, the authors obtain an accuracy within few degrees for roll measurements and tenths of degree for pitch and yaw angles. At present, star trackers are the only vision based attitude sensors employed in satellite missions, where attitude estimation accuracy must fulfill strict requirements (in the order of arcsecs). They recover the satellite attitude through matching visible stars using an on board star field database. In [11], the testing results of an autonomous star tracker in real working conditions has been recently reported. The 3-sigma accuracy along the three axis are less than or equal to [7 7 70] arcsec, respectively, either in nominal operations or in spin mode. In addition, high costs, energetic consumption and processing power (which are limited resources) together with Sun and Moon blindness represent the main open issues, specially for mini/micro satellite missions.
3
The Registration Framework
The goal of our registration algorithm is to provide an autonomous visual navigation system for three-axis stabilization of mini-satellites. In remote sensing applications imaging sensors are mounted on board and nadir pointing oriented, that is always looking at the Earth during the orbit. The Earth images viewed at different attitudes can be exploited to recover the spacecraft orientation with respect to a local orbital reference frame. An accuracy of the order of 1 arcsec is required in order to guarantee the correct stabilization of the satellite during its functioning. By registering two terrestrial images acquired at different epochs along the orbit, our system aims to estimate
830
A. Bevilacqua, L. Carozza, and A. Gherardi
the current absolute satellite orientation, once the pose (i.e, the position along the orbit and the attitude) at the previous epoch is given. 3.1
The Model
From a general point of view, a calibrated fully perspective camera, with intrinsic parameters matrix K, free to follow rigid motions, maps corresponding points of the scene onto the image reference frames at two different epochs t and t according to the known relation (Eq. 1, [12]): ⎡ ⎤ ⎛⎡ ⎤ ⎞ x x KR(C − C) ⎣ y ⎦ = KR R−1 K −1 ⎝⎣y ⎦ − ⎠ (1) Zcam 1 1 In this relation, [x y 1]T and [x y 1]T represent the homogeneous coordinates of the corresponding points at the two different epochs, respectively. The symbols C and C take into account the 3-D translation of the camera optical centre along the orbital trajectory. The 3-D structure of the scene is taken into account by the term Zcam , that is the distance of each scene point from the image plane, referred to the first epoch. This is the same as computing the component along the principal axis of each of the projection rays from the scene points to the image plane. The Earth is modeled as a sphere of known radius (RE = 6371 km), so that given the satellite position and orientation it is possible to compute Zcam for every pixel of the camera sensor’s plane. R and R are the rotation matrices corresponding to two different spatial orientations of the satellite in the two epochs, respectively. The homogeneous Eq. 2: H = λ KR R−1 K −1 , λ = 0
(2)
links the plane homography H and the attitude variation R R−1 between the two epochs. Once corresponding points on the two image frames are matched, the matrix H can be computed and the relative attitude variation estimated. λ is a proportionality factor that can be computed once that H has been estimated from points correspondences, since it can be easily proved that det(H) = λ3 . In a preliminary study [13], this geometric model has been proved to be consistent, in terms of accuracy, with the physical model adopted to describe the problem. 3.2
The Registration Algorithm
We tailored the image registration algorithm in order to find accurate and robust correspondences on satellite imagery yet preserving scene generality, because of the variety assumed by terrestrial landscape. The presence of native structured patterns (like mountains, roads, etc.) cannot be exploited, since they are not always present in the scene, besides undergoing the effects outlined in Section 2. Since the attitude perturbations between consecutive epochs involved in realistic orbits are not too severe, strong invariance to perspective distortions is
An Image Registration Approach for Accurate Satellite Attitude Estimation
831
not a requirement. Accordingly, robust local descriptors like SURF [9], which privileges robustness in “distorted” scenes over accuracy, have not been considered. Moreover, such region descriptors requires a further layer of “refinement” to provide a punctual sub-pixel location. On the other hand, sparse punctual features can be extracted directly from the natural textured patterns on Earth imagery with a low computational payload. In addition, it is possible to use high gradient points, that generally present good robustness to variation in image geometry and illumination, since they do not work directly on image intensity. Accordingly, due to both computational reasons and robustness to occlusions, sparse motion field measurements have been preferred. We have chosen the Shi and Tomasi feature point [14] extractor, jointly with their best tracker (i.e. the KLT tracker) designed on purpose. In details, corner points [15] are extracted from the first image of each couple. To reach the desired sub-pixel accuracy, the registration algorithm we conceived works by following a coarse-to-fine strategy. The global 2-D translational components of the image motion field (Δx, Δy) are mostly due to the movement of the satellite along the orbit and they are estimated with pixel accuracy using the phase correlation algorithm [16], over decimated images to speed up the process. This estimation is used as a bootstrapping phase to feed the LukasKanade tracker in his pyramidal implementation [17], in order to measure the residual local motion field vector at a sub-pixel accuracy for each feature. In this way, even “large” displacements can be handled with a sub-pixel accuracy. After the correction with the parallax term (see Eq. 1), using the DLT algorithm [12] jointly with the RANSAC [6] outlier rejection method, a robust estimation of H is achieved. It is worth remarking that this registration algorithm does not need any model of terrestrial landmarks to be learned in advance, while it can cope with general scene patterns. 3.3
Generation of Test Images
Our evaluation method requires our registration algorithm to work with georeferenced views of the Earth acquired at different attitude along the orbit. In this procedure, both image and object domains are discrete, since these views are sampled from a geo-referenced satellite image database, by intersecting the FOV of the camera, computed at known position and spatial orientation, with the corresponding geographic area in the database. Realistic position and attitude (i.e. the state) ground truth data during the orbit are provided by an orbital simulator. The state of the satellite imaging sensor is then geo-referenced at each epoch, that is the corresponding maps of latitude and longitude on Earth are computed for each pixel of the sensor, taking into account the camera model (projective) and the Earth model (spherical, rotating at a known velocity) chosen. In this way, for each point in the current image reference frame, the corresponding latitude and longitude viewed on Earth by each sensor location can be achieved. To generate the correspondent terrestrial image, a magnitude value (i.e. gray level) must be assigned to each sensor pixel according to the values of
832
A. Bevilacqua, L. Carozza, and A. Gherardi
the image database tile that covers the sensor geographic maps built (at least, the camera geo-referenced FOV). Since the image model provided by the satellite image database is discrete by definition, image interpolation is used to sample the geo-referenced tile in correspondence of the sensor geographic maps. The interpolation technique chosen can affect the image generation process and the accuracy of the tracking algorithm, accordingly, due to the artifacts introduced (aliasing, edge halos, etc.). Further considerations are discussed in Section 4.1.
4
Experimental Results
Several experiments have been performed on sequences of sampled images, extracted from the geo-referenced satellite image database WMS Landsat7 ETM+. The sensor’s parameters are reported in Table 1. For this database, the ground Table 1. Optical parameters of the sensor model Sensor Dimensions 320 x 240 pixel Pixel Size 8 µm Focal Length 336.7 mm
resolution is 15 m/pix, yielding an instantaneous FOV of about 4 arcsec. As discussed in [13], this set up meets the quality requirements of our project (accuracy of 1 arcsec in attitude estimation), provided that sub-pixel tracking accuracy can be of 10−2 for roll and pitch and 100 for yaw. In particular, the WMS Global Mosaic is made of individual Landsat7 scenes images, each one having a proper geo-location tag and covering 4 × 4 arcdeg of the Earth, with a resolution on earth of 0.5 arcsec/pixel in the geographic reference frame. For each ground truth data sequence along an orbit, the correspondent geo-referenced image sequence is generated and processed by our registration algorithm. 4.1
Results
Two different experiment groups with as many purposes have been performed. In the first one (S1, afterward), the experiments have been performed over 325 frames of a near polar slightly perturbed orbit, with radius of 650 km, covering an area within the geographical coordinates [44N, 8E] − [48N, 12E]. The Earth is supposed to rotate with the angular velocity ω = 7.272 · 10−5 rad/s. The algorithm works in real time (10 fps, but being further optimized) on a consumer PC (AMD2000+, 1.66GHz, 1GB RAM). As an example, Fig. 1 shows a couple of consecutive frames, highlighting the features tracked by our algorithm. The four big crosses on the right image represent the 4-point best subset chosen by the RANSAC algorithm, whose corresponding inliers are used to compute the homography. The estimated attitude, expressed in roll-pitch-yaw angles with respect to the nominal local orbital reference frame, is then compared with the ground truth data and the correspondent errors are assessed.
An Image Registration Approach for Accurate Satellite Attitude Estimation
833
Fig. 1. A couple of consecutive frames with the points chosen for the homography Roll Error Histogram (arcsec)
Pitch Error Histogram (arcsec)
20
Yaw Error HIstogram (arcsec)
14
20
12 15
15
10 8
10
10 6 4
5
5
2 0 −0.3
−0.2
−0.1 bin
0
0.1
0 −0.25
−0.2
−0.15 bin
−0.1
−0.05
0 −60
−40
−20
bin
0
20
40
Fig. 2. Histograms of F2F angular errors for S1 with our image registration algorithm
Fig. 2 reports the distribution of the Frame to frame (F2F) errors concerning the error in registering all the pairs of consecutive images. As it can be noticed, the accuracy reached in attitude estimation is of the order of 10−1 arcsec for roll and pitch, and tens of arcsec for yaw. That is, the required accuracy is widely fulfilled for roll and pitch, just yaw accuracy seems to be higher. However, here we have not to forget that KLT is working on sampled images. For instance, given a “real” point on the first image, the correspondent one on the second frame could come from interpolation. We try now to separate the effects of sampling from the tracker capabilities. In a preliminary study [13], computed matchings have been used to evaluate the numerical consistency of the registration model. For each two consecutive epochs, chosen a set of points in the first frame, from the ground truth state data the expected matching set can be computed on the second frame, by projecting the visual rays of first set on the Earth and than back-projecting the resulting points until they intersect the second image plane. In such a way, computed matchings represent “exact” correspondences and accordingly the best set of matches find out by the tracker. Fig. 3 reports the histograms of the three angular F2F errors resulting from the same sequence as S1, but performed using the computed matchings of the extracted corners. As it can be seen, the accuracy strongly improves and the biasing disappears. Nevertheless, due to the magnitude of worsening when using KLT, we can suppose that the KLT performance even undergoes the effects of sampling. In working dynamical conditions, the estimated attitude is then propagated to the next epoch. So the error biasing (evident for roll and pitch) cumulates in drift effects. We have deeply analyzed the sources of error according to the model
834
A. Bevilacqua, L. Carozza, and A. Gherardi Roll Error Histogram (arcsec)
Yaw Error Histogram (arcsec)
Pitch Error Histogram (arcsec)
25
30
30
20
25
25
20
20
15
15
10
10
15 10 5 0 −1
5 −0.5
0 bin
0.5
1
0 −4
−3
x 10
5
−2
0 bin
2
4 −4
0 −0.4
−0.2
0 bin
x 10
0.2
0.4
Fig. 3. Histograms of F2F angular errors for S1 with computed matchings
described in Section 3.1. Basically, the absolute attitude estimation is affected by the previous estimate and by the estimation of the current attitude variation (or relative attitude). The estimation of the previous orientation also affects the parallax term (see Eq. 1), that is due to the change of the satellite position over the orbit. Since we want to isolate the effect of our image registration algorithm, we simulate a perturbation with the satellite in a fixed position, so to annul the parallax effects due to shifting in position. To this purpose, a 100 frames sequence (S2, afterward) has been generated by just varying the roll angle (i.e., pan in the camera reference frame) with steps of 1 arcsec, while keeping the other two angles unchanged. Fig. 4 reports the corresponding histograms of F2F Roll Error Histogram (arcsec)
Yaw Error Histogram (arcsec)
Pitch Error Histogram (arcsec)
5
6
4
5 4
3
4
3
3
2 2
2 1
1 0 0.04
1 0.06
0.08 bin
0.1
0.12
0 −0.05
0 bin
0.05
0 −15
−10
−5
bin
0
5
10
Fig. 4. Histograms of F2F angular errors for S2 with our image registration algorithm
errors: biasing in roll estimation is evident in S2. The expected image motion field should be directed just according to the x direction, with null components in the y direction. Considering the generic couple of frames in S2, the difference vectors between computed and measured matchings can yield useful hints to understand the results in Fig. 4. In Fig. 5, the histograms of these differences in the x and y components for a fixed couple of frames and a set of 150 features are shown in pixel. As expected, the y component is normally distributed around a zero mean value, while the x component of this offset is biased. Extending these results to all the other couple of frames (presenting the biasing in the same direction) yields a biased histogram that confirms how the biasing of Fig. 4 originates from a “systematic” error in tracking. This still confirms that the interpolation involved in image generation (as previously explained in Section 3.3) affects the outcome of the tracker, also recalling that our feature extractor performs image gradients on small patches to extract corner points.
An Image Registration Approach for Accurate Satellite Attitude Estimation F2F X Motion Offset (pix)
835
F2F Y Motion Offset (pix)
14
20
12 15
10 8
10 6 4
5
2 0 −0.1
−0.05
0 bin
0.05
0.1
0 −0.1
−0.05
0 bin
0.05
0.1
Fig. 5. The error components in the motion field estimation for one F2F registration
5
Conclusion and Future Works
This research work proposes a novel vision based approach to recover satellite attitude with a high accuracy, exploiting the geometrical relation between couples of subsequent frames. The system provides that a camera integral with the satellite is looking at the Earth. In order to assess the performance of a system being the fastest and the most robust one, while achieving the accuracy required in satellite applications, we have considered corner points and the KLT tracker. This study analyzes the potential of the registration approach, the quality performance of the KLT on the image database as well as the process of generating the images used to assess the approach itself. Accurate experiments separate the possible sources of errors, show how the use of trackers, like KLT, in the application permits to achieve the accuracy required. Accurate simulations have shown that the interpolation in generating the test images could be the main responsible for drift errors. Future works will be directed to assess the performance of our approach using images acquired in real time.
Acknowledgment This research was partly granted by the University of Bologna through the joint DIEM/DEIS “STARS” project, started in 2005. We thank the DIEM team led by Prof. P. Tortora (C. Bianchi, N. Melega and D. Modenini) for providing us with the data of their orbital/attitude simulator developed in the context of the STARS project.
References 1. Nister, D., Naroditsky, O., Bergen, J.: Visual odometry for ground vehicle applications. Journal of Field Robotics 23, 3–20 (2006) 2. Stanton, R.H., Alexander, J.W., Dennison, E.: Ccd star tracker experience: key results from astro 1 flight. Space Guidance, Control, and Tracking, 138–148 (1993) 3. Moravec, H.: Obstacle avoidance and navigation in the real world by a seeing robot rover. In: tech. report CMU-RI-TR-80-03, Robotics Institute, Carnegie Mellon University & doctoral dissertation, Stanford University. Number CMU-RI-TR-80-03 (1980)
836
A. Bevilacqua, L. Carozza, and A. Gherardi
4. Johnson, A.E., Goldberg, S.B., Cheng, Y., Matthies, L.H.: Robust and efficient stereo feature tracking for visual odometry. In: IEEE International Conference on Robotics and Automation (ICRA 2008), pp. 39–46 (2008) 5. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) 6. Fischler, M.A., Bolles, R.C.: Random sample and consensus: A paradigm for model fitting with application to image analysis and automated cartography. Comm. of the ACM 24, 381–395 (1981) 7. Conte, G., Doherty, P.: An integrated uav navigation system based on aerial image matching. In: Proceedings of the IEEE Aerospace Conference, pp. 1–10 (2008) 8. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 91–110 (2004) 9. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Computer Vision and Image Understanding archive 110, 346–359 (2008) 10. Caballero, F., Merino, L., Ferruz, J., Ollero, A.: Vision-based odometry and SLAM for medium and high altitude flying UAVs. Journal of Intelligent and Robotic Systems 54, 137–161 (2009) 11. Rogers, G.D., Schwinger, M.R., Kaidy, J.T., Strikwerda, T.E., Casini, R., Landi, A., Bettarini, R., Lorenzini, S.: Autonomous star tracker performance. Acta Astronautica 65, 61–74 (2009) 12. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge Academic Press, London (2003) 13. Bevilacqua, A., Carozza, L., Gherardi, A.: A novel vision based approach for autonomous space navigation systems. In: International Symposium on Visual Computing (2009) 14. Shi, J., Tomasi, C.: Good features to track. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 593–600 (1994) 15. Zitova, B., Flusser, J.: Image registration methods: a survey. Image and Vision Computing 21, 977–1000 (2003) 16. Foroosh, H., Zerubia, J.B., Berthod, M.: Extension of phase correlation to subpixel registration. IEEE Transactions on Image Processing 14, 12–22 (2002) 17. Bouguet, J.Y.: Pyramidal implementation of the Lukas Kanade feature tracker: Description of the algorithm. In: Intel Research Laboratory, Technical Report (1999)
A Novel Vision-Based Approach for Autonomous Space Navigation Systems Alessandro Bevilacqua, Alessandro Gherardi, and Ludovico Carozza ARCES - Advanced Research Centre on Electronic Systems - University of Bologna via Toffano 2/2 - 40125 Bologna, Italy {abevilacqua,agherardi,lcarozza}@arces.unibo.it
Abstract. Determining the attitude of a satellite is a crucial problem in autonomous space navigation. Actual systems exploit ensemble of sensors, also relying on star trackers, that recover the attitude of a satellite through matching part of the sky map with a star atlas stored on board. These are very complex systems that even suffer for Sun and Moon blinding. In this work we assess the feasibility to use a novel stand-alone system relying on a camera looking at the Earth. The satellite attitude is recovered through the parameters extracted from the registration of the image pairs acquired while orbiting. The experiments confirms that the accuracy of our approach is fully compliant with the application requirements.
1
Introduction
Vision-based systems are more and more spreading in diverse application fields, aiming at replacing, or at least “aiding”, systems constituted by other sensors, due to the advantages they can carry. The automatic guidance and control (AGC) of remote sensing devices is one of these fields. The employment of a CCD camera together with a processing unit can often replace ensembles of different sensors, that represent a more complex system, also requiring an integration of the different sources of data. As an example, the determination of the attitude in satellite applications is usually achieved by coupling different sensors and this requires a subsequent data integration. Vision based sensors and systems have been studied since the early Eighties. In 1990, the first autonomous star tracker (ASTROS) was employed in a NASA space mission [1] using a CCD camera. Nowadays, stars trackers are more and more important in the field of attitude and orbit control, although some concerns are still open. For instance, the high cost of these sensors and some drawback due to Sun and Moon blinding are limiting factors for small-satellite, low-Earth orbit missions. In addition, they rely on a reference database on which a matching is performed. Our group is studying a new vision-based approach able to estimate with a high accuracy the satellite attitude. Differently from star trackers, our method provides that the Earth is employed as a native target to be tracked in order to recover the attitude, still relying on an on-board monocular camera integral with the satellite. In this work, we describe the preliminary research we have carried G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 837–846, 2009. c Springer-Verlag Berlin Heidelberg 2009
838
A. Bevilacqua, A. Gherardi, and L. Carozza
out to assess the feasibility of our approach. It relies on the application of image registration techniques to satellite terrestrial images. The dynamical model is derived taking into account requirements and constraints of a typical small-satellite application. The attitude parameters should be recovered from the geometrical transformation that maps the views of the Earth acquired at different epochs along the orbit. Since attitude estimation represents a mission critical step in satellite applications, it must present high robustness and precision, even working in real time and requires, at the same time, a low computational payload. Thorough experiments on simulated data prove how the vision-based method we have conceived is compliant with the application requirements. This work is organized as follows. Section 2 discusses previous approaches utilized to face the problem. The requirements and constraints of the problem are analyzed in Section 3, while in Section 4 the geometric model and the image registration approach to relate different views are described. The motivation standing behind the choice of the Earth imagery dataset is described in Section 5. In Section 6 some results concerning experiments with sequences of simulated data are shown. Finally, Section 7 draws conclusions and proposes future developments.
2
Previous Work
Attitude estimation for the automatic guidance of remote unmanned system has been faced through using specific devices and sensors in terrestrial or aerial applications and space missions. Most of the currently available navigation systems based on attitude estimation follow two different approaches. In the first approach, a combination of sensors is employed. In terrestrial and aerial applications, GPS, accelerometers and gyroscopes are widely employed. For satellite missions, where estimation accuracy is a key requirement, orbital gyrocompass and horizon sensors are used to estimating pitch, roll and yaw attitude angles. However, accuracy is limited to tenths of degree [2]. Furthermore, in Low Earth Orbit (LEO) system, Earth’s horizon appears as not being perfectly circular and infrared radiation deflected by the atmosphere can lead to false detections, that worsen accuracy as low as few degrees [2]. The second approach exploits vision based methods. In particular, the Simultaneous Localization and Mapping (SLAM) approach has been conceived to cope with dead reckoning effects for long looping image sequences by reducing drift errors. The research presented in [3] well resumes the state of the art, where a combined approach of SLAM and visual odometry is proposed to reduce the impact of cumulative errors. Although attitude estimation is also provided, it shows very a limited accuracy (some degrees). In [4], sequences of monocular aerial images acquired in real time are compared with a geo-referenced Earth image database to estimate position and velocity of the aircraft. This also helps to reduce drift errors in position estimation. However, no results are reported for the attitude, although authors state to recover it. In addition, processing large database images could be not compliant with the limited resources available on
A Novel Vision-Based Approach for Autonomous Space Navigation Systems
839
board of a satellite. As regards satellite systems, star trackers have been employed to improve attitude estimation accuracy. In particular, the authors in [5] report a fully autonomous star tracker capable of achieving an angular estimation accuracy below 1 arcsec. The system is able to operate in different modes depending on the environmental conditions. However, only indoor testing has been reported and the accuracy strongly depends on the quality and number of matched stars. Also in [6], sub-pixel centroid estimation is achieved by matching the camera Point Spread Function (PSF) with the star’s pixel values. The overall accuracy is influenced by the centroid estimation as well as the probability of correct star identification. The authors report an accuracy for attitude estimation of about 2 arcsec for both synthetic and real (urban scene) images. Nevertheless, stars trackers are not immune to problems, as reported in [7], where the radiation impact is analyzed. The testing of stars trackers in real working condition has been recently reported in [8]. The 3-sigma accuracy along the three axis is less than, or equal to, [7 7 70] arcsec, respectively. Here, also the effects of the satellite motion on performance, stray light and direct Sun blinding are discussed, analyzing in which conditions the recognition will fail.
3
Problem Statement
In this work we investigate the feasibility to employ a vision based attitude sensor for standalone three-axis stabilized mini-satellites in LEO orbit. They are usually equipped with remote sensing devices, like imaging sensors, and they are oriented in nadir pointing attitude, that is one axis always looks at the Earth during the orbit. What follows is a brief resume of the main constraints related to our target application: – LEO satellites have a typical orbit of radius 650 km and cover about 15 circular non-geostationary orbits per day, with a period of 94 min/orbit, corresponding to a nominal ground velocity of 6.9 km/s; – limited computational resources are available, usually due to old generation processors and system architectures already tested for space missions; – limited power budget, so that the design of such a sensor must pay attention to computational requirements; – limited payload and reduced dimension available, which have direct consequences on the choice of the optics layout. The motion velocity has to be taken into account since it can affect the image quality (e.g., because of blurring). The working frame rate is also important for the feedback frequency in the spacecraft control, since iterative algorithms for refining the estimation can be too “heavy” to be supported from the HW available on mini satellites. Therefore, computational complexity of the method should be kept under control. The vision based attitude estimation system must basically comply with two following requirements: – attitude estimation accuracy (3 angles) ≤ 1 arcsec – attitude sensor dimension ≤ 150 × 150 × 300 mm
840
4
A. Bevilacqua, A. Gherardi, and L. Carozza
The Method
The approach we conceived aims to estimate the satellite attitude by analyzing the transformation between subsequent pairs of views. In particular, the satellite motion, the object being viewed (the Earth) and the camera are responsible for image generation and warping processes. Accordingly, it is essential to analyze thoroughly the choices accomplished to describe the geometrical relations linking corresponding points in views taken at different attitudes, as described in Section 4.1. In order to assess that our model complies with requirements, correspondences are generated as described in Section 4.2. 4.1
The Geometrical Model
We adopt the realistic model of a full projective camera with intrinsic parameters supposed to be known. Two different views at two consecutive epochs are linked through the well known Eq. 1 [9]: ⎡ ⎤ ⎛⎡ ⎤ ⎞ x x KR(C − C) ⎣ y ⎦ = KR R−1 K −1 ⎝⎣y ⎦ − ⎠ (1) Zcam 1 1 where [x y 1]T and [x y 1]T represent the homogeneous coordinates of the corresponding points in the previous epoch t and in the current epoch t , respectively. K is the matrix of the intrinsic parameters of the camera. The rotation matrices corresponding to two different spatial orientations of the satellite are represented by R and R in the two epochs, respectively. C and C stand for the 3-D coordinates of the camera’s optical centre in the first and in the second epoch, respectively, and depend on the satellite orbital trajectory. Zcam is the distance of each scene point from the image plane, referred to the first epoch, and it takes into account the 3-D structure of the scene. The Earth is modeled as a sphere of known radius (RE = 6371 km), so that given the satellite position and orientation it is possible to compute Zcam for every point of the camera sensor’s plane. The plane homography H retains the attitude variation R R−1 , according to homogeneous Eq. 2: H = λ KR R−1 K −1 , λ =0 (2) and relates the two image reference frames, up to a ratio that can be computed since the orbital trajectory and the Earth model are known (see Eq. 1). λ is a proportionality factor that can be computed once that H has been estimated from points correspondences, since det(H) = λ3 . Satellite attitudes can be conveniently represented using unit quaternions q. Since they yield more numerically stable solutions with respect to the composition of rotation matrices, they are usually adopted in satellite attitude control [10]. In particular, by using the Direction Cosine Matrix DCM (q) and taking into account the Earth rotation velocity ω = 7.272 · 10−5 rad/s, after some passages Eq. 3 can be written: H = λ KDCM (q(t ))R3 (ω(t − t))T DCM (q(t))T K −1
(3)
A Novel Vision-Based Approach for Autonomous Space Navigation Systems
841
The relative orientation quaternion Δq can be computed stemming from the estimated homography H (Eq. 4): R(Δq)(= DCM (Δq)T ) = 4.2
1 −1 (K HK)T λ
(4)
The Registration Approach
We have to find out if a registration algorithm could exist that is compliant with the problem requirements claimed in Section 3. Frame to Frame (F2F) registration consists of three steps: feature extraction, feature matching (or tracking, if images have a temporal reference), H computation. Actually, we have considered the image domain being either continuous, representing the ideal condition, or discrete, this representing real world conditions. Several kinds of features can be chosen to be tracked and as many trackers, accordingly. The presence of native structured patterns (like mountains, roads, etc.) could be exploited but the descriptors would suffer from occlusion problems. On the other hand, as a general principle, we could state that one pixel is the smallest feature that can be extracted, it does not suffer from sampling errors (differently from, e.g., a oblique segment), it is the fastest to be tracked. In particular, from a general point of view, tracking single pixels instead of complex descriptors requires a lower computational complexity, which is crucial in our application. In addition, points are coordinates, that can be directly utilized to compute H, according to Eq. 1. As a matter of fact, in practical applications a small patch around a pixel is usually considered. Here, we do not address the problem of the robustness of the feature descriptor, we just want to assess if tracking single points is compliant with the accuracy required by the problem, at least assuming a perfect tracking. Therefore, the homography H is performed over a set of computed matchings between pairs of point maps in the sensor image plane. The procedure to compute this set can be summarized as follows: 1. A sparse set S(t) of n points p uniformly distributed is chosen in the first image plane (in case of discrete domains, they represents pixels and have integer coordinates); 2. The geo-referenced projections of the image planes on Earth are found, taking into account the camera and the Earth model chosen. Practically speaking, points p are projected on the spherical terrestrial surface through visual rays passing for the camera centre, so as to find on the Earth the set S(t)E1 expressed in latitude and longitude; 3. Starting from each point of the set SE1 (t), the matching set S(t ) in the second frame is found by following a different method according to image domain: – continuous: The visual rays are simply back-projected from the set SE (t) to the plane of the second image, so to find directly the matching coordinates – discrete: The image plane grids define as many grids on Earth, where each component (say, a “unitary square”) is considered as being planar.
842
A. Bevilacqua, A. Gherardi, and L. Carozza
The projection of pixels of the first image yields a set of target coordinates (LatE1 , LonE1 ) expressed in latitude and longitude. In the shared region, we firstly find the coordinates (cx , cy ) in the second image plane of the pixel c whose projected values (LatE2 , LonE2 ) are the nearest ones to (LatE1 , LonE1 ). Secondly, the planar perspective transformation P that maps the values of longitudes and latitudes of the “four nearest neighbors” pixels of c into the unitary square is computed. In this way, the set of searched values (LatE1 , LonE1 ) are “assigned” to the image coordinate P (LatE1 , LonE1 ), hence at sub-pixel resolution. The set S(t ) is the collection of the points assigned according to this perspective interpolation. In the continuous domain, the errors in computing the homography are just numerical ones. On the other side, in the discrete domain the method also undergoes the discretization error, although the perspective interpolation yields sub-pixel accuracy. The deviation of the discrete matching from the ideal one represents the lower bound of the registration accuracy.
5
Earth Imagery
In order to test the feasibility of an image-based attitude control, the availability of a suitable database of images have to be investigated. The relations between sensor size (ground area, accordingly), overlapping area between subsequent frames (for feature tracking), frame rate and ground velocity have to be analyzed. Let us consider a CCD camera commonly used in these applications, with pixel size of 8 μm and focal length of 336.7 mm. Table 1 shows how the sensor Table 1. Relation among resolution, acquisition frame rate and overlapping region Acquisition frame rate [fps] Ground velocity [pixel/s] Sensor size [pixel] Ground area [km2 ] 320 × 240 640 × 480
4.8 × 3.6 9.6 × 7.2
1 2 5 10 460 230 92 46 Overlapping area(%) 0 28.1 71.3 85.6 28.1 64.1 85.6 92.8
resolution (in pixel) and the acquisition frame rate affect the overlapping region. If using feature points for image registration, the overlapping area between two consecutive frames should be reasonably kept at least at about 25% − 30%, in order not to face an ill-conditioned problem and to attain a successful feature matching. Therefore, we can see how for a small size image (320 × 240 pixel) the minimum frame rate can be of 2 fps only, while when doubling the resolution, the image processing algorithm can run at 1 fps. However, low frame rates have to be avoided since the attitude control feedback would be responsiveness and would
A Novel Vision-Based Approach for Autonomous Space Navigation Systems
843
expose the satellite to longer perturbations. As for the tracking accuracy, it can be computed given the altitude of the satellite and the sensor optical parameters. For a typical LEO orbit with the realistic sensor parameters reported above, in order to detect perturbations of 1 arcsec the features must be tracked with an accuracy of about 0.2 pixel for roll and pitch angles and with an accuracy of about 10−3 pixel for the yaw angle. In order to test the registration algorithm with typical satellite Earth images, different databases publicly available have been examined, also considering that the resolution on ground should be compatible with the performance of the tracker algorithm in term of accuracy. The choice has fallen on Landsat7 ETM+, since it better fits the resolution required. In particular, Landsat7 images have a proper geolocation tag and cover 4 × 4 arcdeg of the Earth, with a spatial resolution of 15 m/pixel, compatible with the sensor’s parameters reported above. Just as a reference, Fig. 1(left) shows an image from NASA Blue Marble (250 m/pixel) and the corresponding highlighted image from the NASA Landsat7 databases (right).
Fig. 1. Image from NASA Blue Marble (left) and Landsat7 database (right)
6
Experimental Results
Simulations with computed matchings generated according to Section 4.1 and Section 4.2 have been extensively performed and the accuracy of the matchings in different views analyzed, since it affects the estimation of the homography H. An orbital simulator provides us with the ground-truth used to evaluate the sensitivity of our approach based on image registration to the satellite navigation. The sensor parameters have been chosen according to what described in Section 5, with sensor size of 320 × 240 pixel, being compatible with those of the devices commonly employed in the field of satellite imaging. Besides they provide the same ground resolution as of the image database, which will be expectedly used in the next research stage. 1000 frames have been extracted from a slightly perturbed circular orbit, according to roll, pitch and yaw angles, generated by the orbital simulator, in a range of [1000, 8, 60] arcsec, respectively, at the working frequency of 10 Hz
844
A. Bevilacqua, A. Gherardi, and L. Carozza
(10 fps) and considered for testing. The estimation errors with respect to the ground-truth data are reported in arcsec for roll, pitch and yaw, after converting the quaternion error in Euler angles. As for the numerical method to convert a rotation matrix to quaternions, here we report only the results obtained using the algorithm in [11]. This method works better than the canonical method for numerical simulation, where the matrix DCM (Δq) is already close, according to the Frobenius matrix norm, to an orthogonal matrix. On the contrary, the canonical algorithm works better in the presence of matrices farther from the orthogonality, like it could happen, for instance, for registration with synthetic images extracted from a geo-referenced database. The F2F error concerns the registration of each couple of consecutive frames, starting from the ground truth attitude of the first frame. In Fig. 2, the F2F Roll Error Histogram (arcsec)
Pitch Error Histogram (arcsec)
50
Yaw Error Histogram (arcsec)
60
40
50
40
30 40
30 30
20 20
20 10
10 0 −4
10 −2
0 bin
2
4 x 10
0 −4
−2
−5
Roll Error Histogram (arcsec)
0 bin
2
4 x 10
0 −1
−0.5
0 bin
−5
Pitch Error Histogram (arcsec)
0.5
1 x 10
−7
Yaw Error Histogram (arcsec)
50
50
40
40
35
30
30
20
20
20
15
10
10
30 25
10 5 0 −10
−5
bin
0
5 −5
x 10
0 −1
−0.5
0 bin
0.5
1 −4
x 10
0 −6
−4
−2
bin
0
2
4 −7
x 10
Fig. 2. F2F error statistics with continuous (up) and discrete (bottom) image domain
error statistics with continuous (up) and discrete (bottom) image domain, are reported. They show that the errors in the two domains are comparable, except for the pitch error. This means that pure discretization basically slightly affect the accuracy. In addition, as one can see, the histograms for the discrete domain are biased, this probably due to the interpolations in the discrete object plane (see Section 4.2). In fact, in continuous domain the bias is not present. In working dynamical conditions, the estimated attitude is then propagated to the next epoch. Therefore biasing in F2F error may cumulate, thus resulting in drift effects. In Fig. 3, the temporal trend for the three angular errors, achieved by propagating the estimated attitude to the next F2F registration, is illustrated for the continuous domain. As expected, there is no error drift for none of the three angles, since the continuous model of the sensor has no bias. The analyses reported above just cope with the discretization error, since they are grounded on the perspective tracker with an “unlimited” sub-pixel resolution. The last
A Novel Vision-Based Approach for Autonomous Space Navigation Systems Roll Error trend (arcsec)
−5
x 10
Pitch Error trend (arcsec)
−5
4
x 10
6
2
2
4
0 −1 −2
1 0 −1 −2
−3
−3
−4 0
−4 0
200
400 600 #Frames
800
1000
Yaw Error (arcsec)
3
1
Yaw Error trend (arcsec)
−8
8
3
Pitch Error (arcsec)
Roll error (arcsec)
4
845
x 10
2 0 −2 −4 −6
200
400 600 #Frames
800
1000
−8 0
200
400 600 #Frames
800
1000
Fig. 3. F2F trend error with continuous image domain
studies we have carried out have yielded an estimate of the attitude errors based on the reliable hypothesis that a tracker could have a sub-pixel resolution of 10−2 . Practically speaking, this has been assessed by limiting the precision of the coordinates P (LatE1 , LonE1 ) right to 10−2 . The results are still biased and the errors on the three angles are of the order of [10−2 , 10−2 , 100 ] arcsec. Although being worse than those of Fig. 2 (up), they are still compliant with the problem’s requirements. A significant difference regards the error on the yaw angle, being two orders of magnitude worse than that of the other angles. This is due to the effects of a yaw perturbation yielding a less appreciable variation in the image motion field, if compared to the ones produced by equal roll and pitch perturbations, thus better reflecting the real world conditions. As for the computational complexity, the choice of a feature tracker based on single points rather than complex descriptors together with the choice on image resolution and acquisition frame rate, fulfills the application requirements.
7
Conclusion and Future Works
We have presented the preliminary results of the research we have been carrying out to develop a stand alone vision-based system to assess the attitude of a satellite. The approach proposed is based on frame to frame registration and exploits tracking of feature points to compute the homography and recover the attitude parameters. A feasibility study is presented aimed at assessing the feasibility of the method in presence of the clear constraints of the problem. Experiments proved that a tracking method capable to work with feature points at sub-pixel resolution could comply with the problem’s accuracy requirements. The outcome of the experimental results also constitute the input data to the next stage, when simulations with synthetic images extracted from the database chosen and possibly real time acquisitions have to be performed. In addition, a tracking algorithm capable of working with feature points at sub-pixel resolution in the domain of the target image, will be implemented and tested on sampled and real time sequences also considering computational requirements. It is worth noticing that this approach could be useful in other domains such as in video surveillance applications, especially when PTZ cameras are used to precisely recover the angle of view.
846
A. Bevilacqua, A. Gherardi, and L. Carozza
Acknowledgment This research was partly granted by the University of Bologna through the joint DIEM/DEIS “STARS” project, started in 2005. We thank the DIEM team led by Prof. P. Tortora (C. Bianchi, N. Melega and D. Modenini) for providing us with the data of their orbital/attitude simulator developed in the context of the STARS project.
References 1. Stanton, R.H., Alexander, J.W., Dennison, E.: Ccd star tracker experience: key results from astro 1 flight. Space Guidance, Control, and Tracking, 138–148 (1993) 2. Pisacane, M.: Fundamentals of Space Systems, ch. 5, 2nd edn. Oxford University Press, Oxford (2005) 3. Caballero, F., Merino, L., Ferruz, J., Ollero, A.: Vision-based odometry and SLAM for medium and high altitude flying UAVs. Journal of Intelligent and Robotic Systems 54, 137–161 (2009) 4. Conte, G., Doherty, P.: An integrated uav navigation system based on aerial image matching. In: Proceedings of the IEEE Aerospace Conference, pp. 1–10 (2008) 5. Ruocchio, C., Accardo, D., Rufino, G., Mattei, S., Moccia, A.: Development and testing of a fully autonomous star tracker. In: 2nd IAA Symposium on Small Satellites for Earth Observation, Berlin, Germany, April 12-16 (1999) 6. Kolomenkin, M., Pollak, S., Shimshoni, I., Lindenbaum, M.: Geometric voting algorithm for star trackers. IEEE Transactions on Aerospace and Electronic Systems 44, 441–456 (2008) 7. Jørgensen, J.L., Thuesen, G.G., Betto, M., Riis, T.: Radiation impacts on startracker performance and vision systems in space. Acta Astronautica 46, 415–422 (2000) 8. Rogers, G.D., Schwinger, M.R., Kaidy, J.T., Strikwerda, T.E., Casini, R., Landi, A., Bettarini, R., Lorenzini, S.: Autonomous star tracker performance. Acta Astronautica 65, 61–74 (2009) 9. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge Academic Press, London (2003) 10. Kristiansen, R., Nicklasson, P.J., Gravdahl, J.T.: Satellite attitude control by quaternion-based backstepping. IEEE Transactions on Control System Technology 17, 227–232 (2009) 11. Bar-Itzhack, I.Y.: New method for extracting the quaternion from a rotation matrix. Journal of guidance, control, and dynamics 23, 1085–1087 (2000)
An Adaptive Cutaway with Volume Context Preservation S. Grau1 and A. Puig2 1
Polytechnic University of Catalonia, Spain 2 University of Barcelona, Spain
Abstract. Knowledge expressiveness of scientific data is one of the most important visualization goals. However, current volume visualization systems require a lot of expertise from the final user. In this paper, we present a GPU-based ray casting interactive framework that computes two initial complementary camera locations and allows to select the focus interactively, on interesting structures keeping the volume’s context information with an adaptive cutaway technique. The adaptive cutaway surrounds the focused structure while preserving a depth immersive impression in the data set. Finally, we present a new brush widget to edit interactively the opening of the cutaway and to graduate the context in the final image.
1
Introduction
Knowledge expressiveness of scientific data is one of the most important visualization goals. The abstraction process that the final user should carry out in order to convey the desired information in the underlying data is normally difficult and tedious. Several methods have been published to gather visual information contained in the data. However, volume renderings often include a barrage of complex 3D structures that can overwhelm the user. Over the centuries, the traditional illustration techniques for visual abstraction enhances the most important structures into a context with different painting techniques (Figure 1A). Several approaches provide interactive focus selections and volume visualizations, such as cutaways [1] and close-ups [2]. Some methods are based on NPR techniques and ghosting shading to simulate the illustrator tools. All of these methods are included in the field of illustrative visualization [3], where the main goal is to develop applications that can integrate illustrations in the expert’s ordinary data analysis in order to get more semantics from the data. Some metaphors of interaction have been provided to help users in data navigation between focus and context (importance-driven, VolumeShop, exoVis, LiveSync++, ClearView). Specifically, ClearView proposes a simple point-andclick interface that enables the user to show particular areas of the focused object while keeping the surface context information (Figure 1B). In some applications, the context region’s volume information is especially important, such as we can see in Figure 1C, where the different strata around the eye, which is the focus region, must be visualized. Strata representing layers of the context may help G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 847–856, 2009. c Springer-Verlag Berlin Heidelberg 2009
848
S. Grau and A. Puig
Fig. 1. Illustration examples of the underlying ideas of the adaptive cutaway visualization: (A) An illustration example, (B) visualization based on ClearView method ([4]), (C) dual-camera illustration (image from http://www.keithtuckerart.com/Illustration.html) and (D) Context preservation -ear- although it occludes the focused region -vessels- (image from the medical dictionary Allen’s Anatomy)
to locate the focused structure [5]. Moreover, in some cases, the context which occludes focused structures should be preserved (see Figure 1D). For this reason, an interactive tool to preserve the context is useful to obtain the desired final image. In this paper, we propose an enhancement of the ClearView paradigm where the context’s volume data is adaptively clipped around the focus region. To show the layers of the context around the focus, we use volume rendering. We use ghosting techniques to preserve the context without occlusion of the focus. Moreover, we propose a new brushing widget to edit interactively the gosthing effect and to graduate the contribution of the context in the final image. On the other hand, a complementary parameter is the initial location of the camera, from the user can start a free exploration of the data set. To select a starting good viewpoint is sometimes a tedious task for a non-experimented users. Some traditional illustrations use two different views of the region of interest to enhance the perception and construction of the mental image (see Figure 1C). Actually, in the clinical routine, prefixed views are used, based on saggital, axial and coronary views, despite that they are not always the best. In the bibliography, we can find techniques that automatically locate the viewpoint according to the importance of the structure to be rendered. Most of them are computed in a preprocessed stage. We propose an efficient GPU-based computation of two initial correlated views that shows maximal information of the focused structure. In case of Figure 1C, the focused structure is the eye and it is showed in two dual viewpoints, one in a oblique view and one in front of the eye. Also, the user can adjust interactive any of the suggested cameras, and our system efficiently computes the new dual view.
2
Related Work
Comprehension of the meaningful structures in an image is a low-level cognitive process. Single visual events are processed in an intuitive way. However, multiple visual events involve a more complex cognitive process. Many works have been addressed this problem. In the following, first, we will review the use of cut-away
An Adaptive Cutaway with Volume Context Preservation
849
views and focus and context techniques that render the contextual information using NPR-shading and ghosting techniques. Secondly, we review the previous approaches in optimal camera location estimation. Cutaways: Usually, traditional artists render anatomy layered structures using clipping planes or curved clipping surfaces. Following this idea, several approaches have been proposed for polygonal rendering [6] and for volume rendering [2]. Interactive cutaways allow users to specify the cuts to be performed: by planes oriented along principal axes, user-defined interactive sculpting tools [7] and user-manipulated deformable meshes [8]. Users can explore regions with peel-away [9] and exploded views [10]. 3D automatic cuttings design the appropriate cut of a focused internal structure based on the feature specification. Viola et al. [1] use the object importance to avoid unwanted occlusions. Zhou et al. [11] use the distance to the features to emphasize different regions. Bruckner et al. [2] present cuts completely based on the shape of the important regions. Kr¨ uger et al. [4] presents in ClearView a region-focal based interaction that preserves the focused structure. Many of these approaches keep contextual information to give a better impression of the spatial location of the focused structure. Usually, the surface context information can be kept in different ways: with high transparency, low resolution, different shading styles [1]. These algorithms are designed primarily to expose perfectly layered structures in the context but they can not show the intertwined volume structures of the context often found in 3D models. In order to preserve the contextual volume information, we present an automatic feature-based cutaway approach. We propose a view-dependent cutaway opening that guarantees the visibility of the selected inner structure for any viewpoint location. Optimal viewpoint selection: Setting the camera so that it focuses on the relevant structures of the model in volume direct rendering has been addressed by many authors [12,13,14,15,16,17], extending the ideas used in surface-based scenes. Visual information is the measure used to estimate the visibility between a viewpoint and the structure of interest of a volume data set. Then, the analysis of the best camera position is focused under heuristic functions [17,15,13] and information theory approaches (such as viewpoint entropy [12] and the mutual information entropy [14,16]). These metrics are not universal and, as [18] conclude, not one descriptor does a perfect job. However, as heuristic approaches define a weighted energy function of view descriptors, they could be easily accelerated using blending operations between view descriptors. Some approaches study correlations between cameras in varying time measuring the stability between the views based on the Jense-Shanon divergence metric [12]. Also, in path-views searching a Normalized Compression Distance is used [16]. Our goal is to obtain the most representative dual projection according to the position of the best view. Then, we restrict the search space of the dual-camera position in a set of positions related to the best one. Thus, we guarantee the distance between the correlated cameras and, moreover, we reduce our searching space.
850
S. Grau and A. Puig
Fig. 2. The dual-camera placement method computes two camera locations from the sampled bounding sphere that show complementary information of the focused regions (in orange). The method takes into account the adaptive cut-away strategy that shows always the focused regions.
3
Overview
The main goal of our system is to provide an interactive exploration of volume data sets for inexperienced users. First of all, the focused structures should be selected with the aid of a transfer function classification. With presegmented labeled data sets, the user can directly use a value-based function widget. Once the focused regions are selected, the user can explore the data set using an automatic adaptive cutaway strategy that shows always the focused regions but also keeps the relevant contextual volume information. Moreover, if the user is not entirely satisfied with the cutaways results, a simple point-and-click editing brush tool aids him to adjust the final image, adding and removing contextual volume appearance. Before this process, camera position should be fixed. Two different camera locations, (C1 , C2 ), are computed as a preprocess of the adaptive cutaway (see Figure 2). C1 represents the best view to visualize the focused regions according to user-defined criteria. C2 represents the correlated dual-camera that preserves the maximum information integrated in both views. This step is called dualcamera placement and it suggests two views that can be adjusted manually by the user to obtain the final image. The system can efficiently compute the correlated dual-camera of any user-defined best view.
4
Adaptive Cutaway with Context-Volume Preservation
The automatic feature-based cutaway opening is generated in two stages. In the first stage, we setup the parameters to optimize the raycasting. In the second stage, we compute context and focus image layers using the raycast algorithm and we blend both image layers using an auxiliary distance map. Raycasting Parameters Setup The first step finds out for each sampled ray pi,j two hit points against the volume data set in FTB order: the first intersection with the context, Icontext (pi,j ), and the first intersection with the focus, If ocus (pi,j ) (see Figure 3). Rays with no intersection are skipped for the next steps.
An Adaptive Cutaway with Volume Context Preservation
851
Fig. 3. Ray focus and context intersection’s computation. The slice view shows the three possible ray intersection’s cases: no focus and context intersection (orange), only context intersection (blue) and focus and context intersections (green).
When we want to show always the focus region, the sampled rays may begin at the starting point If ocus (pi,j ) if exists and at Icontext (pi,j ), otherwise. However, the depth perception between focus and context is not clear (see Figure 4A). A more deeper immersive impression could be obtained when different strata of the context are shown. Thus, our proposal consists of gradually opening the clipped area between the selected feature and the context (see Figure 4B) in object space as well as in image space. We adaptively open the cutaway in function of the depth of the sample and the Euclidean distance of the pixel pi,j (DM (pi,j )) to the nearest pixel that belongs the focused region projection (Figure 3 right shows the focused projection in green). When the pixel belongs to the projection, DM (pi,j ) is zero. The first cast of the ray for the pixel pi,j is done between the Icontext (pi,j ) and the If ocus (pi,j ). In those pixels that do not have If ocus (pi,j ), it is estimated using a pyramidal method for interpolation of scattered pixel data [19]. We extrapolate If ocus by averaging only using the known ones in an analysis process and by filling unknown ones in a synthesis process. At the end of this process, we know all the If ocus (pi,j ) and Icontext (pi,j ) and they are stored in a 2D texture. We apply a progressive cutaway for those rays whose distance map values are less or equal than a user-defined opening width (wopen ). We can identify these pixels pi,j as those that fulfill that DM (pi,j ) ≤ wopen . For these rays, the starting point of the ray is interpolated between the If ocus (pi,j ) and the Icontext (pi,j ) weighted linearly by DM (pi,j ). Raycasting A common volume raycasting with early-termination is used to create the image of Figure 4 A or B. The cast of the ray begins in the sample computed between If ocus (pi,j ) and Icontext (pi,j ). This image is stored in a texture called focus image layer (Lf ocus ). Moreover, to give a better impression of the spatial location of the focused structure in the context area, we employ context preserving strategies to keep context edges on the focus projection. It is done computing the curvature, illumination and value at Icontext (pi,j ) sample point for all the pixels inside the focus projection. These values are stored in the context image layer (Lcontext ).
852
S. Grau and A. Puig
Fig. 4. Cutaway example using a phantom data set. It shows the adaptive cutaway (A), the opening cutaway (B), and the context preservation (C).
Fig. 5. Cutaway openings in two viewpoints positions without (A and B) and with (C) context preservation or using the border effect (D)
A simple blending process combines the context and focus image layers properly obtaining the final image. Figure 5C shows the enhancement of the context preserving idea. An additional border effect can be applied in function of the computed distance DM (pi,j ) (see Figure 5D). The pixels on a surrounding area of the border projection’s boundary can be easily detected by their values and rendered as a silhouette. In summary, with a user-defined opening and border widths (wopen and wborder ), during the composition stage, the final color Coli,j at each pixel pi,j is computed as: Coli,j
⎧ ⎨ blend(Lf ocus (pi,j ), Lcontext (pi,j )) if 0 ≤ DM (pi,j ) ≤ wopen = border color if wopen < DM (pi,j ) ≤ wopen + wborder ⎩ Lf ocus (pi,j ) otherwise
where, Col is the final pixel color and blend is a blending function. Interactive Brush Widget In order to allow users modify interactively the starting cast of the ray between If ocus (pi,j ) and Icontext (pi,j ), we define a 2.5D brushing tool. Upon the cutaway is visualized, the user can paint over the final image using a circular depth-brush that changes the ray’s starting point for all the pixels inside the
An Adaptive Cutaway with Volume Context Preservation
853
circle. The depth of the starting point is increased/decreased according to the depth of the current brushing tool. Thus, only the affected pixels of the focus layer, Lf ocus , are re-sampled.
5
Dual-Camera Placement
The camera’s searching space’s continuity and smoothness is essential to guarantee the stability of the view quality. Our solution space S is the bounding sphere of the complete model that contains the set of all possible camera locations. The space of N camera locations is defined as a discrete and finite set of samples over S that are iso-distributed along the surface of the sphere (we use the HealPix package [20]). We assume that the up vector can be arbitrarily chosen at each location due to the camera rolls not having an effect on the visibility. We propose a heuristic method with the following image-based descriptors: visibility, coverage and goodness of the location of the focus on the viewport. The visibility descriptor (vd1 ) evaluates how far is the focused regions to the camera location using the average depth, and vd1 is bigger for higher average depth. The quality of the view can be enhanced with the coverage of the focused regions (vd2 ) that arises with the size of the focus in the final projection. Finally, the goodness of the location (vd3 ) measures the centroid of the projected area. The focus should be centered in the viewpoint area. Only the focused regions are taken into account in the camera quality estimation. For this reason, the method computes first, for each camera location, an image that stores for each pixel the first intersection with the focused regions and the corresponding depth, when this intersection exists. This image is next used to compute the image-based descriptors. All these view descriptors are computed on the GPU. The number of pixels, the bounding box and the average depths of all pixels are computed following a GPU-based hierarchical strategy [19]. The final view quality estimation of a camera location c is computed by the user-defined weights of the different viewing descriptors, vdi . To find the best camera placement (C1 ), we search the maximum view quality estimation using a GPU-based hierarchical method. Its efficiency depends directly on the size of the texture that stores the N viewpoint values. For small sizes, we search the maximum value on the CPU directly. Illustration techniques that enhance the perception of the model with dualcamera inspire us to define two correlated views (see Figure 1C). Classical illustrations shown quasi-orthogonal projections of a region. In order to find out the second camera, C2 , we simply reduce the searching samples of the solution space to orthogonal regions to the best camera location, C1 . We have tested orthogonal regions as streams and semi-spheres (see Figure 6).
6
Simulations
We have evaluated the performance of the proposed methods using a Pentium Dual Core 3.2 GHz with 3GB memory equipped with an NVidia GeForce 8800
854
S. Grau and A. Puig
Fig. 6. Dual-camera illustrations: (left) the reference illustration (image from the medical dictionary Allen’s Anatomy); (center) our results: best camera, semi-sphere dual camera and stream dual camera, respectively; (right) the orthogonal regions (green) of the best camera location (blue): (top) streams and (bottom) semi-spheres
Fig. 7. Stability of camera’s computation varying the number of sampled cameras with the hand data set
Fig. 8. Different cutaways of thorax (top) and VMHead (bottom) data sets. The first row shows visualizations with a classical raycasting (left), and with different wopen and wborder values. Second row shows from left-to-right, wopen = 0, wopen = 24 without context preservation, wopen = 24 with context preservation. The last is edited with the brush tool to enhance context regions as ear and nose.
GTX GPU with 768 MB of memory. The viewport size is 700 × 650. To test the methods, different synthetic and real data sets have been used. In all the tests, spheres of 12, 108, 300, 588 viewpoints has been used and two different
An Adaptive Cutaway with Volume Context Preservation
855
orthogonal regions, stream band and semi-sphere has been fixed by the user. We have used several data sets with different sizes. For each data set, different features can be selected. The thorax data set represents a segmented phantom human body. VMHead is a CT human brain obtained from the Visible Human Project. Foot and Hand are non-segmented CT scan of a human foot and a human hand respectively. First of all, we have evaluated the minimum number of cameras to be sampled in order to converge to an stable solution. We have tested the different data sets increasing the number of samples in the sphere, N , from 12 to 1200 (see Figure 7). In general, between 108 and 184 samples, the computed camera’s positions became stable. In these cases, the initial step of the camera location is averaging between 0.75 s. and 1 second by measuring with different data sets and selecting various structures. Changes on camera descriptors weights and searches of the dual-cameras do not reduce the frame rates significantly. In the tested data sets, the stability of the dual-camera location depends directly on the variation of the number of samples on the bounding sphere as well as the used searching space. Figure 8 shows different cutaway widths, with context preserving, border highlights and editing brush tools of the different data sets. We obtain interactive rates for all tested visualizations, between 50FPS and 60FPS.
7
Conclusions and Future Work
In this paper, we have presented a system for adaptive context volume visualization using automatic cutaways and dual-camera placements, based on GPU strategies. Our system provides new insights on the interaction with a focused structure to be analyzed in a volume context environment. It helps users to better understand the relationships between the focused structure and the context. Our approach enhances and extends the ClearView metaphor providing volumetric information in a cutaway context opening. The proposed adaptive cutaway exhibits the volume information contained in the context data, as stratified information surrounding the focused region. Also, we have proposed a new brushing widget to graduate interactively the contribution of the context in the final image. In addition, an efficient GPU algorithm is implemented to find the best two correlated viewpoints in the focused structure’s bounding sphere. Starting from this paper, we will continue our work in different directions. In this initial work, we concentrated our efforts on a particular sampling level, instead HealPix offers the possibility to refine the sampling space, in a hierarchical model. In the future, we will explore the capability of integrating the refinement process in our GPU-based system. On the other hand, we will attempt to obtain the best dual-camera view searching for the two views simultaneously. Acknowledgements. This work has been partially funded by the project CICYT TIN2008-02903, by the research centers CREB of the UPC and the IBEC and under the grant SGR-2009-362 of the Generalitat de Catalunya.
856
S. Grau and A. Puig
References 1. Viola, I., Kanitsar, A., Gr¨ oller, E.: Importance-driven feature enhancement in volume visualization. IEEE Trans. on Visualization and CG 11, 408–418 (2005) 2. Bruckner, S., Gr¨ oller, E.: Volumeshop: An interactive system for direct volume illustration. In: IEEE Visualization 2005, pp. 671–678 (2005) 3. Rautek, P., Bruckner, S., Gr¨ oller, M.E., Viola, I.: Illustrative visualization: New technology or useless tautology? SIGGRAPH Comput. Graph. 42 (2008) 4. Kr¨ uger, J., Schneider, J., Westermann, R.: Clearview: An interactive context preserving hotspot visualization technique. In: IEEE Trans. on Visualization and Computer Graphics (Proc. Visualization / Information Visualization 2006), vol. 12 (2006) 5. Patel, D., Giertsen, C., Thurmond, J., Gr¨ oller, M.: Illustrative rendering of seismic data. In: Lensch, H., Bodo Rosenhahn, H.S. (eds.) Proceeding of Vision Modeling and Visualization, pp. 13–22 (2007) 6. Burns, M., Finkelstein, A.: Adaptive cutaways for comprehensible rendering of polygonal scenes. ACM Trans. Graph. 27, 1–7 (2008) 7. Wang, S., Kaufman, A.: Volume sculpting. In: ACM Symp. Interactive 3D Graphics, pp. 151–156 (1995) 8. Konrad-Verse, O., Preim, B., Littmann, A.: Virtual resection with a deformable cutting plane. In: Proc. of Simulation und Visualisierung, pp. 203–214 (2004) 9. Correa, C., Silver, D., Chen, M.: Feature aligned volume manipulation for illustration and visualization. IEEE Transactions on Visualization and Computer Graphics 12, 1069–1076 (2006) 10. Bruckner, S., Gr¨ oller, M.E.: Exploded views for volume data. IEEE Trans. on Visualization and Computer Graphics 12, 1077–1084 (2006) 11. Zhou, J., Hinz, M., Tonnies, K.: Focal region-guided feature-based volume rendering. In: Symp. on 3D Data Processing Visualization and Transmission, p. 87 (2002) 12. Bordoloi, U., Shen, H.: View selection for volume rendering. IEEE Visualization, 62 (2005) 13. Takahashi, S., Fujishiro, I., Takeshima, Y., Nishita, T.: A feature-driven approach to locating optimal viewpoints for volume visualization. vis. 0, 63 (2005) 14. Viola, I., Feixas, M., Sbert, M., Gr¨ oller, E.: Importance-driven focus of attention. IEEE Trans. on Visualization and Computer Graphics 12, 933–940 (2006) 15. M¨ uhler, K., Neugebauer, M., Tietjen, C., Preim, B.: Viewpoint selection for intervention planning. In: IEEE Symp. on Visualization 2007, pp. 267–274 (2007) 16. V´ azquez, P.P., I Navazo, E.M.: Representative views and paths for volume models. In: Smart Graphics, pp. 106–117 (2008) 17. Kohlmann, P., Bruckner, S., Kanitsar, A., Gr¨ oller, E.: The livesync++: Enhancements of an interaction metaphor. Graphics Interface (08), 81–88 18. Polonsky, O., Patane, G., Biasotti, S., Gotsman, C., Spagnuolo, M.: What’s in an image? The Visual Computer 21, 840–847 (2005) 19. Strengert, M., Kraus, M., Ertl, T.: Pyramid methods in gpu-based image processing. In: VMV 2006, pp. 169–176 (2006) 20. Gorski, K., Hivon, E., Banday, A., Wandelt, B., Hansen, F., Reinecke, M., Bartelman, M.: Healpix – a framework for high resolution discretization, and fast analysis of data distributed on the sphere. The Astrophysical Journal 622, 759 (2005)
A 3D Visualisation to Enhance Cognition in Software Product Line Engineering Ciarán Cawley1, Goetz Botterweck1, Patrick Healy1, Saad Bin Abid1, and Steffen Thiel2 1 Lero, University of Limerick, Limerick, Ireland {ciaran.cawley,goetz.botterweck,patrick.healy, saad.binabid}@lero.ie http://www.lero.ie 2 Furtwangen University of Applied Sciences, Furtwangen, Germany
[email protected] http://www.hs-furtwangen.de/
Abstract. Software Product Line (SPL) Engineering is a development paradigm where core artefacts are developed and subsequently configured into different software products dependent on a particular customer's requirements. In industrial product lines, the scale of the configuration (variability management) can become extremely complex and very difficult to manage. Visualisation is widely used in software engineering and has proven useful to amplify cognition in data intensive applications. Adopting this approach within software product line engineering can help stakeholders in supporting essential work tasks by enhancing their understanding of large and complex product lines. In this paper we present our research into the application of visualisation techniques and cognitive theory to address SPL complexity and to enhance cognition in support of the SPL engineering processes. Specifically we present a 3D visualisation approach to enhance stakeholder cognition and thus support variability management and decision making during feature configuration.
1 Introduction Software product line engineering has rapidly emerged as an important software development paradigm during the last few years. SPL engineering promises benefits such as “order-of-magnitude improvements in time to market, cost, productivity, quality, and other business drivers” [1]. The primary principle underpinning SPL engineering is the development of core assets through a domain engineering process and the subsequent configuration of those assets through an application engineering phase. These core assets comprise the product line and contain variation points that support their configuration. Configuration of variation points allows the same asset to implement different requirements / features within different final software products. This configuration stage is a core part of application engineering. Many of the expected benefits rely on the assumption that the additional up front effort in domain engineering, which is necessary to establish the product line, provides G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 857–868, 2009. © Springer-Verlag Berlin Heidelberg 2009
858
C. Cawley et al.
a long term benefit as deriving products from a product line during application engineering is (expected to be) more efficient than traditional single system development. However, to benefit from these productivity gains we have to ensure that application engineering processes are performed as efficiently as possible. This has proven extremely challenging [2, 3] with industrial sized product lines containing thousands of variation points each of which can be involved in many dependent relationships with various other parts of the product line. One way of addressing this is to support the SPL engineering activities by providing interactive tools that use, at their core, visualisation theory and techniques that are suited for comprehension of large data sets and inter-relationships. Adopting visualisation techniques in software product line engineering can aid stakeholders by supporting essential work tasks and in enhancing their understanding of large and complex product lines. This paper introduces software product lines and presents our visualisation approach to enhance stakeholder cognition of the large and complex data sets that require understanding and management during the application engineering phase of the SPL process. We build on our previous work [4, 5] which elaborated on our initial ideas and we focus here on exemplifying, describing and discussing a working implementation. The rest of this paper is structured as follows. In Section 2 we introduce Software Product Lines and discuss the inherent data complexity challenges. In section 3 we discuss related work. In Section 4 we present our visualisation approach from a conceptual viewpoint. In Section 5 and 6 we provide a concrete implementation of the visualisation approach and discuss its benefits and limitations. The paper finishes with an overview of future work and conclusions.
2 Software Product Lines 2.1 The Process and Challenges Two areas within software product line engineering that can cause particular difficulties for practitioners are the management of variability and the process of product derivation. Variability refers to the ability of a software product line development artefact/asset to be configured, customized, extended, or changed for use in a specific context [6]. It thus provides the required flexibility for product differentiation and diversification within the product line. Product derivation is the process whereby the product line variability is manipulated and managed in order to produce a single final software product (variant). Empirical work by Deelstra et al. [2] was expanded on by Hotz et al. [3] and two fundamental issues at the root of most other problems were identified: The complexity of the product line in terms of variation points, variants and dependencies; The large number of implicit properties or dependencies associated with variation points and variants. These tend to be undocumented or only known to experts.
A 3D Visualisation to Enhance Cognition in Software Product Line Engineering
859
Part of our ongoing work targets variability management directly by providing a considered visualisation approach based on a meta-model that describes the software product line in a supportive way. 2.2 Modelling and Visualisation Approaches Describing a software product line in terms of a feature model [7] is a prevalent mechanism employed to address variability management. A feature describes a capability of the product line from a stakeholder’s point of view and is represented as a node with relationships between features as links (or edges). For example, a Seatbelt Reminder feature of a car restraint system requires the Seatbelt Detection Sensor feature. A feature diagram is typically represented as a tree where primitive features are leaves and compound features are interior nodes. The meta-model that we have developed and use as the basis for our visualisation approach consists of three separate but integrated meta-models. These describe features, decisions (which provide a high-level abstracted view on features and are essentially a combination of features that satisfy a particular need), and components (e.g. java classes) which implement features. The details of this meta-model are out of scope for this paper and the interested reader is guided to a previous publication [8] for further information.
3 Related Work A traditional approach to visualising feature models is to render features as nodes in a node-link diagram and to represent their relationships with each other through edges linking those nodes. Where multiple models are involved with additional relationships existing between those models, the same approach can be taken making the graphs grow ever more complex. Tools such as pure::variants [9] and Gears [10] primarily use items such as lists and hierarchical tree views. These, though familiar, lack evidence of their effectiveness with large scale product lines. The approach presented here addresses the issues from a non-traditional, relationship-centric perspective. The DOPLER [11] tool, although again employing lists and hierarchical trees, allows for more sophisticated graph layouts to be visualised. These, however, follow the node-link diagram approaches mentioned above. 3D software visualisation tools such as VISMOOS [12] and MUDRIK [13] provide interesting use of 3D approaches to supporting cognition, however, they do not support SPL engineering and focus on comprehension while omitting process support. Work by Robertson et al [14] and Risden [15], where they compare task performance using 3D versus 2D techniques provides interesting evidence that 3D techniques can be effective in certain circumstances.
4 Visualisation Approach There are a number of visualisation techniques that can be applied to the model visualisations described above, however, the approach presented here addresses the issues from a different perspective.
860
C. Cawley et al.
During feature configuration of a large SPL, one of the primary difficulties is understanding and managing the relationships that exist between and within different models. For example, during feature configuration, a stakeholder concerned with adding a specific feature is particularly interested in the effect that selecting that feature has on the rest of the system. Its selection may cause multiple other features to be selected and/or eliminated from the configuration, which again can have consequences for other features. Also, we need to consider the effects on elements in other models e.g. components. Hence, understanding and managing these relationships is key to an efficient configuration process. With this in mind, the approach taken here aims to focus the visualisation on the relationships that exist between model elements and not on the elements themselves and in this way make the relationships the primary visual element. 4.1 Visualising Sets of Relationships As an example case of interrelated SPL models we use a DFC model, which describes a product line in terms of Decisions, Features and Components: A decision model captures a small number of high-level questions and provides an abstract, simplifying view onto features. A feature model describes available configuration options in terms of “prominent or distinctive user visible aspects, qualities, or characteristics” [7]. A component model describes the implementation of features by software or hardware components. These three models are interrelated. For instance, making a decision might cause several features to become selected, which in turn require a number of components to be implemented. Fig. 1 shows a traditional approach [8] to visualising such inter-model and intramodel relationships. Using a tree layout, this example visualises a DFC model that describes automotive REStraint Control Units (RESCU). The product line described by this model contains features of electronic control units (ECUs) for automotive restraint systems such as airbags and seatbelt tensioners The example uses a details on demand approach to visualising the relationships pertaining to a specific element or elements of interest. In the example, the central tree graph represents the feature model, the left tree graph represents the decision model and the right tree graph represents the component model. The nodes in each graph represent the elements of the particular model, the straight edges represent parent/child relationships and the curved edges represent other inter-model and intra-model relationships such as implements (e.g. feature implements decision), excludes (e.g. feature excludes feature) and requires (e.g. feature requires feature). 4.2 Visualising a DFC Model Using 3 Dimensions Taking an example from Fig. 1, we can see (marked using ovals) that the BladderMat feature (partly) implements the Hardware B decision and also that it itself is (partly) implemented by the BladderMatSDriver component. Whereas, these model elements
A 3D Visualisation to Enhance Cognition in Software Product Line Engineering
861
Fig. 1. A DFC Model Tree View Visualisation
and the two relationships that exist between them comprise just a subset of the relationships that the BladderMat feature is involved in, we confine our discussion in this sub section to these in order to provide an initial understanding of our approach. Sections 5 and 6 will elaborate more concerning the details.
862
C. Cawley et al.
Fig. 2. Visualising Inter-Model Relationships
Fig. 2 presents a three-dimensional space which provides the container for our DFC model visualisation. It primarily consists of three graph axes. The decision model is mapped to the Y-axis, the feature model to the X-axis and the component model to the Z-axis. The mapping is currently a simple sequential listing of the model elements along an axis. For illustration purposes we show the example that we have just identified above. The Hardware B decision is highlighted on the Y-axis, the BladderMat feature on the X-axis and the BladderMatSDriver component on the Zaxis. The blue sphere rendered within the coordinate space is the point where these three separate model elements “intersect”. A sphere rendered at that specific point indicates that those three model elements are associated with each other. Hence, one visual element (the sphere) represents the three model elements and the inter-model relationships that exist between them (feature implements decision and component implements feature). It is therefore also referred to as a relation set identifier. By using a colour encoding scheme, additional relationships can be identified. One such use is colouring the relation set (sphere) with green to indicate that the feature involved in this relation set is a required feature due to the selection of another feature - all relation set identifiers involve a feature. This exemplifies the encoding of an intra-model relationship. This three dimensional space provides the environment that can allow a stakeholder visualise, interact with and analyse the relationships that exist between and within the three models.
A 3D Visualisation to Enhance Cognition in Software Product Line Engineering
863
This visualisation is concerned with representing three appropriate models at any given time but is not intended to be limited to only three models. Although we intend to extend our approach towards support for additional models, in this paper we focus on visualising our three integrated DFC models. The next section describes our implementation and provides further illustration. We use a specific scenario to highlight the main visualisation and its interactivity. Subsequently in Section 6 we discuss the reasoning and argue benefits and limitations.
5 Implementation Throughout our description of the implementation we will use our example “RESCU” DFC model introduced in Section 4.1. The model consists of eighteen requirement decisions each of which is implemented through one or more features each of which in turn is implemented through one or more components.
Fig. 3. A 3D Visualisation as an Eclipse Plugin
Fig. 3 presents a screenshot of the visualisation. The implementation consists of an Eclipse plug-in [16] which, when installed within the Eclipse IDE, provides a set of views aimed at supporting SPL feature configurations. The centre view provides the 3D implementation under discussion. The decision tree view to its left and the textual view to its right are synchronised supporting views but are not required by it.
864
C. Cawley et al.
5.1 User Interface The main interface comprises a three-dimensional co-ordinate space. Decisions are listed vertically along the Y-axis, features along the X-axis and components along the Z-axis. As described in Section 4, the relation sets (spheres) rendered within the coordinate space identify where sets of relationships exist between the three models (axes). 5.1.1 Basic Interactivity As the mouse is moved over the labels along each axis, labels are magnified to provide readability and to identify features, decisions or components. If a label is clicked on a particular axis and a set of relationships exist that relates that model element to the other two models, then a relation set will be displayed at the corresponding 3D co-ordinate. Also, the corresponding labels on other axes (identifying associated model elements) will be highlighted. By moving the mouse over a relation set, the labels of the three model elements involved in that relation set are further magnified to distinguish them and aid legibility (also see Fig. 2) The visualisation as a whole can be flexibly manipulated by the stakeholder. It can be rotated in any direction by 360 degrees; it can be panned horizontally and vertically and can be zoomed towards and away from the user. This supports navigation of the visualisation and allows preferred viewing dependent on the particular information of interest. 5.1.2 Visualising Additional Relationships Up to this point, we have mainly described how the visualisation represents the relationships that exist between elements in different models. However, one of the main purposes of this visualisation is to additionally identify relationships that exist between different elements of the same model. A relation set that does not represent any additional relationships other than intermodel ones is coloured blue. In that case, the relation set shows that a particular decision, feature and component are related to each other through two implements relationships. Let us now consider that the feature identified in that set of two relationships requires another feature. By default, all relation sets within the co-ordinate space that involve that required feature among its relationships will be displayed and coloured green. The green visually indicates that the feature represented within its relation set is a required feature (given the users current selection). Similarly, a red relation set indicates that the feature represented within its relation set has an excludes relationship with another feature related to the user’s current selection. 5.2 Transitive Relationship Complexity Consider that a user has selected a decision and that the visualisation has rendered all relation sets where that decision is involved (e.g. if only one feature implemented the decision and that feature was implemented by only one component then only one relation set would exist that directly involved that decision). If that one feature either required or excluded another feature then as described in the previous subsection, the visualisation, by default, will also render any relation sets that the required/excluded
A 3D Visualisation to Enhance Cognition in Software Product Line Engineering
865
feature was involved with. These transitive relationships introduce additional complexity on a number of levels, namely features excluding or requiring other features, decisions requiring other decisions and components requiring other components. To manage this complexity the stakeholder has access to a number of filtering options (see Fig. 3). A detailed explanation of these filters is out of scope here; suffice it to say that the stakeholder can choose to manage different aspects of the complexity incrementally. Additionally, any filtered information can be brought more or less into view dynamically by the user using the slider at the top right allowing a context to remain while filtering out less relevant information. 5.3 Example Scenario Using Fig. 3 as an example we can highlight some of the attributes of the visualisation. In this example the stakeholder has, by moving the mouse over its label on the Y-axis and clicking on it, selected the decision “High End Occupant Protection?”. The stakeholder has also selected the “Primary” setting for the “Show Linked Features” filter which is part of the “Filter Decision Selections” filter panel. This filter pattern will filter out any transitive relation sets to an extent specified by the filter slider setting which increases/decreases their transparency. The selection of the “High End Occupant Protection” requirement decision results in 26 encoded relationships involving 24 distinct model elements across 3 separate models, visualised using 10 colour encoded spheres (relation set identifiers). Three blue and five green relation sets immediately indicate three implementing and five required features respectively, in relation to the selected decision. Two transparent relation sets indicate additional transitive relationships exist, one of which is red indicating a mutually exclusive feature. By hovering the mouse over any of the relation sets or clicking a relation set the relevant decision, feature and component are clearly highlighted.
6 Discussion The overarching motivation for this approach stems from the immense complexity that can be present in large scale SPLs. Traditional approaches to managing such complexity can lead to problems such as “mapshock” (a phenomenon where someone perceiving an overly complex diagram has an audible reaction to information overload). The DFC meta-model provides a basis where SPL data can be described in more manageable entities using decisions to provide a high level mapping of features which are implemented by components. This basis provides a useful platform to apply cognitive theory and interactive visualisation techniques to address management of this complexity. 6.1 Cognitive Benefits The approach presented here uses as part of its basis, the theory of augmented thinking using visual queries on visualisations - cognitively, constructing a visual query entails identifying a visual pattern that will be used by a mental search strategy over a graphical visualisation [17]. Below are three of the most salient points of this theory.
866
C. Cawley et al.
1. Data should be presented in such a way that informative patterns are easy to perceive. 2. The cognitive impact of the interface should be minimised so that thinking is about the problem and not the interface. 3. The interface should be optimised for low cost, rapid information seeking. In our approach, the main visual pattern of importance is a colour encoded sphere. The interface itself is, for the most part, a spatial container for those visual patterns where additional information is retrieved on demand. In brief, through these core ideas we attempt to emulate the above key points. Next we discuss some of the specific techniques employed. Focus+Context describes the ability to work at a focussed level while maintaining the overall context within which you work. The 3D container in which the relation sets are rendered provides a mapping of each of the models on its axes and we argue that this provides the perception of the SPL as a whole while working with individual elements and relationships. Distortion techniques (transparency) and filtering allow exploration of relevant data while keeping complexity in the background. These transparent relation sets also act as pull cues to draw the user’s attention to this additional complexity. Details On Demand, Dynamic Queries and animation are techniques implemented through the use of mouse interactions with both the axes labels and relation sets. Colour encoding guided by work such as that carried out by Kerbs [18] aims to provide preferred aesthetics. The 3D nature of the visualisation supports the world-in-hand metaphor (which inherently employs kinetic depth cues and parallax motion) allowing the user to manipulate the visualisation through rotation, panning and zooming for appropriate viewing. 6.2 Benefits to Feature Configuration We argue that by providing a visualisation based on enhancing cognition through the use of visualisation techniques and cognitive theory, a stakeholder can be supported in their task to make configuration decisions while deriving a new variant from a large scale SPL. By supporting the stakeholder in this way we argue that the feature configuration process becomes less complicated and hence less error-prone and more efficient. With this approach, the complexity inherent in a large SPL is broken down into more manageable blocks. However, within the context of the visualisation which perceptually contains all three models as a whole, the stakeholder can work with individual model elements and their relationships while keeping that context. Using this approach, the stakeholder can explore and understand the complex relationships that exist in an incremental fashion, allowing informed judgement of the possibilities and effects of a particular configuration step. For example, using transparency, the user can keep transitive complexity in the background until desired. Importantly, high risk or possibly problematic instances can be easily identified while evaluating decision selections. For example, the appearance of a number of red relation sets would indicate that the current selection warrants further investigation as to the impact of and alternatives to those eliminated features.
A 3D Visualisation to Enhance Cognition in Software Product Line Engineering
867
6.3 Limitations Many of the limitations discussed here are a result of additional implementation that needs to be carried out. This additional work is currently being undertaken and is also discussed in Section 7. The magnification/FishEye implementation on the axes (particularly the z-axis) is presently rudimentary and will be enhanced to increase its effectiveness. Having multiple relation sets representing the same feature could be considered redundant if the user is only interested in the features themselves at a given time. One possibility to address this is to allow removing/combining “redundant” relation sets on-demand where appropriate. A traditional feature model view is not available. Such a tree view layer that can be displayed on demand for a variety of purposes including a partial feature model is planned.
7 Conclusion and Future Work The elicitation of expert opinion is deemed of great importance as part of the next steps to evaluate and guide the future direction of our relationship visualisation approaches. The modelling of a large commercial system based on our meta-model is currently in progress with a planned evaluation to follow. In addition, planned work for the immediate future will be aimed at addressing the main limitations that exist through further implementation. This work will mostly be concerned with providing additional supporting task based information using dynamic queries and details on demand techniques. Also, work to support ease of use and perception is planned. In conclusion, this paper builds on previous work in employing visualisation theory and techniques to address complexity issues in SPL feature configuration. Specifically, it reports on a visualisation implementation based on previously published ideas and discussions. We present and argue that such an implementation can enhance stakeholder cognition during feature configuration providing the basis for a more efficient and less error-prone process. The approach focuses on representing the relationships that exist between and within three separate but integrated models as the primary visual elements in a 3D visualisation. We discuss the benefits and limitations of the approach using an illustrated example.
References 1. SEI: Software Product Lines, http://www.sei.cmu.edu/productlines/ 2. Deelstra, S., Sinnema, M., Bosch, J.: Product Derivation in Software Product Families: A Case Study. Journal of Systems and Software 74, 173–194 (2005) 3. Hotz, L., Wolter, K., Krebs, T., Nijhuis, J., Deelstra, S., Sinnema, M., MacGregor, J.: Configuration in Industrial Product Families - The ConIPF Methodology. IOS Press, Amsterdam (2006)
868
C. Cawley et al.
4. Cawley, C., Thiel, S., Botterweck, G., Nestor, D.: Visualising Inter-Model Relationships in Software Product Lines. In: Proceedings of the 3rd International Workshop on Variability Modeling of Software-Intensive Systems (VAMOS), Seville, Spain (2009) 5. Cawley, C., Thiel, S., Healy, P.: Visualising Variability Relationships in Software Product Lines. In: 2nd International Workshop on Visualisation in Software Product Line Engineering (ViSPLE), Limerick, Ireland (2008) 6. Van Gurp, J., Bosch, J., Svahnberg, M.: On the notion of variability in software product lines. In: WICSA Proceedings, pp. 45–54. IEEE Computer Society, Los Alamitos (2001) 7. Kang, K., Cohen, S., Hess, J., Novak, W., Peterson, S.: Feature-oriented domain analysis (FODA) feasibility study. Technical Report CMU/SEI-90-TR-21. Software Engineering Institute, Carnegie Mellon University (1990) 8. Botterweck, G., Thiel, S., Nestor, D., Abid, S.B., Cawley, C.: Visual Tool Support for Configuring and Understanding Software Product Lines. In: The 12th International Software Product Line Conference (SPLC 2008), Limerick, Ireland (2008) 9. Pure-systems GmbH: Variant Management with pure:variants (2003-2004), http://www.pure-systems.com 10. Biglever Software: Gears, http://www.biglever.com 11. Rabiser, R., Dhungana, D., Grünbacher, P.: Tool Support for Product Derivation in LargeScale Product Lines: A Wizard-based Approach. In: 1st International Workshop on Visualisation in Software Product Line Engineering (ViSPLE 2007), Tokyo, Japan (2007) 12. Rohr, O.: VisMOOS (Visualization Methods for Object Oriented Software Systems). University of Dortmund (2004), http://ls10-www.cs.uni-dortmund.de/vise3d/prototypes.html 13. Ali, J.: Cognitive support through visualization and focus specification for understanding large class libraries. Journal of Visual Language and Computing (2008) 14. Robertson, G., Cameron, K., Czerwinski, M., Robbins, D.: Polyarchy Visualization: Visualizing Multiple Intersecting Hierarchies. In: Conference on Human Factors in Computing Systems. ACM, Minneapolis (2002) 15. Risden, K., Czerwinski, M.P., Munzner, T., Cook, D.B.: An initial examination of ease of use for 2D and 3D information visualizations of web content. Int. J. Human-Computer Studies, 695–714 (2000) 16. Eclipse IDE, http://www.eclipse.org 17. Ware, C.: Information Visualisation: Perception for Design. Kaufmann Series in Interactive Technology. Morgan Kaufmann, pp. 370–383. Morgan Kaufmann, San Francisco (2004) 18. Kerbs, R.W.: An Empirical Comparison of User Color Preferences in Electronic Interface Design. In: 19th International Symposium on Human Factors in Telecommunication Berlin, Berlin, Germany (2003)
A Visual Data Exploration Framework for Complex Problem Solving Based on Extended Cognitive Fit Theory Ying Zhu1, Xiaoyuan Suo2, and G. Scott Owen1 1
Department of Computer Science Georgia State University Atlanta, Georgia, USA 2 Mathematics and Computer Science Department Webster University St. Louis, Missouri, USA
Abstract. In this paper, we present a visual data exploration framework for complex problem solving. This framework consists of two major components: an enhanced task flow diagram and data visualization window. Users express their problem solving process and strategies using the enhanced task flow diagram, while multiple frames of visualizations are automatically constructed in the data visualization window and are organized as a tree map. This framework is based an extended Cognitive Fit Theory, which states that a data visualization should be constructed as a cognitive fit for specific tasks and a set of data variables. It also states that the structure of multiple data visualizations should match the structure of the corresponding tasks. Therefore, in our framework, data is presented in either visual or non-visual format based on the cognitive characteristic of the corresponding task. As users explore various problem solving strategies by editing the task flow diagram, the corresponding data visualizations are automatically updated for the best cognitive fit. This visual data exploration framework is particularly beneficial for users who need to conduct specific and complex tasks with large amount of data. As a case study, we present a computer security data visualization prototype. Keywords: information visualization, task, Cognitive Fit.
1 Introduction The aim of this research is to use the Cognitive Fit Theory to guide the design and construction of effective data visualizations for complex problem solving. To achieve this we extend the Cognitive Fit Theory [1] in two areas. First, we propose the concept of a new three-way, task-data-visualization cognitive fit, as opposed to a twoway task-visualization cognitive fit. Second, we argue that it is important to maintain the cognitive fit between the task structure and visualization structure. We call this an Extended Cognitive Fit Theory. The Extended Cognitive Fit Theory leads to a new visual data exploration framework for complex problem solving. This framework consists of two main components: an enhanced task flow diagram and a data visualization tree map. The complex G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 869–878, 2009. © Springer-Verlag Berlin Heidelberg 2009
870
Y. Zhu, X. Suo, and G.S. Owen
problem solving process is captured in the task flow diagram. Each data visualization is constructed based on the cognitive fit principles for each individual task, while all the data visualizations are organized in a tree map to match the structure of the task flow. The proposed system alleviates the cognitive load in problem solving by providing best cognitive fit based on our current knowledge of visualization. This framework can be applied to a wide variety of domain areas, and is suitable for experienced users with specific tasks in mind. It is particularly suitable for collaborative problem solving. Later in this paper, we demonstrate an example of this framework in computer security visualization.
2 Background and Related Works 2.1 Cognitive Fit Theory The Cognitive Fit Theory [1] is an investigation of the fit of technology to task, the user’s view of the fit between technology and task, and the relative importance of each to problem-solving or decision-making performance. A main theme of this theory is that performance improved markedly for tasks when the problem representation matched the task. Although the Cognitive Fit Theory has been used successfully to analyze the task performance of graphs and tables as well as software comprehension, it has its limitations. First, the tasks in the study are simple information read-off tasks. The theory does not account for complex tasks that involve sub-tasks in a hierarchical structure. Second, this theory only considers the cognitive fit between problem representation and task, not cognitive fit between problem representation and data. In section 3, we address these issues by extending the Cognitive Fit Theory. 2.2 Task Based Visualization Design An implication of the Cognitive Fit Theory is that data visualization is effective when it is a cognitive fit for a task. The data exploration framework described in this paper is for task-centered visualization design. It is particularly suitable for the complex problem solving process in which users have specific tasks and task structures in mind. There are a number of differences between our visual data exploration framework and other task centered visualization methods [2-6]. In our system users are required to specify their tasks in the form of an enhanced task flow diagram. In other systems, the tasks are not explicitly specified. The benefits of the enhanced task flow diagram are two folds. First, it helps the computer program to automatically identify the tasks and data variables, and then select visualizations based on cognitive fit principles. Second, the enhanced task flow diagram can facilitate collaborative problem solving, which is not addressed by previous task centered visualizations.
3 Extending the Cognitive Fit Theory In this section we describe our efforts to extend the Cognitive Fit Theory.
A Visual Data Exploration Framework for Complex Problem Solving
871
3.1 Cognitive Fit among Visualizations, Tasks, and Data The Cognitive Fit Theory establishes the importance of cognitive fit between different data representations and different tasks. Here we argue that equally important is the cognitive fit between visualization and data. Previous studies in data visualization and cartography have shown that certain types of data are better visualized by certain types of visualization. For example, Bertin [7] points out that particular visual variables are perceptually suitable or not suitable for certain types of data. For example, color is not quantitative and not ordered. In addition, some visualization techniques are designed for specific types of data. For example, the parallel coordinate technique is designed for high dimensional data sets; the tree map technique is designed for hierarchical data sets. Therefore both the cognitive fit between visualization and data and the cognitive fit between visualization and task should be considered when constructing data visualizations. In fact, these two cognitive fits address two different aspects of visualization design – accuracy and efficiency. A visualization design is a good cognitive fit for data when it accurately represents data. A visualization design is a good cognitive fit for a specific task when it improves task performance over non-visual representations. 3.2 Extending the Cognitive Fit Theory for Complex Tasks One limitation of the original Cognitive Fit Theory is that it is concerned mainly with simple information read-off tasks. Our goal is to extend it for more complex tasks. In this study, we define complex tasks as having the following characteristics: 1. the task can be divided into multiple sub-tasks; 2. the overall task involves at least four data variables; 3. the sub-tasks have either a linear or hierarchical structure. This definition is based in part on a taxonomy proposed by Quesada, et al. [8]. We extend Cognitive Fit Theory by introducing an additional principle that data visualizations should be structured to match the task structure. The original Cognitive Fit Theory deals with the cognitive fit between visualization and task. Here we introduce the cognitive fit between visualization structure and task structure. These two principles guide our development of the framework for visualizing task structure, which is discussed in section 4.
4 A Visual Data Exploration Framework We envision a visual data exploration environment with two main components. The first component is an enhanced task flow diagram to help manage and visualize tasks and task structure. The second is a data visualization window with multiple visualization frames. The contents of these two windows are dynamically linked according to the extended Cognitive Fit Theory. 4.1 Enhanced Task Flow Diagram In our framework, tasks are represented in a task flow diagram. The task structure is assumed to be a tree, which is a common data structure for organizing and storing problem solving activities [9].
872
Y. Zhu, X. Suo, and G.S. Owen
Each node on the task tree represents an individual task. A task may be hierarchically divided into sub-tasks. For each task, users need to specify two things – task keywords and data variables. Task keywords are used to describe the action and purpose of the task, while data variables are the parameters needed to perform the specific task. In practice, the computer program should allow users to open data files, select parameters, and attach them to a task. A data variable can be aggregated from multiple data sources through a backend computing engine. Our task flow diagram differs from the traditional task flow diagram in that we require users to explicitly specify data variables for each task. This is important for analyzing each task and selecting the data visualizations that are appropriate cognitive fit. Thus we call it an enhanced task flow diagram, an example of which is shown in figure 1. To solve a problem, users can add, delete, relocate, or edit tasks in the task flow diagram. They can change the data variables associated with each task, organize the task hierarchy by dividing a task into sub-tasks, split a task in two, or merge multiple tasks into one. The task flow diagram is then parsed by the computer program. For each task, task keywords and data variable types are extracted to select or construct a data visualization that is a cognitive fit, based on the method described in section 4.2.
Fig. 1. An example of enhanced task flow diagram
The task flow diagram is dynamically synchronized with the data visualizations. That is, when users add, delete, relocate, or edit a task node, the corresponding data visualization is automatically updated so that the visualization structure always matches the task structure. The requirement for users to explicitly specify a task structure and its related data variables is a unique feature of our framework. Some may question the benefit of the task flow diagram as it may be tedious to create. Indeed, users who only want to explore the data without more complex tasks in mind may not benefit from constructing an enhanced task flow diagram. But for users who conduct complex tasks, the enhanced task flow diagram has a number of benefits. First, an enhanced task flow diagram allows the computer program to automatically select data visualizations based on extended cognitive fit principles. Second, the task flow diagram can be very useful for collaborative problem solving. It is
A Visual Data Exploration Framework for Complex Problem Solving
873
Table 1. An example of cognitive fit table designed for computer security visualization Task keywords
Data variables
Verbs
Adj.
Nouns
Detect; monitor; examine; inspect; investigate; look for; review; identify
Unusual, unexpected, suspicious, acceptable, compromised, unauthorized,
behavior, characteristics, change, intrusion, signs
User ID; time; successful login attempts; failed login attempts;
User ID; time; failed attempts to access password files User ID; Changes in user identity Time; Number of files added; number of files deleted; Time; number of file system warnings; Time; number of system processes at any given time; number of user processes at any given time; Source IP; Destination IP; Network connection duration;
Correlate
Network sockets currently open Time; number of network probes and scans for a given time period; Successful connections: protocol; port; source IP; destination IP; User ID; login time; logout time; source IP; destination IP; network connection duration;
Visualization types Bar chart; Line chart; Scatter plot;
Bar chart; scatter plot; Bar chart; Bar chart; line chart; scatter plot; Bar chart; line chart; scatter plot; Bar chart; line chart;
Pixel-oriented visualization; parallel coordinates; graph; Pixel-oriented visualization; Bar chart; line chart; scatter plot; Pixel-oriented visualization; parallel coordinates Parallel coordinates; Multiple views
essentially a visual language for describing a specific problem solving strategy and expertise [10], which can be shared or reused. Psychological studies have shown that a shared visualization increases the efficiency of the collaboration, the product of the collaboration, and the enjoyability of the collaboration [11]. Part of the reason is that
874
Y. Zhu, X. Suo, and G.S. Owen
these visualizations are external representations of one’s thought, and can be shared with others and reasoned on collectively [12]. In addition, a team can divide the problem solving process into multiple sub-tasks and let each individual build a task flow diagram for each sub-task, and later integrate multiple diagrams into a single one. 4.2 Data Visualization In this section, we discuss our method to select data visualizations that are cognitive fits for specific tasks and data. After the user edits the task flow diagram, data visualizations are automatically constructed based on the Extended Cognitive Fit theory. The key is to codify the theory in a format that can be read by a computer program. Our solution is to build a cognitive fit table. An example of such table for computer security visualization is shown in Table 1. A cognitive fit table has three columns – task keywords, data types, and visualization types. Task keywords are obtained from domain specific hierarchical task analysis [13]. The basic technique of hierarchical task analysis is task decomposition, in which high level tasks are broken down into subtasks. The outcome of this analysis is a list of task keywords (e.g. Table 1). Developers also need to come up with a data type classification for the domain. Each data variable in the task flow diagram has two components: name and type. The data types are either automatically detected by the name of the variable (e.g., time, price, etc.) or specified by users. The reason for specifying data types in the enhanced task flow diagram is to help select cognitive fit visualization for both the task and data. Each row of the table represents a particular “cognitive fit” between (task, data) and visualization. The challenge is to collect and identify available “cognitive fits” between (task, data) and visualization in particular domain. Ideally these cognitive fits should be based on rigorous user studies. But unfortunately such study is rare. So we propose a heuristic method to find the cognitive fits. 1.
2.
The developers first come up with an initial set of cognitive fit pairs between (task, data) and data visualization based on anecdotal evidences from existing data visualization literatures or based on their own experiences. Such anecdotal evidences about a good fit between a particular data visualization and data or task exist in many literatures. In addition, many visualization techniques are designed for specific data type or task. There will be ambiguity or uncertainty in the initial cognitive fit table. For example, a (task, data) combination may have multiple visualizations types as a potential cognitive fit. Our solution is to gradually build up a ranking system for these visualization types through usage analysis. Specifically, at the beginning, for each (task, data type) pair, multiple visualizations are displayed. Each (task, data, data visualization) pair has an initial ranking number. Each time a user uses and interacts with that data visualization, its ranking number is increased. On the other hand, if a user closes a data visualization without interacting with it, then its ranking number is decreased. Over time, with sufficiently large number of uses, these ranking numbers codifies the cognitive fit among task, data type, and data visualization.
A Visual Data Exploration Framework for Complex Problem Solving
3.
875
The benefit of this method is that it is possible to create personalized cognitive fit data for an individual user or a group of users as the ranking numbers are dynamically adjusted as users continue to use the program. For example, as a user become more experienced, he/she may feel more comfortable using complex visualization types, which will be ranked higher. Initially, the selection of data visualization is based on a relatively simple match between task, data type, and data visualization. As more usage data are collected and a sufficiently stable cognitive fit ranking system is established for a particular domain, it will be possible to develop more sophisticated and intelligent selection algorithms using techniques such as Artificial Neural Networks or Fuzzy Logics.
Based on extended Cognitive Fit Theory, the visualization structure and task structure should match. While task structure is about the spatial relationship among tasks, visualization structure is about the spatial relationship among visualizations. In our framework data visualizations are organized as a tree map because it is the most efficient way to use the display space. The hierarchical relationship among data visualizations in the tree map reflects the hierarchical relationship among the task nodes in the data flow diagram. Each branch in the task flow diagram occupies a rectangular region in the tree map. If that branch has multiple sub-branches, then that region is further divided into smaller regions until a leaf task node is reached. Then the task keywords and data information are extracted from that leaf task node, and are used to construct a data visualization, using the method described earlier in this section. Then the data visualization is displayed in the corresponding region in the tree map. 4.3 The Problem Solving and Data Exploration Process In this section, we describe the problem solving and data exploration process in our proposed framework from a user’s perspective. First a user creates an enhanced task flow diagram, using task keywords to describe the task, and then attaches data and data types to each task. After the user creates the task flow diagram, the computer program parses the diagram, extracting keywords and data types, which are used to search the cognitive fit table and look for a cognitive fit. A cognitive fit is selected by matching the target keywords and data types with the ones stored in the table. The row with the highest number of matched task keywords and data types are selected. The data visualization types stored in this row are then selected for display. The data variables are mapped to visual variables based on data types and pre-determined rules, but users can change the mapping between data variables and visual variables (e.g. shape, color, etc.) If there are multiple visualizations in a row, then multiple visualizations are displayed. As described in section 4.2, each data visualization will be dynamically ranked based on usage analysis. Eventually the selection of data visualizations will be based on the cognitive fit ranks. If no appropriate data visualization is found in the cognitive fit table, then the data is displayed in symbolic format (e.g., a table). As users delete, add, edit, merge, split, or relocate tasks in the task flow diagram, data visualizations are automatically constructed and updated.
876
Y. Zhu, X. Suo, and G.S. Owen
5 A Case Study in Computer Security Visualization In this section, we briefly discuss a prototype that we have developed for computer security visualization. 5.1 Task Analysis and Data Types We divide computer security tasks into several categories: problem detection, problem identification and diagnosis, problem projection, and problem response. For example, a user first detects some anomalous behavior through network security visualization, and then works to find out what the problem is and the possible causes. If it is a threat, the user will try to assess the projected impact of this threat on the organization. Then a solution is developed in response to the threat. Currently computer security visualization is mainly concerned with problem detection. Therefore most of the task keywords we have collected fall into this category. These keywords are collected from computer security literatures (e.g. [14]).
Fig. 2. An example of our prototype computer security visualization
A Visual Data Exploration Framework for Complex Problem Solving
877
For data types, we adopt the comprehensive list of computer security data categories and types found in [14], Table 5.2. Typical network security data include log data, alerts, network activities, system activities, access records, user information, etc. Part of the cognitive fit table is shown in Table 1, which contains a partial list of task keywords, data types, and visualization types. We are not able to show the entire table due to page limit. 5.2 Implementation and Preliminary Results Our prototype is implemented using Java. Figure 2 shows a screen shot of our security visualization system. The task flow diagram shows three tasks – check network connections, check networking activities, and review user information. The corresponding data visualizations are constructed based on Extended Cognitive Fit theory and organized in a tree map to match the task structure. In this example, the data file is the log file generated by open source intrusion detection software SNORT (http://www.snort.org).
6 Conclusion and Future Work The research presented in this paper makes two major contributions to the field of information visualization. First, we improve the Cognitive Fit Theory by extending the concept of visualization-task cognitive fit to more comprehensive visualizationtask-data cognitive fit. We also introduce the principle that the visualization structure should be a cognitive fit of the task structure. Second, based on the extended Cognitive Fit Theory we developed a visual data exploration framework for complex problem solving. Unlike previous task centered visualization design methods, our visual data exploration framework is designed for complex problem solving and collaborative problem solving.
References 1. Vessey, I.: Cognitive Fit: A Theory-Based Analysis of the Graphs Versus Tables Literature. Decision Sciences 22, 219–240 (1991) 2. Bautista, J., Carenini, G.: An Integrated TaskBased Framework for the Design and Evaluation of Visualizations to Support Preferential Choice. In: Proceedings of Working Conference on Advanced User Interface, AVI (2006) 3. Hibino, S.L.: A Task-Oriented View of Information Visualization. In: Proceedings of CHI: Late-Breaking Results (1999) 4. Ignatius, E., Senay, H., Favre, J.: An Intelligent System for Task-specific Visualization Assistance. Journal of Visual Languages and Computing 3, 321–338 (1994) 5. Treinish, L.A.: Task-Specific Visualization Design. IEEE Computer Graphics & Applications, 72–77 (September/October 1999) 6. Casner, S.M.: A task-analytic approach to the automated design of graphic presentation. ACM Transactions on Graphics 10, 111–151 (1991) 7. Bertin, J.: Semiology of Graphics. University of Wisconsin Press (1983)
878
Y. Zhu, X. Suo, and G.S. Owen
8. Quesada, J., Kintsch, W., Gomez, E.: Complex problem-solving: a field in search of a definition? Theoretical Issues in Ergonomics Science 6, 5–33 (2005) 9. Simmons, R., Apfelbaum, D.: A Task Description Language for Robot Control. In: Proceedings of the 1998 IEEE/RSJ International Conference on Intelligent Robots and Systems, Victoria, B.C., Canada (1998) 10. Casner, S., Bonar, J.: Using the expert’s diagram as a specification of expertise. In: Proceedings of IEEE Symposium on Visual Languages (1988) 11. Heiser, J., Tversky, B., Silverman, M.: Sketches for and from collaboration. In: Gero, J.S., Tversky, B., Knight, T. (eds.) Visual and spatial reasoning in design III, pp. 69–78. Key Center for Design Research, Sydney (2004) 12. Suwa, M., Tversky, B.: External Representations Contribute to the Dynamic Construction of Ideas. In: Hegarty, M., Meyer, B., Narayanan, N.H. (eds.) Diagrams 2002. LNCS (LNAI), vol. 2317, p. 341. Springer, Heidelberg (2002) 13. Crandall, B., Klein, G., Hoffman, R.R.: Working Minds: A Practitioner’s Guide to Cognitive Task Analysis. MIT Press, Cambridge (2006) 14. Allen, J.H.: The CERT(R) Guide to System and Network Security Practices. AddisonWesley Professional, Reading (2001)
Energetic Path Finding across Massive Terrain Data Andrew Tsui and Zo¨e Wood California Polytechnic State University, USA
Abstract. Throughout history, the primary means of transportation for humans has been on foot. We present a software tool which can help visualize and predict where historical trails might lie through the use of a human-centered cost metric, with an emphasis on the ability to generate paths which traverse several thousand kilometers. To accomplish this, various graph simplification and path approximation algorithms are explored. We show that it is possible to restrict the search space for a path finding algorithm while not sacrificing accuracy. Combined with a multi-threaded variant of Dijkstra’s shortest path algorithm, we present a tool capable of computing a path of least caloric cost across the contiguous US, a dataset containing over 19 billion datapoints, in under three hours on a 2.5 Ghz dual core processor. The potential archaeological and historical applications are demonstrated on several examples.
1
Introduction
Anthropologists, archaeologists, and historians have spent a great deal of time uncovering the routes taken by ancient travelers on trade-routes with longdistance neighbors, or as they foraged for food around their camp [1]. As a general rule of thumb, humans are often quite proficient in finding the most efficient path of travel, including those paths that traverse great distances. We present an algorithm for computing and visualizing human centered paths across large datasets, for example the contiguous US. To this end, this work explores graph simplification and various path approximation algorithms in order to create solutions for massive out-of-core data. This work provides an efficient means of generating paths by restricting the search to a subset of the original data. In addition, visualization techniques to compare potential paths, foraging grounds, and alternate destinations are presented. The application is interactive and allows the user to select start and end locations, as well as one of several path computation algorithms. Satellite imagery is utilized to provide a 93,600 by 212,400 elevation and landcover data grid covering the contiguous US. The available path finding algorithms are Dijkstra’s, Fast Dijkstra’s (a multi-threaded variant of Dijkstra’s introduced in this work), A∗ , and Single-Query Single Direction PRM (a Probabilistic Road Map algorithm). Most computations are divided into a global and detailed search phase. The global search phase identifies a rough path using a simplified dataset. This G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 879–888, 2009. c Springer-Verlag Berlin Heidelberg 2009
880
A. Tsui and Z. Wood
rough path is then used to significantly reduce the total memory and computational time required for the detailed search phase by ensuring that only relevant areas of the terrain are searched. Our application is written in C++ using the OpenGL graphics API and Berkeley Database. Our results show that path computation over massive out-of-core datasets is possible. We conclude that using our Fast Dijkstra variant provides the best results in terms of accuracy. The contributions of this work include: – Tools for managing and performing energetic analysis on massive out-ofcore datasets. In particular, a restrictive tiling scheme is constructed which significantly reduces the search space without reducing accuracy. – A multi-threaded bidirectional version of Dijkstra’s shortest path algorithm that does not suffer any accuracy loss. – A comparison between multiple path computation algorithms in terms of runtime, memory usage, and accuracy. – Visualizations of the terrain from a human traveler’s perspective.
2
Previous Work
This work utilizes some of the significant work in the area of path finding algorithms, namely, Dijkstra’s [2], A∗ [3] and Probabilistic Road Maps [4]. Additionally, our application builds on two previous projects involving human centered paths across terrain data, Energetic Analyst [5] and Continuous Energetically Optimal Paths [6]. Brian Wood’s Energetic Analyst tool demonstrated the importance of using a human-centered, as opposed to distance-centered, metric for determining the routes of travel for archeology applications. Due to algorithmic constraints, this work could not be applied to large terrain datasets. The work of Jason Rickwald, Continuous Energetically Optimal Paths (CEP) built on Energetic Analyst while utilizing the Fast Marching Algorithm (FM) [7] for path computations. FM has the benefit of allowing paths to cross a grid face, rather than being constrained to the grid edges. In addition, Rickwald introduced a multi-threaded variant of Fast Marching, which significantly reduced the runtime but introduced a small error in the energetic path computation. Rickwald addressed the memory limitations encountered in Energetic Analyst by providing a mechanism for swapping terrain data between memory and disk. However, algorithmic and data format issues hindered the capability of analyzing very large datasets. For example, a path computation across a dataset covering most of the state of Oregon required almost a full day to compute.
3
Algorithms
A goal of this work is to visualize the terrain and optimal path from one point on the terrain to another. As the dataset for the contiguous United States comprises over 19 billion data points, it was necessary to develop tools to provide a highly simplified, yet acceptably accurate, representation of the dataset for real-time
Energetic Path Finding across Massive Terrain Data
881
interaction. This simplification was accomplished by first breaking down the dataset into 434*1,000 non-overlapping rectangular clusters. These dimensions were chosen as they maintain the approximate latitude to longitude ratio across the US. As a pre-process, the tool individually loads each cluster and performs a simplification on those data points to obtain a single representative data point, using either the average or the median of all the points in the cluster. Path computation occurs when the user selects two arbitrary points on the displayed simplified terrain. The latitude and longitude of these two points is then passed to a variant of Dijkstra’s shortest path algorithm that is optimized for the particular graph structure. The resulting path serves as an approximation to the actual path and is visually overlayed onto the simplified terrain. See Figure 1a. To address the problem of the large search space needed with analyzing large datasets, we present a restrictive tiling scheme in which the search space is drastically reduced. First, the full dataset is divided into several hundred tiles. Note that these tiles are different than the simplification clusters as they contain many more datapoints. Next, the tiles crossed over by the approximate path generated on the simplified dataset are identified. Finally, the algorithm searches within these tiles to construct the detailed path. For robustness, a configurable buffer can be set so that if the path falls near the boundary of a tile, the neighboring tiles can also be searched. This method is successful in limiting the search space without compromising the accuracy of the resulting path. This is a very important feature of our implementation as it prevents a large amount of data swapping that would occur if the path computation was allowed to run un-restricted on the full dataset. See Figure 1a.
(a) Path and tiles
(b) Fast Dijkstra
Fig. 1. [Algorithm Visualizations] (a) Display of an in progress path computation on the full dataset. The light squares indicate tiles that are within the search space while grey tiles are currently loaded in memory. The simplified path (black) used to determine which tiles to search is also shown. (b) Shows the areas searched by each thread during a run of the Fast Dijkstra algorithm.
In order to compute a human centered optimal path, we must choose graph weights that correspond to the amount of energy it would take a human to traverse that type of terrain. The equations used in this work to determine the caloric cost of travel between two points are the same as for Energetic Analysis
882
A. Tsui and Z. Wood
and CEP. These equations were determined and verified under various conditions [8,9]. First, the metabolic rate for traveling between two points is calculated based on the physical parameters of the subject and the slope (grade) of travel. There are two equations, one which computes the metabolic rate when traveling on a positive grade (uphill), while another is used for a negative grade (downhill): See [10] for the exact equations used. For this work, the average velocity km when traveling across open ground is 1.34112 m s , or approximately 4.8 hr . When traversing water, it has been experimentally determined that swimming at 0.7 m s is roughly equivalent to running at 3.3 m s [11]. Unfortunately, these equations can potentially under-predict the caloric cost when traveling downhill at certain velocities. Thus, the computed metabolic rate is compared to the metabolic rate while standing [12], with the larger being used to complete the calculation. The metabolic rate gives the amount of energy expended over time, thus it is necessary to obtain the approximate time required to travel between the two points. Finally, the caloric cost is calculated and converted to kilocalories. 3.1
Path Finding Algorithms
In order to find a path between the source and destination, a number of possible path finding algorithms are provided. The available algorithms are Dijkstra’s, Fast Dijkstra’s (a multi-threaded variant of Dijkstra’s introduced in this work), A∗ , and Single-Query Single Direction PRM (a Probabilistic Road Map algorithm). Dijkstra’s shortest path algorithm [2] is a well known and extremely pervasive algorithm for determining paths of least cost between two points on a graph. A∗ [3] is a slight modification of Dijkstra’s algorithm which uses a heuristic to decrease the computation time. A Probabilistic Road Map (PRM) [4] is defined as being a discrete representation of a continuous configuration space generated by randomly sampling the free configurations of a search space and connecting those points in a graph. PRM algorithms are designed for speed at the cost of accuracy, however, due to the probabilistic nature of the algorithms, it is possible to randomly produce an optimal path in a fraction of the time of other algorithms. PRM: The specific PRM algorithm used in this work is the Single-Query Single Directional PRM (SQPRM) [4]. The idea is to grow a tree type path in random directions until the destination is found. However, since SQPRM’s expand in a random fashion, it may require a large amount of time to randomly select and connect the destination node to the graph. Thus, the algorithm terminates when a node is examined that is sufficiently close to the destination node. The detailed pseudo code for the algorithm used for this work is given in [10]. The PRM class of algorithms are designed to quickly construct a traversable path in a large search space, but are not concerned with the actual efficiency of the path. Thus, if a potential edge is not accepted, it can be assumed that that edge will never be added to the graph. However, in the context of energetic paths, it is possible that a previously rejected edge may become viable at a later iteration. The approach taken in this work is to allow a node to be selected and
Energetic Path Finding across Massive Terrain Data
883
expanded multiple times, with the cost to reach directly connected (as opposed to all related) nodes being updated when appropriate. Thus, any change to a nodes cost may be slowly propagated as the algorithm progresses. While this does not completely eliminate out-dated information, it does provide a means to reevaluate certain edges and allows the algorithm to include path efficiency as a metric. However, the number of times the node is allowed to be updated can significantly impact the efficiency of the algorithm, which is explained in detail in [10]. Fast Dijkstra’s: There has been significant recent work on developing parallelized path computation algorithms [6,13] to utilize the increasing number of cores within standard processors. However, the general problem with these algorithms is that they provide approximations of optimal cost paths, thus sacrificing accuracy for speed. This work presents a bidirectional implementation of Dijkstra’s shortest path algorithm which uses two threads to capitalize on modern multi-core processors. This algorithm does not disrupt the optimality properties of Dijkstra’s, thus providing optimal paths with a minimal amount of memory overhead. In essence, Dijkstra’s algorithm is run separately in two threads with one thread calculating the cost from the start node to the destination node, while the second thread simultaneously calculates the cost from the destination to the start. The two threads meet roughly half-way to their respective goals where one thread is given priority and is responsible for combining the results of the two threads. Care is taken to account for bidirectional graphs, which is important in this work as the caloric cost of traveling uphill differs from traveling downhill as discussed above. Figure 1b illustrates the merging point of the two threads fronts, at the termination of the algorithm. For a more detailed pseudo-code for the fast Dijkstra’s algorithm see [10].
4
Results
To demonstrate the tools effectiveness, a number of paths were constructed on massive terrain datasets. All results were obtained on a 2.5 Ghz Intel Core 2 Duo MacBook Pro running OS X 10.5.6 with 4 GB 667 MHz DDR2 SDRAM and a 5400 RPM hard drive. California Indian Trails: To demonstrate the potential application to the field of Archeology and Anthropology, a trail was plotted between two Native American Indian tribes, one located within a valley between two mountain ranges, and the other located near the coast. California was chosen as a test site as historical records show evidence of healthy trade relations among many of the California Indian tribes. For the exact latitude and longitudes used for the start and stop locations for this example and all others, see [10]. The distance between these two tribes necessitates searching most of southern California. Figure 2a displays the energetic path found by the tools, while Table 1a shows the runtime required for the different algorithms. As shown in this figure, the computed path closely follows the route used and documented by James Davis [14].
884
A. Tsui and Z. Wood
As can be seen in Table 1a, using the simplified dataset to get an approximate path and restricting the search space (with a small buffer) on the detailed dataset can still produce a perfectly accurate path. Notice that the required time is drastically reduced for both the Dijkstra s and F ast Dijkstra variant when using restrictive tiling, but with no error in the path. In addition, A∗ provides a path with very little error while requiring even less time than the F ast Dijkstra algorithm. However, P RM took a substantially longer amount of time to complete and provided a highly inaccurate path. This does not conclusively determine the inappropriateness of PRM as the probabilistic nature of it means results may vary between runs. Table 1. Path computations: Note that ’Dijk → DijkF ast ’ indicates that Dijkstra’s was used on the full dataset to determine which tiles to search using the Fast Dijkstra algorithm. For each result, the error percentage is based on the difference in cost between the indicated method and the cost obtained from running Dijkstra’s unrestricted on the full dataset (marked with a *). N odes indicates the number of data points that were analyzed. All costs are in kilocalories.
Method *Dijk Dijk → Dijk Dijk → DijkF ast Dijk → A∗ Dijk → P RM
Nodes Runtime Memory Cost Error 265,024,564 1h 21m 3s 2.36 GB 42,455.5 0.00% 92,548,510 23m 37s 1.35 GB 42,455.5 0.00% 100,371,403 14m 26s 1.51 GB 42,455.2 0.00% 39,576,328 11m 31s 1.35 GB 42,880.5 1.00% 77,423,841 42m 3s 1.19 GB 54,417.4 28.17% (a) California Indian Trail
Method *Dijk Dijk → Dijk Dijk → DijkF ast Dijk → A∗ Dijk → P RM
Nodes Runtime Memory 505,896,251 2h 41m 50s 2.34 GB 184,985,532 46m 5s 1.80 GB 181,351,914 26m 9s 1.95 GB 164,865,746 47m 54s 1.95 GB 164,464,736 2h 22m 29s 1.94 GB
Cost 74,653.2 74,653.2 74,655.7 75,776.5 97,402
Error 0.00% 0.00% 0.00% 1.50% 30.47%
(b) Between Old Fort Boise, ID and Oregon City, OR Method *Dijk Dijk → Dijk Dijk → DijkF ast Dijk → A∗ Dijk → P RM
Nodes 735,795,927 252,449,787 285,154,704 230,861,134 185,307,75
Runtime 3h 18m 39s 1h 14m 59s 37m 15s 1h 12m 18s 2h 24m 31s
Memory 2.36 GB 1.94 GB 1.95 GB 1.91 GB 1.94 GB
Cost 107,723 107,723 107,726 112,149 123,435
(c) Between Moundville, AL and Hopewell, OH
Error 0.00% 0.00% 0.00% 4.11% 14.59%
Energetic Path Finding across Massive Terrain Data
(a) California Indian Trai
885
(b) Oregon Trail - Detailed View
Fig. 2. (a) An energetic path, possibly corresponding to the trails mapped by James Davis [14] shown in the upper right. (b) An energetic path overlayed with an estimate of the historic Oregon Trail [15].
Oregon Trail: To demonstrate a comparison against previous work while demonstrating a historical application, a path was computed following the Oregon Trail, which is a well known trail taken by settlers journeying to the western United States in the mid 1800’s. This example begins at Old Fort Boise in Idaho and ends at Oregon City in Oregon. Figure 2b shows the paths generated on the full dataset, while Table 1b presents the results. As clearly demonstrated in Figure 2b, the computed path closely follows the historical trail. Constructing a path across Oregon using previous work, CEP, required most of the day. By contrast, the restrictive tiling scheme presented in this work significantly reduces the amount of time to typically less than an hour while still producing accurate paths. The results are comparable to those of the California Indian Trail in that both the Dijkstra and F ast Dijkstra methods produced accurate paths. Moundbuilders: To demonstrate a path computation over a significant distance, an energetic path is computed that may have been used between two prominent Moundbuilder villages; Moundville, Alabama and Hopewell, Ohio. Ancient burial and ceremonial mounds constructed by the Moundbuilders have been found all along the central and eastern United States. Excavation from these mounds have revealed flint from the Rocky Mountains, shells from the Gulf of Mexico, and other artifacts of distant origin [16]. This indicates that the
886
A. Tsui and Z. Wood
inhabitants engaged in extremely long-distance trade, although the actual trade routes remain somewhat of a mystery. Figure 3a visualizes the least cost caloric path between the two sites and the results are presented in Table 1c. This longer path required increased runtimes and a larger number of nodes to be searched for each method. Notice that, compared to the results for the previous paths, P RM is able to construct a relatively accurate path in a small fraction of the time required by Dijkstra when run unrestricted on the full dataset. However, F ast Dijkstra is still able to produce a far more accurate path in similar time while consuming considerably less memory.
(a) Moundbuilders Moundville
-
Hopewell
to
(b) CA-SLO-9
Fig. 3. (a) A possible trade route utilized by the Moundbuilders. While the paths from the simplified (dark grey) and full (light grey) datasets differ significantly, the simplified path is still sufficient to determine which full dataset tiles to search. (b) California archaeological sites with the caloric radius visualization. Likely boundary of foraging region is banded black and white regions. The dark body of water seen at the top of the image is Morro Bay in California.
California Archaeological Sites: In the archaeological community, a traditional means of determining a tribes hunting and foraging grounds is to draw a circle on a map, focused on the tribes camp, with a radius of several kilometers. However, this method does not account for the terrain type. For example, a hunting party can travel much further on flat terrain (thus extending the radius) as opposed to very rocky or sloped terrain. Thus, a better metric for determining a tribes hunting ground may be a caloric measure. Figure 3b shows archaeological sites in central California, designated CA-SLO-9 respectively. The caloric radius visualization shows the area for which the tribes located at these sites may have searched for food. Notice that when the elevation remains level, such as along the coast or through valleys, a much greater distance can be reached, while the converse it true for hilly regions. Visualizations: In addition to the specific results examples, the system allows for visualizations for further analysis of the paths and the terrain from the
Energetic Path Finding across Massive Terrain Data
887
perspective of a walking human. Figure 4 demonstrates potential paths that end near the original destination and neighboring areas on a path that could be traversed without adding significant cost to the original path. These types of visualization are intended to help visualize other potential regions of interest when searching for a historical path.
(a) Nearby Paths: Black path indicates an (b) Path Regions: Reachable areas (shown alternate path as light regions near the path), are drastically decreased through valleys Fig. 4. Nearby path visualizations
Notes on the Results: Note that the results occasionally show a slight difference in cost between Dijkstra’s shortest path algorithm and the Fast Dijkstra variant. This is caused by floating point rounding differences during the caloric cost calculations, not an actual difference in paths produced. Also note that the results provided for the PRM algorithms are for a specific run. Due to the probabilistic nature of the algorithms, each run will likely produce a different result. In addition, A∗ generally utilizes an admissible heuristic function to estimate the cost from the current node to the destination. Admissible indicates that the heuristic must not overestimate the cost to the destination. For this work, the heuristic uses the Euclidean distance from the current node to the destination, combined with the metabolic rate associated with the corresponding grade, to estimate the total caloric cost. Unfortunately, this heuristic is not admissible in certain rare cases, which introduces a small error into some path computations.
5
Conclusions and Future Work
We present a set of tools that can be used to analyze massive out-of-core terrain datasets. We have demonstrated the efficiency and possible historical applications of our tools by performing several experiments to construct energetic paths across large distances using a variety of algorithms. Using our multi-threaded Dijkstra variant combined with restrictive tiling, we are able to construct an accurate energetic path across the United States in under three hours – a large
888
A. Tsui and Z. Wood
improvement over previous work in which a path across the state of Oregon required most of a day. Avenues for future work include exploring alternate datasets from around the world. Additionally, more advanced simplification methods may provide more accurate paths, allowing for the use of smaller tiles to further reduce the search space. Also, PRM optimizations could be explored, such as utilizing the fast runtime by running PRM multiple times and presenting the best result or a bidirectional approach for further speed increases.
References 1. Jones, T.: Personal Correspondence (2009) 2. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische Mathematik 1, 269–271 (1959) 3. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics 4, 100–107 (1968) 4. Hsu, D., Claude Latombe, J., Motwani, R.: Path planning in expansive configuration spaces. International Journal of Computational Geometry and Applications, 2719–2726 (1997) 5. Wood, B.M., Wood, Z.J.: Energetically optimal travel across terrain: visualizations and a new metric of geographic distance with anthropological applications. In: SPIE, vol. 6060, p. 60600 f. (2006) 6. Rickwald, J.: Continuous energetically optimal paths across large digital elevation data sets. Master’s thesis, California Polytechnic State University, San Luis Obispo (2007) 7. Kimmel, R., Sethian, J.A.: Computing geodesic paths on manifolds. Proc. Natl. Acad. Sci. USA, 8431–8435 (1998) 8. Duggan, A., Haisman, M.F.: Prediction of the metabolic cost of walking with and without loads. Ergonomics 35, 417–426 (1992) 9. Pandolf, K.B., Givoni, B., Goldman, R.F.: Predicting energy expenditure with loads while standing or walking very slowly. Journal of Applied Physiology 43(4), 577–581 (1977) 10. Tsui, A.N.: Energetic path finding across massive terrain data. Technical Report CPSLO-CSC-09-02, Department of Computer Science, California Polytechnic State University, San Luis Obispo, California (2009) 11. Prampero, P.E., Pendergast, D.R., Wilson, D.W., Rennie, D.W.: Energetics of swimming in man. Journal of Applied Physiology 37, 1–5 (1974) 12. Harris, J.A., Benedict, F.G.: A biometric study of basal metabolism in man. Cornell University, Mann Library, Ithaca, New York (1919) 13. Weber, O., Devir, Y.S., Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Parallel algorithms for approximation of distance maps on parametric surfaces. ACM Trans. Graph. 27, 1–16 (2008) 14. Davis, J.T.: Trade Routes and Economic Exchange Among the Indians of California. University of California Archaeological Survey, CA (1961) 15. City, H.O.: End of the oregon trail interpretive center (2009), http://www.historicoregoncity.org/HOC/index.php?view=article&id=57 16. Fagan, B.M.: From Black Land to Fifth Sun: The Science of Sacred Sites. Addison-Wesley, Reading (1998)
The Impact of Image Choices on the Usability and Security of Click Based Graphical Passwords Xiaoyuan Suo1, Ying Zhu2, and G. Scott Owen2 1
Mathematics and Computer Science Department Webster University St. Louis, Missouri, USA 2 Department of Computer Science Georgia State University Atlanta, Georgia, USA
Abstract. Click based graphical password systems, such as PassPoint [1], have received much attention in recent years. In this paper we describe our recent user studies on click based graphical passwords. Results from the user study showed a relationship among usability, security and the image choice for the graphical password. We further conducted an analysis on attacking methods for click based graphical passwords. The study highlights the vulnerability of click based graphical password, and enhances our understanding on the usability and security of graphical passwords. We also discuss a number of techniques to improve the usability and security of click based graphical password. Keywords: graphical password, user studies, usability.
1 Introduction Graphical passwords have been proposed as a possible alternative to text-based schemes, motivated partially by psychological studies [2] that show human can remember pictures better than text. In particular, click based graphical passwords have received much attention in recent years [3]. However, relatively little usability study has been done for graphical passwords. In addition, as Cranor, et al. [4] noted, little work has been done to study the security of graphical passwords as well as possible attacking methods. In fact, some recent studies have shown that there are patterns in user created graphical passwords. In reality, security and usability [5] of graphical passwords are often at odds with each other, but both factors are critical to all authentication systems. In this paper, we present a brief user study to understand the impact of background picture choices on the usability. The study highlights the vulnerability of click based graphical passwords. As a result, we discuss several graphical password attacking methods based on the choice of pictures. This study aims to achieve better balance between security and usability of click based graphical passwords through better choices of background images. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 889–898, 2009. © Springer-Verlag Berlin Heidelberg 2009
890
X. Suo, Y. Zhu, and G.S. Owen
2 A Brief User Study on the Relationship of Password Usability and Background Image Selection During this study, a group of undergraduate students participated; they were first asked to create a series of graphical password on three different images using the click based graphical password scheme [6]. We selected three different photos; each represents a different theme. The three different images are also being analyzed by using our edge-finding algorithm.
Fig. 1. Left: University Library Bridge, from the university website. Each of the objects in the scene is familiar to our study subjects. Right: Analyzing the image. We predict that the areas surrounded by white edges are the potential hot spots for graphical password.
Fig. 2. A photo taken from the NASA website, with a larger amount of complicated objects
2.1 User Tasks A group of users were gathered to perform this brief study; they are either first or second year undergraduate students with various majors. First stage: The concept of graphical password was introduced; and for the security concerns, I suggested the user to choose a graphical password between 5 to 9 clicks. 20 different voluntary submissions were received. Second stage: Users were asked to recall their password after two days. Successful rates were recorded.
The Impact of Image Choices on the Usability and Security
891
Last stage: Users were asked to submit answers to two questions to complete this brief study: a. What is your strategy (if any) for selecting your click based password? b. Would you like to use this type of graphical password in a real application?
Fig. 3. Edges of the NASA image. There are a large amount of edges found, which also showed the complicatedness of the image. We predict that this image has the lowest memorability for the users, but highest difficultness for the attackers.
Fig. 4. Image on left hand side: a random image with various meaningless shapes and colors. There is no easy visual target from the first sight [7]. The image on right hand side: edges of the random art. The edges were very clear in this case.
2.2 Results In the figures above, we combined all passwords created on one image. This image demonstrated certain “hot spot” [8] in which user would most likely to click on, for example the street light. Many people reported some of the graphical passwords are hard to remember. Among the 20 users, 19 were able to remember the password from the first image; approximately 7 were able to remember their passwords from the second and third image. We later asked each user for their reasons of the low memorability. Majority
892
X. Suo, Y. Zhu, and G.S. Owen
(a)
(b)
(c) Fig. 5. Red dots represent user clicks. (a) Passwords user created on the image1. (b) Passwords user created on the image2. (c) Passwords user created on the image2.
of the people reported that the graphical password scheme is still unfamiliar; some promised if they were given more trials, they would be able to remember. Some others reported that the chosen images contain too many details. 2.3 Analysis of the User Clicks A large amount of the user clicks falls into the circulated regions found by our algorithm (street lamp, windows); which in turn showed us that object was one of the main reasons users selected their graphical passwords. Other types of clicks also exist, such as the 5 equally-distanced clicks on the left edge of the image; the user who created this password later reported that location was the easiest approach. Even though this was a rather complicated image with vague edges, but user clicks largely fall on objects, for example: ladder, human eyes, wall texture etc.. From the distribution graph, we can see that in the right bottom corner where as the number of objects decreases, the number of click decreases. User clicks largely distributes on the corners of the image or the shapes. Some users reported there was no easy target on first sight, so location and color were the main reasons of such selections.
The Impact of Image Choices on the Usability and Security
(a)
893
(b)
Fig. 6. (a) Analyzing the user clicks in image1 in edge only format. (b) Distribution of user clicks.
Fig. 7. Analyzing the user clicks in image2 in edge only format
Fig. 8. Distribution of user clicks
894
X. Suo, Y. Zhu, and G.S. Owen
(a)
(b)
Fig. 9. (a) Analyzing the user clicks in image3 in edge only format. (b) Distribution of user clicks.
2.4 Questionnaire Among the 20 users who submitted the questionnaire and graphical passwords. 9 users reported they prefer graphical password to the original text based password. 8 users hesitated; some questioned the security issues of graphical password while some others concerned the password may create extra cognitive load. 3 people reported they prefer the traditional text-based password, simply because text-based passwords are more familiar.
(a)
(b)
Fig. 10. (a) User preference of graphical password. (b) Password creation strategies.
Users also reported their password creation strategies; they were allowed to report multiple strategies which helped them on their selection. 6 people reported that shapes were the primary target during their selection. 9 users reported that objects were their first choice, examples such as the street light in image1, ladders in image2. 6 people also preferred location, such as the corners of the images. 1 person reported that his
The Impact of Image Choices on the Usability and Security
895
favorite color was the main reason for his password selections. There was also 1 person reported he selected his password based on intuition.
3 Attacking a Graphical Password 3.1 Complexity of an Image Complexity of the background image directly affects the usability of the graphical password. Some of the factors define the complexity are listed below: Colors: We cannot always provide the user with meaningful pictures; especially when the graphical password is generated in a semi-automatic fashion, color can play a critical role. Objects: Objects in the image are another type of hot spots for graphical password. Face recognition [9] is one type of graphical password that uses objects as its main theme. Depend on the size of the object and the proportions the objects occupies compare to the entire image, users may only able to focus on one or a very limited number of objects at a time. Location and Shapes: There can be two types of shapes in a graphical password images: the shape of the objects in an image, or the shape of the clicks formed. Based on our user study, complexity of a background image does not have a linear relationship with the complexity of the graphical password.
allfactors = color + objects + location + shape + others 3.2 Image Choices vs. Usability of a Graphical Password Highest usability interval
(a)
(b)
Fig. 11. (a) 3-D representation of the relationship among usability, all factors and security, (b) relationship between all factors and usability
896
X. Suo, Y. Zhu, and G.S. Owen
Security and usability do not always go together; in many cases, security confines the growth of usability[5]. The relationship between all factors and usability of a graphical password is parabolic-like. Initially, all factors are the essential ingredients of a usable and secure graphical password. At a certain interval, usability will reach its highest point and continue that way until all factors become disturbing instead of helping. Security decreases at the same time with usability, when all factors become overwhelmingly large. It is still not clear to us the quantitative relationship between security and usability. In fact, usability and security should be case-dependent and user-dependent. 3.3 Patterns in Clicks In the brief user study we conducted, even though majority of the users found the picture of the library-bridge familiar, a few others still clicked their passwords in patterns. When a background image becomes so complex; usability of the graphical password decreases significantly. In the NASA image shown in the previous section, when the picture contains so many objects, user clicks fall into the corners of the image in a very easy to predict pattern. User tends to have patterns in clicks when they feel the complexity of the image exceeds their cognitive capability. Recent work by Chiasson et, al. [10] also showed click patterns exist for clickbased graphical password. Some of the patterns are shown in the figure below; example such as line, V-shape, W-shape, etc.
Fig. 12. In a recent study by Chiasson et, al. [10] showed the significant patterns formed by user clicks
3.4 Attacking a Graphical Password Human assisted attacking methods should be a more proper approach. In this section, we will discuss a few possible approaches of attacking click-based passwords. All-factors The four factors we have discussed above are the fundamental methods of creating graphical password. Apply edging finding algorithm would automatically speed up the process of identifying the objects/colors/shapes in the image. However, not all automatically discovered objects/colors/shapes are equally likely to be clicked by the user. This process requires attackers to manually discard the factors that are less likely to be clicked on.
The Impact of Image Choices on the Usability and Security
897
Click-patterns When selecting of all-factors fails, checking click-patterns is essential. Previous work [10] proposed several types of click patterns based on their user studies, many other factors are still unclear, such as: size of the pattern, location of the pattern, orientation of the pattern etc. Automated methods could emphasis on corners, edges of the image, and joint of objects by apply established patterns in various size and orientation. Familiarity Familiarity can also be called an educated guess. In the image of the library-bridge, since this is a rather familiar scene to all students in the university, it is obvious that the university flag is very likely to be clicked on. The successful rate of familiarity guesses should be less than the other factors; since familiarity is largely based on subjective reasons. Shoulder surfing Graphical passwords are prone to shoulder surfing, especially click-based PassPoint techniques. Mouse tracking algorithms could be involved to assist breakings.
4 Background and Related Works Chiasson et, al [10] found out that click-based passwords follow distinct patterns; and patterns occurs independently of the background image. Our brief user study showed user passwords do not always fall into patterns; click patterns actually occur only when the complexity of an image is higher than the tolerance. In fact, user studies proved patterns occur much less frequently than the other factors we mentioned above. Some other [8, 11-13] work suggested hot-spots occurs in click-based password. The hot-spot analysis work by Julie Thorpe [8] suggested that by using an entirely automated attack based on image processing techniques, 36% of user passwords within 231 guesses (or 12% within 216 guesses) can be broken in one instance, and 20% within 233 guesses (or 10% within 218 guesses) can be broken in a second instance. We believe semi-automatic methods with the help from humans will enhance the attack.
5 Conclusion and Future Work In this paper, we discussed a brief user study conducted using three different images. Results from user study suggested that the created graphical passwords are largely image-dependent. We also analyzed factors are may involved in attacking a graphical password, a few semi-automatic methods are further proposed. We did not attempt to break passwords from the user studies, although the intriguing user study results suggested such potentials. In fact, a complete set of attacking methods is our next step of research. The methods will be applied to the user studies, and the results will be compared with the actual created password.
898
X. Suo, Y. Zhu, and G.S. Owen
References 1. Wiedenbeck, S., et al.: Authentication using graphical passwords: Basic results. In: Human-Computer Interaction International (HCII 2005), Las Vegas, NV (2005) 2. Shepard, R.N.: Recognition memory for words, sentences, and pictures. Journal of Verbal Learning and Verbal Behavior 6, 156–163 (1967) 3. Wiedenbeck, S., et al.: Authentication using graphical passwords: Effects of tolerance and image choice. In: Symposium on Usable Privacy and Security (SOUPS), Carnegie-Mellon University, Pittsburgh (2005) 4. Cranor, I.F., Garfinkel, S.: Secure or Usable? IEEE security & privacy, 2004 (September/October), pp. 16–18 (2004) 5. Suo, X., Zhu, Y., Owen, G.S.: Graphical Password: A Survey. In: Proceedings of Annual Computer Security Applications Conference (ACSAC). IEEE, Tucson (2005) 6. Ian Jermyn, A.M., Monrose, F., Reiter, M.K., Rubin, A.D.: The Design and Analysis of Graphical Passwords. In: 8th Security Symposium, Washington DC (1999) 7. Wright, R.D.: Visual Attention. Oxford University Press, US (1998) 8. Thorpe, J., Oorschot, P.C.v.: Human-Seeded Attacks and Exploiting Hot-Spots in Graphical Passwords. In: 16th USENIX Security Symposium, Boston, MA (2007) 9. Davis, D., Monrose, F., Reiter, M.K.: On user choice in graphical password schemes. In: 13th conference on USENIX Security Symposium (2004) 10. Chiasson, S., et al.: User interface design affects security: Patterns in click-based graphical passwords (2008) 11. Chiasson, S., et al.: A Second Look at the Usability of Click-based Graphical Passwords. In: SOUPS (2007) 12. Dirik, A.E., Menon, N., Birget, J.C.: Modeling user choice in the PassPoints graphical password scheme. In: SOUPS. ACM, New York (2007) 13. Gołofit, K.: Click Passwords Under Investigation. In: Biskup, J., López, J. (eds.) ESORICS 2007. LNCS, vol. 4734, pp. 343–358. Springer, Heidelberg (2007)
Visual Computing for Scattered Electromagnetic Fields Shyh-Kuang Ueng and Fu-Sheng Yang Department of Computer Science, National Taiwan Ocean University, Keelung City, Taiwan 202
[email protected],
[email protected]
Abstract. In this paper, an innovative procedure is presented to simulate and visualize the scattered electric field from a target illuminated by radar waves. The propagations of radar waves are computed by using a ray-tracing method. Then the scattered field is computed at receivers which are located in concentric spheric surfaces surrounding the target. The scattered electric field is converted into RCS values and illustrated by using volume rendering and other graphical techniques. Compared with conventional radar simulation methods, our procedure produces more information but consumes less computational costs. The simulation results are better explored by using improved graphics techniques. Hence the scattered field is more comprehensible.
1
Introduction
As the electromagnetic waves emanating from a radar hit a target, the radar waves are reflected by the target and create a scattered field. If the scattered field is sensed by a radar antenna (a receiver), the target is detected. The scattered field is a 3D complex vector field which is hard to evaluate and understand. To simplify the analysis, the magnitude of the scattered field is converted into a real scalar field, called the Radar Cross Section(RCS). The RCS field is influenced by the target’s shape, the direction and frequency of the radar waves, and the position of the receiver. By analyzing the RCS field, engineers can tune the radar such that the radar has a higher possibility to detect the target. On the other hand, engineers can also modify the target’s shape to make the target stealthier to the radar. 1.1
Related Work
Many numerical methods have been proposed to simulate the propagations of radar waves. Among them, the Shooting and Bouncing Ray (SBR) method is comparably accurate and easier to implement. In an SBR simulation, the radar is assumed to be in a far distance from the target. As the radar waves arrive the target, they become planar waves and are replaced with parallel rays. Hence ray paths can be traced by using ray-tracing techniques [1]. Beside the ray paths, the G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 899–908, 2009. c Springer-Verlag Berlin Heidelberg 2009
900
S.-K. Ueng and F.-S. Yang
electric field along the rays is computed by using the Law of Geometrical Optics, and the scattered field is calculated based on the Law of Physical Optics[2,3]. SBR-based methods have been applied in many applications to solve electromagnetic problems. An SBR simulation on predicting the scattering effects of a radar covered by a dome is presented in [4]. Radar signatures are key features for target recognition. SBR methods are modified to compute radar signatures of airplanes, tanks, and ships in [2]. Beside radar simulations, SBR techniques are also prevalent in predicting radio wave propagations. In [5], a ray tracing method is utilized for predicting indoor radio wave propagations. The distribution of radio signals inside a tunnel is solved by using an SBR algorithm in [6]. A similar research is presented in [7] to measure the propagations of microwaves in a room furnished with metal furnitures. Other researchers apply SBR techniques to model wireless communications in urban environments[8]. Some graphical systems have been designed to display SBR simulation results. In [9], a graphical system is proposed to visualize RCS values. In their method, radar wave paths are rendered as lines such that wave propagations can be understood. A system called GRECO is proposed in [10] for performing a similar task. However, graphics hardwares are utilized to speed up the rendering process. A powerful system, called XPATCH, has been built to compute radar signatures. This system employs the SBR method to compute scattered electric fields and uses graphical sub-systems to display target models, ray paths, and RCS patterns[11]. A graphical system designed by using VTK libraries for postprocess RCS values is introduced in [12]. This program generates iso-surfaces, cut-planes, and volume rendering images to display RCS values. But the visualization quality can be improved. Another visualization method for RCS values is presented in [13]. The authors map a RCS field on a spherical surface and distort the surface according to the RCS values. Hence the radiation pattern is revealed by the resulted surface. 1.2
Overview of the New SBR Method
Conventional SBR simulations focus on the computation of RCS values. Other important data are not necessarily extracted, for example, the paths and phases of radar waves and the hot spots of the target causing high radiations. Furthermore, SBR simulations produce only numerical data which are difficult to analyze. Though some graphical tools have been dedicated to RCS visualization, their functionalities are still limited. In this paper, a pipeline is presented for an efficient simulation and visualization of the scattered electric field created by a target illuminated by radar waves. We re-arrange the conventional SBR procedure such that more essential information are extracted in one simulation, though with less computational costs. Our pipeline is also equipped with improved graphical modules. Therefore key features of the scattered field are better illustrated, and engineers can get more inside knowledge about the scattered field. A radar system has two antennas. The first antenna, the transmitter, sends waves toward the target. The second antenna, the receiver, detects the scattered
Visual Computing for Scattered Electromagnetic Fields
901
waves. If the two antennas are located at the same position, the operation is in a mono-static mode. Otherwise, it is in a bi-static mode. The two radar modes are illustrated in the part (a) of Fig. 1. The SBR method employs only one pair of transmitter and receiver. These two antenna are assumed to be far from the target and only their directions are concerned in the simulation. Thus the resulted RCS field is a function of the designated directions of receiver and transmitter. Users have to repeat the computation numerous times to acquire the whole radiation pattern. It results in high computational overheads. We adjust the SBR model such that more features are extracted, though consuming less computational costs. In our simulation environment, the target is enclosed by using multiple concentric spherical surfaces. These spherical surfaces are called the ray buffers in this article. Receivers are uniformly distributed in the ray buffers such that radiations from the target will be detected in all directions and at various positions. The full radiation pattern of the target is captured in just one simulation. The conceptual structure of our simulation environment is shown in the part (b) of Fig. 1. In this example, an airplane is served as the target, which is surrounded by three ray buffers. Radar rays are parallelly shot through a rectangle toward the target. This rectangle is called the ray window in this article. Beside changing the simulation environment, we add new data structures, numerical modules, and graphical sub-routines into the SBR simulation, and re-organize the entire process into a pipeline of three stages. At the first stage, a ray tracing method is adopted to compute the geometric optics(GO) data of the rays. The GO data include the ray paths, the reflection points (impact points), and the electric field at the impact points. The physical optics (PO) of the scattered field are computed at the second stage and converted into the RCS field. Then, at the third stage, the simulation results are passed to the visualization module to explore the radiation pattern, the ray paths, ray phases, and other key information. The details of these stages are described in the following sections: In Section 2, the computation of the GO data is formulated. Then, the method
target
transmitter & receiver
Monostatic mode
ray buffers transmitter
receiver
Bi−static mode
(a) Conventional radar simulation modes.
incident waves ray window
:receivers :rays (waves)
(b) Diagram of our simulation environment.
Fig. 1. Radar wave simulation. (a) The mono-static and the bi-static modes. (b) The conceptual simulation environment of our method.
902
S.-K. Ueng and F.-S. Yang
of computing the PO data is described in Section 3. Our visualization strategies for SBR simulations are explained in Section 4. Conclusion of this article is presented in the last section.
2
Computation of Geometrical Optics
At the first stage of our SBR method, the GO data are computed by using a ray-tracing algorithm. At first, the target model are projected onto a plane orthogonal to the wave direction. The bounding box of the projection forms the ray window, which is divided into cells by using a uniform grid. Then, for each cell, a ray parallel to the ray direction is shot through the cell center toward the target. When hitting the target, the ray is reflected and bounced between the target surface. Along the ray path, the electric field is also computed at all impact points. The electric field, the ray paths, and the impact points comprise the GO data set. When a ray hits a metal surface at the impact point Pi , its direction is reflected. The reflected ray is computed by: ri = r i−1 − 2 < n, ri−1 > n, where ri−1 and r i are the directional vectors of the incident and reflected rays and n is the normal vector at Pi , as shown in the part (a) of Fig. 2. When a ray is initiated, an initial electric field E 0 and an orthogonal coordinate system is associated with the ray. The three axes of the coordinate system are composed of the initial direction of the ray r0 and two unit vectors u0 and v 0 . The vectors u0 and v 0 are the directions of the transmitting electrical (TE) and transmitting magnetic (TM) components of the initial electric field E 0 . At each reflection point Pi , these vectors are updated by: n × ri−1 , n × ri−1 v i−1 = ri−1 × ui−1 , ui = −ui−1 ,
ui−1 =
v i = ui × r i ,
(1) (2) (3) (4)
where r i−1 , ui−1 , and v i−1 are the three axes of the previous impact point Pi−1 . An example of the variation of these vectors is shown in the part (a) of Fig. 2. The electric field Ei at the impact point is calculated by: E i = [Riu < E i−1 , ui−1 > ui + Riv < E i−1 , v i−1 > v i ]exp(−jkdi ), 2π k= , √λ j = −1,
(5) (6) (7) (8)
where Riu and Riv are the reflection coefficients of the TE and TM components, and di is the distance between the i − 1th and ith impact points. Riu and Riv
Visual Computing for Scattered Electromagnetic Fields
Pi−1
vi−1 n vi
ri
receiver
Pi−1
n
ui
u i−1 ri−1 target surface
903
ϕ
vi−1
θ
_r
u i−1 ri−1
Pi
Pi−1 Pi :(i−1)−th & i−th impact points
ri−1 r i :(i−1)−th & i−th reflected rays u i−1 u i :TM components of electric field vi−1 vi :TE components of electric field (a) Variation of electric field at an imapct point.
Pi
the last impact point
(b) Electric field at the last impact point and the local coordinates system of the receiver.
Fig. 2. (a)As a ray is reflected, the electric field at the impact point have to be updated. (b) Only GO of the last impact point is used to compute the PO.
represent the decay of the electric field after the reflection, λ is the wave length, and k is the wave number which represents the variation of phase in a unit distance.
3
The Scattered Field
Once the GO data are calculated, the scattered field can be computed. In the SBR simulation, it is assumed that the cross-section of a ray is small and the distance between two consecutive impact points is short. The ray is not diverged as it bounces between the target’s surfaces, and all its energy is concentrated in the direction of reflection or transmission. Based on this assumption, only the GO of the last impact point is used to compute the scattered field created by the ray. Let’s denote the last impact point as Pi . The direction of incident ray at Pi is r i−1 and the directions of the TE and TM components are ui−1 and v i−1 , as shown in the part(b) of Fig. 2. Then assume that the polar coordinates of the receiver are [r, θ, φ]T , as shown in the same figure. The PO, Bθ and Bφ , contributed by the ray are computed by: k (< φ × ui−1 , n > + < v i−1 × φ, n >)SF, 2πj k Bφ = (< θ × ui−1 , n > + < v i−1 × θ, n >)SF, 2πj Bθ =
(9) (10)
where θ and φ are unit vectors in the zenith and azimuth directions at the receiver’s position, as shown in the part (b) of Fig. 2. The scalar SF is the shape function of the ray’s cross-section and can be computed by: SF = exp(jk < (r i−1 + r), s¯ >)ds, (11) s
904
S.-K. Ueng and F.-S. Yang
where s represents all points within the cross section of the ray, s¯ is the positional vector of s, and r is the vector from the origin to the receiver. Then the scattered field is calculated by: Es =
e−jkr (Bθ θ + Bφ φ), r
(12)
where r is the distance from the origin to the receiver. To compute the total scattered field sensed by the receiver, we have to sum up the effects contributed by all rays. Once this process is done, the RCS sensed by the receiver is defined as[14]: δ = 4πr2
E s 2 , E inc 2
(13)
where E inc is the sum of the initial electric field of all the rays hitting the target. The RCS is the ratio between the strengths of the incident electric field and the scattered electric field sensed by the receiver. It tells us the portion of power reflected from the target and perceived by the receiver.
4
Visualization Strategies
Compared with traditional RCS software, our system offers more improved visualization functionalities. Three types of information are extracted from the simulation and graphically displayed. They are the radiation patterns, the GO data set, and the RCS field. The RCS field is visualized by using a volume rendering approach and sped up by using 3D textures. The radiation patterns are shown by using a surface rendering scheme. To explore the GO data set, various methods are adopted to show ray paths, ray phases, and hot spots of the target surfaces. 4.1
Volume Visualization for the RCS Field
The RCS field comes with a polar coordinate system in nature. It is hard to volume-render the RCS field directly. Therefore the RCS field is re-sampled before being rendered. Since the distance between the 1st ray buffer and the target is large while the gap between the ray buffers is small, intuitive re-sampling will create a hole in the 3D grid. To overcome this problem, we shrink the distance between the target and the first ray buffer such that the distance between the target and the 1st ray buffer is equal to the gap between ray buffers. Then the RCS values are re-sampled in a 3D regular grid and converted into a 3D texture map. The RCS volume data are rendered by using a slicing volume rendering method. One resulted image is shown in the part (a) of Figure 3. In the image, the target is a box reflector, which is commonly used to disturb radar waves. The incident radar waves are shot toward the box aperture. The radar waves are
Visual Computing for Scattered Electromagnetic Fields
905
parallel to the x-axis. Ten ray buffers are used to record the RCS field. The nearest and farthest ray buffers are 100 and 109 meters away from the reflector. The spatial gap between two ray buffers is 1 meter. In the image, white color means high RCS values, red color represents median radiations, and yellow and green colors are used to show lower scatterings. The areas with no radiation are shaded with black color. The results show that the back scattering is strong. The radiations spread out in a wide range. The RCS field varies periodically in the space. This is because the waves travel for different distances before arriving receivers. If their phases are in opposite angles, interferences occur such that the scattered field is weakened.
reflector (a) A volume rendering image of the RCS field of the box reflector.
(b) The radiation pattern of the box reflector. The image is rotated by 90 degrees.
Fig. 3. (a)A 3D volume rendering image of the radiation pattern of a box reflector. (b) The radiation pattern of the box reflector, perceived by a ray buffer.
4.2
Radiation Pattern Visualization
To reveal the radiation pattern of the target, we scale the distances between the receivers of the 1st ray buffer and the target according to the RCS values measured in the receivers. If a receiver detects a small radiation, it is moved toward the target. Otherwise, it is moved away from the target. Then the receivers are treated as vertices and connected to form a surface mesh. The gradients of the RCS field are utilized as the surface normals, and vertex colors are decided by using the RCS values of the vertices. As the surface is rendered, the radiation pattern is revealed. A sample image is shown in the part (b) of Fig. 3. As this image shows, the strongest radiation occurs in the back scattering direction. However, the directions around a conic surface, whose cut-off angle is about 45 degrees, also receive high scatterings. The area in-between has a weaker scattering.
906
S.-K. Ueng and F.-S. Yang
The most used RCS visualization method in the EE community is to portrait the RCS field in the 2D polar coordinate space spanned by the azimuth and zenith angles. Our system also supports this conventional visualization method. In another test, an airplane model is employed as the target. Rays are casted toward the airplane from the front end. The resulted RCS field is shown in the part (a) of Fig. 4. The horizontal axis represents the azimuth angle while the vertical axis represents the zenith angle. The back scattering direction is enclosed in a dot-rectangle. Based on the image, we conclude that the airplane generates significant scattering in the zenith directions. But the back scattering is weak.
azimuth angle
zenith angle (a) RCS of the airplane in all directions. The back−scattering direction is enclosed in the box.
(b) The facets produce high back scattering. They are on the edges of the vertical tails.
Fig. 4. (a) The RCS field of the airplane model perceived on all polar angles. (b) Hot spots in the airplane surfaces. The vertical tail edges produce strong back-scattering.
4.3
Hot-Spot and Ray Path Visualization
The GO data set helps users to understand the propagations of the rays and the variation of the electrical fields. In our GO visualization process, users are allowed to pick up an area in a ray buffer and display the paths of the rays contributing RCS to this region. Beside the ray paths, the TE and TM components along the rays can also be displayed. To reveal ray phases, we divide rays into segments and encode ray segments with colors according to ray phases. The target surfaces consist of many small facets. Identifying the facets which are hit most by the rays is important, because these hot spots usually produce high radiations. During the RCS simulation, the number of impact points in each facet is counted. Then the impact point number is divided by the facet area to obtain the impact point density. Facets are given different colors according to their impact point densities. When the target’s surface is rendered, the hot spots of the target are revealed.
Visual Computing for Scattered Electromagnetic Fields
907
Fig. 5. Paths of rays reflected by a facet of the airplane. Colors are used to display the phases.
We collect the rays producing the back-scattering for the airplane in the previous test and use their impact points to compute the impact point densities in all facets. The impact densities are visualized the part (b) of Fig. 4. The red facets have high impact point densities, while the blue facets contain fewer impact points. In the airplane, the edges of the vertical tails possess highest impact point densities. Though the wings are large, their impact densities are low. An image of ray path visualization is shown in Fig. 5. Our system can show all the GO data of rays. However, if all information are displayed simultaneously, the image is saturated. Therefore, we only reveal the paths and phases of rays hitting a facet of the airplane. In the left image, the paths of rays are portrayed as lines. Colors are used to represent ray phases. The ray paths is zoomed in and shown in the right image.
5
Conclusion
A ray-tracing procedure for predicting the interaction between a target and radar waves is presented. We modify the traditional SBR method such that more RCS data are computed in one simulation. Our SBR simulation also identify the hot spots which cause high scattering and reveals wave paths and ray phases. New visualization strategies are added to display the the RCS field and the radiation patterns. Hence essential characteristics of the scattered field can be understood and evaluated.
References 1. Glassner, A.: An Introduction to Ray Tracing. Academic Press, London (1989) 2. Bhalla, R., Ling, H., Moore, J., Andersh, D.J., Lee, S.W., Hughes, J.: 3D Scattering Center Representation of Complex Targets Using the Shooting and Bouncing Ray Technique: A Review. IEEE Antennas and Propagation Magazine 40, 30–38 (1998)
908
S.-K. Ueng and F.-S. Yang
3. Ling, H., Chou, R.-C., Lee, S.-W.: Shooting and Bouncing Rays: Calculating the RCS of an Arbitrarily Shaped Cavity. IEEE Transactions on Antennas and Propagation 37, 194–205 (1989) 4. Kuroda, S., Inasawa, Y., Morita, S., Nishikawa, S., Konishi, Y., Sunahara, Y., Makino, S.: Radar Cross Section Analysis Considering Multi-Reflection Inside a Radome Based on SBR Method. IEICE Transactions on Electron E88-C, 2274–2280 (2005) 5. Yang, C.F., Wu, B.C., Ko, C.J.: A Ray-Tracing Method for Modeling Indoor Wave Propagation and Penetration. IEEE Transactions on Antennas and Propagation 48, 907–919 (1998) 6. Chen, S.H., Jeng, S.K.: SBR Image Approach for Radio Wave Propagation in Tunnels with and without Traffic. IEEE Transactions on Vehicular Technology 45, 907–919 (1996) 7. Chen, S.H., Jeng, S.K.: SBR Image Approach for Radio Wave Propagation in Indoor Environments with Metallic Furniture. IEEE Transactions on Antennas and Propagation 45, 98–106 (1997) 8. Liang, G., Bertoni, H.L.: A New Approach to 3-D Ray Tracing for Propagation Prediction in Cities. IEEE Transactions on Antennas and Propagation 46, 853–863 (1998) 9. Yu, C.L., Lee, S.W.: Radar Cross Section Computation and Visualization by Shooting and Bouncing Ray (SRB) Technique. In: Proceedings of Antennas and Propagation Society International Symposium, pp. 1323–1326 (1992) 10. Rius, J.M., Ferrando, M., Jofre, L.: GRECO: Graphical Electrmagnetic Computing for RCS Prediction in Real Time. IEEE Antennas and Propagation Magazine 35, 7–17 (1993) 11. Andersh, D., Moore, J., Kosanvich, S., Kapp, D., Bhalla, R., Courtney, T., Nolan, A., Germain, F., Cook, J., Hughes, J.: XPATCH 4: The Next Generation in High Frequency Electromagnetic Modeling and Simulation Software. In: Proceedings of IEEE International Radar Conference, pp. 844–849 (2000) 12. Spivack, M., Usher, A., Yang, X., Hayes, M.: Visualization and Grid Applications of Electromagnetic Scattering from Aircraft. In: Proceedings of the 2003 UK e-Science All Hands Meeting (2003) 13. Preiss, B., Tollefson, M., Howard, R.: A 3-D Perspective for Radar Cross Section Visualization. In: Proceedings of IEEE Aerospace Conferences 1997, pp. 95–112 (1997) 14. Inan, U.S., Inan, A.S.: Electromagnetic Waves. Prentice-Hall, Englewood Cliffs (2000)
Visualization of Gene Regulatory Networks Muhieddine El Kaissi1 , Ming Jia2 , Dirk Reiners1 , Julie Dickerson2 , and Eve Wuertele2 1
CACS, University of Louisiana, Lafayette, LA 70503 2 VRAC, Iowa State University, Ames, IA 50011
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. Networks are a useful visualization tool for representing many types of biological data such as gene regulations. In this paper, We will present a novel graph drawing technique for visualizing gene regulatory networks. This technique is based on drawing the shortest paths between genes of interest using breadth-first-search algorithm. By drawing genes of interest and showing their regulatory networks, we were able to clearly understand the interaction between these genes. Visualization of Gene Regulatory Networks considerably reduces the number of displayed nodes and edges. This reduction in nodes and edges depends on the number of genes of interest as well as to the number of interactions between these genes.
1
Introduction
Biological data of physical, genetic and functional interactions are rapidly growing. This fast expansion of data presents a challenge for data visualization and exploration [29]. Gene regulation leads to the creation of different cells which form different organs and ultimately different organisms [26]. There are intricate mechanisms that let the cell regulate the expression of many of its genes. Gene expression can be regulated at the transcriptional, post-transcriptional, or post-translational levels [3]. On the transcriptional level, two types of regulation exist: positive control, in which transcription is enhanced in response to a certain set of conditions; and negative control, in which transcription is repressed. The same type of controls exist for other levels of regulation. On the post-transcriptional level, regulation controls the rate at which mRNA is translated. The translation speed is controlled by how fast the 5’ end of the mRNA binds to the ribosome. The faster the mRNA can be read the more protein will be produced [13]. The last type of regulation is the post-translational regulation. This regulation modifies amino acids by changing its chemical nature or by making structural changes to them [12]. One of the main functions of the post-translational regulation is the control of the enzyme activity. This activity plays an important role in accelerating reaction rates which is millions of times faster than those of comparable un-catalyzed reaction. Another important role of enzyme activity is the determination of metabolic pathways that would occur in a cell [9]. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 909–918, 2009. c Springer-Verlag Berlin Heidelberg 2009
910
M. El Kaissi et al.
Gene regulatory networks are the combination of gene regulations at different levels especially at the transcriptional level. In these networks, genes are usually represented as nodes and their interactions as edges. In Gene regulatory networks, Graphs are direct because regulation flows from one gene to another. These networks are just begining to be understood, and biologists are trying to analyze the functions for each gene "node", to help them understand the biological system behavior in increasing levels of complexity [21]. In this paper, we will focus on how different genes regulate each other and which pathways they follow. This mechanism is realized by displaying the shortest path between different genes of interest, described in detail in sec.4. In sec.2, we presented related work in large data visualization followed by a motivation on our work in sec.3. Results on visualizing gene regulatory networks, as well as their discussion, are presented in sec.6. The paper is closed with a conclusion and ideas for future work.
2
Related Work
Different techniques exits for displaying large data sets on a limited screen space. In this section, we will mention the most popular ones. Data visualization tools usually implement one or more of these techniques. 2.1
Hierarchical Clusters and Incremental Build
Hierarchical clusters of nodes or edges present a simplified view of large, complex networks. Users can expand these clusters, in the same or different window, in order to visualize the underlying cluster members. Clustering can be achieved by merging nodes that satisfy certain criteria, such as merging all proteins of the same subcellular localization in one cluster node [10]. Users can incrementally explore the graph by opening and closing of the cluster nodes [2]. 2.2
Search, Filter and Neighbor Expansion
In this technique, a user applies queries to the underlying database. Different sets of filters can be used to query the network. As an example, users can find genes by their names or connectivity. A subgraph containing query results is initially shown to the user. After that, users can choose to expand all interesting node neighbors. Several tools, including VisAnt, Osprey and MINT are using this technique to display large data sets [20,11,6]. 2.3
Fish Eye and Hyperbolic View
The purpose of Fish Eye and Hyperbolic View techniques is to focus on nodes of interest while displaying relatively distant neighbors. These techniques would enable users to keep a detailed picture of a part of a graph as well as the global context of that graph.
Visualization of Gene Regulatory Networks
911
Fish Eye technique has the effect of putting a fish eye lens in front a graph layout [27]. This effect can be achieved by distorting an existing graph layout by a certain function h(x) around the interest node. The main disadvantage of this technique is the introduction of unwanted edge crossing on the graph border. This unpleasant effect is due to the distortion that is applied only to the nodes. Hyperbolic View technique is the “integration” of the Fish Eye technique within the graph layout algorithm. Also this technique is slower than the fish eye technique, but it produces more pleasant graph drawings. One of the famous hyperbolic view systems is H3Viewer that primarily uses the Klein model [23]. 2.4
Overview Window
The Overview window technique is the simplest technique to visualize large graphs. Usually located at the corner of the screen, the overview window displays the whole graph and some other visual clues. This technique helps users to situate the detailed picture of a part of a graph in the global graph context. The Overview window technique is usually combined with other techniques described in this section. Several tools like Tulip, ASK-GraphView, Cytoscape and MNV implement this technique[7,5,4]. (MNV is developed by the authors of this article) 2.5
Animation
Animation is usually realized during the transition between two views. The animation should help the user to maintain the mental map of a changing graph especially for large ones [1]. The animation can be linear, or it can use other smoothing methods such as slow-in slow-out technique [19]. Recent visualization tools combine this technique with other ones presented previously. As an example, Grouse animates the transition of node position when a user open a cluster node [24]. Animation can also be used with zoom and pan techniques as well as filtering (described in sec.2.2) [28,18].
3
Motivation
Due to the large amount of interactions, analyzing the shortest path between two or more genes may be useful because of the possible relation between these
Fig. 1. Slow-in slow-out technique
912
M. El Kaissi et al.
pathways and the immediacy or breadth of signal response [8]. Studying the structure and evolution of regulatory networks related to certain organisms can be translated into predictions and could be used for engineering the regulatory networks of different organisms. This translation of knowledge is due to the conservation in closely related organisms of the genes and regulatory interactions [30].
Fig. 2. Shortest path output in VisAnt
After much reading and research, to the authors knowledge, there is no graph visualization tools that build networks from a combination of shortest paths between different genes of interest. Few tools, such as VisAnt, have the option to show the shortest path between proteins in protein-to-protein interaction networks. Fig. 2 shows a text output displaying shortest paths between different proteins. From this point, the user has the option to click on the related shortest path. VisAnt will then highlight these paths in the graph. In our approach, the network is formed only by a combination of shortest paths as it will be described in sec.4.
4
Visualization of Gene Regulatory Networks
Gene regulatory networks are based on a set of genes that regulate each other at different cellular activity levels, as explained in sec.1. Biologists usually extract these genes of interest by analyzing Systems Biology data. The analysis of this omics data is assisted by statistical analysis tools such as exploRase, MetaOmGraph as well as graph visualization tools such as Cytoscape [25,22]. From that point, biologists can import the names of genes of interest into a visualization tool developed by the authors named Metabolic Network Visualization (or MNV). MNV will then generate gene regulatory network formed by the shortest paths between these genes. Pathways are analyzed by breadth-first-searching for the shortest paths between genes of interest. From that point, all nodes and edges that belongs to the shortest paths, between genes of interest, are displayed. The combination of these shortest paths form a relatively small network, which is of great interest for
Visualization of Gene Regulatory Networks
913
Fig. 3. Biologists work flow for generating Gene Regulatory Networks
biologists (see sec.3). This network would help biologist analyze gene regulatory networks. Biologists can also choose to gradually expand the network of interest by displaying related neighbors, as explained in sec.5.
5
Animation and Neighbor Expansion
Unlike other visualization tools, such as VisAnt and Osprey, where users expand all node neighbors, MNV can expand only neighbors that verify certain criteria [20,11]. These criteria can be node location, connectivity, type etc... Fig. 4 shows the right click menu that enable biologists to selectively display neighbor nodes. This feature reduces visual complexity by minimizing the number of displayed nodes as well as focusing on only interesting nodes. Animation plays an important role in displaying shortest paths between two genes of interest. It helps biologists follow the path from one gene to another. The animation is realized by changing the color of nodes and edges successively. Other visual attributes can also be applied such as changing node/edge size or shape.
Fig. 4. Graph layout before expanding node “R”
914
6 6.1
M. El Kaissi et al.
Results System Platform
Visualization of the Gene Regulatory Networks was implemented as part of the Metabolic Network Visualization (or MNV) tool developed at the University of Louisiana [14]. MNV has been written in Python and uses C++ modules for performance-critical operations like graph rendering. GUI is based on Qt and OpenGL. MNV supports a wide variety of layout algorithms. In this paper, we used the neato algorithm from the GraphViz package [16]. 6.2
“Ethylene Signaling (Expert User)” Pathway Test
As a proof of concept, we picked the “ethylene signaling (Expert User)” pathway from MetNetDB as an example [17]. “Ethylene signaling (Expert User)” pathway is a well-known pathway. It contains several positive and negative regulations (or controls) as well as some enzyme reactions. After loading this pathway, we randomly choose three genes of interest: AT2G43790, AT5G25350 and AT1G75750. These three genes form a gene regulatory network. Genes AT2G43790 as well as AT5G25350 indirectly regulate gene AT1G75750. These two genes repress the construction of protein EIN3 which in its turn positively regulates protein ERF1. Finally, protein ERF1 positively regulates the transcriptional activity of AT1G75750 (gene of interest). Fig. 5 shows the relation between these three genes. The red edge (or link) indicates a negative regulation. The green edge indicates a positive regulation. White links indicate translational activity while the black links indicate transcriptional
Fig. 5. Genes of interest in “ethylene signaling (Expert User)” pathway
Visualization of Gene Regulatory Networks
915
activity. The unconnected graph edges denote the presence of neighbor nodes. As explained in sec.5, biologists can expand the gene regulatory network to display these neighbors. The yellow rectangles, in the overview window located at the lower left corner, indicate the location of these genes of interest in the regulatory networks. 6.3
All “arabidopsis” Pathways Test
In real-life situations, biologists need to study the regulatory system between different set of genes contained in certain species. We conglomerate all “arabidopsis” pathways containing 1006 different pathways from MetNetDB. These pathways have 2332 different genes and 1008 shortest paths. The average shortest path length between two different genes is 27 nodes and their max length is 56 nodes. Using MNV, biologists can import or paste genes of interest (see fig. 6). If the genes of interest exist in the “arabidopsis” pathways, they will be displayed as well as their regulatory networks. In our case, biologists where interested in 8 genes found in “arabidopsis”. From these eight genes only two of them formed a gene regulatory network. We have found that gene AT4G12430, after a series of enzyme reactions, enhances the transcriptional activity of gene AT3G03830. Fig. 7 shows the regulatory network formed by a shortest path between AT4G12430 and AT3G03830 genes. This shortest path has a length of 33 nodes. Note that in fig. 7, genes of interest are marked by a yellow rectangle. Blue edges indicate enzyme reactions. 6.4
Discussion
In the “ethylene signaling (Expert User)” pathway test from sec.6.2, we have demonstrated the ability of gene regulatory network to display the interaction between different genes. These interactions can be quite complex and difficult to detect. By using the shortest paths between different genes of interest, we
Fig. 6. Importing genes of interest
916
M. El Kaissi et al.
Fig. 7. Genes of interest in the conglomerate of all “arabidopsis” pathways
were able to understand the relation between two genes as demonstrated in sec.6.3. We have also found that a gene can regulate multiple other genes using approximately the same shortest path, and multiple genes can share the same shortest path to regulate a certain gene. On the graph side, for the “ethylene signaling (Expert User)” pathway, the original graph contains 100 nodes and 121 edges. The gene regulatory networks only contain 14 nodes (14% of the original) and 33 edges (27% of the original). At the high end, by displaying only genes of interest and their regulatory networks for all “arabidopsis” pathways, we were able to reduce the number of displayed nodes from 15066 nodes to 41 (0.27% of the original). Respectively, the number of displayed edges was reduced from 17993 edges to 163 (0.9% of the original).
7
Conclusion and Future Work
Interaction between different genes can be quite complex. By using the shortest paths between different genes of interest, we were able to clearly understand the relation between these genes. A possible relation, between length of the shortest paths and the immediacy or breadth of signal response may exist. A significant reduction in the number of displayed nodes and edges has been achieved. This reduction is proportional to the number of genes biologists are interested in as well as to the number of interactions between these genes. Biologists can also extend the gene regulatory network to display neighbor nodes that might influence the regulatory network. For future work, we are looking to implement algorithms other than the breadth-first-search. These algorithms would detect flux between genes of interest undetected by breadth-first-search algorithm. The authors are also working on integrating gene regulatory networks with the Reaction Centric Layout [15]. This integration would allow to view the regulatory networks from a different perspective. We are also interested in developing techniques for clustering shortest paths and adapting them to gene regulatory networks.
Acknowledgments The work presented in this paper is supported by the National Science Foundation under contract NSF 0612240.
Visualization of Gene Regulatory Networks
917
References 1. Online Animated Graph Drawing for Web Navigation. Springer, London (1997) 2. A Fully Animated Interactive System for Clustering and Navigating Huge Graphs. Springer, London (1998) 3. Expression system - definition. Biology Online. Biology-Online.org., 10 (2005) 4. Integration of biological networks and gene expression data using cytoscape (2007) 5. Abello, J., van Ham, F., Krishnan, N.: Ask-graphview: A large scale graph visualization system. IEEE Transactions on Visualization and Computer Graphics 12(5), 669–676 (2006) 6. Aryamontri, C.A., Ceol, A., Palazzi, L.M., Nardelli, G., Schneider, M.V., Castagnoli, L., Cesareni, G.: Mint: the molecular interaction database. Nucleic Acids Res. 35(Database issue) (January 2007) 7. Auber, D.: Tulip: A huge graph visualisation framework. In: Mutzel, P., Jünger, M. (eds.) Graph Drawing Softwares. Mathematics and Visualization, pp. 105–126. Springer, Heidelberg (2003) 8. Madan Babu, M., Luscombe, N.M., Aravind, L., Gerstein, M., Teichmann, S.A.: Structure and evolution of transcriptional regulatory networks. Current Opinion in Structural Biology 14(3), 283–291 (2004) 9. Bairoch, A.: The ENZYME database in 2000. Nucl. Acids Res. 28(1), 304–305 (2000) 10. Benno, S., Peter, U., Stanley, F.: A network of protein-protein interactions in yeast. Nature Biotechnology 18, 1257–1261 (2000) 11. Breitkreutz, B.-J., Stark, C., Tyers, M.: Osprey: A network visualization system. Genome Biology 3(12);preprint0012.1–preprint0012.6 (2002); This was the first version of this article to be made available publicly. A peer-reviewed and modified version, http://genomebiology.com/2003/4/3/R22 12. Burnett, G., Kennedy, E.P.: The Enzymatic Phosphorylation of Proteins. J. Biol. Chem. 211(2), 969–980 (1954) 13. Cheadle, C., Fan, J., Cho-Chung, Y., Werner, T., Ray, J., Do, L., Gorospe, M., Becker, K.: Control of gene expression during t cell activation: alternate regulation of mrna transcription and mrna stability. BMC Genomics 6(1), 75 (2005) 14. Kaissi, M.E., Dickerson, J., Wuertele, E., Reiners, D.: Mnv: Metabolic twork visualization. 15. El Kaissi, M., Dickerson, J., Wuertele, E., Reiners, D.: Reaction centric layout for metabolic networks. LNCS. Springer, Heidelberg (2009) 16. Ellson, J., Gansner, E., Koutsofios, L., North, S., Woodhull, G.: Graphviz open source graph drawing tools (2002) 17. Wurtele, E.S., Li, L., Berleant, D., Cook, D., Dickerson, J.A., Ding, J., Hofmann, H., Lawrence, M., Lee, E.K., Li, J., Mentzen, W., Miller, L., Nikolau, B.J., Ransom, N., Wang, Y.: Metnet: Software to build and model the biogenetic lattice of arabidopsis. In: Concepts in Plant Metabolomics, pp. 145–158. Springer, Heidelberg (2007) 18. Friedrich, C., Schreiber: Visualisation and navigation methods for typed proteinprotein interaction networks. Applied bioinformatics 2(3 suppl.) (2003) 19. Friedrich, C., Houle, M.: Graph drawing in motion ii., pp. 122–125 (2002) 20. Hu, Z., Mellor, J., Wu, J., Yamada, T., Holloway, D., DeLisi, C.: VisANT: dataintegrating visual framework for biological networks and modules. Nucl. Acids Res. 33(2), 352–357 (2005)
918
M. El Kaissi et al.
21. Kauffman, S.A.: The Origins of Order: Self-Organization and Selection in Evolution. Oxford University Press, Oxford (1993) 22. Lawrence, M., Lee, E.-K., Cook, D., Hofmann, H., Wurtele, E.: explorase: Exploratory data analysis of systems biology data. In: CMV 2006: Proceedings of the Fourth International Conference on Coordinated & Multiple Views in Exploratory Visualization, Washington, DC, USA, pp. 14–20. IEEE Computer Society, Los Alamitos (2006) 23. Munzner, T.: Drawing large graphs with h3Viewer and site manager. In: Whitesides, S.H. (ed.) GD 1998. LNCS, vol. 1547, pp. 384–393. Springer, Heidelberg (1999) 24. Museth, K., Möller, T., Ynnerman, A.: In: Archambault, D., Munzner, T., Auber, D. (eds.) Grouse: Feature-based, steerable graph hierarchy exploration (2007) 25. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., Ideker, T.: Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Research 13(11), 2498–2504 (2003) 26. Supratim, C.: Gene regulation and molecular toxicology. Toxicology Mechanisms and Methods 15, 1–23 (2004) 27. Taylor, R.W., Sarkar, M., Brown, M.H.: Graphical fisheye views of graphs. In: Proceedings of CHI 1992 Conference on Human Factors in Computing Systems, pp. 83–91. ACM Press, New York (1992) 28. van Wijk, J.J., Nuij, W.A.A.: Smooth and efficient zooming and panning. In: IEEE Symposium on Information Visualization (INFOVIS 2003), October, 2003, pp. 15–23 (2003) 29. Vidal, M.: A biological atlas of functional maps. Cell 104(3), 333–339 (2001) 30. Yu, H., Luscombe, N.M., Lu, H.X., Zhu, X., Xia, Y., Han, J.-D.J., Bertin, N., Chung, S., Vidal, M., Gerstein, M.: Annotation transfer between genomes: Protein protein interologs and protein dna regulogs. Genome Research 14(6), 1107–1118 (2004)
Autonomous Lighting Agents in Photon Mapping A. Herubel, V. Biri, and S. Deverly Laboratoire d’Informatique Gaspard Monge Duran Duboi
Abstract. In computer graphics, global illumination algorithms such as photon mapping require to gather large volumes of data which can be heavily redundant. We propose a new characterization of useful data and a new optimization method for the photon mapping algorithm using structures borrowed from Artificial Intelligence such as autonomous agents.
1
Introduction
Physically-based global illumination algorithms such as photon mapping [1] have a linear progression between complexity and quality. To a given quality, rendering time scales linearly with computer performances. With Moore’s law call in question and increasing demand in quality, those algorithms need more and more optimisations. Classical optimisations such as irradiance caching [2] or shadow photons [3] are themselves linear in performance gain. They usually consist of adding knowledge about the scene and using it to interpolate previously computed values. Data is gathered in large volumes although due to heavy redundancy we observe a low density of useful data (see Fig 1.B and C).
Fig. 1. A. Our test scene with classical indirect photon map (B. middle top) and ALA map (D. middle bottom) and corresponding local useful data density (respectively C. top right for classical photon map and D. bottom right for ALA map); In false color, brighter means high useful data when darker means low G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 919–928, 2009. c Springer-Verlag Berlin Heidelberg 2009
920
A. Herubel, V. Biri, and S. Deverly
We propose a brand new optimisation method for the photon mapping global illumination algorithm which uses Artificial Intelligence structures and concepts to resolve the sparse data problem in global illumination presented above. More precisely, we create our own autonomous agent structure called Autonomous Lighting Agents (ALA) that performs an agent-based scene discovery. They can efficiently gather and store large amounts of useful data (see Fig 1.D and E) thanks to various sensors. Moreover ALA exist within multiple graph representations that depict relations between agents such as neighbourhood or light paths. The resulting structures are networks of agents storing data non-uniformly with increased density in areas of interest. Then any ALA and the associated networks can be queried during rendering to make decisions regarding, for example, irradiance caching or shadow ray casting. Therefore, fewer photons are cast reducing computing time during both the photon casting and the rendering phases of the algorithm. In the next section, we present the photon mapping algorithm and overview the main optimisations on that method. Then we present our new criterion to evaluate usefulness of local data density. To match closely this particular data density, we detail our method showing how ALA discover the scene and how they are queried for final rendering. Finally, we present our first results showing an important decrease of memory occupation and slightly shorter rendering times, compared to optimized photon mapping.
2
Global Illumination and Photon Mapping
We call global illumination the simulation of all light scattering phenomena in a virtual scene. Photon mapping presented in [4,5] is currently one of the most efficient physically based algorithm capable of global illumination rendering. Compared to other global illumination methods like radiosity [6,7] or Metropolis Light Transport [8,9], photon mapping is a robust and consistent two-pass algorithm [10], is able to handle many light effects including caustics, and is modular since it separates illumination in several layers [4]. The first pass traces photons from all lights through the scene and stores them, at hit points, in so called photon map. The second pass uses ray tracing to render the image. In the ray tracing pass the photon map is used to estimate the radiance at different locations within the scene. This is done by locating the nearest photons and performing a nearest neighbor density estimation (see Fig. 2). A detailed presentation of this method can be found in [1]. Despite its numerous qualities, photon mapping is still a slow and complex algorithm. Classical optimisations include shadow photons, direct computing and importance sampling of the indirect illumination evaluation. Shadow photons, as shown in [3], are stored in a separate photon map which is used to speed-up ray traced shadow computing. Ward et al [2] proposed also irradiance caching, used to accelerate the indirect diffuse illumination. The key idea is to interpolate, if possible, irradiance value between several points which are relatively close and share the same orientation. Unfortunately, this optimization is itself costly and not adapted to multithreaded implementations of path tracing.
Autonomous Lighting Agents in Photon Mapping
921
Fig. 2. The two passes of photon mapping
Advanced optimizations address the drawback of the density of the photon map. Indeed, the photon distribution is driven by lights, and for large models where only a small part is visible from the viewer, like the example scene presented in [8], it may be better to cast photon only where they are needed. Techniques like [11,12] focus on the camera to influence the direction of photons casted by lights. However this techniques, like Importons, may induce high variance in the photon map by generating high-power photons as explained in [13] (pp 146). Density control, introduced in [14], consists of limiting the presence of photons in bright area allowing more photons in darkest areas. Recent algorithms [15,10], inverting the two passes of photon mapping, allow to minimize the bias, but they are very dependent on the view position and still use a traditional photon mapping inducing the same drawbacks in photon density. Nevertheless, the locality of photon maps in [10] involves a limited density for each iteration of the progressive photon mapping.
3
Local Useful Data Density
In photon mapping, lights are responsible for the distribution of photons and an inadequate repartition will keep the photon map from playing its role in rendering. The first pass sends an important amount of photons with only slight local variations in their own data. Unfortunately, they are required to achieve sufficient visual quality, or noise can appear as shown in [1] (pp 150). Classical optimization, as density control or inversion techniques, presented in 2, address partially this issue by restricting the number of photons using local density and visual importance. But they are very view dependant, still send a lot of photons or assume strong variance in lighting in order to be efficient, especially for the density issue. Therefore, our first goal is to find a mean of characterizing the usefulness of data in each area of the scene. Indeed, if we observe Fig. 1.B. we find a high density photons which are locally very similar. It leads us to introduce the concept of local useful data density (LUDD) in the photon map to represent the local density of useful data for any criterion. First, for a particular criterion k, we define the local density of useful k-data, denoted by δk . It is defined as the standard deviation of k computed for N included samples {xkj } on a local sphere.
922
A. Herubel, V. Biri, and S. Deverly
δk =
1 N −1
N i=1
1 N
xki − N j=1
1 N
N
j=1 xkj
2
xkj
The whole local density of useful data δ for K criteria is simply : K δk δ = k=1 N
(1)
(2)
Criteria used to perform this measure include surface normal, light power or direction. This measure can be expanded for every additional data carried by the photon map such as visual importance. The δ value computed Fig. 1.C. is particularly low which implies that added value of each photon is very small. However in Fig. 1.E., δ is higher which means our agent map has locally a strong useful data density. Our observations show that local useful data density is not evenly distributed in the photon map and that this measure is strictly correlated to local conditions in the scene.
4 4.1
Our Method Using Autonomous Lighting Agents An Autonomous Agent Approach
To maximize δ measure, we need to create a new structure dedicated to scene discovery instead of carrying light flux with an evenly strong δ. We want it to be independent of human interaction, of any tessellation, and of visual importance. The ability to discover local phenomena as geometry gradient and light variations leads us to the autonomous agent software paradigm [16]. In consequence our structure is not a monolithic entity but a swarm of agents able to perceive their environment through sensors [17] and to make behavioural decisions based on the given criteria. We define our autonomous lighting agents (ALA) as entities with a temporal persistence coexisting in a scene with other ALA and interacting with them [18]. We use formal description as described in litterature [19] to represent mathematically our model. We give each agent an immutable and unique identity i. ALA are very different from photons in a way they don’t depend on light sources for being cast, therefore they do not cary energy by themselves. Photons comportment is based on physics principles and stochastic methods where ALA use AI techniques. An ALA has three abilities, it can be cast by an arbitrary object, it can discover the area around its location and it can answer simple questions related to this area. The model also specifies attributes or sensors owned by agents and denoted by a. Therefore, our new ALA improved photon mapping is organized as follows : 1. Replace the classical photon casting pass with ALA casting pass (see 4.2) 2. Make ALA discover their environment (see 4.3) 3. Do the rendering pass using ALA decision making for different layers (see 4.4)
Autonomous Lighting Agents in Photon Mapping
923
Table 1. Different sensors used with their measurement and median value used Sensor
Median Value σam Incomming irradiance i Integrated incomming irradiance on an hemi- Harmonic sphere above the i sensor mean distance between objects [2] . Penumbra p Number of intersections with surfaces along the 1.0 ray between p sensor location and light source divided by distance to source light 3 1 1 2 Geometry gradient g k=1 dk , dk being the distance to the closest 3 3 intersection from sensor to any surface in a arbitrary direction ALA cast 1 Proximity n ALA density around the sensor × 100 nb light
4.2
Measurement
ALA Casting
We choose to cast ALA from the objects of the scene, uniformly from a surrounding bounding sphere. When hitting a surface an ALA acts like both a photon and a shadow photon in the photon mapping algorithm, as it is stored and duplicated. Like shadow photon, the obtained ALA clone passes through the surface. In consequence ALA are present in occluded zones. Agents are stored using a kd-tree in the same way than photon maps. A visualization of the ALA network is shown in Fig. 1.D. After this stage, agents are evenly distributed in the scene at locations {xi }. 4.3
ALA Deployment
Local discovery with sensors. Each agent can deploy sensors to gather local data about the scene and other agents. The photon map only provides data at each discrete position of each photon whereas our ALA deploy sensors on a local parametric discovery area. A sensor is identified with an identity j and is attached to an agent i. Each sensor corresponds also to an attribute a, so we define a value discovered by a sensor as aij . Each agent can deploy more than one sensor for each kind of attribute a. Sensor position are denoted by xaij . Finally, for each sensor a, we define a coefficient of interest σa , computed by the difference of the value perceived by the sensor with an arbitrary chosen median value σam . Sensor used are light sensor, penumbra sensor, geometry gradient sensor and proximity sensor. Table 1 shows the different measures and median values used for each sensor. Gold miner decision algorithm. As we aim to achieve an optimal LUDD repartition across the scene, ALA are entitled with the decision of choosing the quantity of data they carry. In the classical photon map, LUDD is mainly related to local photon density. ALA strongly differ on this point, as data is not related by the presence of the agent itself. A single agent can deploy dozens of sensors
924
A. Herubel, V. Biri, and S. Deverly
on large distances as well as many agents can choose not to use any. Each agent will choose which types of sensors are deployed, how many of them and how far they can go. We use a decision algorithm making the agent acting like a gold miner (see Alg. 1). A sensor reports its coeficient of interest σ, an interesting result will increase the stimulation factor γ of the agent. On the contrary, a non-pertinent result will decrease this factor. Stimulation factors include : little number of sensors, strong irradiance gradient, high geometric gradient, high or low number of neighbours. Discouragement factors include great number of sensors or planarity of surface. Factors are of different natures and some of them implies computing the gradient of two given criteria aij . Between each deployment we test if an excitement or unexciting factor is triggered. The agent will continue to deploy sensor and gather data until the stimulation factor drops below a certain level. Computing the proximity network. When all ALA have been cast and all sensors are deployed, we compute the proximity network on the whole agent set. The network is a graph in which each agent is a node related to its closest neighbours in the scene space. The graph is computed quickly using the kd-tree to find the neighbours for each agent. 4.4
Rendering Pass Using ALA
The ALA structure allows an efficient agent based local discovery of the scene, data is then used during the rendering pass. The ALA network aim to replace the photon map for all kind of decision making optimisations therefore avoiding the costly generation of direct photons, indirect photons and shadow photons. We define our model as an agent based decision maker called the oracle algorithm. Decision making optimisations can be seen as “Should I?” questions such as “Should I trace shadow ray?” or “Should I recompute incoming irradiance?”. We formalize thoose questions in an Artificial Intelligence procedure. Avoiding shadow rays. The oracle algorithm, answers the question : “Should I trace shadow rays ?”. This is used each time we want to trace a shadow ray from a particular position. Practically, the oracle will locate the closest ALA using Algorithm 1. Gold miner algorithm
γ = 0.0, σ = 0.0; for all sensor types do while γ > 0 do direction = random() distance += σ = deploy sensor(direction,distance) γ += σ - ( σ/10) end while end for
Autonomous Lighting Agents in Photon Mapping
925
the ALA kd-tree and compute the difference between the sum of its penumbra sensors pij (as defined in Table. 1) and the median penumbra sensor value. pij − σpm > 0 → {Y ES, N O} Penumbra sensors stores the lights position frow which they are occluded by objects. This reduces greatly the number of ray shadows in multiple lights environments. To achieve more robustness it is possible to use the median response of neighbours agents of i using the proximity network. Caching Irradiance. The irradiance sensors compute an approximated diffuse indirect illumination by sampling the hemisphere above the sensor position. Our implementation allows to compute an approximation of diffuse indirect illumination at a position xij . We implement a procedure which compute the mean distance to closests irradiance sensors gij at positions xgij and compare it to the median value of this sensors. xgij − xij − σim > 0 → {Y ES, N O} If the procedure answers negatively, indirect illumination is computed by interpolating irradiance values of closest sensors. In application due to homogeneous repartion of ALA few indirect diffuse illumination values need to be computed during the rendering phase.
5
Results and Discussion
In this section we present results based on our implementation of the ALA structure and algorithms. For comparison we implemented a standard photon mapping algorithm with shadow photons, importance sampling and irradiance caching. Images have been rendered on a 2.67GHz Intel Core i7 920 using four cores with a resolution of 1600×1200. The test scene shown in Fig. 4 is a common Cornell box with a mesh and a glass ball. The scene is illuminated by a spherical light source on the ceiling. Table 2. Comparative results with ALA for the first pass in memory occupation and casting times. 1M of photons and 50k ALA are cast and stored, each ALA deploys one to twelve sensors with 275k sensors in total. Pass 1
Map
Direct PhotonMap Indirect PhotonMap Photon casting Caustic PhotonMap Shadow PhotonMap ALA casting ALA Network Sensor deploying
Total Memory times (s) occupation (mb) 18 9.4 46 0.004 4.9 27 5.2 128
926
A. Herubel, V. Biri, and S. Deverly
Table 3. Comparative results with ALA in rendering times. Images are in Fig. 5 Number of Rendering photons/agents times (s) With Photon mapping and Irradiance caching 1M 21570 With ALA 50000 16557 Pass 2
Fig. 3. Local useful data density for the indirect on ALA (left) and on classical the photon map (right)
Fig. 4. Rendering test scene without indirect diffuse lighting with direct illumination and shadows photons (left), ALA based shadows (center) and difference multiplyed by 5 between two rendered image (right) The two renderings use photon mapping for caustics
Fig. 5. Rendering test scene with standard photon mapping (left), ALA photon mapping (center) and difference multiplyed by 5 between two rendered image (right)
Autonomous Lighting Agents in Photon Mapping
927
Fig. 4 shows a rendering of our test scene including only direct lighting, caustics and shadows. We can see that difference is subtle and due uniquely to some aliasing near the silhouette. In Fig. 5 1M photon are cast leading to a 4.9Mb shadow photon map. We observe that the ALA method, while achieving faster rendering, avoids the usage of shadow photons and produces nearly identical shadows with only 50000 ALA. The indirect diffuse illumination is computed using importance sampling on the indirect photon map and irradiance caching using the direct and indirect photon map. The ALA allow to avoid computation of the two photon maps while providing an accurate but perfectible estimation of the indirect illumination in slightly shorter rendering times as shown in Fig. 3. We observe in Fig. 3 that the local useful data density considering the irradiance criteria is uniformly low in the photon maps. This is explained by the local redundancy of the irradiance data. Our structure show a considerably better density, as the number of sensors perfectly fits with the redundancy of data. This difference leads to an ALA casting phase shorter than the photon casting phase as well as a significant reduction of memory usage as show in Table 2.
6
Conclusions and Future Work
In this paper, we stated that photon density is not necessarily related to useful data for some light estimation. We therefore propose a new estimation of the local density of useful data we call LUDD for local useful data density. We then presented a new agent-based model to discover efficiently a scene, in the sense of matching correctly LUDD. To implement the method we introduced a new structure : Autonomous Lighting Agents (ALA), which are an efficient container to handle data about the scene. We show how the ALA network can be used as an oracle or estimation to lower rendering times and reduce memory overhead in the photon mapping algorithm. ALA concept and networks could be extended to answer other critical question related to rendering, like halting or not path tracing recursion or handle importance sampling in each intersection related to the whole light distribution in the scene. ALA network could also force photon casting in importance direction, leading light path. We plan also to improve greatly irradiance estimation with ALA networks to correctly replace irradiance caching. Finally, this approach of concentrating efforts on local area, where useful data can be found, could be tested on other kind of vizualisation of 3D data or on other global illumination algorithm like Metropolis Light Transport.
Acknowledgements We would like to thank the DuranDuboi R&D team for their support and for providing the rendering infrastructure.
928
A. Herubel, V. Biri, and S. Deverly
References 1. Jensen, H.W.: Realistic Image Synthesis Using Photon Mapping. A K Peters, Wellesley (2001) 2. Ward, G., Rubinstein, F., Clear, R.: A ray tracing solution for diffuse interreflection. In: Proceedings of the 15th annual conference on Computer graphics and interactive techniques, pp. 85–92 (1988) 3. Jensen, H., Christensen, N.: Efficiently Rendering Shadows using the Photon Map. In: Compugraphics 1995, pp. 285–291 (1995) 4. Jensen, H.: Importance driven path tracing using the photon map. Rendering Techniques 95, 326–335 (1995) 5. Jensen, H.: Global Illumination using Photon Maps. In: Rendering Techniques 1996, pp. 21–30 (1996) 6. Goral, C., Torrance, K., Greenberg, D., Battaile, B.: Modeling the interaction of light between diffuse surfaces. In: Proceedings of SIGGRAPH 1984, Computer Graphics, vol. 18(3), pp. 213–222 (1984) 7. Keller, A.: Instant radiosity. In: SIGGRAPH 1997: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 49–56 (1997) 8. Veach, E., Guibas, L.J.: Metropolis light transport. In: SIGGRAPH 1997: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 65–76 (1997) 9. Segovia, B., Iehl, J.C., P´eroche, B.: Metropolis Instant Radiosity. Computer Graphics Forum 26, 425–434 (2007) 10. Hachisuka, T., Ogaki, S., Jensen, H.W.: Progressive photon mapping. In: SIGGRAPH Asia 2008: ACM SIGGRAPH Asia 2008 papers, pp. 1–8. ACM, New York (2008) 11. Pietrek, I.: Importance driven construction of photon maps. In: Rendering Techniques 1998: Proceedings of the Eurographics Workshop in Vienna, Austria, June 29-July 1, p. 269. Springer, Wien (1998) 12. Fan, S., Chenney, S., Chi Lai, Y.: Metropolis photon sampling with optional user guidance. In: Rendering Techniques 2005 (Proceedings of the 16th Eurographics Symposium on Rendering), pp. 127–138. Eurographics Association (2005) 13. Christensen, P.: Faster photon map global illumination. Graphics Tools: The Jgt Editors’ Choice, 241 (1995) 14. Suykens, F., Willems, Y.: Density control for photon maps. Rendering techniques 2000, 23–34 (2000) 15. Havran, V., Herzog, R., Seidel, H.P.: Fast final gathering via reverse photon mapping. Computer Graphics Forum (Proceedings of Eurographics 2005) 24, 323–333 (2005) 16. Maes, P.: Designing autonomous agents: theory and practice from biology to engineering and back. MIT Press, Cambridge (1990) 17. Maes, P.: Modeling adaptive autonomous agents. Artificial Life 1, 135–162 (1994) 18. Treuil, J.P., Drogoul, A., Zucker, J.D.: Mod´elisation et simulation ` a base d’agents. Dunod (2008) 19. Lerman, K., Galstyan, A., Martinoli, A., Ijspeert, A.: A macroscopic analytical model of collaboration in distributed robotic systems. Artificial Life 7, 375–393 (2001)
Data Vases: 2D and 3D Plots for Visualizing Multiple Time Series Sidharth Thakur and Theresa-Marie Rhyne Renaissance Computing Institute, North Carolina, USA
[email protected],
[email protected]
Abstract. One challenge associated with the visualization of time-dependent data is to develop graphical representations that are effective for exploring multiple time-varying quantities. Many existing solutions are limited either because they are primarily applicable for visualizing nonnegative values or because they sacrifice the display of overall trends in favor of value-based comparisons. We present a two-dimensional representation we call Data Vases that yields a compact pictorial display of a large number of numeric values varying over time. Our method is based on an intuitive and flexible but less widely-used display technique called a “kite diagram.” We show how our interactive two-dimensional method, while not limited to time-dependent problems, effectively uses shape and color for investigating temporal data. In addition, we extended our method to three dimensions for visualizing time-dependent data on cartographic maps.
1
Introduction
In this paper we address challenges associated with the graphical representation and exploration of multiple time-dependent quantities. Our motivation is to support visual analytic tasks of exploring data such as census records that can have a large number of interesting correlations and trends in the temporal domain. Some specific challenges and issues addressed in our work are: • Displaying several time-dependent or time-varying quantities simultaneously without causing overplotting. • Developing effective graphical representations that engage a human user’s visual and cognitive abilities to detect interesting temporal patterns of changes and quickly get overviews of data having multiple time-varying quantities. • Census records and many other temporal data often contain a geo-spatial context such as cartographic maps. A challenge is how to expose patterns in the temporal and spatial domains while maintaining the ability to inspect several time-varying quantities. We present a two-dimensional graphical arrangement for displaying multiple time-varying numeric quantities that avoids the problem of overplotting. Our approach is based on a two-dimensional plot called a “kite diagram.” Our method creates what we call Data Vases: interesting and intuitive graphical patterns G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 929–938, 2009. c Springer-Verlag Berlin Heidelberg 2009
930
S. Thakur and T.-M. Rhyne
of time-varying data. Data vases can be used in many analysis tasks such as quick comparison of global and local time-varying patterns across many data sets, identification of outliers, and exploration of data with multiple levels of temporal granularity. We begin the remainder of this paper with a discussion in Section 2 on background and related work. In Section 3 we describe kite diagrams and their applications. Sections 4 and 5 describe our visualization methods in two and three dimensions. In Section 6 we conclude the paper with a discussion of our approaches and directions for future work.
2
Background and Related Work
Time constitutes an inherent and often a principal independent quantity in many data sets and possesses unique characteristics compared to the other fundamental data entities, namely space and populations [1,2]. Although many effective visualization techniques and interactive methods have been developed for exploring time-dependent data [3,4,5,6,7,8], some important challenges remain. For example, a common problem with many existing methods (e.g., line graphs) is over plotting as shown in figure 1 (a). Another important limitation of many available techniques available for visualizing multiple time series is that few of them can effectively display positive and negative data values. Among the existing methods for visualizing multiple time-varying data is an effective two-dimensional representation called ThemeRiver [9], in which the time-dependent quantities are displayed using smooth area-filled and layered profiles to create aesthetically pleasing “currents” representing the data stream. However, the ThemeRiver metaphor is mostly limited to the visualization of nonnegative data values. On the other hand, a different approach, namely Horizon Graphs [10], can handle negative and positive data values by effectively exploiting layering and color-coding to create dense, space-conserving visualizations. However, the method achieves the efficiency in the spatial layout by sacrificing the ability to compare overall profiles of the time-varying quantities. Another interesting time series visualization is called wiggle traces [11] and employs individual line traces or “wiggles” to plot the profiles of seismic waves obtained during the exploration of sub-surface strata. However, this technique can not be generalized because it is difficult, in general, to visually compare multiple line graphs that do not share the same set of coordinate axes. We address some of the challenges in visualizing multiple time series using a method that is an evolution of two-dimensional charts called kite diagrams [12]. Kite diagrams are useful for plotting simple statistical data but have not been exploited for displaying more complex data. We apply our methods for visualizing census-related data, which often contain a huge set of individual time series and can have potentially many interesting correlations among the recorded social and economic indicators that may be studied. In our work we have also explored three-dimensional visualizations to investigate potentially interesting spatial relationships in census data that have an inherent geo-spatial context (e.g., census tract, counties, and states). Our visualizations
Data Vases: 2D and 3D Plots for Visualizing Multiple Time Series
931
Fig. 1. (a) A dense plot of multiple time series. (b) Illustration of the steps involved in the creation of a kite diagram of a single data series shown in step 1.
exploit a standard three-dimensional geo-spatial representation called the spacetime cube [13], in which spatial data are plotted in the X − Y “ground” plane and temporal data are plotted along a vertical Z axis. This space-time cube metaphor has been used to generate effective visualizations of spatial-temporal data in some previous works [14,15,16]. However, three-dimensional information visualizations are generally challenging due to the problem of inter-object occlusion. We employ user-driven data filtering to reduce clutter, though some other techniques are also available for overcoming the occlusion-related problem [17]. We begin the discussion on our visualizations of multiple time series by discussing a two-dimensional layout based on kite diagrams.
3
Kite Diagrams
A kite diagram is a two-dimensional data representation technique that employs closed, symmetric glyphs (graphical widgets) to represent simple quantitative data [12]. Figure 1 (b) illustrates the construction of a kite diagram of a line graph profile shown in step 1 in the figure. The underlying motivation in using kite diagrams is that for small data sets the differences in the shapes of the kites can reveal differences in the trends and values values of one or more data series. An effective application of kite diagrams is shown in figure 2, where kite diagrams have been used along with a standard tree view for visualizing species diversification during the Mesozoic era [18]. The horizontal temporal axes in the kite diagrams correspond to the different geologic periods of the Mesozoic era and the thickness of the kites along the vertical axis indicates the estimated populations during the corresponding periods. Colors of the kite shapes pertain to the different mammalian orders or families shown in the phylogenetic tree. Discussion. Kite diagrams are an easy-to-create and straightforward representations of simple time-series data sets. However, to visualize complex data using kite diagrams the following limitations of the kite diagrams need to be addressed: • Kite diagrams are limited to the display of positive data values and there is no option to encode negative data values or missing data. • Kite diagrams are suitable for comparing “gross” values of variables and attributes; however, detailed analysis that involves the comparison of exact values across multiple charts is tedious.
932
S. Thakur and T.-M. Rhyne
Fig. 2. Species distribution and diversification in Mesozoic Era (courtesy of [18]) shown using (left) phylogenetic tree, and (right) kite diagrams
• Kite diagrams are useful mostly for exposing large differences between and within different data series; using kite charts it is difficult to compare adjacent values that vary only slightly. We next discuss a two-dimensional approach for visualizing time-dependent data that exploits the useful characteristics of kite diagrams and avoids some of its limitations. Although in this work we consider primarily time-varying data, our methods are also applicable for visualizing other types of multi-variate data that may not involve a temporal domain.
4
Data Vases: Display of Multiple Time Series
We present an approach for visualizing multiple time series that combines kite diagrams and standard visualization techniques to create information-rich and interesting glyph-based representations of the data1 . At the very least, our approach generates dense, space-filling representations of time-dependent data by plotting kite diagrams corresponding to multiple time series and using markers to highlight unusual data values such as missing data. An example based on our approach is shown in figure 3, which shows in a single view data vases for a hundred crime-related time series2 . Our technique avoids over plotting (compared to standard methods such as line graphs shown in figure 1 (a)) and allows encoding of additional interesting characteristics in a given data set. Our approach exploits many of the salient perceptual organizing principles of graphical representations available in kite diagrams such as bilateral symmetry, closure of shapes, and distinction between figure and ground [20]. Our visualizations are intended to engage an observer’s visual perceptual capabilities for detecting and pre-attentively processing interesting patterns in the emergent data vase shapes [21]. 1 2
The resulting shapes in our approach look like profiles of flower vases; we therefore use the term data vases to refer to our representation of time series data. The design choice pertaining to the alignment of the temporal axis is governed primarily by the principle of creating “space-filling” representations of the data [19]. The temporal axes in data vases might easily be swapped to have a horizontal alignment in case of a data with larger number of time steps relative to other data dimensions.
Data Vases: 2D and 3D Plots for Visualizing Multiple Time Series
933
Fig. 3. Kite diagram-based Data Vases showing the time series of crime rates (numbers) in North Carolina’s (USA) 100 counties during 1980-2005. Individual kite diagrams are arranged alphabetically from left to right based on county names and colored based on geographic regions in North Carolina (see the map in figure 4).
Fig. 4. Data vases corresponding to net migration rates for North Carolina’s 100 counties using (a) interpolated profiles, and (b) discrete or stepped profiles
The data vase chart shown in figure 3 provide a useful tool for the simultaneous comparison of multiple time series; however, our straightforward representation is not sufficient for performing more complex data analysis tasks. We next present some enhancements to our data vase charts that exploit some of the standard visualization techniques for exploring and representing data. Interpolated Versus Discrete Profiles of Data Vases. The data vases shown in figure 3 are constructed using profiles of line graphs, which employ a linear interpolation between consecutive time steps to represent a continuous data series. However, in many data, such as census surveys, the quantities may not vary linearly between each time step. We therefore provide an option to use profiles of bar graphs or histograms to generate “discrete” representations of the data. figure 4 shows the two versions of data vases that are based on interpolated and discrete profiles.
934
S. Thakur and T.-M. Rhyne
Fig. 5. Data vases showing US oil imports from different countries (1970 and 2007)
Fig. 6. Data filtering with data vases using simple techniques like range selection to show (left) negative data values, and (right) positive data values.
Color Coding. The symmetric shapes of data vases are limited to conveying the absolute values of time-dependent variables. We exploit different color coding schemes to improve differentiation of the data values and to represent additional information such as negative data values. An example is shown in figure 4 where two different color hues are used for encoding positive values (blue-green scheme) and negative values (orange-red scheme). Figure 5 illustrates a different coloring scheme that is based on a discretized or segmented color palette, which can reveal interesting patterns in data such as time ranges corresponding to major changes and time periods over which different values persisted. Other useful color coding schemes that are sensitive to statistical properties in data are also available [22]. Data Vases and Data Filtering. Data filtering is an indispensable analytic tool in any type of visualization for investigating dense data. Some standard data filtering methods include user-driven dynamic queries that employ straight forward graphical widgets like sliders and range selectors [23]. These and other data filtering tools can be used effectively with data vases for exploring time-varying data and to answer some analytic questions, for example, when certain data values of interest appear in the data. An example is shown in figure 6 where the data vases corresponding to estimated net migration rates for the counties in North Carolina (USA) have been filtered to show either the negative or the positive migration rates. Exploration of Different Levels of Granularity. An important characteristic in some time-varying data is the different levels of granularity of the temporal domain [2] (e.g., months and years in census data). In our data vase approach multiple levels of temporal granularity can be displayed by summarizing data values by their averages over the higher granularity levels.
Data Vases: 2D and 3D Plots for Visualizing Multiple Time Series
935
Fig. 7. Charts showing vases corresponding to different levels of temporal granularity in a data set: (background) unemployment rates averaged for each year, and (foreground) all time series expanded to show monthly unemployment rates
An example is shown in figure 7 where two levels of granularity in a data are shown in two different charts. In addition, interactive methods are used in our approach to collapse and expand the different temporal levels, which nicely affords the exploration of the data with the multiple levels of details. Interaction with Data Vases. Interactive tools can be particularly effective for exploring dense data sets using data vases. For example, in a prototype implementation of our methods region-wise patterns in census data can be explored by rearranging the glyphs on the horizontal axis according to the different regions in the data (figure 4). The glyphs may also be sorted to highlight data vases with large averages over the entire time series or for particular time steps. Some other useful options include option to switch between the representation of data vases based on raw values and per-capita values and to interactively change the width of the glyphs to reduce overlaps.
5
Display of Multiple Time Series on Maps
Many time-dependent data often also involve an inherent geo-spatial context. For example, census survey data are usually associated with geographic regions like census tracts, districts, counties, and states. We present an approach using three-dimensional versions of data vases for exploring spatial relationships in data with multiple time series. We create the 3D visualizations of data series for different geographic regions by stacking polygonal disks for each time step along a corresponding vertical temporal axis. In this representation, time increases from bottom to top and the value of a variable at each time step is encoded by the width of the corresponding disk. The disks are color coded using the coloring schemes discussed in Section 4 for better discrimination of data values. We employ orthographic projections in our visualizations to avoid the distortion of the 3D shapes and to preserve the relative sizes of the disks in different rotationally-transformed views. Figure 8 shows a 3D visualization of the monthly unemployment rates for North Carolina’s (USA) 100 counties. The visualization employs color coding and interactive methods like rotation and zooming to reveal interesting patterns in space and time. For example, closer inspection of the two right-most 3D vases in figure 8(a) reveals highly fluctuating unemployment rates, which might be due to the changes in the employment rates during different agricultural seasons.
936
S. Thakur and T.-M. Rhyne
Fig. 8. 3D data vase diagrams of a dense data set. (a) Vases corresponding to unemployment rates in North Carolina’s 100 counties from January 1999 to December 2008. (b) A filtered view of the data corresponding to a user-selected range (inset).
One problem in our approach is due to occlusion (see figure 8(a)), which makes it difficult to compare the 3D shapes of the data values. Some effective methods for reducing occlusion in 3D visualizations have been suggested in [17]. We improve the readability of the 3D data glyphs by reducing the number of data elements displayed using a data-filtering mechanism (see figure 8(b)). The problem due to occlusion can also be overcome to some extent using interactive camera control (e.g., using rotation, panning, and zooming).
6
Discussion and Conclusion
In this paper we have highlighted a visualization technique for creating engaging and informative displays of multiple time series and is based on an intuitive twodimensional graphical plot entitled “kite diagrams.” Although we have primarily considered historic data (i.e., recorded census data), our graphical representations can be adapted for visualizing streaming data (e.g., network traffic). An important consideration in the generation of the vase shapes pertains to the distribution of values in a data set: data with high standard deviations often result in vases of highly varying widths. For example, in figure 5, which shows USA’s oil imports in millions of barrels from different countries, the vases corresponding to small import values are highly shrunk. A standard solution to generate more homogeneous shapes might be to rescale the data values using a log scale. Another option in domains like census surveys is to use per-capita values, which are sometimes more meaningful and often also eliminate large differences in data values. Another important issue in our representations is that it can be difficult to compare the profiles of the vase shapes that are far apart in the chart. Some possible solutions might be to interactively rearrange the locations of the vases on the chart or to selectively compare up to a few glyphs in a separate window. To discuss the different exploratory tasks supported by our approach we turn to a comprehensive framework in [24] that introduces a systematic and functional description of data and tasks. A distilled description of the framework, particularly its task topology, has been presented in [22]. The first and basic tasks in the task topology are elementary tasks and involve the determination of the values of dependent variables when the values of independent variables have been specified (and vice versa). For example, in the
Data Vases: 2D and 3D Plots for Visualizing Multiple Time Series
937
data vase charts in figure 5 showing USA’s crude oil imports an analyst can pose and answer questions like “How much crude oil did the US import from Canada in 2007?,” and “When did the highest value of crude oil imports occur?” Other types of elementary tasks involve the investigation of relationships between independent and dependent variables, for example, “Compare import rates of crude oil between OPEC and non-OPEC countries.” A different and more complex set of tasks are synoptic tasks, which, unlike elementary tasks, involve exploring relationships between and within the entire sets of dependent and independent quantities in the data. In the synoptic tasks concrete patterns are specified and a goal is to find the sets of values of the dependent and independent variables that exhibit the target patterns. Synoptic tasks are generally considered more important tasks because they can expose the general “behavior” of a phenomenon or a system. Some synoptic tasks can be specified using the data vase approach; for example, in figure 5 some synoptic tasks are “From 1980 to 2000 how did crude oil imports vary?,” or “During what time interval(s) did the crude oil imports change from decreasing to increasing?” Synoptic tasks often require queries that involve complex patterns specified over multiple independent variables. For example, a hypothetical task of moderate complexity pertaining to census data can be: “Find the time interval(s) when poverty rates in Western North Carolina were decreasing and per capita income was increasing.” Data vases in the current form are limited to the exploration of analytical queries that combine only up to a few data variables; we therefore need to adapt our approach for exploring complex data sets with multiple variables. We conducted an informal discussion session within our organization to assess our data vase-based approaches using snap shots of our visualizations in a webbased format. As a future work we would like to evaluate our methods using a formal comparison with the other standard methods for representing timevarying data. Another interesting direction to pursue is to experiment with nongeographic maps (e.g., tree maps) with our 3D versions of data vases.
Acknowledgment This work was conducted at the Renaissance Computing Institute’s Engagement Facility at North Carolina State University (NCSU). Data vases grew out of a visualization framework that was developed with NCSU’s Institute for Emerging Issues. We thank Steve Chall and Chris Williams for their contributions.
References 1. M¨ uller, W., Schumann, H.: Visualization methods for time-dependent data - an overview. In: Chick, S., Sanchez, P., Ferrin, D., Morrice, D. (eds.) Proc. of Winter Simulation 2003 (2003) 2. Aigner, W., Bertone, A., Miksch, S., Tominski, C., Schumann, H.: Towards a conceptual framework for visual analytics of time and time-oriented data. In: WSC 2007: Proceedings of the 39th conference on Winter simulation, Piscataway, NJ, USA, pp. 721–729. IEEE Press, Los Alamitos (2007)
938
S. Thakur and T.-M. Rhyne
3. Roddick, J.F., Spiliopoulou, M.: A bibliography of temporal, spatial and spatiotemporal data mining research. SIGKDD Explor. Newsl. 1, 34–38 (1999) 4. Aigner, W., Miksch, S., M¨ uller, W., Schumann, H., Tominski, C.: Visual methods for analyzing time-oriented data. IEEE TVCG 14, 47–60 (2008) 5. Hochheiser, H., Shneiderman, B.: Dynamic query tools for time series data sets: timebox widgets for interactive exploration. Info. Vis. 3, 1–18 (2004) 6. Berry, L., Munzner, T.: Binx: Dynamic exploration of time series datasets across aggregation levels. In: IEEE InfoVIS, Washington, DC, USA. IEEE Computer Society, Los Alamitos (2004) 7. Peng, R.: A method for visualizing multivariate time series data. Journal of Statistical Software, Code Snippets 25, 1–17 (2008) 8. Hao, M.C., Dayal, U., Keim, D.A., Schreck, T.: Multi-resolution techniques for visual exploration of large time-series data. In: EuroVis 2007, pp. 27–34 (2007) 9. Havre, S., Hetzler, B., Nowell, L.: Themeriver (tm). In search of trends, patterns, and relationships (1999) 10. Heer, J., Kong, N., Agrawala, M.: Sizing the horizon: The effects of chart size and layering on the graphical perception of time series visualizations. In: CHI 2009, Boston, MA, USA (2009) 11. Emery, D., Myers, K. (eds.): Sequence Stratigraphy. Blackwell Publishing, Malden (1996) 12. Sheppard, C.R.C.: Species and community changes along environmental and pollution gradients. Marine Pollution Bulletin 30, 504–514 (1995) 13. Kraak, M.: The space-time cube revisited from a geovisualization perspective. In: Proc. 21st Intl. Cartographic Conf., pp. 1988–1995 (2003) 14. Eccles, R., Kapler, T., Harper, R., Wright, W.: Stories in geotime. In: VAST 2007. Visual Analytics Science and Technology, pp. 19–26 (2007) 15. Tominski, C., Schulze-Wollgast, P., Schumann, H.: 3d information visualization for time dependent data on maps. In: IV 2005: Proceedings of the 9th Intl. Conf. on Info. Vis., Washington, DC, USA, pp. 175–181. IEEE Computer Society, Los Alamitos (2005) 16. Dwyer, T., Eades, P.: Visualising a fund manager flow graph with columns and worms. International Conference on Information Visualisation, 147 (2002) 17. Elmqvist, N., Tsigas, P.: A taxonomy of 3d occlusion management for visualization. IEEE Transactions on Visualization and Computer Graphics 14, 1095–1109 (2008) 18. Luo, Z.X.: Transformation and diversification in early mammal evolution. Nature 450, 1011–1019 (2007) 19. Tufte, E.R.: The visual display of quantitative information. Graphics Press, Cheshire (1986) 20. Ware, C.: Information Visualization: Perception for Design. Morgan Kaufmann Publishers Inc., San Francisco (2004) 21. Healey, C.G., Booth, K.S., Enns, J.T.: Visualizing real-time multivariate data using preattentive processing. ACM Trans. Model. Comput. Simul. 5, 190–221 (1995) 22. Tominski, C., Fuchs, G., Schumann, H.: Task-driven color coding. In: Intl. Conf. Info. Vis., Washington, DC, USA, pp. 373–380. IEEE Computer Society, Los Alamitos (2008) 23. Shneiderman, B.: Dynamic queries for visual information seeking. IEEE Software 11, 70–77 (1994) 24. Andrienko, N., Andrienko, G.: Exploratory Analysis of Spatial and Temporal Data: A Systematic Approach. Springer, Heidelberg (2005)
A Morse-Theory Based Method for Segmentation of Triangulated Freeform Surfaces Jun Wang and Zeyun Yu* Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI 53211
[email protected]
Abstract. This paper presents a new algorithm for segmentation of triangulated freeform surfaces using geometric quantities and Morse theory. The method consists of two steps: initial segmentation and refinement. First, the differential geometry quantities are estimated on triangular meshes, with which the meshes are classified into different surface types. The initial segmentation is obtained by grouping the topologically adjacent meshes with the same surface types based on region growing. The critical points of triangular meshes are then extracted with Morse theory and utilized to further determine the boundaries of initial segments. Finally, the region growing process starting from each critical point is performed to achieve a refined segmentation. The experimental results on several 3D models demonstrate the effectiveness and usefulness of this segmentation method.
1 Introduction Characterized with such advantages as simple representation, fast rendering, and accelerated visualization, triangular meshes are widely used for 3D surface modeling in computer vision, computer graphics and geometric modeling, by providing good approximations of real world objects. However, since the mesh for a 3D surface model is triangulated as a whole, it does not have explicit higher-level structures that help us understand the semantics of the model. To remedy this problem, a process, typically referred to as mesh segmentation, is performed to decompose the wholemesh into a union of connected, non-overlapping regions with locally meaningful shapes. Mesh segmentation has become a necessary ingredient in many research problems in geometric modeling and computer graphics and their applications. Examples include mesh editing [1], surface parameterization [2], model reconstruction [3], model simplification [4] and compression [5], skeleton extraction [6], texture mapping [7], and so on. For its rich applications, mesh segmentation has been studied by many research scientists and there have been numerous techniques under application contexts. Various criteria and methods of surface segmentation have been summarized in detail by Shamir [8] and Agathos et al. [9]. Srinark and Kambhamettu [10] developed a surface mesh segmentation method by defining four types of segments: i) peak-type, ii) pit-type, iii) minimal surface-type, *
Corresponding author.
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 939–948, 2009. © Springer-Verlag Berlin Heidelberg 2009
940
J. Wang and Z. Yu
and iv) flat type, according to the Gaussian curvature at each vertex of the surface mesh. This method often leads to the insufficient segmentation, known as undersegmentation. Mangan and Whitaker [11] generalized the watershed technique to arbitrary meshes by using the Gaussian curvature at each mesh vertex as the height field. With this method, some segmented boundaries are jaggy and the oversegmentation occurs occasionally. Besl and Jain [12] adopted the region growing technique to partition a large class of range images into regions of arbitrary shapes. They initially labeled the data points using the mean and Gaussian curvatures. Using this labeling, they constructed seed regions to initiate the region growing. This algorithm is sensitive to data noise and threshold selection. Natarajan et al. [13] designed an inspiring method for segmenting a molecular surface with the Morse-Smale complex theory. This algorithm performs well for the closed molecular surfaces with many critical points. However, the height function is adopted to extract critical points, which is sensitive to the posture of the surface. To overcome the above problems of segmentation, such as the under- or oversegmentation, inconvenient human intervention, we propose a new algorithm for freeform surface segmentation with triangular meshes. In this algorithm, the hierarchical refinement scheme is adopted. The initial segmentation is performed based on curvature analysis, and then a further step is carried out to obtain the refined segmentation with Morse theory. In the first step, the curvatures for each vertex and each triangular face are estimated, and each triangular face is labeled according to the fundamental surface types. In the second step, critical points are extracted by designing an appropriate scalar function over the triangular mesh and analyzing its local properties, and then the boundaries passing through saddle points are determined by using the steepest ascent/descent techniques. Finally the refined decomposition of triangulated freeform surfaces is achieved with the region growing process.
2 Curvature-Based Initial Segmentation The initial segmentation is the basis of our entire algorithm, which is realized by performing curvature labeling and region growing. Gaussian curvature and mean curvature are first estimated and utilized to label the surface type of each triangular mesh, and then the region growing process is carried out to group the adjacent meshes with the same surface type into a unique surface patch, which produces the initial segmentation. 2.1 Curvature Labeling Curvature estimation techniques for triangular meshes have been broadly studied in computer graphic, computer vision and geometric modeling applications. We adopt the continuous method to estimate Gaussian curvature(K) and mean curvature(H), in which the k-nearest neighbor [14] of each vertex in the meshes are searched and then fitted with an analytical quadric surface, and finally the curvatures are estimated based on the first and second fundamental forms of the surface[15]. After the Gaussian and mean curvatures are obtained, all the mesh vertices and faces are labeled with the corresponding surface type. In particular, the signs of K and H define the surface type and the values of K and H define the surface sharpness. Besl
A Morse-Theory Based Method for Segmentation of Triangulated Freeform Surfaces
941
and Jain [10] proposed the eight fundamental surface types using the signs of K and H as shown in Table 1. In our algorithm, three general types are considered: (i) Convex type (H < 0), (ii) Plane type (H = 0), and (iii) Concave type (H > 0). From Table 1, we know that the convex type includes peak, ridge and saddle ridge types; the plane type consists of flat and minimal surface types; and the concave type contains pit, valley and saddle valley types. Table 1. Eight fundamental surface types
H<0 H=0 H>0
K>0 Peak N/A Pit
K=0 Ridge Flat Valley
K<0 Saddle ridge Minimal surface Saddle valley
General Type Convex Plane Concave
2.2 Region Growing Once the entire triangular mesh has been labeled, the topologically adjacent triangles with the same surface type are grouped into a single segment. The region growing technique is adopted for this purpose. First, a triangle that has not been assigned to any segments is selected as a seed triangle and assigned to the current segment. Then, the surface type of each of the neighboring triangles, if not assigned to any segments, is checked and, if its surface type is same as that of the seed triangle, then the neighboring triangle is added into the current segment and updated as the seed triangle. These two steps are repeated until all triangles on the surface have been assigned to a segment. The pseudocode for the initial segmentation algorithm with the region growing process is given in Fig. 1, in which TriFaceArray stores all triangles on the surface, and Segment(id) stores the triangles of the id-th segment. InitialSegmentation( ) Initialize the segment number seg_id of each triangle in TriFaceArray as -1, and the current segmented number id as 0. Label the surface type(convex, plane or concave type) for each triangle in TriFaceArray; for ( each triangle tri_i in TriFaceArray ) if ( seg_id of tri_i != -1 ) continue; /* start to perform region growing process */ set seg_id of tri_i with id; set Segment(id) with NULL and then add tri_i into Segment(id); for ( each triangle tri_j in Segment(id) ) for ( each triangle tri_k NT_j, which is the set of adjacent triangles of tri_j ) if ( the surface type st_k of tri_k is same as that of tri_j && seg_id of tri_k = = -1 ) set seg_id of tri_k with id and add tri_k into Segment(id); /* end of region growing process */ id++;
∈
Fig. 1. Pseudocode for the initial segmentation algorithm
942
J. Wang and Z. Yu
Fig. 2 shows the initial segmentation results for two surfaces with triangular meshes. Fig. 2(a) and 2(d) give the input triangular meshes. Fig. 2(b), 2(e) show curvature labeling result for each surface, in which the convex, plane and concave type surfaces are represented in red, green, and blue, respectively. The initial segmentation results are demonstrated in Fig. 2(c), 2(f), where the segments are marked by different colors.
(a) Triangular meshes
(d) Triangular meshes
(b) Curvature labeling
(e) Curvature labeling
(c) Region growing
(f) Region growing
Fig. 2. Initial segmentation of triangulated surfaces
3 Morse Theory-Based Refinement of Segmentation It can be observed that the initial segmentation based on curvature labeling is coarse, i.e., there is more than one surface patch included in a single segment. To refine the segmentation, the critical points, including local minima, local maxima and saddle points, are extracted from the triangulated surface. For each initial segment containing saddle points, the boundaries passing through the saddle points are determined using the steepest ascent/descent strategy. Then, the region growing process is initiated from each extreme point to obtain the refined segmentation. 3.1 Critical Point Extraction In our algorithm, Morse theory is used to extract the critical points of a triangulated surface. In differential topology, Morse theory provides a direct way of analyzing the topology of a manifold by studying differentiable functions on the manifold [16]. Let M2 be a closed 2-manifold surface and f: M2→R a real valued smooth function, a point p is called a critical point of the function f if the gradient at p is zero: f(p) = 0; otherwise it is called regular point. A critical point p of f is called non-degenerate if the Hessian matrix of f at p is non-singular; otherwise it is called degenerate. The examples of non-degenerate critical points are local minimal point, maximal point and saddle point. Banchoff [17] extended the Morse theory to triangular meshes of closed
▽
A Morse-Theory Based Method for Segmentation of Triangulated Freeform Surfaces
943
manifolds. In a triangulated 2-manifold surface, the piecewise linear function f is defined over the vertex on the mesh and calculated by linear interpolation across the edges and triangular patches of the mesh. To illustrate the basic idea, some definitions are given below. The star star(v) of a vertex v comprises all the triangles that share v; the link link(v) of v consists of all the edges in the star of v excluding those containing v; the lower link link-(v) of v is defined as the edges whose vertices have smaller function values than that of v. Similarly, the higher link link+(v) of v contains the edges whose vertices have larger function values than that of v. The mixed link link±(v) of v includes the edges, where some vertices have smaller function value than that of v and others have larger function values than that of v. A vertex on the closed manifold can be classified into various critical points and regular point by means of the link properties of the vertex. The corresponding criteria are described below. (1) Minima: Cardinality(link-(v)) = 0 (2) Saddle: Cardinality(link±(v)) = 2+2m (m>0) (3) Maxima: Cardinality(link+(v)) = 0 (4) Regular: Cardinality(link±(v)) = 2 where Cardinality(S) measures the number of elements in the set S; m means the multiplicity of saddle point. Morse theory can also be extended to manifolds with boundaries by considering a collection of sub-manifolds [18]. The extension theory is called Stratified Morse theory, in which the most remarkable restriction is that only extreme points, i.e. no saddle point, of Morse function are allowed on boundaries of manifolds. With the restriction of Stratified Morse theory, the type of p can be determined with the following criteria:
∈ ∈
(1) Minima: f (p) < f (q), q Nghbr(p) (2) Maxima: f (p) > f (q), q Nghbr(p) (3) Regular: otherwise. where Nghbr(p) is the neighboring vertex set of p. Table 2. The link of critical points of 2-manifold
Maxima
Manifold without boundary Minima Saddle
Manifold with boundary Maxima Minima
944
J. Wang and Z. Yu
Table 2 gives the link of critical points of 2-manifold with and without boundary. The gray area of the link means that the function values of all neighboring vertices are less than the value of the blue point, while the green area indicates the function values of all neighboring vertices are larger than the value of the blue point. The distribution of the gray and green areas of the link implies the type of each point. From these criteria, we can see that the extraction of critical points completely depends on the definition of Morse function. Since the definition of Morse function is often application-oriented, there are many kinds of ways to define Morse functions for different application contexts [19, 20]. In our system, we consider two definitions of Morse functions, the height function, i.e. the z-coordinate of the vertex of 2-manifold, and the curvature function, i.e. the curvature of the vertex of 2-manifold, for different applications. Fig. 3 shows the critical point extraction from triangular meshes with these two functions. The red, green and blue points stand for the minimal, saddle and maximal points, respectively. a
b
c
Fig. 3. Critical point extraction with height function for (a), (b) and curvature function for (c)
3.2 Boundary Determination After the critical points are extracted, the initial segmentation can be further refined by determining the boundaries passing through saddle points. Since saddle points are always located on convex and concave surfaces, we can extract the boundaries for these two types of surfaces.
Fig. 4. (a) The saddle points and their surrounding extreme points; (b) the saddle points and surrounding points for the convex and concave surface
Surrounding a saddle point, there is typically more than one minimal and maximal point, as shown in Fig. 4(a). It implies that there are more than one peak patch and pit patch surrounding the saddle point. Apparently, from a geometry point of view, the peak and pit patches should be separated and considered as different segments. Therefore, we regard these saddle points as boundary points to further decompose the initial segments.
A Morse-Theory Based Method for Segmentation of Triangulated Freeform Surfaces
945
A saddle point could be connected to either its surrounding minimal or maximal points, depending on the surface type. For the convex surface, its saddle points are connected to their surrounding maximal points, while for the concave surface, the saddle points connected to the minimal points, as shown in Fig. 4(b). The yellow arrows illustrate the paths from the associate extreme points to the saddle points. By regarding the saddle point as a boundary point, we can further determine other boundary points and hence the boundary curves passing through the saddle point by using the steepest ascent/descent strategy. Let s be the saddle point of a convex surface surrounded by two maximal points, max1, max2 and two minimal points, min1, min2. B1, B2 are the boundaries of the segments at which min1, min2 are located, as shown in Fig. 5(a). The link(s) of s is given in Fig. 5(b), in which p1, p2, … p6 are its neighboring vertices. And the values of Morse function in link-(s): p2-s-p3-p2 and link-(s): p5-s-p6-p2 with “gray” areas are lower than the function value of s, while the function values in link+(s): p2-s-p6-p1-p2 and link+(s): p3-s-p5-p4-p3 with “green” areas are higher than that of s. In Fig. 5(b), p2, p5 have the lowest function value in the corresponding link-(s): p2-s-p3-p2 and p5-s-p6-p2 and thereby are considered as boundary points. Therefore, the edges s-p2, s-p5 are marked as the boundary edges. By replacing the seed points with p2, p5, their next vertices on the boundary can be found as pi, pl, respectively. Repeating this process generates an integrated boundary curve pn-pm-plp5-s-p2-pi-pj that passes through the given saddle point.
Fig. 5. Boundary determination for the convex surface with triangular meshes
Similarly, the steepest ascent strategy is used to determine boundaries passing through the saddle points of concave surfaces. The boundary points are extracted from the vertices with the highest value of Morse function in each link+(s) of the seed point s. Fig. 6 shows the extracted boundaries passing through the saddle points on the convex and concave surfaces with yellow curves.
Fig. 6. Boundary extraction for the triangulated convex and concave surfaces
946
J. Wang and Z. Yu
After determining the boundaries, the initial segments can be refined to get more accurate segmentation by using the region growing method. The mesh containing extreme points in each initial segment is selected as a seed mesh and grows by adding the neighboring meshes within the same segment until it reaches the boundaries in the initial segmentation or the boundaries extracted from saddle points. Particularly, for the convex surface, only the maximal point is considered as the seed point and, for the concave surface, the minimal point is considered. The detailed region growing method is similar to the one in Sect. 2.2.
4 Results and Discussion All algorithms described have been implemented in Visual C++ and OpenGL, running on a Pentium IV PC configuration with 2.0G Hz. And many 3D triangulated surface models have been tested and couples of them are demonstrated below. Fig. 7 shows the segmentation of the triangular mesh of a bumpy torus from AIM@Shape (http://www.aimatshape.net/). Fig. 7(a) gives the triangular mesh model. After calculating the curvature, we obtain the curvature labeling as in Fig. 7(b) and the initial segmentation is shown in Fig. 7(c). Fig. 7(d) indicates the critical points extracted by Morse theory and Fig. 7(e) shows the boundaries of initial segments passing through the saddle points. Finally, with the region growing process starting from extreme points, we are able to refine the segmentation as seen in Fig. 7(f). a
b
c
d
e
f
Fig. 7. Segmentation of bumpy torus with (a) triangular meshes; (b) curvature labeling; (c) initial segmentation; (d) extracted critical points; (e) boundary determination(yellow curves); (f) refined segmentation.
The comparison between our method and those proposed by Besl [12] and Srinark [10] are presented as in Table 3. We can see that the segments extracted with Besl’s algorithm contain a large number of superfluous fragments, whereas the results by Srinark’s algorithm are under-segmented, similar to our initial segmentation. In contrast, our method generates the surface patches that are relatively more accurate from the geometry point of view.
A Morse-Theory Based Method for Segmentation of Triangulated Freeform Surfaces
947
Table 3. Comparisons between different segmentation algorithms Besl’s method
Srinark’s method
Our method
5 Conclusion In this research, a hierarchical, region-growing method has been proposed and implemented for the segmentation of triangulated surfaces. The algorithm based on curvature labeling is capable of obtaining an initial segmentation results. The explicit and comprehensive criteria are reported for extracting the critical points of discrete, 2manifold surfaces by using the Morse theory. The steepest ascent/descent techniques are exploited to determine the boundaries so that the initial segments are partitioned into more accurate surface patches. The comparison with the methods by Besl [12] and Srinark [10] demonstrates that our method is effective in segmenting triangulated surface meshes. Our method handles a wide variety of 3D surface models with or without boundaries and requires less user intervention. Because our refined segmentation relies on the critical points of an input surface, it is especially applicable for the surface models characterized with salient features, such as terrain models, molecular surfaces, and 3D structures of synthetic objects. In a broader sense, our segmentation method can find applications in many fields, where surface decomposition is in need and the models are represented by triangular meshes.
948
J. Wang and Z. Yu
References 1. Rustamov, R.: On Mesh Editing, Manifold Learning, and Diffusion Wavelets. In: IMA Conference on the Mathematics of Surfaces (2009) 2. Zhang, E., Mischaikow, K., Turk, G.: Feature-based surface parameterization and texture mapping. ACM Trans. Graph 24, 1–27 (2005) 3. Benko, P., Varady, T.: Segmentation methods for smooth point regions of conventional engineering objects. Computer-Aided Des. 36, 511–523 (2004) 4. Garland, M., Willmott, A., Heckbert, P.: Hierarchical face clustering on polygonal surfaces. In: Proceedings of ACM Symposium on Interactive 3D Graphics, pp. 49–58 (2001) 5. Karni, Z., Gotsman, C.: Spectral compression of mesh geometry. In: Proceedings of SIGGRAPH, pp. 279–286 (2000) 6. Katz, S., Tal, A.: Hierarchical mesh decomposition using fuzzy clustering and cuts. ACM Trans. Graph 22, 954–961 (2003) 7. Levy, B., Petitjean, S., Ray, N., Maillot, J.: Least squares conformal maps for automatic texture atlas generation. In: Proceedings of SIGGRAPH, pp. 362–371 (2002) 8. Shamir, A.: A survey on mesh segmentation techniques. Computer Graphics Forum 27, 1539–1556 (2008) 9. Agathos, A., Pratikakis, I., Perantonis, S., Sapidis, N., Azariadis, P.: 3D mesh segmentation methodologies for CAD applications. Computer-Aided Design & Applications 4, 827–841 (2007) 10. Srinark, T., Kambhamettu, C.: A novel method for 3D surface mesh segmentation. In: Proceedings of the 6th Intl. Conf. on Computers, Graphics and Imaging, pp. 212–217 (2003) 11. Mangan, A.P., Whitaker, R.T.: Partitioning 3D surface meshes using watershed segmentation. IEEE Transactions on Visualization and Computer Graphics 5, 308–321 (1999) 12. Besl, P.J., Jain, R.: Segmentation through Variable-Order Surface Fitting. IEEE PAMI 10, 167–192 (1988) 13. Natarajan, V., Wang, Y., Bremer, P.T., Pascucci, V., Hamann, B.: Segmenting molecular surfaces. Computer Aided Geometric Design 23, 495–509 (2006) 14. Voronoi, G.: Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Journal für die Reine und Angewandte Mathematik 133, 97–178 (1907) 15. do Carmo, M.: Differential Geometry of Curves and Surfaces. Prentice-Hall, Englewood Cliffs (1976) 16. Milnor, J.: Morse Theory. Princeton Univ. Press, Princeton (1963) 17. Banchoff, T.F.: Critical points and curvature for embedded polyhedral surfaces. Amer. Math. Monthly 77, 475–485 (1970) 18. Goresky, M., MacPherson, R.: Stratified Morse Theory. Springer, Heidelberg (1988) 19. Hilaga, M., Shinagawa, Y., Komura, T., Kunii, T.L.: Topology matching for full automatic similarity estimation of 3d shapes. In: Proceedings of SIGGRAPH, pp. 203–212 (2001) 20. Ni, X., Garland, M., Hart, J.C.: Fair Morse functions for extracting the topological structure of a surface mesh. In: Proceedings of SIGGRAPH, pp. 613–622 (2004)
A Lattice Boltzmann Model for Rotationally Invariant Dithering Kai Hagenburg, Michael Breuß, Oliver Vogel, Joachim Weickert, and Martin Welk Mathematical Image Analysis Group Faculty of Mathematics and Computer Science Saarland University, Saarbr¨ ucken, Germany {hagenburg,breuss,vogel,weickert,welk}@mia.uni-saarland.de
Abstract. In this paper, we present a novel algorithm for dithering of gray-scale images. Our algorithm is based on the lattice Boltzmann method, a well-established and powerful concept known from computational physics. We describe the method and show the consistency of the new scheme to a partial differential equation. In contrast to widelyused error diffusion methods our lattice Boltzmann model is rotationally invariant by construction. In several experiments on real and synthetic images, we show that our algorithm produces clearly superior results to these methods.
1
Introduction
Dithering is the problem of binarising a given gray value image such that its visual appearance remains close to the original image. In this paper, we explore a novel approach to this problem by employing a lattice Boltzmann (LB) framework. LB methods are usually used for the simulation of highly complex fluid dynamics, where a discrete environment, the lattice, is provided to model the propagation of gas or fluid particles [9,12,15]. Previous work. So far, LB methods have not been used extensively in image processing applications. In 1999, Jawerth et al. [7] proposed an LB method to model non-linear diffusion filtering. To our knowledge this is the only published work on LB methods for image processing. Standard algorithms for dithering employ the technique of error diffusion. Choosing a starting point and sweeping direction, pixels are locally thresholded. The occurring L1-error is then distributed to unprocessed pixels in the neighbourhood according to a specified distribution stencil. This results in a dithered image with an additional blurring. Prominent examples of such algorithms are the ones by Floyd and Steinberg [4], Jarvis et al. [6], Shiau and Fan [13], Stucki [14] and Ostromoukhov [11]. All algorithms follow the same principle, as the only variation is a different choice of the distribution stencil. While error diffusion algorithms are very fast, they share undesirable properties. By construction, these algorithms do not respect rotational invariance, which results in visible sweep G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 949–959, 2009. c Springer-Verlag Berlin Heidelberg 2009
950
K. Hagenburg et al.
directions (see for example Figure 3). Furthermore, error diffusion methods introduce undesired noisy, worm-like artifacts which are prominent to a greater or lesser extent depending on the distribution stencil. In one recent paper it is proposed to use simulated annealing for solving an optimisation problem related to dithering. Though this produces visually convincing results, the method is rather slow, depends on several parameters and needs an already dithered image (e.g. by error diffusion methods) as initialisation [10]. Our contribution. The goal of our paper is to present a novel dithering algorithm that does not suffer from problems with respect to rotationally invariance, local blurring and directional bias introduced by distribution stencils. Its favorable visual quality results from its edge-enhancing properties. All this can be achieved by choosing lattice Boltzmann strategies in an appropriate way. Organization of the paper. The paper is organized as follows. In Section 2 we describe our LB framework, followed by the definition of the needed reference state in Section 3. In Section 4 we summarize the algorithm. Experiments are presented in Section 5 and the paper is concluded with a summary in Section 6. In the Appendix, we provide a proof for the consistency of our method.
2
Our Lattice Boltzmann Framework
At the heart of the LB method, one distinguishes a macroscopic level and a microscopic level. Using this as a framework, the underlying idea behind the scheme is quite intuitive from a physical point of view. The idea is that the state we observe by the gray values in an image is a macroscopic state. The gray value density can be understood as an analogon to the density of a fluid. Knowing that any fluid is naturally a composition of very small molecules, we can explore that analogy by the following idea: If one would zoom close enough into the pixels of an image one would observe that the gray values are represented by an appropriate amount of white particles. These particles constitute the microscopic state. By movement and collision of the particles, the observable macroscopic state may change. The LB method requires a set of rules for movement and collision of the microscopic particles. After evaluating the microscopic dynamics, the macroscopic gray values are obtained by summation over the discrete particle distribution. In what follows, we explain the corresponding steps in detail. The microscopic set-up. The LB method relies on a discrete grid, or lattice. Each node of the lattice holds the value of a distribution function uα , where α is an index that indicates the neighbourhood relation to the center node. The position of neighbours is identified by a lattice vector eα , where e0 = (0, 0) points to the center node itself. In this paper we employ a (3 × 3)-stencil giving the set of possible directions Λ := {−1, 0, 1} × {−1, 0, 1}. For α ∈ Λ, Λ an index set, indicating all possible directions, the corresponding directional unit vector is eα = (α1 , α2 ) .
A Lattice Boltzmann Model for Rotationally Invariant Dithering
951
The distribution function uα models a microscopic state. The macroscopic state, in our case the gray value at position x = (x1 , x2 ) at time t, is described by summation over the local (3 × 3)-patch: u(x, t) = α∈Λ uα (x, t). (1) As indicated, the LB method encodes particle movement and collisions that take place at the microscopic level. The corresponding fundamental equation reads as uα (x + eα , t + 1) = uα (x, t) + Ωα (x, t) ,
(2)
where Ωα (x, t) is the so-called collision operator. The proper modelling of this operator is vital for any LB algorithm as it describes a set of collision rules that can be used to simulate arbitrary fluid models. In a first step to address this issue, we employ a BGK model named after Bhatnagar, Gross and Krook [1,12] which has become a standard approach in the LB literature. The specific BGK model we use for Ωα reads as Ωα = uref α − uα .
(3)
This model allows to interpret a collision state as the deviation of the current microscopic distribution uα from a reference distribution uref α instead of explicitly defining collision rules. In a second step, we impose as an additional structural property the conservation of the average gray value of the image via the pointwise condition (4) α∈Λ Ωα (x, t) = 0 . We assume now that the lattice parameters h and τ that denote the spatial and temporal mesh widths are coupled via a relation τ /h2 = constant. Employing then a scaling in space and time by the scaling parameter ε, one obtains by (2)-(3) the relation uα (x + εeα , t + ε2 ) = uref α (x, t) .
(5)
In order to define the LB method, we approximate uref α (x, t) by uref α (x, t) = tα u(x, t) (1 + εγα ) .
(6)
In case of γα = 0, equation (6) would give a LB description for linear diffusion [15]. By setting γα = 0, the reference state can be described as a perturbation of an equilibrium distribution by some function γα which is crucial to achieve the dithering effect. In the following chapters we will directly give the reference state, as the direct description of γα can be derived from that. The tα are normalisation factors depending on the direction [12]: ⎧ eα = (0, 0) , ⎨ 4/9 , tα = 1/9 , eα = (0, ±1), (±1, 0) , (7) ⎩ 1/36 , eα = (±1, ±1) .
952
K. Hagenburg et al.
As usual for normalisation weights, α∈Λ tα = 1. The crucial point about (6) is that it relates the current macroscopic state u(x, t) to uref α (x, t) via the introduced perturbation. The logic behind the scheme definition given as the next step is to model uref α in such a way that it gives the desired steady state – i.e. the dithered image – by evolution in time. Macroscopic limit. The proof of the following assertion is given in the Appendix: Theorem 1. In the scaling limit ε → 0, the LB scheme obeying the proposed discrete set-up solves the partial differential equation ∂ 1 u(x, t) = Δu(x, t) − div (u(x, t)γ(x, t)) , ∂t 6
(8)
2 where Δ := ∂ 2 /∂x21 + ∂ 2 /∂x 2 is the Laplace operator, and where the divergence operator is denoted by div (a1 , a2 ) := ∂a1 /∂x1 + ∂a2 /∂x2 and a vector valued function γ(x, t) given by
γ(x, t) :=
α∈Λ
eα tα γα .
(9)
The PDE (8) is a diffusion-advection equation. While the diffusion term Δu gives an uniform spreading of the macroscopic variable u, this is balanced by the edge-enhancing advection term div(uγ).
3
Constructing a Reference State for Dithering
We now model the reference state, see especially (6). Our aim is a dithering strategy that preserves structures and enhances edges by the model. Structure enhancement can be achieved by enlarging the gradient between two neighbouring pixels. This is done by transporting particles from darker pixels to brighter pixels. In the following, we consider three possible scenarios. 1. We distinguish two cases. If a pixel in (x, t) has a larger gray value than its neighbour in (x + eα , t), then we do not want to allow particles to dissipate into direction eα . In the opposite case, we allow the neighbouring pixel in (x + eα , t) to take into account – and take away – the amount tα u(x, t) particles, as long as the neighbour is not already saturated. 2. For the robustness of the implementation, we also define the following rule. If a pixel in (x, t) has a very low gray value below a minimal threshold ν > 0 close to zero, we always allow a neighbouring pixel in (x+eα , t) to take away an amount tα u(x, t) of particles. 3. If the gray value exceeds 255 we distribute superficial particles to neighbouring pixels.
A Lattice Boltzmann Model for Rotationally Invariant Dithering
953
Summarising these considerations, we obtain the reference state as ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨
tα u(x, t)
if u(x + eα , t) > u(x, t) and u(x + eα , t) < 255 0 if u(x + eα , t) < u(x, t) ref tα u(x, t) if u(x, t) < ν and α = (0, 0) uα (x + eα , t) = (10) ⎪ ⎪ 0 if u(x, t) < ν and α = (0, 0) ⎪ ⎪ ⎪ ⎪ ⎪ = (0, 0) ⎪ ⎩ tα (u(x, t) − 255) if u(x, t) > 255 and α 255 if u(x, t) > 255 and α = (0, 0) with tα as in (7). Furthermore, we disallow any flow across the image boundaries. This suffices as boundary conditions for the reference state.
4
The Algorithm
We now show how to code an iterative dithering algorithm making use of the equations (5) and (10). By (5), and setting the scaling parameter to match the grid, we obtain uα (x, t + 1) = uref α (x − eα , t) and after taking sums: α
uα (x, t + 1) =
α
uref α (x − eα , t).
(11)
By the symmetries incorporated in the directions in Λ and by (1) follows u(x, t + 1) :=
α
uref α (x + eα , t).
(12)
With this knowledge, we can describe the algorithm. Summary of the algorithm Step 1: Compute the reference state according to (10) Step 2: Compute u(x, t + 1) := α uref α (x + eα , t) Step 3: If u(·, t + 1) − u(·, t)2 < break, otherwise go back to 1. Implementation details. W.l.o.g. we consider images that already have grey values that sum up to multiples of 255, otherwise we scale the image such that it fulfills this property. In this setting, we can say that our algorithm is greyvalue preserving. However, it is possible that at the end of the evolution a few pixels converge to a state neither zero nor 255. On these pixels, we perform a gray-value-preserving threshold to obtain the dithered image. Furthermore, the tα as set in (7) are need to be re-normalised, if for some α the microscopic state uα is zero. The reason is easily seen by considering an example where all particles are concentrated at α = 0: Summing up directly with the weights (7) effectively reduces the local gray value by 5/9. Thus, in the algorithm one defines a number ηα which is zero for uα = (0, 0) andone otherwise. Then we renormalise the weights tα via a factor η such that η α tα ηα = 1; in case the sum is zero we set η to some finite number.
954
K. Hagenburg et al.
Fig. 1. Results of dithering algorithms. Left: Result of the algorithms on a real-world image of size 600 × 305. Right: close-up of the upper left corner with size 150 × 150. First row: Original image. Second row: Floyd-Steinberg with serpentine implementation. Third row: Ostromoukhov. Fourth row: Lattice Boltzmann dithering.
5
Experiments
In this section we present experiments on both real and synthetic images that show the quality of our approach. Especially, we demonstrate the edge-enhancing and rotationally invariant properties of our algorithm. We compare the visual quality
A Lattice Boltzmann Model for Rotationally Invariant Dithering
955
Fig. 2. Comparison of dithering algorithms on images with low contrast areas. Top. Original image. Gray value ramp gradually with low-contrast text with constant gray value, size 300 × 50. Middle. Ostromoukhov. Bottom. Lattice Boltzmann dithering.
of the results to the classical standard method of Floyd and Steinberg [4] implemented with serpentine pixel order as this is the essential algorithm mostly identified with error diffusion, as well as to the method of Ostromoukhov [11] which constitutes the state-of-the art error diffusion algorithm in the field. For Ostromoukhov’s method we use the original implementation from the author’s web page1. The Poker chip experiment. In the first experiment we deal with a real world image with large contrasts, see Figure 1. While error diffusion methods blur important image structures and introduce noisy patterns, the lattice Boltzmann method preserves edges very well. Our method even recovers prominent structures of blurred objects that are out-of-focus in the original image. Comparing the methods of Floyd-Steinberg and Ostromoukhov, we find no significant visual difference between each other. Let us note that the iteration strategy relying on the pixel ordering is the same in both error diffusion methods. The ramp experiment. We now consider a synthetic image of low contrast, see Figure 2. The image shows a ramp of gray values decreasing from left to right, together with a text of constant gray value. The latter has been chosen in such a way that parts of the text are indistinguishable from the background ramp. The error diffusion result loses some letters during the dithering process since they tend to smooth out image structures with low contrast. In contrast, our method still produces a readable text. The Gaussian test. In Figure 3 we demonstrate the rotational invariance of our scheme, though some visible directional artifacts remain due to the chosen discretisation. Furthermore, it is observable that the result of a error diffusion method strongly depends on the implementation of the pixel ordering. 1
http://www.iro.umontreal.ca/∼ostrom/varcoeffED/
956
K. Hagenburg et al.
Fig. 3. Evaluation w.r.t. rotation invariance. Left: Gaussian with size 256 × 256. Middle: Ostromoukhov. Right: Lattice Boltzmann dithering.
Runtimes. While the runtime of the error diffusion algorithms lies in the range of milliseconds, our diffusion-advection motivated algorithm takes a couple of seconds to converge on a standard PC with an implementation in C. The inherent parallelisation potential of lattice Boltzmann methods [3] that would allow for a further speedup has not been exploited yet. In its current state, the algorithm is attractive for offline dithering in high quality.
6
Conclusion and Future Work
We have derived a novel lattice Boltzmann model for dithering images that is by construction rotationally invariant. The adaptation of the lattice Boltzmann framework to this application has been achieved by specifying an appropriate reference state within the collision operator. We have provided an analysis of the model that shows that its macroscopic equation is a diffusion-advection equation. For future work, we plan to perform research on efficient algorithms for our LB method and to exploit its excellent parallelisation properties. We also plan to analyse the PDE (8) more thoroughly and eventually extend our algorithm to colour images. Acknowledgements. The authors gratefully acknowledge the funding given by the Deutsche Forschungsgemeinschaft (DFG).
References 1. Bhatnagar, P.L., Gross, E.P., Krook, M.: A model for collision processes in gases. I. Small amplitude procession charged and neutral one-component systems. Physical Review 94(3), 511–525 (1954) 2. Chapman, S., Cowling, T.G.: The Mathematical Theory of Non-uniform Gases. University Press, Cambridge (1939) 3. Dawson, S.P., Chen, S., Doolen, G.D.: Lattice Boltzmann computations for reaction-diffusion equations. Journal of Chemical Physics 98(2), 1514–1523 (1993) 4. Floyd, R.W., Steinberg, L.: An adaptive algorithm for spatial gray scale. Proceedings of the Society of Information Display 17, 75–77 (1976)
A Lattice Boltzmann Model for Rotationally Invariant Dithering
957
5. Frisch, U., Hasslacher, B., Pomeau, Y.: Lattice-gas automata for Navier-Stokes equations. Phys. Rev. Letters 56, 1505–1508 (1986) 6. Jarvis, J.F., Judice, C.N., Ninke, W.H.: A survey of techniques for the display of continuous tone pictures on bilevel displays. Computer Graphics and Image Processing 5(1), 13–40 (1976) 7. Jawerth, B., Lin, P., Sinzinger, E.: Lattice Boltzmann models for anisotropic diffusion of images. Journal of Mathematical Imaging and Vision 11, 231–237 (1999) 8. Lutsko, J.F.: Chapman-Enskog expansion about nonequilibrium states with application to the sheared granular fluid. Physical Review E 73, 021302 (2006) 9. McNamera, G.R., Zanetti, G.: Use of the lattice Boltzmann equation to simulate lattice-gas automata. Physical Review Letters 61, 2332–2335 (1988) 10. Pang, W.-M., Qu, Y., Wong, T.-T., Cohen-Or, D., Heng, P.-A.: Structure-Aware Halftoning. ACM Transactions on Graphics (SIGGRAPH 2008 issue) 27(3), 1–89 (2008) 11. Ostromoukhov, V.: A Simple and Efficient Error-Diffusion Algorithm. In: Proceedings of SIGGRAPH 2001 in ACM Computer Graphics, pp. 567–572 (2001) 12. Qian, Y.H., D’Humieres, D., Lallemand, P.: Lattice BGK models for Navier-Stokes equation. Europhysics Letters 17(6), 479–484 (1992) 13. Shiau, J.N., Fan, Z.: A set of easily implementable coefficients in error diffusion with reduced worm artifacts. In: Proc. SPIE, vol. 2658, pp. 222–225 (1996) 14. Stucki, P.: MECCA–A Multiple-Error Correction Computation Algorithm for BiLevel Image Hardcopy Reproduction Research Report RZ-1060, IBM Research Laboratory, Zurich, Switzerland (1981) 15. Wolf-Gladrow, D.: Lattice Gas Cellular Automata and Lattice Boltzmann Models - An Introduction. Springer, Berlin (2000)
Appendix: Proof of Theorem 1 The proof proceeds in the following way. As we aim at deriving a PDE, we want to obtain expressions of u(x, t). In a first step of the proof, we therefore eliminate all dependencies on shifted variables (x + εeα , t + ε2 ). In a second step, we eliminate the reference distribution uref α from the deduced equations. In the final step, we summarise the microscopic variables appropriately to obtain expressions in the macroscopic variable u(x, t). We begin with substituting uα (x + εeα , t + ε2 ) from the left hand side of equation (5). This is done making use of a Taylor expansion around (x, t): uα (x + εeα , t + ε2 ) = uα (x, t) + ε
2
∂ i=1 eα,i ∂xi uα (x, t)
+ O(ε2 ) .
(13)
Substituting this expression in (5) and neglecting the second order error gives ε
2
∂ i=1 eα,i ∂xi uα (x, t)
= uref α (x, t) − uα (x, t) .
(14)
For the second step of the proof, we use the Chapman-Enskog expansion [2]. This works in analogy to the Taylor expansion, and describes uα in terms of fluctuations about the reference state uref α that are given by a function Φα : 2 uα = uref α + εΦα + O(ε ) .
(15)
958
K. Hagenburg et al.
The actual choice of the reference state is not crucial, cf. [8] where arbitrary reference states are used. Substituting uα (x, t) in (14) by (15) gives ∂ Φα = − 2i=1 eα,i ∂x uα (x, t) + O(ε) . (16) i Having thus computed an expression for the fluctuation Φα , we plug this into the Chapman-Enskog expansion (15): 2 ∂ 2 uα = uref (17) α −ε i=1 eα,i ∂xi uα (x, t) + O(ε ) . We proceed by considering the collision rule (4). Using (3) one obtains 2 α∈Λ uα (x + εeα , t + ε ) − uα (x, t) = 0 .
(18)
The Taylor approximation of uα (x + εeα , t + 1) reads as 2 ∂ ∂ uα (x + εeα , t + 1) = uα (x, t) + α∈Λ ε2 ∂t uα (x, t) + i=1 εeα,i ∂x uα (x, t) i
ε2 2 ∂2 + (19) i,j=1 eα,i eα,j ∂xi ∂xj uα (x, t) . 2 Inserting this expression for uα (x + εeα , t + ε2 ) in (18) gives ∂ ∂ 0 = ε2 α∈Λ ∂t uα (x, t) + ε α∈Λ 2i=1 eα,i ∂x uα (x, t) i =:A
+
ε2 2
α∈Λ
=:B
2
∂2 i,j=1 eα,i eα,j ∂xi ∂xj uα (x, t)
.
(20)
=:C
We now rewrite the terms A, B and C individually. Term A. ε2
∂ α∈Λ ∂t uα (x, t)
∂ = ε2 ∂t
α∈Λ
(1) ∂ uα (x, t) = ε2 ∂t u(x, t) .
(21)
Term B. ε
α∈Λ
2
(17) ∂ i=1 eα,i ∂xi uα (x, t) =
ε
− ε
α∈Λ
2
2
α∈Λ
∂ ref i=1 eα,i ∂xi uα
2
∂ i=1 eα,i ∂xi
(22)
∂ e u (x, t) α,j α j=1 ∂xj
2
We consider the first summand in (23). For replacing uref α in this term, we make use of assumption (6), yielding 2 ∂ ε α∈Λ i=1 eα,i ∂x [tα u(x, t) (1 + εγα )] (23) i 2 2 ∂ ∂ 2 = ε i=1 ∂xi u(x, t) α∈Λ eα,i tα + ε i=1 ∂xi u(x, t) α∈Λ eα,i tα γα By α∈Λ eα,i tα = 0 and (9), the result is ε
α∈Λ
2
∂ i=1 eα,i ∂xi
[tα u(x, t) (1 + εγα )] = ε2
2
∂ i=1 ∂xi
[u(x, t)γ(x, t)] . (24)
A Lattice Boltzmann Model for Rotationally Invariant Dithering
959
We now employ (15) (6 ) uα = uref α + O(ε) = tα u(x, t) + O(ε) ,
(25)
and by plugging it into the second summand of (23) gives
2 2 ∂ ∂ ε2 α∈Λ i=1 eα,i ∂x e t u(x, t) (26) α,j α j=1 ∂xj i 2 2 2 2 ∂2 = ε2 i,j=1 α∈Λ eα,i eα,j tα ∂x∂i ∂xj u(x, t) = ε3 uα (x, t) . i=1 ∂x2i α∈Λ =1/3·δij
=u(x,t) by (1)
In summary, Term B (23) results in ε
α∈Λ
2
∂ i=1 eα,i ∂xi uα (x, t)
= ε2 div u(x, t)γ(x, t) −
ε2 3 Δu(x, t) .
(27)
Term C. In a first step, we substitute uα as in (25), neglecting higher order terms in ε, giving us a first-order approximation of uα . Using this gives 2 ε2 ∂2 α∈Λ i,j=1 eα,i eα,j ∂xi ∂xj tα u(x, t)) 2 ε2 2 ∂2 = eα,i eα,j tα = i,j=1 ∂xi ∂xj u(x, t) 2 α∈Λ
ε2 6 Δu(x, t) .
(28)
=1/3·δij
Plugging all the three terms A, B, and C together, dividing by ε2 and taking the limit ε → 0 results in the diffusion-advection equation (8) which concludes our proof.
Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster Aaron Hagan and Ye Zhao Kent State University
Abstract. In this paper, we propose an inherent parallel scheme for 3D image segmentation of large volume data on a GPU cluster. This method originates from an extended Lattice Boltzmann Model (LBM), and provides a new numerical solution for solving the level set equation. As a local, explicit and parallel scheme, our method lends itself to several favorable features: (1) Very easy to implement with the core program only requiring a few lines of code; (2) Implicit computation of curvatures; (3) Flexible control of generating smooth segmentation results; (4) Strong amenability to parallel computing, especially on low-cost, powerful graphics hardware (GPU). The parallel computational scheme is well suited for cluster computing, leading to a good solution for segmenting very large data sets.
1
Introduction
Large scale 3D images are becoming very popular in many scientific domains including medical imaging, biology, industry etc. These images are often susceptible to noise during their acquisition. Image segmentation is a post processing technique that can show clearer results for analysis and registration. This topic has been a widely studied area of both 2D and 3D image processing and has been explored with a variety of techniques including (not limited to) region growing, contour evolutions, and image thresholding. The method we propose for performing image segmentation is an inherently parallel scheme based on the level set equation. Solving the level set equation is performed by using an extended lattice Boltzmann model (LBM) which provides an alternative numerical solution for the equation. This method has several advantages in that it is very easy and straightforward to implement, implicitly includes the computation of curvatures, has a unique parameter that controls the smoothness of the results, and finally, is parallel which allows it to be mapped to low-cost graphics hardware in a single GPU or GPU cluster environment. The level set method uses a partial differential equation (PDE) to model and track how fronts evolve in a discrete domain by maintaining and updating a distance field to the fronts. Previous methods based on the level set formulation discretize the PDE with finite difference operators which lead to complex numerical computations. The state-of-the-art narrow-band method applies an adaptive strategy where the level set computation is only performed on a narrow G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 960–969, 2009. c Springer-Verlag Berlin Heidelberg 2009
Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster
961
region around the propagating contour. To expedite the narrow band on graphics hardware, Lefohn. et al [1,2] proposed a successful GPU implementation with narrow band packing and virtual memory management that arranges CPU-GPU data communication. The proposed method uses a GPU cluster environment to perform the segmentation of large datasets. With simple local operations, this method is a tool that is easily implemented on distributed machines, with minimal data management and communication through the network. Furthermore, it has the ability to handle curvature flows with its explicit computation, leading to a controllable noise reduction effect in the segmentation results. In detail, previous methods apply the narrow band with the corresponding priority data structure to adaptively propagate fronts to the target regions. A re-initialization of the narrow band is required to maintain the valid distance field. After this step, the new narrow band is packed and reloaded to the GPU. Our method is different in that we do not use a narrow band approach, and therefore, the distance field is always valid in the whole domain and no reload is needed. As a result there is no CPU-GPU crosstalk during segmentation and the data structures are easy to manage as they reside completely in the graphics hardware. Abandoning the adaptive strategies of this solver may appear unconventional at first, as memory consumption increases and computations are performed globally. There is, however, less management of the data (in terms of narrow band computing) and future work could be expanded to develop an adaptive method for this solver. More importantly, this strategy arises based on the rapidly increasing computational power of GPUs (i.e. speed and memory size). For example, GPU memory is increasing rapidly, current graphics cards are equipped with up to 4 GB of memory on a single unit. Cluster systems that contain many GPUs with large memory capacity are becoming available and being used in many scientific applications. We anticipate that this trend will continue in the future. Further benefit comes from this method’s easy implementation with under 100 lines of CPU and GPU code. A knowledgeable graduate student can implement the program in a short period of time. In summary, our approach shows that large volumetric data sets can be segmented in parallel on multiple GPUs with fast performance and satisfying results. To the best of our knowledge this approach is the first to solve level set segmentation of large 3D images on a GPU cluster.
2
Related Work
The level set equation has been used in a wide variety of image processing operations such as noise removal, object detection, and modeling equations of motion [3]. The level set equation can be used to perform the segmentation by creating an initial contour surface in the target image and having it evolve to regions of interest normally defined as target intensity values or gradients to attract the curve [4]. For GPU acceleration, a handful of work [5,1,6] has successfully applied the level set equation for image segmentation by solving it on the GPU.
962
A. Hagan and Y. Zhao
The LBM method has been used in natural phenomena modeling in computer graphics and visualization [7]. The LBM-based diffusion has been used in image processing [8], where an anisotropic 2D image denoising is implemented on the CPU. Recently, Zhao [9] showed how the LBM scheme can be derived to solve volume smoothing, image fairing, and image editing applications. T¨ olke [10] described the parallel nature of LBM and how it can be mapped to the GPU through the CUDA library for modeling computational fluid dynamics. GPU cluster computing has been a rapidly developing research area which is adopted in many scientific computing tasks. Fan et al. [11] used the classic LBM for fluid modeling with their Zippy programming model for GPU clusters.
3
Introduction to LBM
LBM [12] originates from the cellular automata scheme which models fictitious particles on a discrete grid where each point of the grid contains a particular lattice structure with links to its neighbors. The lattice structures are defined by D3Q19 and D3Q7 which define the dimension and how many links are connected between the lattice and its neighbors. The fictitious particles moving along the links and their averaging behaviors are initially used to simulate traditional fluid dynamics. Using the numerical computing process derived from microscopic statistical physics, this recovers the Naiver-Stokes equations governing flow behaviors. The independent variables in the LBM equation consist of particle distribution functions of each link from a grid point to one of its neighbors. The particle distribution functions model the probability of a packet of particles streaming across one lattice link to its corresponding neighbor. Between two consecutive steps of the streaming computation, the function is modified by performing local relaxation that models inter-particle collisions. We refer the interested readers to a complete physical description [12], and its usage in visual simulations [7]. LBM Computation. The first step in performing a LBM simulation is to discretize the simulation domain to a grid, and generate the lattice structure for each grid point. For LBM simulations each grid point has a variety of different links to its neighbors. During each step of the simulation, collision and streaming computations are performed which are mathematically described as: 1 (fi (x, t) − fieq (x, t)), τ streaming ⇒ fi (x + ei , t + 1) = fi (x, t∗ ), collision ⇒ fi (x, t∗ ) = fi (x, t) −
(1) (2)
The local equilibrium particle distribution, fieq , models collisions as a statistical redistribution of momentum. At a given time step t, each particle distribution function, fi , along one link vector ei at a lattice point, x, is updated by a relaxation process with respect to fieq . The collision process is controlled by a relaxation parameter τ . τ controls the rate at which the equation approaches the equilibrium state. After collision, the post-collision result is propagated to x + ei . Here, x + ei locates a neighboring lattice point along the link i. This
Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster
963
provides the distribution function value at time step t+1. fieq can be defined by the Bhatnagar, Gross, Krook (BGK) model as fieq (ρ, u) = ρ(Ai + Bi (ei · u) + Ci (ei · u)2 + Di u2 ),
(3)
where Ai to Di are constant coefficients chosen via the geometry of the lattice links and ρ is the fluid density computed as the accumulation of particle distributions by: ρ= fi . (4) i
The LBM can be easily extended to incorporate additional micro-physics, such as an external force F . This force affects the local particle distribution functions as follows: (2τ − 1) fi ←− fi + Bi (F · ei ). (5) 2τ By applying Chapman-Enskog analysis [13], the Navier-Stokes equation can be recovered from the equilibrium equation as: ∇ · u = 0, ∂u + u · ∇u = νΔu + F . ∂t
(6) (7)
∂ ∂ ∂ Here ∇ defines the gradient operator ( ∂x , ∂y , ∂z ) and Δ is the Laplacian Δ =
∇2 =
∂2 ∂x
+
∂2 ∂y
+
∂2 ∂z .
Extended LBM. Though initially designed for fluid dynamics, the LBM method can be modified for modeling typical diffusion computations. Equation 3 can be simplified to: fieq (ρ) = Ai ρ, (8) which erases momentum terms and in effect removes the nonlinear advection term in the Navier-Stokes equation, which aren’t needed for solving diffusion equations. As shown in [9], the parabolic diffusion equation can be recovered by the Chapman-Enskog expansion: ∂ρ = γ∇ · ∇ρ, (9) ∂t where γ is a diffusion coefficient defined for a D3Q7 lattice by the relaxation parameter τ as: 1 γ = (2τ − 1). (10) 6 In this case, we can also include the external force in the same way as in Equation 5. And thus, the modified LBM computation can recover the following equation: ∂ρ = γ∇ · ∇ρ + F . (11) ∂t Using this equation to compute a distance field (replace ρ by φ), we can recover the level set equation, where F is used to accommodate the speed function and the first term relates to the curvature flow effects.
964
4
A. Hagan and Y. Zhao
Solving Level Set Equation
A distance field defines how far all points in a domain are to an existing surface, where the distance is signed to distinguish between inside and outside the surface. In this way, the surface S is defined as: φ : R3 → R for p ∈ R3 . The distance function is defined as: φ(p) = sgn(p)·min{|p − q| : q∈S}
(12)
The surface can be considered as points with a zero distance value. In image segmentation, the zero level set starts from an arbitrary starting shape and evolves itself by the following level set equation: ∂φ ∇φ = |∇φ|[αD(x) + γ∇ · ] ∂t |∇φ|
(13)
where φ is the distance, D(x) is the speed function that performs as a driving force to move the evolving level set to target regions, with a user-controlling parameter α (we use 0.01 in the examples). The second term represents curvature flow (smoothing). γ determines the level of curvature-based smoothness in the results. For a regular distance field, |∇φ| = 1, which leads the last term to γ∇ · ∇φ. Note that |∇φ| = 1 in our framework at all steps, since we do not use an adaptive approach and the distance field is valid in the whole domain. From this, Equation 13 is only a variational formula of Equation 11. It shows that the modified LBM computation leads to a new solution to the level set equation, enabling us to use the simple, explicit, parallel computational process for volume segmentation. In this way, our method also has the potential to be applied to other level set based applications. In our implementation, we apply a simple D3Q7 lattice that uses less memory and improves the performance, comparing with a traditional fluid solver using a D3Q19 lattice. This is made possible since the level set solution does not need to solve the nonlinear advection term as in the Navier-Stokes equations, and the D3Q7 lattice can provide enough accuracy. Driving Speed Function. Speed functions are designed to make the evolving front of the zero level set propagate to certain target regions. We use a popular approach [3,14] where the speed function is defined by the difference between a target isovalue and a density value at each grid position: D(I) = − |I − T |,
(14)
where I is the voxel/pixel value at the grid position, and T represents the target density isovalue that the front should evolve to. As the front moves closer to the target region, the speed will converge to zero. The speed term also carries properties that allow the front to propagate in either direction, based on the sign of the function. The propagating front will expand if I falls in the T - or T + range, otherwise it will contract. The function D(I) is easily applied in our
Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster
965
LBM computation as a body force F in Equation 11. D(I) can also be derived based on the gradient defining the object boundary or other user-specified rules. Level Set Curvature Computation. As mentioned earlier the LBM scheme inherently contains properties that model curvature during the collision and streaming process. The benefit of the LBM method is that the curvature does not need to be computed explicitly, because it is hidden in the microscopic LBM collision procedure. From the LBM-solved diffusion Equation 9, we substitute fluid density ρ by the distance value φ. And then by applying |∇φ| = 1 for distance field, we get: ∂φ ∇φ = γ∇ · ( )|∇φ| = γκ|∇φ|, ∂t |∇φ|
(15)
where κ represents the mean curvature: κ=∇·(
∇φ ). |∇φ|
(16)
In summary, the modified LBM can implicitly provide the curvature-based smoothing effects, in contrast to upwinding difference methods that need to explicitly compute curvature components.
(a)
(b)
Fig. 1. (a) Data decomposition to cluster nodes for LBM computations. (b) Ghost layers used to transfer boundary data between neighboring nodes in the network.
5
Cluster Computing
Our method can easily be implemented on a single GPU, but to handle large data sets, we extend the algorithm to multiple GPUs organized in a cluster environment. Our cluster is Linux-operated and consists of seventeen nodes, each having a dual core or quad core AMD Opteron processor and a Nvidia 8800 GTX graphics card with 768 MB memory. The 3D volume data set is divided into 16 blocks and sent to the 16 worker nodes in a 4 × 4 organization of the nodes. One master node is used for managing initial data division, collecting and assembling results, and visualization. The master node assembles separate
966
A. Hagan and Y. Zhao
Table 1. Performance report: Per step (in seconds) averaging speed to perform LBM computation and ghost layer communication, the memory size (in MB) per node, and the total ghost layer data size on the GPU cluster with 4 × 4 configuration.
Model CT Head MRI Head Abdomen Colon Aneurism Porche Bonsai
LBM Ghost Layer Speed Transfer Size Per Step Per Step 128 × 128 × 128 0.01 0.04 256 × 256 × 256 0.08 0.21 512 × 512 × 174 0.22 0.51 512 × 512 × 442 0.57 1.61 512 × 512 × 512 0.66 1.96 559 × 1023 × 347 0.96 2.57 1024 × 1024 × 308 1.57 5.24
Total Speed Per Step 0.05 0.29 0.73 2.18 2.62 3.53 6.81
GPU Mem. Size Per Node 4.5 36 97.9 248.6 288 427 693
Ghost Layer Data Size 5.2 21 28.5 72.5 84 88 101.1
results with correct coordinate transformation and indexing, and uses a Marching Cubes method to render the segmented features of the distance field. Figure 1(a) shows this cluster configuration and data distribution. The data distribution is accelerated by using OpenMP to distribute the data in parallel. Between consecutive LBM steps, it is necessary for the worker nodes to share the LBM data (i.e., fi values) residing on the boundaries between each pair of neighboring blocks. We apply a ghost layer method to handle this problem. Each data block contains an extra layer of data, the ghost layer, to communicate with each of its neighbors, which is shown in Fig. 1(b). For example, one node A performs computation on data layer A1 to An . After each step, data in A1 is transferred to the ghost layer Bn+1 of its neighbor node B. Meanwhile, B’s Bn layer will be transferred to the ghost layer A0 of A. In the next step, A will use A0 to implement streaming operation and B will use Bn+1 as well. The data transfer only involves the boundary layers with a very small amount of data compared with the total data size. With an infiniBand network equipped on our cluster, data can be transferred with a speed at an order of gigabits per second.
6
Results and Performance
We ran several volumetric data sets with various data sizes on our GPU cluster, where the computation was completely GPU-based. A 3D Aneurism dataset is used in Fig. 2 to show the results. The image sequence demonstrates the process of the level set propagation in different steps. A Marching Cubes method is used to generate a triangle mesh of the zero level set. The total steps used for the level set to reach final results are determined by the position and shape of the initial starting level set (we use a simple sphere in our examples). Fig. 4 shows another example of a CT abdomen data set and Fig. 3 uses a bonsai volume. Table 1 outlines the performance results of several datasets. The average speed per step is composed of two parts: LBM computation and ghost layer handling. We also report the GPU memory consumption on each node, and the total size
Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster
(a)
(b)
(c)
967
(d)
Fig. 2. Results of segmenting an Aneurism dataset with a target iso-value of 32. Data size is 512 × 512 × 512. γ = 1.5. (a) Level set propagates after 3 steps; (b) After 10 Steps; (c) After 25 Steps; (d) After 50 Steps.
(a)
(b)
(c)
(d)
Fig. 3. Results of segmenting a volumetric Bonsai data with a target iso-value of 20. γ = 5.125. Data size is 1024 × 1024 × 308. (a) Initial level set as a sphere; (b) After 25 Steps; (c) After 45 Steps; (d) Direct rendering of isosurface with density values of 20.
(a)
(b)
Fig. 4. Results of segmenting a 3D CT addomen dataset with a target iso-value of 62. γ = 1.125. Data size is 512 × 512 × 174. (a) Segmentation result (distance field) after 25 steps; (b) Direct rendering of isosurface with density values of 62.
of all ghost layers that determines the network traffic speed. It clearly shows that our method achieves very good performance to segment very large data. For the largest Bonsai data, it averages 6.81 seconds per step. The segmentation of the Bonsai completes in 45 steps, leading to a total processing time of around
968
A. Hagan and Y. Zhao
306 seconds. The segmentation usually uses tens of total steps for large data. With an averaging per step speed at a few seconds, the whole process can generally be accomplished in tens to hundreds of seconds depending on the data sets and the initial distance field. In detail, the LBM level set computation is very fast even for a large data set. It uses 1.57 seconds for the Bonsai data, which runs on one 256 × 256 × 77 volume per node due to our data division scheme. The computation for the ghost layers handling is a little slower, which include (1) data readback from GPUs, (2) network transfer, and (3) data write to GPUs. For the Bonsai data, it costs 5.24 seconds. The total ghost layer data size (on all the nodes) reaches 101.1 MB. Although this data size does not impose a challenge on the infiniBand network, the GPU readback may consume a little more time than the LBM computation which is a known bottleneck of GPU computing. We plan to improve performance with further optimization of the ghost layer processing on faster GPUs and a new cluster configuration.
7
Conclusion
Common segmentation techniques such as isovalue threshholding are not adequate enough to handle complex 3D images generated by medical or other scanning devices. It proves necessary to implement advanced techniques which have the power to give clearer segmentation results by solving level set equations. Popular level set approaches on a single GPU are not easily extended to large volume data sets which are prevalent in practical applications. We have proposed an inherent parallel method to solve the segmentation problem flexibly and efficiently on single and multiple GPUs. Based on an extended LBM method, our method lends itself as a good segmentation tool with easy implementation, implicit curvature handling, and thus controllable smoothness of the segmented data. With its parallel scheme only minimal data processing is required for implementing the method in an GPU cluster compared with the previous single GPU approaches. We have reported good performance of multiple data sets on the cluster. In summary, our scheme provides a viable solution for large-scale 3D image segmentation in adoption of distributed computing technology. It has great potential to be applied in various applications. In the future, we will work on combining parallel visualization techniques with the segmentation, to further augment the ability of this method.
Acknowledgement This work is partially supported by NSF grant IIS-0916131 and Kent State Research Council. The cluster is funded by the daytaOhio Wright Center of Innovation by the Ohio Department of Development. Please find the color version of this paper at the author’s homepage.
Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster
969
References 1. Lefohn, A., Cates, J., Whitaker, R.: Interactive, GPU-based level sets for 3d brain tumor segmentation. In: Medical Image Computing and Computer Assisted Intervention, MICCAI, pp. 564–572 (2003) 2. Cates, J.E., Lefohn, A.E., Whitaker, R.T.: Gist: An interactive, GPU-based levelset segmentation tool for 3d medical images. Medical Image Analysis 10, 217–231 (2004) 3. Sethian, J.: Level set methods and fast marching methods: Evolving interfaces in computational geometry, fluid mechanics, computer vision, and materials science (1999) 4. Malladi, R., Sethian, J.A., Vemuri, B.C.: Shape modeling with front propagation: A level set approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 158–175 (1995) 5. Klar, O.: Interactive GPU based segmentation of large medical volume data with level sets. Diploma Thesis, VRVis and University Koblenz-Landau (2006) 6. Rumpf, M., Strzodka, R.: Level set segmentation in graphics hardware. In: Proceedings of IEEE International Conference on Image Processing (ICIP 2001), vol. 3, pp. 1103–1106 (2001) 7. Zhao, Y., Kaufman, A., Mueller, K., Thuerey, N., R¨ ude, U., Iglberger, K.: Interactive lattice-based flow simulation and visualization. In: Tutorial, IEEE Visualization Conference (2008) 8. Jawerth, B., Lin, P., Sinzinger, E.: Lattice Boltzmann models for anisotropic diffusion of images. Journal of Mathematical Imaging and Vision 11, 231–237 (1999) 9. Zhao, Y.: Lattice Boltzmann based PDE solver on the GPU. Visual Computer, 323–333 (2008) 10. T¨ olke, J.: Implementation of a lattice boltzmann kernel using the compute unified device architecture developed by nvidia. Computing and Visualization in Science (2008) 11. Fan, Z., Qiu, F., Kaufman, A.E.: Zippy: A framework for computation and visualization on a gpu cluster. Computer Graphics Forum 27(2), 341–350 (2008) 12. Succi, S.: Numerical Mathematics and Scientific Computation. In: The Lattice Boltzmann Equation for Fluid Dynamics and Beyond. Oxford University Press, Oxford (2001) 13. He, X., Luo, L.: Lattice Boltzmann model for the incompressible Navier-Stokes equation. Journal of Statistical Physics 88(3/4), 927–944 (1997) 14. Lefohn, A.E., Kniss, J.M., Hansen, C.D., Whitaker, R.T.: A streaming narrow-band algorithm: Interactive computation and visualization of level sets. IEEE Transactions on Visualization and Computer Graphics 10(4), 422–433 (2004)
A Practical Guide to Large Tiled Displays Paul A. Navr´ atil, Brandt Westing, Gregory P. Johnson, Ashwini Athalye, Jose Carreno, and Freddy Rojas Texas Advanced Computing Center The University of Texas, Austin {pnav,bwesting,gregj,ashwini,jcarreno,rfreddy}@tacc.utexas.edu Abstract. The drive for greater detail in scientific computing and digital photography is creating demand for ultra-resolution images and visualizations. Such images are best viewed on large displays with enough resolution to show “big picture” relationships concurrently with finegrained details. Historically, large scale displays have been rare due to the high costs of equipment, space, and maintenance. However, modern tiled displays of commodity LCD monitors offer large aggregate image resolution and can be constructed and maintained at low cost. We present a discussion of the factors to consider in constructing tiled LCD displays and an evaluation of current approaches used to drive them based on our experience constructing displays ranging from 36 Mpixels to 307 Mpixels. We wish to capture current practices to inform the design and construction of future displays at current and larger scales.
1
Introduction
Scientists, like photographers, seek the greatest possible detail in their images. Yet, our ability to generate large images and large image sets has outstripped our ability to view them at full resolution. Gigapixel scientific images are becoming commonplace, from the sub-kilometer resolution satellite images of NASA’s Earth Observatory [1] to nanometer-resolution electron micrographs [2] used in three-dimensional cell reconstruction. Also, analysis of ultrascale supercomputing datasets increasingly requires high-resolution imagery to capture fine detail. Further, scientists using image-intensive processes, such as image alignment in biological microscopy, may track features across tens or hundreds of related images, the combined sizes of which can be much larger than conventional displays. Historically, projection-based systems have been used for large, high-resolution displays, because of both their seamless image and a lack of viable alternatives. However, projection systems are expensive, both to purchase and to maintain. Recent technology advances have reduced the both purchase cost and the costper-pixel, but maintenance costs remain high, both for upkeep (projector bulbs, display alignment), and for the lab space needed to accommodate the screens, the projectors, and the necessary throw distance between them. Recently, tiles of commodity LCD monitors have been used to construct displays of over two hundred megapixels [3]. Tiled LCD displays offer low purchase G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 970–981, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Practical Guide to Large Tiled Displays
971
and maintenance costs, but often the software used to drive these displays requires a custom API [4,5,6,7], a constraint that complicates application implementation and prohibits running third-party applications for which the source code is unavailable. In this paper, we describe our experience constructing tiled LCD displays of various resolutions using only freely-available, open-source software. We show that while custom display software can provide high render rates for special applications, they are not required to drive these displays with good performance, which dramatically reduces development and maintenance costs. Further, we provide hardware and software recommendations to guide the construction of future displays both at current and larger resolutions.
2
The Case for Large High-Resolution Displays
Not everyone sees value in large high resolution displays. Such displays are sometimes labeled as “only good for demos” and, less charitably, as “fleecing rooms” for big donors. However, recent studies in human-computer interaction demonstrate that these large displays offer improved usability and performance for analyzing high-resolution imagery. 2.1
Improved Human Interaction
The human-computer interaction community has documented the benefits of high-resolution displays, and several studies have targeted tiled LCD displays in particular for increasing user perception [8], productivity [9,10,11,12,13], and satisfaction [14]. These benefits appear to scale with increased display size. Further, a large high-resolution display permits physical navigation of the image, where viewers walk about the display to view portions of the image. On geospatial visualization tasks, physical navigation was shown to provide superior task performance than virtual navigation, scrolling and zooming the image through a software interface on a single screen [15]. Of the components of Dourish’s “embodied interaction” concept of interface design [16,17], there is evidence that physical navigation of the image is a primary component of user productivity and satisfaction in large-scale visualization tasks [18,19]. 2.2
High Resolution Imagery
Scientific equipment contains increasingly high-resolution sensors that produce high-resolution images. Multiple images are often analyzed together, either by composition into a single enormous image or by comparison of a related image set. In astronomy, composite images from space probes range from 100 Mpixel panoramics of Mars [20] to over three gigapixels for full-Earth coverage at sub-kilometer resolution [1] to five gigapixel infrared scans of the inner Milky Way [21]. In biology, electron micrographs at nanometer resolution can be larger than one gigapixel [2]. Further, three-dimensional reconstruction of electron tomographs rely on proper alignment of the individual images [22], and
972
P.A. Navr´ atil et al.
the alignment process often requires manual identification of features across images. For both example domains, a large display would aid detecting features and relationships, either within a composite image or across a large image set. 2.3
Scientific Computing
Scientific computing exists in a feedback loop: the increasing capacity and capability of supercomputers drive increased resolution and precision in scientific simulations, which in turn require larger and more capable systems to effectively display and analyze the simulation results. Science ranging from universal dark matter N-body simulations [23,24,25] to high-resolution hurricane storm surge modeling [26] to turbulent fluid flow models [27] create ultra-resolution results. To match the result resolution, researchers should create ultra-resolution images of their data, and such images are better-analyzed on a large-format, high resolution display [19]. 2.4
Projection or Tiled-LCD?
Though the largest projection display is less than one-fourth the resolution of the largest tiled-LCD displays [28], projection systems remain popular because of their seamless images. With seamlessness, however, comes higher purchase costs, larger space requirements, and maintenance costs for bulbs and projector alignment. The highest-resolution projectors today use an 8.8 Mpixel LCD [29,30], a resolution slightly higher than two 30” LCD tiles. Assuming retail costs of $1000 per LCD tile and $100,000 per 4K projector, a projection display of the same resolution as the largest tiled LCD displays would cost nearly 47 times as much, without considering space and maintenance costs. A single 4K projector bulb costs thousands of dollars, the cost of several LCD display tiles. In addition, projectors must be kept in alignment, either physically or with an automated calibration system [31]. Even with the advent of thirty-five megapixel projectors [32], the price-per-pixel cost still favors tiled LCD technology.
3
Tiled-LCD Display Hardware
The highest-resolution displays currently use either tiled-LCDs or projectors, and all known displays over 100 Mpixels use tiled-LCDs [28]. Other high-resolution display technologies exist [33], but are not currently used for large displays. The remainder of the paper will concern tiled-LCD displays. Tiled-LCD displays have been built entirely from commodity parts, at resolutions of ten to over two hundred megapixels [3,34,35]. Below, we highlight key aspects of the hardware used in Stallion, the 307 Mpixel display at the Texas Advanced Computing Center (TACC), that extend the commodity hardware trend. Specific hardware details can be found at the TACC website [36]. Stallion consists of seventy-five LCD monitors mapped to twenty-three rendering nodes, each with two GPUs, and a head node that acts as the user console. The machine contains a total of 100 processing cores, 108 GB aggregate RAM
A Practical Guide to Large Tiled Displays
973
and 46 render-node GPUs with 36 GB aggregate graphics RAM. The render nodes of Stallion are Dell XPS “gaming boxes” marketed to home enthusiasts, rather than workstations or rack-mounted machines, and each node contains two NVIDIA GeForce gaming cards, rather than industrial-class Quadro cards. Home-user hardware can provide adequate performance for lower total cost, depending on the intended use of the display. We describe these considerations in Section 5. Of the seventy-five display tiles, fifty-eight share a GPU with another tile and seventeen have a dedicated GPU. These seventeen tiles are centrally located in the display, creating a “hot-spot” with increased rendering performance. Table 1 quantifies the rendering performance of the hot-spot compared to other regions of the display.
4
Display Environment Evaluation
We identify three categories of display environments: windowing environments; OpenGL substitutes that reimplement the OpenGL API; and custom parallel libraries that implement a new rendering library interface. We will discuss the qualities of each category below through the performance of representative software on Stallion. The features of each category are summarized in Table 3. 4.1
Windowing Environments
Windowing environments for tiled displays include Distributed Multihead X (DMX) [37] and the Scalable Adaptive Graphics Environment (SAGE) [38]. DMX acts as an X Windows proxy to multiple X servers running on a tiled display, whereas SAGE hosts a separate windowing environment running within an X server on the cluster’s head node. While DMX can support most X-enabled software, the heavy communication load of the X protocol limits DMX’s scalability beyond sixteen nodes. In addition, to account for the display mullions, each tile must have a separate X display, since Xinerama with DMX does not permit compensation for the mullion gap. Thus, DMX is also effectively limited to sixteen displays. SAGE scales beyond both the node and tile limits of DMX by implementing its own windowing and communication protocols, and it compensates for mullions. Though SAGE does not support a full X environment, it provides native image and video support and an API for “plug-ins” for third-party applications. In addition, it uses dynamic pixel routing to allow runtime movement and scaling of imagery and video across the tiled display. This pixel routing is bandwidth intensive: the image source node must stream pixels over the network to the render nodes where the image will be displayed. Thus, available network bandwidth from the source node is often the bottleneck for SAGE performance. As Figure 1 shows, uncompressed video streaming in SAGE experiences a sharp performance drop past high-definition (1080p) resolution. Test videos were natively encoded at 24 fps, and each video test was placed over the entire 15 × 5 display
974
P.A. Navr´ atil et al.
Fig. 1. This plot shows that uncompressed video streaming in SAGE is dependent on network bandwidth. Playback for resolutions higher than 1080p (1920 × 1080) is no longer real-time. The dip in bandwidth consumed just past 1080p resolution is due to the sudden drop in frame rate, causing less total data to be streamed.
to eliminate effects from display location and node communication. For a single source node on our SDR InfiniBand fabric, SAGE reached bandwidth saturation at ˜ 230MB/s. For a single video stream above 1080p (1920 × 1080) resolution, SAGE performance can be improved using compression or faster interconnect hardware. For several video streams with an aggregate resolution above 1080p, performance can be improved by distributing the bandwidth load across the cluster by sourcing videos from separate render nodes. The tested version of SAGE communicates via IP over InfiniBand (IPoIB); an implementation using native InfiniBand primitives would further improve network performance. 4.2
OpenGL Substitutes
Chromium [39] is a widely known OpenGL implementation for parallel and cluster rendering, though other implementations have been made [40,41]. Chromium intercepts application-level OpenGL calls and distributes them across a rendering cluster. By sending rendering information (geometry, textures, transformations) rather than raw pixels, Chromium often consumes less bandwidth at high image resolution than a pixel streaming environment like SAGE. Because it uses the OpenGL API, Chromium allows unmodified OpenGL applications to be run directly on a tiled display. Chromium is ill-suited for some applications. Because Chromium streams OpenGL calls, geometry- and texture-intensive applications can saturate network bandwidth. In addition, Chromium implements only up to the OpenGL 1.5 standard, though any missing OpenGL functionality can be implemented by the user. Finally, Chromium is subject to any resolution limitations built into the render nodes’ native OpenGL stack, so maximum image resolution may be smaller than the resolution of the display.
A Practical Guide to Large Tiled Displays
4.3
975
Custom Parallel Libraries
Researchers have built custom parallel rendering libraries to overcome Chromium’s limitations and to support specific application functionality. These include IceT [4], VR Juggler [5], CGLX [6], and Equalizer [7]. While custom library implementations can yield significantly better rendering performance over Chromium [42], their API calls must be implemented in source, thereby limiting their use with third-party applications. Since CGLX is in use on many of the largest tiled-LCD displays [3], we chose to explore its performance on Stallion. In CGLX, an instance of the application is opened on each of the rendering nodes, and the head node communicates with the render nodes to synchronize the display, thereby reducing bandwidth requirements compared to pixel or OpenGL streaming. CGLX reimplements certain OpenGL methods, such as glFrustum, to perform correctly and efficiently in a distributed parallel context. Our CGLX evaluation used its native OpenSceneGraph viewer. Our tests used 20K vertex and 840K vertex geometry files from 3Drender.com’s Lighting Challenge[43]. The benchmark results in Table 1 show that CGLX performance is determined by the slowest render node, which is due to the synchronization enforced by the head node. Further, the framerate doubles when the rendering window is displayed only on tiles with a dedicated GPU, demonstrating the increased performance of the “hot-spot” described in Section 3. The bandwidth used was approximately the same for all cases and did not exceed 160kB/s. Table 1. This table shows the increase in rendering performance by having fewer displays per node. In addition, it shows that CGLX scales extremely well due to its distributed architecture. All render nodes have two GPUs and most nodes drive four screens. There is a centered 5 × 3 tile “hot-spot” where each GPU drives only one tile. (*) The 15 × 5 configuration includes both two tile per GPU nodes and the one tile per GPU hot-spot nodes. Performance is governed by the two tile per GPU nodes, with a slight performance penalty from the increased display area. Render Tiles FPS @ FPS @ Tile Layout Nodes per GPU 20K Verts 840K Verts 5×3 8 1 585 115 5×4 5 2 250 68 15 × 5 23 2∗ 248 64
5
Recommendations
Framing and Display Layout. Stallion, along with other tiled-LCD displays [3,34], uses modular metal framing from 80/20. The cost of this framing comes to approximately $100 – $150 per tile, with decreasing marginal cost as total tiles increase. The frame specification can be designed in any 3D modelling tool that can make real-world distance measurements, such as Google SketchUp, and the frame can be reconfigured or expanded easily.
976
P.A. Navr´ atil et al.
The relative quantities of render nodes, GPUs and display tiles dictate the layout by which tiles should be connected to rendering nodes. There are four cases, which we present in order of increasing complexity: – A render node contains a single GPU connected to one tile. Applications displayed on the tile receive all CPU and GPU resources. – A node contains multiple GPUs, each connected to one display tile. Applications displayed on any connected tile must share CPU resources and system memory. – A node contains one GPU that drives multiple displays. Applications displayed on any connected tile must share GPU resources and graphics memory. – A node contains multiple GPUs and each GPU drives multiple display tiles. Applications displayed on any connected tile must share both CPU and GPU resources. When multiple tiles are assigned to a single render node (cases 2–4, above), these tiles should be both contiguous and regularly shaped, either in lines or rectangles, to minimize the number of applications or images sent to each render node. We have found that though “L”- and “S”-shaped layouts can be specified, they are not well-supported by graphics drivers. Power Efficiency. When determining the power budget for a tiled display, prudent design suggests using max amperage draw for all components, plus a percentage for overage (which can also accommodate future expansion). We recommend this for displays in newly-constructed facilities where power requirements can be specified in advance. In our experience, actual amperage draw is significantly below the manufacturer-quoted maximum, which may allow existing circuitry to be used for a new display in repurposed space or to expand an existing display. Table 2 presents the measured power draw for Stallion hardware. Interconnect. We have found that the system interconnect, especially connectivity from the cluster head node to the render nodes, plays a crucial role in overall system performance. At a minimum, the cluster should be interconnected with 1 Gb Ethernet (GbE), and we recommend maintaining a redundant 1GbE network as a fallback for a higher-bandwidth interconnect. We recommend investing in a high-bandwidth interconnect, such as InfiniBand, especially if the display will be used for video streaming. We caution Table 2. This table shows the power usage of Stallion’s Dell 3007WFP-HC LCD displays and Dell XPS 720 render nodes. Brightness governs the operating draw for the LCD displays; CPU and GPU load governs the operating draw for the render nodes. Draw with Observed Rated Power Off Operating Draw Maximum Draw Dell 3007WFP-HC 0.05 A 0.5 — 1.15 A 1.6 A Dell XPS 720 0.36 A 1.36 — 2.2 A 8.33 A
A Practical Guide to Large Tiled Displays
977
that, in contrast to Ethernet switches, multiple InfiniBand switches cannot easily be linked together while maintaining peak bandwidth rates. If the render node cluster may be expanded during the system lifetime, we recommend using a blade-based InfiniBand switch so that expansion ports can be added without impacting overall fabric bandwidth. Render Nodes. We use Dell XPS 720 “gaming boxes” in Stallion with NVIDIA GeForce 8800 GTX GPUs, and we have used Dell Precision 690 workstations in other displays [34]. We chose workstation form factors because rack-mounted nodes with GPUs were not yet available when the machines were designed. Rackmounted render nodes may better fit space, aesthetic and HVAC constraints, depending on the location of the display, though they may run louder than workstation machines. We have found the GeForce-class GPUs sufficient to drive both Stallion and smaller displays [34], though video lag among tiles can be seen at very high frame rates. Quadro-class GPUs are capable of hardware enforced frame-locking, though additional daughter cards are needed for each render node to enable it. Lag may seriously affect high framerate immersive applications, but we have found the actual impact of lag on display usefulness to be negligible, in part because the display mullions reduce the noticeable effects of lag between tiles. Display Tiles. In addition to the per-pixel cost savings of using commodity LCD displays, the display mullions help reduce assembly and maintenance costs by masking small misalignments between displays that would otherwise be visually objectionable. Informally, we have found that users interpret the mullions as “window panes,” and with this idea, they see “past” them as if looking out a window. We posit that this phenomenon exists only for mullions of a certain size. If the mullions are sufficiently thin, or removed entirely, a viewer may ignore the tile divisions and interpret the display as a single solid image. If this occurs, any tile misalignment would be visually objectionable. Further, the display would need periodic realignment due to natural shifting of the frame and building. We advise purchasing extra displays at the time of the original order to ensure a supply of replacement tiles of the same form factor and manufacture lot. Small variances in the color temperature of LCD back-lights from different manufacture lots can cause objectionable variance among tiles, though GPU driver settings can provide corrective adjustments. Having replacement displays on hand simplifies maintenance, and these displays can be used on other systems until needed. Software Selection. Many tiled-LCD displays, including Stallion, use a Unixbased operating system such as Ubuntu, Red Hat, or Mac OS X [3,34,35]. We use a Long Term Service (LTS) release of the Ubuntu Linux distribution on Stallion, since LTS distribution support is guaranteed for two years and updated packages are provided every six months. Display environments should be chosen according to anticipated uses of the tiled display. We summarize the available display environment options in Table 3. For distributed parallel applications that require an MPI stack, we have found
978
P.A. Navr´ atil et al.
Table 3. The table distinguishes the capabilities of display environments. Windowing Environments include SAGE and DMX. OpenGL Substitute refers to parallel rendering libraries like Chromium that implement the OpenGL API directly. Custom Parallel Library refers to parallel rendering libraries that use their own API, such as CGLX, VR Juggler and IceT. (1) DMX supports most X-enabled applications, while SAGE supports a limited range of applications via SAGE plug-in. Windowing Environment head / cluster ✓
OpenGL Substitute head node
Custom Parallel Lib Application Location cluster nodes Distributed Apps ✓ Distributed Rendering ✓ ✓ Distributed Display ✓ ✓ ✓ Must Modify App Code *1 ✓ Example Image & Video Parallel Render Parallel Render Use Case Streaming 3rd Party Apps Custom Apps
that OpenMPI provides a simple and stable MPI environment, especially over gigabit Ethernet. For an InfiniBand-connected cluster, we recommend either OpenMPI or MVAPICH MPI stacks.
6
Future Work and Conclusion
In this paper, we capture our lessons-learned from constructing large tiled-LCD displays at resolutions ranging from 36 Mpixels to 307 Mpixels. We demonstrate that large tiled-LCD displays can be built using commodity parts and run using open-source software, which help make them the lowest price-per-pixel technology for high-resolution displays. While custom-built libraries provide the best rendering performance on these displays, windowing environments and parallel OpenGL implementations can provide adequate performance for video and third-party applications. Yet, these options could be improved: an efficient distributed parallel implementation for image and video streaming would mitigate the need for a high cost, high-bandwidth interconnect; and a parallel implementation of the current OpenGL standard would increase the types of software immediately usable on these displays. Also, progress on rack-mountable rendering nodes opens the possibility for mobile high-resolution tiled displays that could be deployed with remote research teams to analyze high-resolution data at the point of generation. We hope that the high-resolution display community continues to embrace open-source, freely available software so that access to these displays may continue to grow.
Acknowledgements Thanks to Hank Childs, Kelly Gaither, Karl Schulz, Byungil Jeong and the anonymous reviewers for their helpful comments. This work was funded in part by generous donations from Dell, Microsoft, and the Office of the Vice-President for Research of the University of Texas at Austin.
A Practical Guide to Large Tiled Displays
979
References 1. NASA (2009), http://earthobservatory.nasa.gov/features/bluemarble 2. The Electron Microscopy Outreach Program (2008), http://em-outreach.ucsd.edu/ 3. DeFanti, T.A., Leigh, J., Renambot, L., Jeong, B., Verlo, A., Long, L., Brown, M., Sandin, D.J., Vishwanath, V., Liu, Q., Katz, M.J., Papadopoulos, P., Keefe, J.P., Hidley, G.R., Dawe, G.L., Kaufman, I., Glogowski, B., Doerr, K.-U., Singh, R., Girado, J., Schulze, J.P., Kuester, F., Smarr, L.: The OptiPortal, a Scalable Visualization, Storage, and Computing Interface Device for the OptiPuter. Future Generation Computer Systems 25, 114–123 (2009) 4. Moreland, K., Wylie, B., Pavlakos, C.: Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays. In: IEEE Symposium on Parallel and Large-Data Visualization and Graphics, pp. 85–92 (2001) 5. Bierbaum, A., Hartling, P., Morillo, P., Cruz-Neira, C.: Implementing Immersive Clustering with VR Juggler. In: Gervasi, O., Gavrilova, M.L., Kumar, V., Lagan´ a, A., Lee, H.P., Mun, Y., Taniar, D., Tan, C.J.K. (eds.) ICCSA 2005. LNCS, vol. 3482, pp. 1119–1128. Springer, Heidelberg (2005) 6. Doerr, K.U., Kuester, F. (2009), http://vis.ucsd.edu/~ cglx/ 7. Eilemann, S., Makhinya, M., Pajarola, R.: Equalizer: A Scalable Parallel Rendering Framework. IEEE Transactions on Visualization and Computer Graphics 15, 436–452 (2009) 8. Yost, B., Haciahmetoglu, Y., North, C.: Beyond Visual Acuity: The Perceptual Scalability of Information Visualizations for Large Displays. In: ACM Conference on Human Factors in Computer Systems (CHI), April 2007, pp. 101–110 (2007) 9. Ball, R., North, C.: An Analysis of User Behavior on High-Resolution Tiled Displays. In: Tenth IFIP International Conference on Human-Computer Interaction, pp. 350–364 (2005) 10. Ball, R., Varghese, M., Carstensen, B., Cox, E.D., Fierer, C., Peterson, M., North, C.: Evaluating the Benefits of Tiled Displays for Navigating Maps. In: International Conference on Human-Computer Interaction, pp. 66–71 (2005) 11. Czerwinski, M., Smith, G., Regan, T., Meyers, B., Robertson, G., Starkweather, G.: Toward Characterizing the Productivity Benefits of Very Large Displays. In: Eighth IFIP International Conference on Human-Computer Interaction (2003) 12. Czerwinski, M., Tan, D.S., Robertson, G.G.: Women Take a Wider View. In: ACM Conference on Human Factors in Computing Systems, pp. 195–201 (2002) 13. Shupp, L., Andrews, C., Kurdziolek, M., Yost, B., North, C.: Shaping the Display of the Future: The Effects of Display Size and Curvature on User Performance and Insights. Human-Computer Interaction (to appear) (2008) 14. Shupp, L., Ball, R., Yost, B., Booker, J., North, C.: Evaluation of Viewport Size and Curvature of Large, High-Resolution Displays. In: Graphics Interface (GI), pp. 123–130 (2006) 15. Ball, R., North, C., Bowman, D.A.: Move to Improve: Promoting Physical Navigation to Increase User Performance with Large Displays. In: ACM Conference on Human Factors in Computer Systems, pp. 191–200 (2007) 16. Dourish, P.: Seeking a Foundation for Context-Aware Computing. HumanComputer Interaction 16(2), 229–241 (2001) 17. Dourish, P.: Where the Action Is: The Foundations of Embodied Interaction. MIT Press, Cambridge (2001)
980
P.A. Navr´ atil et al.
18. Ball, R., North, C.: Realizing Embodied Interaction for Visual Analytics through Large Displays. Computers & Graphics 31(3), 380–400 (2007) 19. Ball, R., North, C.: The Effects of Peripheral Vision and Physical Navigation in Large Scale Visualization. In: Graphics Interface, pp. 9–16 (2008) 20. NASA (2005), http://photojournal.jpl.nasa.gov/catalog/pia04182 21. NASA (2008), http://www.spitzer.caltech.edu/media/releases/ ssc2008-11/ssc2008-11a.shtml 22. Phan, S., Lawrence, A.: Tomography of Large Format Electron Microscope Tilt Series: Image Alginment and Volume Reconstruction. In: 2008 Congress on Image and Signal Processing, pp. 176–182 (2008) 23. Springel, V., White, S.D.M., Jenkins, A., Frenk, C.S., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J.A., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., Pearce, F.: Simulations of the Formation, Evolution and Clustering of Galaxies and Quasars. Nature 435, 629–636 (2005) 24. Shapiro, P.R., Iliev, I.T., Mellema, G., Pen, U.-L., Merz, H.: The Theory and Simulation of the 21-cm Background from the Epoch of Reionization. In: The Evolution of Galaxies through the Neutral Hydrogen Window (AIP Conf. Proc.) (2008) 25. Kim, J., Park, C., Gott III, J.R., Dubinski, J.: The Horizon Run N-Body Simulation: Baryon Acoustic Oscillations and Topology of Large Scale Structure of the Universe. The Astrophysical Journal (submitted) (2009) 26. Dawson, C., Westerink, J., Kubatko, E., Proft, J., Mirabito, C.: Hurricane Storm Surge Simulation on Petascale Computers. In: TeraGrid 2008 presentation (2008) 27. Donzis, D., Yeung, P., Sreenivasan, K.: Energy Dissipation Rate and Enstrophy in Isotropic Turbulence: Resolution Effects and Scaling in Direct Numerical Simulations. Physics of Fluids 20 (2008) 28. KVM sans V (2009), http://kvmsansv.com/multi-megapixel_displays.html 29. JVC (2009), http://pro.jvc.com/prof/attributes/features.jsp?model_id=mdl101793 30. Sony Electronics Inc. (2009), http://pro.sony.com/bbsc/ssr/cat-projectors/cat-ultrahires/ 31. Jaynes, C., Seales, W., Calvert, K., Fei, Z., Griffioen, J.: The Metaverse: a Networked Colleciton of Inexpensive, Self-Configuring, Immersive Environments. In: Proceedings of the EuroGraphics Workshop on Virtual Environments (EGVE), pp. 115–124 (2003) 32. Davies, C.: JVC create 35-megapixel 8k x 4k projector LCD. SlashGear.com (2008) 33. Ni, T., Schmidt, G.S., Staadt, O.G., Livingston, M.A., Ball, R., May, R.: A Survey of Large High-Resolution Display Technologies, Techniques, and Applications. In: IEEE Virtual Reality 2006, pp. 223–234 (2006) 34. Johnson, G.P., Navr´ atil, P.A., Gignac, D., Schulz, K.W., Minyard, T.: The Colt Visualization Cluster. Technical Report TR-07-03, Texas Advanced Computing Center (2007) 35. Laboratory for Information Visualization and Evaluation, Virginia Tech. (2009), http://infovis.cs.vt.edu/gigapixel/index.html 36. Texas Advanced Computing Center (2009), http://www.tacc.utexas.edu/resources/vislab/ 37. Distributed Multihead X Project (2004), http://dmx.sourceforge.net/
A Practical Guide to Large Tiled Displays
981
38. Renambot, L., Rao, A., Singh, R., Jeong, B., Krishnaprasad, N., Vishwanath, V., Chandrasekhar, V., Schwarz, N., Spale, A., Zhang, C., Goldman, G., Leigh, J., Johnson, A.: SAGE: the scalable graphics architecture for high resolution displays. In: Proceedings of WACE 2004 (2004) 39. Humphreys, G., Houston, M., Ng, R., Frank, R., Ahern, S., Kirchner, P.D., Klosowski, J.T.: Chromium: a Stream-Processing Framework for Interactive Rendering on Clusters. In: ACM SIGGRAPH, pp. 693–702 (2002) 40. Mitra, T., Chiueh, T.: Implementation and Evaluation of the Parallel Mesa Library. In: International Conference on Parallel and Distributed Systems, pp. 84–91 (1998) 41. Wylie, B., Pavlakos, C., Lewis, V., Moreland, K.: Scalable Rendering on PC Clusters. IEEE Computer Graphics and Applications 21, 62–70 (2001) 42. Staadt, O.G., Walker, J., Nuber, C., Hamann, B.: A Survey and Performance Analysis of Software Platforms for Interactive Cluster-Based Multi-Screen Rendering. In: IPT/EGVE, pp. 261–270 (2003) 43. 3DRender.com (2009), http://www.3drender.com/challenges/
Fast Spherical Mapping for Genus-0 Meshes Shuhua Lai1 , Fuhua (Frank) Cheng2 , and Fengtao Fan2 1
2
Department of Mathematics and Computer Science, Virginia State University, Petersburg, VA 23806 Graphics and Geometric Modeling Lab, Department of Computer Science, University of Kentucky, Lexington, Kentucky 40506
Abstract. Parameterizing a genus-0 mesh onto a sphere means assigning a 3D position on the unit sphere to each of the mesh vertices, such that the spherical mapping induced by the mesh connectivity is not too distorted and does not have overlapping areas. Satisfying the non-overlapping requirement sometimes is the most difficult and critical component of many spherical parametrization methods. In this paper we propose a fast spherical mapping approach which can map any closed genus-0 mesh onto a unit sphere without overlapping any part of the given mesh. This new approach does not try to preserve angles or edge lengths of the given mesh in the mapping process, however, test cases show it can obtain meaningful results. The mapping process does not require setting up any linear systems, nor any expensive matrix computation, but is simply done by iteratively moving vertices of the given mesh locally until a desired spherical mapping is reached. Therefore the new spherical mapping approach is fast and consequently can be used for meshes with large number of vertices. Moreover, the iterative process is guaranteed to be convergent. Our approach can be used for texture mapping, remeshing, 3D morphing, and can be used as input for other more rigorous and expensive spherical parametrization methods to achieve more accurate parametrization results. Some test results obtained using this method are included and they demonstrate that the new approach can achieve spherical mapping results without any overlapping.
1
Introduction
Surface parameterization refers to the process of bijectively mapping the whole surface or a region of the surface onto a 2D plane, a 2D disk or a 3D sphere. Parameterization is a central issue and plays a fundamental role in computer graphics [16]. It finds a correspondence between a discrete surface patch and an isomorphic planar mesh through a piecewise linear function or mapping. Parameterization of 3D meshes is important for many graphics applications, in particular for texture mapping, surface fitting, remeshing, morphing and many other applications. In practice, parameterization is simply obtained by assigning each mesh vertex a pair of coordinates (u, v) referring to its position on the planar region. Such a one-to-one mesh mapping provides a 2D flat parametric space, allowing one to perform any complex graphics application directly on the 2D flat domain rather than on the 3D curved surface [14]. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 982–991, 2009. c Springer-Verlag Berlin Heidelberg 2009
Fast Spherical Mapping for Genus-0 Meshes
983
Perfect parameterization of meshes of arbitrary topology currently is still unavailable [6]. Many parameterization techniques are only suitable for genus-0 meshes because of its simplicity and ubiquitous [6]. For example, meshes for almost all the animals are genus-0. Closed manifold genus-0 meshes are topologically equivalent to a sphere, hence a sphere is the natural parameter domain for them. Parameterization of a genus-0 mesh onto the sphere means assigning a 3D position on the unit sphere to each of the mesh vertices, such that the spherical mesh (having the same topology and connectivity of the original given mesh) is not too distorted and does not overlap. There are some techniques that can be used to achieve less mapping distortion, such as harmonic mappings and conformal mappings, which have been intensively studied recently and many nice methods have been proposed. However, for most spherical mapping methods, satisfying the non-overlapping requirement is still no guarantee, although this requirement is a critical component of any spherical mapping process. Most parametrization methods are very expensive and take long time to achieve a desirable mapping [9,15,16,18]. In this paper we describe a new iterative approach for fast spherical mapping of 3D models from a genus-0 mesh to a 3D unit sphere. The new method is easy to understand, easy to implement, and can achieve relatively good mapping results. More importantly, our approach can guarantee that the resulting mapping has no overlapping area on the sphere. The basic idea is to first project a closed genus-0 3D model into a unit sphere, then using subdivision techniques to smooth out the overlapping areas in the sphere. The project and smoothing process does not require setting up any linear systems, nor any matrix computation, but is simply done by iteratively moving vertices of the genus-0 mesh locally until a meaningful mapping without any overlapping is reached. Therefore the new iterative method is very fast and consequently can be used for meshes with large number of vertices. Moreover, the iterative process is guaranteed to be convergent. The capability of the new approach is demonstrated with test examples shown in the paper.
2
Previous Work
3D Parametrization/spherical mapping: There are many publications [6] on surface mapping or parametrization because it is now a popular topic and has been intensively studied recently. Many nice methods have been developed for genus-0 or arbitrary topology meshes. Most research work is done based on genus-0 meshes, because for arbitrary genus meshes, a well-known parametrization approach can be used to somehow segment the mesh into disk-like patches such that each of the patches is genus-0. The challenge of this approach is to obtain mappings that are smooth across the patch boundaries and these methods, like the one presented in [14] suffered indeed from this problem although recently Gu and Yau proposed a different method to compute parameterizations that are globally smooth with singularities occurring at only a few extraordinary vertices [15]. For genus-0 meshes, much more research has been done on mesh parameterization due to its usefulness in computer graphics applications [6].
984
S. Lai, F. Cheng, and F. Fan
Majority of these methods are based on conformal mapping [6]. A conformal mapping is a mapping that preserves angles between edges on the mesh. Perfect conformal mapping is difficult to achieve, hence most of the spherical mapping methods attempt to mimic conformal maps in order to preserve angles as much as possible by optimizing some constraints. There has been a lot of interest in spherical parameterization recently and here we briefly summarize related recent work on spherical parametrization using conformal mapping. Floater introduced a mesh parameterization technique based on convex combinations [7]. For all vertices, their 1-ring stencils are parameterized using a local parameterization space. The overall parameterization is obtained by solving a sparse linear system with constrains of preserving angles. Quicken et al. parametrized the surface of a voxel volume onto a sphere. Their nonlinear objective functional exploits the uniform quadrilateral structure of the voxel surface [8]. It tries to preserve areas and angles of surface elements. Desbrun et al. presented a method for computing conformal parameterization by minimizing the Dirichlet energy defined on triangle meshes [9]. Gortler et al. proposed a mesh parameterization approach using discrete 1-forms[11]. Their approach provided an interesting result in mesh parameterization, however it fails to control the curvatures on the surface boundaries. Sheffer et al. presented an angle-based flattening (ABF) approach, in which the changes of angles after flattening is minimized [12]. Grimm partitioned a surface into 6 pieces, and maps these to faces of a cube, and then to a sphere [10]. A priori chart of the surface partitions are used as constrain in the spherical parametrization process. Kharevych et al. [13] obtained locally conformal parameterization by preserving intersection angles among circum-circles, each of which is derived from a triangle on the given mesh. Subdivision Surfaces: Given a control mesh, a subdivision surface is generated by iteratively subdividing the control mesh to form new and finer control meshes. The refined control meshes converge to a limit surface called a subdivision surface. Subdivision is a smoothing process and it makes the meshes finer and smoother. The control mesh of a subdivision surface can contain vertices whose valences are different from four. Those vertices are called extraordinary vertices. The limit point of a vertex is the point of the subdivision surface that is corresponding to the vertex. It is well known that all the limit points can be calculated directly from a given mesh. Subdivision surfaces can model/represent complex shape of arbitrary topology because there is no limit on the shape and topology of the control mesh of a subdivision surface. Recently with parametrization [5] and multiresolution representation of subdivision surfaces becoming available, they have been used for many graphics and geometric applications. Subdivision surfaces are used as the surface representation in our spherical mapping technique. Without loss of generality, we shall assume the subdivision scheme considered in this paper is Catmull-Clark subdivision scheme [1]. But our approach works for other subdivision schemes as well, for instance, Loop subdivision[3] or Doo-Sabin subdivision scheme[2]. In this paper we only consider objects represented by Catmull-Clark subdivision surfaces. But our approach works for other subdivision schemes as well.
Fast Spherical Mapping for Genus-0 Meshes
(a) original model
(b) after projection
985
(c) after smoothing
Fig. 1. Spherical mapping using projection and smoothing
3
Basic Idea
Given a genus-0 mesh M with arbitrary topology, our task is to find another mesh Q such that the new mesh Q has the same topology and connectivity as the given mesh M , meanwhile all the vertices of the new mesh Q lie on a unit sphere. The basic idea to achieve this is through projection and smoothing. First all the vertices of the given mesh M are projected (with respect to the center of the unit sphere) to a unit sphere. After this step all the vertices are on a unit sphere. However, most likely there would be some overlapping areas on the unit sphere. Hence we need a second step, which is to adjust the new mesh such that no overlapping areas exists on Q. The second step can be done with mesh smoothing techniques. Figure (1) shows the process of our spherical mapping approach. Figure (1(a)) is the given mesh M , Figure (1(b)) is the mesh after M is projected to a unit sphere. Figure (1(c)) is the mesh obtained after smoothing, which can be regarded as a spherical mapping of the original mesh M because there is no overlapping in the sphere now. The above process seems very simple and straightforward, but the implementation is very tricky. For example, in the first step, how to position the unit sphere so that the results of the projection process will make the next smoothing process easier and consequently achieve more meaningful spherical mapping? If the unit sphere is not positioned well, the given mesh could only be projected to part of the sphere and some wide areas of the unit sphere could be left empty without any vertices on it. For the second step, how to design an efficient smoothing approach in order to remove all the overlapping areas on the sphere is very important. The chosen smoothing method will determine the final quality of the spherical mapping results. In the following sections, we will present our approaches to the projection and smoothing process.
4 4.1
Spherical Mapping Direct Calculation of the Limit Position of a Vertex
We need to calculate the limit positions of all the vertices in the spherical mapping process using any existing subdivision techniques. In our method, we choose
986
S. Lai, F. Cheng, and F. Fan
Fig. 2. Vertex V and its neighboring vertices
to use general Catmull-Clark subdivision scheme. But similar formulas can be derived for other subdivision schemes as well. Basically we need to find a matrix A, such that for a given mesh M , A ∗ M is the mesh that all the vertices of A ∗ M are the limit points of M . The matrix A is not needed in the implementation. The whole process can be done locally according to the matrix A. A can be constructed as follows. For generalized Catmull-Clark subdivision scheme, new face points and edge points are calculated the same way as the standard Catmull-Clark subdivision scheme, but the new vertex points are calculated differently, using the following formula n n n−2 1 1 V = V+ 2 (αV + (1 − α)Ej ) + 2 fj n n j=1 n j=1 where 0 ≤ α ≤ 1 and fj are new face points after one subdivision [1]. When α = 0, we get the standard Catmull-Clark subdivision scheme. The limit point [4] of a vertex Vi of degree ni can be calculated as: 1 Vi∞ = (bii Vi + bij Ej + bij Fj ) ni (ni + 5) j j where
4 bii = (ni − 1)ni + ni α + dij bij = (2 − α + d4ij + d4ji ), if (Vi , Vj ) is an edge bij = 4/dij , if (Vi , Vj ) is a diagonal line of a face bij = 0, if Vi and Vj do not belong to the same face
Note that in the above formula the surrounding faces could be not-four-sided (see figure 2). dij is the number of sides of the face of which (Vi , Vj ) is an edge or a diagonal line. Note that dij and dji could have different values because faces adjacent to the edge (Vi , Vj ) could have different number of sides. But if (Vi , Vj ) is a diagonal line of a face, then dij = dji . According to the above definition, we have 1 Aij = bij . ni (ni + 5) It can be proven [17] that all the eigenvalues of A belong to (0, 1]. Therefore Ai ∗ M is guaranteed to converge to a single point.
Fast Spherical Mapping for Genus-0 Meshes
4.2
987
Mesh Projection
The first step of our spherical mapping process is to project the given mesh to a unit sphere. The key of this step is to determine the center of the sphere because the projection is done with respect to the center of the sphere. The chosen center of the sphere has to be located inside of the given mesh and has to be statistically same distance from all the vertices of the given mesh as much as possible so that after projection, there would be no much distortion. There are many choices of positioning the sphere. For example, average of all vertices, centroid of the mesh, or others. However, none of them would guarantee that the chosen point is inside the given mesh considering that the mesh could be of arbitrary shape. If not chosen properly, this step may fail or make the following steps of the spherical mapping process more difficult. To make sure the chosen center is inside the given mesh and approximately close to all vertices, we need to smooth the given mesh M0 = M . As we know subdivision process is a smoothing process, hence we can use subdivision technique to smooth the mesh M0 . However, we do not want to introduce any new vertices in the process of subdivision. To do this, we simply calculate the limit positions of all the vertices of the given mesh M0 repeatedly using the matrix A given in the previous section as follows. Mi = A ∗ Mi−1 In other words Mi = Ai ∗ M0 . Because all the eigenvalues of A belong to (0, 1], eventually, when i → ∞, Mi becomes a single point [17]. Therefore Mi tends to be more like a sphere when i approaches ∞, which helps us better position the projection sphere. The process for calculating the limit points of a given mesh is very fast because it is a local operation. So it would not take much time to get a smoother version of the given mesh. To find a good candidate for the sphere center, we have to take the contribution of each vertex into consideration so that less distortion would be introduced in the projection process. In our implementation, instead of using the centroid of the mesh, we use weighted centroid of the mesh, which is calculated as follows. C=
N k
|Vk − G| ∗ Vk , N j |Vj − G|
where G is the centroid of the mesh (which is the average of all the vertices), N is the total number of vertices in the given mesh and Vk and Vj are the vertices of the mesh Mi (Note that it is not the given mesh M ). Once we have the sphere center C and Mi (in our implementation, we use M10 ), we can project Mi to a unit sphere with respect to the center of the sphere. 4.3
Mesh Smoothing
Once we have a desirable projection of the given mesh M to a unit sphere, next step is to remove the overlapping in the unit sphere. Again here we use
988
S. Lai, F. Cheng, and F. Fan
subdivision techniques to smooth out the overlapping areas of the new mesh Q0 obtained from the projection process. For all vertices of mesh Q0 , we find their limit positions using the matrix A and then map them back to the unit sphere and then repeat the same procedure again and again as follows. Qi = P (A ∗ Qi−1 ), where P is the projection operator with respect to the chosen unit sphere center. Because Qi−1 lies on a unit sphere and matrix A is a smoothing operator, (A ∗ Qi−1 ) has less overlapping areas than Qi−1 . So is Qi . Therefore by repeating the process, Qi will have less and less overlapping, and eventually it will have no overlapping areas at all. Because matrix A’s eigenvalues are not bigger than one, the above process is guaranteed to be convergent. However, in order to remove all the overlapping areas in Qi , i does not have to be infinity. Qi will already be free of overlapping after some number of iterations of the above smoothing process. The above process works fine theoretically. However, it converges very slow. In order to remove all the overlapping areas, it takes a lot of iterations. To overcome this, we use the following two methods to speed up the spherical mesh smoothing process. The first one we can use is to offset the vertex positions of Qi after each iteration as follows. Qi = u ∗ (Qi − Qi−1 ) + Qi , where u is an offset factor, which has to be chosen carefully in order to keep the iteration process stable. In our implementation, we choose u = 0.25. The second one we can use to speed up the smoothing process is to adjust the vertex positions of Qi after each iteration such that each vertex is locate at the centroid of its first ring surrounding triangles. The new location of a vertex V can be calculated as follows. k Fi ∗ Ai V¯ = , k i i Ai where k is the number of surrounding triangles, Ai is the area of the ith spherical triangle and Fi is the location of the centroid of the ith spherical triangle, which is the projection of the average of the vertex locations of the triangle. The area of a spherical triangle ABC on a unit sphere can be calculated explicitly as (∠A + ∠B + ∠C − π). This above two methods will dramatically improve the convergence speed of the smoothing process, because for a vertex with overlapping surrounding triangles, the two methods will adjust the location of this vertex to a new location such that its surrounding triangles have much less overlapping areas. This is true because all the vertices are on a unit sphere.
5
Test Cases
The proposed approach has been implemented in C++ using OpenGL as the supporting graphics system on the Windows platform. Quite a few examples
Fast Spherical Mapping for Genus-0 Meshes
(a) original model (b) spherical mapping
(c) resampling
989
(d) remeshing
(e) original model (f) spherical mapping (g) texture mapping (h) texture mapping
(i) Original Mesh
(j) Morphing 1
(k) Morphing 2
(l) Morphing 3
(m) Morphing 4
(n) Morphing Final
Fig. 3. Remeshing using spherical mapping (a)-(d), Texture mapping using spherical mapping (e)-(h) and Morphing of an object to a sphere (i)-(n)
990
S. Lai, F. Cheng, and F. Fan
have been tested with the method described here. All the examples have extraordinary vertices. Some of the tested cases are shown in Figure 3 for remeshing, morphing and texture mapping applications, respectively. For the remeshing application given in Figures 3(a)-(d), first we get a spherical mapping using the approach presented in this paper. Then we use the method given by [16] to resample the sphere using an octahedron and to map it back to the original model to obtain a new mesh for the same model. For the morphing example given in Figures 3(i)-(n), it is simply a linear animation of the given model and the sphere obtained from the spherical mapping process. For the texture mapping case given in Figures 3(e)-(h), once we have the spherical mapping, there are many ways to obtain the texture coordinates for each vertex. In this paper the method presented in [16] is used to get the uv texture coordinates. From all the test cases, we can see that even though our mapping process does not try to preserve angles or edge length, the spherical mapping results are still meaningful and desirable. We believe some of the nice properties of subdivision surfaces, such as convex hull property, contribute to this. Figure 3 demonstrates the capability of the new iterative method in mesh spherical mapping. It not only can smooth out all the overlapping areas, but maintains most of the geometric properties of the original mesh as well. The new spherical mapping method can handle meshes with large number of vertices in a matter of seconds on an ordinary laptop (2.16GHz CPU, 2GB of RAM). For example, the model shown in figures (e)-(h), has 35948 vertices and 69451 faces and, the model shown in figures 1, has 23202 vertices and 46400 faces. It takes 55 seconds and 31 seconds to obtain the spherical mapping shown in figure 3(b) and 1(c) respectively. For smaller meshes, like models shown in figures 3(a) and 3(i), the spherical mapping process is done even quicker. Hence our spherical mapping method is suitable for interactive applications, where simple shapes with small or medium-sized control vertex sets are usually used.
6
Summary and Future Work
A fast spherical mapping approach is presented which can map any closed genus0 mesh onto a unit sphere without overlapping any part of the given mesh. The mapping process is straightforward and easy to implement. It is achieved through mesh projection and smoothing and relies heavily on subdivision techniques to locally adjust locations of vertices such that the mapping has less distortion and has no overlapping. The mapping process is done simply by iteratively moving vertices of the given mesh locally until a desired spherical mapping is reached. Therefore it is very fast and consequently can be used for meshes with large number of vertices. Moreover, the iterative process is guaranteed to be convergent. Some test results obtained using this method are included and they demonstrate that the new approach can achieve spherical mapping results without any overlapping. The main disadvantage of our method is that our method could fail for some cases. Sometimes it takes some time to tune some parameters in the implementation so as to achieve better results. One of our future research work is to
Fast Spherical Mapping for Genus-0 Meshes
991
investigate the reasons and try to improve our method. In addition, in order to get a desirable spherical mapping, the stop criterion of iteration used in our implementation is simply controlled by user interaction. This is because so far we do not have a metric for measuring the quality of spherical mapping. We will work on it to obtain better quality of spherical mappings in the near future.
References 1. Catmull, E., Clark, J.: Recursively generated B-spline surfaces on arbitrary topological meshes. Computer-Aided Design 10(6), 350–355 (1978) 2. Doo, D., Sabin, M.: Behavior of recursive division surfaces near extraordinary points. Computer-Aided Design 10(6), 356–360 (1978) 3. Loop, C.T.: Smooth Subdivision Surfaces Based on Triangles, MS thesis, Department of Mathematics, University of Utah (August 1987) 4. Halstead, M., Kass, M., DeRose, T.: Efficient, fair interpolation using CatmullClark surfaces. In: ACM SIGGRAPH, pp. 35–44 (1993) 5. Lai, S., Cheng, F.: Parametrization of General Catmull Clark Subdivision Surfaces and its Application. Computer Aided Design & Applications 3(1-4), 513–522 (2006) 6. Floater, M.S., Hormann, K.: Surface parameterization: a tutorial and survey. In: Advances in Multiresolution for Geometric Modelling, Mathematics and Visualization, pp. 157–186. Springer, Heidelberg (2005) 7. Floater, M.S.: Parametrization and smooth approximation of surface triangulations. Computer Aided Geometric Design 14(3), 231–250 (1997) ´ 8. Quicken, M., Brechb¨ uhler, C., Hug, J., Blattmann, H., Szkely, G.: Parametrization of closed surfaces for parametric surface description. In: CVPR 2000, pp. 354–360 (2000) 9. Desbrun, M., Meyer, M., Alliez, P.: Intrinsic parameterizations of surface meshes. Computer Graphics Forum (Proc. Eurographics 2002) 21(3), 209–218 (2002) 10. Grimm, C.M.: Simple Manifolds for Surface Modeling and Parameterization. In: Proceedings of the Shape Modeling International 2002, May 17-22, p. 277 (2002) 11. Gortler, S., Gotsman, C., Thurston, D.: Discrete oneforms on meshes and applications to 3D mesh parameterization. Computer Aided Geometric Design 23(2), 83–112 (2005) 12. Sheffer, A., Levy, B., Mogilnitsky, M., Bogomyakov, A.: ABF++: Fast and robust angle based flattening. ACM Transactions on Graphics 24(2), 311–330 (2005) 13. Kharevych, L., Springborn, B., Schr´’oder, P.: Discrete conformal mappings via circle patterns. ACM Transactions on Graphics (Proc. SIGGRAPH) 25(2) (2006) ¨ P.: Consistent mesh parameterizations. In: Pro14. Praun, E., Sweldens, W., Schrder, ceedings of SIGGRAPH 2001, pp. 179–184 (2001) 15. Gu, X., Yau, S.-T.: Global conformal surface parameterization. In: Proceedings of the 1st Symposium on Geometry Processing, pp. 127–137 (2003) 16. Praun, E., Hoppe, H.: Spherical parametrization and remeshing. ACM Transactions on Graphics (Proc. SIGGRAPH 2003) 22(3), 340–349 (2003) 17. Cheng, F., Fan, F., Lai, S., et al.: Loop Subdivision Surface based Progressive Interpolation. Journal of Computer Science and Technology 24(1), 39–46 (2009) 18. Gotsman, C., Gu, X., Sheffer, A.: Fundamentals of spherical parameterization for 3D meshes. In: Proceedings of ACM SIGGRAPH 2003, pp. 358–363 (2005)
Rendering Virtual Objects with High Dynamic Range Lighting Extracted Automatically from Unordered Photo Collections Konrad K¨ olzer, Frank Nagl, Bastian Birnbach, and Paul Grimm Erfurt University of Applied Sciences, Germany
[email protected] http://www.fh-erfurt.de
Abstract. In this paper, we present a novel technique for applying image-based lighting on virtual objects by constructing synthesized environment maps (SEM) automatically. In contrast to related work, we use unordered photo collections as input instead of a completely measured environmental lighting (so called light probe). Each image is interpreted as a spatial piece of a light probe. Thus, we accumulate all given pieces to a synthesized environment map by a given weight function. In addition, multiple coverage is used to increase the dynamic range. Finally, we show how this algorithm is applied for an interactive image-based Augmented Reality environment for virtual product presentations.
1
Introduction
Interior furnishing of rooms is a challenge for most people. This task requires a lot of imagination to see how the furniture from a catalog will finally look in your room. Questions like ”Will the new sofa match to the rest of my room?” or ”How does an open-plan office look like if there are no dividing walls?” are hard to answer. However, with our system we want to assist the user in these decisions. The idea is to create a 3D scene of a user’s room and place digitally planned furniture into it to provide a photo-realistic impression of new furniture to the user. Thanks to the ongoing development of digital cameras and storage media a user can easily make many photos of a real scene. For using our system, he just has to shoot some overlapping photos from arbitrary positions of his empty room. Using Structure-From-Motion (SfM) algorithms [1], for unordered images, we can extract the camera parameters of each photo and arrange all photos in 3D space. The arranged photos represent the copy of the real world. After that, we superimpose those images with virtual objects of the office furniture industry. The goal of our research is embedding virtual objects under photo-realistic illumination conditions in such 3D worlds. To gain a seamless light transition between those virtual objects and the real scene, the virtual objects have to be rendered with illumination conditions found in the photos. In our system a fully automated process leads from photo collections and some planning data G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 992–1001, 2009. c Springer-Verlag Berlin Heidelberg 2009
HDR-Lighting Extracted Automatically from Unordered Photo Collections
993
to a photo-realistic illuminated three-dimensional view of augmented photos. In other office design solutions the user receives the planned furniture just in a virtual environment. To connect real and virtual worlds and to register virtual objects correctly often a calibration setup or tracking systems are needed. In this paper, we present an approach to create photo-realistic superimposed copy of the real world based on unordered photo collections without any hardware calibration setup. Standard consumer cameras can be used to make photos. The paper is structured as follows, first we give an overview of work related to ours in section 2. Next, in section 3, we present our approach. There, we discuss how to extract the lighting information from the images and how to apply them for real-time rendering. Afterwards, the realization of our approach is presented in section 4. Finally, we describe the results of our realization and future work in section 6.
2
Related Work
As described in the introduction we superimpose images of the real world with virtual furniture objects. Since we accomplish the three facts that describe augmented reality (AR) defined by Ronald Azuma [2], we see our work in this field of research: those facts are the combination of real and virtual worlds, the interaction in real-time and that virtual objects are registered in the real world. In the past years more and more AR systems were developed to support users in planning data. Either in planning the correct assembly of cable harnesses at Boeing [3] or in planning whole factory halls like the robotic workstations at Volkswagen [4]. Even the furniture industry develops AR systems to present their products in real environments. Customers can plan furniture in videos or photos of their own apartments. The Brazilian furniture warehouse Tok&Stok presents a system [5] to visualize furniture superimposed in a web cam video stream. Similar to that is Click&Design [6] which superimposes single images instead of a video stream. All of them are using image-based tracking systems with markers in the images to recognize the 3D space. In [7] a system for augmented image synthesis is presented that renders objects into images captured from HDR stereo cameras using a real-time capable technique called Differential Photon mapping, which provides realistic illumination. In contrast, we want the user to create a 3D world from photos without placing any marker or other calibration hardware in the scene. We are using image-based rendering techniques to create the 3D scene from photos and illuminate overlaid virtual objects photo-realistic using image-based lighting. Image-based rendering aims at synthesizing novel views from a set of photographs. We want to allow a fast and easy way to build photo scenes in 3D space, what we call 3D Photo Collection. Consequently, restrictions in shooting photos have to be as small as possible. In 2006 Snavely et al. [1] presented Photo Tourism, a system to interactively browse a large number of unordered photo sets using
994
K. K¨ olzer et al.
(a) Unordered photos
(b) Constructed SEM
(c) Extracted lights
(d) 3D Photo Collection (e) Chair without High (f) Chair with High Dywith embedded chair Dynamic Range Lighting namic Range Lighting Fig. 1. Render steps and results of rendering virtual objects with HDR lighting extracted automatically from unordered photo collections
a 3D interface. They compute viewpoints of each photograph and a sparse 3D model of the scene. The alternative to use unordered photos as input for a 3D environment makes it possible for everybody to generate such 3D environments. As long as the photos contain some overlapping areas they can be used as input images. Hence, there is no calibration hardware needed. Consequently, we use the same structure-from-motion algorithm [8] to calculate camera parameters like position, orientation and focal length. These parameters can be used to align the photos in the 3D space. In Fig. 2 it is shown how unordered photos are processed and displayed in a viewer. Additionally, Bundler recognizes common points of interest in the photos and reconstructs these so-called key points in 3D space. The key points form together a sparse 3D point cloud of the scene. With the information about the location of key points in 3D space for every camera the according photo is positioned in front of the camera’s position and orientation in a calculated distance. In 3D space the photo according to a camera is called image plane.
Fig. 2. Unordered photos (a) and matched 3D Photo Collection in our viewer (b) and (c)
HDR-Lighting Extracted Automatically from Unordered Photo Collections
995
For better usability the 3D Photo Collection viewer enables two possible navigation modes. In the first mode, the user navigates through the world by picking a photo of the 3D Photo Collection. The viewport moves to the according camera coordinates. The second mode enables a free navigation through the 3D Photo Collection by a common eyeball-in-hand navigation. Hence, the user can move around the scene and watch the virtual objects from every possible view. Due to the fact that these 3D Photo Collections are superimposed with virtual objects, we call this technique Image-Based Augmented Reality (IBAR) instead of image-based rendering. Image-based lighting Paul Debevec uses High Dynamic Range (HDR) images to realistically illuminate virtual objects [9]. With several photographs of a specular sphere under different illumination conditions, a HDR environment map is created. This environment map (called light probe) represents the light information of the complete surrounding area. Mapped as a cube map it can be used to render new objects into the scene under correct lighting conditions. Thereby, every Texel of the HDR environment map represents a light source of the real world. To achieve real-time frame rates in interactive systems like AR applications, this approach costs too much computing time. More time effective approaches estimate only a few light sources to illuminate virtual objects [10,11] or they pre-compute different maps and use them as a lookup table during rendering [12,13].
3
Approach of Rendering Virtual Objects Using Synthesized Environment Maps
The goal of our work is to embed virtual objects in 3D photo collections. For seamless integration of a virtual object, it is important to illuminate the object with the lighting conditions of the real scene. Therefore, we need to perform the following steps: 1. Extract the lighting information from the photos. 2. Describe the lighting information in a suitable data format 3. Use this lighting data to render the illuminated object. For indoor design applications (see section 1), users will typically want to “try out” many different objects at different positions to get an impression how these objects will fit into the real indoor environment. Therefore, the IBAR application should react on user interactions in real-time to provide a fluent interaction flow. This means, the run time of all 3 steps needs to be reduced as much as possible. 3.1
Architecture Overview
In this subsection we will explain, how the extraction of lighting information is achieved by constructing synthesized environment maps (SEM). Therefore, we have to reconstruct the environmental lighting. In addition, we have to increase the dynamic range afterwards.
996
K. K¨ olzer et al.
Fig. 3. Overview of light extraction
After construction of the SEM, we can use already existing techniques for applying the lighting from the environment map to the object. Since we are aiming for real-time capable rendering, we chose to extract directional light sources from the environment map using Debevec’s median cut algorithm [10]. Reconstruction of environmental lighting. For applying realistic lighting to embedded objects, first this lighting information must be extracted from photos. Each photo represents only a small part of the real world and it does only provide a small piece of lighting information. However, for easy access it is required to store and access all available environmental lighting that reaches the embedded object. For this task, we chose environment maps as suitable format. In contrast to Debevec’s light probes (see section 2) we have no complete information about the environmental lighting. Instead, we have to summarize the environment map by collecting all light, which is received from photos (see fig. 3). The in detail description of this task will be shown in subsection 3.2. Extend dynamic range of lighting information. The pixel color values of each photo cannot be used directly as lighting intensities, because they express only relative intensity factors (as 24bit RGB triplets) that are subject to an camera-specific non-linear color mapping as well as a photo-specific light exposure. In order to accumulate light, these color values must be converted to radiance values, which will serve as a common unit of light measure that is independent of camera settings. Depending on the quantity of photos, lighting information from multiple photos will be available for the same directions. If redundant photo coverage is available and these multiple photos differ in their camera light exposure, they will be used to extend the dynamic range of light intensities (see subsection 3.3). This will result in a more precise reproduction of bright photo areas, which are the most important parts for lighting extraction. 3.2
Collect Light for a Position in Space
As specified in section 3.1, lighting extraction is done by constructing a SEM. This environment map will describe the environmental lighting around the center of an embedded object, PR (see also fig. 5). We will show how to extract light by mapping light rays to image pixels in the next steps.
HDR-Lighting Extracted Automatically from Unordered Photo Collections
997
Fig. 4. Mapping image pixel to light rays for each image
Collecting light rays from image pixels. For each each camera in the photo collection, the view volume is known, which describes the projection of the photo into the 3D space. In fact, each photo is a rectangular slice plane of light rays, which were reflected by the real world into the lens of the associated camera (see fig. 4). This means each pixel of a photo is a valuable piece of information on the light condition for a given position (the camera origin) and direction (incident direction of light ray). For each image pixel, a light ray with a certain direction is constructed. The ray direction is calculated by using the relative pixel position in the image plane as the two factors for bilinear interpolation with the edges of the camera view volume. Since there are most likely no light rays that will intersect PR , a certain distance radius rR around PR will be used. Only light rays that come this close will be considered as containing significant light information. Depending on the size of rR , this will result in a more or less blurry reconstruction of environmental lighting. Considering significant light rays only. The previous step relies on the fact that each camera within radius rR is considered to have unblocked sight on other cameras within this radius. If the view between two cameras is blocked, then they will provide contradictory information. For example a light ray from the first camera may reflect the blocking object while an equally directed ray from the other camera may reflect the background. Although there may exist cameras outside of rR with light rays lying as close as those light rays from near cameras, they may not be used without further consideration. It must be checked that a light ray from a far camera will even reach the radius around the reference point, as there might be blocking objects. However, this check is computationally expensive, so we will discard far cameras. 3.3
Extend Dynamic Range of Lighting Information
The result of the previous steps is a set of light rays that meet in the object center PR and have a known direction, however the intensity and color of light they are transporting is yet unknown. Therefore in the next steps the intensity of each light ray is derived from the color of its associated image pixel. As specified in the requirements (see section 3.1) the pixel color needs to be converted to a relative radiance value.
998
K. K¨ olzer et al.
Fig. 5. Top view showing a 3D photo collection featuring one embedded object and 7 cameras within the tolerance radius rR around the object center PR . The left upper area has for many directions three photos, while there is no lighting information for most of the directions in the bottom right corner.
Camera calibration. The non-linear mappings (section 3.1) from pixel values to exposures are described by the camera response curve, which needs to be reconstructed. Robertson et al. [14] presented an automatic camera calibration, which estimates the camera response curve. Their calibration technique requires an array of pixel color values for a combination of multiple scene feature and light exposures. While they retrieve these pixel color samples from a set of photos shot from the same position with different light exposure, we search for key points that were shot in multiple photos with different light exposure. That way we can retrieve the camera response curve in a similar manner. Utilizing redundant image coverage. By inverting the camera response curve, pixel values can be converted back to exposure values to provide light intensity information for the associated light ray. However, since the pixel values are discrete units and their dynamic range is very low (only 256 values), they will not be suitable to represent the complete dynamic range of the light that was present in the real world scene. To avoid these clipping errors we extend the dynamic range by looking out for equally directed light rays coming from photos that were shot with a different light exposure (see also Fig. 5). We combine several pixel values – each from a photo with different light exposure – to get an exposure value with extended range. Therefore, we use the algorithm described in [15], which weights the individual pixel values by their significance. The weighting gives very dark and very bright pixel a lower significance, which will reduce the effect of noise and clipping errors.
4 4.1
Realization Render Setup
We implemented the SEM construction as hardware shader to achieve a short construction time. In our render setup we chose cube maps as environment map format, which we use to accumulate light rays for each cube map face separately. Each cube map face serves as a render target texture with a camera being set
HDR-Lighting Extracted Automatically from Unordered Photo Collections
(a) Front view
999
(b) Top view
Fig. 6. Monkey model with back light extracted from photos behind
up at the object center facing in +X, -X, +Y, -Y, +Z and -Z direction to collect light for all possible directions. Each texel of the cube map face represents a unique direction d, for which incoming light is accumulated. This accumulation is done implicitly by rendering all arranged photos as quads directly into the cube map face. For each d we will use use the following equation from Debevec’s approach to accumulate the exposure value Ed,j coming from photo j (from a total number of P photos available for direction d): P ln E´d =
w(Zd,j ) · ln Ed P j=1 w(Zd,j )
j=1
(1)
As specified in section 3.3, each exposure value, Ed , is weighted by the significance of its original pixel value, Zd,j , using the weighting function w(Z), which is defined as: x w(Z) = 1 − − 1 (2) 127.5 We solve equation 1 for each texel by using three render passes. In the first pass, the sum in the denominator is evaluated for each texel and results are stored in a temporary texture TDen . Analog to the first pass, the second pass evaluates the sum in a second temporary texture TN um . In the last pass the quotient of TDen and TN um is calclulated texel-wise and written to the cube map face. Extracting light sources. To use the cube map for lighting, we render it into a single texture map using cylinder projection (see fig. 2). On this single texture we perform a median cut algorithm presented in [10] to extract 16 directional lights. The embedded object is lit by these extracted light sources using a simple Lambert lighting model and local illumination. 4.2
Used Software Systems
The realization is based on MS XNA 3.1, AForge.NET Framework [16] and Bundler [8]. XNA is used as 3D application framework combines an easy handling and prefabricated classes and libraries for typical 3D operations like model loading, 3D mathematics and shader support. For image processing routines the AForge.NET Framework is used. To display a 3D Photo Collection in our viewer
1000
K. K¨ olzer et al.
the position and alignment of the photos are precomputed by the structurefrom-motion software Bundler. The results of Bundler are used to create the image-based 3D world.
5
Results
As can be seen in fig. 1, the object is illuminated by light reconstructed from unordered photos. This strongly increases the effect of the object “fitting” into the scene. In fig. 6 the object is accurately illuminated by light originating from the back, which gives the impression of proper back lighting by the background.
6
Summary and Future Work
In this paper, we presented a technique for photo-realistic rendering of virtual objects by automatically extracting lighting from photos and constructing a synthesized environment map (SEM). We introduced our image-based Augmented Reality viewer, in which virtual objects are embedded and illuminated by this technique. Therefore, the concept described the architecture and the HDR Lighting system of our image-based Augmented Reality viewer. In our future work the fully automated embedding of virtual objects in our 3D photo collection viewer will be implemented. Furthermore our goal is to increase the photo-realistic impression of the superimposed 3D Photo Collections. Thus, it is planned to develop mechanisms for adopting differences between the image quality of the photos. Also quality differences between the photos and the embedded virtual objects have to be adopted. For example the automatic estimation of image noise will be rendered onto the objects to achieve a more realistic looking result. In case of foreground objects like a pillar in a room a useful occlusion handling for the embedded virtual objects is needed. One approach could be to create a depth map of the scene and extract the foreground objects from the photos. With the help of depth information the foreground objects can be translated to their correct depth position. Embedded virtual objects will be occluded if necessary. In addition, it is planned to enhance the displaying of 3D Photo Collection. The photos shall be merged seamless without any border crossings. Therefore, it is necessary to develop suitable adopting algorithms for displaying neighbor photos.
Acknowledgements This work was funded by BMBF (Bundesministerium fr Bildung und Forschung, project no.: 17N0909. Furthermore, special thanks to Marc Alexa (Berlin University of Technology), Stefan M¨ uller (Koblenz University) and Ekkehard Beier (EasternGraphics GmbH) for their technical and scientifical support to this project.
HDR-Lighting Extracted Automatically from Unordered Photo Collections
1001
References 1. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring photo collections in 3d. In: SIGGRAPH Conference Proceedings, pp. 835–846. ACM Press, New York (2006) 2. Azuma, R.T.: A survey of augmented reality. In: Presence: Teleoperators and Virtual Environments, vol. 6 (1997) 3. Curtis, D., Mizell, D., Gruenbaum, P., Janin, A.: Several devils in the details: making an ar application work in the airplane factory. In: IWAR 1998: Proceedings of the international workshop on Augmented reality: placing artificial objects in real scenes, pp. 47–60 (1999) 4. Doil, F., Schreiber, W., Alt, T., Patron, C.: Augmented reality for manufacturing planning. In: EGVE, Proceedings of the workshop on Virtual environments 2003 (2003) 5. Tok, Stok: Planta virtual (2009), http://www.plantavirtual.com.br/website/ 6. Metaio Augmented Solutions: Click and design (2009), http://www.metaio.com/demo/demo/afc-moebelplaner/ 7. Grosch, T.: Augmentierte Bildsynthese. PhD thesis, Universit¨ at Koblenz-Landau (2007) 8. Snavely, N.: Bundler: Structure from motion for unordered image collections (2008), http://phototour.cs.washington.edu/bundler/ 9. Debevec, P.: Rendering synthetic objects into real scenes: Bridging traditional and image-based graphics with global illumination and high dynamic range photography. In: SIGGRAPH 1998 (1998) 10. Debevec, P.: A median cut algorithm for light probe sampling. In: High Dynamic Range Imaging: Acquisition, Display, and Image-Based Lighting (2005) 11. Dachuri, N., Kim, S.M., Lee, K.H.: Estimation of few light sources from environment maps for fast realistic rendering. In: ICAT 2005: Proceedings of the 2005 international conference on Augmented tele-existence (2005) 12. Unger, J., Wrenninge, M., Ollila, M.: Real-time image based lighting in software using hdr panoramas. In: GRAPHITE 2003: Proceedings of the 1st international conference on Computer graphics and interactive techniques in Australasia and South East Asia, p. 263. ACM, New York (2003) 13. Neulander, I.: Image-based diffuse lighting using visibility maps. In: SIGGRAPH 2003: ACM SIGGRAPH 2003 Sketches & Applications, p. 1 (2003) 14. Robertson, M., Borman, S., Stevenson, R.: Dynamic range improvement through multiple exposures. In: Proceedings of 1999 International Conference on Image Processing, ICIP 1999, vol. 3 (1999) 15. Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from photographs. In: SIGGRAPH 1997 (1997) 16. Kirilov, A.: Aforge.net framework 2.0.0 beta (2009), http://code.google.com/p/aforge/
Effective Adaptation to Experience of Different-Sized Hand Kenji Terabayashi1, Natsuki Miyata2 , Jun Ota3 , and Kazunori Umeda1 1 2
Department of Precision Mechanics, Chuo University / CREST, JST, Tokyo, Japan {terabayashi,umeda}@mech.chuo-u.ac.jp Digital Human Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
[email protected] 3 Department of Precision Engineering, The University of Tokyo, Tokyo, Japan
[email protected]
Abstract. This paper reports the effect of pre-operation intended to familiarizing oneself with vicarious experiences of different-sized hands. To measure this effect, the index of degree of immersion (DOI) is proposed, which represents whether observed behavior is appropriate for the presented hand size. The DOI is measured for various sizes of hands when changing type of pre-operation which is classified based on relationship between hands and objects. The experimental results show that the preoperation is effective for familiarizing the presented sized hand, especially in larger sized hands, and that behavior of touching and controlling an object in position is important for effective pre-operation.
1
Introduction
There are many hand-held products such as mobile phones, remote controllers, and digital cameras. The usability of these hand-held products is affected by the shapes of hands to operate them. It is difficult for designers to imagine the user’s evaluation of these products in early assessment because of two reason: (1) difference of hand shape between a designer and users, (2) intangibility by nonexistence of actual products. For experiencing a part of the functions of products, a number of studies have examined virtual prototyping and augmented prototyping. In these studies, the operating characteristics of a particular user interface [1][2] and the external appearance of products can be controlled [3]. However, few studies have addressed controlling the physical characteristics in the experience. Computational models that assess the usability of products employing various hand shapes have been studied [4][5]. Using these models, designers can get the user’s evaluation quantitatively in early assessment but cannot experience it intuitively. In addition, as with “Through Other Eyes [6]”, which is a system for experiencing physical characteristics caused by aging and disabilities, enabling designers to experience product manipulation with hands of various shapes can G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1002–1010, 2009. c Springer-Verlag Berlin Heidelberg 2009
Effective Adaptation to Experience of Different-Sized Hand
1003
help designers to explore the usability of users having various hand shapes. In particular, experiencing behavior depending on the size of other people’s hand is effective in such helping because behavior usually changes in proportion to hand size. Thus, the authors have proposed a system that provides the experience of hands of various sizes, which presents hands of various sizes in an environment that appears not to change in size [7][8]. This is achieved in the proposed system by adjusting the optical zoom and actual object size (Fig. 1). The system can provide two functions which are useful as design support. The first one is subjective aspect in which the hand size is recognized as different from one’s own according to the presented hand size. This gives designers the opportunity to become aware of the difference in hand size between oneself and others. The second one is behavioral aspect in which grasping behavior is appropriate for the presented hand size. This is effective for designers to find behavioral patterns of others. For changing the size of experiencing hand, a pre-operation is performed to familiarize with the presented hand size (Fig. 2). However, there is no knowledge
Experience of small hand
Experience of large hand Subject’ Subject’ view
Real size
Optical scaling
1
2
3
1
2
3
1
2
3
4
5
6
4
5
6
4
5
6
7
8
9
7
8
9
7
8
9
1
2
3
1
2
3
4
5
6
7
8
9
Up
4
5
6
7
8
9
Non
1
2
3
4
5
6
7
8
9
Down
Fig. 1. Relationship among physical size, optical scaling, and the presented hand [8]
Fig. 2. Pre-operation when changing the size of experiencing hand
1004
K. Terabayashi et al.
about that what kinds of pre-operation is adequate for what sizes of presented hands. It is important to design aid because in this case pre-operation should be simple and easy as much as possible. Therefore, the purpose of the paper is to investigate the effect of pre-operation on the familiarization with different-sized hand. The effect is examined from the aspect of what kind of pre-operation is suitable for design aid. In Section 2, the method for providing the experience of hands of various sizes is described. In Section 3, pre-operations are classified according to the relationship between hands and objects. In Section 4, the index of measuring the familiarization with this experience is defined. In Section 5, an effective preoperation is explored comparing several kinds of pre-operations. In Section 6, this study is concluded.
2 2.1
Method for Presenting the Experience of Hands of Various Sizes Method
We have developed a method for providing the experience of hands of various sizes [8]. This experience involves the sensation of hand size being different from one’s actual hand size, which occurs when hands of various sizes are presented in an environment that appears to be of constant size. A controlled view of hands and objects is presented to subjects through an optical system with the appropriate scaling. Fig. 1 shows the method employed to realize the desired view using an optical system and analogous objects. The top row shows the desired hand size to be presented. The middle row presents the actual sizes of the hands and the objects, and the bottom row shows the method used to scale the optical system. The scaling rate of the optical system is adjusted to present hands of desired size. The sizes of the objects used in this condition are varied inversely with the magnification of the optical system. In this way, only the hand size, and not the object size, is changed visually. For instance, the right-hand column shows the method used to present smaller hands. The optical system is scaled down, and analogously large objects are used. In contrast, for presenting larger hands, the optical system is scaled up, and analogously small objects are used, as shown in the left column. 2.2
Implementation
According to studies suggesting the possibility of experiencing hands of various sizes [9][10][11], the consistency of senses, especially vision and somatic sensation, is essential for the experience. In the proposed system, employing an optical system and analogous objects achieves the required consistency. The system consists of analogous objects, a camera with a zoom lens, a computer, a display, and two mirrors (Fig. 3). Images of the right hand and an object
Effective Adaptation to Experience of Different-Sized Hand
1005
captured by the camera are presented to a subject by rendering the images on the display. The camera is a Sony DFW-VL500 (VGA, 30[fps]). The minimal visual delay of the system is 38 [ms], which is due to hardware constraints.
Analogous object Hand
Camera with zoom lens Capturing
Mirror
ya lp Computer Rendering isD
Mirror
Subject
Fig. 3. Configuration of the system for providing the experience of hands of various sizes [8]
2.3
Pre-operation
Before the experience of various sized hands, a pre-operation is performed to familiarize with the presented hand size. In the previous work [8], the pre-operation is a button-pushing task. It involves pushing buttons on a object in hand, which correspond to a four-digit number five times. Fig. 4 (a) shows the analogous objects used in this task. In the developed system, the computer displays the numbers to push buttons, and controls the visual delay and the optical scaling of the camera. Fig. 4 (b) is an example of the subject’s view.
3
Classification of Pre-operations
In terms of relationship between a hand and an object, pre-operations are classified into four categories. – Category 1: Only hand motion without an object – Category 2: Touching an object with controlling hand position (e.g. switch a light on by pushing the button) – Category 3: Controlling position of an object in hand (e.g. hold and swing a tennis racket) – Category 4: Finger movement to operate an object with controlling its position in hand (e.g. press buttons on a remote controller)
1006
K. Terabayashi et al.
(a) Objects used for the button-pushing task
(b) Example of the subject’s view
Fig. 4. Pre-operation of button-pushing task. The sizes of objects in (a) are 83×50×17[mm] (left), 100×60×20[mm] (middle), 150×90×30[mm] (right).
The first category is about only hand motion, while the remaining categories are defined in relationship to an object. In these categories, the amount of senses through these kinds of pre-operations increases according to the category number. It indicates that a pre-operation in category of lower number can be simpler and easier. The pre-operation employed in previous work belongs to Category 4. It is already known that this pre-operation is sufficiently effective for the experience of various sized hands in aspects of subjective and behavioral. But, it doesn’t show that this pre-operation is inevitable for the experience in any case of presented hand size. From the viewpoint of design aid, pre-operation should be simple and easy as much as possible because it is adjunctive operation relative to product assessment through the vicarious experience. In this paper, the effect of preoperation on the experience is investigated based on the categories.
4
Measurement of Familiarization with the Presented Hand
In this section, we define the index of measuring the familiarization with the presented hand. To judge whether the experience of different sized hand is successful, a task of grasping equilateral triangular prisms is employed. According to [12], the grasp strategy are determined by the relationship between hand size and size of equilateral triangular prism and are classified into four grasp patterns. The grasp strategy depending on hand size can be obtained appropriately to the presented size of hand using our system. Based on the previous work [8], the probability of observed grasp patterns has already known when familiarizing oneself with the presented sized hand
Effective Adaptation to Experience of Different-Sized Hand
1007
sufficiently, briefly experiencing different sized hand. Using this information,the probability of the experience of presented sized hand can be calculated. In fact, when an equilateral triangular prism is vj and the observed grasp pattern is gk , the probability of which the experienced hand size is hi can be calculated as the following equation. p(gk |vj , hi )p(vj , hi ) p(hi |vj , gk ) = p(gk |vj , hl )p(vj , hl )
(1)
l
where p(gk |vj , hi ) has already known as discussed previously and p(vj , hl ) is determined by experimental setup. The present paper labels Eq. (1) as “Degree Of Immersion (DOI),” and investigates the effect of pre-operation based on this index.
5
Experiment
In this section, we investigate what types of pre-operations are sufficiently effective for experience of different sized hand. This investigation is focused on the experience when increasing hand size because the previous work [13] shows that pre-operation is required for this case while it is not needed for the case of changing in hand size smaller. Some pre-operations based on the classification described in Section 3 are compared with the pre-operation known as sufficiently effective for familiarizing oneself with the experiences of different sized hands by using the index of DOI. 5.1
Condition and Procedure
DOIs corresponding to the following five pre-operations are measured when changing in hand size larger. The pre-operations employed in this experiment are decided by the classification shown in Section 2. The details of them are as stated below and shown in Fig. 5. – – – –
C-1: Clasp and unclasp one’s hands three times (Fig. 5 (a)) C-2: Touch the top of a cylinder without grasping (Fig. 5 (b)) C-3: Grasp and rotate a cylinder (Fig. 5 (c)) C-4-1: Pushing buttons on a object in hand with corresponding to a fourdigit number one time (Fig. 5 (d)) – C-4-2: Pushing buttons on a object in hand with corresponding to a fourdigit number five times (Fig. 5 (d)) The scaling rates of the presented hand sizes are 0.67, 1.00, and 1.20, which are labeled Small, Normal, and Large, respectively. These rates were determined by data of the hand length of actual subjects [14][15]. The transitional conditions of presented hand size were determined by couples of these scaling rates.
1008
K. Terabayashi et al.
(a) C1 pre-operation
(c) C3 pre-operation
(b) C2 pre-operation
1
2
4
5
3 6
7
8
9
(d) C4 pre-operation
Fig. 5. Pre-operations
– from Normal to Large – from Small to Normal – from Small to Large The number of subjects is five. In each experimental condition, the subjects were asked to perform the following tasks. – Perform the button-pushing task described in Section 2 in pre-transition sized hand. – Perform this task in post-transition sized hand if the experimental condition has a pre-operation. – Grasp equilateral triangular prisms. This grasp was recorded using a video camera to calculate the DOI in this experimental condition. 5.2
Result
Fig. 6 shows the experimental result of the mean DOI about conditions of preoperations with multiple comparisons. In this figure, significant differences are confirmed between C-1 and C-4-2, and between C-2 and C-4-2 respectively. Therefore, controlling a object in position with grasping is essential for familiarizing oneself with the experience of different-sized hands.
Effective Adaptation to Experience of Different-Sized Hand
*
Degree of immersion
1.0
1009
*
0.8
0.6
0.4
0.2
0.0 C1
C2
C3
C4-1
C4-2
Types of pre-operations
Fig. 6. Comparison of mean degree of immersion about types of pre-operations (*: p<0.05)
6
Conclusion
The effect of pre-operation on the experience of different-sized hand was investigated. To measure this effect, the index of degree of immersion into this experience was proposed. The experimental result shows that behavior of touching and controlling an object in position is important for effective pre-operation.
Acknowledgment This work was supported by KAKENHI(21700147).
References 1. Tideman, M., van der Voort, M.C., van Houten, F.J.A.M.: Design and evaluation of virtual gearshift application. In: Proceedings of IEEE Intelligent Vehicles Symosium, pp. 465–470 (2004) 2. Bordegoni, M., Colomboa, G., Formentinia, L.: Haptic technologies for the conceptual and validation phases of product design. Computers & Graphics 30, 377–390 (2006) 3. Verlinden, J.C., de Smit, A., Peeters, A.W.J., van Gelderen, M.H.: Development of a flexible augmented prototyping system. Journal of WSCG 11, 496–503 (2003) 4. Kouchi, M., Miyata, N., Mochimaru, M.: An analysis of hand measurements for obtaining representative japanese hand models. In: Proceedings of SAE 2005 Digital Human Modeling for Design and Engineering Conference 2005–01–2734 (2005)
1010
K. Terabayashi et al.
5. Miyata, N., Kouchi, M., Kurihara, T., Mochimaru, M.: Modeling of human hand link structure from optical motion capture data. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2129–2135 (2004) 6. Through Other Eyes, http://www.capacitybuilders.ca/training/tow/tow-overview.htm 7. Terabayashi, K., Miyata, N., Kouchi, M., Mochimaru, M., Ota, J.: Experience of variously sized hands: Visual delay effect. In: Proceedings of Human Computer Interaction International 2007 Posters, pp. 1009–1013 (2007) 8. Terabayashi, K., Miyata, N., Ota, J.: Grasp strategy when experiencing hands of various sizes. eMinds: International Journal on Human-Computer Interaction I, 55–74 (2008) 9. Botvinick, M., Cohen, J.: Rubber hands ’feel’ touch that eyes see. Nature 391, 756 (1998) 10. Armel, K.C., Ramachandran, V.S.: Projecting sensations to external objects: evidence from skin conductance response. In: Proceedings of the Royal Society of London-B, vol. 270, pp. 1499–1506 (2003) 11. Iriki, A., Tanaka, M., Iwamura, Y.: Coding of modified body schema during tool use by macaque postcentral neurones. Neuroreport 7, 2325–2330 (1996) 12. Shirai, T., Kaneko, M., Harada, K., Tsuji, T.: Scale-dependent grasps. In: Proceedings of the 3rd International Conference on Advanced Mechatronics (ICAM 1998), pp. 197–202 (1998) 13. Terabayashi, K., Miyata, N., Ota, J., Umeda, K.: Asymmetric familiarization with experience of different sized hand. In: Proceedings of Asia International Symposium on Mechatronics, pp. 414–418 (2008) 14. ADULTDATA The Handbook of Adult Anthropometric and Strength Measurements - Data for Design Safety. Government Consumer Safety Research, Department of Trade and Industry (1998) 15. Japanese body size data 1992-1994. Research Institute of Human Engineering for Quality Life, HQL (1997)
Image Processing Methods Applied in Mapping of Lubrication Parameters Radek Poliščuk Institute of Automation and Computer Sciences, Brno University of Technology Technická 2896/2, Brno, The Czech Republic
Abstract. Simple image processing approaches, like advanced palette fitting and convolution filters, can effectively replace traditional numerical methods in study of ball bearings. Two innovative experimental methods for mapping of tribological parameters in elastohydrodynamically lubricated contacts were developed, based on combination of optical tribometer (used for view into simulated rolling bearing) and software processing of chromatic interferograms. The lubricant shape reconstruction uses optimized colorimetric analysis (Thin Film Colorimetric Interferometry, TFCI), while the fast contact pressure mapping method is based on Inverse Elasticity Theory (IET) and convolution.
1 Introduction The multidisciplinary approach can bring surprising solutions of todays complicated technical problems, if knowledge of methods used in completely different discipline can break limits of existing techniques. One of such areas was experimental study of elastohydrodynamically lubricated (EHD, EHL) contacts. In mechanical elements like gears, cams and rolling bearings, the lubricant performance in small contact area between adjacent surfaces often affects reliability of the whole machine [1]. The term “elastohydrodynamic” means, that the minimum thickness of lubricant layer, that carries the whole contact load, is less or comparable to the elastic deformation of surfaces (typically less than 1 micrometer, [2]); The discipline that explains behavior of lubricated contacts is then called Tribology [3]. Many approaches for prediction and design of lubricated parts were already developed, typically based on numerical simulation models derived from Reynolds equation [4]. Increase of contact load and surface speeds – so often asked in modern applications – implies thinner lubricant films and harder requirements on lubricant performance. Unfortunately, the Newtonian behavior of liquid lubricant under such challenging contact conditions dramatically changes and the existing models require parameters based on experimental measurement [5]. Some experimental procedures, like measurement of viscous-pressure coefficient α, can give results from single point measurement [6] – but many other experimental studies, like wear and contact fatigue determination, require a detailed view into distribution of lubricant layer and pressure in the rolling contact. Such mapping would be hardly accessible in a real bearing. Therefore, tribological simulators (tribometers) are used, typically based on one or more tracking or rolling mechanical elements. Equipped with various radiation, G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1011–1020, 2009. © Springer-Verlag Berlin Heidelberg 2009
1012
R. Poliščuk
capacity or optical sensors, they allow measurement of lubricant layer thickness under specific physical conditions. The most popular construction for simulation of elastohydrodynamically lubricated (EHL) contacts is based on a metal ball, rolling on semi reflective disc and creating visible newton fringes on lubricant film inside. Software interpretation of the interferograms, whether based on spectrophotometry [7] or trichromatic image processing [8,9,10], then provides mapping of lubricant layer thickness, contact surface deformation and consecutive evaluation of contact pressure and other tribological parameters.
Fig. 1. Ball-on-Disc tribometer and typical interferogram of ball on disc in white light
The Fizeau interferometer based tribometer used in our studies (fig. 1) was described elsewhere [11], so only the main parameters are mentioned here: The mechanical part consists of highly polished steel ball, rolling on glass disc coated with a CrO2 semi reflective layer. Using a monochromatic or white light illumination, episcopic microscope and imaging equipment, a sight into the simulated bearing is possible. The load, ball and disc revolutions and lubricant temperature are programmable, in order to reach specific slide-to-roll ratios and operating conditions. The crown glass disc diameter is 150 mm and the ball diameter is 25.4 mm. With typical load of 29 N, the corresponding Hertz pressure peak is 0.51 GPa. The lubricant temperature is stabilized as required, from 20 to 75 Celsius degrees. The Nikon Optiphot episcopic microscope with D65 xenon illuminator, beam splitter and 3×10 bit Hitachi HV-F22CL CCD camera mounted on trinocular was used. Using medium a Camera-link interface, the RGB images (fig. 2) of 1360×1024 pixels were transferred to the PC for real-time (15fps) profile evaluation and storage.
Image Processing Methods Applied in Mapping of Lubrication Parameters
1013
2 Thickness Mapping Using TFCI When looking at Newton rings on a thin oil spill, the visible rainbow of colour fringes could be described as a superposition effect of light waves of specific wavelength, reflected from top and bottom interface of the same optical thickness layer. On simple surfaces illuminated by monochromatic light, the interference extremes represent integer multiples of half-wavelengths (at device dependent phase offset, fig. 2) – but any layer thickness estimation becomes ambiguous on uneven or scratched surfaces.
Fig. 2. Typical interferogram extremes in unloaded ball-disc contact in monochromatic light
In the white light, the fringe interference colours represent wide-band superposition perceived by the human eye, camera film or digital sensor. Because the sensors generally respond to more than single wavelength, simplified interpretations based on isolated R, G and B channel wavelengths would be physically problematic. Therefore, a different approach for evaluation of chromatic interferograms was required. The first practical solutions used by Guangteng a Spikes [7] were based on spectral analysis at selected interferogram point (and later line), where the peak wavelengths implied the optical thickness within an expected interference order. Another approach, presented by Marklund [8] and later by Molimard [9], was transformation of R-G-B image channels into HSV color coordinates, where the “Hue” channel indicated the optical thickness. Both mentioned approaches were complicated by requirement of additional phase unwrapping technique (potential problem at rough surfaces) and nonlinear relation between the synthetic “Hue” value and the real optical thickness. In order to avoid expensive spectrophotometry and unwrapping issues during the real EHL film thickness mapping, analysis of typical RGB values was carried out [10]. It revealed, that colour-to-thickness identification is unambiguous for ca 2/3 of pixels, without phase unwrapping requirement in thickness range up to 1000 nanometers. However, ideas of simple “EHL Thickness vs Camera RGB” direct reference table proved to be confusing in the real world, because of uneven illumination and imaging noise effects in trichromatic dimensions. Therefore, a colorimetric approach based on psychometric colour space CIELAB [12] was tested for interpretation of radiometric RGB data. The transformation of NTSC RGB values to uniform L*a*b* coordinates
1014
R. Poliščuk
is relatively simple, as well as the apparatus of ΔE color difference formulas [12]. The non-linear CIELAB space also amplifies low saturation differences around the L* (“lightness”) axis, naturally enhancing the colour resolution in case of limited interference contrast, typical for lubricant layers in tribometer. With this knowledge, a complex method for EHL thickness mapping (Thin Film Colorimetric Interferometry, TFCI) based on monochromatic vs white light calibration and standard-sample colorimetric identification was formulated [10]. The “standard” is here represented by indexed palette of colours in CIELAB colour space, typical for each thickness value, device and lubricant (fig. 3): 100 50 L* a* b*
0 -50 0
100 200 300 400 500 600 700 800 900 1000 optical thickness, nm
Fig. 3. Thickness vs L*a*b* colour model
The identification means a lowest ΔE index search for every sample interferogram pixel in Ångström scale, analogical to common palette matching algorithms (fig. 4). Thanks to CIELAB sensitivity, the typical reliability raises to ca 95% of the contact area, while the remaining false matches are corrected by radial continuity check.
Fig. 4. TFCI identification in CIELAB and the typical result - EHL deformed ball surface
The accuracy of TFCI measurement is only dependent on correct colour model. This model is result of two-step calibration process, based on synthesis of simple object geometry (interpolated monochromatic interference peaks, see fig. 2) and radial colour profiles from of the same object interferogram taken in white light (fig. 1).
Image Processing Methods Applied in Mapping of Lubrication Parameters
1015
The estimated thickness resolution of 3×10 bit interferograms is generally better than 1 nanometer, within operating range of 50 to 1000 nanometers. Range expanding tribometer modifications like Spacer Layer Technique [14] or dichromatic light interferometry were also tested, with acceptable conformity to fully numerical models.
3 Pressure Mapping Using IET Convolution Knowledge of pressure distribution in rolling contact is generally important for prediction of wear and contact fatigue in rolling bearing. The theoretical models of pressure and film thickness are usually based on Reynolds equation and Theory of elastic deformation of semi-infinite bodies [4]. In lubricated contact, the Reynolds equation describes the lubricant flow and pressure between two elastic bodies. If D and P are finite n×n matrices, representing pressure and deformation in contact area with mesh resolution d, then we can declare:
(1)
where the transformation matrix T is constructed using a compliance matrix K, defining members of the transformation matrix T at coordinates a, b as
(2)
where the db and da are results of integer division of a and b indexes by n, while ra and rb are their integer remainders; The compliance matrix itself is then defined as
1
'
with substitutions:
'
'
'
'
' ' '
'
' ' '
'
' ' '
'
2 (3)
' ' ' ' (4) '
and
'
'
< < < '
< '
(5)
representing reduced Young's modulus, calculated from material properties (μ, E) of the two contact bodies [4, 16]. The calculation of pressure map from deformation then could be based on inverse of the transformation matrix T [16]:
where the deformation map
<
(6)
K
(7)
could be obtained from experimental thickness map H, mutual approximation of contact bodies h0 (by iteration, using force balance condition) and map of undeformed ball geometry U (example on fig. 5). In comparison e.g. to FEM, the pressure mapping by inverse elasticity algorithm (Eq. 1-7) is simple and relatively straightforward, but the inverse of transformation matrix T may present a computing capacity problem. For example, even with deformation domain of n = 128, the T consists of 1284 floating point numbers (2GiB).
1016
R. Poliščuk
2000
Thickness / nm Deformation / nm Pressure / MPa
1600 1200 800 400 0 -40
-30
-20
-10
0
10
20
30
40
Radius / pixels
Fig. 5. Typical relation between pressure and deformation in ball bearing
Fortunately, in case of constant ER and d parameters (which is usually true for series of data from single experiment), the T -1 element of Eq. 6 behaves constant and it requires recalculation only if material parameters or imaging resolution change. In effort to find further simplification of the T -1 matrix calculation, various large -1 T matrices (n from 32 to 96) were deeply analyzed, element by element [17]. At every T -1 row, specific narrow peaks were found, always located at columns relevant to the coordinates of the evaluated matrix element, with line wrapped profile shown at fig. 6. 8E+15
Central prof ile of conv olution matrix
6E+15 4E+15 2E+15 0E+00 -2E+15 magnif ied v alues at the same domain 0E+00
-1E+13
-2E+13 Index of coefficient, domain 64×64 pixels
Fig. 6. Typical convolution peak relevant to the central pixel of the deformation matrix
This indicated possibility of matrix multiplication in eq. (6) replacement by simple convolution filter, with floating window C (fig. 6) over the deformation data D:
Image Processing Methods Applied in Mapping of Lubrication Parameters
*B*B
KK B< B<
1017
(8)
The last optimization was deducted from equations (3) and (4), where the ER and d parameters could be completely separated and used for linear transformation of any known convolution window to the new conditions:
(9)
.
(10)
By comparison with fully numerical solution of the same reference data, It was found that the convolution approach acceptably fits the theoretical profile at the domain center, but with growing difference towards the edges of target matrix (fig. 7). 600
C onv olution Theoretical
Pressure, MPa
500 400 300 200 100 0 -40
-30
-20
-10
0
10
20
30
40
Radius, pixels
Fig. 7. Comparison of profile obtained by convolution and the theoretical pressure profile
C matrix values, various domain size
0E+00 -1E+11 96×96 64×64 48×48 32×32
-2E+11 -3E+11 -4E+11 -5E+11 0
10
20
30
40
50
Index in right half profile
Fig. 8. Right part of central T-1 matrix profiles, showing typical trends and negative preferences on the edges of various sized deformation domains
1018
R. Poliščuk
This behavior is typical for convolution with large window, because of discontinuity on deformation matrix borders. It could be eliminated by extrapolated deformation matrix and larger convolution window C [17]. But as the calculation of large convolution windows (n>128) was explained as problematic, existing T -1 matrices of smaller domain sizes were compared, in order to find at least partially analytical solution for their extrapolation. The comparison (fig. 8) shown, that radial extrapolation of the precalculated 96×96 convolution “hat” by fitted cubic-hyperbolic function is possible. This approach was successfully tested on reference domains of 512×512 deformation values (fig. 9). 1""!
!" #"$"%#
"'$(
Fig. 9. High resolution experimental TFCI deformation data of smooth EHD contact (ER=124 GPa), processed using extrapolated convolution window with 400×400 elements
Fig. 10. Detail of pressure map (MPa) over area with dent
Image Processing Methods Applied in Mapping of Lubrication Parameters
1019
The figure 9 also shows importance of smooth deformation data, evaluated from experimental EHD film thickness. Even with relatively clean experimental data, the numerical method (6) strongly exaggerates any irregularities, including first and second derivations of the shape geometry. For smooth contacts with just imaging noise, proper initial low pass filter could be applied, to make the deformation data clean. In real-world data taken during studies of rough or dented data, the real shape of undeformed surface U has to be used in Eq. (7), rather than simplified spherical surface. The image convolution approach was already successfully applied during study of textured contacts [18] (fig. 10).
4 Conclusion Application of common image processing and enhancement functions can provide significant benefits even in so distant area of research, like tribology. The colour matching and palette fitting methods were source for formulation of fast thickness mapping method, processing white light EHL contact interferograms. The TFCI offers working range 50÷1000 nm and better than nanometer accuracy. Together with high reliability and real-time processing capabilities, it was successfully used in numerous tribological studies and industrial lubricant testing. The second tribological technique, intended for pressure mapping of EHL contacts, was step by step simplified to the definition of universal convolution window filter. Based on “Inverse Elasticity Theory” for EHL rolling bearings, it features a reverse approach to the usual “response on load” FEM solvers. The IET enables reconstruction of pressure map inside the rolling bearing from a single TFCI processed interferogram of EHL contact. Thanks to complete exclusion of matrix solvers, the convolution approach brings a dramatic increase of speed (seconds or minutes vs hours or days) in comparison to fully numerical pressure map solutions, with the same accuracy. However, sensitivity to imaging noise and scratches limits the practical IET resolution to ca 500×500 deformation samples in the contact area. Both introduced techniques helped to significantly improve processing times in high resolution interferometry, used in tribology of EHL contacts. For practical purposes, a specialized Win32 software platform AChILES (Automated Chromatic Interferogram Evaluation System) was developed and successfully implemented in various tribological laboratories. It covers all TFCI requirements, including interactive device control, imaging, calibration, EHL thickness mapping and visualization. Features like pressure mapping were added recently, using software plugins. The research of elastohydrodynamic lubrication now concentrates on progress in micro textured surfaces, where TFCI and IET convolution methods also take its place.
Acknowledgements Many thanks to my colleagues, professors Martin Hartl and Ivan Křupka and doctors Vaverka and Vrbka for their collaboration and support. Parts of the research were also supported by The Grant Agency of the Czech Republic (grant № 101/06/P225).
1020
R. Poliščuk
References 1. Archard, J.F., Kirk, M.T.: Lubrication at Point Contacts. Proceedings of the Royal Society of London A261, 535–550 (1961) 2. Křupka, I., Hartl, M., Čermák, J., Liška, M.: Elastohydrodynamic Lubricant Film Shape Comparison between Experimental and Theoretical Results. In: Tribology for Energy Conservation (Proceedings of the 24th Leeds-Lyon Symposium on Tribology), pp. 221–232. Elsevier Science B. V., Amsterdam (1998) 3. Kirk, M.T.: Hydrodynamic Lubrication of ’Perspex’. Nature 194, 965–966 (1962) 4. Hamrock, B.J.: Fundamentals of Fluid Film Lubrication. McGraw-Hill, Inc., New York 5. Křupka, I., Hartl, M., Čermák, J., Liška, M.: Elastohydrodynamic Lubricant Film Shape Comparison between Experimental and Theoretical Results. In: Tribology for Energy Conservation (Proceedings of the 24th Leeds-Lyon Symposium on Tribology), pp. 221–232. Elsevier Science B. V., Amsterdam (1998) 6. Hamrock, B.J., Dowson, D.: Isothermal Elastohydrodynamic Lubrication of Point Contacts, Part I – Theoretical Formulation. Transactions of the ASME, Journal of Lubrication Technology 98, 223–229 (1976) 7. Smeeth, M., Spikes, H.A.: Central and Minimum Elastohydrodynamic Film Thickness at High Contact Pressure. Transactions of the ASME (the American Society of Mechanical Engineering). Journal of Tribology 119, 291–296 (1997) 8. Gustafsson, L., Höglund, E., Marklund, O.: Measuring Lubricant Film Thickness with Image Analysis. Proceeding Institution of Mechanical Engineers, Part J: Journal of Engineering Tribology 208, 199–205 (1994) 9. Hartl, M., Molimard, J., Křupka, I., Vergne, P., Querry, M., Poliščuk, R., Liška, M.: Thin Film Lubrication Study by Colorimetric Interferometry. In: Thinning Films and Tribological Interface Conservation (Proceedings of the 26th Leeds-Lyon Symposium on Tribology), pp. 695–704. Elsevier Science B. V., Amsterdam (2000) 10. Hartl, M., Křupka, I., Poliščuk, R., Liška, M.: Computer-Aided Chromatic Interferometry. Computer & Graphics 22, 203–208 (1998) 11. Foord, C.A., Hammann, W.C., Cameron, A.: Evaluation of Lubricants Using Optical Elastohydrodynamics. ASLE Transactions, 31–43 (November 1968) 12. Supplement №2, CIE Publication №15, Colorimetry. Paris, Bureau Central de la CIE (1978) 13. Hartl, M., Křupka, I., Poliščuk, R., Liška, M.: Thin Film Colorimetric Interferometry. Tribology Transactions 44, 270–276 (2001) 14. Cann, P.M., Spikes, H.A., Hutchinson, J.: The Development of a Spacer Layer Imaging Method (SUM) for Mapping Elastohydrodynamic Contacts. Tribology Transactions 39, 915–921 (1996) 15. Foord, C.A., Wedeven, L.D., Westlake, F.J., Cameron, A.: Optical Elasto-hydrodynamics. In: Proceedings of the Institution of Mechanical Engineers, Part 1, vol. 184, pp. 487–505 (1969-1970) 16. Vaverka, M., Vrbka, M., Křupka, I., Hartl, M.: Influence of Dents on Friction Surfaces on Thin Lubrication Films. In: Technische Akademie Esslingen, Ostfildern (2006) 17. Poliščuk, R., Vaverka, M., Vrbka, M., Křupka, I., Hartl, M.: Pressure Distribution Within EHD Point Contacts Based on Measured Film Thickness. In: Proceedings of ASME IMECE 2006, Chicago (2006) 18. Křupka, I., Poliščuk, R., Hartl, M.: Behavior of thin viscous boundary films in lubricated contacts between micro-textured surfaces. Tribology International (2008)
An Active Contour Approach for a Mumford-Shah Model in X-Ray Tomography Elena Hoetzl and Wolfgang Ring Institute of Mathematics and Scientific Computing, University of Graz
[email protected],
[email protected]
Abstract. This paper presents an active contour approach for the simultaneous inversion and segmentation of X-ray tomography data from its Radon Transform. The optimality system is found as the necessary optimality condition for a Mumford-Shah like functional over the space of piecewise smooth densities, which may be discontinuous across the contour. In our approach the functional variable is eliminated by solving a classical variational problem for each fixed geometry. The solution is then inserted in the Mumford-Shah cost functional leading to a geometrical optimization problem for the singularity set. The resulting shape optimization problem is solved using shape sensitivity calculus and propagation of shape variables in the level-set form. As a special feature of this paper, a new, second order accurate, finite difference method based approach for the solution of the optimality system is introduced and numerical experiments are presented.
1
Introduction
Active contours have been successively applied for various imaging problems within the last two decades. We mention the pioneers [12], [2], [6]; the introduction of Mumford-Shah like functionals with active contours as the main optimization variables [4], [5] (now commonly denoted as Chan-Vese Models); the extensions of the available array of energy functionals (e.g. [16], [11]) and the introduction of level-set based active contours into the realm of inverse problems [18], [13]. Especially the level-set technique [15] has been proven to be a valuable tool for the efficient treatment of geometrical variables. We present an active contour approach for the simultaneous inversion of the classical Radon transform gd (s, ω) ≈ Rf := f (sω + tω ⊥ ) dt , (1.1) R
where (s, ω) ∈ R × S 1 appearing in standard X-ray tomography together with a segmentation of the reconstructed image. We formulate a Mumford-Shah energy which is minimized with respect to the segmenting contour as the primary optimization variable. In our approach we aim for reconstructions of the density which are piecewise smooth with respect to the partition introduced by the G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1021–1030, 2009. c Springer-Verlag Berlin Heidelberg 2009
1022
E. Hoetzl and W. Ring
active contour. We emphasize that we do not (as it is traditionally done) first invert the Radon transform and then segment the obtained image but we use the raw tomography data directly for both reconstruction and segmentation. We generalize this approach, originally proposed in [17], to allow for density functions to be not necessary piecewise constant but piecewise smooth. This leads to the fact that the chosen energy functional has the more complicated form as one from [17]. And therefore the numerical solution of the optimality system for the density with a fixed geometry poses a specific difficulty and can not proceed in the way described in [17]. For this a second order accurate finite difference approximation is constructed. The optimality system is solved using an iterative method. For the chosen energy functional we use shape sensitivity calculus to obtain a descent direction and we apply the level-set methodology to update the active contour. We specially notify the applicability of the approach for more involved medical imaging problems such as the inversion of SPECT data.
2
The Mumford-Shah Functional and the Minimization Algorithm
Suppose gd : R×S 1 → R are given noisy data which are inaccurate measurements of ideal data g = Rf where R is the Radon Transform of an unknown density f : Ω ⊂ R2 → R. We assume that the density is piecewise smooth with respect to a partition Ωi of Ω, i.e. f = f1 + f2 + ... + fn with fi |Ωi ∈ H 1 (Ωi ) and fi (x) = 0 if x ∈ / Ωi . The boundary of the individual domains is given by Γ . Note that f can be singular on Γ but is smooth within each Ωi . The assumption that f is piecewise smooth is reasonable especially in the context of segmentation of medical images. We want to find f and Γ such that the Radon transform of f fits the given data gd best possible. Therefore we consider the Mumford-Shah like functional J(f, Γ ) = Rf − gd 2L2 (R×S 1 ) + α |∇f |2 dx + β|Γ |. (2.1) Ω\Γ
In an algorithm for the minimization of the functional (2.1) it is difficult to update both variables — geometric Γ and functional f — independently, since the geometry Γ is an essential part of the definition of the density function f . Therefore we first fix Γ and solve the variational problem for the fixed Γ min J(f, Γ ). f
(2.2)
Having solved (2.2) we insert the solution f (Γ ) into the Mumford-Shah like functional (2.1) and solve the shape optimization problem min J(f (Γ ), Γ ). Γ
(2.3)
For more details we refer to [17]. Similar approaches which reduce the primary task to a shape optimization problem by eliminating the functional variable are described in [3], [20], [9].
An Active Contour Approach for a Mumford-Shah Model
3 3.1
1023
The Variational Problem for Fixed Γ The Optimality System
The necessary optimality conditions for the minimum f of J for fixed Γ is given as ∂f J(f (Γ ), Γ ) φ = Rf − gd , Rφ L2 (R×S 1 ) + α ∇f, ∇φ dx = 0 (3.1) Ω\Γ
for all test functions φ ∈ H 1 (Ω \ Γ ), where ∂f J denotes the derivative of J with respect to the first variable. Using the integration-by-parts formula for the second term we get ∂f ∗ ∗ R Rf − R gd , φ L2 (R×S 1 ) − α Δf φ dx + α φ ds = 0 (3.2) ∂n Ω\Γ
∂Ω
∗
for all φ ∈ H (Ω \ Γ ), where R denotes the adjoint operator to the Radon Transform (1.1) given by R∗ g(x) = ω∈S 1 g(ω, x , ω) dω (see [14]). From (3.2) we conclude that the strong form of the optimality system is given as R∗ Rf − αΔf = R∗ g in Ω \ Γ (3.3) ∂f ∂n = 0 on Γ. 1
4 4.1
Numerical Approximation Geometrical Set Up
For the numerical solution of the optimality system, we use finite difference approach on a regular grid with additional points where the active contour intersects the grid lines. These points are called an intersection points. Points on the regular grid which have at least one intersection point as a neighbor are called a boundary points. All other points on the regular grid are regular points. The active contour Γ is represented by the level-set function Φ as Γ = {p|Φ(p) = 0}. The contour Γ divides regions where the level-set function Φ has different sign. We call a region where Φ < 0 an inner domain and a region where Φ > 0 an outer domain. Each intersection point is a grid point for both, the inner and the outer domain which it separates. Therefore, at each intersection point we have to account for two function values for f (the traces for the inner and outer domains respectively). The same holds for derivatives of f which will occur as unknowns in the discrete optimality system. We use a 9-points stencil approximation with a local numbering of grid points and intersection points occurring in the stencil. fk denotes the value of f at a grid point with local index k. An intersection point which lies on a grid line emanating from the center of the stencil p0 is called central intersection points. We will call the remaining intersection points in the stencil outer intersection points. In the same sense we have central grid points and additional grid points for regular grid points.
1024
5
E. Hoetzl and W. Ring
Approximation of Δf at Boundary Grid Points
In our case — motivated by the structure of terms occurring in the shape derivative of the cost functional (see Section 6) — we are interested not only in values of f at grid points but also in values of f and its derivative at intersection points. For all regular points we can use a classical 5-points stencil approximation. In order to get values of f at intersection points (fk˙ |Γ ) and its derivatives (∇fk˙ |Γ ) we made modifications which are presented in the following. These modifications concern only the construction of an approximation of the Laplace operator at boundary points and the discretization of the Neumann boundary conditions. 5.1
The Approximation Concept
The idea is to use Taylor series expansions to express f0,xx and f0,yy in terms of available points in the stencil (grid, center and/or outer intersection). Accordingly we will have several types of unknowns: (a).function values at grid points, (b).function values at intersection points, (c).gradient values at intersection points. So we need (at least) as many equations as we have unknowns. We propose to use the following types of equations: (I).integro-differential equation (3.3) at each regular point, (II).approximation of the gradient at each intersection point using Taylor series expansions, (III).homogeneous Neumann boundary conditions at each intersection point. The goal of our approach is to achieve second order accuracy for the function values and first order accuracy for the derivatives. Therefore functional values are approximated with 4th order accuracy and first derivatives are approximated with 3th order accuracy (for details see [8]). 5.2
A Construction of an Approximation of the Laplace Operator
We can get Taylor series expansions, which relate derivatives of f at the center of the stencil (especially f0,xx and f0,yy ) to the unknown values fi and ∇fi . If we perform the Taylor series expansions up to third order, the coefficients of the expansion include the 9 parameters (f0,x , f0,y , f0,xx , f0,xy , f0,yy , f0,xxx , f0,xxy , f0,xyy , f0,yyy ), which are related to the unknowns via linear 9 × 9 systems. General strategy for assembling the necessary 9 equations: (I). 4 equations from Taylor series expansions of function values of 4 points within a 5-points stencil (grid or center intersection); (II). 5 more equations from the following choice (ordered by priority): – 2 additional equations for each chosen center intersection point (Taylor series expansion of x- and y- derivatives). As many center intersection points as are available and necessary (to fill up the five equations) are selected; – “free” (or not already used) outer intersection points with Taylor series expansions of both derivatives (if it is necessary); – function values of “free” regular points.
An Active Contour Approach for a Mumford-Shah Model
1025
According to this strategy we always have 4 equations for x- and y- derivatives at intersection points (center and/or outer) in our systems of 9 equations. After solving the system of 9 equations we get an approximation for the Laplace operator in terms of unknowns of the type (f0 , fi , fi,x , fi,y ). Those expressions will be used to assemble the discretization matrix afterwards. 5.3
Construction of the Approximation of the Gradient at Intersection Points
As gradient information at intersection points is used in the approximation of the Laplace operator, we need to establish relations between the gradient at intersection points and other unknowns. The idea is the following: each intersection point p˙ lies either on a horizontal or on a vertical grid-line. We call this direction (horizontal or vertical) the in-line direction of p˙ and the partial derivative in this direction the in-line derivative. The direction perpendicular to the in-line direction is called the cross-line direction and the corresponding derivative the cross-line derivative. For each center intersection point p˙ we use two points: the center of the stencil and its neighbor (grid or intersection point) which lies on the same grid line. Using Taylor series expansion we get a linear (second order accurate) relation between the in-line derivative at p˙ and the function values of the three points. Likewise, we also express the second in-line derivative (fp,xx or fp,yy ) by the same function values. For the cross-line deriva˙ ˙ tives we use again Taylor series expansion, in which either fp,x and fp,xx or fp,y ˙ ˙ ˙ and fp,yy are already known from the calculation of the in-line derivative. So ˙ for three remaining unknowns fp,y , fp,yy or fp,x , fp,xy we need three ˙ , fp,xy ˙ ˙ ˙ , fp,xx ˙ ˙ additional equations, which we choose as follows: (I). 2 equations from Taylor series expansion of function values of 2 neighbors (intersection points) along Γ ; (II). 1 remaining equation from the following choice (ordered by priority): – function value of “free” (not yet used) outer intersection point, – function value of “free” central regular point. After solving the system of 3 equations we can get expressions for fp,y or fp,x ˙ ˙ , which we will use to assemble a discretization matrix for the optimality system. 5.4
Homogeneous Neumann Boundary Conditions at Intersection Points
To complete the discretization of the optimality system we also need to realize homogeneous Neumann boundary conditions at intersection points which is equivalent to + x + y − x − y fp,x (5.1) ˙ np˙ + fp,y ˙ np˙ = 0 or fp,x ˙ np˙ + fp,y ˙ np˙ = 0, where nxp˙ and nyp˙ denote correspondingly x and y components of the normal vector at the point p˙ and f + and f − denote values of the function f on the outer and inner domains respectively. For the discretization of (5.1) the approximations for fp,x and fp,y derived in Section 5.3 are used. The normal vector ˙ ˙
1026
E. Hoetzl and W. Ring
np˙ = (nxp˙ , nyp˙ ) is determined using a local biquadratic interpolation model for the level-set function Φ which encodes Γ . 5.5
Numerical Treatment of the Optimality System (3.3)
Using the results of the modified finite difference approximation described above we assemble the discretization matrix S for the Laplace part of (3.3). We get the finite dimensional approximation R∗ Rf + αSf = R∗ g. The discretization matrix S is of the size N × N , nonsymmetric, and R∗ R + αS is nonsingular. The dimension N is equal to L2 + 6IP , where L is the size of the reconstructed density f and IP is the number of all intersection points. The solution of the optimality system for the fixed geometry (3.3) poses a specific difficulty since it arises as the discretization a coupled system of integrodifferential equations on variable and irregular domains (i.e A = R∗ R + S). Assembling the whole discretization matrix for the part R∗ R is much too expensive. To find an approximate solution by any iterative methods we have to apply the operators R∗ and R to the previous approximation xk in each step of the iterative procedure. This can be achieved in reasonable time using fast rotation and interpolation. We use in our computations preconditioned Bi-CGSTAB. As preconditioner we choose the matrix M = I + αS where I is the N × N -Identity matrix.
6
Shape Sensitivity Analysis and Update of the Geometry
In our work we use techniques from shape sensitivity calculus described e.g. in [19], [7], [1], [9]. Consider the reduced functional ˆ ) = J(f (Γ ), Γ ). J(Γ
(6.1)
ˆ we differentiate the reduced functional Jˆ with To find a descent direction for J, respect to the geometry Γ . Using known differentiation rules (for more details we refer to [19], [7], [10]) for domain- or boundary functionals we obtain fj (y) ˆ ;F) = 4 dJ(Γ sp fp (x) dy F (x) dS(x)− (6.2) |x − y| j k x∈Γ p∈d(k) k
−2
+α
y∈Ωj
sp fp (x)R∗ gd (x) F (x) dS(x)+
k x∈Γ p∈d(k) k
k x∈Γ p∈d(k) k
|∇fp (x)|2 F (x) dS(x) + β
κ(x) F (x) dS(x),
k x∈Γ k
where s(i) = sign(φ(x)) and d(k) = {i(k), j(k)} is the set of the indices of two components Ωi and Ωj separated by Γk . Note that second order accurate
An Active Contour Approach for a Mumford-Shah Model
6 20
1027
6 20
5
40 60
5
40 60
4 80
4 80
100
3
120
100
3
120 2
140 160
2
140 160
1
1 180
180
200 0
50
100
150
200
0
200 0
50
100
150
200
0
Fig. 1. Density distribution f (left) and reconstructed (exact) density distribution (right)
scheme, proposed in this work, is used for more accurate evaluation of |∇fp (x)| at Γk in (6.2). A direction F : Γ → R for which the directional derivative (6.2) is negative is called a descent direction of the functional (6.1). Without loss of generality we normalize the descent direction to F = 1 since different scaling of the descent direction can be always compensated by the step-length’s choice of the optimization algorithm. A steepest descent direction is a solution to the constrained optimization problem min dJˆ(Γ ; F ) such that F = 1. F
If we denote Γ (t) = {x : φ(t, x) = 0} we can interpret the propagation of an interface Γ (t) as a propagation law for a corresponding time-dependent level set function φ(t, x) (this idea was proposed in [15]). The level set equation of the form φt + F |∇φ| = 0 propagates simultaneously φ and Γ in a direction of decreasing cost functional values if F is equal to a descent direction obtained above. The level-set equation is solved using a WENO scheme.
7 7.1
Numerical Results A Reconstruction from Artificial Data
The necessary tomography test data gd (s, ω) was created synthetically using a piecewise smooth density distribution f as input. The image for the density has a size of 201 × 201 pixels and contains three regions (see Figure 1 left). The measurements were simulated over the full circle for 319 angles and 320 offsets. The reconstruction of the density distribution from (numerically) exact data using the technique proposed in this work are shown in Figure 1 (right). One can see no difference the reconstructed density with the original one. In Figure 2 (right on the top) reconstructed contours are presented where the exact data are contaminated with 20% of Gaussian noise. One can hardly notice the difference with the noise-free reconstruction (left on the top). But the corresponding density reconstruction in Figure 2 (below) is slightly worse which is reasonable taking into account 20% of the noise. If the level of noise grows
1028
E. Hoetzl and W. Ring
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
−0.8 −1
−0.8 −1
−0.5
0
0.5
1
−1
−1
−0.5
0
0.5
1
Fig. 2. Reconstruction of Γ α = 0.1, β = 0.1, preconditioner parameter pr = 0.01 from the noise-free data (left on the top) and from the noisy data (20%) (right on the top). Gray dashed contour is the initial guess, black dashed is the propagating contour, black solid is the exact geometry. Corresponding reconstructed density (below).
the reconstruction of contours becomes visibly worse. In above experiments we started with an initial contour which is topologically similar to the exact density, i.e. it contained 3 subdomains. Of course the question “with what to start” is very natural in practical applications. In Figure 3 (left) we started with 5 initial subdomains and as one can see, after 130 iterations we got the result any way. Moreover, we started with 9 initial subdomains (one big and eight small one). After 153 iterations we arrive at the result (Figure 3 middle). 7.2
Parameters α and β
The parameter α corresponds to the regularization term penalizing f in (2.1). This term insures that the minimization of J in (2.1) with respect to f is wellposed. Numerical experiments show that the bigger α is the smoother is the reconstructed density f . But the very big α (> 1) as well as very small (< 0.001) starts to affect the reconstruction of contours and of the density. The reconstruction becomes slightly worse. For the comparison see Table 1, where an exact value and averages of reconstructed values of the density f (with different values of α) are presented. Here Ω2 is the big domain from the previous Figures, Ω1 is the small left domain and Ω3 is the small right domain. The parameter β corresponds to the regularization term penalizing Γ in (2.1) and this term controls the length of the boundary Γ . Numerical experiments show that a big β starts to “subtend” (or “span”) Γ and therefore has a negative influence to the reconstruction of contours. In contrast to the big β the length of Γ with the very small β (< 0.001) starts to be uncontrollable.
An Active Contour Approach for a Mumford-Shah Model
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
−0.2
−0.2
−0.2
−0.4
−0.4
−0.4
−0.6
−0.6
−0.6
−0.8
−0.8
−1
−1
−0.5
0
0.5
1
−1
1029
−0.8 −1
−0.5
0
0.5
1
−1
−1
−0.5
0
0.5
1
Fig. 3. Reconstruction of Γ started with 5 initial domains (left), reconstruction of Γ started with 9 initial domains (middle) and reconstruction of Γ using scheme from [17] (right) Table 1. Average values of the reconstructed density f in subdomains of Ω for a fixed β Ω1 exact α = 0.005 α = 0.01 α = 0.1 α=1 α = 10
8
6.00 6.02 6.00 5.91 5.88 5.85
Ω2 min-max 2.22–3.78 2.18 − 3.87 2.23 − 3.75 2.26 − 3.60 2.49 − 3.54 2.89 − 3.20
Ω3 0.00 0.01 0.00 0.01 0.07 0.08
Conclusion
In this work we introduced a new finite difference method based approach for the determination of a piecewise smooth density function from data of its Radon Transform. We introduced a second order accurate finite difference discretization on a regular quadratic grid with additional grid points at the intersection points. Due to the fact that our algorithm allows to perform a reconstruction and a segmentation simultaneously and directly from the measured data we achieved the higher rang of accuracy (compare with other available methods), which makes our approach suitable for practical applications. Also the fact that in our algorithm the density f is not necessary constant, but smooth, gives us the possibility to treat more complicated cases. More precisely, if we apply scheme from [17] to the data with the piecewise smooth density the reconstruction fails at all (Figure 3 right). Anyhow experiments with the real world data are still in proceeding.
References 1. Aubert, G., Barlaud, M., Faugeras, O., Jehan-Besson, S.: Image segmentation using active contours: calculus of variations or shape gradients? SIAM J. Appl. Math. 63(6), 2128–2154 (2003) 2. Caselles, V., Catt´e, F., Coll, T., Dibos, F.: A geometric model for active contours in image processing. Numer. Math. 66(1), 1–31 (1993)
1030
E. Hoetzl and W. Ring
3. Chan, T.F., Vese, L.A.: A level set algorithm for minimizing the Mumford-Shah functional in image processing. UCLA CAM Report 00-13, University of California, Los Angeles (2000) 4. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. Image Processing 10(2), 266–277 (2001) 5. Chan, T.F., Vese, L.A.: A multiphase level set framework for image segmentation using the mumford and shah model. Int. J. Comp. Vision 50(3), 271–293 (2002) 6. Cohen, L.D., Kimmel, R.: Global minimum for active contour models: a minimum path approach. Int. J. of Comp. Vision 24(1), 57–78 (1997) 7. Delfour, M.C., Zol´esio, J.-P.: Shapes and geometries. In: Analysis, differential calculus, and optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2001) 8. Hoetzl, E.: Numerical treatment of a mumford-shah model for x-ray tomography. PhD.thesis, Karl Franzens University Graz, Institute for Mathematics (2009) 9. Hinterm¨ uller, M., Ring, W.: An inexact Newton-CG-type active contour approach for the minimization of the Mumford-Shah functional. J. Math. Imag. Vis. 20(1–2), 19–42 (2004) 10. Hinterm¨ uller, M., Ring, W.: A second order shape optimization approach for image segmentation. SIAM J. Appl. Math. 64(2), 442–467 (2003) 11. Jehan-Besson, S., Barlaud, M., Aubert, G.: DREAM2 S: Deformable regions driven by an Eulerian accurate minimization method for image and video segmentation (November 2001) 12. Kass, M., Witkin, A., Terzopoulos, D.: Snakes; active contour models. Int. J. of Computer Vision 1, 321–331 (1987) 13. Litman, A., Lesselier, D., Santosa, F.: Reconstruction of a two-dimensional binary obstacle by controlled evolution of a level-set. Inverse Problems 14, 685–706 (1998) 14. Natterer, F.: The Mathematics of Computerized Tomography. Classics in Applied Mathematics, vol. 32. SIAM, Philadelphia (2001); Reprint of the 1986 original 15. Osher, S., Sethian, J.A.: Fronts propagating with curvature-dependent speed: algorithms based on Hamilton-Jacobi formulations. J. Comput. Phys. 79(1), 12–49 (1988) 16. Paragios, N., Deriche, R.: Geodesic active regions: a new paradigm to deal with frame partition problems in computer vision. Int. J. of Vis. Communication and Image Representation (2001) (to appear) 17. Ramlau, R., Ring, W.: A mumford – shah level – set approach for the inversion and segmentation of x – ray tomography data. Journal of Computational Physics (221), 539–557 (2007) 18. Santosa, F.: A level-set approach for inverse problems involving obstacles. ESAIM: Control, Optimization and Calculus of Variations 1, 17–33 (1996) 19. Sokolowski, J., Zol´esio, J.-P.: Introduction to shape optimization. Springer, Berlin (1992); Shape sensitivity analysis 20. Tsai, A., Yezzi, A., Willsky, A.S.: Curve evolution implementation of the MumfordShah functional for image segementation, denoising, interpolation, and magnification. IEEE Transactions on image processing 10(8), 1169–1186 (2001)
An Integral Active Contour Model for Convex Hull and Boundary Extraction Nikolay Metodiev Sirakov1 and Karthik Ushkala2 1
Dept. of Mathematics Dept. of Computer Science, Texas A&M University-Commerce, Commerce-Tx, 75429 Tel.: 9038865943, 9032178843, Fax: 9038865945 Nikolay
[email protected],
[email protected] 1,2
Abstract. This paper presents a new deformable model capable of segmenting images with multiple complex objects and deep concavities. The method integrates a shell algorithm, an active contour model and two active convex hull models. The shell algorithm automatically inscribes every image object into a single convex curve. Every curve is evolved to the boundary’s vicinity by the exact solution of a specific form of the heat equation. Further, if re-parametrization is applied at every time step of the evolution the active contour will converge to deep concavities. But if distance function minimization or line equation is used to stop the evolution, the active contour will define the convex hull of the object. Set of experiments are performed to validate the theory. The contributions, the advantages and bottlenecks of the model are underlined at the end by a comparison against other methods in the field.
1
Introduction
In recent years new scientific fields emerged such as image semantics extraction and events discovery in images and video. But the very first problem to solve, in order to approach the above sophisticated tasks, is the problem of image segmentation. Different kinds of techniques are employed for segmentation but some of the most successful in terms of connectivity detection and preservation are the so called active contour models. They could be sub divided to regions [2,3] or boundary detection methods [1,4,5,6,7,11,12]. Another classification of the active contour models may be done on the base of the mathematical method(s) lying behind the algorithm. Thus, there are models based on minimization of functionals [5,12], level set [2,8,10,11], the approximate solution of partial differential equation [1,8,4,9,17], and algebraic structures [13]. The present paper develops a new Integral Active Contour Model (IACH) composed by the shell algorithm, a new active contour model and two new active convex hull models. The shell algorithm[1,14] is used to segment an image with multiple objects, inscribing every object into a convex closed curve. Further every curve is evolved toward the vicinity of the boundary. The evolution is guided G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1031–1040, 2009. c Springer-Verlag Berlin Heidelberg 2009
1032
N.M. Sirakov and K. Ushkala
by the exact solution of the Active Convex Hull Model (ACHM) [1], which is a specific form of the geometric heat differential equation. From the position of boundary vicinity the IACM may determine: the objects boundary; convex hull; or deep concavities [16]. An arc length re-parametrization [1,14,15,17] is used to facilitate the contour’s evolution to the boundary of deep and complex concavities. A minimization of a distance function and line equation are applied along with the evolution equation to develop two new active convex hull models. The distance function measures the Euclidian distance between the initial position of every active contour point and its location in the vicinity of the boundary. The rest of the paper is organized as follows: Section 2 develops new Active Contour model on the Exact Solution (ACES) of the ACHM [1] and its extension to objects with deep concavities. Section 3 presents the two new active convex hull models, whereas Section 4 develops the mathematical description of the shell algorithm in the light of the ACES model. The theory is validated by experiments with synthetic and medical images (the x-ray is provided by the Hunt Regional Community Hospital) given across sections 2-4. Section 5 presents more experimental results obtained from images containing craft objects. A PC with Core Duo CPU 2.16GHz, 2.16GHz and 2GB of RAM was used to carry the experiments. Section 6 provides a discussion on the advantages and disadvantages of the present work. A comparison with existing methods underlines the contributions of the IACM.
2
The Active Contour Model on the Exact Solution
The present section develops a new Active Contour model using the Exact Solution (ACES) of the ACHM [1] presented with the following equation: ∂ψ → − = P k N − |ds|T, ∂t
(1)
where ψ = ψ(x(s, t), y(s, t)) is a parametric curve. In the last expression s parameterizes a particular curve, whereas t parameterizes the family of curves. T → − denotes the tangent vector, N the normal one, k is the curvature and P is a penalty function such that: P =
0 if ε2 > ∂f (x,y) > ε1 , ∂t 1 otherwise
where ε1 , ε2 are thresholds, f (x, y) is the image function. Thus, if the exact solution of Eq(1) is sought in the form of a separable function ψ(s, t) = w(t)u(s) then this solution is determined as a parametric curve ψ: s s |ds| 2 ψ(s, t) = exp(s 2 −λ t) [C1 cos 4λ2 − |ds|2 +C2 sin 4λ2 − |ds|2 ], (2) 2 2
An IACM for Convex Hull and Boundary Extraction
1033
where ωωt = −λ2 and |ds| is the length of the chord of the arc segment. Substitute λ = |ds| in Eq(2) and rewrite it in a vector form:
r(s, t) = xi + yj = exp
|ds|
s 2 −|ds|2 t
|ds| √ |ds| √ [C1 cos c s 3 i + C2 sin c s 3 j]. (3) 2 2
If c varies from 1 to 1000 the curve described by Eq(3) will change its shape π from a spiral to a circle, for s ∈ [0, 2.309 c|ds| ]. For such s and t ∈ [0, ∞] the ACES model, is defined by Eq(3) and satisfies the following initial and object boundary conditions. The initial condition is:
r(s, t)|
t=0.001,
= R exp |ds| =a,C1 =C2 =R,c=1000 2
sa−4a2 0.001
√ √ [cos 1000 3as , sin 1000 3as ].(4)
√ In Eq(4) R = 12 nr2 + nc2 , nr - shows the number of rows in the image, nc - is the number of columns. Recall that f (x, y) denotes the image function, where: 2
x(s, t) = R exp(sa−4a
t)
√ √ 2 cos(1000 3as), y(s, t) = R exp(sa−4a t) sin(1000 3as),
for (x, y) ∈ Df (the image domain) and (x, y) ∈ r(x, y) . Now the following object boundary condition is formulated: r(s∗ , t∗ ) = r(s∗ , t∗ + ∂t) if ε2 >
∂f (s∗ , t) > ε1 for s = s∗ and t∗ > 0.001, (5) ∂t
and r(s∗ , t∗ ) = r(s∗ , t∗ + ∂t) if the above double inequality is not satisfied. ACES model presented by Eqs(3), (4) and (5) was coded in Java. The thresholds ε1 , ε2 and Δt (the discrete form of ∂t) are selected by the user. An experiment was performed using a CT section of brain with a size 256x256 (see Fig 1(a)). In Fig 1(b)the initial contour inscribes the entire image and evolves towards the object boundary. The software took 0.016 sec, to extract the boundary shown in Fig 1(c). The ACES model evolves the active contour toward the objects boundary, but experiences difficulties to progress into deep concavity (Figs 2(a),3(a)). To
Fig. 1. (a) A CT section of brain; (b) The brain’s boundary along with the curve defined by Eq(4) and evolved by Eq(3), to satisfy the condition given by Eq(5); (c) The extracted boundary alone
1034
N.M. Sirakov and K. Ushkala
resolve this problem the model is “armed” with a re-parametrization approach [1,14,15,17]. To determine the direction of evolution after re-parametrization, the following concavity(convexity) equation is employed: (x∗∗ − xi )(y ∗ − y ∗∗ ) > (x∗ − x∗∗ )(y ∗∗ − yi ).
(6)
In Eq(6) P ∗ = (x∗ (s∗ , t∗ ), y ∗ (s∗ , t∗ )) and P ∗∗ = (x∗∗ (s∗∗ , t∗∗ ), y ∗∗ (s∗∗ , t∗∗ )) are active contour points, which satisfy condition (5): r(s∗ , t∗ ) = r(s∗ , t∗ + Δt), r(s∗∗ , t∗∗ ) = r(s∗∗ , t∗∗ + Δt), whereas (xi , yi ) is a point added to the arc defined by P ∗ and P ∗∗ during the re-parametrization with Δs. If Eq(6) is satisfied the contour (the point xi ) evolves in the direction to the right side of the corresponding edge, if a clockwise walk along the contour is considered. See Figs 2(c), 3 (c) and 6(b) where the direction of evolution after the re-parametrization is shown with little parallel bars. Using the above concepts an algorithm was coded in Java and incorporated into the ACES module.
Fig. 2. (a) An image of a spiral; (b) The spiral along with the active contour which is re-parameterized and moved into the concavity; (c) The contour evolution into the spiral is shown by the straight segments; (d) spiral’s contour
Observe that the ACES is always the first module to run and is halted in the vicinity of the object’s boundary as shown in Figs 2(b), 3 (a) and 6(a). If the active contour is in about to converge into deep concavities a re-parametrization is performed with an arc segment Δs, given by the user. Then Eq(6) determines the right side (in case of clockwise traverse of the active contour) of every edge and the active contour evolves in the direction of this side by Eq(3) under condition given by Eq(5). The process of re-parametrization and evolution halts if the length of the arcs becomes smaller than Δs (Fig.2(c)). As mentioned above the ACES module was tested with the spiral image shown in Fig 2(a). The image is of size 768x776. The software took 5.617 sec to extract the spiral’s boundary using Δs = 15. The extracted boundary alone is shown in Fig 2(d).
3
Active Convex Hull Models on the Exact Solution
Paper[1] develops an ACHM applying the approximate solution of Eq(1). Normal and tangent forces were used to stop the contour evolution in this case. But the present section derives two new ACHM’s on the base of the exact solution of
An IACM for Convex Hull and Boundary Extraction
1035
Eq(1). Instead of using forces a distance function or geometric line condition are employed in the present section. Recall that the points on the initial contour are defined by Eq(4) for t = 0.001 and s ∈ [0, 0.001155 aπ ]. Define the function: d(s, t) = |(x(s, 0.001), y(s, 0.001)) − (x(s, t), y(s, t))| for t > 0.001
(7)
Theorem 1. Given an object O in an image f (x, y) and the center of the image belongs to O. Also ε1 and ε2 are given for Eq.(5). Then the convex hull of O is defined by the active contour points (x(s, t), y(s, t)) which satisfy Eq.(5) and at which the function d(s, t) has a local minimum, such that every triplet of consecutive minimums defines a convex up arc (this is an arc which satisfies the inverse of Eq(6)). Theorem 1 implies that the Exact Solution along with the Distance Minimums, which define convex up arcs, represent a new ACHM (ACHM-ESDM). Thus, the initial contour defined by Eq(4) is evolved by Eq(3) under condition (5) and is stoped in the proximity of the boundary (see Fig 3(a)). Then d(s, t) is constructed using Eq(7) for t at which the contour was halted (Fig 3(a)). Next the local minimums of d(s, t) are determined using the 1st derivative test. Derivatives are approximated by finite differences on two consecutive nodes. To construct the convex hull Eq(6) is applied to determine the convexity of every triplet of consecutive points at which d(s, t) has a local minimum. Triplets defining concavities are discarded.
Fig. 3. (a) An X-ray of hand, the ACES in its vicinity; (b) The Convex hull of the hand; (c) The small bars show the direction of evolution after re-parameterizations. The hand’s boundary; (d) The hand with the exact boundary.
The ACHM-ESDM is coded in Java, incorporated into the ACES software, and employed to extract the convex hull of the hand in Fig 3(a). The convex hull along with the hand is given in Fig 3(b). The next Fig 3(c) presents the direction of evolution and the hand’s boundary after re-parameterizations. It follows from Theorem-1 that the boundary points the active contour will meat first are the convex hull vertices. Consider now that for every convex object there is a unique circle to inscribe the object. Therefore the active contour points (xm , ym ), m = i, ..., j between every pair of convex hull vertices Pi (xi (si , ti ), yi (si , ti )) and Pj (xj (sj , tj ), yj (sj , tj )) belong to an arc of a circle.
1036
N.M. Sirakov and K. Ushkala
Fig. 4. (a) A section of a groundwater unit; (b) The section with its Convex Hull; (c) The convex hull alone
Thus to construct the convex hull every arc between convex hull vertices must evolve to a straight segment. A condition to stop arcs evolution on this segment is: Dij − (Dim + Dmj ) = ±ε3 ,
(8)
where Dij , Dim and Dmj are the Ecludian distances between the points Pi , Pj and Pm . The threshold ε3 is chosen by the user. The ACES model along with Eq(8) represents another convex hull model, called ACHM on the Exact Solution and Line Equation(ACHM-ESLE). The new model is incorporated into the ACES software. Experiments were performed with the ACHM-ESLE module using the images shown in Figs 4(a), 5(a). The convex hull of multiple objects is diven in Fig.5(b) whereas Fig 4(b) shows the convex hull of a single object defined in 0.047 seconds.
4
The Shell Algorithm with the ACES Model
The shell algorithm is designed in [1,14,15] to inscribe in a circle every object from an image, such that the center of the circle belongs to the object. The present section describes the shell algorithm in the light of the ACES model. Consider an active contour which satisfies the initial condition formulated by Eq(4), evolves all the way to the image center, using Eq(3) and boundary condition (5). Then collect the points at which the contour hits or touches an object (finds discontinuities of f ), and link every touching point with the center of the image or with another touch point (see Figs 5(c), 7(b)). The lines defined above are called tangent lines. The tangent lines along with the joint points Pjm , between the active contour and object, form a shell. Two shells are defined in Fig 7 (b), eight in Fig 5(c). Three of them contain two objects whereas the remaining shells contain a single object. In every shell m, for m = [1, 2, ..., l] the mass center is calculated: xm c = k j=1
xj m k+1 , yc
= and calculate
k j=1
yj k+1 .
m Denote the mass center of shell m by Pcm = (xm c , yc )
nc nr
d(Pcm , Pjm ), d(Pcm , ( , )) 1≤j≤k 2 2
Rm = max
(9)
An IACM for Convex Hull and Boundary Extraction
1037
Fig. 5. (a) The original image; (b)The convex hull of the set of objects; (c) The shells of the objects; (d) The objects’ boundaries
In ncEq(9)nr “d” denotes the Euclidian distance between plane points, whereas 2 , 2 denotes the center of the image. Pcm and Rcm are used by Eq(4) to define an initial contour which inscribes the objects from shell m. Again, the contour is evolved all the way to the center Pcm . The process continues until a single object remains in every initial contour defined by Eq(4), and the center of this contour belongs to the object. The stop condition for the process is “No tangent line is built up after an evolution to the center”. From theoretical view point it is possible for the mth initial contour to contain points not only from the mth shell. To discard points cut from other shells m the following approach is developed. Consider a point Pkm = (xm k , yk ) which satisfies Eqs(3) and (5) for a certain t. Therefore the following expression is m derived from the discreet form of Eq(3) for s = ik and t = jk : (xm k , yk ) = √ √ ik a−4a2 jk ) ( exp [cos 1000 3aik , sin 1000 3aik ]. Denote by lm−1 and lm the right and left most tangent lines which form the mth shell. Denote by Pm the point at which lm is a tangent to an object in the shell, and by Pm−1 the point at which lm−1 is tangent to an object in the shell. Follows that both points satisfy the discrete form of Eqs(3) and (5) for s = im and t = jm , and for s = im−1 and t = jm−1 , √ √ 2 (xm , ym ) = exp(im a−4a jm ) [cos 1000 3aim , sin 1000 3aim ] √ √ 2 (xm−1 , ym−1 ) = exp(im−1 a−4a jm−1 ) [cos 1000 3aim−1 , sin 1000 3aim−1 ]. Now the following statement holds. The point Pkm belongs to the mth shell if and only if im−1 < ik < im .
(10)
The Shells algorithm is linked to the ACES, ACHM-ESDM and ACHM-ESLE and all together they represent Integral Active Contour Model (IACM). To verify the shell module an experiment was performed with the image shown in Fig 5(a). The defined shells are given in Fig 5(c), while the objects’ boundaries are shown in Fig 5(d).
1038
N.M. Sirakov and K. Ushkala
Fig. 6. (a) An image of Mahi’s statue along with the IACM in the vicinity of the objects; (b) The directions of evolution and the boundary after re-parametrization; (c) The objects along with the extracted boundary
Fig. 7. (a) Image of crafts; (b) The crafts inscribed by two shells; (c) The crafts with the extracted boundary
5
Experimental Results
In the previous sections multiple models were developed and integrated as an IACM capable of extracting the boundaries, the convex hull and the concavities of multiple objects. To validate every module, experiments were performed across the previous sections. Thus, the Convex hull on Fig 3(b) was defined in 0.157 sec, whereas the hand boundary, shown in Fig 3(d), was extracted in 1.468 sec. The original image is of size 628x809. The run time for defining the convex hull of the objects in the image of size 512x512 shown in Fig 5(b) is less than 10−4 sec. The shells in Fig 5(c) were determined in 1.62 sec, while the objects’ boundaries shown in Fig 5(d) were extracted in 1.746 sec. In the next experiment an image of Mahi’s statue was used. The image is of size 210x200 and is shown in Fig 6(a). During the first stage of the work the ACES module set the contour in the vicinity of the boundary in less than 0.001 sec (Fig 6(a)). Next, a re-parametrization was performed and the directions of evolution were determined as shown in Fig 6(b). The extracted boundary of the objects is presented in Figs 6(b) and 6(c). The total elapsed time for the above work is 0.578 sec. Another experiment was performed to extract the boundary of two crafts in an image of size 900x431, shown in Fig. 7 (a). The shell algorithm defined the two shells given in Fig 7(b) in 0.063 sec. Further the ACES and the re-parameterizations modules defined the boundary of every object shown in (Fig. 7 (c)). The total run-time for this job was 1.217 sec.
An IACM for Convex Hull and Boundary Extraction
1039
Table 1. Comparison of IACM, with VFC (R=N) [5] and GVF (N-iterations)[5,12] Method IACM VFC GVF run in sec for 256x256 0.692 ≈ 0.75 ≈ 2 run in sec for 512x512 1.744 ≈ 3 ≈ 16
6
Discussion and Future Work
This paper presents an IACM and software capable of extracting convex hulls, boundaries and deep concavities of multiple objects. The IACM is composed of ACES, ACHM-ESDM, ACHM-ESLE, re-parametrization and shell algorithm and possess the following advantages: – Automatic work with multiple objects, in an image, without user interaction; – Very large capture range, because Eq(3) and (4) state that the intial contour may be set as far as the user wants (see Fig 1(b)); – Handles rectangular images, whereas [5, 12] report only square ones; – Does not need stability convergence condition, while the active contours on differential equation do [1, 4, 8, 9, 17]; – Naturally extendable to the 3D case considering the 3D form of Eqs. (3-8); – High accuracy of boundary and convex hull approximation because truncation error is generated only by the ACHM-ESDM. If the ACHM-ESLE is used truncation error is not generated at all. The accuracy also depends on the space step Δs (for s) and the time step Δt (for t), small Δs and Δt generate less error, but increase the run-time; – Fewer operations and better runtime compare to VFC and GVF [5, 12] (see Table 1). The later statement holds because the complexity of ACES, ACHM-ESDM, ACHM-ESLE depends on the number of pixels the Eqs(4), (6-8) are calculated on. Since the above listed models work with background pixels, whose number is less than N 2 (N -the longest side of an image) it follows that the complexity of each model included in IACM is O(N 2 ). The re-parametrization approach increases the number of background pixels used by Eqs (4) - (6), but this number could not exceed N 2 . Thus the complexity of this approach is O(N 2 ). The shell method splits the initial contour, inscribing the image and defined by Eq(4), to multiple contours. Each of them inscribes a single object. Therefore the complexity of the shell method does not exceed O(N 2 ) as well. Follows that the complexity of the IACM is O(N 2 ), which is favorable in comparison to VFC - O(N 2 log N ) and GVF - O(N 3 ) [5,12]. An elegant method which also applies re-parametrization but for the purpose of evolving “topology snakes” is represented in [17]. It uses a manual initialization of the curve and needs the correct choice of six parameters, whereas the present method is automatic. A disadvantage of the IACM is its incapability to extract voids’ boundaries (see Fig 6). Also, the user has to select the values for ε1 , ε2 , ε3 , Δs and Δt. Large values for Δs and Δt increase the speed but decrease the accuracy.
1040
N.M. Sirakov and K. Ushkala
The future work continues with elaboration of Eq(3) and (5) to help the contour work with “low signal to noise ratio” [5] images. Also a work is under way to employ shells along with ACES model to determine voids. The IACM is to be extended to the 3D case.
References 1. Sirakov, N.M.: A New Active Convex Hull Model For Image Regions. Journal of Mathematical Imaging and Vision 26(3), 309–325 (2006) 2. Paragios, N., Deriche, R.: Geodesic, Active Regions and Level Set Methods for Supervised Texture Segmentation. Inter. Journal of Computer Vision, 223–247 (2002) 3. Rousson, M., Deriche, R.A.: Variational Framework for Active and Adaptive Segmentation of Vector Valued Images. In: Proc. IEEE Workshop Motion and Video Comp (December 2002) 4. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic Active Contours. Inter. J of Computer Vision 22(1), 61–79 (1997) 5. Li, B., Acton, S.T.: Active Contour External Force Using Vector Field Convolution for Image Segmentation. IEEE Transactions On Image Processing 16(8), 2096–2106 (2007) 6. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active Contour Models. Int’l J. Computer Vision 1(3), 211–221 (1987) 7. Kanaujia, A., Metaxas, D.N.: Large Scale Learning of Active Shape Models. In: Proc. IEEE ICIP2007, San Antonio, September 2000, p I-265-268 (2007) ISBN: 1-4244-1437-7 8. Chan, T., Shen, J., Vese, L.: Variational PDE Models in Image Processing. Notices American Math Society 50(1), 14–26 (2003) 9. Sapiro, G.: Geometric Partial Differential Equation and Image Processing. Cambridge Univ. Press, Cambridge (2001) 10. Osher, S., Sethian, J.A.: Fronts Propagating with Curvature Dependent Speed: Algorithms Based on Hamilton-Jacobi Formulations. Journal of Computational Physics 79, 12–49 (1988) 11. Sethian, J.A.: A fast marching level set method for monotonically advancing fronts. Applied Mathematics 93(02-96), 1591–1595 (1996) 12. Xu, C., Prince, J.L.: Gradient Vector Flow Deformable Models. In: Bankman, I. (ed.) Handbook of Medical Imaging, pp. 159–169. Academic Press, London (2000) 13. Grenander, U., Miller, M.I.: Pattern Theory: From Representation to Inference. Oxford University Press, Oxford (2007) 14. Sirakov, N.M.: Monotonic Vector Forces and Green’s Theorem For Automatic Area Calculation. In: Proc. IEEE ICIP 2007, San Antonio, September 2007, vol. IV, pp. 297–300 (2007) ISBN: 1-4244-1437-7 15. Sirakov, N.M.: Automatic Concavity’s Area Calculation Using Active Contours and Increasing Flow. In: Proc. of ICIP 2006, Atlanta, October 2006, pp. 225–228 (2006) ISBN: 1-4244-0481-9 16. Sirakov, N.M., Simonelli, I.: A New Automatic Concavities Extraction Model. In: Proc of Southwest Sym. on Image Analysis and Inter., pp. 178–182. IEEE Computer Society, Denver (2006) 17. McInerney, T., Terzopolous, D.: T-Snakes: Topology adaptive snakes, Medical image analysis, vol. 4, pp. 73–91. Elsevier Science B.V., Amsterdam (2000)
Comparison of Segmentation Algorithms for the Zebrafish Heart in Fluorescent Microscopy Images P. Kr¨ amer1, F. Boto1 , D. Wald1 , F. Bessy1 , C. Paloc1, C. Callol2 , A. Letamendia2 , I. Ibarbia2 , O. Holgado2 , and J.M. Virto2 1
VICOMTech, Paseo de Mikeletegi 57, 20009 Donostia – San Sebasti´ an, Spain 2 Biobide, Paseo de Mikeletegi 58, 20009 Donostia – San Sebasti´ an, Spain
Abstract. The zebrafish embryo is a common model organism for cardiac development and genetics. However, the current method of analyzing the embryo heart images is still mainly the manual and visual inspection through the microscope by scoring embryos visually a very laborious and expensive task for the biologist. We propose to automatically segment the embryo cardiac chambers from fluorescent microscopic video sequences, allowing morphological and functional quantitative features of cardiac activity to be extracted. Several methods are presented and compared within a large range of images, varying in quality, acquisition parameters, and embryos position. Despite such variability in the images, the best method reaches a 70% of accuracy, allowing reducing biologists workload by automating some of the tedious manual segmentation tasks.
1
Introduction
The zebrafish (danio rerio) is a widely used model for the study of vertebrate development. Due to its prolific reproduction, small size and transparency, the zebrafish is a prime model for genetic and developmental studies, as well as research in toxicology and genomics. While genetically more distant from humans, the vertebrate zebrafish nevertheless has comparable organs and tissues, such as heart, kidney, pancreas, bones, and cartilage. During the last years tremendous advances in imaging system have been made allowing the acquisition of high-resolution images of the zebrafish. Anyhow, the processing of such images is still a challenge [1]. To date, only few work has been presented addressing the analysis of zebrafish images [2,3,4]. For instance [4] presents a method to acquire, reconstruct and analyze 3D images of the zebrafish heart. The reconstruction of the volume is based on a semi-automatic segmentation procedure and requires the help of the user. The approach of [3] avoids segmentation and instead derives a signal of the heart of the images itself to quantify heartbeat parameters. Despite such isolated research initiatives, the current method of analyzing the embryo heart images in laboratories is still mainly the manual and visual processing using commercial software, offering limited and traditional analysis methods. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1041–1050, 2009. c Springer-Verlag Berlin Heidelberg 2009
1042
P. Kr¨ amer et al.
Some morphological and functional quantitative features of cardiac activity can be extracted from the images. However, the heart itself and the cardiac chambers have to be segmented. Such segmentation is usually done manually, on each image composing a sequence - a very tedious and laborious task for the biologist. In order to provide the biologist with a tool reducing its workload, we propose to attempt segmenting automatically the cardiac chambers of the zebrafish embryo heart from microscopic video sequences. Several methods are described and compared. In our experiment, transgenic embryos expressing fluorescent protein in the myocardium were placed under light microscopy allowing to capture fluorescent images of the heart at video rate. In particular, we are interested in segmenting the heart of zebrafish embryos after two days of post-fertilization (2 dpf). In early stages of the zebrafish development the primitive heart begins a simple linear tube. This structure gradually forms into two chambers, a ventricle and an atrium. At 2 dpf the heart tube is already partitioned into atrium and ventricle as depicted in Figure 1. They are separated by a constriction which will later form the valve. At this stage the heart is already beating. More information on zebrafish heart anatomy can be found in [5].
Fig. 1. The 2 dpf zebrafish heart already consists of two chambers: the atrium (A) and ventricle (V) (fluorescent image courtesy of Biobide, Spain)
The remainder is organized as follows: In section 2 we present several approaches to segment the zebrafish heart. Then, we present in section 3 two methods to identify the chambers. In section 4 we show some results and compare the segmentation methods respectively for the heart and its chambers. We give a conclusion of our work and outline future research in section 5.
2
Segmentation of the Zebrafish Heart
In this section we outline different approaches to segment the shape of the zebrafish heart. For the methods of subsection 2.3, 2.4, 2.5, we cast the images to 8-bit gray level images and stretch the gray level range into [0, 255].
Comparison of Segmentation Algorithms
2.1
1043
Adaptive Binarization
The Adaptive Binarization method is based on the hypothesis that the image of the heart consists of three brightness levels such as illustrated in Fig. 2: one corresponding to the background and two corresponding to the fluorescent heart where strong contracted regions appear brighter due to a higher concentration of fluorescent cells. For preprocessing, we smooth the image using a Gaussian kernel of aperture size 7 × 7 to remove noise. Then, the region of the heart with highest brightness is segmented by first applying a Contrast-Limited Adaptive Histogram Equalization (CLAHE) [6] using a uniform transfer function and then the automatic threshold method from Otsu [7]. In order to segment the second, less brighter region of the heart, we exclude the previous segmented region and apply CLAHE and Otsu again. The final segmentation is obtained by combining both segmentation results. Postprocessing includes the filling of holes which can appear inside in the heart.
Fig. 2. The image of fluorescent heart (courtesy of Biobide, Spain) consists of three brightness levels: one corresponding to the background and two to the heart
2.2
Clustering
This methods relies like the previous one on the assumption that there are three different brightness levels, although it is an statistical approach. It is based on unsupervised classification in order to distinguish between object and background pixels. First, each pixel is characterized by the mean luminance value of the 3× 3 mask centered at the pixel. A unidimensional feature space results. Then, we use a k-means (k = 3) classifier in order to separate the pixels into three clusters. The cluster to which belongs the pixel at position (0, 0) is then defined as the background and others as the region of the heart. Similarly than above, we apply hole filling as postprocessing. For more information on the k-means clustering can be found in [8]. 2.3
Voronoi-Based Segmentation
In this case the classification is based on the mean and standar deviation value to classify the pixels in two regions. The Voronoi diagram-based segmentation [9] divides an image into regions using Voronoi diagram and classifies the regions
1044
P. Kr¨ amer et al.
as either inside or outside the target based on classification statistics (mean and standard deviation), and then break up the regions on the boundary between the two classifications into smaller regions and repeat the classification and subdivision on the new set of regions. The classification statistics can be obtained from an image prior which is a binary image of preliminary segmentation. In order to compute the image prior, we apply first a bilateral filter to smooth the image while preserving edges. Afterwards, the gradient magnitude is computed using a recursive Gaussian filter and Sigmoid filter to map the intensity range into [0, 255]. Then a threshold is applied to the gradient magnitude to obtain a binary mask. As the binary mask may contain holes, we apply a morphological closing operation and fill the holes to complete the object’s shape. Then the main region of the heart is isolated from noise in the binary image by a region growing algorithm to the binary with the brightest pixel in the image as seed point. Typically, the brightest pixel in the gray-level image belongs to the region of the heart. After the Voronoi segmentation we apply again morphological closing, hole filling, isolation of the main region, and morphological erosion to smooth the contours. 2.4
Level Set
The main idea of this method is different to the previous methods. First a presegmentation accomplished which is then refined, but here we use the level set approach [10] for refinement. We choose this method because of its fast performance and the availability of source code which can be found at [11]. The method starts with a morphological reconstruction to suppress structures that are lighter than their surroundings and that are connected to the image border. Then, edges are detected using the Canny edge detector. Dilation, hole filling, and erosion are applied to the contour image. The biggest region is considered as the region of the heart while the others are considered as noise. We complete the process by applying again dilation and hole filling. A Gaussian filter is applied to smooth the original gray level image for noise removal. Then we apply the level set method [10] with contours of the binary mask as initialization. We chose the edge indicator function 1/(1+g) as suggested by the author where g is the gradient magnitude of the Gaussian filtered gray level image. 2.5
Watershed
This approach is different to the previous one as it does not rely on a presegmentation by binarization and it is based on the topology of the image [12]. First the border structures are suppressed by morphological reconstruction. This is followed by a strong low-pass filtering (Gaussian filter) in a morphological reconstruction by erosion using the inverse of morphological gradient. This attenuates unwanted portions of the signal while maintaining the signal intensity as the Watershed method is known to oversegment the image. Afterwards, a
Comparison of Segmentation Algorithms
1045
small threshold is applied to set the background to zero and the image intensity is adjusted so that such that 1% of data is saturated at low and high intensities. We apply to this gradient magnitude the watershed segmentation. An oversegmented image may result with typically one region belonging to the background and several regions belonging to the heart. The latter ones are joined to form the region of the heart.
3
Identification of the Chambers
The objective is now to divide the heart into the chambers based on the results of the methods presented in the previous section. 3.1
Convexity Defects
The method assumes that there is a constriction between the two chambers (see Figure 1) causing two convex points in the contour of the heart’s shape. Therefore, we compute the convexity defects of the contour using its convex hull. Generally more than two convexity defects are found due to irregularities in the contour caused by the segmentation as depicted in Figure 3. Moreover, we assume that the convexity defects denoting the constriction between the chambers are parallel. Thus, we choose the four most important convexity defects, i.e. the four points with the highest distance from the convex hull, and compute the angle for each pair as: v1 · v2 θ = arccos (1) |v1 ||v2 | where v1 , v2 are respectively the vectors between the start and end points of the first and second convexity defects. If the angle is lower than a small threshold, then the pair of convexity defects is considered as a possible candidate for the constriction, otherwise it is rejected. Finally, we choose the pair with the highest mean distance as the points of the constriction from the remaining. We compute the straight line interpolating the points which separates both chambers. As there is often a high variation of the straight line along the image sequence, we correct it by Double Exponential Smoothing-Based Prediction (DESP) [13] using the results of the previous images. 3.2
Watershed
This method is based on the results of the Watershed segmentation of subsection 2.5. The general idea is to divide the segmented shape into the two chambers by applying a second watershed segmentation. Therefore, the background is masked out and a watershed segmentation is applied in this area after a strong low-pass filtering. If two regions result, then they correspond to a rough identification of two chambers. Otherwise the regions have to be joined until only two regions remain. Therefore, we use the chamber identification of the previous image. We compute the intersection of a region in the current image with the
1046
P. Kr¨ amer et al.
Fig. 3. The segmented heart (red line) and the convex hull (yellow line) with convexity defects of the shape (yellow points) (fluorescent image courtesy of Biobide, Spain)
identified chambers of the previous image. Then, the region is identified to belong to the chamber where the intersection is maximal. It can happen that only one region is obtained by the Watershed segmentation. Then, the segmentation of the previous image is used for further processing of the current image. This chamber identification is very rough whereas the outline is not coincident with that one of subsection 2.5 as can be seen in Figure 4. Thus unassigned pixels remain. In order to assign them to one of the chambers, an Euclidean distance transform is computed for each chamber. Then, the non-assigned pixels of the segmentation are joined with the chamber for which the distance transform is smaller.
4
Results and Discussion
In this section, we show and discuss the results obtained with the methods presented above. First, we compare the algorithms for segmenting the shape of the heart from section 2 using an accuracy measure. Then, we evaluate visually
Fig. 4. The Watershed segmentation (blue line) and the first rough identification of the chambers (red line) (fluorescent image courtesy of Biobide, Spain)
Comparison of Segmentation Algorithms
1047
Table 1. Mean, Std, Max and Min accuracy for the segmentation algorithms Method Mean accuracy Standard deviation Max accuracy Min accuracy Adaptive binarization 0.870 0.052 0.946 0.725 Clustering 0.876 0.057 0.954 0.759 Voronoi segmentation 0.890 0.046 0.937 0.769 Level set 0.850 0.062 0.935 0.724 Watershed 0.856 0.044 0.896 0.717
the results of chamber segmentation algorithms from section 3. Moreover, we give computational times for all methods. 4.1
Comparison of Segmentation Algorithms
There are several methods to measure the performance of segmentation algorithms. In this work, we compare the segmented images with ground truth images which were obtained by manual segmentation. Although manual segmentations may be very subjective, we consider unsupervised evaluation methods such as [14,15] unsuitable for this comparison. Furthermore, the authors of [14] state that existing unsupervised approaches are most effective at comparing different parameterizations of the same algorithm and that they are less effective at comparing segmentations from different algorithms. We used the Jaccard coefficient [16,17] as performance measure for each segmentation method. This coefficient measures the coincidence between the segmentation result R and the ground truth A. Then, the segmentation accuracy is measured as: P (R, A) =
|R ∩ A| |R ∩ A| = |R ∪ A| |R| + |A| − |R ∩ A|
(2)
with | · | as the number of pixels of the given region. The nominator |R ∩ A| means how much of the object has been detected while the denominator |R ∪ A| is a normalization factor to scale the accuracy measure into the range of [0, 1]. Likewise pixels falsely detected as belonging to the object (false positives) are penalized by the normalization factor. Thus, this accuracy measure is insensitive to small variations in the ground truth construction and incorporates both, false positives and negatives, in one unified function [17]. In our experiments we used 26 image sequences with a resolution of 124 × 124 pixels and with a sufficient image quality for a objective manual evaluation. For each image sequence we segmented manualy the first 20 images with the above presented methods and compared them with a ground truth segmentation. The results of the mean Jaccard coefficient for each sequence are presented in Table 1. The Voronoi segmentation and both thresholding methods outperform the watershed and level set methods. A visual inspection of the segmentation results reveals similar results.
1048
P. Kr¨ amer et al.
The level set method gives good results on high contrast edges, but in regions where edges are blurred, the level set does not approach well the shape of the heart resulting in holes in the object shape or a too large shape. Moreover, we found it difficult to determine a common set of parameters suitable for all sequences. Finally we used the parameter set used in the example code [11] and fixed the number of iterations to 20. A specific choice of parameter set and number of iterations for each sequence/image might improve the results. The contours of the watershed method appear very rough and are often too tight. This might be due to the strong low-pass filtering in the post-processing which causes an edge mismatch. Equally a false classification of the regions into background and foreground may cause an inaccurate segmentation. The Voronoi segmentation method reveals the best results in term of accuracy measure. The contours are typically a little bit irregular, some postprocessing could be applied to smooth them. In case of low-contrast contours it may behave similar to the level set method. The overall results is quite satisfying. The adaptive binarization tends to have a slightly larger contours, but approaches well the object shape. This might cause the lower accuracy results, but the overall segmentation results are good. Sometimes in case of low-contrast edges the object shape may be incomplete. The clustering method tends also to larger contours, but slightly tighter than the adaptive binarization method. Therefore, a higher accuracy is achieved. However, in case of low-contrast edges it reveals more often incomplete shapes than the adaptive binarization. Note that the accuracy can vary as the randomized choice of initial cluster may result in slightly different segmentation results. The computational cost can not be directly compared as the implementations use different programming languages and libraries (the adaptive binarization, clustering, and Voronoi methods are implemented in C++ using respectively OpenCV, OpenCV and Torch, and ITK; the level set and watershed methods are implemented in Matlab), but the average execution time for each image is about one second.
4.2
Chamber Identification
In this section, we present some results for the convexity defects and watershed methods used to divide the heart into its chamber. We only evaluated the convexity defects method in combination with the adaptive binarization and clustering methods as they have the shortest execution times and we obtained good results in terms of accuracy and visual inspection. Evaluation of the results was realized by determining manually how many images were correctly divided into two chambers with respect to all segmented images. During evaluation we became aware of the fact that in some cases it is very difficult to decide if the chambers have been separated accurately enough or not, because the line segmenting both chamber might be displaced with respect to the restriction. Here, we adopted a hard line and classified such images as not correctly segmented.
Comparison of Segmentation Algorithms
1049
Table 2. The ratio of correct chamber identification for the algorithms presented Method Ratio Adaptive binarization + convexity defects 0.704 Clustering + convexity defects 0.577 Watershed 0.456
For evaluation we used only 24 out of the 26 sequences from above, because in two other ones the chambers are superimposed. Hence, the evaluation in those cases is very difficult and we chose to discard those sequences, we visually inspected 480 images. Our results are shown in Table 2, where the best result is obtained for the adaptive binarization method which also has the shortest execution time.
5
Conclusion and Future Work
We presented different methods to automatically segment the cardiac chambers of the zebrafish embryo from fluorescent microscopy video sequences. First, we implemented and compared various methods to extract the heart as a whole. The Voronoi-based segmentation gives the best results in terms of accuracy, followed by thresholding methods, such as the adaptive binarization and clustering method. Other methods, such as level set and watershed were also implemented but showed worse results they were also found more difficult to configure because of the variability in the images. We then compared various methods to identify the two chambers from the whole heart segmentation. This processing task can be very useful for cardiac study, allowing to analyze morphological and functional activity of each chamber separately. Cardiac pathology, such as fibrillation for example, can affect the atrium (atrial fibrillation) or the ventricle (ventricular fibrillation). It is therefore important to be able to extract not only the whole heart, but also each chamber from the microscopic video sequence. Our comparative study showed that the adaptive binarization method in combination with the detection of convexity defects outperforms clearly the other methods. While current methods of analyzing the embryo heart images are still mainly based on manual and visual inspection through the microscope, we have proposed image processing methods so that to substitute manual segmentation with automatic process. Although manual control and visual assessment are still necessary, our methods have the potential to drastically reduce biologist workload.
References 1. Vermot, J., Fraser, S., Liebling, M.: Fast fluorescence microscopy for imaging the dynamics of embryonic development. HFSP Journal 2, 143–155 (2008) 2. Luengo-Oroz, M.A., Faure, E., Lombardot, B., Sance, R., Bourgine, P., Peyri´eras, N., Santos, A.: Twister segment morphological filtering. A new method for live zebrafish embryos confocal images processing. In: ICIP, pp. 253–256. IEEE, Los Alamitos (2007)
1050
P. Kr¨ amer et al.
3. Fink, M., Callol-Massot, C., Chu, A., Ruiz-Lozano, P., Belmonte, J.C., Giles, W., Bodmer, R., Ocorr, K.: A new method for detection and quantification of heartbeat parameters in drosophila, zebrafish, and embryonic mouse hearts. BioTechniques 46(2), 101–113 (2009) 4. Liebling, M., Forouhar, A., Wolleschensky, R., Zimmermann, B., Ankerhold, R., Fraser, S., Gharib, M., Dickinson, M.E.: Rapid three-dimensional imaging and analysis of the beating embryonic heart reveals functional changes during development. Developmental Dynamics 235, 2940–2948 (2006) 5. Hu, N., Sedmera, D., Yost, H., Clark, E.: Structure and function of the developing zebrafish heart. The Anatomical Record 260, 148–157 (2000) 6. Zuiderveld, K.: Contrast Limited Adaptive Histogram Equalization. In: Graphics Gems IV, pp. 474–485. Academic Press, London (1994) 7. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. on Systems, Man and Cybernetics 1, 62–69 (1979) 8. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, Heidelberg (2007) 9. Imelinska, C., Downes, M., Yuan, W.: Semi-automated color segmentation of anatomical tissue. Computerized Medical Imaging and Graphics 24, 173–180 (2002) 10. Li, C., Xu, C., Gui, C., Fox, M.: Level set evolution without re-initialization: A new variational formulation. In: CVPR, vol. 1, pp. 430–436. IEEE, Los Alamitos (2005) 11. Li, C.: Home page, http://www.engr.uconn.edu/~ cmli/ 12. Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence 13, 583–598 (1991) 13. LaViola Jr., J.: Double exponential smoothing: An alternative to kalman filterbased predictive tracking. In: Immersive Projection Technology and Virtual Environments, pp. 199–206 (2003) 14. Zhang, H., Fritts, J.E., Goldman, S.A.: Image segmentation evaluation: A survey of unsupervised methods. Computer Vision and Image Understanding 110, 260–280 (2008) 15. Sezgin, M., Sankur, B.: Survey over image thresholding techniques and quantitative performance evaluation. Journal of Electronic Imaging 13, 146–168 (2004) 16. Cox, T., Cox, M.: Multidimensional Scaling, 2nd edn. Chapman & Hall/CRC, Boca Raton (2001) 17. Ge, F., Wang, S., Liu, T.: New benchmark for image segmentation evaluation. Journal of Electronic Imaging 16, 33011 (2007)
A Quality Pre-processor for Biological Cell Images Adele P. Peskin1 , Karen Kafadar2, and Alden Dima3 1
2
NIST, Boulder, CO 80305 Dept of Statistics and Physics, Indiana University, Bloomington, IN 47408-3725 3 NIST, Gaithersburg, MD 20899
Abstract. We have developed a method to rapidly test the quality of a biological image, to identify appropriate segmentation methods that will render high quality segmentations for cells within that image. The key contribution is the development of a measure of the clarity of an individual biological cell within an image that can be quickly and directly used to select a segmentation method during a high content screening process. This method is based on the gradient of the pixel intensity field at cell edges and on the distribution of pixel intensities just inside cell edges. We have developed a technique to synthesize biological cell images with varying qualities to create standardized images for testing segmentation methods. Differences in quality indices reflect observed differences in resulting masks of the same cell imaged under a variety of conditions.
1
Introduction
High content screening (HCS) has become a critical method for large scale cell biology and is often used for drug discovery. HCS is the automation of cell biological investigation using automated microscopes and sample preparation and includes the acquisition and analysis of cellular images without human intervention. Quantitative fluorescent microscopy plays a key role in HCS as it does in cell biology in general. Typical HCS-based experiments can involve the analysis of more than a million cells [1][2][3][4][5]. Image segmentation is the most important part of analyzing biological image data [4]. Consequently, many segmentation methods have been published, including histogram-based, edge-detection-based, watershed, morphological, and stochastic techniques [4]. However, segmenting the same image with different methods can lead to different masks for a given cell, and hence different estimates of cell characteristics (e.g, area, perimeter, etc.). We refer to a mask as the set of pixels that are used to define an object within an image. Accurate segmentation is one of the challenges facing high content screening [5], given the large number of images and consequent impracticalities of human supervision. Indeed, the handling and analysis of large image sets has been identified as an impediment to the wider use of HCS [4]. Therefore it is important to
This contribution of NIST, an agency of the U.S. government, is not subject to copyright.
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1051–1062, 2009. c Springer-Verlag Berlin Heidelberg 2009
1052
A.P. Peskin, K. Kafadar, and A. Dima
obtain a predictive, objective and efficient measure of the segmentation quality resulting from a particular method, on a cell-by-cell basis. Not all cells in the same image will lead to segmentations with the same accuracy. Certain clinically relevant cell lines are also known to be particularly difficult to segment, making the automated analysis of image data unreliable. In one study, the dose response of morphological features varied significantly between well segmented and poorly segmented cells taken from high content screening data [6]. The goal of this project is to provide a simple, yet reasonably faithful, measure of the quality of an individual cell in an image, to facilitate the choice of segmentation method and focus setting for the image. This index is determined on a cell-by-cell basis, by the characteristics of the edge of each cell, different from other quality indices proposed previously that describe overall image noise and distortion ([7] [8] [9] [10]). To define a quality index for each cell, we first examine a series of images and their ranges in quality, revealing types of features associated with low/high quality. Section 2 describes the data images considered in this project. Examination of these images leads to the definition of the quality index in Section 3. We then illustrate in Section 4 our method of creating synthetic images with a given quality index. While our measure is shown to behave as expected (i.e., poor images associated with low quality indices), a field study is needed to validate this measure for more general cases. We propose the investigation of this issue and others in the concluding Section 5.
2
Data Description
We examined 16 sets of fixed cell images prepared by our biological collaborators at NIST. These images consist of A10 rat smooth vascular muscle cells and 3T3 mouse fibroblasts stained with Texas Red cell body stain [11]. For each set, we compared six different images for a total of 96 images. Each image comprises multiple cells. Five of the six images are repeated acquisitions under five sets of imaging conditions. The first three sets were acquired using three different exposure times with Chroma Technology’s Texas Red filter set (Excitation 555/28, #32295; dichroic beamsplitter #84000; Emission 630/60, #41834). This filter is matched to the cell body stain used and allows us to compare the effect of exposure time on acquired cell images. We used non-optimal filters to reduce intensity signal to noise and introduce blurring: Chroma Technology’s GFP filter set (Excitation 470/40, #51359; dichroic beamsplitter #84000; Emission 525/50, #42345).1 The resulting images were blurred in a fashion similiar to the Gaussian blurring operation found in many image processing systems. These five image sets were acquired using 2x2 binning in which the output of four CCD image elements are averaged to produce one image pixel. Binning is typically used to trade image resolution for acquisition speed and sensitivity [2]. 1
Certain trade names are identified in this report in order to specify the experimental conditions used in obtaining the reported data. Mention of these products in no way constitutes endorsement of them. Other manufacturers may have products of equal or superior specifications.
A Quality Pre-processor for Biological Cell Images
1053
Table 1. The five sets of imaging conditions Image Illumination Exposure Level time(s) 1 low 0.015 2 medium 0.08 3 high 0.3 4 low 1.0 5 medium 5.0 6 (gold standard)
Filter type optimal filter (Texas Red filter set) optimal filter (Texas Red filter set) optimal filter (Texas Red filter set) non-optimal filter (GFP filter set) non-optimal filter (GFP filter set)
Fig. 1. Individual pixel intensities are color-coded over the range in each image, to show differences in edge sharpness. The ground truth mask is shown for comparison.
The sixth image is a higher-resolution 1x1 binned image, using the full image range to define pixel intensities. We use this image to create a ground truth mask for each cell, via expert manual segmentation to assure a “gold standard” image. The five sets of imaging conditions are a combination of filter type (optimal or non-optimal) and illumination level (low, medium, high), and are given in Table 1. We visualize the quality of each cell by color-coding the pixels. Figure 1 provides a visual rendition of the five conditions for cell number 4 from image set 1, revealing differences in the implied masks. Cell edges vary widely in clarity and sharpness. In particular, the images vary in terms of the number of pixel lengths (distance between pixels) needed to represent the thickness of the edge of the cells. We will attempt to quantify this feature in the next section.
3
Quality Calculation
For each cell in an image, we look at pixel intensities within an isolated region containing the cell. We fit this distribution using a 3-component Gaussian mixture via the EM (Expectation-Maximization) algorithm. A 2-component mixture model, with a cell component and a background component, resulted in estimates for means, standard deviations, and component fractions significantly different from the actual data. Thus we reasoned that the edge pixels, whose intensities span a wide range of pixels, form their own third distribution. The 3-component model is illustrated in Figure 2 for the second and fifth cells of Figure 1 and better reflects the actual data. We provide the equations for fitting the 3-component model (described in general terms in [12] [13]) in [14]. As illumination is increased,
1054
A.P. Peskin, K. Kafadar, and A. Dima
Fig. 2. Normalized curves of the 3 components of pixel intensities, as an example, for the second and fifth cells of Figure 1, background(blue), edge(green), and cell(purple)
the overall range of pixel intensities increases, giving a better separation of the background and edge peaks. At high illumination, however, the center of the cell is flooded with light, and the edge distribution is very thin and pushed back towards the background distribution. In the discussion below, xp and sp denote the estimated mean and standard deviation, respectively, of the intensities from pixels of type p from the EM algorithm, where p corresponds to B (background pixels), E (edge pixels), or C (cell interior pixels). We examined 16 sets of data in detail. For each set of imaging conditions in Table 1, we find a consistent background distribution. Data for the background Table 2. Background and Edge values for all low illumination, optimal filter images Set
Cell number Background Background Edge Edge Mean, xB SD, sB Mean, xE SD, sE 1 1 115 4 141 19 2 5 119 4 193 42 3 2 117 4 178 36 4 1 115 4 188 50 5 2 116 4 238 85 6 1 119 4 195 40 7 1 119 4 202 38 8 2 116 3 131 12 9 1 119 5 152 24 10 1 113 4 156 33 11 1 115 4 141 20 12 1 116 3 158 31 13 1 113 3 152 25 14 1 117 5 190 59 15 1 114 4 178 48 16 1 117 4 171 31 Mean 116 3.94 173 37 SD 0.51 0.14 7.0 4.45 N.B. Pooled sB = 3.98 (mean(sB ) = 3.94); pooled sE = 40.87 (mean(sE ) = 37.06)
A Quality Pre-processor for Biological Cell Images
1055
mean, xB , and standard deviation, sB , as well as for the edge pixel mean, xE , and standard deviation, sE , for each image taken at low illumination with the optimal filter (set 1 in Table 1), are given in Table 2. There is less variation in the background than the edge distributions. A single cell chosen from each image provides the data for Table 2, using the first cell of each image unless a particular cell in the image was used consistently in another part of this analysis. For each pixel intensity between xB and xE (i.e., between the mean of the background distribution and the mean of the edge distribution), we also calculate an average value for the magnitude of the gradient at that intensity. We look at the gridded data within a bounding box on the image that contains a single cell. At each pixel location, pij , with pixel intensity, Iij , we find a local derivative of that intensity and its magnitude, using a Sobel mask. In the i direction, a numerical estimate for the directional derivative at each pij is: ∂I/∂i|pij = [2 · I(i+1)j − 2 · I(i−1)j + I(i+1)(j−1) − I(i−1)(j−1) + I(i+1)(j+1) − I(i−1)(j+1) ]/8.
(1)
In the j direction, we find the numerical estimate at each pij : ∂I/∂j|pij = [2 · Ii(j+1) − 2 · Ii(j−1) + I(i−1)(j+1) − I(i−1)(j−1) + I(i+1)(j+1) − I(i+1)(j−1) ]/8.
(2)
The total derivative gives the magnitude of the gradient of the intensity at pij , which we denote by g(i, j): g(i, j) ≡ dI(i, j) = (∂I/∂i)2 |pij + (∂I/∂j)2 |pij . (3) We find a gradient magnitude value g(i, j) at each location pij . The average value of the gradient magnitude at those locations where the measured intensity level is a given value, say I ∗ , is given by: G(I ∗ ) = ave{g(i, j) : I(i, j) = I ∗ }.
(4)
If, for example, a particular intensity value, Iv , occurs 3 times within the bounding box, at positions pab , pcd , and pef , the average magnitude of the gradient at intensity = Iv is: G(Iν ) ≡ Gv = (gab + gcd + gef )/3. (5) We will refer to this average gradient function as G(I), where I is an intensity in the region of the edge. We use G(I) as a way of smoothing this otherwise noisy estimate of the gradient curve. Values of I near the edge likely lie between xB and xE + 2sE . We denote by A the intensity for which G(I) is largest for xB ≤ I ≤ xE ; i.e., A = argmaxxB ≤I≤xE G(I). In Figure 3, we show the locations of the pixels, pij , whose intensities are equal to Intensity A. We see that these pixels lie within one or two pixel lengths of the apparent edge of the cell, which we see in accompanying images of more detailed contouring of the pixel intensities.
1056
A.P. Peskin, K. Kafadar, and A. Dima
Fig. 3. Cell 2, set 1, imaging conditions 1-5: pixel intensities above Intensity A are red, Intensity A (+/- 1 unit), green, and below A, blue. Green pixels fall close to the apparent cell edge. Below, each cell is color-coded in 40 equally-spaced pixel ranges.
The average gradient magnitude curve, G(I) vs. intensity I, reflects two cell characteristics. The initial slope of the curve directly indicates the sharpness of the edge: the steeper the slope, the sharper the edge. The shape of this curve also indicates a feature of the cell edges. For very sharp edges, the average gradient magnitude increases monotonically across the range from the mean of the background, xB , to the mean of the edge distribution, xE , as is seen for all high illumination images (type 3 in Table 1; rightmost curve in Fig 4a). For the low and medium illumination images, the gradient increases from xB , reaches a maximum, and then falls. The graphs for the gradient magnitude for cell 5, image set 2, under low, medium, and high illumination with the optimal filter are shown in Figure 4a, and for low illumination for both filters in Figure 4b. For the type 3, high illumination images, the gradient magnitude continually increases between xB and xE , so we set the maximum gradient value in this region, Intensity A, to be the magnitude of the gradient at xE . The edge distribution is very close to the background distribution for these images, and we do not expect a rise and fall in the plot of the gradient magnitude within this small part of the curve (see the purple curve of Figure 4).
Fig. 4. Cell 5, set 2: a.)Averaged magnitudes of the pixel gradient for imaging conditions 1(red), 2(blue), 3(purple); b.)for imaging conditions 1(red), 4(blue)
A Quality Pre-processor for Biological Cell Images
1057
Fig. 5. The mask for cell 5 of set 2, and enlarged pictures of the lower right hand marked section, in order of decreasing quality index, imaging conditions 3,2,1,5,4
We measure the quality of an image in terms of an index that measures how rapidly the pixel intensities fall from the intensity at the maximum gradient at the edge (Intensity A) to the background mean value (xB ). In particular, we calculate the expected fraction of this range that should lie on the image within two physical pixel lengths of intensity A = argmaxxB ≤I≤xE G(I). Consider the set of images from cell 5 in set 2, which yielded the gradient curves in Figure 4. The intensity range between xB and Intensity A is divided into 10 equal-pixel-length ranges, and each range is displayed in Figure 5, enlarged pictures of the lower right corner of each of the five images in the region containing the cell pseudopodia. This example is chosen because the pseudopodia are connected to the rest of the cell over a very narrow region, of approximately the same size as the boundary edge region we use to define the quality measure. From the magnitude of the gradient, we calculate the expected fraction of the cell edge that lies within two pixel lengths of the maximum gradient at the edge. We use the following sequence of steps: 1. Find the 3-component pixel intensity distribution; denote means of the components by xB , xE , xC . 2. Find the average gradient magnitude at each intensity between xB and xE . 3. Smooth the gradient in this region to fill in any gaps, and denote the resulting function by G(Intensity). 4. Find the intensity, Intensity A, at which the smoothed gradient magnitude is maximized. 5. G(A) represents the maximum change in intensity among the edge pixels: find Intensity B, the intensity that is expected at one pixel unit away from A; i.e., B = A − G(A)·(1 pixel unit). 6. Find Intensity C, the intensity expected at one pixel unit away from B; i.e., C = B − G(B)·(1 pixel unit) 7. Compute the quality index as QI = (A - C)/(A - xB ). We perform the above calculations for the five different images of cell 5, image set 2. The results of each of the steps are given in Table 3. The quality index appears to describe how well the very thinnest geometry of the cell, the place where the pseudopodia are attached, appears in the image. For the optimal filter images, increasing the illumination increases the range of pixel intensities in the image, separating the background and edge distributions from each other and from the pixel intensity distribution that represents the cell and its pseudopodia.
1058
A.P. Peskin, K. Kafadar, and A. Dima
Table 3. Quality calculation for cell 5 of image set 2 for the five test images Type 1 2 3 4 5
xB 119 165 321 192 524
A 146 319 799 222 692
G(A) 16.06 109.13 480.50 7.87 39.92
B 130 210 319 215 653
G(B) 7.91 45.23 5.38 7.02 33.67
C 122 165 314 208 620
Quality 0.852 1.000 1.015 0.467 0.429
The sharper the edge, the higher the quality index. The thin connection (2 to 3 pixel lengths) between the main part of the cell and the pseudopodia is clearly seen for the medium and high illumination figures, and pixel intensities between the two parts of the pseudopodia decrease to levels near xB . For the images taken with the non-optimal filter, the connection between the main part of the cell and the pseudopodia is lost in the images, and the pixel intensities between the two parts of the pseudopodia do not approach xB . It is clear from these figures that the decrease in pixel intensity between the inside of the cell and the background covers much more than two pixel lengths. In the medium illumination, nonoptimal filter image, type 5, much less noise appears than in the corresponding low illumination image, type 4, but the edge of the cell covers, on average, about 4 to 5 pixel lengths, so no connection appears between the lower portion of the pseudopodia and the main portion of the cell.
4
Data Creation
To compare cell segmentation techniques with one another, we use the quality index we defined above to create sets of test images over a specified range of qualities. We assume that a given segmentation method will have some quality range over which the segmentation provides a given measure of accuracy. A quality index permits a quick assessment of the accuracy of the cell area from a particular segmentation method. To illustrate the creation of sets of synthetic cells with a specified range of qualities, we describe and show a set created from a ground truth mask of one of the cells in the 16 sets described above, cell 2 from set 10. The five images of the cell of interest from that set have qualities equal to 0.842, 0.910, 1.137, 0.563, and 0.636. The images of those cells are shown in Figure 6. To test our method, we will construct a cell with pixel distributions and gradients that correspond with the second image, and compare its quality index to the index for that cell, 0.910. We now construct the new cell from the mask, such that its distribution has three components whose means and standard deviations agree with those of the original cell. Then we use the other characteristics of that cell, namely the intensity of maximum gradient at the edge and the cell quality, to compare with the new cell to ensure a match between the set values and the calculated values on the synthetic image. To begin, we read in the mask from Figure 6, and assign each pixel of the new image to be either
A Quality Pre-processor for Biological Cell Images
1059
Fig. 6. The five different images of the cell whose mask is used for the cell synthesis and the corresponding mask
on the cell boundary, inside, or outside of the cell based on the mask. We build our cell edge starting from those pixels at the edge of the mask. We assemble a pixel distribution for the new cell using the three components of the original cell. For cell 2 of image 10, the type 2 image, xB , xE , and xC respectively are 132, 225, and 826, and the sB , sE , and sC values are 5, 80, and 817. The fractional components are 0.800, 0.103, and 0.096. We calculate the expected number of pixels at each intensity by adding the contributions of each component at each intensity. If the total number of pixels in this region is T, then the expected number of pixels with intensity I is: pE = 0.800T ; pB = 0.103T ; pC = 0.096T ;
(6)
#pixels(I) = pB φ(I; xB , sB ) + pE φ(I; xE , sE ) + pC φ(I; xC , sC ),
(7)
where φ(I; μ, σ) denotes the Gaussian density function with mean μ and standard deviation σ at a particular value of the intensity, I. For each pixel intensity, we compute the difference between the raw data and the data computed from the 3-component distribution above to determine the standard deviation in the raw data. We compute the standard deviation over different pixel intensity ranges. We collect two sets of pixel intensities from the 3-component distribution. One set contains intensities at or below Intensity A, the intensity where the gradient is maximized. One set lies above Intensity A. We modify each pixel intensity by changing it to its Gaussian variate determined by N(xp ,s2p ), where p corresponds to B (background), E (edge), or C (cell interior) pixels. The pixels in the first set are placed outside of the mask according to their distances from the edge of the mask. The pixels in the second set are placed inside of the mask in the same way. Figure 7b shows the resulting new cell alongside the original cell, 7a, from which it is made. We then test the new image by recalculating the 3-component analysis, pixel intensity gradient, and quality to ensure that it retained the same features. The 3-component analysis is only slightly different, as would be expected by imposing random noise: xB = 134, xE = 214, and xC = 770. The intensity of maximum gradient at the edge occurs at Intensity A = 181, and the quality is 0.915. The new cell has the same basic features as the original. Now we can change some of the characteristics of this synthetic cell and monitor the corresponding change in quality. First, we create a set of cells in the same way as the cell above, but change the standard deviation of the background
1060
A.P. Peskin, K. Kafadar, and A. Dima
Fig. 7. a.) The original cell imaged with medium illumination, optimal filter. b.) Synthesized cell of the same type as (a), made as a test case. c,d.) Two cells identical except for the spread of the background pixels: standard deviations are 10 and 12 compared to original 5. e,f) Two cells identical except that the edge distribution has been shifted from a mean of 225 to a mean of 215 and 205.
Fig. 8. Original cell imaged with medium illumination, non-optimal filter; cell synthesized with the same characteristics, starting from the ground truth mask
peak. The original background peak has a standard deviation of 5, and we create two new cells with backgrounds of mean xB = 132 as before, but with standard deviations, sB equal to 10 and 12 instead of 5 (see Figure 7c,d). The quality indices of these new cells change from 0.910 to 0.766 and 0.667, respectively. Next we modify the edge distribution without changing the magnitude of the pixel intensity gradient. We shift the value of xE from its original value of 225 to either 215 or 205, so that the modified edge distribution overlaps more closely with the background distribution. The results for these two examples are shown in Figure 7e,f. The image qualities for these two examples change from 0.910 to 0.878 and 0.825, respectively. As one would expect, the quality index decreases as the overlap between the background and edge distributions increases. We can use these techniques to look at the effect of using a less optimal filter for cell imaging. We focus on the type 5 images from Table 1 of this same cell. If we try to reconstruct this cell, using the original cell’s 3-component mean and standard deviation values, the same pixel gradient, and the same value of pixel intensity for the maximum gradient at the cell edge, we create a cell with basically the same quality as the original cell (0.59 compared with 0.58 for the original), but a different shape, shown in Figure 8. Clearly the precise shape of the cell is influenced by the optical effects introduced with different filter sets, leading to different segmentation results, and hence the masks for the two cells in Figure 8 will be different. Thus this index appears to describe only the clarity of the cell edge, which will differ from the true mask as the quality decreases.
A Quality Pre-processor for Biological Cell Images
5
1061
Conclusions and Future Work
The features of a biological cell image are complex. The accuracy of a cell segmentation by any given method can depend on many different factors: the magnitude of the pixel gradient at the cell edges, the background pixel intensity distribution: the overlap between the background pixel intensity distribution and the edge pixel intensity distribution, and spreads in these distributions. It can also depend upon other geometric features of a cell, such as roundness of the cell and jaggedness of the edges of the cell. The goal of this quality index is to provide an indication of the segmentation method that should be used to yield the highest accuracy in derived cell quantities (e.g., area, perimeter, etc.). Ideally, we would like to be able to distinguish between those segmentation methods that over- or under-estimate cell area from those that provide unbiased estimates of cell area, based on the quality of the edge of the cell. Because our quality index as defined here measures only edge clarity, we may need to modify it to better capture cell image features. Our future work includes a more thorough investigation of the properties of this quality index, particularly its relation to accuracy of cell area. We plan to create sets of images with a very wide range of quality indices and cell geometries, to use as test images for a wide variety of segmentation methods. From them we hope to use this quality index (possibly modified) to choose segmentation techniques for very rapid analysis of large numbers of biological images.
Acknowledgements Kafadar’s support from Army Research Office Grant Number W911NF-05-10490 awarded to University of Colorado-Denver, and National Science Foundation Grant Number DMS 0802295 awarded to Indiana University is gratefully acknowledged. James Filliben of NIST’s Statistical Engineering Division and John Elliott of NIST’s Biochemical Science Division (BSD) designed the experiment that John performed to produce the cell images used in this study. Anne Plant and Michael Halter, of BSD, provided helpful discussion and input.
References 1. Abraham, V.C., Taylor, D.L., Haskins, J.R.: High content screening applied to large-scale cell biology. Trends in Biotechnology 22, 15–22 (2004) 2. Cox, G.C.: Optical Imaging Techniques in Cell Biology. CRC Press, Boca Raton (2007) 3. Giuliano, K.A., Haskins, J.R., Taylor, D.L.: Advances in High Content Screening for Drug Discovery. Assay and Drug Development Technologies 1(4), 565–577 (2003) 4. Zhou, X., Wong, S.T.C.: High content cellular imaging for drug development. IEEE Signal Processing Magazine 23(2), 170–174 (2006) 5. Zhou, X., Wong, S.T.C.: Informatics Challenges of High-Throughput Microscopy. IEEE Signal Processing Magazine 23(3), 63–72 (2006)
1062
A.P. Peskin, K. Kafadar, and A. Dima
6. Hill, A.A., LaPan, P., Li, Y., Haney, S.: Impact of image segmentation on highcontent screening data quality for SK-BR-3 cells. BioMed Central Bioinformatics 8, 340 (2007) 7. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13(4), 600–612 (2004) 8. Shnayderman, A., Gusev, A., Eskicioglu, A.M.: An SVD-based grayscale image quality measure for local and global assessment. IEEE Transactions on Image Processing 15(2), 422–429 (2006) 9. Wang, Z., Bovik, A.C.: A universal image quality index. IEEE Signal Processing Lett (March 2002) 10. Ling, H., Bovik, A.C.: Smoothing low-SNR molecular images via anisotropic median-diffusion. IEEE Transactions on Medical Imaging 21(4), 377–384 (2002) 11. Elliott, J.T., Woodward, J.T., Langenbach, K.J., Tona, A., Jones, P.L., Plant, A.L.: Vascular smooth muscle cell response on thin films of collagen. Matrix Biol. 24(7), 489–502 (2005) 12. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, And Prediction. Springer, New York (2001) 13. Scott, D.W.: Parametric Statistical Modeling by Minimum Integrated Square Error. Technometrics 43, 274–285 (2001) 14. Peskin, A.P., Kafadar, K., Santos, A.M., Haemer, G.G.: Robust Volume Calculations of Tumors of Various Sizes. In: 2009 International Conference on Image Processing. Computer Vision, and Pattern Recognition (2009)
Fast Reconstruction Method for Diffraction Imaging Eliyahu Osherovich, Michael Zibulevsky, and Irad Yavneh Technion — Israel Institute of Technology, Computer Science Department {oeli,mzib,irad}@cs.technion.ac.il
Abstract. We present a fast image reconstruction method for two- and three-dimensional diffraction imaging. Provided that very little information about the phase is available, the method demonstrates convergence rates that are several orders of magnitude faster than current reconstruction techniques. Unlike current methods, our approach is based on convex optimization. Besides fast convergence, our method allows great deal of flexibility in choosing most appropriate objective function as well as introducing additional information about the sought signal, e.g., smoothness. Benefits of good choice of the objective function are demonstrated by reconstructing an image from noisy data.
1
Introduction
Determination of the structure of molecules is an important task in a variety of disciplines including biology, chemistry, medicine, and others. Currently, one of the major techniques for such an imaging is X-ray crystallography. However, its applicability is limited to the cases when a good quality crystals are available. Unfortunately, biological specimens such as cells, viruses, and many important macromolecules are difficult to crystallize. One of the promising alternatives to obtain the structure of single biomolecules or cells is the rapidly developing technique of X-ray diffraction microscopy also known as coherent diffraction imaging (CDI) [1]. In CDI a highly coherent beam of X-rays is scattered by electrons of the specimen, generating a diffraction pattern which, after being captured by a CCD sensor, is used to reconstruct the electron density of the specimen. It can be shown that under certain conditions the diffracted wavefront is approximately equal (within a scale factor) to the Fourier transform (FT) of the specimen. However, due to the physical nature of the sensor, the phase of the diffracted wave is lost. Hence, the problem is equivalent to reconstruction of an image (denoted by x) from the magnitude of its FT (denoted by r). The same problem is met in astronomy and X-ray crystallography where it is often called the phase (retrieval) problem. In its classical form the phase retrieval problem assumes that the phase information is lost completely. However, current reconstruction methods do require additional information about the specimen support. Therefore, they either obtain the support information from other sources, e.g., low-resolution images or G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1063–1072, 2009. c Springer-Verlag Berlin Heidelberg 2009
1064
E. Osherovich, M. Zibulevsky, and I. Yavneh
attempt to reconstruct the support along with the phase retrieval using some sort of bootstrapping technique [2]. In this work we consider a situation where the additional information available is the phase uncertainty limits. Discussion on possible ways to obtain such information is postponed till Sect. 5. At this moment we assume that a rough estimation of the phase is available, i.e., the phase uncertainty is slightly less than π radians. In this case we give an efficient convex optimization method of solution. The rest of the paper is organized as follows. In Sect. 2 we give a short overview of the previous work. This overview is divided into two main subsections. Firstly, the theoretical foundations are given in Sect. 2.1 where we chiefly consider the uniqueness of the solution. Secondly, we review current reconstruction methods in Sect. 2.2. In Sect. 3 we present our convex optimization procedure. Quantitative results of our simulations are presented in Sect. 4, followed by discussions and conclusions in Sect. 5.
2
Previous Work
Since a very important part of the information is lost it is not obvious that the image can be reconstructed from the FT magnitude alone. Therefore, several researches were devoted to the theoretical and practical aspects of the problem. In the subsequent sections we give a short review of major results and methods. 2.1
Theoretical Foundations
It has been shown [3] that, under mild restrictions, a finite-support onedimensional or multidimensional image is uniquely specified to within a scale factor by the tangent of its FT phase. The FT magnitude, in contrast, does not uniquely specify a one-dimensional image. For multidimensional images, the FT magnitude is almost always sufficient to specify the image to within a trivial transformation. Namely, the set of images for which it is insufficient is of measure zero [3]. And by the trivial transformations we mean a combination of coordinate reversal, sign change, and translation. As follows from the above results, the uniqueness properties of the reconstruction of one-dimensional and multidimensional images are quite distinct. In addition, the phase and the magnitude of the FT contribute differently to the uniqueness of reconstruction. By combining the FT magnitude with some phase information we can further restrict the uncertainty in the reconstruction result. It has been shown [4] that, under mild restrictions, the signed FT magnitude, i.e., the magnitude plus one bit of phase information per wavelength, (i.e., phase uncertainty of π radians) is sufficient to specify one-dimensional and multidimensional images uniquely. 2.2
Current Reconstruction Methods
Current reconstruction methods date back to the pioneering work of Gerchberg and Saxton (GS) [5]. Later, their method was significantly improved by
Fast Reconstruction Method for Diffraction Imaging F
x
xˆ
Satisfy Object domain constraints
Satisfy Fourier domain constraints
F −1
x
1065
xˆ
Fig. 1. Schematic of the Gerchberg-Saxton algorithm
Fienup [6] who suggested the Hybrid Input Output (HIO) algorithm. Since the later algorithm (HIO) is significantly better than the GS method we limit ourselves to this method in our simulations and comparisons. The GS algorithm utilizes alternating projections on the constraints sets. The algorithm performs iterative Fourier transformations back and forth between the object and Fourier domains. In each domain it applies the known data (constraints). In our case, it is the modulus of a complex number in the Fourier domain and non-negativity in the object domain. Schematic of the GS algorithm is given in Fig. 1. The HIO algorithm is identical to the GS method in the Fourier domain, however, their behavior in the object domain differs. In the GS algorithm we have xk+1 (t) =
xk (t), 0,
t ∈ ∨ . t∈∨
While in the HIO the update rule is given by xk (t), xk+1 (t) = xk (t) − βxk (t),
t ∈ ∨ , t∈∨
(1)
(2)
where β ∈ [0, 1], ∨ is the set of points at which x (t) violates the object domain constraints (positivity). x is obtained from x by enforcing the FT magnitude as shown in Fig. 1. It has long been known that the GS algorithm is equivalent to the gradient projection method with constant step length for the problem defined by (3) [6]. Recently, it has been shown that in this case the gradient is an eigenvector of the Hessian with the corresponding eigenvalue equal to two [7]. Thus, the GS method is, in fact, the Newton method combined with projections. To our knowledge, Fienup’s HIO algorithm has no such counterpart in the convex optimization land. Despite the remarkable success of these two algorithms, they have several drawbacks. First, they require a good estimation of the support. Second, their flexibility is limited, e.g., it seems that the algorithms cannot be adapted to a particular noise distribution in the measurements nor can addition prior knowledge be easily integrated into the scheme.
1066
3
E. Osherovich, M. Zibulevsky, and I. Yavneh
Convex Optimization Methods
First, we must clarify that the problem is not convex. Thus, the term “convex optimization” is probably misleading. We do perform a convex relaxation at some stage and it turns out to be crucial for successful reconstruction. However, the problem itself remains non-convex. Hence, in this paper by “convex optimization methods” we mean the classical gradient and Newton-type methods. Let us start by formulating the optimization problem. The very common formulation is as follows min |F x| − r2 , (3) s.t. x ≥ 0 where F denotes the FT operator (a matrix in the discrete case). Of course, there is an endless number of ways to choose the objective function. A particular choice of it may affect the convergence speed and numerical stability. However, in our view, it is more important to choose the objective function that properly reflects the underlying physics phenomena. For example, the choice of Equation (3) is suitable when the measured quantity is r and the noise in the measurements has Gaussian distribution with zero mean. In practice, however, we measure r2 and not r and noise distribution is Poissonian rather than Gaussian. The corresponding objective function and its influence on the reconstruction quality is shown in Sect. 4. Unfortunately, the phase retrieval problem turns out to be particularly tough for convex optimization methods. For example, it was reported that naive application of Newton-type algorithms to the problem formulated by Equation (3) fails, except for tiny problems [8]. This failure is due to the high non-linearity and non-convexity of the problem. To our knowledge, no globally convergent convex optimization method exists at the moment. Despite this general failure, we demonstrate in this work, that the situation can change dramatically if additional phase information is available. Let us consider one pixel in the Fourier domain. In case the phase is known to lie within a certain interval [α, β], the correct complex number must belong to the arc defined by α and β as depicted in Fig. 2a. Despite this additional information, the problem still remains non-convex and cannot be solved efficiently. However, we perform convex relaxation. That is, we relax our requirements on the modulus and let the complex number lie in the convex region C as shown in Fig. 2b. The problem now becomes convex. Its formal definition is as follows min d(F x, C)2 (4) s.t. x ≥ 0 where d(a, C) denotes the Euclidean distance from point a to the convex set C. In our experience, several dozen iterations are sufficient to solve this convex problem (see Fig. 5a). Of course, the solution does not match the original image because both the phase and magnitude may vary significantly. However, we suggest the following method for the solutions of the original problem. Stage 1:
Fast Reconstruction Method for Diffraction Imaging
z-plane
1067
z-plane
C β
β α
α
(a) Phase uncertainty
(b) Convex region
Fig. 2. Convex relaxation
starting at random x0 , we solve the problem defined by Equation (4). Stage 2: the solution obtained in Stage 1 (denoted by x1 ) is used as the starting point for the minimization problem that combines both the convex and non-convex parts, as defined below min |F x| − r2 + d(F x, C)2 (5) s.t. x ≥ 0 More precisely, in our implementation we use the unconstrained minimization formulation, i.e., instead of Equations (4) and (5) we minimize the following functionals Ec (x) = d(F x, C)2 + [x]− 2
(6)
E(x) = |F x| − r + d(F x, C) + [x]− 2
2
2
(7)
where [x]− is defined as follows [x]− =
0, x ≥ 0 x, x < 0
(8)
Results of our simulations are given in the next section.
4
Simulations and Results
Due to the high dimensionality of the problem (especially in the 3D case) we shall limit our choice to the methods that do not keep the Hessian matrix or its approximation. Hence, in our implementation we used a modified version of the SESOP algorithm [9] and the LBFGS method [10]. Both algorithms demonstrated very close results. The main difference is that SESOP guarantees that there are two Fourier transforms per iteration just like in the GS and HIO methods. The LBFGS method, on the other hand, cannot guarantee that. However, in practice the number of the Fourier transforms is very close to that of SESOP and HIO.
1068
E. Osherovich, M. Zibulevsky, and I. Yavneh
(a) 3D model (PDB)
(b) electron density (2D)
(c) 3D Fourier modulus
(d) 2D Fourier modulus
Fig. 3. Caffeine molecule
The method was tested across a variety of data. In this section we present some of these examples. The first example is a molecule of caffeine whose 3D model, a 2D projection of its electron density, and the corresponding FT modulus are shown in Fig. 3. This information was obtained from a PDB1 (protein database) file . In addition, we use a “natural” image which represents a class of images with rich texture and tight support. Moreover, it may be easier to estimate the visual reconstruction quality of such images. This image and its Fourier modulus are given in Fig. 4. Note that we assume that a rectilinear sampling is available in the 3D case. In practice, however, the sensors measure a two-dimensional slice of the 3D volume. Provided that a sufficient number of such slices were measured an interpolation can be used to form a rectilinear array of measurements [11]. However, the slices can be incorporated directly into our minimization scheme. It will be addressed in future work. In our experiments we used the phase uncertainty of 3 radians. The bounds were chosen at random such that the true phase has uniform distribution inside the interval. The starting point (x0 ) was also chosen randomly. Of course, there 1
see http://www.pdb.org for more information.
Fast Reconstruction Method for Diffraction Imaging
(a) Original image
1069
(b) Fourier modulus
Fig. 4. A natural image (Lena)
is an obvious way to make a more educated guess: by choosing the middle of the uncertainty interval. However, our experiments indicate that the starting point has little influence. In all cases the reconstructed images obtained with our method were visually indistinguishable from the original. Therefore, we present the values of Ec (x) and E(x) as defined in Equations (6) and (7) to visualize the progress of the first and the second stages respectively. The second stage is compared with the HIO algorithm for which the error term is E(x) without the phase bounds constraint, i.e., EHIO = |F x| − r2 + [x]− 2 .
(9)
The first experiment is as follows. First, we ran 60 iterations of Stage 1. The progress of different images is shown in Fig. 5a. In the second stage we ran 200 iterations of our algorithm (SESOP) starting at the solution obtained in the previous stage (x1 ). To compare the convergence rate with the current methods, we ran twice the HIO algorithm. Once, starting at x0 and thus, the algorithm is unaware of the additional phase information. Another run was started at x1 , hence, the phase information was made (indirectly) available to the algorithm. The results for 2D and 3D reconstruction of the caffeine molecule are shown in Figs. 5b and 5c respectively. The results of the natural image are shown in 5d. It is evident from these results that our method significantly outperforms the HIO algorithm is all experiments. However, its superiority on the “Lena” image is tremendous. In addition to the examples shown in this paper we have studied a number of other examples. Based on our observations we conclude that our algorithm demonstrates a significantly better convergence rate when the interval of phase uncertainty is not too close to π radians. Besides the fast convergence rate, our method allows us to incorporate additional information about the image or the noise distribution in the measurements. Typical noise in photon/electron counting processes is distributed according to
1070
E. Osherovich, M. Zibulevsky, and I. Yavneh Stage 1 − Convex problem
5
Caffeine 2D − Stage 2
10
Caffeine 2D Caffeine 3D Lena
HIO (starting at x 0)
2
HIO (starting at x )
10
1
We (starting at x )
0
1
10
0
E(x)
c
E (x)
10 −5
10
−2
10
−4
10
−10
10
−6
10 −15
10
0
10
20
30 40 # iterations
50
60
70
20
(a) Stage 1
40
60
80
100 120 # iterations
140
160
180
200
(b) Stage 2 — Caffeine 2D
Caffeine 3D − Stage 2
Lena − Stage 2 HIO (starting at x 0)
HIO (starting at x 0)
HIO (starting at x )
HIO (starting at x )
0
1
1
10
We (starting at x )
We (starting at x )
1
1
0
10
E(x)
E(x)
−5
−5
10
−10
10
10
−15
10 20
40
60
80
100 120 # iterations
140
160
180
(c) Stage 2 — Caffeine 3D
200
20
40
60
80
100 120 # iterations
140
160
180
200
(d) Stage 2 — Lena
Fig. 5. Reconstruction results
the Poisson distribution. In this case the maximum-likelihood criterion implies the functional for the error measure in the Fourier domain as given in Equation (10). EP (x) = 1T |F x|2 − r2 ln |F x|2 .
(10)
To demonstrate the performance of our method we contaminated the measurements (r2 ) of the “Lena”image with Poisson noise such that the signal to noise ratio (SNR) was 53.6 dB. The phase uncertainty was 3 radians as before. First, we started by solving the convex problem, as defined by Equation (4). The solution obtained was then used as the starting point for the second stage of our method using the non-convex functional defined in Equation (10). The HIO algorithm also started at this solution. In addition to using the objective function that fits the noise distribution we also included a regularization term in the object space. In this example, we used the total variation functional [12] TV(x) = |∇x| (11) with a small weight. Total variation is a good prior for a broad range of images, especially, for those that are approximately piece-wise constant. In our case, introduction of this regularization added about 3 dB to the reconstruction SNR. The reconstruction results are shown in Fig. 6. Our method achieved the SNR of
Fast Reconstruction Method for Diffraction Imaging
(a) Our method: 30dB
1071
(b) HIO: 16.7dB
Fig. 6. Reconstruction from noisy data
30dB, while the HIO algorithm produced a significantly inferior result. Its SNR was only 16.7dB. Note that the SNR in the reconstructed images is worse than the SNR in the data. This is a typical situation for ill-conditioned problems. It is also worth to note that Poisson noise of small intensity can be well approximated by Gaussian noise. However, if you use the objective function suitable for the Gaussian noise in r2 , i.e., |F x|2 − r2 2
(12)
the reconstruction results are few dB’s worse than those we got with the proper choice of the objective function.
5
Discussion and Conclusions
We presented a convex optimization method for the phase retrieval problem when the phase is known to lie within a certain interval. Straight forward incorporation of this information does not lead to a successful reconstruction. Therefore, we designed a two-stages method. At the first stage we perform convex relaxation and solve the resulting convex problem. At the second stage the original objective function is introduced into the scheme and the reconstruction continues from the solution of the first stage. The algorithm demonstrates significantly better convergence rate compared to the current reconstruction methods. Moreover, in contrast to these methods, our technique is flexible enough to allow incorporation of additional information. Practical examples of such information include measurement noise distribution, knowledge of the reconstructing image being smooth or piece-wise constant. There are several ways to obtain rough phase estimations. First, and probably most obvious, is to introduce into the scene an object whose Fourier transform is known. In this case the recorded data is the modulus of a sum of two complex numbers one of which is known. It is easy to show that in this case the phase of the unknown number can be reconstructed by solving two equations with
1072
E. Osherovich, M. Zibulevsky, and I. Yavneh
two unknowns. Since our method allows phase uncertainty to be as large as 3 radians, the Fourier transform of this additional object can be known only approximately. However, we believe that a rough phase estimation is possible without introducing a change into the physical setup. At this time we work on an algorithm that will perform some sort of bootstrapping for simultaneous phase retrieval and phase uncertainty estimation. For example, if we use one of the current reconstruction methods that are able to reconstruct the image. Then, at some point, the phase will be close enough the true one and if we manage to identify such a moment we can switch to our algorithm and get a faster reconstruction. As a last resort, the whole interval of 2π radians can be split into 2 or three intervals and each one can be tested separately. In this case, once a correct partitioning is found our method will recover the image.
References 1. Neutze, R., Wouts, R., van der Spoel, D., Weckert, E., Hajdu, J.: Potential for biomolecular imaging with femtosecond x-ray pulses. Nature 406(6797), 752–757 (2000) 2. Marchesini, S., He, H., Chapman, H.N., Hau-Riege, S.P., Noy, A., Howells, M.R., Weierstall, U., Spence, J.C.H.: X-ray image reconstruction from a diffraction pattern alone. Physical Review B 68(14) (October 2003); 140101 Copyright (C) 2009 The American Physical Society; Please report any problems to
[email protected] 3. Hayes, M.: The reconstruction of a multidimensional sequence from the phase or magnitude of its fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing [IEEE Transactions on Signal Processing] 30(2), 140–154 (1982) 4. Hove, P.V., Hayes, M., Lim, J., Oppenheim, A.: Signal reconstruction from signed fourier transform magnitude. IEEE Transactions on Acoustics, Speech and Signal Processing 31(5), 1286–1293 (1983) 5. Gerchberg, R.W., Saxton, W.O.: A practical algorithm for the determination of phase from image and diffraction plane pictures. Optik 35, 237–246 (1972) 6. Fienup, J.R.: Phase retrieval algorithms: a comparison. Applied Optics 21(15), 2758–2769 (1982) 7. Osherovich, E., Zibulevsky, M., Yavneh, I.: Signal reconstruction from the modulus of its fourier transform. Technical Report CS-2009-09, Technion (December 2008) 8. Nieto-Vesperinas, M.: A study of the performance of nonlinear least-square optimization methods in the problem of phase retrieval. Journal of Modern Optics 33, 713–722 (1986) 9. Narkiss, G., Zibulevsky, M.: Sequential subspace optimization method for LargeScale unconstrained problems. CCIT 559, Technion, EE Department (2005) 10. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Mathematical Programming 45(1), 503–528 (1989) 11. Miao, J., Hodgson, K.O., Sayre, D.: An approach to three-dimensional structures of biomolecules by using single-molecule diffraction images. Proceedings of the National Academy of Sciences of the United States of America 98(12), 6641–6645 (2001) 12. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60(1-4), 259–268 (1992)
Evaluation of Cardiac Ultrasound Data by Bayesian Probability Maps Mattias Hansson1 , Sami Brandt1,2 , Petri Gudmundsson3, and Finn Lindgren4 1
School of Technology, Malm¨o University, Sweden
[email protected] 2 Information Processing Laboratory, Oulu University, Finland 3 Faculty of Health and Society, Malm¨o University, Sweden 4 Centre for Mathematical Sciences, Lund University, Sweden
Abstract. In this paper we present improvements to our Bayesian approach for describing the position distribution of the endocardium in cardiac ultrasound image sequences. The problem is represented as a latent variable model, which represents the inside and outside of the endocardium, for which the posterior density is estimated. We start our construction by assuming a three-component Rayleigh mixture model: for blood, echocardiographic artifacts, and tissue. The Rayleigh distribution has been previously shown to be a suitable model for blood and tissue in cardiac ultrasound images. From the mixture model parameters we build a latent variable model, with two realizations: tissue and endocardium. The model is refined by incorporating priors for spatial and temporal smoothness, in the form of total variation, connectivity, preferred shapes and position, by using the principal components and location distribution of manually segmented training shapes. The posterior density is sampled by a Gibbs method to estimate the expected latent variable image which we call the Bayesian Probability Map, since it describes the probability of pixels being classified as either heart tissue or within the endocardium. By sampling the translation distribution of the latent variables, we improve the convergence rate of the algorithm. Our experiments show promising results indicating the usefulness of the Bayesian Probability Maps for the clinician since, instead of producing a single segmenting curve, it highlights the uncertain areas and suggests possible segmentations.
1 Introduction Echocardiography is more accessible, mobile and inexpensive compared to other imaging techniques and has become a widely used diagnostic method in cardiology in recent years. Unfortunately ultrasound images struggle with inherent problems which in large part stem from noise, and is often referred to as speckle contamination. Speckle is the result of interference between echoes, which are produced when the ultrasound beam is reflected from tissue, and has the properties of a random field, see [1, 2]. The use of the Rayleigh distribution in modeling speckle in ultrasonic B-scan images is wellestablished through early works, such as [1, 3], and more recently in [4]. There is much previous work done in the field of segmentation of cardiac ultrasound images, of which an excellent overview is given in [5]. Here we will only mention G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1073–1084, 2009. c Springer-Verlag Berlin Heidelberg 2009
1074
M. Hansson et al.
those works which, like our algorithm, treat segmentation of blood and tissue as a pixelclassification or region-based problem. Our model makes a dependency assumption of neighboring pixels via total variation. A similar approach is employed in [6–10], where Markov random field (MRF) regularization is used. Like our model, in [7, 9–11] a Bayesian framework is used, although the construction of the posterior density function is different. Our approach uses priors on location and shape; of the forementioned, only one work [9] uses a shape prior. Also in [9] probabilistic pixel class prediction is used, which is reminiscent of the proposed Bayesian Probability Maps (BPM). In this paper, we present improvements to our method [12] for determining the position of the endocardium in ultrasound sequences. Information about position may be used for determining ejection fraction (by comparing systolic and diastolic volume) and assessment of regional wall abnormalities of the heart; measures used in diagnosis of ischaemic heart disease. The problem is represented as a latent variable model, which represents the inside and outside of the endocardium. The method uses priors for spatial and temporal smoothness, in the form of total variation, connectivity, preferred shapes and location, by using the principal components and location distribution of manually segmented training shapes. The main steps of the method are: 1) We assume a three-component Rayleigh mixture model for the pixel intensities (of blood, echocardiographic artifacts, and tissue) and estimate the parameters by expectation maximization. 2) A latent variable model with two realizations, tissue and endocardium, is built using the estimated mixture model parameters. The posterior distribution of the latent variables is then sampled. 3) The mean of the posterior gives us the Bayesian probability map, which describes the position distribution of the endocardium. Instead of giving a single segmenting curve, the certainty of which may vary along the curve, our method provides a more versatile measure. Our method shares some analogy with other region-based methods, but our approach of describing the position of the endocardium as the expected latent variable image and incorporating priors on location, connectivity, shape and smoothness in space and time, is in its construction novel to our knowledge. The improvement of our previous model consists of: 1) the use of a three-component mixture model, improving the sensitivity of the algorithm in distinguishing between tissue and blood. 2) A connectivity prior ensuring that samples are spatially simply connected. 3) An atrium prior, which prevents blood in the atrium being classified as within the endocardium. 4) Sampling translation distribution, which improves the convergence of the algorithm.
2 Model Our goal is to determine the position of the endocardium in an ultrasound sequence. To accomplish this we represent the endocardium by the latent variable model with values one and zero for the inside and outside, respectively and estimate the posterior distribution of the latent variable model P (u|z, θ) ∝ p(z|u, θ)P (u|θ) ,
(1)
where u is the vector of latent variables, z represent image intensities stacked into a single vector and θ are parameters. The Rayleigh distribution has been reported to be
Evaluation of Cardiac Ultrasound Data by Bayesian Probability Maps
1075
an appropriate for modeling blood and tissue in cardiac ultrasound images, see [1, 3, 4]. Therefore to construct the likelihood p(z|u, θ), we assume a Rayleigh mixture model for pixels intensities in the ultrasound images, as described in Section 2.1. In Section 2.2, we construct the prior distribution P (u|θ) by using the prior knowledge such as temporal and spatial smoothness, connectivity, shape and location. 2.1 Likelihood By empirical observation we model the ultrasound data as a three-component mixture model: one for the blood intensities, one for echocardiographic artifacts, and finally one for the tissue. By artifacts we refer to areas with tissue-like intensity caused by cordae, papillary muscles, ribs or local increase in signal strength. Denoting the intensity value of pixel k in an ultrasound image by zk , we assume that p(zk |θ) = α1 prayl (zk |σ1 ) + α2 prayl (zk |σ2 ) + α3 prayl (zk |σ3 ) ,
(2)
where θ = {αi , σi ; i = 1, 2, 3} are the mixture model parameters and prayl (z|σ) = z z2 σ exp(− 2σ ), σ > 0 is the Rayleigh probability density function. In our previous work we employed two-component mixture model, but we have found that a three-component model better discriminates between tissue and blood, by adding a category for echocardiographic artifacts. Pixels are assumed to be independent in the mixture model. The likelihood is then defined as p(z|w, θ) = P (Wj = k|zj , σk )δ(wj −k) , k = 1, 2, 3 (3) j
k
where Wj and wj is the random latent variable and its realization, respectively, corresponding to zj , and δ denotes the Kronecker delta function. wj = 1 if xj is in the blood pool, wj = 2 if xj is in an echocardiographic artefact, wj = 3 if xj is in the cardiac 3 tissue, and P (Wj ∈ i|zj , θ) = αi prayl (zj |σi )/ k=1 αk prayl (zj |σk ), i = 1, 2, 3. We use the parameters θ to build a latent variable model, with only two realizations: tissue (0) and endocardium (1). The likelihood of this model is defined as p(z|u, θ) = P (Uj ∈ endocardium|zj , σ)uj P (Uj ∈ tissue|zj , σ)1−uj , (4) j
σ = {σ1 , σ2 , σ3 } where Uj and uj are the random latent variable and its realization, respectively, corre3 sponding to zj and P (Uj ∈ tissue|zj , θ) = α1 prayl (zj |σ1 )/ i=1 αi prayl (zj |σi ) and P (Uj ∈ endocardium|θ) = 1 − P (Uj ∈ backgr|zj , θ). 2.2 Prior Our prior model P (u|θ) =PB (u|θ)PTV|B (u|θ)Pshape|B,TV (u|θ)Patrium|B,TV,shape (u|θ)× Pcon|B,TV,shape,atrium(u|θ)Plocation|B,TV,shape,atrium,con(u|θ)
(5)
1076
M. Hansson et al.
Fig. 1. The symbol u represents the m latent variable images Iu stacked into a single vector. Each Iu corresponds to an image in the ultrasound sequence of length m.
consists of six components, where each characterizes different kinds of properties preferred. The Bernoulli component PB is the discrete latent variable distribution following from the Rayleigh mixture model. The total variation PTV|B enforces spatial and temporal smoothness for latent variable images, and possible shape variations around the mean shape are characterized by trained eigenshapes of manually segmented images through Pshape|B,TV . The sequence of ultrasound images is divided into subsequences, to take the temporal variations of the endocardium into account, and so for each part of the ultrasound sequence a corresponding set of eigenshapes and mean is used. The atrium contains blood which should not be classified as being within the endocardium. The atrium prior Patrium|B,TV,shape reduces the likelihood exponentially, of a pixel being classified as being within the endocardium, with distance from the horizontal position of the ventricle to the bottom of the ultrasound image. Although some blood may still be misclassified, this prior will prevent large misclassifications. The connectivity prior Pcon|B,TV,shape,atrium enforces that all samples u are spatially simply connected. The location prior Plocation|B,TV,shape,atrium,con is constructed from the mean of the unregistered binary training shapes. The location prior describes the experimental probability value for each pixel location being either inside or outside of the endocardium, thus allowing only similar latent variable values as observed in the training data. The Bernoulli prior is defined as PB (u|θ) ∝ αuj (1 − α)1−uj (6) j
and is thus a prior on the proportion of zeros and ones in u and j ∈ {1, ..., N }, where N is the total number of latent variables in u. Let Iu (x; n) be a latent variable image, where x and n are its spatial and temporal coordinates, respectively (see Figure 1). The total variation prior is then given by PTV|B (u|θ) ∝ exp{−λTV ||Iu (x; n) ∗ h||L1 } , where h is a three dimensional Laplacian kernel and ∗ denotes convolution. Let Iu,r (x; n) be the transitionally registered latent variable image, corresponding to ¯ nr are the Iu (x; n), where the center of mass has been shifted to the origin; unr and u corresponding latent variable vectors. The shape prior is defined as
Evaluation of Cardiac Ultrasound Data by Bayesian Probability Maps
Pshape|B,TV (u|θ) ∝
n
1077
¯ nr )T (Cn + λ0 I)−1 (unr − u ¯ nr )} , (7) exp{−λshape (unr − u
where Cn represents the truncated covariance of the training shapes, whose center of mass has been shifted to the origin, and λ0 I is the Tikhonov regularizer [13]. The atrium prior is defined as Patrium|B,TV,shape(u|θ) ∝ a(x)Iur train (x; n) , (8) x
where a(x) =
n
1 if x2 > xa 2 −xa 1 − exp{ maxx(x } , otherwise 2 −xa )
(9)
x2
where xa = max argmin{Iur train (x; n) > 0} , x = (x1 , x2 ). In every training image n
x2
Iur train there is a least x2 -coordinate xl s.t Iur train ((x1 , xl )) > 0; xa is the largest out of all xl . This gives an approximate location of the ventricle, where the atrium starts. The connectivity prior is defined as 1 if u ∈ N Pcon|B,TV,shape,atrium(u|θ) ∝ (10) 0 , otherwise where N = {u : u spatially simply connected}. The location prior is defined as Plocation|B,TV,shape,atrium,con(u|θ) ∝ 1 if 1 h g ∗ I¯utrain (x; n) Iu (x; n) = 1 j
uj
n
(11)
x
0 otherwise k 1 where I¯utrain = K k Iutrain is the mean training image and K is the number of training images. g is a Gaussian kernel and h is the step function s.t. h(t) = 1 for t > 0, otherwise h(t) = 0. This component has the effect that when sampling individual latent variables outside of the (smoothed) mean shape, the result of sampling will be that the latent variable is set to zero. Inside the (unregistered) mean shape the sampling is unaffected. Three parameters control the influence of the priors: λTV , λshape and λ0 . By increasing λTV we can regularize our sampling, while increasing λshape makes the influence of the shape prior larger. Finally λ0 increases the influence of the mean shape in the formation of the shape prior; this is crucial when segmenting very noisy images, that do not respond well to the subtle control of the a shape prior with small λ0 .
3 Algorithm Our algorithm for generating Bayesian Probability Maps can be divided into three parts. First the mixture model parameters are estimated by the EM algorithm from our ultrasound data; these parameters are used to compute the class posterior possibilities for
1078
M. Hansson et al.
Mixture Parameter Estimation
T
u (i )
Sampling of the Posterior
Bayesian Probability Map
Sample Mean
Fig. 2. Summary of the proposed algorithm to construct the Bayesian probability map
each latent variable after seeing the corresponding image values — these probabilities are used as an input in following step in constructing the likelihood function that is further transformed to the posterior probabilities in the second step. The posterior is then sampled by Gibbs sampling and the samples are used to compute the Bayesian probability map. The algorithm is summarized in Fig. 2. 3.1 Estimation of Mixture Model Parameters The complete data likelihood is represented according to the latent variable model as p(z, w|θ) = prayl (zj |σi )δ(wj −i) , (12) j
i
where z are the pixel intensity values and w = (w1 , ..., wN ) are interpreted as missing data. The mixture parameters θ = {αi , σi ; i = 1, .., 3} are estimated by Expectation Maximization (EM) [14]. That is, on the E-step, we build the expected complete data loglikelihood, conditioned on the measured data and the previous parameter estimates, or χ(θ, θˆ(n−1) ) = Ew|z,θˆ(n−1) {log p(z, w|θ))} =
N
3
P (Wj = k|zj , θˆ(n−1) ) log prayl (zj |θ) .
(13)
j=1 k=1
On the M-step, the expected complete data loglikelihood is maximized to obtain an update for the parameters θˆ(n) = argmax χ(θ, θˆ(n−1) ) ,
(14)
θ
and the steps are iterated until convergence. 3.2 Sampling of the Posterior To improve convergence, the sampling of the posterior (1) was performed by alternating between conventional Gibbs sampling [15, 16] and sampling of latent variable image translations. On the Gibbs step, we draw the elements of the sample latent variable vector u from the conditional distribution (i)
(i)
(i−1)
(i−1)
P (uj |u1 , . . . , uj−1 , uj+1 , . . . , uN ) 1 (i) (i) (i−1) (i−1) = P (uj = k|u1 , . . . , uj−1 , uj+1 , . . . , uN )
k=0
, j = 1, 2, . . . , N . (15)
Evaluation of Cardiac Ultrasound Data by Bayesian Probability Maps
1079
Then, to obtain sample vector u(i) , we sample the distribution of translations which spatially move the latent variable image Iu . The details of the translation sampling step are as follows. We want to sample the conditional translation distribution P (t|u, z, θ) ≡ P (u |u, z, θ) ,
(16)
where the latent variable vector u is obtained from u by spatially translating the latent variable image Iu by t. Now we may write, ⎛ P (t|u, z, θ) ∝ ⎝
N
⎞ uj
1−uj
prayl (zj |σ1 ) prayl (zj |σ2 )
⎠ p(u |θ)
j=1
∝
N
(prayl (zj |σ1 )v1,j )
(17) uj
1−uj
(prayl (zj |σ2 )v2,j )
,
j=1
where we have used the fact that, apart from the location prior, the conditional translation distribution is independent of the priors; and the location prior is encoded in the mask vectors v1 = v and v2 = 1 − v , where v is the vector corresponding to the matrix g ∗ I¯utrain , cf. (11). It follows that log P (t|u, z, θ) =
N
uj log (prayl (zj |σ1 )v1,j ) + (1 − uj ) log (prayl (zj |σ2 )v2,j ) + C ,
j=1
(18) where C is a constant that does not depend on the translation. The sums above represent correlations between the translated latent variable image and the masked log probability densities. Hence, the logarithms of the conditional translation probabilities can be computed by the correlation theorem, after which we are able to draw the translation sample and finally obtain the sample u(i) = u . After iteration the center of mass of each latent variable image is calculated, which determines the area of influence of the shape prior. 3.3 Sample Mean To characterize the posterior distribution, we compute estimate conditional mean of the latent variable vector over the posterior E{u|z, θ} ≈
N 1 (i) ˆ ˆ CM u = P (Uk ∈ obj) k=1 ≡ u M i
(19)
ˆ CM → by the latent variable sample vectors u(i) . By the strong law of large numbers u E{u|z, θ} when n → ∞. The corresponding image Iuˆ CM represents the Bayesian probability map.
1080
M. Hansson et al.
4 Experiments 4.1 Material The ultrasound data used in this paper consists of cardiac cycles of two-chamber (2C) apical long-axis views of the heart. The sequences were obtained using the echocardiogram machines Philips Sonos 7500, Philips iE33 or GE Vivid 7, from consecutive adult patients referred to the echocardiography laboratory at Malm¨o University hospital, Sweden which has a primary catchment area of 250,000 inhabitants. Expert outlines of the endocardium in the sequences have been provided by the same hospital. 4.2 Initialization As an initial estimate of mixture model parameters we set α(0) to the proportion of object pixels in the training images, and σ1 and σ2 are set to maximum likelihood Q 1 2 12 estimate σ ˆ = ( 2Q i=1 xi ) of object and background pixels in the training data, where Q is the number of pixels in the training set. The Gibbs sampling algorithm is seeded by a sample obtained by Bayesian classification of the mean of the annotated images for each category of the heart cycle. Prior parameters λTV , λshape , λ0 are set manually. 4.3 Evaluation We divide our data into two sets: training set and validation set. The training set consists of 20 cardiac cycles. The training set is further divided into sets, corresponding to parts of the cardiac cycle. The validation set consists of 4 different cardiac cycles. As evaluation measure the expected misclassification Emc of a pixel, w.r.t the expert outline, is used. Let Itrue (x; n) be ground truth images corresponding to the data z. Then the expected misclassification of a pixel in the examined sequence is given by 1
Emc = 1 − Itrue (x; n) P Iu (x; n) = 0 + Itrue (x; n)P Iu (x; n) = 1 . N n x (20) This measure is needed since it is impossible to present the entire sequence in images. A low Emc guarantees that the Bayesian Probability Map correctly describes the position of the endocardium in the entire sequence, not just for a few selected images.
(A1)
(A2)
(A3)
(A4)
Fig. 3. (Color online) Graph Cut (red) applied to Validation sequence A with expert outline (white)
Evaluation of Cardiac Ultrasound Data by Bayesian Probability Maps
(A1)
(A2)
(A3)
(A4)
(B1)
(B2)
(B3)
(B4)
1081
Fig. 4. (Color online) Validation sequences A (43 frames) and B (40 frames). BPM with overlaid expert outline (white). Systole (A1-2,B1-2) and Diastole (A3-4,B3-4).
1082
M. Hansson et al.
(C1)
(C2)
(C3)
(C4)
(D1)
(D2)
(D3)
(D4)
Fig. 5. (Color online) Validation sequences C (21 frames) and D (35 frames). BPM with overlaid expert outline (white). Systole (C1-2,D1-2) and Diastole (C3-4,D3-4).
Evaluation of Cardiac Ultrasound Data by Bayesian Probability Maps
1083
Table 1. Parameters and Emc for Validation Sequences Validation Sequence λTV λshape λ0 A 0.75 125 100 B 1.5 60 100 C 4 400 800 D 5 500 1000
Emc 0.0335 0.0367 0.0507 0.0569
4.4 Results Figure 4 and 5 contain Bayesian Probability Maps (BPM) formed from 150-200 samples; the approx. number samples needed to reach a stationary distribution. Running time for sampling is approx. 3 hours for the entire ultrasound sequence on a Intel Xeon 2.33 GHz, 9 Gb RAM server. The probability map spans colors from red to blue with degree of probability,of area being within the endocardium. Hence, red indicates the highest probability. Table 1 contains the parameter settings and the Emc for the validation sequences. From each sequence four frames are displayed, two at systole and two at diastole. Sequence A and B have quite modest parameter values, as the underlying estimate of tissue and blood, is quite satisfactory and does not need much intervention in the form of priors. Sequence C and D required large λ0 , due to artifacts in the chamber (in the case of D due to rib of patient). Overall these results are superior to those we have previously published. 4.5 Comparison with Graph Cut Method We compare our results with a Graph Cut method as described in [17–19]. In Figure 3 we observe that the Graph Cut method fails to identify the location as clearly as the proposed method for sequence A. For validation sequences B,C and D no results were obtained since a singular covariance matrix was obtained. This may be attributed to the very noisy nature of these sequences. This comparison is limited and given to show the differences between the proposed method and pure Graph Cut algorithms, as there are some fundamental similarities such as pixel dependencies; similar to the methods described in [6, 7, 9–11], the Graph Cut method uses MRF for this. However, more complex methods share many similarities with our method, e.g. as described in [7], which we plan include in future comparative study.
5 Conclusion and Future Work We have presented improvements to our approach [12] to cardiac ultrasound segmentation, which consists of modeling the endocardium by latent variables. The latent variable distribution is then sampled which yields Bayesian Probability Map, which describes the location of the endocardium. The improvements consist of a three-component mixture model, connectivity, an atrium prior and sampling translation distribution. We plan to introduce a method of estimating the prior parameters, and by this refining our results. We will expand our comparative evaluation with graph-based methods and those akin to the approach proposed by [7].
1084
M. Hansson et al.
References 1. Burckhardt, C.B.: Speckle in Ultrasound B-Mode Scans. IEEE Trans. Sonics and Ultrasonics 25, 1–6 (1978) 2. Wagner, R.F., Smith, S.W., Sandrik, J.M., Lopez, H.: Statistics of Speckle in Ultrasound B-Scans. IEEE Trans. Sonics and Ultrasonics 30, 156–163 (1983) 3. Goodman, J.: Laser Speckle and Related Phenomenon. Springer, New York (1975) 4. Tao, Z., Tagare, H., Beaty, J.: Evaluation of four probability distribution models for speckle in clinical cardiac ultrasound images. MedImg 25, 1483–1491 (2006) 5. Noble, J.A., Boukerroui, D.: Ultrasound image segmentation: A survey. IEEE Transactions on Medical Imaging 25, 987–1010 (2006) 6. Friedland, N., Adam, D.: Automatic ventricular cavity boundary detection from sequential ultrasound images using simulated anneal. IEEE Trans. Med. Imag. 8, 344–353 (1989) 7. Boukerroui, D., Baskurt, A., Noble, J.A., Basset, O.: Segmentation of ultrasound images: multiresolution 2d and 3d algorithm based on global and local statistics. Pattern Recogn. Lett. 24, 779–790 (2003) 8. Dias, J.M.B., Leitao, J.M.N.: Wall position and thickness estimation from sequences of echocardiographic images. IEEE Trans. Med. Imag 15, 25–38 (1996) 9. Song, M., Haralick, R., Sheehan, F., Johnson, R.: Integrated surface model optimization for freehand three-dimensional echocardiography. IEEE Transactions on Medical Imaging 21, 1077–1090 (2002) 10. Xiao, G., Brady, J.M., Noble, A.J., Zhang, Y.: Contrast enhancement and segmentation of ultrasound images: a statistical method. Medical Imaging 2000: Image Processing 3979, 1116–1125 (2000) 11. Figueiredo, M., Leitao, J.: Bayesian estimation of ventricular contours in angiographic images. IEEE Transactions on Medical Imaging 11, 416–429 (1992) 12. Hansson, M., Brandt, S., Gudmundsson, P.: Bayesian probability maps for the evaluation of cardiac ultrasound data. In: MICCAI Workshop: Probabilistic Models for Medical Image Analysis (2009) 13. Tikhonov, A.: Solutions of Ill Posed Problems. Vh Winston, Scripta series in mathematics (1977) 14. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood form incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. Ser. B-Stat. Methodol. 39, 1–38 (1977) 15. Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans. Pattern Analysis and Machine Intelligence 6, 721–741 (1984) 16. MacKay, D.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003), http://www.inference.phy.cam.ac.uk/mackay/itila/book.html 17. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Analysis and Machine Intelligence 20, 1222–1239 (2001) 18. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Analysis and Machine Intelligence 26, 147–159 (2004) 19. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence 26, 1124–1137 (2004)
Animated Classic Mosaics from Video Yu Liu and Olga Veksler Department of Computer Science, University of Western Ontario London, Ontario, Canada, N6A 5B7
[email protected],
[email protected]
Abstract. Generating artificial classic mosaics from digital images is an area of NPR rendering that has recently seen several successful approaches. A sequence of mosaic images creates a unique and compelling animation style, however, there has been little work in this area. We address the problem of creating animated mosaics directly from real video sequences. As with any animation, the main challenge is to enforce temporal coherency between the frames. For this purpose, we develop a new motion segmentation algorithm. Our algorithm requires only a minimal help from the user. We pack the tiles into the discovered coherent motion layers, using color information in all the frames in a global manner. Occlusions and dis-occlusions are handled gracefully. We produce colorful, temporally coherent and uniquely appealing mosaic animations. We believe that our method is the first one to animate classic mosaics directly from video.
1
Introduction
Mosaics are composed of a large number of regularly shaped tiles, such as rectangles and squares, artfully arranged. Simulating classic mosaics from digital images is one of the areas in non-photorealistic rendering that has been widely investigated [1,2,3,4,5]. One of the reasons for popularity of NPR rendering in computer graphics is that a stylized image can have a more profound impact on the user than the original. This is perhaps even more true of an NPR animation [6,7,8,9]. Creating animated mosaics manually is very labor intensive. However, there has been little work on creating mosaic animations automatically or interactively [10,4]. We develop a system for creating animated mosaics directly from video sequences. Our approach is inspired by [10], who were the first to realize the unique set of challenges for mosaic animation. In many NPR animation methods, in order to facilitate temporal coherence, the primitives are allowed to deform, scale, blend, etc. However, to stay faithful to the classic mosaic style, the tile primitives cannot undergo any such transformations. Each individual frame must be a convincing mosaic, while the whole sequence exhibits a convincing motion. One way to achieve temporal coherency is to displace groups of tiles in a consistent manner. For this purpose we develop a new motion segmentation algorithm with occlusion reasoning. Our algorithm requires minimal help from G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1085–1096, 2009. c Springer-Verlag Berlin Heidelberg 2009
1086
Y. Liu and O. Veksler
Fig. 1. Several frames from “Walking” sequence, and the corresponding mosaic
the user. We pack the tiles into the discovered coherent motion layers, using color information in all the frames in a global manner. Our tile packing algorithm is based on the one for still mosaic [5], with several modifications to address video input. Occlusions are handled gracefully. We produce colorful, temporally coherent and uniquely appealing mosaic animations, see Fig. 1. We believe that our method is the first one to animate classic mosaics directly from video.
2
Related Work
There are several approaches to still classic mosaic rendering from a digital image. In order to obtain a visually pleasing mosaic, most methods agree on the following basic principles. First, mosaic tiles should be placed at orientations that emphasize perceptually important curves in an image. This is usually done by placing the tile sides parallel to the important curves. Which curves are important is often decided through user interaction or edge detection. The second principle is to maximize the number of tiles, while avoiding overlap as much as possible. This, combined with the first principle, implies that the tile orientations should align with important boundaries and vary smoothly in the image, since smoothly varying orientations allow a tighter packing. The last principle is that the tile color distribution should reflect that of the underlying image. There is a variety of techniques for classic mosaics [1,2,11]. All of the above have a number of heuristics steps, and their behavior may be hard to predict and control. We base our animated mosaics on the still mosaic method in [5], which is based on principled global optimization. They formulate an explicit objective function that incorporates the desired mosaic properties. User interaction and explicit edge detection are not required. The main challenge to our animated mosaics is ensuring temporal consistency. Stylizing each frame individually produces disturbing artifacts. Artifacts may be more tolerable in the moving parts of the scene and could be regarded as a special effect. However flickering artifacts are especially pronounced in the static regions of the frame. Therefore most NPR methods seeking to stylize a video have to deal with temporal consistency.
Animated Classic Mosaics from Video
1087
There are roughly two ways to approach temporal consistency. The first group of methods treats a video as a space-time 3D volume [8,4,9]. Rather than directing 2D (flat) primitives in the direction of the scene motion, temporal coherency is achieved by using volumetric (3D) rendering primitives to fill the 3D spacetime volume. The advantage is that motion estimation, which is a notoriously hard computer vision problem, is avoided. This approach, however, is harder to adapt to mosaic animation, since the cross sections of the 3D primitives must be valid mosaic tiles. The second group of methods is based on explicit computation of motion, typically based on optical flow [6,7]. The idea is to let the rendering primitive (brush strokes, etc.) follow the motion field so that the primitives appear attached to the scene objects. Our work falls into this first group. We are aware of only two methods [10,4] for mosaic animation. In [4], a moving mosaic is created by packing 3D volumes with temporally repeating animated shapes. This work is very interesting and produces appealing animations, however, it is far from our goal of rendering a real video in a classic mosaic style. Our work is based on [10]. They make an observation that many devices for temporal coherence in NPR animation are based on the changes of primitive renderings units (i.e. scale, blend, etc.), which is not appropriate for classic mosaic animation. They argue that for classic mosaics specifically, one should target coherent motion of group of primitives. However in their work they assume that the motion is given by the user. The input to their algorithm is an animated scene represented as a collection of 2D “containers”, with known correspondences between containers in adjacent frames. We extend the work of [10] to real video sequences. Thus we must estimate the “containers” and their correspondence.
3
Energy Minimization with Graph Cuts
Many problems in vision and graphics can be stated as labeling problems. In a labeling problem, one has a set of pixels P, which is often the set of all image pixels, and a finite set of labels L. The task is to assign some label l ∈ L to each image pixel. Let fp denote the label assigned to pixel p, and let f be the collection of all pixel-label assignments. Typically there are two types of constraints on pixels. Unary constraints, denoted by Dp (l), reflect how likely is each label l ∈ L for pixel p. The lower the value of Dp (l), the more likely is label l for pixel p. Usually Dp (l) are modeled from the observed data. Binary constraints, denoted by Vpq (l1 , l2 ), express how likely it is for two neighboring pixels p and q to have labels l1 and l2 , respectively. Binary constraints usually come from prior knowledge about the optimal labeling. An energy function is formulated to measure the quality of f : E(f ) = Esmooth (f ) + Edata (f ).
(1)
1088
Y. Liu and O. Veksler
E data (f ) is called the data term, and it sums up the unary constraints: Edata (f ) = p∈P Dp (fp ). Esmooth is called the smoothness term, and it sums up the binary constraints: Esmooth = wpq · Vpq (fp , fq ). (2) {p,q}∈N
In Eq. (2), N is a collection of neighboring pixel pairs, often the standard 4 or 8-connected grid. The choice of Vpq reflects a priori knowledge about the desired labeling. A frequent choice is Vpq (fp , fq ) = min(K, |fp − fq |C ), where K, C are constants. If K = C = 1, the smoothness term corresponds to the famous Potts model. Many commonly used Vpq are NP-hard to optimize, but there are approximations based on graph cuts [12]. We use min-cut implementation of [13].
4
Review of Still Classic Mosaic
We now review the static mosaic algorithm of [5]. They design an objective function that incorporates the desired mosaic properties, such as: (i) tiles should align with strong intensity edges; (ii) nearby tiles should have similar orientations; (iii) tiles should avoid crossing strong image edges; (iv) the gap space should be minimal; (v) tiles should not overlap. User interaction and explicit edge detection are avoided. We start with the label set. Let I be the image to generate the mosaic from, and let P be the collection of all pixels of I. For each p ∈ P we wish to assign a label which is an ordered pair: (vp , ϕp ). Here vp ∈ {0, 1} is the binary “visibility” variable. If vp = 1, then a tile centered at p is placed in the mosaic. If vp = 0 then the mosaic does not have a tile centered at p. We assume that all tiles are squares with a fixed side tSize. The second part of the label, ϕp , specifies the orientation of the tile centered at p, if there is such a tile. If vp = 1 then ϕp has a meaning (i.e. tile orientation), if vp = 0, the value of ϕp is not used. The set of tile orientations Φ is discretized into m angles, at equal intervals. Since tiles are rotationally symmetric, only the angles in [0, π2 ) are needed. We set m = 32. Let T (p, ϕp ) denote the set of pixels covered by a tile centered at pixel p and with orientation ϕp . The color of the tile is the average color of pixels it covers. Let ϕ = {ϕp |p ∈ P} and v = {vp |p ∈ P}. A mosaic then is an ordered pair of variables (v, ϕ) s.t. v ∈ {0, 1}n and ϕ ∈ Φn , where n is the size of P. The energy function for a mosaic (v, ϕ) is formulated as: E(v, ϕ) = (1 − vp ) + vp · Dp (ϕp ) + Vpq (vp , vq , ϕp , ϕq ). (3) p∈P
p∈P
{p,q}∈N
The first sum in Eq. (3) minimizes the gap space. The second sum in Eq. (3) is the data term. Each Dp (ϕp ) measures the quality of a tile with center at p and with orientation ϕp . Dp (ϕp ) is computed from the local area around p. Under a good orientation ϕp , the sum of image gradients under the tile is small and there is a strong intensity gradient around at least one tile edge. Multiplying Dp (ϕp ) by vp ensures that we measure the quality only of the tiles that are visible.
Animated Classic Mosaics from Video
1089
The term. The neighborhood system is: last term in Eq. (3)√is the smoothness N = {p, q} | dist(p, q) ≤ 2 · tSize , where dist(p, q) is the Euclidian distance between pixels p and q. This N is large enough to contain all pairs {p, q} s.t. if we place tiles centered at p and q, they are either adjacent or overlapping. The interaction term is: ⎧ 0 if vp = 0 or vq = 0 ⎪ ⎪ ⎪ ⎪ π w · |ϕ − ϕ | ⎨ s p q mod( 2 ) if vp = vq = 1 Vpq (ϕp , ϕq , vp , vq ) = and T (p, ϕp ) ∩ T (q, ϕq ) = ∅ ⎪ ⎪ if vp = vq = 1 ⎪ ∞ ⎪ ⎩ and T (p, ϕp ) ∩ T (q, ϕq ) =∅
where |ϕp − ϕq |mod( π2 ) =
|ϕp − ϕq | if |ϕp − ϕq | ≤ π − |ϕ − ϕ | p q otherwise 2
π 4
.
(4)
(5)
The smoothness term serves two purposes. First, any finite energy labeling does not have overlapping tiles. Second, adjacent tiles are encouraged to have similar orientations. We only consider the orientations of neighboring tiles that are actually placed in the mosaic. The modulo arithmetic in Eq. (5) reflects the fact that rotation by angle ϕp gives the same result as rotation by angle ϕp + π2 . The energy in Eq. (3) is too difficult to optimize in all variables simultaneously. In [5], they use an incremental approach, based on the graph cuts [12], which first optimizes the orientation and then the visibility variables.
5
Overview of Mosaic Animation
Fig. 2 gives a schematic overview. We start with a sequence of m frames, I1 , I2 , ..., Im . We assume that the scene background is known and stationary. However, we are interested in background replacement, since typical office scenes
(a) background subtraction
(b) initial motion segm
(c) user interaction
(d) corrected motion segmentation
(e) still mosaic in key segments
(f) mosaic propagated
Fig. 2. Summary of the approach
1090
Y. Liu and O. Veksler
have boring backgrounds that do not produce appealing mosaics. Thus the first step is background subtraction, Fig. 2(a). To ensure temporal coherence, we find groups of pixels that have common motion. This is the task of motion segmentation. A group of pixels with coherent motion is called a layer. We develop motion segmentation algorithm for the whole sequence in a global optimization framework, Fig. 2(b). Let L be a layer of pixels with common motion throughout the whole sequence. If general motions are allowed, the “containers” corresponding to L in two different frames may undergo drastic changes in scale, shear, etc. One has to come up with non-trivial strategies for filling these corresponding “containers” with tiles such that each container is a valid mosaic and the apparent motion between the frames is acceptable. In [10], they explore two such strategies with different visual effects. Unlike [10], we are already facing a formidable task of motion segmentation of a real video, so we chose leave exploring the strategies from [10] as well as developing the new ones for future work. We assume that the motion of layer L between frame Il and Il+1 is well approximated by rotation and translation. Notice that between each individual pair of frames, the translation and rotation parameters of L can be different. With this restriction, the “containers” corresponding to layer L in frames Il and Il+1 have identical shape, except if there is an occlusion or out of frame motion. Therefore, we also need to include occlusion detection as a part of our motion segmentation. Under restriction to rigid layer motion, packing two corresponding “containers” between two frames then becomes basically equivalent to moving tiles from one container to another, following the computed motion, except parts of a container may become occluded by another layer. Automatic motion segmentation rarely produces error-free results. Therefore we ask the user for corrections, Fig. 2(c).We sample and present a portion of frames to the user. If a part of an object was not segmented correctly, the user finds a nearby frame where the same part was correctly segmented, and clicks on this part. These user indicated correct segmentations are then propagated to the rest of the sequence, Fig. 2(d). During propagation, we also handle occlusions. Finally we pack the tiles using the algorithm in Sec. 4, with some adjustments to take advantage of the full video sequence. First the tiles are placed into the “key” segments indicated by user interaction, Fig. 2(e). These segments are likely to correspond to regions with higher image quality. Next the mosaics of the key segments are propagated to the rest of the sequence, taking occlusions into consideration. Lastly mosaic is placed in any segments that have not been tiled yet, and we render the tiles with the corresponding image colors, Fig. 2(f).
6 6.1
Detailed Description Background Subtraction
Most indoor office scenes are dull in color, resulting in unimpressive mosaic backgrounds. Thus we remove background and render the moving object in front of
Animated Classic Mosaics from Video
1091
a lively scene, rendered as a classic mosaic using [5]. Background subtraction is a well studied area in computer vision [14]. To get accurate background subtraction, we use global optimization, similar to that of [15]. 6.2
Initial Motion Segmentation
Temporal coherency of the final animation depends most of all on the accuracy of motion segmentation, which is a widely studied problem in computer vision [16]. Methods based on global optimization [17,18] produce more accurate results. Our algorithm,particularly suitable for our application, is most closely related to that of [17]. In [17], motion segmentation is performed on a pair of frames at a time. First a sparse set of feature points is matched across two frames. Then using RANSAC [19], several motion models are fitted to the matched points. Next, dense assignment of image pixels to motion layers is performed. The algorithm is further iterated, refining motion models and reassigning pixels to motion layers. For our application, we need motion layers for the whole sequence, not a pair of frames. One solution is to track feature points throughout the whole sequence, as in [18], but the number of features that appear in all frames is limited. Our solution is to first estimate pairwise motion models between two adjacent frames, but then find correspondences between adjacent motion models. Global motion models (i.e. those describing a motion from the first frame to the last) are formed from the correspondences. Finally global motion segmentation is performed for all frames at the same time, using the global motion models. Let I1 , I2 , ..., Im be the m input frames. We match feature points between pairs of frames Id and Id+1 , for d = 1, ..m − 1. Next we fit k motion models using RANSAC between each pair of adjacent frames. Let Md for d = 1, ...m − 1 be the set of motion models estimated between frames Id and Id+1 . The initial number of models in each Md is k. Let Mdi stand for the ith motion model in Md , i.e. Mdi is the ith estimated motion model between frame d and d + 1. Fig. 3 is an oversimplified illustration for 3 frames and k = 4. We first perform dense motion segmentation between each adjacent pair of frames independently, using the estimated motion models Md ’s. Given a pair of frames Id and Id+1 , the label set L consists of the k estimated motion models in Md , with one label per motion model. To densely assign labels to pixels in frame Id , we perform graph cut optimization with the energy as in Eq. (1). The data terms for pixel p and label l measure how likely is p to have motion Mdl from frame d to d + 1. The data term is based on the color difference between p in Id and the pixel in Id+1 it corresponds to according to motion model Mdl . We use the Potts smoothness term. Let S 1 , S 2 , ..., S m−1 be the resulting segmentations. Here S d corresponds to segmentation in the frame number d, and Spd ∈ Md , i.e. Spd is the motion model label assigned to pixel p, out of the possible set of motions Md . Fig. 3 illustrates a hypothetical result of pairwise motion segmentation.
1092
Y. Liu and O. Veksler
Fig. 3. Oversimplified example of pairwise motion segmentation. Four models are extracted between each pair of frames, i.e. k = 4. Different labels are illustrated by different colors. Notice that after motion segmentation on frame pairs, we do not know that the “red” model in frame 1 should correspond to the “green” model to in frame 2 and to the “yellow” model in frame 3. In practice, motion correspondences are not as easy to resolve as in this picture. Three global motion models extracted: purple (combines M11 and M22 ) brown (combines M12 and M23 ), and blue (combines M13 and M21 ).
Initial motion segmentation finds groups of pixels with consistent motion between pairs of frames, but we need such pixel groups across the whole sequence. Thus we perform global optimization across the whole sequence. Let 1, 2, ...c be the c hypothetical global motion labels. Each individual global motion label l should describe how pixels obeying global motion l move from the first sequence to the second, from second to the third, etc. We have motion models Mdi , that describe how pixels move from frame d to d + 1, but we do not know how these same pixels move from frame d+1 to d+2. That is, for a motion model in Md , we do not know the “corresponding” motion model in Md+1 . We use the following i heuristic but simple procedure for determining which motion model in Md+1 corj 1 m−1 responds to motion model Md . Consider the motion segmentations S , ...S , performed between pairs of frames individually. Let Rid = {p ∈ P|Spd = Mdi }, that is Rid is the set of pixels that are labeled with motion i in frame d. Let us warp pixels in Rid to frame d + 1 using motion model Mdi , let W (Rid ) the set of warped pixels. If at least 80% of pixels in W (Rid ) are assigned to the same j motion model, say model Md+1 , and if the size of W (Rid ) is equal to at least 80% j of all pixels in S d+1 that are assigned motion model Md+1 , then we say that moj tion model Mdi corresponds to motion model Md+1 . In Fig. 3 the corresponding motion models are indicated by the arrows with the same color. Occasionally we need to combine two or three motion models to satisfy this condition, i.e. we need to take several models in frame d so that pixels assigned to either of these models make 80% of pixels assigned to the same motion in frame d + 1. This happens because motion segmentation occurs at different levels of precision. For example, between frames d and d + 1, an arm could be fitted with two motions, but between the next pair of frames, d + 1 and d + 2, the whole arm is fitted with one motion. In such cases, we add new motion models to sets M d
Animated Classic Mosaics from Video
1093
Fig. 4. User Interaction and motion segmentation correction
and M d+1 , the model allowing for no arm splitting in M d (the added model is simply a combination computed from the two motion models allowing the split), and the motion model with arm split to M d+1 (the new model is based on warping the two “split” models from the previous pair of frames). The procedure described in the previous paragraph creates many global motion labels by linking labels between pairs of frames into a single chain, see Fig. 3. Notice that chains can start after the first frame and end before the last frame, allowing for appearance of new layers and disappearance of old layers. With global motion labels, we can perform global layer segmentation. However, performed on the pixel level, the whole sequence does not fit into the memory on 32-bit architecture. Therefore, we oversegment each frame into “superpixels” using the segmentation algorithm of [20]. Optimization is performed by assigning labels to superpixels, resulting in huge memory savings. The neighborhood system is three-dimensional, with superpixels between the frames also connected. Specifically, we connect a superpixel p in frame d to the closest superpixel q in frame d + 1. This is justified because we expect the motions to be relatively slow. Data terms are still based on color similarity. For a superpixel, the data term is computed as the average of data terms for its pixels. 6.3
User Interaction
The initial results of motion segmentation are not likely to be accurate for all frames. Therefore we ask the user for guidance. We sample one fifth of the frames and show their motion segmentation to the user. To correct segmentation, starting with the first frame that is not accurately segmented, the user has to point out its correct segmentation in a nearby frame, by a single click. Consider Fig. 4. The middle pictures shows segmentation results with gross errors due to occlusion, highlighted with a rectangle. The hands are correctly segmented in the frame on the left. 6.4
Correction of Motion Segmentation
Let F 1 , F 2 , ..., F m−1 be the motion segmentation with global labels. Suppose the user clicks on a group of pixels assigned a global label l in frame i. Let Gil be this group of pixels, i.e. Gil is spatially contiguous, contains the pixel the user clicked on and Gil = {p ∈ P|Fpi = l}. We fix the labels of pixels in Gil to strongly prefer label l in the ith frame. That is we set the data penalties to be infinite for all labels other than l for pixels in Gil in the ith frame. Furthermore, we warp
1094
Y. Liu and O. Veksler
Fig. 5. Results on “Waving arms” sequence and “Overlapping arms” sequence
pixels in Gil to the (i + 1)th frame according to the motion label l. Let W (Gil ) be the set of warped pixels in frame i + 1. We set wpq (see Sec. 3) between pixels in Gil and W (Gil ) to be large. Here p is a pixel in frame i and q is the pixel in frame i + 1 that q gets warped to by the global motion model l. Now we are ready to talk about occlusion handling. The coefficient wpq is also set in proportion to the color similarity between pixels p and q. The more similar are the colors, the higher is wpq . Weighting wpq in direct proportion to color similarity helps us to handle occlusions automatically. Consider Fig. 4 again. Let O be the group of pixels in the area where the left hand occludes the right hand. Both the left hand pixels and the right hand pixels in the first frame get connected by strong links to pixels in O. However, the links from the left hand are stronger, since the left hand pixels are actually visible in the second frame and their color similarity, on average, is stronger than that between the right hand and pixels in O. Therefore pixels in O get assigned the correct label. After the data terms and the neighborhood weights wpq are updated, the motion segmentation is recomputed again, propagating user corrections throughout the whole sequence and resolving occlusions.
Animated Classic Mosaics from Video
6.5
1095
Mosaic Rendering
Now we are ready to pack tiles. We start with the “key” segments pointed out by the user, since these are likely to correspond to image data of high quality. For a still mosaic, given a pixel p and orientation label ϕ, we need to decide on the penalty of placing tile with center at p and orientation ϕ. This penalty is modeled from the data around pixel p. For a video sequence, the penalty should depend not just on the current frame, but on all the other frames in the sequence. Let K be a “key” segment in frame I d that the user clicked on. If we place a tile centered at p under orientation ϕ, this tile will be propagated by the global motion model throughout the whole sequence. Therefore, to model the data penalty, we propagate the tile throughout the whole sequence (notice its orientation will change in different frames) and compute the data penalty in each frame of the sequence, using the same procedure in each frame as for the still mosaic. The final data term for pixel p to have a tile centered at it with orientation ϕ in frame I d is the average of all the data terms from all the frames. After packing the key segments and propagating them throughout the sequence, we pack the empty regions. We start with the first frame, pack any unprocessed regions and propagate them throughout the whole sequence using the same algorithm as for the key segments. If there are any unprocessed regions in the second frame (for example, because a new motion label appears), we repeat the procedure. We stop when the whole sequence is packed with tiles. The final step is to paint the tiles with the colors of the underlying image.
7
Experimental Results
In Fig. 1 we show the “Waking” sequences. Observe how each individual frame of animation is a pleasing mosaic. This sequence contains significant occlusions, and parts of the leg appear and disappear from the scene. Our system produces a nice coherent animation, with correctly handled occlusions. Due to our restricted motion assumption, the animated figure has a distinctive “puppet”-like effect. Fig. 5 shows two more video sequences. The “Waving arms” sequence is relatively simple, with no significant occlusions. The motion of the torso is modeled with two layers, creating an interesting visual effect. The “Occluding arms” sequence has significant overlap between the two arms, which is handled gracefully. The torso and the head have motions very close to stationary. We decided to fix the head and the body to be stationary which visually blends them into the background, creating interesting “arms sticking out of the wall” effect. Our results are best viewed from the animations on the web.1 1
see http://www.csd.uwo.ca/faculty/olga/VideoMosaic/results.html
1096
Y. Liu and O. Veksler
References 1. Hausner, A.: Simulating decorative mosaics. In: Proceedings of SIGGRAPH 2001, pp. 573–580 (2001) 2. Elber, G., Wolberg, G.: Rendering traditional mosaics. The Visual Computer 19, 67–78 (2003) 3. Blasi, G.D., Gallo, G.: Artificial mosaics. The Visual Computer 21, 373–383 (2005) 4. Dalal, K., Klein, A.W., Liu, Y., Smith, K.: A spectral approach to npr packing. In: NPAR 2006, pp. 71–78 (2006) 5. Liu, Y., Veksler, O., Juan, O.: Simulating classic mosaics with graph cuts. In: Yuille, A.L., Zhu, S.-C., Cremers, D., Wang, Y. (eds.) EMMCVPR 2007. LNCS, vol. 4679, pp. 55–70. Springer, Heidelberg (2007) 6. Litwinowicz, P.: Processing images and video for an impressionist effect. In: SIGGRAPH 1997, pp. 407–414 (1997) 7. Hertzmann, A., Perlin, K.: Painterly rendering for video and interaction. In: NPAR 2000, pp. 7–12 (2000) 8. Klein, A.W., Sloan, P.P.J., Finkelstein, A., Cohen, M.F.: Stylized video cubes. In: SCA 2002, pp. 15–22 (2002) 9. Wang, J., Xu, Y., Shum, H.Y., Cohen, M.F.: Video tooning. In: SIGGRAPH 2004., pp. 574–583 (2004) 10. Smith, K., Liu, Y., Klein, A.: Animosaics. In: SCA 2005, pp. 201–208. ACM, New York (2005) 11. Battiato, S., Di Blasi, G., Gallo, G., Guarnera, G., Puglisi, G.: Artificial mosaics by gradient vector flow. In: Proceedings of EuroGraphics (2008) 12. Boykov, Y., Veksler, O., Zabih, R.: Efficient approximate energy minimization via graph cuts. IEEE transactions on PAMI 21, 1222–1239 (2001) 13. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on PAMI 24, 137–148 (2004) 14. Elgammal, A.M., Harwood, D., Davis, L.S.: Non-parametric model for background subtraction. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 751–767. Springer, Heidelberg (2000) 15. Sun, J., Zhang, W., Tang, X., Shum, H.: Background cut. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 628–641. Springer, Heidelberg (2006) 16. Adelson, E., Weiss, Y.: A unified mixture framework for motion segmentation: Incorporating spatial coherence and estimating the number of models. In: CVPR 1996, pp. 321–326 (1996) 17. Wills, J., Agarwal, S., Belongie, S.: What went where. In: CVPR, vol. I, pp. 37–44 (2003) 18. Xiao, J., Shah, M.: Motion layer extraction in the presence of occlusion using graph cuts. PAMI 27, 1644–1659 (2005) 19. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, pp. 726–740 (1987) 20. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vision 59, 167–181 (2004) 21. Shi, J., Tomasi, C.: Good features to track. Technical report, Ithaca, NY (1993)
Comparison of Optimisation Algorithms for Deformable Template Matching Vasileios Zografos Link¨ oping University, Computer Vision Laboratory ISY, SE-581 83 Link¨ oping, Sweden
[email protected]
Abstract. In this work we examine in detail the use of optimisation algorithms on deformable template matching problems. We start with the examination of simple, direct-search methods and move on to more complicated evolutionary approaches. Our goal is twofold: first, evaluate a number of methods examined under different template matching settings and introduce the use of certain, novel evolutionary optimisation algorithms to computer vision, and second, explore and analyse any additional advantages of using a hybrid approach over existing methods. We show that in computer vision tasks, evolutionary strategies provide very good choices for optimisation. Our experiments have also indicated that we can improve the convergence speed and results of existing algorithms by using a hybrid approach.
1
Introduction
Computer vision tasks such as object recognition [1], template matching [2], registration [3], tracking [4] and classification [5] usually involve a very important optimisation stage where we seek to optimise some objective function, corresponding to matching between model and image features or bringing two images into agreement. This stage requires a good algorithm that is able to find the optimum value within some time limit (often in real-time) and within some short distance from the global solution. Traditionally, such tasks have been tackled using local, deterministic algorithms, such as the simplex method [6], Gauss-Newton [7] or its extension by [8,9] and other derivative-based methods [7]. Such algorithms, although usually improve on the solution relatively fast, need to be intialised near the proximity of the global optimum, otherwise they may get stuck inside local optima. In this work, we examine the simplex and the pattern search methods, due to their simplicity, ubiquity and tractability. In recent years, a wide selection of global, stochastic optimisation algorithms have been introduced, the effectiveness of which, has ensured their use in computer vision applications. Their main advantage is that they are able to find
This work has been carried out partially at University College London and at Link¨ oping university under the DIPLECS project.
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1097–1108, 2009. c Springer-Verlag Berlin Heidelberg 2009
1098
V. Zografos
the optimum value without the need for good initialisation, but on the other hand require considerable parameter adjustment, which in some cases is not an intuitive or straightforward process. In addition they tend to be slow, since they require a higher number of function evaluations (NFEs). This paper is organised as follows: in section 2 we present a selection of traditional local, algorithms, followed by the global approaches in section 3. In section 4 we explain our test methodology on the different datasets, including a set of 2-D test functions and real-image data of varying complexity. Section 5 includes an analysis of our experimental results for each algorithm, followed by an introduction to hybrid optimisation. We conclude with section 6.
2
Local Methods
We consider two local optimisation methods that are well known and used in computer vision and many other scientific fields. These are the downhill simplex and the pattern search methods. A simplex is a polytope of N + 1 vertices in N dimensions, which is allowed to take a series of steps, most notably the reflection, where the vertex with the worst function value is projected through the opposite face of the simplex to a hopefully better point. The simplex may also expand or contract or change its direction by rotation when no more improvements can be made. Simplex evaluations do not require calculation of function derivatives but the simplex must be initalised with N + 1 points. This can be rather costly, but it still remains a very good solution when we need something working quickly for a problem with small computational overhead. We introduced two small yet significant modifications to the basic algorithm [6], in order to deal with local minima. The first was the ability for the simplex to restart by generating N random unit vectors at distance λ from the current minimum, whenever its progress stalled. Furthermore, we gradually reduced the distance λ based on the number of function evaluations using a “cooling” schedule similar to Simulated Annealing [10]. Pattern search algorithms [11] conduct a series of exploratory moves around the current point, sampling the objective function in search of a new point (trial point ) with a lower function value. The set of neighbourhood points sampled at every iteration is called a mesh, which is formed by adding the current point to a scalar multiple of a fixed set of vectors called the pattern and which itself is independent of the objective function. If the algorithm finds a new point in the mesh that has a lower function value than the current point, then the new point becomes the current point at the next step of the algorithm.
3
Global Methods
In this section we introduce certain novel optimisation methods, specifically differential evolution and SOMA, together with a more traditional, generic Genetic Algorithm. Our aim is to determine whether or not stochastic, global algorithms are more effective in overcoming the typical convergence shortcomings associated with the aforementioned local methods, but also if these two new approaches are better suited than traditional genetic algorithms to computer vision problems.
Comparison of Optimisation Algorithms
1099
A genetic algorithm (GA) [12] belongs to a particular class of optimisation methods based on the principles of evolutionary biology. Almost all GAs follow the basic stages of: initialisation, selection, reproduction (crossover, mutation) and termination. GAs have been applied to the solution of a variety of problems in computer vision, such as feature selection [13], face detection [14] and object recognition [15]. Additionally, they have been shown [16] to perform well in problems involving large search spaces due to their ability in locating goodenough solutions very early in the optimisation process. Differential evolution (DE) [17] is an evolutionary population-based optimisation algorithm that is capable of handling non-differentiable, nonlinear and multi-modal objective functions, with any mixture of discrete, integer and continuous parameters. DE works by adding the weighted difference between two randomly chosen population vectors to a third vector and the fitness result is compared with an individual from the current population. In this way, no separate probability distribution is required for the perturbation step and DE is completely self-organising. DE has been used successfully in a variety of engineering tasks. Finally, we examine the Self-Organizing Migrating Algorithm or SOMA, a stochastic optimization algorithm that is modelled on the social behaviour of co-operating intelligent individuals and was chosen because of its proven ability to converge towards the global optimum [18]. SOMA maintains a population of candidate solutions. In every iteration, the whole population is evaluated and the individual with the highest fitness (or lower error value) is designated as the leader. The remaining individuals will “migrate” towards the leader, that is, travel in the solution space at the direction of the fittest individual.
4
Experimental Domain
Our task now is to compare these different strategies against a set of 2-D, analytic functions and real-image data. The aim is to determine the general properties of each of the optimisation algorithms and understand some details about their parameter settings. We may then use this information and apply the same algorithms in a template matching problem and see how they compare in more realistic circumstances. 4.1
2-D Test Functions
These functions are designed to test against universal properties of optimisation algorithms and give us an overall understanding of each method’s strengths and weaknesses and possible parameter choices, before moving on to template matching specific datasets and experimentation. The original inspiration was the work of [19], but with a few modifications, which include: the sphere model, f (xi ) = x2i , a smooth, unimodal, symmetric, convex function used to measure the general efficiency of an optimisation algorithm. Rosenbrock’s function, f (xi ) = [(1 − xi )2 + 100(xi+1 − x2i )2 ], which has a single global minimum inside a long, parabolic-shaped flat valley. Algorithms
1100
V. Zografos
that are not able to discover good directions underperform in this problem by oscillating around the minimum. 4 The six-hump camel-back function, f (x, y) = (4 − 2.1x2 + x3 )x2 + xy + (−4 + 4y 2 )y 2 which has a wide and approximately flat plateau and a number of local minima. In addition, it has two, equally important global minima. Unless an algorithm is equipped to handle variable step sizes, then it is likely to get stuck in one of the flat regions. 2 Rastrigin’s function f (xi ) = 10n + (xi − 10 cos(2πxi )) and the slightly x2i xi more difficult Griewank’s function f (x) = cos( √ ) + 1. Both have 4000 − i a cosine modulation part that simulates the effects of noise (multiple modes), and are designed to test whether an algorithm can consistently jump out of local minima. 4.2
Real-Image Template Matching
In this section we propose more detailed experiments relevant to computer vision by examining deformable template matching, since it is a generic scenario that might be applied to many different areas in the field. Template matching can be expressed as the task of searching for the parameters ξ of a transformation T that will bring the model template F0 into agreement with an image I. The transformation T , for 2-D problems, is usually an affine transformation with 6 parameters. In mathematical terms this is defined as a minimisation problem: min S = g(I(x, y), T F0 (x, y)), (1) ξ
x,y
where g(., .) is some dissimilarity metric and the sum is over all the features in the template, in this case pixels. When g is chosen as the sum of squared differences (SSD) dissimilarity metric, (1) produces specific error surfaces which have been examined in previous work by [20] with well known properties. For ease of analysis and visualisation we consider the transformation parameters as independent with the following parameterisation: T = SRUx + D, that is a 2-D translation D, anisotropic scaling S and 1-D rotation R and shear Ux . Of particular interest to us is the translation transform, because it contains the majority of problems for optimisation algorithms. This is due to the fact that, in general, a change in translation will move the model away from the object and on to the background region where unknown data and thus more noisy peaks in the error surface exist. Furthermore, the translation surface may vary depending on the type of template model F0 and scene image I we use. For example, if we consider the object of interest in front of a constant background (see Fig. 1(a)), then the translation space (assuming all other transformation parameters are optimally set) is a simple convex surface (Fig. 1(d)). This is considered to be a relatively easy scenario of a computer vision optimisation
Comparison of Optimisation Algorithms
(a)
(b)
(c)
(d)
(e)
(f)
1101
Fig. 1. Common test scenarios: Simple scene with constant background (a), moderate scene with background model available (b) and complex scene without background model (c). Their respective translation error surfaces are shown in (d), (e) and (f).
problem and it is mostly encountered in controlled environments (e.g. assembly line visual inspection, medical image registration and so on). A second possibility, is for the scene image I background to be substantially more complex (see Fig. 1(b)) with non-trivial structure and noise existent. In this case, our template model F0 may be more elaborate also, composed of a full foreground and background model. As such, we either have to know what the background is [21], build a very simple model [20], or have a statistical model of what it is expected to be like [22]. The result will be a translation error surface as in (Fig. 1(e)), which constitutes a moderate optimisation problem, with most global algorithms and a number of local methods under good initialisation, expected to converge to the correct minimum. Finally, we have the hardest case, where considerable structure and noise exist in the scene image background, but a model of the latter is not available (see Fig. 1(c)). The optimisation difficulty in this scenario is apparent in the complexity of the produced 2-D translation error surface (Fig. 1(f)), and all local optimisation methods not initialised in close proximity to the global minimum are expected to fail, while most global methods will converge with great difficulty and after many iterations. Regarding the remaining error spaces, we would like to draw attention to the irregularities of the 2-D scale space previously examined in [23]. Finally, the rotation and shear spaces can be easily minimised, even though for the rotation space there may be a number of local minima at angle intervals of ±π/2, depending on the rotation symmetrical properties of the object.
1102
V. Zografos
Table 1. Comparative results from the 2-D test functions using all the algorithms
Sphere Rosenbrock Griewank’s Rastrigin’s Camel-back
5
Simplex 26, 3.09E-5 70, 8.09E-5 103 , 7.93E-3 516, 2.28E-5 30, 3.0E-5
P. Search 81, 0 89, 0 103 , 7.39E-3 81, 0 169, 3.0E-5
GA 4600, 1.62E-5 104 , 1.1E-2 8300, 4.85E-5 104 , 6.45E-4 104 , 4.9E-4
DE 1600, 6.44E-5 2800, 6.59E-5 2100, 9.24E-5 2300, 9.92E-5 1900, 1.0E-4
SOMA 1302, 9.5E-5 104 , 1.34E-2 104 , 7.4E-3 4570, 1.57E-5 2651, 0
Experiments: Methods and Results
We now present the experimental methods for each dataset, the algorithm configurations and the comparative results from which we aim to draw certain conclusions about the fitness and efficiency of each strategy in relation to the typical computer vision scenario. 5.1
Set 1: 2-D Test Functions
We used the total NFEs as a general and independent, quantitative measurement for comparing different algorithms. Convergence was defined as a recovered error minimum no greater than τ = 10−4 of the known global solution, and inside the allocated optimisation budget (1000 NFEs for local methods and 10000 NFEs for global methods). Additionally, we tried to use similar initialisation criteria for every method, in order to make intra-category comparison easier. For stochastic approaches, we carried out 5 test runs per function and averaged the results. The initialisation settings for each method where: Simplex initial triangle [(5,5),(5,0),(-5,-5)]; Pattern Search starting point (4,5) and polling of the mesh points at each iteration using the positivebasis2N [11] method; GA population generated from U([-5,5]) and using a stochastic uniform selection and scattered cross-over reproduction functions [12]; DE population limit=100, maximum iterations=100, F=0.8, and CR=0.5 [17]. The Best1Bin strategy was also chosen and the algorithm was initialised inside the soft boundaries [-5,5]; Finally, for SOMA step=0.11, pathLength=2, prt=0.1, migrations=50 and populationsize=10, which approximates to 10000NFEs. The best strategy, was the all-toone randomly [18] and the algorithm was initialised inside the hard boundary [-5,5]. The combined output for all methods is shown in Table 1. The first number in each column corresponds to the average NFEs for this method, while the second is the absolute difference between the global and recovered minima. The bold figures represent the best performing algorithm for each function. As we can see, the Simplex performs rather well with very low required NFEs. It can also cope well with flat region uncertainties due to its expanding/contracting nature, and negotiate moderately noisy surfaces (Rastrigin’s) albeit with a high NFEs. This is not the case however when numerous local minima exist (Griewank’s) even if the available NFEs are increased. The pattern search method requires more NFEs than the Simplex indicating that it is not so efficient nor can it discover good directions. It did however find the exact
Comparison of Optimisation Algorithms
1103
location of the global minimum in most cases and managed to deal with moderate noise much more efficiently than the Simplex. Regarding flat regions, its fixed mesh expansion and contraction factors were not very adequate in cases where there was no information about the current function estimate. The GA is the worst algorithm and fails to converge below the 10−4 threshold for NFEs=10000. Nevertheless, it can cope well with noise the majority of times and thus it is best suited for difficult problems with inexpensive function cost where a high NFEs would be justified. DE is the best across most functions and is generally more efficient than both GA and SOMA. It does also succeed in solving Griewank’s function (80% of times), which as we have already seen is particularly problematic for all optimisaiton algorithms so far. Finally, SOMA performs somewhere between GA and DE, is quite efficient for simple test functions and can deal with a moderate amount of noise (not Griewank’s function though), also allowing for flat surface uncertainty with a varying step length when improvement stalls. However, it is not good in determining good search directions since it was not able to converge in the Rosenbrock’s function, although it did come close. What these tests have demonstrated is that the reducing-step restarting Simplex and the DE algorithms were the best performing from the local and global methods respectively. Before we can draw any broader conclusions however, we need to perform more rigorous tests on real-image data. 5.2
Set 2: Real-Image Template Matching
We shall further analyse the fitness of each of the examined optimisation algorithms by performing more detailed tests with the 3 real-image datasets previously seen (easy, moderate and hard) using the objective function in (1). We define convergence in this context as the ability to recover a model configuration within some Euclidean distance threshold from the known optimum. We could have also used the minimum value of (1) to determine convergence, but in this case and especially when using a SSD dissimilarity metric, it is quite possible to find an invalid model configuration with an error value that is lower or equal to the global minimum, as discussed in [23]. As such, the threshold boundaries were defined as follows: translation tx , ty = 5, scale sx , sy = 0.1, rotation θ = 100 , and shear φ = 50 . Any configuration within these limits from the known global minimum will be considered a valid solution and convergence will be deemed as successful. The same values have been used across all the 3 datasets. We can now define a number of quantitative measures such as the global minimum of a converged test run; the time to convergence, that is how many iterations before the optimisation reached the convergence thresholds and the convergence percentage, that is the number of times the optimisation converged inside the set threshold. In all the following tests, we used a maximum of 2000 and 20000 NFEs for the local and global optimisation methods respectively, and each method was allowed to perform 100 separate tests, the results of which were averaged. None of the algorithms was initialised close to the ground truth solution, but instead and in order to eliminate any bias, they were started randomly within the
1104
V. Zografos Table 2. Comparative results from the 3 datasets using all the algorithms Simplex P. Search GA DE SOMA Convergence % 2 12 0 100 100 Dataset 1 NFEs 1060 476 3915 2551 Minimum 2.945 1.365 0.3213 0.3265 Convergence % 2 3 11 96 61 Dataset 2 NFEs 476 0* 446 889 1416 Minimum 0.09 0.0915 0.08815 0.0799 0.0865 Convergence % 1 4 63 61 97 Dataset 3 NFEs 1194 862 4603 11483 4070 Minimum 0.03806 0.0389 0.0273 0.0301 0.0252
6 coefficient domains. In more detail, we used the following settings for each method: Simplex algorithm with random initial 7×6 simplex within the boundaries [1-50, 1-50, 0.5-1, 0.5-1, 1-20, 1-20], step size λ=[20, 20, 2, 2, 50, 20] and cooling rate r=0.9; Pattern search initial randomly generated population in the range [0-100, 0-100, 0.5-1.5, 0.5-1.5, 0-50, 0-10]. Poll method = positiveBasis2N, initial mesh size=30, rotate and scale mesh, expansion factor=2 and contraction factor=0.5; GA 200 generations and 100 populations. Initial population function U ([0 − 100], [0 − 100], [0 − 1], [0 − 1], [0 − 50], [0 − 10]); DE populations=100, maximum iterations=200, F=0.8, CR=0.5, strategy=Best1Bin. Soft boundaries=[1-100, 1-100, 0.5-2, 0.5-2, 0-100, 0-50]. Finally SOMA step=0.5, pathLength=1.5, prt=0.1, migrations=100, popsize=50. Hard boundaries=[1100, 1-100, 0.5-2, 0.5-2, (-180)-180, (-50)-50]. These settings were kept fixed throughout all the datasets. Dataset 1 - MRI image: The first test data consist of an MRI scan of a human brain in front of a black background (Fig. 1(a)). A template of the object was subjected to a 2-D affine transform with: (tx , ty ) = 65, 68; (sx , sy ) = 0.925, 1.078; θ = −25 and φ = −5.5826. The SSD error between the optimal template and the scene, including minor interpolation and approximation errors, is 6.6689. After 100 experimental runs with each algorithm, we obtained the results in rows 2-4 of Table 2. It is clear that both DE and SOMA have the best performance, with all their test runs converging inside the threshold. DE uses only about 20% of its optimisation budget to achieve convergence on average, but SOMA is the clear winner with approximately 1400 less NFEs required for comparable results. Next we have the genetic algorithm which very suprisingly did not manage to converge in any of the 100 tests but instead converged inside one of the many pronounced local minima of the rotation parameter θ (due to the symmetry of the human brain scan), while having successfully identified the other parameters. From the local methods, due to the absence of good initialisation, we expect much lower convergence rates than the global methods. Compared between themselves, the pattern search search can converge many more times and at around half the NFEs than the simplex requires.
Comparison of Optimisation Algorithms
1105
Dataset 2 - CMU PIE data: The second installment of tests was carried out in a real image sample (see Fig. 1(b)) from the CMU PIE database [21], with a complex background but which is given as a separate image. This is a more difficult scenario than previously and we expect a lower convergence rate across all the methods. In this occasion, the ground truth is located at [82, 52, 1.0786, 1.1475, 100, −4.89910] with an SSD error of 0.1885. After 100 test runs for each optimisation algorithm, we obtain the results in rows 5-7 of Table 2. As expected, we see an overall drop in the recognition results with DE being the best performing method, while at the same time displaying convergence behaviour reminiscent of a local method; that is, converging in under 900 NFEs. The rest of the algorithms perform rather poorly, with SOMA at 61% and GA at a much lower 11%. Furthermore, all methods find a good minimum at <0.1, which is lower than the known solution, since they can effectively overcome any inherent interpolation and approximation errors. We also note that in the case of the pattern search algorithm, the only 3 cases that succeeded in converging correctly, were the ones that randomly initialised inside the basin of attraction. Dataset 3 - Real image data without a background model: Finally we arrive to the hardest case; A real image with a complex background, but without any model of the latter (see Fig. 1(c) and (f)). Due to the increased difficulty associated with this particular dataset, it is expected that the overall optimisation performance will be further reduced. The optimal SSD solution in this case is [106, 59, 0.9048, 1.0444, 12.020, 00 ] = 0.0488. If we use the same optimisation settings as we did previously, we get the following results after 100 test runs (Table 2 rows 8-10). SOMA performs very well with a 97% convergence ratio, with the GA coming second at 63% and DE not particularly efficient at 61%. We also see that it takes DE many more iterations in order to converge, whereas SOMA and GA on average reach the global minimum around 2.5 times faster. Despite that, all the global methods reached approximately the same minimum error. Both the local methods managed to converge fast and towards a very good solution but for only a limited number of cases, most probably due to the absence of good starting points. In conclusion we may say that both DE and SOMA perform consistently well in all the 3 cases, with an expected performance penalty associated with the increased difficulty of each dataset, and both reach approximately the same minimum at the end of their allocated time budget. Where they differ however, is in the time they require for initial convergence, with SOMA being the clear winner since it manages to approximate the correct solution much earlier than DE. This makes SOMA ideal for the hybrid approach later on. As far as the GA is concerned, we have seen that it can reach an equally good minimum error, just like SOMA and DE, when and if it converges successfully. Nevertheless, it has the tendency to get stuck in pronounced local minima in all but the simplest datasets, which consequently reduces its effectiveness on template matchingbased object recognition. The two local methods, simplex and pattern search, can converge very fast and nearly at the same minimum whenever they can reach its proximity. We can therefore use either one for the hybrid approach next.
1106
5.3
V. Zografos
Hybrid Approach
The hybrid approach is essentially the combination of a global, stochastic algorithm (in this case SOMA) designed to get us close to the basin of attraction as early as possible from a random, distant location on the error surface, and a local method (Simplex), whose purpose is to rapidly refine the good recovered solution, much faster and more efficiently than the global method alone can. The only additional issue with using a hybrid method, is how to determine when is the best time to switch between methods. One possibility, is to use a number of concurrent criteria to decide when we are close to the switch point. The first such criterion, could be a proximity threshold such as the Euclidean distance previously used to determine convergence. Another could be the observed relative gain of each successful iteration. When the gain is below some predetermined value, we can assume that the global algorithm has near-stalled and switch to the local method. Finally, a third criterion might be the relative change of each parameter at every iteration. Alternatively, we may opt to use a fixed NFE-related threshold, based on the information we have about the optimisation behaviour of SOMA. If for example we revisit Table 2 we can see that on average and across all 3 datasets, SOMA requires between 1500-4000 NFEs to reach the minimum error threshold. We can therefore use this prior knowledge and set SOMA to run at a fixed number of 4000 NFEs. Such a number will most likely ensure that the simplex switch is performed when we are near the solution. The only settings that we altered since the previous test runs are: SOMA migrations=20, popsize=50 ≈ 4000 NFEs; Simplex 2000 NFEs, initial 7×6 simplex that includes as a vertex V1 the optimum recovered solution from the SOMA run and 6 random vertices V2−7 generated at a distance d=[5,5,0.1,10,5] from V1 . Note, this is the Euclidean distance threshold from earlier. We carried out 100 test runs of the hybrid method for each of the 3 datasets and we present the results on Table 3. The second row shows the convergence rate of the hybrid method. The percentage differences (±%) in this row are in relation to the original SOMA results (column 7 of Table 2). The next two rows show the average SSD error of the 100 hybrid runs and the original 100 SOMA runs at 6000 NFEs. The percentage differences of row 3 are in relation to the results in row 4. Finally, the last row shows the average SSD error of the original 100 SOMA runs at the maximum 20000 NFEs, with a percentage difference in relation to the results in row 4. We see that the convergence ratio of the hybrid, is only around 15-30% lower than the original tests, but the error is between 20-65% lower than the SOMA-only approach for the same NFEs. In fact, the error values are quite Table 3. The results of the hybrid and SOMA tests at 6000 and 20000 NFEs
Convergence % (±%) Hybrid SSD @ 6000 NFEs (±%) SOMA SSD @ 6000 NFEs SOMA SSD @ 20000 NFEs (±%)
Dataset 1 86% (-14%) 0.4275 (-65%) 1.215 0.3265 (−73%)
Dataset 2 41% (-33%) 0.0868 (-24%) 0.1138 0.08659 (−24%)
Dataset 3 81% (-16.5%) 0.02661 (-22%) 0.03419 0.02523 (−26%)
Comparison of Optimisation Algorithms Hybrid vs SOMA converged test runs − Dataset 1
Hybrid vs SOMA converged test runs − Dataset 2
20
10
5
0.2
0.15
0.1
0.5
1
1.5
NFE
(a)
2 4
x 10
0.05 0
SOMA Hybrid SOMA part Hybrid Simplex part
0.2
SSD error
SSD error
SSD error
SOMA Hybrid SOMA part Hybrid Simplex part
0.25
15
0 0
Hybrid vs SOMA converged test runs − Dataset 3 0.25
0.3 SOMA Hybrid SOMA part Hybrid Simplex part
1107
0.15
0.1
0.05
0.5
1
NFE
(b)
1.5
2 4
x 10
0 0
0.5
1
NFE
1.5
2 4
x 10
(c)
Fig. 2. Plots comparing the hybrid approach and the SOMA method for the 3 datasets
close to the original recovered minima using the full 20000 NFEs. This can also be seen in the iteration plots in Fig. 2, where we can observe the secondary drop of the local method, which always manages to refine the optimisation further (i.e. there is no stall at the switch point), indicating that on average we have chosen good switch points and that the local method can converge faster that the global method for the same number of iterations. We may therefore say that by using a hybrid approach, it is possible to obtain results that are very close to a global algorithm-only solution, but at a considerably reduced NFEs cost. In that sense, a hybrid optimiser might be useful in situations where we are faced with a costly objective function but a good initialisation for a local-only method is not available. With the application of the hybrid method we may still use a global algorithm for initialisation, while avoiding the increased NFEs overhead.
6
Conclusion
In this paper we have examined the suitability of a number of different optimisation methods (both novel and traditional) for the task of template matching. We have tested against a series of 2-D, analytic functions designed to highlight the generic properties of each optimisation method, followed by three realistic datasets of progressive difficulty, commonly encountered in computer vision. Our results show that the novel methods outperform the traditional approaches in all cases, and we hope that this work serves as a first step into introducing these novel methods to the computer vision community and establishing them as better alternatives to the methods currently being used. Finally we argue that a hybrid combination of a global and local methods can produce equally good results in a fraction of the time required by a global method alone. We demonstrate this to some degree, with a number of additional experiments.
References 1. Peters, G.: Theories of three-dimensional object perception: A survey, Recent Research Developments in Pattern Recognition, Transworld ResearchNetwork (2000) 2. Jain, A.K., Zhong, Y., Dubuisson-Jolly, M.P.: Deformable template models: A review. Signal Processing 71, 109–129 (1998)
1108
V. Zografos
3. Hill, D.L.G., Batchelor, P.G., Holden, M., Hawkes, D.J.: Medical image registration [invited topical review]. Physics in Medicine and Biology 46, R1–R45 (2001) 4. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput. Surv. 38, 1–45 (2006) 5. Hasegawa, O., Kanade, T.: Type classification, color estimation, and specific target detection of moving targets on public streets. Machine Vision and Applications 16, 116–121 (2005) 6. Nelder, J.A., Mead, R.: A simplex method for function minimization. Computer Journal 7, 308–313 (1965) 7. Nocedal, J., Wright, S.: Numerical optimization. Springer, New York (1999) 8. Levenberg, K.: A method for the solution of certain problems in least squares. Quart. Appl. Math. 2, 164–168 (1944) 9. Marquardt, D.: An algorithm for least-squares estimation of nonlinear parameters. SIAM J. Appl. Math. 11, 431–441 (1963) 10. Betke, M., Makris, N.C.: Fast object recognition in noisy images using Simulated Annealing. In: Proceedings of the 5th ICCV, pp. 523–530 (1995) 11. Audet, C., Dennis Jr., J.E.: Analysis of generalized pattern searches. SIAM J. on Optim. 13, 889–903 (2003) 12. Holland, J.H.: Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. The MIT Press, Cambridge (1992) 13. Kim, H.D., Park, C.H., Yang, H.C., Sim, K.B.: Genetic algorithm based feature selection method development for pattern recognition. In: International Joint Conference SICE-ICASE (2006) 14. Bebis, G., Uthiram, S., Georgiopoulos, M.: Genetic search for face detection and verification. In: ICIIS, pp. 360–367 (1999) 15. Hill, A., Taylor, C.J., Cootes, T.F.: Object recognition by flexible template matching using genetic algorithms. In: Proceedings of the 2nd ECCV, London, UK, pp. 852–856. Springer, Heidelberg (1992) 16. Goldberg, D.E.: Genetic Algorithms in Search, Optimzation & Machine Learning. Addison-Wesley, Reading (1989) 17. Storn, R., Price, K.V.: Differential evolution - a simple and efficient heuristic for global optimization over continuous spaces. Journal of Gl. Optim. 11, 341–359 (1997) 18. Zelinka, I.: SOMA-Self Organizing Migrating Algorithm. In: Onwubolu, G., Babu, B.V. (eds.) New optimization techniques in engineering. Springer, Berlin (2004) 19. DeJong, K.A.: An analysis of the behavior of a class of genetic adaptive systems. PhD thesis, University of Michigan, Ann Arbor, MI, USA (1975) 20. Buxton, B., Zografos, V.: Flexible template and model matching using image intensity. In: Proceedings DICTA, pp. 438–447 (2005) 21. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination and expression (PIE) database. In: Proc. of the 5th FG. (2002) 22. Srivastava, A., Lee, A., Simoncelli, E., Zhu, S.C.: On advances in statistical modeling of natural images. Journal of Math. Imag. and Vis. 18, 17–33 (2003) 23. Zografos, V., Buxton, B.F.: Affine invariant, model-based object recognition using robust metrics and Bayesian statistics. In: Kamel, M.S., Campilho, A.C. (eds.) ICIAR 2005. LNCS, vol. 3656, pp. 407–414. Springer, Heidelberg (2005)
Two Step Variational Method for Subpixel Optical Flow Computation Yoshihiko Mochizuki1 , Yusuke Kameda1 , Atsushi Imiya2 , Tomoya Sakai2 , and Takashi Imaizumi2 1
Graduate School of Advanced Integration Science, Chiba University Institute of Media and Information Technology, Chiba University Yayoicho 1-33, Inage-ku, Chiba, 263-8522, Japan
2
Abstract. We develop an algorithm for the super-resolution optical flow computation by combining variational super-resolution and the variational optical flow computation. Our method first computes the gradient and the spatial difference of a high resolution images from these of low resolution images directly, without computing any high resolution images. Second the algorithm computes optical flow of high resolution image using the results of the first step.
1
Introduction
Super-resolution is a technique to recover a high-resolution image and/or image sequence from a low-resolution image and/or image sequence. In this paper, we develop a two step method for the subpixel accurate optical flow computation. To combine a super-resolution technique for still images and the optical flow computation for temporal images, we adopt variational method [4] both for super-resolution and optical flow computation. Our method computes the gradient and the spatial difference of a high-resolution images from those of low-resolution images directly, without computing high-resolution images as intermediate information for super-resolution of optical flow fields. The subpixel accurate optical flow computation is required to compute the optical flow vector of the inter-grid point. For the multiresolution optical flow computation, the optical flow field computed in the coarse grid system is propagated to the field in the finer grid. This propagated field is used as the first estimate for the accurate optical flow computation in the finer grid. The superresolution optical flow computation is a technique for the estimation of optical flow vectors beyond the base grid, which is the grid system to express the original measured images, in the multiresolution image analysis framework [3]. Interpolation is a fundamental technique to recover a high-resolution image [2]. Spline technique is a typical method for interpolation. Furthermore, spline interpolation is a classical method for super-resolution of images and shapes. A well established signal super-resolution is the reconstruction of signals with a finite support from a low-frequency part of the signals [1]. In this methodology, additional constraints such that the convex space in which signals exist are used for accurate and robust signal recovery [5]. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1109–1118, 2009. c Springer-Verlag Berlin Heidelberg 2009
1110
Y. Mochizuki et al.
For the recovery of a high-resolution image from a low-resolution image, we are required to assume image observation system and analytical properties of signals and images as priors [1,2]. Pyramid transform reduces the size of image if we use the same size pixel for image representation [6,7]. If we use the same size image landscape for the results of image pyramid transform, reduction by the pyramid transform acts as a low pass filtering for the dezooming of signals and images. Therefore, we deal with the pyramid-transform-based image observation system. Furthermore, we accept the minimisation of norm of the higher order derivatives of images as priors, since the orders of the derivative of images define the continuity order of images. Moreover, super-resolution involving the higher order derivatives as priors is achieved by solving a polyharmonic elliptic equation. Blurring operation decreases the resolution of the original images. In multiresolution optical flow computation, we are required to combine deblurring operation, which is the inverse operation to blurring and optical flow computation. We combine variational super-resolution and variational optical flow computation for the super-resolution optical flow computation.
2
Reduction and Resolution Conversion
Setting the operation R to be a spatiotemporal invariant linear reduction operation, the super resolution optical flow is the problem to compute optical flow [8,9] u = (u, v) of the image f (x, y, t), which is the solution of ∇f u + ∂t f = 0
(1)
from the low-resolution image g(x, y, t) = Rk f (x, y, t) of f (x, y, t), such that g(x, y, t) = wk (u)wk (v)f (2k x − u, 2k y − v, t)dudv, (2) R2
for wk (x) =
1k 2 (1
0,
−
|x| k 2 ),
|x| ≤ 2k , |x| > 2k .
(3)
Setting g(x, y, t) = Rf (x, y, t) = R2
w1 (u)w1 (v)f (2x − u, 2y − v, t)dudv,
there is the relation Rk+1 f = R(Rk )f for k ≥ 1. We set ∞ ∞ 1 s−u t−v Eg(x, y) = 2 wσ (u)wσ (v)g , dsdt. σ −∞ −∞ σ σ
(4)
(5)
The operations E and E k are the dual operations of R and E k , respectively (See Appendix). The operation E is called expansion and achieves linear interpolation.
Two Step Variational Method for Subpixel Optical Flow Computation
1111
Since Rk is a spatiotemporal invariant linear reduction operation, we have the relation 1 ∇g = k Rk ∇f, ∂t g = Rk ∂t f. (6) 2 Therefore, if we can recover fx , fy , and ft from g, we can compute optical flow of f as the solution of eq. (1).
3 3.1
Variational Super-Resolution Super-Resolution
Setting R to be the pyramid transform, we deal with the the subpixel superresolution to reconstruct f from g = Rf . The variational method achieves to reconstruct f by minimising the functional k J(f ) = (R f − g)2 + κQ(f ) dx, (7) R2
where g and f are the original image and the result of super-resolution. In eq. (7), P (f ) is the constraint to the subpixel image, which satisfies the condition λQ(f ) + (1 − λ)Q(g) ≥ Q(λf + (1 − λ)g), 0 ≤ λ ≤ 1.
(8)
If Q(f ) = |∇f |2 , the solution is constructed using cubic B-spline [10,5,13,11,12,14,15]. 3.2
Optical Flow Computation
To solve a singular equation defined by eq. (1), we use the regularisation method [8,9] which minimises the criterion J(u) = ∇f u + ∂t f )2 + αP (u) dx, (9) R2
where P (u) is a convex prior of u such that λP (u) + (1 − λ)P (v) ≥ P (λu + (1 − λ)v), 0 ≤ λ ≤ 1.
(10)
If P (u) = tr∇u∇u = |∇u|2 + |∇v|2 , (11) 2 2 2 2 2 2 P (u) = trHH = |uxx | + 2|uxy | + |uyy | + |vxx | + 2|vxy | + |vyy | ,(12) P (u) = |∇u| + |∇v|,
(13)
where H is the Hessian of u, we have the Horn-Schunck method [8,17,18], deformable model method [10,16], and total variational method [19,20], respectively for optical flow computation.
1112
3.3
Y. Mochizuki et al.
Algorithm of Super-Resolution Optical Flow
Our purpose is to compute u which minimises the criteria k S(u) = (R f − g)2 + κQ(f ) + (∇f u + ∂t f )2 + αP (u dx.
(14)
R2
If α 1, this minimisation can be approximately separated into two step method. We first solve a system of minimisation problem 1 J(fx ) = ( k Rk fx − gx )2 + κQ(fx ) dx (15) 2 R2 1 J(fy ) = ( k Rk fy − gy )2 + κQ(fy ) dx (16) 2 R2 k J(ft ) = (R ft − gt )2 + κQ(ft ) dx. (17) R2
Second using the solutions fx , fy and ft , we compute the solution which minimises I(u) = (∇f u + ∂t f )2 + αP (u) dx. (18) R2
For the detail of this decomposition, see appendix. The spatiotemporal derivatives fx , fy , and ft are computed by the dynamics systems ∂fx ∂fy ∂ft = −c∇fx J(fx ), = −c∇fy J(fy ), = −c∇ft J(ft ), ∂τ ∂τ ∂τ
(19)
for c > 0. The optical flow u of a high-resolution image f is computed using the dynamics, ∂u = −k∇u I(u), (20) ∂τ for a positive constant k.
4
Numerical Examples
We evaluated the performance of our method for k = 2, that is, we compute the optical flow field of the image f from g = R2 f . Furthermore, we set Q(f ) = |∇f |2 and P (u) = |∇u|2 + |∇v|2 . Table 1 shows the sizes of the original image Table 1. Dimension of image sequences Sequence Original size Reduced size # of frames Boundary condition yose 316 × 252 79 × 63 14 free rotsph 200 × 200 50 × 50 45 free oldmbl 512 × 512 128 × 128 30 free
Two Step Variational Method for Subpixel Optical Flow Computation
(a) yose (Low resolution)
(b) (Low tion)
rotsph resolu-
(d) yose (SRed)
(e) rotsph (SR-ed)
(f) oldmbl (SR-ed)
(g) yose (High resolution)
(h) (High tion)
(i) (High tion)
(j) yose (original)
(k) rotsph (original)
rotsph resolu-
(c) (Low tion)
1113
oldmbl resolu-
oldmbl resolu-
(l) oldmbl (original)
Fig. 1. Images for optical flow computation and the results. Images in the first and second rows show images from the original and reduced image sequences. The reduction operation is applied twice to each image in bottom Optical flow field in the second and fourth rows are computed from the reduced and original image sequences.
1114
Y. Mochizuki et al.
Fig. 2. The HSV chart of optical flow vector. (a) shows the HVS colour chart for vector expression. (b) is the ground truth of the motion field of our experiment.
sequences. From these sequences, we generated images in 1/16 size applying the reduction operation twice. Then, we computed the optical flow vectors of the original image sequence from the spatial and temporal derivatives of reduced images. For the comparison, we twice apply the expansions to the optical flow fields of the reduced images. We set κ = 1.0 × |∇f | and the sampling interval for τ for the discrete computation of the dynamics as 0.001. Furthermore, table 1 shows the dimensions of test images, Yosemite(yose), Rotating Sphere(rotsph), and Old Marbled Block(oldmbl). Images in the first and second rows in Fig. 1 show images from the original and reduced image sequences. The reduction operation is applied twice to each image in bottom The original images and reduced images are expressed in the same landscape. Therefore, pixel of the images in the right in Fig. 1 is 16 times larger than the that of original image. Optical flow field in the second and fourth rows in Fig. 1 are computed from the reduced and original image sequences. The computed optical flow fields are shown in Fig. 1 in HVS colour chart of Fig. 2. Setting uS (x, y, t) = (uS (x, y, t), vS (x, y, t)) , uO (x, y, t) = (uO (x, y, t), vO (x, y, t)) to be the optical flow fields obtained as the results of super-resolution and computed from the original image sequence, respectively, we define the values u S uO θ(x, y, t) = cos−1 , N (t) = |uS (x, y, t) − uO (x, y, t)|, |uS ||uO | 1 avrθ(t) = θ(x, y, t)dxdy, varθ(t) = the variance of avrθ(t). |Ω| Ω Figure 3 shows the angle and norm errors between the optical flow fields computed from the original images and low resolution images. Table 2 shows the value of mean angle errors and these variances for three images against α. The table shows that proposed method robustly computes optical flow of a high-resolution image from a low-resolution one.
Two Step Variational Method for Subpixel Optical Flow Computation
= = = = = =
60
1001 102 103 104 105 10
40
30
20
10
0
alpha alpha alpha alpha alpha alpha
50
60
100 1012 103 104 105 10
40
30
20
10
6 8 frame number
10
12
5
30
20
10
alpha alpha alpha alpha alpha alpha
= = = = = =
10
15
20 25 30 frame number
35
40
1
1001 102 103 10 1045 10
alpha alpha alpha alpha alpha alpha
0.8
0.6
0.4
0.2
= = = = = =
0 6 8 frame number
10
(d) norm error
12
14
15 20 frame number
25
1
1001 102 103 10 1045 10
alpha alpha alpha alpha alpha alpha
0.8
0.6
0.4
30
= = = = = =
1001 102 103 10 1045 10
0.6
0.4
0.2
0 4
10
(c) SR angle error
0.2
2
5
45
(b) angle error
mean of norm error
1
1001 102 103 104 105 10
0
14
(a) angle error
0.8
= = = = = =
40
mean of norm error
4
alpha alpha alpha alpha alpha alpha
50
0 2
mean of norm error
= = = = = =
mean of angle error [deg]
alpha alpha alpha alpha alpha alpha
50
mean of angle error [deg]
mean of angle error [deg]
60
1115
0 5
10
15
20 25 30 frame number
35
(e) norm error
40
45
5
10
15 20 frame number
25
30
(f) norm error
Fig. 3. Statistic results of computed optical flow
Table 2. Statistic result (mean and standard deviation of means of angle errors for all frames) of yose α 100 101 102 103 104 105 −1 −1 −2 −2 −2 avrθ 2.11 × 10 1.45 × 10 7.87 × 10 5.95 × 10 6.70 × 10 6.85 × 10−2 varθ 8.63 × 10−2 5.04 × 10−2 1.42 × 10−2 3.76 × 10−3 4.04 × 10−3 3.87 × 10−3 avrθ 1.26 × 10−1 1.15 × 10−1 1.02 × 10−1 1.22 × 10−1 1.22 × 10−1 1.50 × 10−1 varθ 7.39 × 10−2 6.89 × 10−2 5.56 × 10−2 5.07 × 10−2 1.68 × 10−2 4.39 × 10−2 avrθ 3.35 × 10−1 2.03 × 10−1 7.69 × 10−2 4.13 × 10−2 4.60 × 10−2 6.86 × 10−2 varθ 1.12 × 10−1 6.16 × 10−2 9.16 × 10−3 2.87 × 10−3 1.99 × 10−3 2.52 × 10−3
5
Conclusions
In this paper, we have develop an algorithm for the super-resolution optical flow computation, which computes the optical flow vectors on the sequence of high-resolution images from a sequence of low-resolution images. As the first step towards feature super-resolution, which recovers image features of a high-resolution image from a low-resolution image, we dealt with least-squares- and energy-smoothness-based formulations, since the input image and output high-resolution features are related through linear transformations.
1116
Y. Mochizuki et al.
It is possible to adopt other formulations, such as L1 -TV, L2 -TV, and L1 − L1 formulations, to combine model fitting terms and priors [21,22,23,24]. This research was supported by ”Computational anatomy for computer-aided diagnosis and therapy: Frontiers of medical image sciences” funded by Grant-inAid for Scientific Research on Innovative Areas, MEXT, Japan, Grants-in-Aid for Scientific Research@founded by Japan Society of the Promotion of Sciences, Research Fellowship for Young Scientist founded by Japan Society of the Promotion of Sciences, and Research Associate Program of Chiba University.
References 1. Youla, D.: Generalized image restoration by the method of alternating orthogonal projections. IEEE Transactions on Circuits and Systems 25, 694–702 (1978) 2. Stark, H. (ed.): Image Recovery: Theory and Application. Academic Press, New York (1992) 3. Amiz, T., Lubetzky, E., Kiryati, N.: Coarse to over-fine optical flow estimation. Pattern recognition 40, 2496–2503 (2007) 4. Ruhnau, P., Knhlberger, T., Schnoerr, C., Nobach, H.: Variatinal optical flow estimation for particle image velocimetry. Experiments in Fluids 38, 21–32 (2005) 5. Wahba, G., Wendelberger, J.: Some new mathematical methods for variational objective analysis using splines and cross-validation. Monthly Weather Review 108, 36–57 (1980) 6. Burt, P.J., Andelson, E.H.: The Laplacian pyramid as a compact image coding. IEEE Trans. Communications 31, 532–540 (1983) 7. Hwan, S., Hwang, S.-H., Lee, U.K.: A hierarchical optical flow estimation algorithm based on the interlevel motion smoothness constraint. Pattern Recognition 26, 939–952 (1993) 8. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–204 (1981) 9. Beauchemin, S.S., Barron, J.L.: The computation of optical flow. ACM Computer Surveys 26, 433–467 (1995) 10. Suter, D.: Motion estimation and vector spline. In: Proceedings of CVPR 1994, pp. 939–942 (1994) 11. Amodei, L., Benbourhim, M.N.: A vector spline approximation. Journal of Approximation Theory 67, 51–79 (1991) 12. Benbourhim, M.N., Bouhamidi, A.: Approximation of vectors fields by thin plate splines with tension. Journal of Approximation Theory 136, 198–229 (2005) 13. Suter, D., Chen, F.: Left ventricular motion reconstruction based on elastic vector splines. IEEE Trans. Medical Imaging, 295–305 (2000) ´ 14. Sorzano, C.O.S., Th´evenaz, P., Unser, M.: Elastic registration of biological images using vector-spline regularization. IEEE Tr. Biomedical Engineering 52, 652–663 (2005) 15. Steidl, G., Didas, S., Neumann, J.: Splines in higher order TV regularization. IJCV 70, 241–255 (2006) 16. Grenander, U., Miller, M.: Computational anatomy: An emerging discipline. Quarterly of applied mathematics 4, 617–694 (1998) 17. Weickert, J., Schn¨ orr, Ch.: Variational optic flow computation with a spatiotemporal smoothness constraint. Journal of Mathematical Imaging and Vision 14, 245–255 (2001)
Two Step Variational Method for Subpixel Optical Flow Computation
1117
18. Weickert, J., Bruhn, A., Papenberg, N., Brox, T.: Variational optic flow computation: From continuous models to algorithms. In: Proceedings of International Workshop on Computer Vision and Image Analysis, IWCVIA 2003 (2003) 19. Papenberg, N., Bruhn, A., Brox, T., Didas, S., Weickert, J.: Highly accurate optic flow computation with theoretically justified warping. International Journal of Computer Vision 67, 141–158 (2006) 20. Werner, T., Pock, T., Cremers, D., Bischof, H.: An unbiased second-order prior for high-accuracy motion estimation. In: Rigoll, G. (ed.) DAGM 2008. LNCS, vol. 5096, pp. 396–405. Springer, Heidelberg (2008) 21. Rod´rguez, P., Wohlberg, B.: Efficient minimization method for a generalized total variation functional. IEEE Trans. Image Processing 18, 322–332 (2009) 22. Schmidt, M., Fung, G., Rosales, R.: Fast optimization methods for L1 regularization: A comparative study and two new approaches. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 286–297. Springer, Heidelberg (2007) 23. Pock, T., Urschler, M., Zach, C., Beichel, R.R., Bischof, H.: A duality based algorithm for TV-L1 -optical-flow image registration. In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI 2007, Part II. LNCS, vol. 4792, pp. 511–518. Springer, Heidelberg (2007) 24. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007)
Appendix The relation1 ∞ ∞ Rf (x, y)g(x, y)dxdy ∞ ∞ = wσ (u)wσ (v)f (σx − u, σy − v)dudv g(x, y)dxdv −∞ −∞ −∞ −∞ ∞ ∞ ∞ ∞ 1 s−u t−v = f (u, v) wσ (u)wσ (v)g , dsdt dudv σ 2 −∞ −∞ σ σ −∞ −∞ ∞ ∞ = f (u, v)Eg(u, v)dudv −∞ ∞
−∞ 1
−∞ ∞
−∞
In both of the define domain and the range space of the transformation R, the inner products of functions are defined as ∞ inf ty (f, g)D = f (x, y)g(x, y)dxdy (Rf, Rg)R =
−∞ ∞
−∞
inf ty
Rf (x, y)Rg(x, y)dxdy −∞
−∞
The dual operation R∗ of the operation R satisfies the relation (f, Rg)R = (R∗ f, g)D . .
1118
Y. Mochizuki et al.
implies that the operation E is the dual operation of R. Therefore, the dual operation of Rk is E k . The Euler-Lagrange equation of S(u) = (J(fx ) + J(fy ) + J(ft )) + I(u). is a system of linear partial differential equations, 1 k k 1 1 E (R fx − k gx ) + (∇f u + ∂t f )u = 0, κ σ α 1 k k 1 1 Qfy − E (R fy − k gy ) + (∇f u + ∂t f )v = 0, κ σ α 1 k k 1 Qft − E (R ft − gt ) + (∇f u + ∂t f ) = 0, κ α
Qfx −
and Pu −
1 (∇f u + ∂t f )∇f = 0. α
If the inequality α |(∇f u + ∂t f )| is satisfied, these partial differential equations are approximately replaced to the system of equations 1 k k 1 E (R fx − k gx ) = 0, κ σ 1 k k 1 Qfy − E (R fy − k gy ) = 0, κ σ 1 k k Qft − E (R ft − gt ) = 0. κ Qfx −
These equations are Euler-Lagrage equations of eqs. (15), (16), and (17).
A Variational Approach to Semiautomatic Generation of Digital Terrain Models Markus Unger1 , Thomas Pock1 , Markus Grabner1 , Andreas Klaus2 , and Horst Bischof1 1
Institute for Computer Graphics and Vision, Graz University of Technology 2 Microsoft Photogrammetry http://www.gpu4vision.org
Abstract. We present a semiautomatic approach to generate high quality digital terrain models (DTM) from digital surface models (DSM). A DTM is a model of the earths surface, where all man made objects and the vegetation have been removed. In order to achieve this, we use a variational energy minimization approach. The proposed energy functional incorporates Huber regularization to yield piecewise smooth surfaces and an L1 norm in the data fidelity term. Additionally, a minimum constraint is used in order to prevent the ground level from pulling up, while buildings and vegetation are pulled down. Being convex, the proposed formulation allows us to compute the globally optimal solution. Clearly, a fully automatic approach does not yield the desired result in all situations. Therefore, we additionally allow the user to affect the algorithm using different user interaction tools. Furthermore, we provide a real-time 3D visualization of the output of the algorithm which additionally helps the user to assess the final DTM. We present results of the proposed approach using several real data sets.
1
Introduction
When modeling the surface of the earth, several terms for height field data are commonly used. Digital surface models (DSM) describe the earth’s surface with all vegetation, buildings, bridges and other man made objects. These models are often used for visual applications or further processing such as map generation and city modeling. Generation of these datasets can be done by Lidar, Ifsar, photogrammetry or other remote sensing approaches [1]. The digital terrain model (DTM) represents the bare surface of the earth, without any vegetation or man made objects. The DTM model is sometimes also referred to as the digital elevation model (DEM). DTMs are often required for mapping, land usage analysis, flood modeling, geological studies and as input for classification algorithms. In this work we focus on the semiautomatic generation of high quality DTMs based on a given DSM. A lot of work has been done on the generation of DTMs and their possible applications. In [2], a short review on the extraction of buildings from DSM data is given. A good overview in the context of change detection can be found G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1119–1130, 2009. c Springer-Verlag Berlin Heidelberg 2009
1120
M. Unger et al.
in [3]. The different approaches can be discriminated into several categories. The first category is based on local operators e.g. [4], [5] and [6]. Other methods rely on surface based methods e.g. [7] and [8]. The third category makes use of segmentation e.g. [9], [10] and [11]. In [12], a hybrid method that combines segmentation and surface based methods is used. In [13], aerial images are used for the automatic generation of maps. They extracted building, vegetation and street layers. Based on this information also a DTM is generated. A variational approach for DTM generation was presented in [7], where a DTM is generated based on the minimization of an energy making use of the Tukey’s robust error norm. Unfortunately this approach cannot model sharp discontinuities. The main contribution of our approach is the exploitation of a novel variational framework for interactive DTM generation that combines a robust Huber norm regularization with an L1 data fidelity term. A fast minimization algorithm and the implementation on the GPU make the approach suitable to work interactively with the data. Combined with a realtime 3D visualization our approach allows precise assessment and interaction to obtain highly accurate DTMs. The remainder of this paper is organized as follows: First we describe the used approach in Section 2. In Section 3, the algorithm is explained in detail, and a fast minimization algorithm is presented. Implementation details are described in Section 4, and experimental results on different datasets are shown and discussed in Section 5.
2
A Semiautomatic Approach
In our approach we solely rely on the DSM to remove vegetation and man made objects. This task is not trivial e.g. a rock might look very similar to a building, or bridges are difficult to determine form motorways elevated just by soil. Therefore sometimes human supervision and interaction is required to obtain high quality results. When working interactively, speed, global convergence and visualization are important issues. We address these issues by minimizing a convex energy functional to obtain a globally optimal solution. As recently shown [14], variational methods have great parallelisation potential and show huge speedups by implementing them on the graphics processing unit (GPU). Interactive visualization greatly benefits form the data to be already on the GPU. We use a three-step algorithm that is based on the minimization of a convex energy functional. Our processing chain is illustrated on a 1D example in Fig. 1. In the first step we regularize the DTM model using the Huber norm [15] and use a minimum constraint enforcing the result always to have a lower height than the original DTM. This way all vegetation and buildings are dragged down. By thresholding the difference image between the regularized and the input image, a detection mask is generated. In the third step, the detected areas are interpolated using the same energy as in the first step, but without the minimum constraint. Thus slight noise in the DTM is removed. The Huber norm allows for sharp discontinuities, but smoothes small gradients. Interaction is realized by letting the user modify the detection mask. As the energy is convex, the globally optimal
A Variational Approach to Semiautomatic Generation of DTM
(a) DSM
(b) Regularization step
(c) Detection step
(d) DTM with some errors
(e) Interaction on the mask
(f) Final DTM
1121
Fig. 1. Process chain illustrated on an 1D example. (a) DSM input data, (b) regularization using Huber norm and minimum constraint, (c) thresholding step to generate detection mask, (d) first DTM model after interpolation step, (e) interaction by modifying mask and (f) the final DTM.
(a) 2D view
(b) 3D rendering
(c) 3D rendering with texture
Fig. 2. Comparison of different DSM visualizations, showing the advantage of rendering in 3D and additional texture information
1122
M. Unger et al.
solution can be obtained, that is especially important if interaction is allowed, as we cannot get stuck in local minima. A more mathematical description of the process chain can be found in Section 3.2. When working on 2D images where depth is represented as grayscale values visual inspection might be very difficult. Therefore a 3D representation is beneficial. Additionally, texture information such as ortho images can be of great help during the inspection process. See Fig. 2 for an example of different visualization modes. As the variational approach is easy to parallelise, we implemented the whole algorithm directly on the GPU and are able to render the current result in realtime.
3 3.1
Algorithm for DTM Generation A Convex Minimization Problem
We propose to use the following convex minimization problem: min |∇u|ε dx + λ(x) |u − f | dx , u∈K
Ω
(1)
Ω
with the DSM f : Ω → R and the DTM u : Ω → R. The minimum constraint u ≤ f is realized by forcing the solution u to be in the convex set K = {u : Ω → R, u(x) ≤ f (x) ∀x ∈ Ω} .
(2)
The first term in (1) is the regularization term using the Huber norm (or Gauss TV) [15]. Traditionally used regularization terms like the Tikhonov regularization [16] or Total Variation regularization [17] either blur the image or cause staircasing artifacts. The Huber norm has already been used to overcome these problems for image denoising applications [18], [19]. It is defined as 2 |x| if |x| ≤ ε . 2ε |x|ε = (3) |x| − 2ε if |x| > ε The Huber norm is linear for values larger than ε and quadratic for values smaller than ε. This regularization allows for sharp discontinuities at large gradients, but smoothes the image for small gradients. By using an L1 norm in the data term, the removal of structures becomes contrast invariant. The positive, spatially varying parameter λ(x) in (1) is used to balance regularization and data term. One can discriminate the following three cases: – 0 < λ(x) < ∞: Regularization is performed. The smaller the value of λ, the bigger the features that are removed. – λ(x) = 0: These areas are inpainted using solely the Huber norm regularization. – λ(x) → ∞: Forces u(x) = f (x). Obviously we can use λ(x) to incorporate a-priori information like outliers, results from predetection steps or a mask containing vegetation and man made objects.
A Variational Approach to Semiautomatic Generation of DTM
3.2
1123
Processing Chain
The processing chain is summarized as follows: 1. Regularization step: Set λ(x) = 0 where outliers occur. Solve the minimization problem (1) using the minimum constraint. In this step strong regularization is performed, so all objects are removed but lots of details are lost. 2. Detection step: Perform thresholding using a small threshold ξ to obtain the detection mask g : Ω → [0, 1] as 1 if |u(x) − f (x)| > ξ g(x) = (4) 0 if |u(x) − f (x)| ≤ ξ . 3. Interpolation step: Set λ(x) = 0 where g(x) = 1. By replacing the convex set K by K = {u : Ω → R} we can solve (1) a second time to get the final DTM. In this step the areas from the detection step are interpolated, and only very slight regularization is applied. 3.3
Solving the Minimization Problem
As the problem formulation is highly non-linear, an efficient minimization procedure is of great importance but not trivial to find. We first consider a dual formulation of the Huber norm. The convex (or Legendre-Fenchel) conjugate [20] of a function f (x, u, ∇u) = |∇u|ε with respect to ∇u is given as ε f ∗ (x, u, p) = sup {p · ∇u − f (x, u, ∇u)} = I{|p|≤1} + |p|2 , 2 d ∇u∈R
(5)
with the dual variable p = (∇u)∗ , and the indicator function IΣ for the set Σ given as 0 if x ∈ Σ , IΣ (x) = (6) ∞ else . Alike, the convex conjugate of f ∗ (x, u, p) is given as
ε f ∗∗ (x, u, ∇u) = sup {∇u · p − f ∗ (x, u, p)} = sup ∇u · p − |p|2 , (7) 2 p p∈C with C = p : Ω → Rd , |p(x)| ≤ 1 ∀x ∈ Ω . Since f (x, u, ∇u) being convex and lower semi continuous in ∇u, the biconjugate f ∗∗ (x, u, ∇u) = f (x, u, ∇u). Thus the energy from (1) can be written as the following primal-dual functional: ε 2 min sup ∇u · pdx − |p| dx + λ(x) |u − f | dx . (8) u∈K p∈C 2 Ω Ω Ω The energy in (8) is hard to minimize. Similar to [21] and [22] we introduce the following convex approximation of the primal-dual energy: ε 1 min sup ∇u · pdx − |p|2 dx + (u − v)2 + λ(x) |v − f | dx u∈K,v p∈C 2 Ω 2θ Ω Ω Ω (9)
1124
M. Unger et al.
As θ → 0 the convex formulation (9) approaches the original energy (8). We set θ = 0.01 to obtain a solution that is very close to the true minimizer of (1). The energy now represents an optimization problem in the variables u, v and p. We propose to use a derivation of Chambolle’s projected gradient descent algorithm [23], as it was already used for L1 data terms in [21], [22] and [14]. The numerical scheme can be written as: v n+1 = ΠK {f + Sλθ (un − f )} , τ pn+1 = ΠC pn + (∇un − εpn ) , n+1 θ n+1 u = ΠK v + θ∇ · pn+1 .
(10) (11) (12)
+ Here, the shrinkage formula [24] in (10) is given as Sγ (x) = (|x| − γ) sgn(x). The reprojection onto the set C is realized as ΠC (p) = p/ max 1, |p| and for the reprojection onto the set K we can simply write ΠK (w, f ) = min {w, f }. 1 The upper boundary for the time step τ in (11) is given as τ ≤ 4+ε . The three steps (10) - (12) are iterated until convergence. We assume convergence, if the change of the energy falls below a specified threshold.
4
Implementation
In Section 2, we already mentioned the importance of fast response times to maintain interactivity and the importance of 3D visualization for more reliable assessment of the results. As the algorithm presented in Section 3 can be easily parallelised, the implementation was done on the GPU using the CUDA framework [25]. For the visual assessment fast high-quality rendering is necessary. Currently interaction is only possible on a 2D representation, while the 3D visualization is shown in a separate window. A brush of adjustable size is used to add or delete pixels in the detection mask. 4.1
Visualization
Interactive visualization clearly benefits from the fact that all data used by the algorithm already resides in GPU memory. However, the values of the DTM or DSM sampled on a regular grid are by themselves not sufficient to create a visually appealing representation. For properly rendering the surface, we must split each square (formed by four samples) into two triangles since graphics systems are highly optimized to render planar convex polygons. Moreover, for realistic illumination the surface orientation must be provided. The normal vectors (in particular their discontinuities) have significant impact on the perceived shape of the object and must therefore be chosen carefully. Retained mode graphics packages such as OpenInventor [26] typically include sophisticated offline procedures to precompute normal vectors for a given polygonal surface. However, since the surface geometry is dynamically created in our approach, we must compute topology and normal vectors on-the-fly as well. The
A Variational Approach to Semiautomatic Generation of DTM
1125
(a) Uniform triangulation (b) Adaptive triangulation Fig. 3. Comparison of uniform and adaptive triangulation of grid squares. In the uniform case, all squares are triangulated along the same diagonal. Better results are obtained with adaptive triangulation, where the diagonal is chosen which better matches surface features.
choice between the two possible triangulations of a square is based on the second derivative of the surface estimated in a neighborhood of the square [27]. This allows to properly represent surface features as illustrated in Fig. 3. For computation of the normal vector, we must distinguish between smooth regions of the surface and sharp edges. We do so by comparing each triangle’s geometric normal vector n(x) with ∇u(x). If both are sufficiently similar (i.e., span an angle less than ϕc , which is chosen by the user), we consider the surface to be smooth at x and use ∇u(x) for rendering. Otherwise, x is likely located on a crease edge (e.g., a roof line), and n(x) is used for rendering. This is demonstrated in Fig. 3(b), where roof lines are clearly visible, while the roofs and the ground appear smooth. We employ a geometry shader [28] to perform the procedure discussed above, thus avoiding any additional traffic over the CPU/GPU bus since the data is already available in GPU memory.
5
Results and Discussion
The examples presented in this section are all calculated on 2048×2048 datasets, while rendering was done on a 1024 × 1024 grid with the texture overlays in the original resolution. In our implementation the three processing steps (regularization, detection and interpolation) are iterated all the time. This allows for instant reaction of the algorithm to any parameter of the processing chain, but longer overall convergence times. Furthermore the 3D rendering immediately shows changes in parameters or the detection mask. We only present examples using the approach presented in this paper, as interactive methods are hard to compare and no ground truth was available. In Fig. 4, an example of the processing chain without interaction is presented. The first rendering (Fig. 4(a)) shows the DSM with a true ortho image as texture. Although the algorithm does not make use of this information, it gives the user important additional information during assessment of the result. In Fig. 4(b), the image is shown after the regularization step, where a lot of important information got lost. Also note the staircasing artifacts due to a low value
1126
M. Unger et al.
(a) DSM with texture
(b) Regularization step
(c) Detection mask
(d) DTM
Fig. 4. Different steps of the processing chain. The DSM (a) is processed using the minimum constraint (b) (Regularization step). From the thresholded difference the detection mask is calculated (c) (Detection step). By applying the algorithm again using the detection mask one gets the final DTM (d) (Interpolation step).
of ε and strong regularization. Nevertheless the detection mask as depicted in Fig. 4(c) delivers a reliable mask for vegetation and man made objects. In Fig. 4(d), the high quality DTM is shown. For the interpolation step only very slight regularization is applied. Note that fine details and strong edges are preserved. In Fig. 5, another example is given that illustrates the interaction with the algorithm. As can be seen by the detection mask in Fig. 5(c), the small stone hill on the right side was partially removed. The detection mask can be interactively modified in 2D, while the DTM rendering always shows the current result. With the modified mask as displayed in Fig. 5(d), we obtain the high quality result as displayed in Fig. 5(f). One can see in Fig. 5 that the algorithm can accidentally remove small rocks. On the other hand very big buildings as depicted in Fig. 6 are interpreted as natural elevations. Experimental results showed that the algorithm easily detects all high buildings with small footprint (e.g. skyscrapers), while there are sometimes problems detecting large but flat buildings (e.g. in industry). Of course it is easy to include a priori information into our approach using λ(x). We therefore applied MSER detection on the DSM as a preprocessing step. The MSER detector finds regions by thresholding an image using all grayvalues. If a region
A Variational Approach to Semiautomatic Generation of DTM
(a) DSM
(b) DSM with texture
(c) Mask automatic
(d) Mask with interaction
(e) DTM automatic
(f) DTM with interaction
1127
Fig. 5. Processing chain using interaction. In (e) the DTM without any interaction is shown. There are some errors as the detection mask (c is not entirely correct. The detection mask can be edited interactively (d), which results in a high quality DTM (f).
stays stable over several thresholding steps, it is selected. Flat areas with sharp edges, as they usually correspond to buildings, are easily detected by the MSER detector [29] (see Fig. 6(b)). If we additionally use this information in the first processing step, we finally obtain a high quality DTM where all large buildings, vegetation and other man made objects are removed. The iterative procedure as used for the experimental setup took approximately 10 seconds for 2048×2048 images on a NVidia GTX 285. For larger datasets, the regularization and detection step might be precomputed for the whole image, and only the interpolation step can be calculated interactively, which would result in much faster response times.
1128
M. Unger et al.
(a) Mask without MSER
(b) MSER detection
(c) Mask with MSER
(d) DSM rendering
(e) Mask rendering
(f) DTM rendering (no MSER)
(g) DTM rendering
Fig. 6. Example where a preprocessing step is necessary, as a simple detection a) is insufficient. MSER detection is used to obtain big buildings (b). In (e) the MSER regions are blue, and other regions found by the algorithm are depicted in light red. In the DTM without MSER detection (f) big buildings remain, while in the DTM using MSER detection (g) all vegetation and man made objects are successfully removed.
6
Conclusion
We presented a framework for semiautomatic generation of DTMs based solely on DSMs. The main contribution is a three-step algorithm, that first detects vegetation and man made objects by minimizing a convex energy functional that makes use of the Huber norm, an L1 data term and a convex minimum constraint that only allows points in the DSM to be dragged down. In the second step the detection mask is generated that can be modified interactively by the
A Variational Approach to Semiautomatic Generation of DTM
1129
user. Interpolation is applied to these regions in a third step using the same energy from the first step but without the minimum constraint. The Huber norm allows us to obtain smooth surfaces, while preserving sharp edges. To aid visual inspection of DSM and DTM a realtime 3D visualization method was presented. As shown by experimental results, the framework delivers high-quality results where fine details of the terrain are preserved. We also showed that a priori information like MSER detection can be easily incorporated to improve removal of large buildings. Future work will focus on the incorporation of additional information such as true ortho images, as color provides valuable additional information. Furthermore we will try to improve interaction by allowing modification of the mask directly on the 3D representation and by adding additional interaction possibilities (e.g. level up and down functions, or marking regions by a single click on the image using image segmentation).
Acknowledgement This work was supported by the Austrian Research Promotion Agency (FFG) within the VM-GPU Project No. 813396.
References 1. Belliss, S., Mcneill, S., Barringer, J., Pairman, D., North, H.: Digital terrain modelling for exploration and mining. In: Proceedings on New Zealand Minerals & Mining Conference (2000) 2. Paparoditis, N., Boudet, L., Tournaire, O.: Automatic man-made object extraction and 3D scene reconstruction from geomatic-images. Is there still a long way to go? In: Urban Remote Sensing Joint Event (2007) 3. Champion, N., Matikainen, L., Rottensteiner, F., Liang, X., Hyypp¨ a, J.: A test of 2D building change detection methods: Comparison, evaluation and perspectives. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences 37 (2008) 4. Eckstein, W., Munkelt, O.: Extracting objects from digital terrain models. In: Remote Sensing and Reconstruction for Three-Dimensional Objects and Scenes, SPIE, pp. 43–51 (1995) 5. Weidner, U., F¨ orstner, W.: Towards automatic building extraction from highresolution digital elevation models. ISPRS Journal of Photogrammetry and Remote Sensing 50, 38–49 (1995) 6. Zhang, K., Ching Chen, S., Whitman, D., Ling Shyu, M., Yan, J., Zhang, C., Member, S.: A progressive morphological filter for removing nonground measurements from airborne lidar data. IEEE Transactions on Geoscience and Remote Sensing 41, 872–882 (2003) 7. Champion, N., Boldo, D.: A robust algorithm for estimating digital terrain models from digital surface models in dense urban areas. In: PCV Photogrammetric Computer Vision (2006) 8. Sohn, G., Dowman, I.: Terrain surface reconstruction by the use of tetrahedron model with the mdl criterion. In: Photogrammetric Computer Vision, A, p. 336 (2002)
1130
M. Unger et al.
9. Rottensteiner, F.: Automatic generation of high-quality building models from lidar data. Computer Graphics and Applications, IEEE 23, 42–50 (2003) 10. Baillard, C., Maˆıtre, H.: 3-D reconstruction of urban scenes from aerial stereo imagery: a focusing strategy. Comput. Vis. Image Underst. 76, 244–258 (1999) 11. Sithole, G., Vosselman, G.: Filtering of airborne laser scanner data based on segmented point clouds. In: Workshop on Laser Scanning (2005) 12. Baillard, C.: A hybrid method for deriving dtms from urban dems. In: ISPRS International Society for Photogrammetry and Remote Sensing, vol. B3b, p. 109 (2008) 13. Zebedin, L., Klaus, A., Gruber Geymayer, B., Karner, K.: Towards 3D map generation from digital aerial images. ISPRS Journal of Photogrammetry and Remote Sensing 60, 413–427 (2006) 14. Pock, T., Unger, M., Cremers, D., Bischof, H.: Fast and exact solution of total variation models on the GPU. In: CVPR Workshop on Visual Computer Vision on GPU’s, Anchorage, Alaska, USA (2008) 15. Huber, P.: Robust Statistics. Wiley, New York (1981) 16. Tikhonov, A.: On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39, 195–198 (1943) 17. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Phys. D 60, 259–268 (1992) 18. Keeling, S.L.: Total variation based convex filters for medical imaging. Appl. Math. Comput. 139, 101–119 (2003) 19. Hinterm¨ uller, M., Stadler, G.: An infeasible primal-dual algorithm for total bounded variation–based inf-convolution-type image restoration. SIAM J. Sci. Comput. 28, 1–23 (2006) 20. Rockafellar, R.T.: Convex analysis. Princeton Landmarks in Mathematics. Princeton University Press, Princeton (1997); Reprint of the 1970 original, Princeton Paperbacks 21. Aujol, J.F., Gilboa, G., Chan, T., Osher, S.: Structure-texture image decomposition–modeling, algorithms, and parameter selection. Intl. J. of Computer Vision 67, 111–136 (2006) 22. Chan, T., Esedoglu, S.: Aspects of total variation regularized L1 function approximation. SIAM Journal of Applied Mathematics 65, 1817–1837 (2004) 23. Chambolle, A.: Total variation minimization and a class of binary MRF models. In: Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 136–152 (2005) 24. Goldstein, T., Osher, S.: The split bregman method for L1 regularized problems. UCLA CAM Report 08-29 (2008) 25. NVidia: NVidia CUDA Compute Unified Device Architecture programming guide 2.0. Technical report, NVIDA Corp., Santa Clara, CA, USA (2008) 26. Wernecke, J.: The Inventor Mentor. Addison-Wesley, Reading (1994) 27. Grabner, M.: On-the-fly greedy mesh simplification for 2 12 D regular grid data acquisition systems. In: Proceedings Annual Conference of the Austrian Association for Pattern Recognition (AAPR), Graz, Austria, pp. 103–110 (2002) 28. Segal, M., Akeley, K.: The OpenGL graphics system: A specification. Technical report, The Khronos Group Inc. (2009), http://www.opengl.org 29. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: British Machine Vision Conference, vol. 1, pp. 384–393 (2002)
Real-Time Articulated Hand Detection and Pose Estimation Giorgio Panin, Sebastian Klose, and Alois Knoll Technische Universit¨ at M¨ unchen, Fakult¨ at f¨ ur Informatik Boltzmannstrasse 3, 85748 Garching bei M¨ unchen, Germany {panin,kloses,knoll}@in.tum.de
Abstract. We propose a novel method for planar hand detection from a single uncalibrated image, with the purpose of estimating the articulated pose of a generic model, roughly adapted to the current hand shape. The proposed method combines line and point correspondences, associated to finger tips, lines and concavities, extracted from color and intensity edges. The method robustly solves for ambiguous association issues, and refines the pose estimation through nonlinear optimization. The result can be used in order to initialize a contour-based tracking algorithm, as well as a model adaptation procedure.
1
Introduction
Hand tracking is an important and still challenging task in computer vision, for many desirable applications such as gesture recognition for natural HumanComputer Interfaces (HCI), virtual devices, and tele-manipulation tasks (see for example the review work [4, Chap. 2]). In order to reduce the problem complexity, dedicated devices (data gloves) have been developed, directly providing the required measurements for pose estimation. However, such devices somehow constrain the field of applicability as well as the motion freedom of the user, at the same time requiring a carefully calibrated and often expensive setup (particularly when infrared cameras and markers are involved). In a purely markerless context, [10] employs a flock of features for tracking, while detection is performed by an AdaBoost classifier [11] trained on Haar features [17]; however, although showing nice robustness properties, both procedures do not provide any articulated pose information, but only the approximate location over the image. The most well-known approaches to articulated tracking in 2D and 3D [14,15,6,2,13,5] are instead based on contours, which provide a rich and precise visual cue, and profit from a large pool of predicted features (contour points and lines) from the previous frame, through dynamical data association and local search. However, all of these tracking approaches assume at least a partially manual initialization (hand detection), providing an initial localization of the hand. Hand detection amounts to a global search in a high-dimensional parameter space, G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1131–1140, 2009. c Springer-Verlag Berlin Heidelberg 2009
1132
G. Panin, S. Klose, and A. Knoll
using purely static data association [12,1] and fusion mechanisms, that strongly limit the amount of distinctive features that can be reliably matched to the model. This paper deals with the problem of fully automatic, articulated hand detection, using static feature correspondences (points and lines) extracted from two complimentary modalities, namely skin color and intensity edges. The paper is organized as follows: In Section 2 we first describe the visual cues, and the association criteria, used in order to obtain the geometric feature correspondences to the generic model. Afterwards, Section 3 describes the articulated pose estimation procedure. Experimental results are provided in Section 4, together with a discussion of possible development roads.
2
Visual Features for Hand Detection
From the input image, we detect two kinds of features: fingertips and concavities, obtained from skin color segmentation, and finger lines, detected along the intensity edges. 2.1
Point Features from Color Segmentation
The input image is first converted to HSV color space, which is well-suited for skin color segmentation, and pixels are classified through a 2D Gaussian Mixture Model (GMM) in the Hue and Saturation channels [18]. Afterwards, we compute the convex hull of the main connected component (blob), and note that most of the time, fingertips and concavities are approximately located, respectively, on the convex hull vertices and concavity defects [3]: the latter are defined as the maximum-distance points to the respective hull segments (left side of Fig. 1). In order to identify the fingertips among the hull vertices, all vertices and defects have to be properly thresholded and classified. In particular, the overall palm scale is estimated by rpalm , the radius of the maximum-inscribed circle (MIC) within the color blob; this proves to be robust with respect to the fingers configuration, and the center provides also a rough position estimate. The MIC is quickly computed, by maximizing the distance transform [7]. Fig. 1 also illustrates the scheme used to identify the hull points representing fingertips and the concavities representing palm points. For this purpose, for each hull segment we consider two scale-independent and dimensionless indices, related to the palm size r: the maximum concavity depth D/r, and the length L/r; with these values, we classify segments according to four cases indicated in the picture. The next step consists in merging the fingertip points, by removing too short segments from the sequence (cases 3 and 4 with L/r < t− L ), and averaging their endpoints. If a sequence of 5 fingertips is obtained, we identify the thumb and the small finger, by looking for the largest, clock-wise angle between fingertips,
Real-Time Articulated Hand Detection and Pose Estimation
1133
measured around the palm center c. Otherwise, the algorithm recognizes the case of insufficient information, and returns a detection failure, thus avoiding any attempt to further processing and pose estimation.
Fig. 1. Left: Using the convex hull for fingertip and concavity detection. Right: Segments detected via the probabilistic Hough transform.
2.2
Line Features from Intensity Edges
As a second modality, we use intensity edges. In particular, from the Canny edge map we detect straight line segments, through a probabilistic Hough transform [9], that can be matched to the model lines. The right side of Fig. 1 shows an example of line detection. A segment correspondence in principle provides 2 point correspondences (i.e. 4 measurements). However, as we can see from Fig. 1 (right side), the endpoints of the segment are not as well localized as the line itself (in terms of direction and distance to the origin); therefore, the most reliable matching can be obtained by pure line correspondences. A line is described in homogeneous coordinates by a 3-vector l = (a, b, d)T , defined by the equation ax + by + d = 0 ⇔ lT x = 0
(1)
with x = (x, y, 1)T the homogeneous coordinates of a point belonging to l. We also assume, without loss of generality, that the orthogonal vector n = (a, b)T is normalized (a2 + b2 = 1), so that the third component d represents the distance of the line to the origin. 2.3
Data Association to a Generic Model
For pose estimation, the detected features have to be associated to the correct model features from a generic model. For this purpose, we abstractly describe a finger model as a triplet of points and two lines (right side of Fig. 2): the fingertip (x), the left and the right
1134
G. Panin, S. Klose, and A. Knoll
Fig. 2. Left: a simple shape model, made up of ellipses and rectangles; red lines represent the 2D skeleton; green contour lines are used for matching. Right: Definition of model line and point features.
concavity points (cl , cr ), two lines representing the left and the right edges (ll , lr ), and a set of flags, signalling if the respective feature has been detected in the image. The candidate points obtained from the convex hull are first considered, by taking parallel neighboring lines lleft and lright , previously aligned in order to have the same normal directions, forming a candidate finger if: |nTl nr | > 1.0 − cos angle T ll x > 0 ∧ lTr x < 0 ∨ lTl x < 0 ∧ lTr x > 0 T + t− tipDist < ll,r x < ttipDist T T ll cl < tconcDist ∧ lr cr < tconcDist
(2) (3) (4) (5)
These conditions state that: the lines have to be approximately parallel, i.e. the angle between the two normals is checked against a threshold angle (eq. (2)); the fingertip x should lie between both lines (eq. (3)); the fingertip x should be close enough to both lines l (but also not too close, eq. (4)); and finally, each concavity point c should be quite close to the respective line l (eq. (5)).
3
Articulated Pose Estimation from Corresponding Features
After establishing the correct data association, the next problem is to estimate the hand pose, minimizing the re-projection error of all detected model features (points and lines) with respect to their noisy measurements. In our approach, the hand model is an articulated skeleton (left side of Fig. 2), composed of 6 rigid links (one for the palm and for each of the fingers).
Real-Time Articulated Hand Detection and Pose Estimation
1135
In particular, the palm undergoes a 2D similarity transform (roto-translation and uniform scale), with 4-dof, while each finger carries an additional rotation angle θi , so that the 9 pose parameters p are T
p = (tx , ty , θp , s, θ1 , θ2 , θ3 , θ4 , θ5 )
(6)
Starting from the generic model, made up of simple shapes (ellipses and rectangles), we first compute the reference lines and points, by using the same procedure of Sec. 2.3, applied to a rendered image of the model. This has the advantage of keeping generality with respect to the model, at the same time providing the reference features in an automatic way, for a given shape. 3.1
Single-Body Pose Estimation
The pose of each link of the hand in 2D can be represented by a (3 × 3) homogeneous transform T , projecting points from model to screen coordinates. Moreover, in order to keep generality for the articulated chain, a parent transform T¯ pre-multiplying T may be present, considered constant for a single-body pose estimation, and possibly belonging to a different transformation group (for example, for each finger T¯ may be a full similarity, while T is a single-axis rotation). For our purposes, we restrict the attention to 2D similarities sR t ¯ TT = (7) 0T 1 where s is a scale factor, R a (2 × 2) orthogonal matrix, and t the translation vector. This is a good model for planar hand estimation problems, where the distance to the camera center is large enough compared to the hand size. Point correspondences. Given N model points X and corresponding noisy image measurements x in homogeneous coordinates Xi = (Xi , Yi , 1)T ; xi = (xi , yi , 1)T
(8)
we look for the optimal transformation T ∗ that projects the model points Xi “as close as possible” to the measured points xi , i.e. satisfying T¯T · Xi = xi , ∀i
(9)
We can pre-process the data points xi , by writing ¯ i , ∀i T · Xi = T¯ −1 xi ≡ x
(10)
¯ i = (¯ where x xi , y¯i , 1)T become the data values for pose estimation. If T (p) has a linear parametrization p, then we have ˆi ·p T2×3 (p) Xi = X
(11)
ˆ i is a function of Xi , and T2×3 is the upper (2 × 3) where the (2 × n) matrix X submatrix of T (non-homogeneous coordinates).
1136
G. Panin, S. Klose, and A. Knoll
Line correspondences. We formulate the line correspondence problem as follows [12]: given nl model segments (L1i , L2i ) matched to image lines li = (ni , di )T , find T such that both projected endpoints lie on li ∀i : lTi T¯T · L1i = lTi T¯T · L2i = 0 (12) In the above equation, the term T¯ can again be removed, by pre-processing the data lines l ¯lT = lT T¯ = (¯ ¯ nT , d) (13) which can be seen as the dual version of (10). Finally, if the parametrization is linear in p, then the estimation problem ˆ 1, L ˆ 2 the equivalent matrices to L1 , L2 , becomes linear as well: by denoting with L i i i i respectively (as in the previous Section) we can write the two equations (12) in a more compact way T 1 ˆ ¯i L n d¯i i ˆ ¯ ˆ ¯ Li · p + di = 0; Li = (14) T ˆ 2 ; di = d¯ ¯ ni Li i Single-body pose estimation. Under a linear parametrization T (p), given nl line and np point correspondences respectively, we can write the single-body LSE estimation problem ⎛ ⎞ np nl 2 2 ¯i + ˆi · p + d ˆj ·p −x ¯j ⎠ p∗ = min ⎝ (15) L X p∈d
i=1
j=1
¯ i defined in (14), and the pre-processed measurements ¯li , x ˆ i, d ¯ j given by with L (13) and (10), respectively. This problem is linear in p, and can be solved in one step, via the singular value decomposition (SVD) technique. 3.2
Articulated Pose Estimation
Recovering articulated pose parameters is accomplished by a two-step procedure. Initialization. In order to initialize the articulated parameters, we use a hierarchical approach: 1. We examine the skeleton tree, starting from the root (i.e. the palm of the hand) and estimating its similarity parameters alone. For this purpose, two point correspondences (concavity defects, palm center) are sufficient to estimate the similarity parameters p1 , ..., p4 [16]. 2. Afterwards, for all child nodes (i.e. the fingers), we use the parent T estimate as a reference matrix T¯ for each link, and employ all available point and line correspondences in order to estimate its pose as in (15) This approach does not require any initial guess for p, and usually provides a good initial estimate p0 .
Real-Time Articulated Hand Detection and Pose Estimation
1137
Nonlinear LSE refinement using contour points. The geometric error of the articulated chain with respect to the global pose parameters p has an overall nonlinear form, due to the fact that intermediate T matrices are multiplied along the skeleton, in order to produce the finger transforms, and each of them is a function of a subset of pose parameters. For this purpose, the measurements are obtained in a standard way [8], by sampling a set of m contour points yif and image normals nfi , uniformly over the articulated chain (Fig. 3). The contour points are re-projected and matched to the closest intensity edges zi at each Gauss-Newton iteration, providing dynamic data association with a much larger set of measurements for the pose estimation problem. By writing the normal equations (for sake of simplicity, with equal weights for all features), we have m
JyTi ni nTi Jyi δp =
i=1
m
JyTi nTi eyi
(16)
i=1
where δp are the incremental pose parameters w.r.t. the previous iteration, and the (2 × 9) Jacobian matrices f ∂yif i Jyi = ∂y (17) · · · ∂p1 ∂p9 provide the derivatives of screen projections w.r.t. the pose parameters, for each sample contour point. Fig. 3 shows an example of non-linear pose estimation.
Fig. 3. Articulated pose estimation with contour points and normals, after GaussNewton optimization
4
Experimenal Results
We provide here some experiments, showing the performance of the detector for different, more or less crucial hand poses (with closed fingers). The first row of Fig. 4 show the result of the pose initialization algorithm of Sec. 3.2, obtained after detection of line and point features, with our data
1138
G. Panin, S. Klose, and A. Knoll
Fig. 4. Hand detection result for different postures. Top row: initialization; bottom row: nonlinear LSE refinement.
association procedure. It can be seen, that in all cases the detected pose of the hand is close to the correct one; however, the initialization privileges the palm parameters, and uses only the basic features from the detection step. The second row of Fig. 4 show the result of the subsequent pose refinement (Sec. 3.2) after 10 Gauss-Newton iterations. In all cases the tracker converges to a correct pose estimate, despite the mismatched sizes of the model fingers to the real size of the subject. As already emphasized, this step achieves the correct overall scaling and matching, by uniformly optimizing over the full contour features (yellow lines), and ignoring the wrist and internal model lines (gray).
Fig. 5. Left: detection performance with background clutter. Right: detection failures.
By considering more challenging situations, in the first 3 frames of Fig. 5 we show the detection performance in presence of clutter, both conerning intensity edges and other skin-colored objects. The last 3 frames show some examples of data association failures: bent fingers, out-of-plane rotations, and too close fingers. In the whole sequence, all of these case are recognized by the detector, that does not attempt to perform any incorrect pose estimation. In order to provide numerical evaluations, we tested the algorithm against ground-truth data, obtained by a manual alignment of the model to the above given images. Denoting with pT and pE the true and estimated pose respectively, we compute the estimation error as ei = pTi − pE i . Fig. 6 shows the error components for each of the images. In particular, translations (tx , ty ) are given in pixels, and rotation angles in 10−1 degrees.
Real-Time Articulated Hand Detection and Pose Estimation
1139
25 s R T_x T_y R_thumb R_pointer R_middle R_ring R_little
20
Error of pose degree of freedom
15
10
5
0
-5
-10
-15
-20 image 1
image 2
image 3
image 4
image 5
image 6
Fig. 6. Error of the pose detection w.r.t. the visually computed ground truth. (s, R, tx , ty ) = palm scale, rotation and translation; Rthumb = rotation angle for the thumb, etc.
The presented algorithm was tested on an Intel Core 2 Duo with 2,33GHz, 2GB RAM and a NVIDIA 8600GT GPU with 256MB graphics memory. As operating system we use Ubuntu Linux 8.04. For video input, an AVT Guppy F033C firewire camera was used to capture frames with a resolution of 656× 494 at 25Hz. Using this setup the algorithm averagely performs at 5 FPS.
5
Conclusion and Future Work
In this paper, we presented a hand detection and pose estimation methodology, based on a generic model with articulated degrees of freedom, using geometric feature correspondences of points and lines. In particular, the method has been demonstrated for a planar case, with similarity transform and planar fingers motion. A full 3D detection involves more complex issues, which can be best dealt with by using multiple views and related features association. However, the ideas presented so far can serve as as a basis for a more complex approach, where multiple convex hulls are used in order to detect fingertips and palm concavities, while detected edge segments can be (at least in part) associated to individual finger links, by using the detected point information.
1140
G. Panin, S. Klose, and A. Knoll
References 1. Bar-Shalom, Y.: Tracking and data association. Academic Press Professional, Inc., San Diego (1987) 2. Bregler, C., Malik, J.: Tracking people with twists and exponential maps. In: CVPR 1998: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, p. 8. IEEE Computer Society Press, Los Alamitos (1998) 3. Bulatov, Y., Jambawalikar, S., Kumar, P., Sethia, S.: Hand recognition using geometric classifiers. In: ICBA, pp. 753–759 (2004) 4. de Campos, T.E.: 3D Visual Tracking of Articulated Objects and Hands. PhD thesis, University of Oxford (2006) 5. Deutscher, J., Reid, I.D.: Articulated body motion capture by stochastic search. Int’l Journal of Computer Vision 61(2) (2005) 6. Drummond, T., Cipolla, R.: Real-time tracking of highly articulated structures in the presence of noisy measurements. In: ICCV-WS 1999, pp. 315–320 (2001) 7. Felzenszwalb, P.F., Huttenlocher, D.P.: Distance transforms of sampled functions. Technical report, Cornell Computing and Information Science (September 2004) 8. Harris, C.: Tracking with rigid models, pp. 59–73 (1993) 9. Kiryati, N., Eldar, Y., Bruckstein, A.M.: A probabilistic hough transform. Pattern Recogn. 24(4), 303–316 (1991) 10. K¨ olsch, M., Turk, M.: Fast 2d hand tracking with flocks of features and multi-cue integration. Computer Vision and Pattern Recognition Workshop 10, 158 (2004) 11. K¨ olsch, M., Turk, M.: Robust hand detection. In: FGR, pp. 614–619 (2004) 12. Lowe, D.G.: Fitting parameterized three-dimensional models to images. IEEE Trans. Pattern Anal. Mach. Intell. 13(5), 441–450 (1991) 13. MacCormick, J., Isard, M.: Partitioned sampling, articulated objects, and interfacequality hand tracking. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 3–19. Springer, Heidelberg (2000) 14. James, M.: Rehg and Takeo Kanade. Digiteyes: Vision-based human hand tracking. Technical report, Pittsburgh, PA, USA (1993) 15. Stenger, B., Mendona, P.R.S., Cipolla, R.: Model-based 3d tracking of an articulated hand. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, p. 310 (2001) 16. Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 13(4), 376–380 (1991) 17. Viola, P.A., Jones, M.J.: Robust real-time face detection. In: ICCV, p. 747 (2001) 18. Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, vol. 2 (2004)
Background Subtraction in Video Using Recursive Mixture Models, Spatio-Temporal Filtering and Shadow Removal Zezhi Chen1, Nick Pears2, Michael Freeman2, and Jim Austin1,2 1
2
Cybula Limited, York, UK Department of Computer Science, University of York, York, UK
Abstract. We describe our approach to segmenting moving objects from the color video data supplied by a nominally stationary camera. There are two main contributions in our work. The first contribution augments Zivkovic and Heijden’s recursively updated Gaussian mixture model approach, with a multidimensional Gaussian kernel spatio-temporal smoothing transform. We show that this improves the segmentation performance of the original approach, particularly in adverse imaging conditions, such as when there is camera vibration. Our second contribution is to present a comprehensive comparative evaluation of shadow and highlight detection appoaches, which is an essential component of background subtraction in unconstrained outdoor scenes. A comparative evelaution of these approaches over different color-spaces is currently lacking in the literature. We show that both segmentation and shadow removal performs best when we use RGB color spaces.
1 Introduction We consider the case of a nominally static camera observing a scene, such as is the case in many visual surveillance applications, and we aim to generate a background/foreground segmentation, with automatic removal of any shadows cast by the foreground object onto the background. In real applications, cameras are often mounted metal poles, which can oscillate in the wind, thus making the problem more difficult. This problem is also addressed in this paper. To segment moving objects, a background model is built from the data and objects are segmented if they appear significantly different from this modelled background. Significant problems to be addressed include (i) how to correctly and efficiently model and update the background model, (ii) how to deal with camera vibration and (iii) how to deal with shadows. In this paper our contributions are a spatio-temporal filtering improvement to Zivkovic’s recursively updated Gaussian mixture model approach [1], and a comprehensive evaluation of shadow/highlight detection across different color spaces, which is currently lacking in the literature. We also present quantitative results of our complete foreground/background segmentation system with shadow removal in several real-world scenarios. This is valuable to those developing pragmatic visual surveillance solutions that demand a high quality foreground segmentation. G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1141–1150, 2009. © Springer-Verlag Berlin Heidelberg 2009
1142
Z. Chen et al.
A robust visual segmentation system should not depend on careful placement of the camera, rather it should be robust to whatever is in its visual field, whatever lighting effects occur or whatever the weather conditions. It should be capable of dealing with movement through cluttered areas, objects overlapping in the visual field, shadows, lighting changes, effects of moving elements of the scene (e.g. camera vibration, swaying trees) and slow-moving objects. The simplest form of the background model is a time-averaged background image. However, this method suffers from many problems, for example it requires a large memory and a training period absent of foreground objects. Static foreground objects during the training period would be considered as a part of background. This limits their utility in real time applications. A Gaussian mixture model (GMM) was proposed by Friedman and Russell [2] and it was refined for real-time tracking by Stauffer and Grimson [3]. The algorithm relies on the assumptions that the background is visible more frequently than any foreground regions and that it has models with relatively narrow variances. The system can deal with real-time outdoor scenes with lighting changes, repetitive motions from clutter, and long-term scene changes. Many adaptive GMM model have been proposed to improve the background subtraction method since that original work. Power and Schoonees [4] presented a GMM model employed with a hysteresis threshold. They introduced a faster and more logical application of the fundamental approximation than that used in the paper [5]. The standard GMM update equations have been extended to improve the speed and adaptation of the model [6][7]. All these GMMs use a fixed number of components. Zivkovic et al. [1] presented an improved GMM model adaptively chooses the number of Gaussian mixture components for each pixel on-line, according to a Bayesian perspective. We call this method the ZivkovicHeijden Gaussian mixture model (ZHGMM) in the remainder of this paper. Another main challenge in the application of background subtraction is identifying shadows that objects cast which also move along with them in the scene. Shadows cause serious problems while segmenting and extracting moving objects due to the misclassification of shadow points as foreground. Prati et al. [8] presented a survey of moving shadow detection approaches. Cucchiardi et al. [9] proposed the detection of moving objects, ghosts and shadows in HSV colour space and gave a comparison of different background subtraction methods. This paper focuses on two issues: 1) How to get a robust GMM, which models the real background as accurately as possible, and can deal with lighting changes in difficult and challenging environments, such as bad weather and camera vibration. 2) How to remove the shadows and highlight reflections, since these can affect many subsequent tasks such as foreground object classification. The contributions of this paper are (i) an improvement to the ZHGMM algorithm, using a multi-dimensional spatio-temporal Gaussian kernel smoothing transform and (ii) a comprehensive survey of moving shadow and highlight reflection detection approaches in various colour spaces for moving object segmentation applications. The paper is organised as follows: In next section the ZHGMM approach is reviewed. In Section 3, ZHGMM with multi-dimensional Gaussian kernel density transform (MDGKT) is proposed. The training of the MDGKT is given in Section 4. A comprehensive analysis of various shadow removal methods in given in Section 5. Section 6 gives a quantitative evaluation of the background model update and foreground object segmentation. Finally, we present conclusions in Section 7.
Background Subtraction in Video Using Recursive Mixture Models
1143
2 ZHGMM Review In this section, we provide a brief outline of the recursive mixture model estimation procedure described by Zivkovic et al [1] [10]. First, we choose a reasonable time adaption period of T frames (eg T=100 frames) over which to generate the background model so that, at time t, we have the training set X T = {x (t ) , x ( t −1) , L, x (t −T ) } for each pixel. For each new sample, we update the training data set X T and re-estimate the density. In general, these samples contain values that belong to both the background (BG) and foreground (FG) object(s). Therefore, we should denote the estimated density as pˆ (x (t ) X T , BG + FG ) . We use a GMM with M components (we set it as 4).
(
)
(
pˆ x ( t ) X T , BG + FG = ∑ m =1 w m N x (t ) ; μ m , ∑ m M
)
(1)
where μ m is the estimate of the mean of mth Gaussian and ∑ m is the estimate of the variances that describe the mth GMM component. For computational reasons (easily invertible), an assumption is usually made that the dimensions of X T are independent so that ∑ m is diagonal. A further assumption is that the components (eg red, green and blue pixel values) have the same variances [3] so that the covariance matrix is assumed to be of the form ∑ m = σ m I , where I is a 3 × 3 identity matrix. Note that a single σ m may be a reasonable approximation in a linear colour space, but it may be an excessive simplification in non-linear colour spaces. Thus, in this work, the covariance of a Gaussian component is diagonal, with three separate estimates of variance. The estimated mixing weights (what portion of the data is accounted for by this Gaussian) of mth Gaussian in the GMM at time t, denoted by wm , are non-negative and normalized. Given a new data sample x (t ) at time t, the recursive update equations are
(
)
wm ← wm + α o m(t ) − wm + αcT
(2)
μ m ← μ m + o m(t ) (α / wm )δ m
(3)
σ m2 ← σ m2 + om(t ) (α / wm )(δ mT δ m − σ m2 )
(4)
where x (t ) = [x1 , x2 , x3 ]T , μm = [μ1 , μ2 , μ3 ]T , δ m = [δ1 , δ 2 , δ 3 ]T , σ m2 = [σ 12 , σ 22 , σ 32 ]T for a 3 channel colour image, δ m = x ( t ) − μ m . Instead of the time interval T, mentioned above, here the constant α defines an exponentially decaying envelope that is used to limit the influence of the old data, and we note that α = 1/T. cT is the negative Dirichlet prior evidence weight [1], which means that we will accept that the class exists only if there is enough evidence from the data for its existence. It will suppress the components that are not supported by the data and we discard the components with negative weights. This also ensures that the mixture weights are non-negative. For a new sample the ownership o mT is set to 1 for the “close” component with largest wm and the others are set to zero. We define that a sample is “close” to a
1144
Z. Chen et al.
component if the Mahalanobis distance (MD) from the component is, for example, less than three. The squared Mahalanobis distance from the mth component is calculated as Dm2 (x (t ) ) = δ mT ∑ −m1 δ m . If there are no “close” components a new component is generated with wm +1 = α , μ m+1 = x (t ) , σ m +1 = σ 0 , where σ 0 is some appropriate initial variance. If the maximum number of components M is reached, we discard the component with smallest wm . After each weight update, using equation (2), we need to renormalize the weights so that they again sum to unity.
3 ZHGMM with Multi-dimensional Gaussian Kernel Density Transform An image is typically represented as a two-dimensional matrix of p-dimensional vectors, where p=1 in the gray-level case, p=3 for colour images, and p>3 for multispectral images. The space of the matrix is known as the spatial domain, while the gray, colour or multispectral is known as the spectral domain [11] [12]. For algorithms that use image sequences, there is also the temporal domain. In order to provide spatio-temporal smoothing for each spectral component, a multivariate kernel is defined as the product of two radially symmetric kernels and the Euclidean metric allows a single bandwidth parameter for each domain. K ht , h s ( x ) =
⎛ xs C k⎜ h s ht ⎜ h s ⎝
2
⎞ ⎛ xt ⎟k ⎜ ⎟ ⎜ ht ⎠ ⎝
2
⎞ ⎟ ⎟ ⎠
(5)
where x s is the spatial part and x t is the temporal part of the feature vector. k (x) is a common kernel profile (we use Gaussian) used in both spatial and temporal domains, hs and ht are the kernel bandwidths, and C is the corresponding normalization con-
stant. In order to improve stability and robustness of the ZHGMM, we have used this Multi-Dimensional Gaussian Kernel density Transform (MDGKT) as a pre-process, which only requires a pair of bandwidth parameter (hs , ht ) to control the size of the kernel, thus determining the resolution and time interval of the ZHGMM.
4 Online Training of MDGKT A sample RGB image is shown in Fig.1 (a). The variation of red and blue values of a pixel stream over 596 frames is shown in Fig.1 (e) and (f). The black curves show the variation of the original red and blue components, and the red curves illustrate the variation of red and blue components in the MDGKT image. A Gaussian kernel with bandwidth (hs , ht ) = (5,5) and standard deviation (std) of 0.5 was chosen as the kernel profile. The std of the original image is 1.834 and 1.110, but the std of MDGKT output image is only 1.193 and 0.832. Obviously, MDGKT reduces the std figures. Fig.1 (c) and (d) show the scatter plots of the original and MDGKT image (red, blue) values of the same pixel. Fig.1 (d) shows that the distribution of MDGKT image is more localised within two Gaussian components of the mixture model, illustrating the
Background Subtraction in Video Using Recursive Mixture Models
1145
effect of the spatio-temporal filtering in the spectral domain. The mixture of these two Gaussians for the blue colour component of the original pixel and the estimated GMM distribution using MDGKT are shown in Fig.1 (b). The MDGKT algorithm described above allows us to identify the foreground pixels in each new frame while updating the description of each pixel’s background model. This procedure is effective in determining the boundary of moving objects, thus moving regions can be characterized not only by their position, but also size, aspect ratio, moments and other shape and colour information. These characteristics can be used for later processing and classification, for example, using a support vector machine [13]. To analyse the performance of the algorithm, we used a dynamic scene. The results are shown in Fig.2. (a) and (d) are original images. One is outside scene, another is inside scene. (b) and (e) are the results of the ZHGMM algorithm. (c) and (f) are the results of our MDGKT algorithm. Note that the results shown are without the application of any post-processing.
(a)
(b)
(c)
(e)
(d)
(f)
Fig. 1. The effect of spatio-temporal filtering. (a) A sample image. (b) GMM distribution of the blue component value of a sample pixel. (c) and (d) scatter plots of corresponding pixel in original images and MDGKT images respectively. (e) and (f) show the variation of red and blue colour components over time (red trace is spatio-temporally filtered).
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. Comparative results of ZHGMM and MDGKT algorithms
5 Shadow Removal The previous section showed promising initial results for our MDGKT background subtraction algorithm. However, the algorithm is susceptible to both global and local illumination changes such as shadows and highlight reflections (specularities). These
1146
Z. Chen et al.
often cause subsequent processes, such as tracking and recognition, to fail. Prati et al. [8] present a comprehensive survey of moving shadow detection approaches. It is important to recognize the type of features utilized for shadow detection. Some approaches improve performance by using spatial information working at a region level or at a frame level instead of pixel level [14]. Finlayson et al. [15] proposed a method to remove shadows from a still image using illumination invariance. We give a comparison of several different shadow removal methods, working in different colour spaces, below. For the sake of clarity, we distinguish two different foreground segmentations: segmentation F1, is the foreground segmentation which includes shadows (MDGKT segmentation output), while F2 is the foreground segmentation after we have removed shadows. 5.1 Working with RGB and Normalized RGB Colour Space (i) RGB colour. The observed colour vector is projected onto the expected colour vector, and the ith pixel’s brightness distortion α i is a scalar value (less than unity for
a shadow) describing the fraction of remaining ‘brightness’. This may be obtained by minimizing [16]
φ (α i ) = (I i − α i Ei )2
(6)
where I i = [I Ri , I Gi , I Bi ] denotes the ith pixel value in RGB space, Ei = [μ Ri , μGi , μ Bi ] represents the ith pixel’s expected (mean) RGB value in MDGKT. The solution to equation (6) is an alpha value equal to the inner product of Ii and Ei, divided by the square of the Euclidean norm of Ei. Colour distortion is defined as the orthogonal distance between the observed colour and the expected colour vector. Thus, the colour distortion of the ith pixel is CDi = I i − α i Ei . If we balance the colour bands by rescaling the colour values by the pixel std si = [σ Ri , σ Gi , σ Bi ] , the brightness and chromaticity distortion become αi = CDi =
I Ri μ Ri / σ Ri2 + I Gi μ Gi / σ Gi2 + I Bi μ Bi / σ Bi2
[μ Ri / σ Ri ]2 + [μ Gi / σ Gi ]2 + [μ Bi / σ Bi ]2
(I Ri − α i μ Ri )2 σ Ri2 + (I Gi − α i μGi )2 σ Gi2 + (I Bi − α i μ Bi )2 σ Bi2
(7) (8)
Then a pixel in the foreground segmentation (F1) may be classified as either a shadow or highlight on the true background as follows: ⎧ Shadow ⎨ ⎩Highlight
CDi < β1 and α i < 1 CDi < β1 and α i > β 2
(9)
β1 is a selected threshold value, used to determine the similarities of the chromaticity between the MDGKT and the current observed image. If there is a case where a pixel from a moving object in the current image contains a very low RGB value, then this dark pixel will always be misclassified as a shadow, because the value of the dark pixel is close to the origin in RGB space and all chromaticity lines in RGB space meet at the origin. Thus a dark colour point is always considered to be close or similar to
Background Subtraction in Video Using Recursive Mixture Models
1147
any chromaticity line. We introduce a threshold β 2 for the normalized brightness distortion to avoid this problem. This is defined as: β 2 = 1 /(1 − ε ) , where ε is a lower band for the normalized brightness distortion. An automatic threshold selection method was provided by Horprasert et al. [16]. (ii) Normalized RGB. Given three colour variables, Ri , Gi and Bi , the chromaticity coordinates are ri = Ri (Ri + Gi + Bi ) , g i = Gi (Ri + Gi + Bi ) and bi = Bi (Ri + Gi + Bi ) , where ri + g i + bi = 1 [17]. si = Ri + Gi + Bi is a brightness measure. Let a pixel value of the background MDGKT be < ri , g i , s i > . Assume that this pixel is covered by a shadow
in frame t and let < rti , g ti , s ti > be the observed value for this pixel at this frame. Then, for a pixel in the foreground segmentation (F1): ⎧ Shadow β1 < s ti si ≤ β 2 ⎨ β 3 < sti si ⎩ Highlight
(10)
where β1 , β 2 and β 3 are selected threshold values used to determine the similarities of the normalized brightness between the MDGKT and the current observed image. It is expected that, in the shadow area, the observed value
sti will be darker than the
normal value si , up to a certain limit. On the other hand, in the highlight area,
sti > si . So that β1 > 0, β 2 ≤ 1 and β 3 > 1 . These thresholds may be adapted for different environments (e.g. indoor image, outdoor image or brightness of source light). 5.2 Working with HSV Colour Space
HSV colour space explicitly separates chromaticity and luminosity and has proven easier than RGB space to set a mathematical formulation for shadow detection [8] [9]. HSV space is more closely related to the human visual system than RGB and it is more sensitive to brightness changes due to shadows. For each pixel in F1, that initially has been segmented as foreground, we check if it is a shadow on the background according to the following consideration. If a shadow is cast on a background, the hue and saturation components change, but within a certain limit. The difference in saturation is an absolute difference, while the difference in hue is an angular difference. ⎧ Shadow β1 < VIi VBi < β 2 and H Ii − H Bi < τ H and S Ii − S Bi < τ S ⎨ ⎩ Highlight V Ii V Bi > β 3 and H Ii − H Bi < τ H and S Ii − S Bi < τ S
(11)
with 0 < β1 , β 2 ,τ H ,τ S < 1 and β 3 > 1 . Intuitively, this means that a shadow darkens a covered point, and a highlight brightens a covered point, but only within a certain range. Prati et al. [8] state that the shadow often has a lower saturation and, from our experimental results, we see that sometimes the shadow has a higher saturation than that of background sometimes. However, a shadow or highlight cast on a background does not change its hue and saturation as significantly as intensity.
1148
Z. Chen et al.
5.3 Working with YCbCr and Lab Colour Spaces
We now consider the luminance and chrominance (YCbCr) colour space to remove shadows from the results of background subtraction. If a shadow is cast on a background, the shadow darkens a point in the MDGKT. The luminance distortion is α i = YIi YBi < 1 , and chrominance components difference is CHi = (CbiI −CbiB + CriI −CriB ) 2 < β1 , where YIi , CbiI , CriI and YBi , CbiB , CriB are Y, Cb, Cr components in the current image and MDGKT respectively. A pixel in the F1 is classified as follows: ⎧ Shadow α i < 1 and CH i < β1 ⎨ ⎩ Highlight α i > β 2 and CH i < β1
(12)
where β1 < 1 and β 2 > 1 . There is a similar criterion for shadow removal in Lab space.
6 Quantitative Evaluation This section demonstrates the performance of the proposed algorithms above on several videos of both indoor and outdoor scenes, using an image size of 320 × 240. A quantitative comparison of two GMMs (ZHGMM and MDGKT) with different shadow removal methods is presented. A set of videos to test the algorithms was chosen and, in order to compute the evaluation metrics, the ground truth for each frame is necessary. We obtained this ground truth by segmenting the images with a manual classification of points as foreground, background and shadow regions. We prepared 41 ground truth frames in a ‘walking people’ sequence, and 26 in a ‘moving car’ sequence. Sample frames of each sequence and their ground truth mark-up are given in Fig.3. All shadow removal methods in five colour spaces using two GMMs have been fully implemented. Quantitative results for true positive rate (TPR) and specificity (SPC) metrics are reported in Table 1.
(a) Walking
(b) Moving car
Fig. 3. Ground truth images: the red manual mark-up shows the foreground segmentation that we are interested in, the blue mark-up shows the shadow cast by the foreground. Table 1. Experimental quantitative results
RGB Lab YCbCr Normalized RGB HSV
ZHGMM TPR SPC 0.8548 0.9838 0.7165 0.9828 0.6183 0.9811 0.6077 0.9628 0.5039 0.9671
MDGKT TPR SPC 0.9552 0.9853 0.8499 0.9846 0.6748 0.9811 0.6356 0.9714 0.6327 0.9712
Background Subtraction in Video Using Recursive Mixture Models
1149
Fig.4 shows sample frames 9 and 17 of the ‘walking’ video, 3 and 8 of the ‘moving car’ video. Each two-by-two block of images refers to the same frame in the original video. The top-left image is the original frame. The bottom-left image is the foreground segmentation (F1) results. In this image, all coloured pixels are the foreground segmentation output of the MDGKT algorithm, while the black pixels represent the modelled background. The coloured pixels are categorized as foreground object (coloured yellow), shadow (coloured green) or highlight (coloured red) by our shadow removal algorithm operating in RGB colour space. The shadow and highlight pixels are then removed and this is then followed by a post-processing binary morphology stage of dilatation and erosion to remove sparse noise. This gives the final foreground segmentation, as shown in the bottom right image of each two-by-two block. Finally, the top-right image in each block is a synthetic image, created by using the final foreground segmentation as a mask to extract the foreground object from the original frame, and superimposing this on the background model (mean value of each pixel). Clearly these synthetic images are largely shadow-free. The two videos in fig.4 are scenes with very strong shadows.
Fig. 4. Segmentation results with heavily shadowed input images
7 Conclusions Online learning of adaptive GMMs on nonstationary distributions is an important technique for moving object segmentation. This paper has presented an improvement to an existing adaptive Gaussian mixture model, using a multi-dimensional spatiotemporal Gaussian kernel smoothing transform for background modelling in moving object segmentation applications. The model update process can robustly deal with slow light changes (from clear to cloud or vice versa), blurred images, camera vibration in very strong wind, and difficult environmental conditions, such as rain and snow. The proposed solution has significantly enhanced segmentation results over a commonly used recursive GMM. We have given a comprehensive analysis of performance results in a wide range of environments and using a wide variety of colour space representations. The system has been successfully used to segment objects in both indoor and outdoor scenes, with strong shadows, light shadows, and highlight reflections and we have verified our system with rigorous evaluation. We have found that working in standard RGB colour space provides the best results.
Acknowledgements The authors would like to acknowledge the provision of data sequence (the indoor scene images) from the Caviar project at the University of Edinburgh. Also funding under the UK Technology Strategy Board project, CLASSAC, and support from Cybula Ltd.
1150
Z. Chen et al.
References 1. Zivkovic, Z., Heijden, F., van der Heijden, F.: Recursive unsupervised learning of finite mixture models. IEEE Transaction on Pattern Analysis and Machine Intelligence 26(5), 651–656 (2004) 2. Friedman, N., Russell, S.: Image segmentation in video sequences: a probabilistic approach. In: Proc 13th Conf. on Uncertainty in Artificial Intelligence, pp. 175-181 (1997) 3. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 246–252 (1999) 4. Power, P.W., Schoonees, J.A.: Understanding background mixture models for foreground segmentation. In: Proceedings of Image and Vision Computing, New Zealand (2002) 5. Stauffer, C., Grimson, W.: Learning patterns of activity using real-time tracking. IEEE Transaction on Pattern Analysis and Machine Intelligence 22(8), 747–757 (2000) 6. KaewTraKulPong, P., Bowden, R.: An improved adaptive background mixture model for real-time tracking with shadow detection. In: Proc. of 2nd European workshop on Advanced Video Based Surveillance Systems, ch. 11, pp. 135–144 (2001) 7. Lee, D.-S.: Effective Gaussian mixture learning for video background subtraction. IEEE Transaction on Pattern Analysis and Machine Intelligence 27(5), 827–832 (2005) 8. Prati, A., Mikic, I., Trivedi, M.M., Cucchiara, R.: Detecting moving shadows: algorithms and evaluation. IEEE Transaction on Pattern Analysis and Machine Intelligence 25(7), 918–923 (2003) 9. Cucchiara, R., Piccardi, M., Prati, A.: Detecting moving objects, ghosts and shadows in video streams. IEEE Transaction on Pattern Analysis and Machine Intelligence 25(10), 1337–1342 (2003) 10. Zivkovic, Z., Heijden, F., van der Heijden, F.: Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognition Letters 27(7), 773–780 (2006) 11. Chen, Z., Husz, Z.L., Wallace, I., Wallace, A.M.: Video object tracking based on a Chamfer distance transform. In: IEEE International Conference on Image Processing, San Antonio, Texas, USA, pp. 357–360 (2007) 12. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Transaction on Pattern Analysis and Machine Intelligence 24(5), 603-619 (2002) 13. Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), Philadelphia, Pennsylvania, USA, pp. 217–226 (2006) 14. Elgammal, A.M., Harwood, A., Davis, L.S.: Non-parametric model for background subtraction. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 751–767. Springer, Heidelberg (2000) 15. Finlayson, G., Hordley, S., Drew, M.: Removing shadows from images. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 823–836. Springer, Heidelberg (2002) 16. Horprasert, T., Harwood, D., Davis, L.S.: A statistical approach for real-time robust background subtraction and shadow detection. In: Proceedings of IEEE ICCV 1999 Frame rate workshop (1999) 17. Elgammal, A., Duraiswami, R., Harwood, A., Davis, L.S.: Background and foreground modelling using nonparametric kernel density estimation for visual surveillance. Proceedings of the IEEE 90(7), 1151–1163 (2002)
A Generalization of Moment Invariants on 2D Vector Fields to Tensor Fields of Arbitrary Order and Dimension Max Langbein and Hans Hagen TU Kaiserslautern
Abstract. For object recognition in tensorfields and in pointclouds, the recognition of features in the object irrespective of their rotation is an important task. Rotationally invariant features exist for 2d scalar fields and for 3d scalar fields as moments of a second order structure tensor. For higher order structure tensors iterative algorithms for computing something similar to an eigenvector-decomposition exist. In this paper, we introduce a method to compute a basis for analytical rotationally invariant moments of tensorfields of – in principle – any order and dimension and give an example using up to 4th-order structure tensors in 3d.
1
Introduction
Objects in space usually have an arbitrary rotation and position which are the transformations which do not deform the objects. A way to recognize objects independent of these transformations is to compute a “fingerprint” of a given object that is invariant to rotation and translation and compare it to the fingerprints of the objects you want to find. In this paper we will show a type of fingerprints that can be computed from tensorfields of arbitrary order and dimension. We will call them moments of the field because of their relationship to the moment of inertia in physics. This work was inspired by the invariants of 2D vectorfield moments defined in [1] 1.1
Types of Invariance We Want to Support
Our basis of invariant moments M will have certain invariance properties, which are – Rotational invariance (section 2) – only in value: M (f (x)) = M (Rf (x))) – only in domain: M (f (x)) = M (f (R−1 x))) – in domain and value, linked: M (f (x)) = M (Rf (R−1 x)) – in domain and value, independent: M (f (x)) = M (Rf (R∗−1 x)) – Translation invariance in the domain: M (f (x)) = M (f (x − t)) (section 3) G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1151–1160, 2009. c Springer-Verlag Berlin Heidelberg 2009
1152
M. Langbein and H. Hagen
– Scale invariance: (section 4) – In value :M (f (x)) = M (sf (x)) – In domain :M (f (x)) = M (sf (x)) – In domain and value,linked : M (f (x)) = M (sp f (x/s)) – In domain and value,independent : M (f (x)) = M (tf (x/s)) M is the set of invariant moments computed from the field, R, R∗ are rotations, s, t ∈ IR, x, t ∈ IRd , p ∈ IN , f a tensorfield (of the same dimension as x). rotation,translation and scale invariance can be achieved at the same time.
f (x)
Rf (R−1 x)
sf (x))
f (x/t))
Fig. 1. some operations for which invariance can be achieved: rotate, scale in value, domain (plus all combinations) (R is a rotation, t, s ∈ IR)
1.2
Related Work
In [1] they describe a method for doing pattern recognition in vectorfields on a 2-dimensional domain using complex invariant moments which works best for structured rectilinear grids. In [2], rotationally invariant pattern recognition is done by fitting polynomials to the data and then calculating invariants of this polynomial to certain operations. It’s disadvantage is sensititvity to noise. In [3] they study the use of integral invariants in general, analyze their stability against random noise and give an algorithm for computing them for surfaces in 3d using a mixture of FFT and octrees. In [4], a higher-order structure tensor similar to our tensors is defined. A visualization metaphor for higher-order tensors and something corresponding to an eigenvector decomposition algorithm for higher-order tensors is given. The “eigenvalues” there are also rotationally invariant, but not computable analytically. 3 In [5] they define a generalized trace gentr(f (u)) = 2π Ω f (u)du where Ω is the unit sphere. If f (u) = Ti1 ,...in ui1 · · · uin it can be expressed with the total traces we will define later.
2 2.1
Rotational Invariance Definitions
f is always tensor-valued function with limited support, i.e. a finite scalar- vectoror matrixfield. We define the tensor-valued functional mA(f ) as:
A Generalization of Moment Invariants on 2D Vector Fields to Tensor Fields
1153
m
A(f ) := IRd
x . . ⊗ x ⊗ f (x)dV, ⊗ .
x ∈ IRd , dV := dx1 · · · dxd
m
which explicitly reads: m
A(f )i1 ...im j1 ...jn =
IRd
xi1 · · · xim fj1 ...jn (x1 , ..., xd )dx1 · · · dxd
.
The tensor rotation R(T ) is defined by: n ( Rik jk )Ti1 ...in
Tj1 ...jn =
i1 ...in k=1
where Rij is the rotation matrix. In the following, we use the convention:
Tj1 ...jn :=
(Tj1 ...jn , ja = jb = i)
i
(a,b)
Tj1 ...jn :=
(Tj1 ...jn , ja = jb = i, jc = jd = j)
ij
(a,b)(c,d)
etc. We will call those sums traces, and total traces if the result is of 0th order. Examples: Tj1 j2 j3 j4 = Tij2 ij4 , Tj1 j2 j3 j4 = Tikik . (1,3)
2.2
i
(1,3)(2,4)
ik
Rotational Invariance M (Rf (R−1 x)) = M (f (x))
With lemma 1 on page 1154 we show that (a,b) R(T ) = R( (a,b) T ). By induction one sees that all total traces are rotationally invariant. We also observe that mA(R(f (R−1 x))) = R( mA(f ))) , i.e. the rotation of the field f is equivalent to the rotation of the tensor mA (Lemma 2). So, all total traces of mA are rotationally invariant moments of the field f . Other types of rotational invariance – rotation-invariant independent in value and domain: you leave out total traces that have pairs between value (the i’s) and domain (the j’s) indices of the tensor mA. – rotation-invariant only in value: only take traces between the value indices – rotation-invariant only in domain: only take traces between the domain indices
1154
M. Langbein and H. Hagen
Lemma 1 (
⎛ ⎝
(a,b)
R(T ) = R( (a,b) T ) ) w.l.o.g.
⎞ R(T )⎠
(1,2)
a = 1, b = 2. n d d
=
j1 =j2 =k=1 i1 ...in =1
j3 ...jn
=
d d
Ril jl
l=1
d
Ri1 k Ri2 k
k=1 i1 ,i2 =1 i3 ...in =1 d
=
d d
Ri1 k Ri2 k
i3 ...in =1 i1 ,i2 =1 k=1
=
1: i1 =i2 0: i1 = i2
use orthonormality of R:
d
d
i3 ...in =1 i1 =i2 =m=1
n
d
=
i3 ...in =1
⎛ ⎛
= ⎝R ⎝
n
n
Ril jl
l=3 n
Ti1 ...in
Ril jl
Ti1 ...in
l=3
Ril jl
Ti1 ...in
l=3 d
Ril jl
Ti1 ...in
i1 =i2 =m=1
l=3
Ti1 ...in
⎞⎞
T ⎠⎠
(1,2)
j3 ...jn
Lemma 2 ( mA(R(f (R−1 x))) = R( mA(f ))) ) A(R(f (R−1 x))) =
m
IRd
=
x=Ry IRd
x ⊗ . . . ⊗ x ⊗ R(f (R−1 x))dx1 ...dxd m
Ry ⊗ . . . ⊗ Ry ⊗ R(f (y))dy1 ...dyd m
−1
=⇒ ( A(R(f (R x))))i1 ...im j1 ...jn m d n = ( Rαk ik yαk ) ( Rβl jl )fβ1 ...βn (y)dy1 · · · dyd m
IRd k=1 α =1 k m
β1 ...βn l=1 n
=
=
(
Rαk ik yαk )
IRd α1 ...αm k=1 m
(
Rαk ik )(
α1 ...αm β1 ...βn k=1
= (R( mA(f )))i1 ...im j1 ...jn
(
β1 ...βn l=1 n l=1
Rβl jl )
Rβl jl )fβ1 ...βn (y)dy1 · · · dyd (
m
IRd k=1
yαk )fβ1 ...βn (y)dy1 ...dyd
A Generalization of Moment Invariants on 2D Vector Fields to Tensor Fields
2.3
1155
A Basis for the Rotationally Invariant Moments
We want to find a set of total traces of the tensor set A that form a basis for all rotationally invariant moments of A. M (d) shall be the number free parameters of a rotation in dimension d (M (3) = 3, M (2) = 1 ). L shall be the number of independent components of A (a symmetric 3x3-matrix e.g. has 6 independent components) Then there will be maximally L − M (d) independent (if you have tensors with order higher than 1 in the set) that should be part of the basis. We use total traces for these functions. If you want to have a complete basis for the moments of one tensor T , you first take all traces of T , then those of T ⊗ T , then of T ⊗ T ⊗ T and so on until you have enough independent moments. Getting a set of independent moments: If the moments are dependant, then one moment M1 is a function of the other moments: M1 (a) = f (M2 (a), ...Mn (a))), where a is the vector with the independent components of the tensorset A. You see then that ∂M1 (a) df (M2 (a), ..., Mn (a)) ∂f ∂Mi = = ∂a da ∂Mi ∂a i=2 n
So if the moments are dependent, their derivatives are linear dependant. To test the linear dependence of the derivatives, it is sufficient to test the dependence of their values at a random test vector. It is possible then that independant moments will be recognized as dependant, but this is not very harmful because we have enough candidates. To get a complete set of independent moments, we apply the following algorithm: 1. Create a random test tensor at which the derivatives will be evaluated. 2. Create a matrix with as many columns as independent tensor components and initially zero rows 3. For every candidate (a total trace of a Tensorproduct of the Tensors in the tensorset) do: – Evaluate the first derivative of it at the test tensor – append the derivative as the lowest row to the matrix and zero out by adding multiples of the upper rows (Gaussian elimination) – if that row is not zero now, the moment is linear independent of the ones already in the basis and we add it to the basis. – if the basis is complete(enough entries): stop. Remark 1. The calculations have to be done in exact arithmetics (i.e. using fractions of arbitrary precision numbers ) Reducing the number of candidates for moments in the basis: The number of possibilities for total traces of a tensor with order n are: (n − 1) · (n − 3) · . . . · 3 = n! (2n/2 (n/2)!). The number stems from the following: pair the first index with one of the n−1 remaining ones, then take the first remaining unpaired index and pair it with one of the now n − 3 remaining ones, and so on. For a tensor with 12 indices e.g. there are 11340 possibilities. You can greatly reduce this number
1156
M. Langbein and H. Hagen
by taking the symmetry of the tensors into account: One observes that total traces of tensors which are symmetric in certain groups of indices and that have the same number of connections between these groups are equal. Examples for that number: Let Tijk be a total symmetric tensor. then T ⊗ T has two groups of indices in which it is total symmetric (the
first three and the second three indices). Then the number of connections for (1,2)(3,4)(5,6) T ⊗ T is one ( the
pair(3,4) ), for (1,3)(2,4)(3,6) T ⊗ T it is three (all pairs). You can exploit this and take only one total trace for every number of connections between groups of total symmetric indices in the tensor. If you have tensors that are tensorproducts, you so do not take total traces which leave two tensors in the tensorproduct unconnected, because these are computable from traces of lower
tensorproduct orders.
Example: Let Tij be a total symmetric tensor. Then (1,2)(3,4) T ⊗ T = ( (1,2) T )2 . 2 3D Example: Vectorfield folded with the kernels 1, x, x ⊗ x. A = 1 0 A = IR3 x ⊗ f (x)dV, A = IR3 f (x)dV . The numIR3 x ⊗ x ⊗ f (x)dV, ber of Components in A is 6 · 3 + 9 + 3 = 30 , the number of independent moments is 30 − 3 = 27. The moments are calculated as total traces from 0 A ⊗ 0A, 1A, 1A ⊗ 1A, 1A ⊗ 1A ⊗ 1A, 0A ⊗ 2A, 2A ⊗ 2A.
4 3D example: Scalar field folded with up to order 4 kernel. A = IR3 x ⊗ x ⊗ x⊗ x ⊗ f (x)dV , 3A = IR3 x ⊗ x ⊗ x ⊗ f (x)dV , 2A = IR3 x ⊗ x ⊗ f (x)dV , 1 A = IR3 x ⊗ f (x)dV , 0A = IR3 f (x)dV The number of A-components is: 1 + 3 + 6 + 10 + 15 = 35 . After translating the tensor set A to its gravity center (see section 3), 1A get zero and can be ignored. We also dont worry about 0A, because it is already rotational invariant. The remaining number of independent moments is then: 35 − 4 = 31. The entries of the moment basis are then calculated as total traces of 2 A, 2A⊗ 2A, 2A⊗ 2A⊗ 2A, 3A⊗ 3A, . . . , 3A ⊗ · · · ⊗ 3A, 4A, . . . , 4A ⊗ · · · ⊗ 4A, 6
4
A ⊗ 4A, 2A ⊗ 3A ⊗ 3A, 3A ⊗ 3A ⊗ 4A (see table 1). These moments are also invariant against mirroring of the field. If you want to distinguish
between mirrored versions of the field f and original ones, you have to include (1,10),(2,13),(3,16),(4,11),(5,14),(6,17),(7,12),(8,15),(9,18) 3A ⊗ 3A ⊗ 3A ⊗ ε ⊗ ε ⊗ ε into the basis where ε is the total antisymmetric tensor which is of 3rd order. 2
Number of independent moments for 3D-tensor fields. We first look at the scalar-field case, where the tensors nA are total symmetric. The number of components of a total symmetric tensor is equivalent to the number of different words (= index sets) over an alphabet with d characters (=the different index values) without the permutatedversions of the words. This is known as the n word problem with the number n+d−1 d−1 . So, the number of components of A n+2 in 3D is 2 . The total number of components L of the tensorset A is then
N
N L = n=0 n+2 = n=0 (n + 1)(n + 2)/2 = 16 N 3 + N 2 + 11 2 6 N + 1 with N = maximum number of indices in the tensors
A Generalization of Moment Invariants on 2D Vector Fields to Tensor Fields
1157
For non-scalar tensor fields f (x), you have to multiply L with the number of independent f components (e.g. 3 for a vectorfield, 6 for a stressfield). If you want the number of rotationally invariant moments computed from A, you have to subtract the dimension of the rotation group SO(3) from the number of components of A which is three (or two if the maximum tensor order is one). N0 1 2 3 4 5 6 L(N ) 1 4 10 20 35 56 84 # moments of: scalarfields 1 2 7 17 32 53 81 vectorfields 1 4 27 57 102 165 249 stressfields 3 21 57 117 207 333 501
3
Translation Invariance: M (f (x)) = M (f (x − t))
General Principle For translation invariance, one always computes a vector c from the field which moves in the same way the field f (x − t) moves (so c + t is constant). Then one computes the moments M in a coordinate system with origin in −c. For scalar fields, a simple choice for −c is the center of gravity which can be computed as follows from A: 1 xf (x)dV A IRd = 0 f (x)dV A IRd The new A are computed from A and c for scalar fields as following: 0
1
A = 1A + c 0A,
A = 0A,
2
A = 2A + c ⊗ 1A + 1A ⊗ c + c ⊗ c 0A
In the general case, where f (x) can be a tensor field, and one has x-powers higher than (x⊗)2 it is more complicated: You have to express (x + c) ⊗ · · · ⊗ (x + c)⊗ f (x)dV IRd n
As a combination of A and c. The tensors nA of the translated tensorset A are then computed as follows: n n n A = B(k, n) k k=0
with B(k, n) = C(k, n) symmetrized in the first n indices C(k, n) = c · ⊗ c⊗ ⊗ · ·
n−k
A
k
To be more efficient, we look at it componentwise and go to a certain dimension:
1158
M. Langbein and H. Hagen
Translation in 3D: We use the following notation in 3-dimensional space: aijk denotes a tensor component of the total symmetric tensor i+j+kA which which has i indices that are 1, j indices that are 2, and k indices that are 3. The translation of this tensorset is computed by: aIJK =
I J K I J K aijk dI−i,J−j,K−k i j k i=0 j=0 k=0
where dijk := ci1 cj2 ck3 denotes the components of a tensor constructed from the translation vector c. The formula above results from aIJK = (x1 + c1 )I (x2 + c2 )J (x3 + c3 )K dx1 dx2 dx3 IR3
By implementing this efficiently we get the following table: Maximum Order(N) 2 3 4 5 6 Number of multiplications: 27 101 287 689 1467 Vectorfields f and alternate choices for −c : One does not always want to center the coordinate system in the center of gravity. For higher-order fields, a center of gravity is not defined. So you have to look for other choices of −c. One choice for the translation vector −c is the gravity center of the vector-lengths IRd
x||f(x)||dV ||f(x)||dV
IRd
4
.
Scale Invariance
In this section we will describe how to make the rotation-invariant moments described above scale-invariant. 4.1
Scale Invariance in Value: M (f (x)) = M (sf (x))
For the scale invariance , you have to look in which order the scaling factor s appears in the moments.
This s-order is just the number of A-tensors from which M is calculated ( (1,4)(2,5)(3,6) 3A ⊗ 3A for example has an s-order of 2). We denote it ov (order in value). Let M (k) denote the subset of moments with ov = k. Let ck := ||M (k)||∞ = maxi |M (k)i | denote a norm for these moments. You want to find a factor s so that for every set of moments ck s−k ≤ 1 and for one set ck s−k = 1. Then multiply M (k) by s−k . The resulting moments are scale √ invariant. Let l be the order of the set with cl s−l = 1 so s = l cl . A consequence √ √ √ √ is ck / l cl k ≤ 1 ⇒ k ck ≤ l cl . So s will be set to maxk k ck .
A Generalization of Moment Invariants on 2D Vector Fields to Tensor Fields
1159
Table 1. This table represents a basis for the rotation-invariant moments of the 3D tensor set { 2A, 3A, 4A} which are used for the moments of a scalarfield invariant to scale of value, rotation and translation. Every row in this tables corresponds to an invariant moment. On the left side, the tensor is written from which the total traces are taken. On the right, the index pairs are written which are summed over. The index indices start with 0 (are zero-based). 2A 2A ⊗ 2A 2A ⊗ 2A ⊗ 2A 3A ⊗ 3A
: : : :
3A ⊗ 3A ⊗ 3A ⊗ 3A :
4A : 4A ⊗ 4A : 4A ⊗ 4A ⊗ 4A :
2A ⊗ 4A : 2A ⊗ 2A ⊗ 4A : 2A ⊗ 2A ⊗ 2A ⊗
2A ⊗ 4A : 4A ⊗ 4A :
2A ⊗ 3A ⊗ 3A : 2A ⊗ 2A ⊗ 3A ⊗ 3A : 3A ⊗ 3A ⊗ 4A :
4.2
(0 (0 (0 (0 (0 (0 (0 (9 (3 (0 (0 (0 (0 (8 (4 (0 (2 (0 (4 (0 (2 (2 (0 (2 (2 (0 (4 (0
1) 2)(1 3) 2)(1 4)(3 5) 3)(1 4)(2 5) 1)(3 4)(2 5) 6)(1 9)(2 10)(3 7)(4 8)(5 11) 3)(1 6)(2 9)(4 7)(5 10)(8 11) 10)(0 3)(1 6)(2 11)(4 7)(5 8) 4)(6 7)(9 10)(0 5)(1 8)(2 11) 1)(2 3) 4)(1 5)(2 6)(3 7) 1)(4 5)(2 6)(3 7) 4)(1 5)(2 8)(3 9)(6 10)(7 11) 9)(0 4)(1 5)(2 6)(3 10)(7 11) 5)(8 9)(0 6)(1 7)(2 10)(3 11) 1)(4 5)(8 9)(2 6)(3 10)(7 11) 3)(0 4)(1 5) 4)(1 5)(2 6)(3 7) 5)(0 2)(1 6)(3 7) 6)(1 7)(2 4)(3 8)(5 9) 3)(0 6)(1 7)(4 8)(5 9) 3)(6 7)(0 4)(1 8)(5 9) 2)(1 6)(3 7)(4 8)(5 9) 3)(0 5)(1 6)(4 7) 3)(5 6)(0 4)(1 7) 2)(1 5)(3 6)(4 7) 5)(0 7)(1 8)(2 6)(3 9) 1)(2 6)(3 7)(4 8)(5 9)
Scale Invariance in Domain: M (f (x)) = M (f (x/s))
We define mA as the A-tensor computed from the domain-scaled field: mA := m A(f (x/s)) = IRd (x⊗)m f (x/s)dx1 ...dxd . Substitution of x with y = x/s gives m A = (sy⊗)m f (y)(sdy1 )...(sdyd ) = sm+d mA IRd
You see that the order the scaling factor in 3A is the number of x-powers plus the dimension of the field. We will call that order the domain order od of the tensor A. So od ( nA) = n + d if f is defined on IRd . If you want to compute the domain order od of the moments, you refer to the order of the source tensors of the moments. Example in 3D: od ( (1,4)(2,5)(3,6) 3A ⊗ 3A) = od ( 3A ⊗ 3A) = 2od ( 3A) = 12. If M (l) denotes the subset of moments with od = l, the invariant moments are M (l)s−l with s = maxl l ||M (l)||∞ . Proof is in analogy to section 4.1. 4.3
Scale Invariance in Domain and Value: M (f (x)) = M (sp f (x/s))
Here, the combined order oc = ov + od is of importance. Example in 3D with p = 1: The trace of 3A has combined order 6+1=7, the trace of 3A ⊗ 3A has combined order 7 · 2 = 14. Again, you look for the order by looking at the order of source tensors of the moments. If M (l) now denotes the subset of moments with oc = l, the invariant moments are M (l)s−l with s = maxl l ||M (l)||∞ . Proof is in analogy to section 4.1.
1160
4.4
M. Langbein and H. Hagen
Scale Invariance Independently in Domain and Value: M (f (x)) = M (sf (x/t))
We construct the invarinat moments in the following way: Let ckl := ||M (k, l)||∞ be the norms of the moments subsets with domain order od = k and value order ov = l. You construct the quotient of two norms qk := clk21 l1 /clk12 l2 with the property k = od (qk ) = l2 k1 − l1 k2 > 0 and ov (qk ) = l1 l2 − l2 l1 = 0 (so the value scaling factor is eliminated). Of course,one should choose clk12 l2 = 0. Now from √ s = max { k qk } one constructs the domain-scale-invariant moments M (k, l)s−l . Subsequently, you apply the method for scale invariance in value as previously described in section 4.1. Example: Scalar case with
up to order 4 kernel, 3D Point-Cloud
setting For this case, the moment sets { (0,1) 2A} with the norm c21 and { (0,1),(2,3) 4A} with the norm c41 are a good choice for the construction of the quotients. We have then √ k1 = 4, k2 = 2, l1 = l2 = 1, k = 4 − 2 = 2, q2 = c41 /c21 , s = 2 q2 = c41 /c21 . √ We also have then (as experiments showed) always t = 1 s−2 c21 = c221 /c41 . The scale-invariant moments are then M (k, l)t−k s−l .
5
Results and Conclusions
We have presented a method to compute moments of a finite tensorfield invariant to rotation, scale and translation which are computed in the following steps: At first a set of structure tensors is computed, then we it is made invariant to translation, then the basis for rotationally invariant moments of it is computed, and finally the basis is made invariant to scale. We have also given some figures on the efficiency of the different steps. Then we show our method for the case of a 3d scalar field with tensor orders up to 4 in the tensorset. The resulting moments are given in table 1. The evaluation of one set of the corresponding polynomials in the mA-components in long double precision needs 2.6μs on an Intel Xeon 3GHz CPU.
References 1. Schlemmer, M., Heringer, M., Morr, F., Hotz, I., Hering-Bertram, M., Garth, C., Kollmann, W., Hamann, B., Hagen, H.: Moment invariants for the analysis of 2d flow fields. IEEE Transactions on Visualization and Computer Graphics 13, 1743–1750 (2007) 2. Keren, D.: Using symbolic computation to find algebraic invariants. IEEE Trans. Pattern Anal. Mach. Intell. 16, 1143–1149 (1994) 3. Pottmann, H., Wallner, J., Huang, Q.X., Yang, Y.L.: Integral invariants for robust geometry processing. Computer Aided Geometric Design 26, 37–59 (2009) 4. Schultz, T., Weickert, J., Seidel, H.P.: A higher-order structure tensor. Research Report MPI-I-2007-4-005, Max-Planck-Institut f¨ ur Informatik, Stuhlsatzenhausweg 85, 66123 Saarbr¨ ucken, Germany (2007) ¨ 5. Ozarslan, E., Vemuri, B.C., Mareci, T.H.: Generalized scalar measures for diffusion mri using trace, variance, and entropy. Magnetic resonance in Medicine 53, 866–867 (2005)
A Real-Time Road Sign Detection Using Bilateral Chinese Transform Rachid Belaroussi and Jean-Philippe Tarel Universit´e Paris Est LEPSIS, INRETS-LCPC 58, boulevard Lefebvre 75732 Paris, France
[email protected] [email protected] www.lcpc.fr
Abstract. We present a real-time approach for circular and polygonal road signs detection1 in still images, regardless of their pose and orientation. Object detection is done using a pairwise gradient-based symmetry transform able to detect circles and polygons indistinctly. This symmetry transform of gradient orientation, the so-called Bilateral Chinese Transform BCT, decomposes an object into a set of parallel contours with opposite gradients, and models this gradient field symmetry using an accumulation of radial symmetry evidences. On a test database of 89 images 640x480 containing 92 road signs, 79 are correctly detected (86%) with 25 false positives using the BCT approach in about 30 ms/image.
1
Introduction
Most road sign detectors use a color modelling in a connected component segmentation further validated by an appearance-based model (template matching, genetic algorithm, SVM, neural networks, classifiers): these approaches are subject to oversegmentation and missed targets. The Radial Symmetry Transform RST was first used for road sign detection in [1]: greyscale images are used but only 60 mph or 40 mph circular speed signs are detected. This transform suffers from several impairments: it can only detect circular shapes, and it yields to a large number of false positives. Indeed, two additional steps are required: a template matching recognition stage and a temporal filter to validate consistent candidates (which is a way to replace the color information). To cope with the lake of generality of the circular shape, Loy et al [2] developed three specific versions of the RST for octagonal, square and triangular shapes respectively. This proposed approach lacks in generality and was only tested on 15 images for each shape. More extensive results can be found in [3] where a connected component segmentation (red signs only) is further validated by a Radial Symmetry Transform RST: they found the RST quite sensitive to missing edge points and pre-defined object scales. It also requires a template matching validation step to remove false positives. 1
Part of iTowns-MDCO project funded by the Agence Nationale de la Recherche.
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1161–1170, 2009. c Springer-Verlag Berlin Heidelberg 2009
1162
R. Belaroussi and J-P. Tarel
Fig. 1. Radial and reflectional symmetry between two gradient vectors ni and nj can be modelled using (θi , θj , αij ) parameters. In the Generalized Symmetry Transform, the symmetry magnitude is the product of two terms : Θradial and Θref l respectively for radial and reflectional symmetries.
The difficulty with RST is that it is a mono-variate transformation where any edge point casts its vote in several accumulators (to encompass multiple scales) independently of its neighbors. This induces a relatively high number of false positives. Introduced by Reisfeld [4], the Generalized Symmetry Transform (GST) uses an edge pairwise voting scheme particularly convenient for reflectional and radial symmetry detection. It is a kind of bi-variate Hough Transform, therefore it results in fewer false positives and it is more noise tolerant than the RST. An issue raised by the GST is that it is not specific enough as a pair of gradient vectors with no symmetry may have a significant contribution in the votes for a given point. The Chinese Transform CT, introduced by Milgram et al [5], avoids this impairment of the GST as demonstrated in [6]. We propose here a variant of the latter approach we named the Bilateral Chinese Transform BCT which focuses on pairs of gradient vectors with a strict radial symmetry. The proposed BCT is rather general and can also detect symmetry axis. In this paper, we test our BCT approach for circular and polygonal (not triangular) road sign detection in still images. Unlike the standard Chinese Transform, BCT makes no assumption on light/dark or dark/light contrast of objects to be detected. BCT works on intensity images and can detect circular, square and polygonal shapes regardless of their pose: it is an orientation free detector, i.e. robust to in-plane and small out-of-plane rotations. We compare the performance of the standard Chinese Transform CT and the Bilateral Chinese Transform BCT in terms of road signs detection. No prior color model is used. Color information is required only to build the normalized red channel. It is then processed by a symmetry transform (CT or BCT) to detect road signs centers and their spatial extent. We also compare BCT to a color segmentation based approach. This third algorithm is based on a red color model : regions of interest (red-like) image are processed to detect road signs centers with a standard Chinese Transform. Color
Road Sign Detection using Bilateral Chinese Transform
1163
is further used to segment a road sign and to define its spatial extent using the Camshift [7] algorithm. This paper is organized as follow: section 2 defines GST, CT and BCT operators and compare them for the task of road sign segmentation. Experimental results are discussed in section 3, and section 4 gives conclusion and perspectives of this work.
2 2.1
Radial and Reflectional Symmetry Transforms The Original Approach of the Generalized Symmetry Transform
GST is a context-free operator, detecting symmetries if used at large scale or corners at small scale: different approaches are inspired from the GST (especially the approach of edge pairwise voting) more specific to a class of symmetry or an object shape. We are interested in the symmetry magnitude defined by the GST: it is an accumulator of contribution of all pairs of edge points. For each point P of the image, a set of voters is defined: Pi + Pj Γ (P ) = (i, j) | =P (1) 2 The accumulator of the GST defining the symmetry magnitude at points P is: Accu(P ) = C(i, j) (2) (i,j)∈Γ (P )
where C(i, j) is the contribution of a pair of points (Pi , Pj ) to the votes of its middle point P . C(i, j) is a product of: a distance weighting function D(i, j), a phase weighting function Θ(i, j) and a function r of the gradient magnitudes: C(i, j) = D (i, j) Θ(i, j)ri rj
(3)
ri can be a logarithmic function of the gradient magnitude: ri = log (ni ). D(i, j) is a distance weighting function decreasing with Pi Pj , typically: Pi −Pj 1 D(i, j) = √ exp− 2σ 2πσ
(4)
where σ depends on the target spatial extent. Θ(i, j) formula is given in fig. 1. Several versions of the GST has been proposed with different distance or phase weighting functions, or a different feature space Γ depending on the application. GST has been applied to face and facial features detection [8], gait recognition or car license plate detection [9], and a noise tolerant version using negative votes [10] has recently been proposed. In the field of reflectional symmetry, [11] proposed to cast the vote of a pair of edge points (P1 , P2 ) in the (r, θ) space (parameters of the line passing through the midpoint of [P1 P2 ]), using a Standard Hough Transform weighted by a function of the angle between edges orientation. Their approach is used to model the fingers and grasping gestures.
1164
R. Belaroussi and J-P. Tarel
Fig. 2. Standard CT accumulators for a rectangle and a circular shape: both symmetry axis (horizontal and vertical) are detected, as well as the center of the circle
2.2
Fig. 3. Bilateral CT accumulator in ellipses cases: BCT is in-plane rotation free
The Chinese Transform CT
For the case of radial symmetry, another approach has been recently proposed, the so-called Chinese Transform (CT) [5], and successfully applied to facial features or fingers detection [6]. The Chinese Transform CT is related to the GST, yet it is quite different as it does not make use of an explicit reflectional symmetry formulation. The main distinction between these two transforms is the phase weighting function Θ(i, j). Like the GST, the feature space of the CT is the image, and a pair of edge pixels votes for their middle. The difference lyes in the conditions required to form a pair. A special case of the CT is the Fast Circle Detection FCD[12] for iris detection. The CT, like the FCD, uses the assumption that the object to detect is bright on a dark background, so that the pairs of edge points to consider are reduced to couples (P1 , P2 ) with gradient pointing at each other. However, this assumption of convergent gradients is verified in the case of iris or eyes (generally), but not in the case of road sign detection. The Chinese Transform only makes use of a radial symmetry term, as the FCD, but with a condition on the alignment of gradient vectors taking into account possible reflectional symmetry. As for the GST, a scale parameter defines the area of influence of each pixel. Θ(i, j) =
Wβ (θi − αij ) Wδ (|θi − θj | − π) Restricted alignment Radial symmetry
(5)
where WR is a top-hat function with a length of R: WR (x) = 1
if
|x| < R,
0
elsewhere
(6)
δ is a small quantity ensuring that numerically |θi − θj | = π, so that only edge points with opposite direction (radial symmetry term) cast their vote to their
Road Sign Detection using Bilateral Chinese Transform
1165
Fig. 4. Bilateral Chinese Transform. (a) In the case of a dark object on a clearer background, a pair of edge points with opposite (and diverging from each other) gradient orientation (ni , nj ) results in a negative contribution in the accumulator. (b) A pair of edge points with a radially symmetric gradient orientation (ni , nj ) pointing towards each others results in a positive vote for their middle point (bright/dark rectangle).
middle point. The first term of the phase weight is actually a condition on the alignment of Pi Pj and ni . For β = 0 the CT is equivalent to the FCD algorithm. As shown in fig. 2, the CT can model both circular and polygonal shapes. However, it supposes that the object to detect has light/dark contrast. An overview of the image processing used to feed the CT symmetry transform is given in fig. 5. The spatial extent of objects with high symmetry magnitude makes use of a second accumulator as explained in the next section. 2.3
Bilateral Chinese Transform BCT
To take into account the case of object with divergent edge gradients, see fig. 3 and fig. 4, a modification of the phase weighting function of the CT is proposed: ⎛ ⎞ ⎜ ⎟ ⎟ Θ(i, j) = ⎜ β (θi − αij ) − Wβ (θi − αij − π)⎠ × Wδ (|θi − θj | − π) ⎝W Radial symmetry light/dark dark/light
(7)
where δ is the precision with which the assertion |θi − θj | = π is numerically verified: δ = 2π N if the gradient orientation is quantified on N values. αij is the orientation of vector Pi Pj with respect to the horizontal axis (see convention on figure 1). β is a limit angle between ni and Pi Pj . It is the tolerance on the alignment of gradient vectors ni and nj and it delineates the region of influence of Pi point as shown in figure 4. A small value of β would fit to strictly circular
1166
R. Belaroussi and J-P. Tarel
Fig. 5. Overview diagram (CT or BCT). The color image is converted in the red normalized channel which is further processed by the symmetry transform (BCT or CT) to detect symmetrical objects (both center and spatial extent).
Fig. 6. CT+Camshift processing. A red color model is built in the Hue-Saturation plane of HSV. It is backprojected onto the color image: grey level of red-like area processed by CT for object centers detection and a Camshift segmentation is performed.
shape whereas β = π/2 can be used for more general case yet increasing the number of voters and so the processing load of the algorithm and the number of false positives. Regarding the distance weighting function, we find it more convenient and faster to use thresholds on distance between points: D(i, j) = WRmax (Pi − Pj ) − WRmin (Pi − Pj )
(8)
where Rmin and Rmax are respectively the minimum and maximum width of a road sign. That is to say, we try to detect objects within the range of size [Rmin , Rmax ]. The accumulator is incremented as a percentage of the product of Θ(i, j) and of the gradient magnitude functions ri = log (1 + ni ): Accu(P ) = D(i, j)Θ(i, j) ri rj (9) (i,j)∈Γ (P )
Road Sign Detection using Bilateral Chinese Transform
1167
Fig. 7. Bilateral Chinese Transform in cases of polygonal shapes: source image and accumulator. Light/dark object edges (top) results in a positive contribution to the accumulator; the dark/light polygon at bottom has negative votes in the accumulator. The case of out-of-plane rotation is illustrated on the right with a diamond (top) and a rectangle (bottom).
For the case of polygonal shape with four or more sides, we use β = π8 in our experiments, gradient orientation being quantized on 8 direction (Freeman code). As shown in figure 5, the BCT transform has two output accumulators, one for the center of objects and the other one recording their radius. The center of a target can be extracted from the accumulator 9. To define the corresponding spatial extent, an another accumulator registering the distance between voters Pi − Pj is built during the voting process: Radius(P ) =
1 N
Pi − Pj /2
(10)
(i,j)∈Γ (P )
with N = Card(Γ (P )). The spatial extent is then an arithmetic average of the set of voters, and therefore the scale is not discretized. That is an advantage over methods like [1] using a multi-scale approach (one accumulator per scale).
3 3.1
Experimental Results A Pose Free Detector for Circle and Polygons
The approach of the BCT dispenses with the bright/dark contrast assumption made in the CT by enabling the object to detect to have dark/bright or bright/dark contrast. Furthermore, as mentioned in [10], the use of a signed accumulator strenghtens the BCT by making it noise tolerant. Figure 3 illustrates the signed accumulators obtained with the BCT over two elliptical shapes: it is worth noticing that the BCT is also orientation free. This can especially be seen in figure 7 illustrating the case of polygons.
1168
R. Belaroussi and J-P. Tarel
Fig. 8. Road sign segmentation examples using the Bilateral Chinese Transform
3.2
Performances
We used a database of 89 images of streets of Paris with cluttered background (see fig. 8), containing 81 circular red signs and a total of 92 road signs (11 blue or yellow signs, rectangular or circular). Both the CT and the BCT algorithms are tested in the same condition using the same image processing steps as illustrated in fig. 5. Compared to the BCT algorithm, the CT algorithm is less efficient, as can be seen in curves of fig. 9. Figure (a) plots the correct detection rate CDR versus the false detection rate FDR, and figure (b) plots the dice coefficients : CDR =
TP P
FDR =
FP Number of images
Dice coeff. =
2TP TP + FP + P
where TP = Number of True Positive, FP = Number of False Positives and P= Total Number of signs. Table 1 indicates the performances of the algorithms for a given point of the ROC curves: the BCT is able to detect 79 signs out of 92 with 25 false positives whereas the standard CT could only detect 69 road signs with 24 false positives. As the normalized red channel is processed, red signs are likely to have a light/dark contrast and can be detected by a standard CT. Road signs with another color, especially blue signs, are likely to have a dark/light contrast: most of them are detected by the BCT unlike to a standard CT. The dice curve achieves a maximum value of 82% for the BCT compared to a maximum value of 77% for the CT : with a difference of 5% in the dice coefficient, the BCT significatively improves the CT algorithm. We also compared the BCT with a color model based approach illustrated in fig. 6: a red 2D-histogram in the Hue-Saturation plane is built from an external database. The red color model is backprojected to find regions of interest (ROI). Gradient field of these ROI is computed and feeds the input of a standard Chinese
Road Sign Detection using Bilateral Chinese Transform
1169
Fig. 9. Comparison of the performances of Bilateral CT (black) and Standard CT (yellow), over a test set of 89 images with 92 signs. (a) ROC curves plotting the Correct detection rate (%) versus the False positive rate. (b) Dice coefficient (%) versus accumulators threshold.
Transform. The spatial extent of objects with high symmetry magnitude are further segmented using the Camshift [7]. The CT+Camshift algorithm is faster but misses more road signs than the BCT as reported by table 1. Amongst the 81 red signs, 75% are detected with 22 false positives, in about 10ms/frame on a standard PIV@1,2GHz laptop. This illustrates the weakness of the color selection since the red color model is built with a different camera than the one used for the test bench. A lot of red signs do not have the colors of the H-S histogram, and cannot be detected. The BCT on the other hand avoids the issue of color constancy by using R the normalized red channel: r = R+G+B . This channel is less dependant on illumination conditions. Moreover, it enables us to detect signs with any color as a blue is likely to appear darker than its background, and a red is usually brighter. Also yellow and green signs would generally be contrasted enough to be detected in this channel. Table 1. Road sign detection performances on a 89 urban images database
Approach Bilateral Chinese Transform Standard Chinese Transform CT with Camshift on red signs
Correctly detected
Number of false positives
Number of targets
Processing Time (avg)
79 (86%)
25
92
30 ms/frame
69 (75%)
24
92
30 ms/frame
61 (75%)
22
81
10 ms/frame
1170
4
R. Belaroussi and J-P. Tarel
Conclusion and Perspectives
We presented a road sign segmentation system using an efficient symmetry detector, the Bilateral Chinese Transform. It can process a 640x480 image in about 30 ms, with a high detection rate: 86% with 25 false positives over a set of 89 images containing 92 road signs with different colors and shapes (except triangular signs). It performs better than the standard CT which detects only 75% of the signs with 24 false positives. We compared it to an approach using a color segmentation step: the BCT is a bit slower, but it is more efficient and it is not limited to a particular color. The BCT algorithm is more general than the RST as it can detect either circle, square, diamond and polygons. It is also more precise than the GST because it focuses exclusively on pairs of points with radial symmetry. The next step in our approach is to add a recognition stage in order to remove the remaining false positives that usually occur on symmetrical object like cars, windows, logos, signboard, or pedestrians. The BCT is less adequate in the case of triangular signs. Indeed, the triangle center location is quite fuzzy using this approach. To cope with this issue, we developed a specific geometrical transformation for this case [13].
References 1. Barnes, N., Zelinsky, A., Fletcher, L.: Real-time speed sign detection using the radial symmetry detector. Transactions on Intelligent Transportation Systems (2008) 2. Loy, G., Barnes, N.: Fast shape-based road sign detection for a driver assistance system. In: Intelligent Robots and Systems, IROS (2004) 3. Foucher, P., Charbonnier, P., Kebbous, H.: Evaluation of a Road Sign Pre-detection System by Image Analysis. In: VISAPP (2), pp. 362–367 (2009) 4. Reisfeld, D., Wolfson, H., Yeshurun, Y.: Context Free Attentional Operators: the Generalized Symmetry Transform. Int. J. of Computer Vision (1995) 5. Milgram, M., Belaroussi, R., Prevost, L.: Multi-stage Combination of Geometric and Colorimetric Detectors for Eyes Localization. In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 1010–1017. Springer, Heidelberg (2005) 6. Belaroussi, R., Milgram, M.: A Real Time Fingers Detection by Symmetry Transform Using a Two Cameras System. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne, T.-M., Monroe, L. (eds.) ISVC 2008, Part II. LNCS, vol. 5359, pp. 703–712. Springer, Heidelberg (2008) 7. Bradsky, G.: Computer Vision Face Tracking For Use in a Perceptual User Interface. Intel Technology Journal (1998) 8. Hayfron-Acquah, J.B., Nixon, M.S., Carter, J.N.: Automatic gait recognition by symmetry analysis. Pattern Recognition Letters (2003) 9. Kim, D.S., Chien, S.I.: Automatic car license plate extraction using modified generalized symmetry transform and image warping. In: ISIE (2001) 10. Park, C.-J., Seob, K.-S., Choib, H.-M.: Symmetric polarity in generalized symmetry transformation. Pattern Recognition Letters (2007) 11. Li, W.H., Zhang, A.M., Kleeman, L.: Fast Global Reflectional Symmetry Detection for Robotic Grasping and Visual Tracking. In: ACRA (2005) 12. Rad, A.A., Faez, K., Qaragozlou, N.: Fast Circle Detection Using Gradient Pair Vectors. In: Digital Image Computing: Techniques and Applications (2003) 13. Belaroussi, R., Tarel, J.-P.: Angle Vertex and Bisector Geometric Model for Triangular Road Sign Detection. In: Workshop on Applications of Computer Vision (2009)
Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Mustafa Berkay Yilmaz, Hakan Erdogan, and Mustafa Unel Sabanci University, Faculty of Engineering and Natural Sciences Istanbul, Turkey
[email protected], {fhaerdogan,munelg}@sabanciuniv.edu
Abstract. In this work, we propose a method which can extract critical points on a face using both location and texture information. This new approach can automatically learn feature information from training data. It finds the best facial feature locations by maximizing the joint distribution of location and texture parameters. We first introduce an independence assumption. Then, we improve upon this model by assuming dependence of location parameters but independence of texture parameters. We model combined location parameters with a multivariate Gaussian for computational reasons. The texture parameters are modeled with a Gaussian mixture model. It is shown that the new method outperforms active appearance models for the same experimental setup.
1
Introduction
Modeling flexible shapes is an important problem in vision. Usually, critical points on flexible shapes are detected and then the shape of the object can be deduced from the location of these key points. Face can be considered as a flexible object and critical points on a face can be easily identified. In this paper, we call those critical points facial features and our goal is to detect the location of those features. Facial feature extraction is an important problem that has applications in many areas such as face detection, facial expression analysis and lipreading. Approaches like Active Appearance Models (AAM) and Active Shape Models (ASM) [1] are widely used for the purpose of facial feature extraction. These are very popular methods, however they give favorable results only if the training and test sets consist of a single person. They can not perform as well for personindependent general models. AAM uses subspaces of location and texture parameters which are learned from training data. However, by default, this learning is not probabilistic and every point in the subspace is considered equally likely1 . This is highly unrealistic since we believe some configurations in the subspace may have to be favored as compared to other configurations. 1
In some approaches, distributions of the AAM/ASM coefficients are used as a prior for the model.
G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1171–1180, 2009. c Springer-Verlag Berlin Heidelberg 2009
1172
M.B. Yilmaz, H. Erdogan, and M. Unel
In this work, we propose a new probabilistic method which is able to learn both texture and location information of facial features in a person-independent manner. This algorithm expects a face image as the input which is the output of a good face detection algorithm. We show that, using this method, it is possible to find the locations of facial features in a face image with less pixel errors compared to AAM. The rest of the paper is organized as follows: Section 2 explains our statistical model. Experimental results are presented in Section 3. Finally in Section 4 we conclude the paper and propose some future improvements.
2
Modeling Facial Features
Facial features are critical points in a human face such as lip corners, eye corners and nose tip. Every facial feature is expressed with its location and texture components. Let vector li = [xi , yi ]T denote the location of the ith feature in a 2D image2 . ti = ti (li ) is the texture vector associated with it. We use f i = [lTi , tTi ]T to denote the overall feature vector of the ith critical point on the face. The dimension of the location vector is 2, and the dimension of the texture vector is p for each facial feature. Define l = [lT1 , lT2 , . . . , lTN ]T , t = [tT1 , tT2 , . . . , tTN ]T and f = [f T1 , f T2 , . . . , f TN ]T as concatenated vectors of location, texture and combined parameters respectively. Our goal is to find the best facial feature locations by maximizing the joint distribution of locations and textures of facial features. We define the joint probability of all features as follows: P (f ) = P (t, l).
(1)
In this paper, we will make different assumptions and simplifications to be able to calculate and optimize this objective function. The optimal facial feature locations can be found by solving the following optimization problem: ˆl = argmax P (t, l). l
(2)
It is not easy to solve this problem without simplifying assumptions. Hence, we introduce some of the possible assumptions in the following section. 2.1
Independent Features Model
We can simplify this formula by assuming independence of each feature from each other. Thus, we obtain: P (t, l) ≈
N
P (ti , li ).
i=1 2
The location vector could be three dimensional in a 3D setup.
(3)
Probabilistic Facial Feature Extraction
1173
We can calculate the joint probability P (ti , li ) by concatenating texture and location vectors; obtaining a concatenated vector f i of size p + 2. We can then assume a parametric distribution for this combined vector and learn the parameters from training data. One choice of a parametric distribution is a Gaussian mixture model (GMM) which provides a multi-modal distribution. With this assumption, we can estimate each feature location independently, so it is suitable for parallel computation. Since ˆli = argmax P (ti , li ), li
(4)
each feature point can be searched and optimized independently. The search involves extracting texture features for each location candidate (pixels) and evaluating the likelihood function for the concatenated vector at that location. The pixel coordinates which provide the highest likelihood score will be chosen as the seeked feature location ˆli . Although this assumption can yield somewhat reasonable feature points, since the dependence of locations of facial features in a typical face are ignored, the resultant points are not optimal. 2.2
Dependent Locations Model
Another assumption we can make is to assume that the locations of features are dependent while the textures are independent. First, we write the joint probability as follows: P (t, l) = P (l)P (t|l). (5) Next, we approximate the second term in the equation above as: P (t|l) ≈
N
P (ti |l) ≈
i=1
N
P (ti |li )
i=1
where we assume (realistically) that the textures of each facial feature component is only dependent on its own location and is independent of other locations and other textures. Since the locations are modeled jointly as P (l), we assume dependency among locations of facial features. With this assumption, the equation of joint probability becomes: P (t, l) = P (l)
N
P (ti |li ).
(6)
i=1
We believe this assumption is a reasonable one since the appearance of a person’s nose may not give much information about the appearance of the same person’s eye or lip unless the same person is in the training data for the system. Since we assume that the training and test data of the system involve different subjects for more realistic performance assessment, we conjecture that this assumption is a valid one. The dependence of feature locations however, is a more dominant dependence and it is related to facial geometry of human beings. The location of
1174
M.B. Yilmaz, H. Erdogan, and M. Unel
the eyes is a good indicator for the location of the nose tip for example. Hence, we believe it is necessary to model the dependence of locations. Finding the location l that maximizes equation (2) will find optimal locations of each feature on the face. 2.3
Location and Texture Features
It is possible to use Gabor or SIFT features to model the texture parameters. We preferred a faster alternative for the speed of the algorithm. The texture parameters are extracted from rectangular patches around facial feature points. We train subspace models for them and use p subspace coefficients as representations of textures. Principal component analysis (PCA) is one of the most commonly used subspace models and we used it in this work, as in [2–6]. The location parameters can be represented as x and y coordinates directly. 2.4
Modeling Location and Texture Features
A multivariate Gaussian distribution is defined as follows: N (x; μ, Σ) =
1 (2π)N/2 |Σ|1/2
1 exp (− (x − μ)T Σ −1 (x − μ)) 2
(7)
where x is the input vector, N is the dimension of x, Σ is the covariance matrix and μ is the mean vector. For the model defined in 2.1, probability for each concatenated feature vector f i , P (f i ) is modeled using a mixture of Gaussian distributions. GMM likelihood can be written as follows: P (f i ) =
K
wik N (f i ; μki , Σ ki ).
(8)
k=1
Here K is the number of mixtures, wik , μki and Σ ki are the weight, mean vector and covariance matrix of the k th mixture component. N indicates a Gaussian distribution with specified mean vector and covariance matrix. For the model defined in 2.2, probability P (t|l) of texture parameters t given location l is also modeled using a GMM as in equation (8). During testing, for each facial feature i, a GMM texture log-likelihood image is calculated as: Ii (x, y) = log (P (ti |li = [x y]T )). (9) Note that, to obtain Ii (x, y), we extract texture features ti around each candidate pixel li = [x y]T and find its log-likelihood using the GMM model for facial feature i. Our model for P (l) is a Gaussian model, resulting in a convex objective function. Location vector l of all features is modeled as follows: P (l) = N (l; μ, Σ).
(10)
Probabilistic Facial Feature Extraction
1175
Candidate locations for feature i are modeled using a single Gaussian model trained with feature locations from the training database. Marginal Gaussian distribution of location of a feature is thresholded and a binary ellipse region is obtained for that feature. Thus, a search region of ellipse shape is found. GMM scores are calculated inside those ellipses for faster computation. The model parameters are learned from training data using maximum likelihood. Expectation maximization (EM) algorithm is used to learn the parameters for the GMMs [7]. 2.5
Algorithm
For independent features model, we calculate P (f i ) in equation (8) using GMM scores for each candidate location li of feature i and decide the location with maximum GMM score as the location for feature i. For dependent locations model, we propose an algorithm as follows. We obtain the log-likelihood of equation (6) by taking its logarithm. Because the texture of each feature is dependent on its location, we can define an objective function which only depends on the location vector: φ(l) = log (P (t, l)) = log (P (l)) +
N
log (P (ti |li )).
(11)
i=1
Using the Gaussian model for location and GMM for texture defined in section 2.4, we can write the objective function φ as: φ(l) =
N −β (l − μ)T Σ −1 (l − μ) + Ii (xi , yi ) + constant. 2 i=1
(12)
Here, μ is the mean location vector, and Σ −1 is the precision (inverse covariance) matrix, learnt during the training. β is an adjustable coefficient. Ii (x, y) is the score image of feature i defined in equation (9). So the goal is to find the location vector l giving the maximum value of φ(l): ˆl = argmax φ(l). l
(13)
To find this vector, we use the following gradient ascent algorithm: l(n) = l(n−1) + kn ∇φ(l(n−1) ).
(14)
Here, n denotes the iteration number. We can write the location vector l as: l = [x1 , y1 , x2 , y2 , ..., xN , yN ]T .
(15)
Then we can find the gradient of φ as: ∇φ(l) = [∂φ/∂x1 , ∂φ/∂y1 , ..., ∂φ/∂yN ]T .
(16)
1176
M.B. Yilmaz, H. Erdogan, and M. Unel
For a single feature i: N ∂ ∂ ∂φ/∂xi = log P (l) + log P (ti |li ). ∂xi ∂x i i=1
and ∂φ/∂yi =
N ∂ ∂ log P (l) + log (P (ti |li )). ∂yi ∂y i i=1
(17)
(18)
The gradient for the location part can be calculated in closed form due to the modeled Gaussian distribution and the gradient for the texture part can be approximated from the score image using discrete gradients of the score image. Plugging in the values for the gradients, we obtain the following gradient ascent update equation for the algorithm: l(n) = l(n−1) + kn (−βΣ −1 (l(n−1) − μ) + G), where
⎤ G1x (ln−1 ) 1 ⎢ G1 (l(n−1) ) ⎥ ⎢ y 1 ⎥ ⎢ ⎥ ... G=⎢ ⎥. ⎢ N (n−1) ⎥ ⎣Gx (lN )⎦ (n−1) GN ) y (lN
(19)
⎡
(20)
Here, Gix and Giy are the two-dimensional numerical gradients of Ii (x, y) in x and y directions respectively. The gradients are computed only for every pixel coordinate (integers) in the image. G is the collection vector of gradients of all current feature locations in the face image. kn is the step size which can be tuned in every iteration n. Since l(n) is a real-valued vector, we use bilinear interpolation to evaluate gradients for non-integer pixel locations. Iterations continue until the location difference between two consecutive iterations is below a stopping criterion.
3
Experimental Results
For training, we used a human face database of 316 images having hand-marked facial feature locations for each image. In this database, texture information inside rectangular patches around the facial features are used to train a PCA subspace model. We used 9 facial features which are left and right eye corners; nose tip; left, right, bottom and upper lip corners. PCA subspaces of different dimensions are obtained by using the texture information inside rectangular patches around these facial features. Also we found mean locations for each feature and the covariance matrix of feature locations. To get rid of side illumination effects in different regions of the image; we applied the 4-region adaptive histogram equalization method in [4] in both the training and testing stages. For a faster
Probabilistic Facial Feature Extraction
1177
Fig. 1. Facial features used in this work
feature location extraction; training and testing images of size 320x300 are downsampled to size 80x75, preserving the aspect ratio. Facial features used in our experimental setup are shown in Figure 1. Training parameters used for each facial feature are shown in Table 1. Parameters are: PCA subspace dimension, window size used around facial feature points and histogram equalization method. For features having this histogram equalization method as 1; histogram equalization is applied for red, green and blue channels separately and the resulting image is converted to gray-level. For features having this histogram equalization method as 2; image is converted to gray-level and then histogram equalization is applied to the resulting image. Those training parameters are found experimentally; the values giving the best result is used for each parameter and for each feature. For features having large variability between different people, like jaw and lip; we had to train larger dimensional PCA subspaces and had to use larger windows. We used β = 4 and step size kn = 0.05 for all iterations. For testing, we used 100 human face images which were not seen before in training data. For the independent model explained in Section 2.1, PCA coefficients and location vectors are concatenated, and a GMM model is used to obtain scores. For each feature, the pixel giving the highest GMM score is selected as the initial location. These locations are then used to solve the dependent Table 1. Training parameters used for facial features Feature PCA dimension Window size Hist. eq. 1 30 8x8 2 2 30 8x8 1 3 30 10x10 2 4 30 5x5 2 5 20 5x5 2 6 50 12x12 1 7 50 10x13 2 8 50 12x12 2 9 50 19x19 2
1178
M.B. Yilmaz, H. Erdogan, and M. Unel
(a) Independent
(b) Dependent locations
Fig. 2. Facial feature locations obtained using independent and dependent locations models, with a good independent model initialization
(a) Independent
(b) Dependent locations
Fig. 3. Facial feature locations obtained using independent and dependent location models, with an inaccurate independent model initialization
locations model in Section 2.2. Using the method explained in Section 2, locations and textures of the features are refined iteratively. Sample results for independent and dependent location models are shown in Figure 2 and in Figure 3. In Figure 3, independent model gives an inaccurate initialization due to limitations of the model. However; dependent locations model corrects the locations of features fairly, using relative positions of features in face. Pixel errors of independent and dependent locations models using 100 face images of size 320x300 are shown in Table 2. Pixel error of a single facial feature on a face image is the Euclidean distance of the location found for that feature between the manually labeled location of that feature. We find the mean pixel error of all facial features on a single face image. Mean row in Table 2 denotes the mean of those mean pixel errors over all face images, and maximum row is the maximum pixel error over all face images. Maximum error is shown to indicate the worst case performance for the algorithms.
Probabilistic Facial Feature Extraction
1179
Table 2. Comparison of pixel errors of independent and dependent location models with AAM Error Independent Dependent AAM-API Mean 5.86 4.70 8.13 Maximum 29.55 15.78 17.84
3.1
Comparison with AAM
Our method is compared with the AAM method [1] using the AAM implementation AAM-API [8]. Note that other AAM search algorithms and implementations such as [9, 10] may perform differently. The same data set is used for training and testing as in Section 3. Comparison of mean and maximum pixel errors with the proposed method is also shown in Table 2. An advantage of AAM is that it takes into account global pose variations. Our algorithm is modeling the probability distributions of facial feature locations arising from inter-subject differences when there are no major global pose variations. It is critical for our algorithm that it takes as the input, the result of a good face detector. We are planning to improve our algorithm so that global pose variations will be dealt with.
4
Conclusions and Future Work
We were able to get promising facial feature extraction results from independent and dependent locations assumptions offered in this work. Dependent locations model improves the independent one dramatically. In addition, the proposed dependent locations model outperforms the AAM. It is in our plans to find better texture parameters. Using global-pose variation compensation is expected to improve our approach.
Acknowledgments This work has been supported by TUBITAK (Scientific and Technical Research Council of Turkey); research support program (program code 1001), project number 107E015.
References 1. Cootes, T.F., Taylor, C.J.: Statistical models of appearance for medical image analysis and computer vision. In: SPIE Medical Imaging (2001) 2. Demirel, H., Clarke, T.J., Cheung, P.Y.K.: Adaptive automatic facial feature segmentation. In: International Conference on Automatic Face and Gesture Recognition (1996) 3. Luettin, J., Thacker, N.A., Beet, S.W.: Speaker identification by lipreading. In: International Conference on Spoken Language Processing (1996)
1180
M.B. Yilmaz, H. Erdogan, and M. Unel
4. Meier, U., Stiefelhagen, R., Yang, J., Waibel, A.: Towards unrestricted lip reading. International Journal of Pattern Recognition and Artificial Intelligence (1999) 5. Hillman, P.M., Hannah, J.M., Grant, P.M.: Global fitting of a facial model to facial features for model-based video coding. In: International Symposium on Image and Signal Processing and Analysis, pp. 359–364 (2003) 6. Ozgur, E., Yilmaz, B., Karabalkan, H., Erdogan, H., Unel, M.: Lip segmentation using adaptive color space training. In: International Conference on Auditory and Visual Speech Processing (2008) 7. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society (1977) 8. The-AAM-API: http://www2.imm.dtu.dk/aam/aamapi/ 9. Matthews, I., Baker, S.: Active appearance models revisited. International Journal of Computer Vision 60, 135–164 (2003) 10. Theobald, B.-J., Matthews, I., Baker, S.: Evaluating error functions for robust active appearance models. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition, pp. 149–154 (2006)
Common Motion Map Based on Codebooks Ionel Pop1,2 , Scuturici Mihaela1 , and Serge Miguet1 1 Universit´e de Lyon, CNRS Universit´e Lyon 2, LIRIS, UMR5205, F-69676, France 2 Foxstream www.foxstream.fr
Abstract. This article presents a method to learn common motion patterns in video scenes, in order to detect abnormal behaviors or rare events based on global motion. The motion orientations are observed and learned, providing a common motion map. As in the background modeling technique using codebooks [1], we store motion information in a motion map. The motion map is then projected on various angles, allowing an easy visualization of common motion patterns. The motion map is also used to detect abnormal or rare events.
1
Introduction
Abnormality detection in video is a relatively recent field of study. The results are used in the domain of video surveillance, mostly to detect uncommon behaviors. After a period of scene observation, the system is able to learn the common motion patterns of moving objects in the scene. Then, any object that does not follow one of the learned patterns is detected as having abnormal behavior. In most of the cases, abnormal event detection is considered to be the same with rare event detection. The current study presents several methods to build motion patterns along with their frequencies. The proposed approach is able to integrate new data and to continuously adapt the patterns, therefore no learning phase is needed. It is possible to generate a visual representation of the learned data, which shows the common movements in the scene. The method was tested on video issued from fixed cameras, although it may be also applied to mobile cameras, as long as the displacements remain consistent from one view to another (e.g., along a road). The rest of this study is organized as follows. The next section exposes related work. The proposed approach is presented in Section 3, then we continue by describing the choices we made for the implementation. Just before the conclusion, we present the results obtained on several test videos.
2
Related Work
Most of the existing methods of rare event detection use either a trajectory classifier or a motion classifier. In the case of a trajectory classifier, the trajectories are either simulated or acquired using various object tracking algorithms. In the case of simulated G. Bebis et al. (Eds.): ISVC 2009, Part II, LNCS 5876, pp. 1181–1190, 2009. c Springer-Verlag Berlin Heidelberg 2009
1182
I. Pop, M. Scuturici, and S. Miguet
trajectories, they are randomly generated using a certain predefined model. In the other case, there is an object segmentation and tracking step. The trajectories are then normalized (temporally, spatially or both). Some authors ([2], [3]) split them in segments representing sub-trajectories. The resulting segments may have fixed or variable length. For variable length, the division is done upon certain criteria, like discontinuity in speed, acceleration, etc. The trajectories (or the sub-trajectories) are then clustered. A metric is required for this step. There are a lot of propositions for the clustering method (spectral clustering, k-means, vector quantization, etc.) as well as for the used metric. For instance, [4] and in a later work [5] use Mixtures of Von Mises distributions to model trajectories. Similarities between trajectories are estimated with the Bhattacharyya distance and for clustering, k-medoids algorithm is used. Mixtures of Von Mises distributions are the equivalent of the normal distributions for circular variables (i.e. angles). [6] use a Dynamic Time Warping (DTW) approach to estimate the similarity between two trajectories. Based on similarity, an incremental clustering algorithm produces a model of trajectories. Therefore any new trajectory which does not fit one of the models is considered as rare event (rare trajectory). These methods suppose a neat moving object segmentation and tracking, which is sometimes difficult to obtain (difficult outdoor scenes, crowds, moving camera, etc.). Motion classifier works at a lower level. There is no longer the notion of object or tracker. The system takes into account the motion of the pixels over time and learns it. It then builds a model based on these movements and each new information which does not fit into this model is pointed out as abnormal. There are two classes of methods: the first one applies to human gesture recognition, if a moving silhouette or a moving body part is identified. The second one tries to characterize the global motion field and it does not need neat segmentation of moving objects. For the first class, we cite [7] who introduced the notion of motion templates or motion history images in order to perform human gesture recognition. The method requires a silhouette or a part of it, therefore it cannot apply to complex videos (e.g. outside videos with moving background or crowd videos). For the second class we cite [8] who combined all the optical flow vectors observed during the learning phase into a global motion field for a fixed camera. Each motion pattern is a cluster of motion vectors (each flow vector is associated with only one motion pattern). Redundant vectors are filtered with a modified version of neural network. The remaining vectors are then analyzed, in order to find continuous sequences of maximum length, called super-tracks. These supertracks are used to build a motion model in difficult scenes (crowd and aerial videos). A drawback of this method is that there is no time order included in these motion patterns. Another global approach is presented in [9]. They build Internal Motion Histograms by using the gradient magnitudes from optical flow fields estimated from two consecutive frames.
Common Motion Map Based on Codebooks
1183
The system presented in this study belongs to the second category of rareevents detectors, i.e. it learns the predominant movements of the pixels in the scene. The main advantage of this approach is the fact that there is no need to track objects, therefore the applications of such a system are much broader. It is more appropriate for situations when tracking is difficult, such as crowded scenes or cluttered foreground.
3
Proposed Solution
The unusual motion detection we propose is based on a ’motion map’ which keeps information on the observed motion of each block in the image. The method is inspired from the codebook method which models background of an image based on codewords. We model the motion of each block by keeping the angle of the moving direction of the block, obtained by an optical flow method. 3.1
Optical Flow
Optical flow allows the estimation of each motion region (usually at a pixel level) from one image to another. The base for the most common algorithms are LucasKanade ([10]) and Horn-Schunk ([11]). They are both based on the same idea: the brightness conservation. The changes which occur on a region are due only to the motion. An additional constraint states that the values of the optical flow do not change considerably in a neighborhood. These algorithms estimate a flow vector for each pixel of the input image. Another type of optical flow algorithms are based on block matching. The image is decomposed in fixed-sized blocks (overlapping or non-overlapping). Each block is then searched in the next frame and the optical flow is estimated based on the new position of the block. The size of the block is influenced by the size of the objects moving in the image. This type of algorithm does not yield a flow vector for each pixel, so the result has less flow vectors than pixels in the input images. After the tests made on these algorithms, we decided to use the block-matching approach. One of its main advantages is the precision in the estimation of the flow direction. The others approaches perform quite well on synthetic videos, but their performance is impacted by the noise and luminosity variations in real video sequences. Another advantage of the block-matching approach is the smaller volume of output data which is inferior to input data. Therefore we will dispose of less, but more precise data to work with. The block matching algorithm also presents some problems. After the motion stops, on some dark regions without texture, the flow vectors need some time before their values sink to zero. This is explained by the fact that the uniform background does not allow a proper estimation of the displacement. A textureless moving background is not different from a static, texture-less background. This problem is solved by considering the optical flow only in regions in movement. These regions are found using a background subtraction based on gaussian mixture model.
1184
3.2
I. Pop, M. Scuturici, and S. Miguet
Incremental Codebook-Model
The model used to store common displacements was inspired from [1]. [1] use codebooks to model the static background of a scene. The current study makes use of a version of codebooks adapted to circular values. Each block is associated with a list of codewords. A codeword contains min, max, lmin, lmax, n. min and max are the minimum and the maximum of the value accepted by the codeword. lmin and lmax are the minimum and the maximum of the value which may be accepted by the codeword. The way minimum and maximum are calculated on angles will be detailed in the next section. n counts the number of vectors matched. Each time a new flow vector matches the codeword, these informations are updated (if necessary). The matching and updating algorithm is detailed in Algorithm 1.
t Input: θ(x=1,w,y=1,h) , listCW(x=1,w,y=1,h)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
foreach x = 1, w, y = 1, h do foundCW := false; foreach cw ∈ listCW(x,y) do t if isBetween(θ(x,y) , cw.lmin, cw.lmax) then foundCW := true; t if isBetween(θ(x,y) , cw.max, cw.lmax) then t cw.lmin + = |θ(x,y) , cw.max|c ; t cw.max := θ(x,y) ; end t else if isBetween(θ(x,y) , cw.min, cw.lmin) then t cw.lmax − = |θ(x,y) , cw.min|c ; t cw.min := θ(x,y) ; end end end if not foundCW then create new codeword c ; t c.min := c.max := θ(x,y) ; t c.lmin := θ(x,y) − initialW/2 ; t c.lmax := θ(x,y) + initialW/2 ; c.n := 1 ; add c to the listCW(x,y) end end
t Algorithm 1. Learning of flow vectors. θ(x=1,w,y=1,h) is the flow orientation for
each block at instant t. listCW(x=1,w,y=1,h) is the list of code-words. The function isBetween checks if the first parameter is between the other two parameters. initialW is an algorithm parameter, which controls the maximal width of the codeword.
Common Motion Map Based on Codebooks
1185
The main difference with the original algorithm in [1] is the acceptance condition and the update method. [1] compute the values of lmin and lmax from min and max using a sub-unit multiplier. Due to the circular characteristic of angles, this approach is not applicable. The proposed solution is to store lmin and lmax together with min and max and to update them together. The main idea is still present: min and lmax decrease together and max and lmin increase together. 3.3
Circular-Value Arithmetics
A large part of the values used in this study are angles. Due to their periodicity, some common operations such as the sum, difference, etc. do not have the same effect as in the R-space. For a better understanding of the previous sections, we detail the principal operations applied to the angles. The angles may take real values in the interval [0, 2π). The addition and subtraction are computed modulo 2π. For the rest of the study, the angles are considered only in [0, 2π), any value outside this interval is replaced with its equivalent value. Comparing two angles is a bit tricky. We define an increasing direction on the unit circle, as shown in figure 1. When comparing two angles α and β, there are two arcs (A and B) connecting them. We orient the smaller arc (B) with the increasing direction. The smaller arc has the origin in the smaller value (α) and points to the larger one (β). To estimate the average value of a set of angles, we associate to each angle θ an unitary vector with the phase θ. The average is defined as the phase of the sum vector.
2
sin gd ire
cti
on
0 1
3
inc
rea
α
0
B A
1 β a
b
Fig. 1. a. Illustration of comparison between two angles α and β. The arrow shows the selected increasing direction. b. Illustration of multi-frame filtering. The gray cases are abnormal blocks activated at the frame number indicated in the case. The upper chain has a length of 3 frames. The lower one has only a length of two, but if the last case is activated before its ttl expire, the chain may continue.
1186
I. Pop, M. Scuturici, and S. Miguet
Other operation necessary for this study is the function isBetween(α, β1 , β2 ), which checks if the value of the angle α is comprised between the values of β1 and β2 . The value of this boolean function is the evaluation of the expression (α − β1 ) + (α − β2 ) == (β1 − β2 ), using the difference and addition presented in the previous paragraph. The results of the operations are sometimes undefined. For instance, when comparing or estimating the average of opposite angles. The proposed method takes into account these limitations and avoids them.
a
b
c
d
Fig. 2. Visualization of rare events as red blocks. Image (a) shows the passage of a bus onto a zone which is not very circulated (compared to the rest of the image). (b) shows the event detection during the first 1000 frames - almost all movement is detected as rare. In (c) a bus stops and the pedestrian and cars moving around it trigger the detector. In (d) a car crosses the continuous line.
3.4
Multi-frame Noise Filtering
Algorithm 1 builds a codebook model which stores the movements of the blocks in the scene as values of angle α. Based on their frequency, it is possible to estimate the probability of a new value. A threshold on the probability filters out common events. In the current study, this probability is estimated using only the frequency of occurrence of the activated codeword compared with all the frequencies of all codewords created on that block.
Common Motion Map Based on Codebooks
1187
This simple approach may detect abnormal events, but is also sensitive to the slightest error in the estimation of the block flow. So, instead of taking the decision rare/common events on the evidence from one frame, the system waits for a confirmation of the abnormal event during a few frames. If the abnormal event is still present, the system signals it. The system also needs to take into account the possibility of displacements of abnormal regions. In this case, the system builds an abnormality chain, containing the list of temporally consecutive abnormal blocks. To achieve this effect, to each block are associated two values: a level value and a time to live (ttl ) value. The level values are used to mark the index in the abnormality chain, while the ttl represents the number of frames the system may wait for another evidence for the block abnormality. An abnormality alert is raised when there are blocks having the value of level above a predefined threshold. The detailed algorithm is presented in Algorithm 2 and an example of utilization is shown in Figure 1 b.
Input: probt(x=1,w,y=1,h) , blockP rop(x=1,w,y=1,h) ,of(x=1,w,y=1,h) 1 2 3 4
5 6 7 8 9 10 11 12 13 14 15 16
copy blockProp to tempBlockProp; foreach x = 1, w, y = 1, h do if probtx,y > thresholdP rob then tempBlockP ropx+ofx,y .x,y+ofx,y .y .ttl = initialTTL; tempBlockP ropx+ofx,y .x,y+ofx,y .y .level = max(tempBlockP ropx+ofx,y .x,y+ofx,y .y .level, blockP ropx,y + 1); end end copy tempBlockProp to blockProp foreach x = 1, w, y = 1, h do blockP ropx, y.ttl=blockP roptx,y .ttl − 1; if blockP ropx, y.ttl == 0 then blockP ropx, y.level=0; end if blockP ropx, y.level > thresholdLevel and probtx,y > thresholdP rob then blockP ropx, y.abnormal=true; end end
Algorithm 2. Reducing the false alarm rate. probt(x=1,w,y=1,h) represents the probability of the current flow value given the past value for each block. blockP rop(x=1,w,y=1,h) is a structure containing the ttl and level properties of the blocks. of(x=1,w,y=1,h) represents the current optical flow. In addition, the algorithm needs two parameters: thresholdProb (the minimum probability for a normal flow orientation) and thresholdLevel (the minimal of an abnormal chain)
1188
I. Pop, M. Scuturici, and S. Miguet
Fig. 3. Projection of the codebook model on various angles, starting with 0o and a step of 10o . The images are shown from left to right and then from up downwards. These images were obtained using initialW = 0.1.
4
Experimental Results
The method proposed in this work was tested on various videos1 . They consist of almost vertical views of a street. The first one is a 3 cross-road, the second one a 4 cross-road and the third one is almost linear. There are multiple lane on each direction. Figure 2 shows a image from each video. Each video lasts 15mn and has a resolution of 640x480. The learning capabilities of the system were tested by using the projection of the model on angles, on various moments of the video. Figure 3 shows these projections for the second video, for a block size of 5x5 pixel. 1
Videos available at http://ngsim.camsys.com/
Common Motion Map Based on Codebooks
1189
Each block of the model may have more than one codeword, which is difficult to represent in a single image. The solution we choose is to project the frequency of apparition for each angle. After the analysis, the model is projected on some predefined angles (i.e. 0 to π ∗ 2 in 36 increments). For each projection, a grayscale image is generated, with white meaning strong presence of the projected angle. We noticed that the model started to be consistent after 500 frames, and that only small changes take place after the 3000th frame. These small figures may be explained by the density of the traffic (the videos were taken at 9h AM), which gives enough evidence in a small amount of time. In the second part of our tests, the algorithm presented in Section 3.4 was used to detect rare events in the signalized video. The parameters used were ttl 3 frames, thresholdP rob 0.1 and thresholdLevel 5. In the first part of each video (around 1000 frames), there were a lot of events signalized as rare. This is mainly due to the fact that the codebook is almost empty and each new event which may arrive is classified as abnormal, as it has not yet been seen. These alarms could have been avoided if the system was configured to consider the first codeword of each block as normal. This solution was not chosen, because it may miss some legitimate events, e. g., areas where it should not be any movement at all. After this ’learning phase’, the system gets more experienced and detects only rare events. Such rare events are mostly represented by vehicles crossing the continuous line, as seen in Figure 2.
5
Conclusions and Future Works
This work presented an original, incremental approach based on learning the flow in the image using a codebook algorithm. The advantage of using this method is the implicit estimation of the codeword limits. By using this model, it is possible to estimate the probabilities of each observed flow vector, on each block. The main difference and improvement compared to the existing approaches presented in the previous sections consists in the iterative characteristic of the solution. The system is able to learn while observing the scene, and there is no need to stop it in order to process the observed information. For the purpose of this study, the learning was unsupervised, but it can be easily modified to take additional input from a human operator. Other contributions of the present work are the adaptive learning of circular values (i.e. angles) and the noise filtering using multi-frame evidence. Although the usage of Von Misses would provide a better precision, the computing requirements would be too important. According to [12], the estimation of the mean of a Von Mises distribution is straight-forward: the sum vector of the vectors representing the input angles on unity circle. On the other hand, the estimation of the precision (equivalent of the inverse of the standard deviation) is complex, and it would require large amount of time, considering that it should be done for each flow vector.
1190
I. Pop, M. Scuturici, and S. Miguet
Finally, using evidence from more than one frames will limit the false-alarm rate, without impacting the detection rate. The approach presented in this work needs more validation, so more tests are conducted. We want to test especially on other types of video and compare it to other proposed approaches. The next step for this approach is to extract some higher level knowledge for the scene. When visualizing the projections of the codebook, the information about the displacement of the vehicles in the scene can be deduced. The system should be able to do this, and present the common trajectories paths of the scene.
References 1. Kim, K., Chalidabhongse, T., Harwood, D., Davis, L.: Background modeling and subtraction by codebook construction. In: Proc. International Conference on Image Processing ICIP 2004, vol. 5, pp. 3061–3064 (2004) 2. Hsieh, J.W., Yu, S.L., Chen, Y.S.: Motion-based video retrieval by trajectory matching. IEEE Transactions on Circuits and Systems for Video Technology 16, 396–409 (2006) 3. Bashir, F., Khokhar, A., Schonfeld, D.: Object trajectory-based activity classification and recognition using hidden markov models. IEEE Transaction on Image Processing 16, 1912–1919 (2007) 4. Calderara, S., Cucchiara, R., Prati, A.: Detection of abnormal behaviors using a mixture of von mises distributions. In: Proc. IEEE Conference on Advanced Video and Signal Based Surveillance AVSS 2007, pp. 141–146 (2007) 5. Prati, A., Calderara, S., Cucchiara, R.: Using circular statistics for trajectory shape analysis. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition CVPR 2008, pp. 1–8 (2008) 6. Pop, I., Scuturici, M., Miguet, S.: Incremental trajectory aggregation in video sequences. In: Proc. 19th International Conference on Pattern Recognition ICPR 2008, pp. 1–4 (2008) 7. Bobick, A., Davis, J.: Real-time recognition of activity using temporal templates. In: Proc. 3rd IEEE Workshop on Applications of Computer Vision WACV 1996, pp. 39–42 (1996) 8. Hu, M., Ali, S., Shah, M.: Detecting global motion patterns in complex videos. In: Proc. 19th International Conference on Pattern Recognition ICPR 2008, pp. 1–5 (2008) 9. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006) 10. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) 11. Horn, B.K., Schunck, B.G.: Determining optical flow. Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA (1980) 12. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)
Author Index
Aanæs, Henrik I-656 Abid, Saad Bin II-857 Abugharbieh, Rafeef I-944, I-1089 Aguzzi, Adriano I-367 Ahmed, Sohail I-531 Al-Huseiny, Muayed S. II-377 Alba, Alfonso I-554 Aldavert, David I-44 Ali, Asem I-774, II-519 Ali, Imtiaz II-578 Almeida, Jurandy I-435 Almeida, Tiago A. I-435 Amayeh, Gholamreza I-243 Amberg, Brian I-875 Andersen, M. I-999 Andersen, R. I-999 Andersen, Vedrana I-656 Ansar, Adnan I. II-325 Arce-Santana, Edgar R. I-554 Argyros, Antonis A. II-140, II-460 Arnold, Ben II-519 Asama, Hajime II-430 Asari, Vijayan K. II-799 Aslan, Melih S. II-519 Atalay, Volkan I-1 Athalye, Ashwini II-970 Atsumi, Masayasu II-778 Austin, Jim II-1141 Bab-Hadiashar, Alireza I-415 Bærentzen, J. Andreas I-656 Bai, Xiaojing I-886 Bajcsy, Ruzena I-317 Baldock, Richard A. I-924 Baltzakis, Haris II-140 Barbosa, Jorge G. I-586 Bartczak, Bogumil II-228 Basah, Shafriza Nisha I-415 Bebis, George I-243 Bekaert, Philippe I-843, II-788 Belaroussi, Rachid II-1161 Berberich, Eric I-608 Bergeron, R. Daniel II-117 Bernardino, Alexandre I-223 Bessy, F. II-1041
Bevilacqua, Alessandro II-827, II-837 Bhatt, Rajen II-728 Bible, Paul I-1009 Birchfield, Stanley T. I-425 Biri, V. II-919 Birnbach, Bastian II-992 Bischof, Horst II-1119 Boisvert, Jonathan I-586 Bolitho, Matthew I-678 Boto, F. II-1041 Botterweck, Goetz II-857 Bouakaz, S. II-25 Boudreaux, Hollie I-1009 Bouguila, Nizar II-450 Brandt, Sami II-1073 Breuß, Michael II-949 Brits, Alessio M. II-345 Broxton, Michael J. I-700, I-710 Brun, Anders I-337 Buckles, Bill P. II-240 Buhmann, Joachim M. I-367 Burkhardt, Hans I-34, I-287, I-865 Burns, Randal I-678 Burschka, Darius I-1043 Caglioti, Vincenzo I-147 Callol, C. II-1041 Campos-Delgado, Daniel U. I-554 Carmona, Rhadam´es I-644 Carozza, Ludovico II-827, II-837 Carreno, Jose II-970 Castel´ an, Mario II-662 Caunce, Angela I-750 Cawley, Ciar´ an II-857 Cervato, Cinzia I-1009 Chang, Carolina II-686 Charmette, Baptiste I-201 Chausse, Fr´ed´eric I-201 Chavez, Aaron II-550 Chellali, Ryad I-808 Chen, Jingying II-588 Chen, Xinglin II-719 Chen, Ying II-480 Chen, Yisong I-668
1192
Author Index
Chen, Yunmei I-855 Chen, Zezhi II-1141 Cheng, Fuhua (Frank) II-982 Cheng, Jun II-529, II-719 Chesi, G. II-470 Chhabada, Sanyogita I-125 Choi, Byungkuk I-67 Choi, Chong-Ho I-740 Choudhary, Alok II-293 Cohen, Elaine I-101 Cootes, Tim I-750 Corani, Giorgio I-576 Cristinacce, David I-750 Cruz-Neira, Carolina I-1009 Cuypers, Tom I-843, II-788 Dana, Kristin J. II-335 Das, Dipankar II-160, II-172 Deena, Salil I-89 Desbrun, Mathieu I-656 Deverly, S. II-919 De Vieilleville, F. I-327 Dhall, Abhinav II-728 Dhillon, Daljit Singh I-831 Diao, Mamadou II-619 Dickerson, Julie II-81, II-909 Dima, Alden II-1051 Dinov, Ivo D. I-955 Djeraba, Chabane II-674 Dong, Bin I-914, I-955 Dong, Junyu II-558 Dong, Xinghui II-558 D’Orazio, T. II-440 Dornaika, Fadi I-730 Dornhege, C. I-79 Duan, Ye I-265, II-817 Duchaineau, Mark I-720 Duric, Zoran II-417 Dutreve, L. II-25 Egawa, Akira I-135 Elhabian, Shireen I-774 El Kaissi, Muhieddine II-81, II-909 Emeliyanenko, Pavel I-608 Erdogan, Hakan II-1171 Ertl, Thomas I-357 Fahmi, Rachid II-519 Falk, Robert I-347 Fallah, Yaser P. I-317
Fan, Fengtao II-982 Fan, Lixin I-457, I-819, II-252 Fan, Wentao II-450 Farag, Aly I-347, I-774, II-519 Farag, Amal I-347 Fari˜ naz Balseiro, Fernando I-965 Fasuga, Radoslav II-397 Fehr, Janis I-34, I-287 Felsberg, Michael I-211, II-184 Feng, Powei I-620 Ferrydiansyah, Reza I-521 Figueira, Dario I-223 Fijany, Amir I-808 Filipovych, Roman II-367 Fleck, Daniel II-417 Fontaine, Jean-Guy I-808 Forss´en, Per-Erik I-211 Francken, Yannick I-843, II-788 Franek, Lucas II-737 Freeman, Michael II-1141 Frey, Steffen I-357 Fr¨ ohlich, Bernd I-644, II-104 Fuchs, Thomas J. I-367 Fuentes, Olac I-762 Galata, Aphrodite I-89 Gallus, William I-1009 Gambardella, Luca Maria I-576 Gao, Yan II-293 Garc´ıa, Narciso II-150 Garc´ıa-Sevilla, Pedro II-509 Gaspar, Jos´e I-223 Gault, Travis I-774 Geetha, H. II-568 Geist, Robert I-55 Gen¸c, Serkan I-1 Gherardi, Alessandro II-827, II-837 Gianaroli, Luca I-576 Gissler, M. I-79 Giusti, Alessandro I-147, I-576 Glocker, Ben I-1101 Goesele, Michael I-596 Govindu, Venu Madhav I-831 Grabner, Markus II-1119 Graham, James I-347 Grau, S. II-847 Gregoire, Jean-Marc I-379 Grimm, Paul II-992 Gross, Ari II-747 Gschwandtner, Michael II-35
Author Index Gudmundsson, Petri II-1073 Gupta, Om K. I-233 Gustafson, David II-550 Hacihaliloglu, Ilker I-944 Hagan, Aaron II-960 Hagen, Hans II-1151 Hagenburg, Kai II-949 Hamam, Y. II-407 Hamann, Bernd II-71 Hamarneh, Ghassan I-1055, I-1079, I-1089 Hanbury, Allan II-303 Hansson, Mattias II-1073 Hariharan, Srivats I-531 Hashimoto, Yuki II-538 Haybaeck, Johannes I-367 He, Zhoucan II-608 Healy, Patrick II-857 Hedborg, Johan I-211 Heikenwalder, Mathias I-367 Helmuth, Jo A. I-544 Henrich, Dominik I-784 Herlin, P. I-327 Hermans, Chris I-843 Herubel, A. II-919 Hess-Flores, Mauricio I-720 Heun, Patrick I-865 Hill, Bill I-924 Hinkenjann, Andr´e I-987 Ho, Qirong I-253 Hodgson, Antony I-944 Hoetzl, Elena II-1021 Holgado, O. II-1041 Hoppe, Hugues I-678 Horebeek, Johan Van II-662 Hoseinnezhad, Reza I-415 Hosseini, Fouzhan I-808 Hotta, Kazuhiro II-489 House, Donald I-167 Hu, Weiming II-480, II-631, II-757 Huang, Xixia I-265 H¨ ubner, Thomas I-189 Hung, Y.S. II-470 Husz, Zsolt L. I-924 Hutson, Malcolm I-511 Ibarbia, I. II-1041 Imaizumi, Takashi II-1109 Imiya, Atsushi I-403, II-807, II-1109
Islam, Mohammad Moinul Islam, Mohammed Nazrul
1193
II-799 II-799
Jarvis, Ray A. I-233 Jeong, Jaeheon I-490 Jia, Ming II-81, II-909 Jiang, Jun II-719 Jiang, Xiaoyi II-737 Johnson, Gregory P. II-970 Jung, Keechul I-500 Kafadar, Karen II-1051 Kalbe, Thomas I-596 Kalyani, T. II-568 Kameda, Yusuke I-403, II-1109 Kampel, Martin II-598 Kanev, Kamen I-965 Karim, Mohammad A. II-799 Kaˇspar, Petr II-397 Kawabata, Kuniaki II-430 Kawai, Takamitsu I-934 Kazhdan, Michael I-678 Keuper, Margret I-865 Khan, Ghulam Mohiuddin II-728 Khan, Rehanullah II-303 Khandani, Masoumeh Kalantari I-317 Kholgade, Natasha II-357 Kim, Beomjin I-125 Kim, Jinwook II-49 Kim, Jongman II-619 Kim, Soojae II-49 Kim, Taemin I-700, I-710 Kim, Younghui II-59 Klaus, Andreas II-1119 Klose, Sebastian I-12, II-1131 Knoblauch, Daniel I-720, II-208 Knoll, Alois I-12, II-1131 Kobayashi, Yoshinori II-160, II-172 Koch, Reinhard II-228 Koch, Thomas I-596 K¨ olzer, Konrad II-992 Komodakis, Nikos I-1101 Kong, Yu II-757 Koroutchev, Kostadin I-965 Korutcheva, Elka I-965 Kosov, Sergey I-796 Koutis, Ioannis I-1067 Kr¨ amer, P. II-1041 Kr¨ uger, Norbert I-275 Kuck, Roland I-1019
1194
Author Index
Kuester, Falko I-720, II-208 Kuhn, Stefan I-784 Kumagai, Hikaru II-430 Kuno, Yoshinori II-160, II-172 Kurata, Takeshi I-500 Kurz, Christian I-391 Kwok, Tsz-Ho II-709 Lachaud, J.-O. I-327 Ladikos, Alexander I-480 Lai, Shuhua II-982 Lampo, Tomas II-686 Langbein, Max II-1151 Laramee, Robert S. II-117 Larsen, C. I-999 Larsson, Fredrik II-184 Latecki, Longin Jan II-747 Lee, Byung-Uk II-283 Lee, Hwee Kuan I-253, I-531 Lee, Minsik I-740 Leite, Neucimar J. I-435 Lemon, Oliver II-588 Leo, M. II-440 Lepetit, Vincent I-819 Letamendia, A. II-1041 Lewis, Robert R. I-896 Lezoray, O. I-327 Li, Chunming I-886 Li, Li II-480, II-631 Li, Pengcheng II-529 Li, Wanqing II-480 Li, Xiaoling II-817 Liensberger, Christian II-303 Lindgren, Finn II-1073 Ling, Haibin II-757 Lipsa, Dan R. II-117 Lisin, Dimitri A. I-157 Liu, Xiaoping II-240 Liu, Yu II-1085 L¨ offler, Alexander I-975 Lopez de Mantaras, Ramon I-44 Luhandjula, T. II-407 Lundy, Michael I-710 Lux, Christopher II-104 Madsen, O. I-999 Magli, Cristina I-576 Mahmoodi, Sasan II-377 Mai, Fei II-470 Makris, Pascal I-379
Malm, Patrik I-337 Mao, Yu I-955 Mattausch, Oliver II-13 Mazzeo, P.L. II-440 McCaffrey, James D. I-179 McGraw, Tim I-934 McIntosh, Chris I-1079 McPhail, Travis I-620 Mertens, Tom II-788 Meyer, A. II-25 Miguet, Serge II-1181 Miller, Gary L. I-1067 Miller, Mike I-774 Mimura, Yuta II-489 Minetto, Rodrigo I-435 Ming, Wei II-767 Mishima, Taketoshi II-430 Miyata, Natsuki II-1002 Moch, Holger I-367 Mochimaru, Masaaki II-641 Mochizuki, Yoshihiko II-1109 Moeslund, T.B. I-999 Mohedano, Ra´ ul II-150 Monacelli, E. II-407 Moratto, Zachary I-710 Moreland, Kenneth II-92 Moreno, Plinio I-223 Morishita, Soichiro II-430 Moura, Daniel C. I-586 M¨ uller, Christoph I-357 Mulligan, Jane I-490, II-264 Mu˜ niz Gutierrez, Jose Luis I-965 Musuvathy, Suraj I-101 Nagl, Frank II-992 Navab, Nassir I-480, I-1101 Navr´ atil, Paul A. II-970 Nebel, B. I-79 Nefian, Ara V. I-700, I-710 Nielsen, Mads Thorsted I-275 Nixon, Mark S. II-377 Noh, Junyong I-67, II-59 Odom, Christian N.S. I-1031 Ohnishi, Naoya I-403, II-807 Oikonomidis, Iasonas II-460 Olson, Clark F. II-325 Osher, Stanley I-914 Osherovich, Eliyahu II-1063 Ota, Jun II-1002
Author Index Owen, Charles B. I-521 Owen, G. Scott II-869, II-889 Padeken, Jan I-865 Padgett, Curtis W. II-325 Pajarola, Renato I-189, II-1 Paloc, C. II-1041 Panda, Preeti Ranjan I-111 Panin, Giorgio I-12, II-1131 Papazov, Chavdar I-1043 Paragios, Nikos I-1101 Parham, Thomas I-1009 Park, Anjin I-500 Park, Jinho I-67 Park, Junhee II-283 Pasquinelli, Melissa A. II-129 Paulhac, Ludovic I-379 Pavlopoulou, Christina I-906 Paysan, Pascal I-875 Pears, Nick II-1141 Perry, Thomas P. I-924 Peskin, Adele P. II-1051 Pilz, Florian I-275 Pla, Filiberto II-509 Plancoulaine, B. I-327 Pock, Thomas II-1119 Poliˇsˇcuk, Radek II-1011 Pop, Ionel II-1181 Proen¸ca, Hugo II-698 Puerto-Souza, Gustavo A. II-662 Pugeault, Nicolas I-275 Puig, A. II-847 Pundlik, Shrinivas J. I-425 Pylv¨ an¨ ainen, Timo I-457, I-819, II-252 Qin, Jianzhao
I-297
Raducanu, Bogdan I-730 Rajadell, Olga II-509 Ramel, Jean-Yves I-379 Ramisa, Arnau I-44 Rao, Josna I-1089 Rara, Ham I-774, II-519 Rastgar, Houman I-447 Reiners, Dirk I-511, I-1031, II-81, II-909 Reisert, Marco I-287 Repplinger, Michael I-975 Rhyne, Theresa-Marie II-129, II-929 Ribeiro, Eraldo II-367 Riensche, Roderick M. I-896
1195
Ring, Wolfgang II-1021 Riva, Andrea I-147 Robinson, T. II-568 Rodr´ıguez, Gabriel I-644 Rodr´ıguez Albari˜ no, Apolinar I-965 Rohling, Robert I-944 Rojas, Freddy II-970 Ronneberger, Olaf I-865 Rosenbaum, Ren´e II-71 Rosenhahn, Bodo I-391, II-196 Roth, Thorsten I-987 Royer, Eric I-201 Rubinstein, Dmitri I-975 Safari, Saeed I-808 Sagraloff, Michael I-608 Saito, Hideo II-641, II-651 Sakai, Tomoya I-403, II-807, II-1109 Sandberg, Kristian I-564 Sankaran, Shvetha I-531 Santos-Victor, Jos´e I-223 Savakis, Andreas II-357 Savitsky, Eric I-914 Sbalzarini, Ivo F. I-544 Scherzer, Daniel II-13 Scheuermann, Bj¨ orn II-196 Schlegel, Philipp II-1 Schw¨ arzler, Michael II-13 Scott, Jon I-125 Scuturici, Mihaela II-1181 Sedlacek, David II-218 Segal, Aleksandr V. I-710 Seidel, Hans-Peter I-391, I-796 Sharif, Md. Haidar II-674 Sharma, Gaurav II-728 Shetty, Nikhil J. I-1031 Shi, Fanhuai I-265, II-817 Shi, Yonggang I-955 Shirayama, Susumu I-135 Sierra, Javier II-686 Silpa, B.V.N. I-111 Silva, Jayant II-335 Simonsen, Kasper Broegaard I-275 Singh, Mayank I-167 Sirakov, Nikolay Metodiev II-1031 Slusallek, Philipp I-975 Smith, William A.P. I-632 Spagnolo, P. II-440 Sparr, Ted M. II-117
1196
Author Index
Starr, Thomas I-774 Steele, Jay E. I-55 Stelling, Pete I-1009 St¨ ottinger, Julian II-303 Streicher, Alexander I-34 Strengert, Magnus I-357 Sun, Quansen I-886 Suo, Xiaoyuan II-869, II-889 ˇ Surkovsk´ y, Martin II-397 Suzuki, Seiji II-641 Szumilas, Lech I-22 Takahashi, Haruhisa II-489 Tallury, Syamal II-129 Tang, Angela Chih-Wei II-274 Tapamo, Jules R. II-345 Tarel, Jean-Philippe II-1161 Tavakkoli, Alireza I-243 Tavares, Jo˜ ao Manuel R.S. I-586 Taylor, Chris I-750 Teoh, Soon Tee I-468 Terabayashi, Kenji II-538, II-1002 Teschner, M. I-79 Thakur, Sidharth II-129, II-929 Thiel, Steffen II-857 Thomas, D.G. II-568 Thorm¨ ahlen, Thorsten I-391, I-796 Tian, Yibin II-767 Toga, Arthur W. I-955 Toledo, Ricardo I-44 Tolliver, David I-1067 Tomono, Masahiro I-690 Torres, Ricardo da S. I-435 Tougne, Laure II-578 Tsui, Andrew II-879 Tu, Zhuowen I-955 Uberti, Marco I-147 Uematsu, Yuko II-651 Ueng, Shyh-Kuang II-899 Uhl, Andreas II-35 Umeda, Kazunori II-538, II-1002 Unel, Mustafa II-1171 Unger, Markus II-1119 Ushkala, Karthik II-1031 Vallotton, Pascal I-531 Veksler, Olga II-1085 Vemuri, Kumar S.S. I-111 Vetter, Thomas I-875
Vincent, Andr´e I-447 Virto, J.M. II-1041 Vogel, Oliver II-949 Wald, D. II-1041 Wang, Charlie C.L. II-709 Wang, Demin I-447 Wang, Jun II-939 Wang, Qing II-315, II-608 Wang, Shengke II-558 Wang, Yalin I-955 Wang, Yunhong II-499 Warren, Joe I-620 Wei, Qingdi II-631, II-757 Weickert, Joachim II-949 Weier, Martin I-987 Welk, Martin II-949 Wesche, Gerold I-1019 Westing, Brandt II-970 Wild, Peter J. I-367 Wildenauer, Horst I-22 Williams, Martyn I-632 Williams, Q. II-407 Wimmer, Michael II-13 Wood, Zo¨e II-879 Wu, Bo-Zong II-274 Wuertele, Eve II-81, II-909 Wyk, B.J. van II-407 Xi, Yongjian II-817 Xia, Deshen I-886 Xiang, Ping II-519 Xu, Wei I-490, II-264 Yang, Fu-Sheng II-899 Yang, Heng II-315, II-608 Yang, Xingwei II-747 Yassine, Inas I-934 Yavneh, Irad II-1063 Ye, Xiaojing I-855 Yilmaz, Mustafa Berkay II-1171 Yin, Lijun II-387 Yoder, Gabriel II-387 Yorozu, Nina II-651 You, Mi I-67 Yu, Stella X. I-157, I-307, I-906 Yu, Weimiao I-253, I-531 Yu, Ye II-240 Yu, Zeyun II-939 Yuan, Ruifeng II-529
Author Index Yuksel, Cem I-167 Yung, Nelson H.C. I-297 Zara, Jiri II-218 Zhang, Guangpeng II-499 Zhang, Jian II-470 Zhang, Liang I-447 Zhang, Xiaoqin II-480, II-757 Zhao, Wenchuang II-529
Zhao, Ye II-960 Zheng, Jun I-762 Zhu, Lierong I-934 Zhu, Pengfei II-631 Zhu, Ying II-869, II-889 Zibulevsky, Michael II-1063 Zografos, Vasileios II-1097 Zweng, Andreas II-598
1197